Hi Nils, In general, yes, but there are still cases where RDK5 gives better ML models that ECFP or FCFP (i.e. the HSP90 dataset from D3R GC2015). In the end, I combine them all. Anyway, we are out of topic and I am afraid I won't get an answer to my original question.
Thomas On Thu, 4 Oct 2018 at 11:28, Nils Weskamp <nils.wesk...@gmail.com> wrote: > Hi Thomas, > > is there a particular reason why you want to use the RDK5-fingerprints? My > impression was always that circular (Morgan) fingerprints generate better > results than the path-oriented RDK-fingerprints. > > Best, > Nils > > > On Thu, Oct 4, 2018 at 11:22 AM Thomas Evangelidis <teva...@gmail.com> > wrote: > >> Dear RDKit community, >> >> I need some advice regarding the usage of RDK5 fingerprints for machine >> learning. I have a big training set (2200 molecules) and a small test set >> (130 molecules). The RDK5 similarity between training and test set is very >> high. When calculating the RDK5 using fpSize=4096, minPath=1, maxPath=7, 95% >> of the test set molecules have at least 3 highly similar (Tanimoto>0.9) >> training >> set molecules. Strangely, ML models trained with these RDK5 fingerprints >> do not have very good performance. Therefore I would like to increase the >> amount of information stored in RDK5. I tried fpSize=8192, minPath=1, >> maxPath=12 but all test molecules had the same score in the results, >> which is very weird. >> >> What do you think would be a good combination of parameter values to use >> during RDK5 generation in order to include more information in the >> fingerprints? It is fine if the fingerprint length becomes very long, as >> long as it does its job well. The Chem.RDKFingerprint() function offers >> a lot of arguments: >> >> *minPath:* (optional) minimum number of bonds to include in the >> subgraphs Defaults to 1. >> *maxPath:* (optional) maximum number of bonds to include in the >> subgraphs Defaults to 7. >> *fpSize:* (optional) number of bits in the fingerprint Defaults to 2048. >> *nBitsPerHash:* (optional) number of bits to set per path Defaults to 2. >> *useHs:* (optional) include paths involving Hs in the fingerprint if the >> molecule has explicit Hs. Defaults to True. >> *tgtDensity:* (optional) fold the fingerprint until this minimum density >> has been reached Defaults to 0. >> *minSize:* (optional) the minimum size the fingerprint will be folded to >> when trying to reach tgtDensity Defaults to 128. >> *branchedPaths:* (optional) if set both branched and unbranched paths >> will be used in the fingerprint. Defaults to True. >> *useBondOrder:* (optional) if set both bond orders will be used in the >> path hashes Defaults to True. >> *atomInvariants:* (optional) a sequence of atom invariants to use in the >> path hashes Defaults to empty. >> *fromAtoms:* (optional) a sequence of atom indices. If provided, only >> paths/subgraphs starting from these atoms will be used. Defaults to empty. >> *atomBits:* (optional) an empty list. If provided, the result will >> contain a list containing the bits each atom sets. Defaults to empty. >> *bitInfo:* (optional) an empty dict. If provided, the result will >> contain a dict with bits as keys and corresponding bond paths as values. >> Defaults to empty. >> >> Thanks in advance. >> Thomas >> >> >> -- >> >> ====================================================================== >> >> Dr Thomas Evangelidis >> >> Research Scientist >> >> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech >> Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en> >> Prague, Czech Republic >> & >> CEITEC - Central European Institute of Technology >> <https://www.ceitec.eu/> >> Brno, Czech Republic >> >> email: teva...@gmail.com >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > -- ====================================================================== Dr Thomas Evangelidis Research Scientist IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en> Prague, Czech Republic & CEITEC - Central European Institute of Technology <https://www.ceitec.eu/> Brno, Czech Republic email: teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss