Hi Thomas, is there a particular reason why you want to use the RDK5-fingerprints? My impression was always that circular (Morgan) fingerprints generate better results than the path-oriented RDK-fingerprints.
Best, Nils On Thu, Oct 4, 2018 at 11:22 AM Thomas Evangelidis <teva...@gmail.com> wrote: > Dear RDKit community, > > I need some advice regarding the usage of RDK5 fingerprints for machine > learning. I have a big training set (2200 molecules) and a small test set > (130 molecules). The RDK5 similarity between training and test set is very > high. When calculating the RDK5 using fpSize=4096, minPath=1, maxPath=7, 95% > of the test set molecules have at least 3 highly similar (Tanimoto>0.9) > training > set molecules. Strangely, ML models trained with these RDK5 fingerprints > do not have very good performance. Therefore I would like to increase the > amount of information stored in RDK5. I tried fpSize=8192, minPath=1, > maxPath=12 but all test molecules had the same score in the results, > which is very weird. > > What do you think would be a good combination of parameter values to use > during RDK5 generation in order to include more information in the > fingerprints? It is fine if the fingerprint length becomes very long, as > long as it does its job well. The Chem.RDKFingerprint() function offers a > lot of arguments: > > *minPath:* (optional) minimum number of bonds to include in the subgraphs > Defaults to 1. > *maxPath:* (optional) maximum number of bonds to include in the subgraphs > Defaults to 7. > *fpSize:* (optional) number of bits in the fingerprint Defaults to 2048. > *nBitsPerHash:* (optional) number of bits to set per path Defaults to 2. > *useHs:* (optional) include paths involving Hs in the fingerprint if the > molecule has explicit Hs. Defaults to True. > *tgtDensity:* (optional) fold the fingerprint until this minimum density > has been reached Defaults to 0. > *minSize:* (optional) the minimum size the fingerprint will be folded to > when trying to reach tgtDensity Defaults to 128. > *branchedPaths:* (optional) if set both branched and unbranched paths > will be used in the fingerprint. Defaults to True. > *useBondOrder:* (optional) if set both bond orders will be used in the > path hashes Defaults to True. > *atomInvariants:* (optional) a sequence of atom invariants to use in the > path hashes Defaults to empty. > *fromAtoms:* (optional) a sequence of atom indices. If provided, only > paths/subgraphs starting from these atoms will be used. Defaults to empty. > *atomBits:* (optional) an empty list. If provided, the result will > contain a list containing the bits each atom sets. Defaults to empty. > *bitInfo:* (optional) an empty dict. If provided, the result will contain > a dict with bits as keys and corresponding bond paths as values. Defaults > to empty. > > Thanks in advance. > Thomas > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Research Scientist > > IOCB - Institute of Organic Chemistry and Biochemistry of the Czech > Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en> > Prague, Czech Republic > & > CEITEC - Central European Institute of Technology <https://www.ceitec.eu/> > Brno, Czech Republic > > email: teva...@gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss