Dear RDKit community, I need some advice regarding the usage of RDK5 fingerprints for machine learning. I have a big training set (2200 molecules) and a small test set (130 molecules). The RDK5 similarity between training and test set is very high. When calculating the RDK5 using fpSize=4096, minPath=1, maxPath=7, 95% of the test set molecules have at least 3 highly similar (Tanimoto>0.9) training set molecules. Strangely, ML models trained with these RDK5 fingerprints do not have very good performance. Therefore I would like to increase the amount of information stored in RDK5. I tried fpSize=8192, minPath=1, maxPath=12 but all test molecules had the same score in the results, which is very weird.
What do you think would be a good combination of parameter values to use during RDK5 generation in order to include more information in the fingerprints? It is fine if the fingerprint length becomes very long, as long as it does its job well. The Chem.RDKFingerprint() function offers a lot of arguments: *minPath:* (optional) minimum number of bonds to include in the subgraphs Defaults to 1. *maxPath:* (optional) maximum number of bonds to include in the subgraphs Defaults to 7. *fpSize:* (optional) number of bits in the fingerprint Defaults to 2048. *nBitsPerHash:* (optional) number of bits to set per path Defaults to 2. *useHs:* (optional) include paths involving Hs in the fingerprint if the molecule has explicit Hs. Defaults to True. *tgtDensity:* (optional) fold the fingerprint until this minimum density has been reached Defaults to 0. *minSize:* (optional) the minimum size the fingerprint will be folded to when trying to reach tgtDensity Defaults to 128. *branchedPaths:* (optional) if set both branched and unbranched paths will be used in the fingerprint. Defaults to True. *useBondOrder:* (optional) if set both bond orders will be used in the path hashes Defaults to True. *atomInvariants:* (optional) a sequence of atom invariants to use in the path hashes Defaults to empty. *fromAtoms:* (optional) a sequence of atom indices. If provided, only paths/subgraphs starting from these atoms will be used. Defaults to empty. *atomBits:* (optional) an empty list. If provided, the result will contain a list containing the bits each atom sets. Defaults to empty. *bitInfo:* (optional) an empty dict. If provided, the result will contain a dict with bits as keys and corresponding bond paths as values. Defaults to empty. Thanks in advance. Thomas -- ====================================================================== Dr Thomas Evangelidis Research Scientist IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en> Prague, Czech Republic & CEITEC - Central European Institute of Technology <https://www.ceitec.eu/> Brno, Czech Republic email: teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss