Dear RDKit community,

I need some advice regarding the usage of RDK5 fingerprints for machine
learning. I have a big training set (2200 molecules) and a small test set
(130 molecules). The RDK5 similarity between training and test set is very
high. When calculating the RDK5 using fpSize=4096, minPath=1, maxPath=7, 95%
of the test set molecules have at least 3 highly similar
(Tanimoto>0.9) training
set molecules. Strangely, ML models trained with these RDK5 fingerprints do
not have very good performance. Therefore I would like to increase the
amount of information stored in RDK5. I tried  fpSize=8192, minPath=1,
maxPath=12 but all test molecules had the same score in the results, which
is very weird.

What do you think would be a good combination of parameter values to use
during RDK5 generation in order to include more information in the
fingerprints? It is fine if the fingerprint length becomes very long, as
long as it does its job well. The Chem.RDKFingerprint() function offers a
lot of arguments:

*minPath:* (optional) minimum number of bonds to include in the subgraphs
Defaults to 1.
*maxPath:* (optional) maximum number of bonds to include in the subgraphs
Defaults to 7.
*fpSize:* (optional) number of bits in the fingerprint Defaults to 2048.
*nBitsPerHash:* (optional) number of bits to set per path Defaults to 2.
*useHs:* (optional) include paths involving Hs in the fingerprint if the
molecule has explicit Hs. Defaults to True.
*tgtDensity:* (optional) fold the fingerprint until this minimum density
has been reached Defaults to 0.
*minSize:* (optional) the minimum size the fingerprint will be folded to
when trying to reach tgtDensity Defaults to 128.
*branchedPaths:* (optional) if set both branched and unbranched paths will
be used in the fingerprint. Defaults to True.
*useBondOrder:* (optional) if set both bond orders will be used in the path
hashes Defaults to True.
*atomInvariants:* (optional) a sequence of atom invariants to use in the
path hashes Defaults to empty.
*fromAtoms:* (optional) a sequence of atom indices. If provided, only
paths/subgraphs starting from these atoms will be used. Defaults to empty.
*atomBits:* (optional) an empty list. If provided, the result will contain
a list containing the bits each atom sets. Defaults to empty.
*bitInfo:* (optional) an empty dict. If provided, the result will contain a
dict with bits as keys and corresponding bond paths as values. Defaults to
empty.

Thanks in advance.
Thomas


-- 

======================================================================

Dr Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>
Prague, Czech Republic
  &
CEITEC - Central European Institute of Technology <https://www.ceitec.eu/>
Brno, Czech Republic

email: teva...@gmail.com

website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to