Hi Thomas,

is there a particular reason why you want to use the RDK5-fingerprints? My
impression was always that circular (Morgan) fingerprints generate better
results than the path-oriented RDK-fingerprints.

Best,
Nils


On Thu, Oct 4, 2018 at 11:22 AM Thomas Evangelidis <teva...@gmail.com>
wrote:

> Dear RDKit community,
>
> I need some advice regarding the usage of RDK5 fingerprints for machine
> learning. I have a big training set (2200 molecules) and a small test set
> (130 molecules). The RDK5 similarity between training and test set is very
> high. When calculating the RDK5 using fpSize=4096, minPath=1, maxPath=7, 95%
> of the test set molecules have at least 3 highly similar (Tanimoto>0.9) 
> training
> set molecules. Strangely, ML models trained with these RDK5 fingerprints
> do not have very good performance. Therefore I would like to increase the
> amount of information stored in RDK5. I tried  fpSize=8192, minPath=1,
> maxPath=12 but all test molecules had the same score in the results,
> which is very weird.
>
> What do you think would be a good combination of parameter values to use
> during RDK5 generation in order to include more information in the
> fingerprints? It is fine if the fingerprint length becomes very long, as
> long as it does its job well. The Chem.RDKFingerprint() function offers a
> lot of arguments:
>
> *minPath:* (optional) minimum number of bonds to include in the subgraphs
> Defaults to 1.
> *maxPath:* (optional) maximum number of bonds to include in the subgraphs
> Defaults to 7.
> *fpSize:* (optional) number of bits in the fingerprint Defaults to 2048.
> *nBitsPerHash:* (optional) number of bits to set per path Defaults to 2.
> *useHs:* (optional) include paths involving Hs in the fingerprint if the
> molecule has explicit Hs. Defaults to True.
> *tgtDensity:* (optional) fold the fingerprint until this minimum density
> has been reached Defaults to 0.
> *minSize:* (optional) the minimum size the fingerprint will be folded to
> when trying to reach tgtDensity Defaults to 128.
> *branchedPaths:* (optional) if set both branched and unbranched paths
> will be used in the fingerprint. Defaults to True.
> *useBondOrder:* (optional) if set both bond orders will be used in the
> path hashes Defaults to True.
> *atomInvariants:* (optional) a sequence of atom invariants to use in the
> path hashes Defaults to empty.
> *fromAtoms:* (optional) a sequence of atom indices. If provided, only
> paths/subgraphs starting from these atoms will be used. Defaults to empty.
> *atomBits:* (optional) an empty list. If provided, the result will
> contain a list containing the bits each atom sets. Defaults to empty.
> *bitInfo:* (optional) an empty dict. If provided, the result will contain
> a dict with bits as keys and corresponding bond paths as values. Defaults
> to empty.
>
> Thanks in advance.
> Thomas
>
>
> --
>
> ======================================================================
>
> Dr Thomas Evangelidis
>
> Research Scientist
>
> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
> Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>
> Prague, Czech Republic
>   &
> CEITEC - Central European Institute of Technology <https://www.ceitec.eu/>
> Brno, Czech Republic
>
> email: teva...@gmail.com
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to