Hi Thomas,

my understanding was always that RDK5 corresponds to maxPath = 5. I'm
not sure if significantly longer path lengths (e.g. 12) actually
"increase the amount of information" since they also increase the risk
of bit collisions in folded fingerprints.

Nis

Am 04.10.2018 um 19:56 schrieb Thomas Evangelidis:
> Hi Nils,
> 
> In general, yes, but there are still cases where RDK5 gives better ML
> models that ECFP or FCFP (i.e. the HSP90 dataset from D3R GC2015). In
> the end, I combine them all. Anyway, we are out of topic and I am afraid
> I won't get an answer to my original question.
> 
> Thomas
> 
> 
> 
> 
> On Thu, 4 Oct 2018 at 11:28, Nils Weskamp <nils.wesk...@gmail.com
> <mailto:nils.wesk...@gmail.com>> wrote:
> 
>     Hi Thomas,
> 
>     is there a particular reason why you want to use the
>     RDK5-fingerprints? My impression was always that circular (Morgan)
>     fingerprints generate better results than the path-oriented
>     RDK-fingerprints.
> 
>     Best,
>     Nils
> 
> 
>     On Thu, Oct 4, 2018 at 11:22 AM Thomas Evangelidis
>     <teva...@gmail.com <mailto:teva...@gmail.com>> wrote:
> 
>         Dear RDKit community,
> 
>         I need some advice regarding the usage of RDK5 fingerprints for
>         machine learning. I have a big training set (2200 molecules) and
>         a small test set (130 molecules). The RDK5 similarity between
>         training and test set is very high. When calculating the RDK5
>         using fpSize=4096, minPath=1, maxPath=7,95% of the test set
>         molecules have at least 3 highly similar (Tanimoto>0.9) training
>         set molecules. Strangely, ML models trained with these RDK5
>         fingerprints do not have very good performance. Therefore I
>         would like to increase the amount of information stored in RDK5.
>         I tried  fpSize=8192, minPath=1, maxPath=12 but all test
>         molecules had the same score in the results, which is very weird.
> 
>         What do you think would be a good combination of
>         parameter values to use during RDK5 generation in order to
>         include more information in the fingerprints? It is fine if the
>         fingerprint length becomes very long, as long as it does its job
>         well. The Chem.RDKFingerprint() function offers a lot of arguments:
> 
>         *minPath:* (optional) minimum number of bonds to include in the
>         subgraphs Defaults to 1.
>         *maxPath:* (optional) maximum number of bonds to include in the
>         subgraphs Defaults to 7.
>         *fpSize:* (optional) number of bits in the fingerprint Defaults
>         to 2048.
>         *nBitsPerHash:* (optional) number of bits to set per path
>         Defaults to 2.
>         *useHs:* (optional) include paths involving Hs in the
>         fingerprint if the molecule has explicit Hs. Defaults to True.
>         *tgtDensity:* (optional) fold the fingerprint until this minimum
>         density has been reached Defaults to 0.
>         *minSize:* (optional) the minimum size the fingerprint will be
>         folded to when trying to reach tgtDensity Defaults to 128.
>         *branchedPaths:* (optional) if set both branched and unbranched
>         paths will be used in the fingerprint. Defaults to True.
>         *useBondOrder:* (optional) if set both bond orders will be used
>         in the path hashes Defaults to True.
>         *atomInvariants:* (optional) a sequence of atom invariants to
>         use in the path hashes Defaults to empty.
>         *fromAtoms:* (optional) a sequence of atom indices. If provided,
>         only paths/subgraphs starting from these atoms will be used.
>         Defaults to empty.
>         *atomBits:* (optional) an empty list. If provided, the result
>         will contain a list containing the bits each atom sets. Defaults
>         to empty.
>         *bitInfo:* (optional) an empty dict. If provided, the result
>         will contain a dict with bits as keys and corresponding bond
>         paths as values. Defaults to empty.
> 
>         Thanks in advance.
>         Thomas
> 
> 
>         -- 
> 
>         ======================================================================
> 
>         Dr Thomas Evangelidis
> 
>         Research Scientist
> 
>         IOCB - Institute of Organic Chemistry and Biochemistry of the
>         Czech Academy of Sciences
>         <https://www.uochb.cz/web/structure/31.html?lang=en>
> 
>         Prague, Czech Republic
>           & 
>         CEITEC - Central European Institute of Technology
>         <https://www.ceitec.eu/>
>         Brno, Czech Republic 
> 
>         email: teva...@gmail.com <mailto:teva...@gmail.com>
> 
>         website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> 
>         _______________________________________________
>         Rdkit-discuss mailing list
>         Rdkit-discuss@lists.sourceforge.net
>         <mailto:Rdkit-discuss@lists.sourceforge.net>
>         https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 
> 
> 
> -- 
> 
> ======================================================================


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to