Hi Nils,

In general, yes, but there are still cases where RDK5 gives better ML
models that ECFP or FCFP (i.e. the HSP90 dataset from D3R GC2015). In the
end, I combine them all. Anyway, we are out of topic and I am afraid I
won't get an answer to my original question.

Thomas




On Thu, 4 Oct 2018 at 11:28, Nils Weskamp <nils.wesk...@gmail.com> wrote:

> Hi Thomas,
>
> is there a particular reason why you want to use the RDK5-fingerprints? My
> impression was always that circular (Morgan) fingerprints generate better
> results than the path-oriented RDK-fingerprints.
>
> Best,
> Nils
>
>
> On Thu, Oct 4, 2018 at 11:22 AM Thomas Evangelidis <teva...@gmail.com>
> wrote:
>
>> Dear RDKit community,
>>
>> I need some advice regarding the usage of RDK5 fingerprints for machine
>> learning. I have a big training set (2200 molecules) and a small test set
>> (130 molecules). The RDK5 similarity between training and test set is very
>> high. When calculating the RDK5 using fpSize=4096, minPath=1, maxPath=7, 95%
>> of the test set molecules have at least 3 highly similar (Tanimoto>0.9) 
>> training
>> set molecules. Strangely, ML models trained with these RDK5 fingerprints
>> do not have very good performance. Therefore I would like to increase the
>> amount of information stored in RDK5. I tried  fpSize=8192, minPath=1,
>> maxPath=12 but all test molecules had the same score in the results,
>> which is very weird.
>>
>> What do you think would be a good combination of parameter values to use
>> during RDK5 generation in order to include more information in the
>> fingerprints? It is fine if the fingerprint length becomes very long, as
>> long as it does its job well. The Chem.RDKFingerprint() function offers
>> a lot of arguments:
>>
>> *minPath:* (optional) minimum number of bonds to include in the
>> subgraphs Defaults to 1.
>> *maxPath:* (optional) maximum number of bonds to include in the
>> subgraphs Defaults to 7.
>> *fpSize:* (optional) number of bits in the fingerprint Defaults to 2048.
>> *nBitsPerHash:* (optional) number of bits to set per path Defaults to 2.
>> *useHs:* (optional) include paths involving Hs in the fingerprint if the
>> molecule has explicit Hs. Defaults to True.
>> *tgtDensity:* (optional) fold the fingerprint until this minimum density
>> has been reached Defaults to 0.
>> *minSize:* (optional) the minimum size the fingerprint will be folded to
>> when trying to reach tgtDensity Defaults to 128.
>> *branchedPaths:* (optional) if set both branched and unbranched paths
>> will be used in the fingerprint. Defaults to True.
>> *useBondOrder:* (optional) if set both bond orders will be used in the
>> path hashes Defaults to True.
>> *atomInvariants:* (optional) a sequence of atom invariants to use in the
>> path hashes Defaults to empty.
>> *fromAtoms:* (optional) a sequence of atom indices. If provided, only
>> paths/subgraphs starting from these atoms will be used. Defaults to empty.
>> *atomBits:* (optional) an empty list. If provided, the result will
>> contain a list containing the bits each atom sets. Defaults to empty.
>> *bitInfo:* (optional) an empty dict. If provided, the result will
>> contain a dict with bits as keys and corresponding bond paths as values.
>> Defaults to empty.
>>
>> Thanks in advance.
>> Thomas
>>
>>
>> --
>>
>> ======================================================================
>>
>> Dr Thomas Evangelidis
>>
>> Research Scientist
>>
>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>> Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>
>> Prague, Czech Republic
>>   &
>> CEITEC - Central European Institute of Technology
>> <https://www.ceitec.eu/>
>> Brno, Czech Republic
>>
>> email: teva...@gmail.com
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

-- 

======================================================================

Dr Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>
Prague, Czech Republic
  &
CEITEC - Central European Institute of Technology <https://www.ceitec.eu/>
Brno, Czech Republic

email: teva...@gmail.com

website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to