Hi Greg,

thanks for the quick feedback! I will either play with the path length
parameter or switch to an entirely different fingerprint for my
application.

Kind regards,
Nils

On Fri, Jun 2, 2017 at 7:43 AM, Greg Landrum <greg.land...@gmail.com> wrote:

> Hi Nils,
>
> I don't think this is really a bug, it's more of a matter of default
> parameters that aren't appropriate for the molecules being considered.
>
> The RDKit fingerprint hashes subgraphs (branched and unbranched) within a
> particular range of sizes, as measured by the number of bonds in the
> subgraph. The default is to include subgraphs with between 1 and 7 bonds.
> These molecules are quite complex and thus have lots of different unique
> subgraphs, particularly when going out to 7 bonds.
> Here's a quick demo of how quickly the number of set bits falls off with
> the subgraph size for one of your molecules:
>
> In [6]: m = Chem.MolFromSmiles(r'COCC(C)NC1C(N(CC(C)(C)C)CC(N1C1=C2N=
> C(NC(C)=O)SC2=C(C2C(C(CC3=C2C=CC=C3)C(F)(F)F)C2=CC=CN=C2)C(
> N2CCN(CC(C)(C)C)C(CC(C)C)C2)=C1C1=C2N=C(NC(C)=O)SC2=CC=C1)C1=CC=CC2=C1N=C
>    ...: (N)C(=O)N2C1=CC=CN=C1C(F)(F)F)C1=CC=CC=C1')
>
> In [7]: Chem.RDKFingerprint(m,maxPath=7,fpSize=2048).GetNumOffBits()
> Out[7]: 32
>
> In [8]: Chem.RDKFingerprint(m,maxPath=6,fpSize=2048).GetNumOffBits()
> Out[8]: 363
>
> In [9]: Chem.RDKFingerprint(m,maxPath=5,fpSize=2048).GetNumOffBits()
> Out[9]: 999
>
>
> If you think about how many different 7-bond branched subgraphs are
> possible in that molecule, the large number of set bits starts to make
> sense.
>
> The default of 7 bonds is probably not the best choice - Sereina ended up
> selecting "RDK5" while doing the benchmarking papers and that's what I also
> tend to use now - but it's difficult to change the default at this point
> without breaking a lot of code in hard-to-detect ways.
>
> The maxPath parameter is available in KNIME on the advanced tab in the
> RDKit Fingerprint configuration dialog.
>
> For the sake of completeness, the real increase in number of set bits is
> due to the inclusion of branched subgraphs; if you turn those off the
> number of set bits drops dramatically:
>
> In [13]: Chem.RDKFingerprint(m,maxPath=7,branchedPaths=False,fpSize=
> 2048).GetNumOffBits()
> Out[13]: 786
>
> In [14]: Chem.RDKFingerprint(m,maxPath=6,branchedPaths=False,fpSize=
> 2048).GetNumOffBits()
> Out[14]: 1145
>
> In [15]: Chem.RDKFingerprint(m,maxPath=5,branchedPaths=False,fpSize=
> 2048).GetNumOffBits()
> Out[15]: 1460
>
>
> This form of the fingerprint, of course, contains a lot less information.
>
> -greg
>
>
>
> On Thu, Jun 1, 2017 at 4:28 PM, Nils Weskamp <nils.wesk...@gmail.com>
> wrote:
>
>> Dear RDKitters,
>>
>> I just calculated RDKit "Daylight-like" fingerprints for a number of
>> public compound databases and found quite a number of examples where the
>> resulting fingerprints have *all* bits set to 1. This happens in both KNIME
>> 3.2.1 (1024/1/7) and also via the command line (2048/1/7/4) for RDKit
>> 2016.03.
>>
>> Examples include (from SureChEMBL):
>>
>> SCHEMBL5141968
>>
>> SCHEMBL13916889
>>
>> SCHEMBL16257315
>>
>> SCHEMBL16257310
>>
>> SCHEMBL16257297
>>
>> SCHEMBL16257215
>>
>> SCHEMBL16257169
>>
>> SCHEMBL8232906
>>
>> SCHEMBL16257312
>>
>> SCHEMBL13011081
>>
>> SCHEMBL12570100
>>
>> SCHEMBL14524878
>>
>> SCHEMBL6370886
>>
>> SCHEMBL15305169
>>
>> SCHEMBL16912871
>>
>> SCHEMBL13290179
>>
>>
>> Now, these are obviously some very large and complex molecules, so I
>> would expect that they contain many features and thus set many bits - but
>> all of them?
>>
>> So, in short: Are these compounds so ugly that it is normal for the
>> fingerprints to have all bits set or are they so ugly that they trigger
>> some rare bug in RDKit?
>>
>> Any ideas / suggestions / comments?
>>
>> Thanks a lot,
>> Nils
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to