Hi Nils,

I don't think this is really a bug, it's more of a matter of default
parameters that aren't appropriate for the molecules being considered.

The RDKit fingerprint hashes subgraphs (branched and unbranched) within a
particular range of sizes, as measured by the number of bonds in the
subgraph. The default is to include subgraphs with between 1 and 7 bonds.
These molecules are quite complex and thus have lots of different unique
subgraphs, particularly when going out to 7 bonds.
Here's a quick demo of how quickly the number of set bits falls off with
the subgraph size for one of your molecules:

In [6]: m =
Chem.MolFromSmiles(r'COCC(C)NC1C(N(CC(C)(C)C)CC(N1C1=C2N=C(NC(C)=O)SC2=C(C2C(C(CC3=C2C=CC=C3)C(F)(F)F)C2=CC=CN=C2)C(N2CCN(CC(C)(C)C)C(CC(C)C)C2)=C1C1=C2N=C(NC(C)=O)SC2=CC=C1)C1=CC=CC2=C1N=C
   ...: (N)C(=O)N2C1=CC=CN=C1C(F)(F)F)C1=CC=CC=C1')

In [7]: Chem.RDKFingerprint(m,maxPath=7,fpSize=2048).GetNumOffBits()
Out[7]: 32

In [8]: Chem.RDKFingerprint(m,maxPath=6,fpSize=2048).GetNumOffBits()
Out[8]: 363

In [9]: Chem.RDKFingerprint(m,maxPath=5,fpSize=2048).GetNumOffBits()
Out[9]: 999


If you think about how many different 7-bond branched subgraphs are
possible in that molecule, the large number of set bits starts to make
sense.

The default of 7 bonds is probably not the best choice - Sereina ended up
selecting "RDK5" while doing the benchmarking papers and that's what I also
tend to use now - but it's difficult to change the default at this point
without breaking a lot of code in hard-to-detect ways.

The maxPath parameter is available in KNIME on the advanced tab in the
RDKit Fingerprint configuration dialog.

For the sake of completeness, the real increase in number of set bits is
due to the inclusion of branched subgraphs; if you turn those off the
number of set bits drops dramatically:

In [13]:
Chem.RDKFingerprint(m,maxPath=7,branchedPaths=False,fpSize=2048).GetNumOffBits()
Out[13]: 786

In [14]:
Chem.RDKFingerprint(m,maxPath=6,branchedPaths=False,fpSize=2048).GetNumOffBits()
Out[14]: 1145

In [15]:
Chem.RDKFingerprint(m,maxPath=5,branchedPaths=False,fpSize=2048).GetNumOffBits()
Out[15]: 1460


This form of the fingerprint, of course, contains a lot less information.

-greg



On Thu, Jun 1, 2017 at 4:28 PM, Nils Weskamp <nils.wesk...@gmail.com> wrote:

> Dear RDKitters,
>
> I just calculated RDKit "Daylight-like" fingerprints for a number of
> public compound databases and found quite a number of examples where the
> resulting fingerprints have *all* bits set to 1. This happens in both KNIME
> 3.2.1 (1024/1/7) and also via the command line (2048/1/7/4) for RDKit
> 2016.03.
>
> Examples include (from SureChEMBL):
>
> SCHEMBL5141968
>
> SCHEMBL13916889
>
> SCHEMBL16257315
>
> SCHEMBL16257310
>
> SCHEMBL16257297
>
> SCHEMBL16257215
>
> SCHEMBL16257169
>
> SCHEMBL8232906
>
> SCHEMBL16257312
>
> SCHEMBL13011081
>
> SCHEMBL12570100
>
> SCHEMBL14524878
>
> SCHEMBL6370886
>
> SCHEMBL15305169
>
> SCHEMBL16912871
>
> SCHEMBL13290179
>
>
> Now, these are obviously some very large and complex molecules, so I would
> expect that they contain many features and thus set many bits - but all of
> them?
>
> So, in short: Are these compounds so ugly that it is normal for the
> fingerprints to have all bits set or are they so ugly that they trigger
> some rare bug in RDKit?
>
> Any ideas / suggestions / comments?
>
> Thanks a lot,
> Nils
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to