Dear TJ,
On Mon, Dec 27, 2010 at 11:41 PM, TJ O'Donnell <[email protected]> wrote:
> I was surprised that, using topological fingerprints, the tanimoto
> similarity between benzene and toluene is 0.32
> Examining the fp bits, I can see why. But I don't understand why so
> many paths are repeated for toluene.
> To my way of thinking, paths that trace the same types of atoms should
> not be considered different, and therefore
> set new bits. Am I missing something?
Maybe one point: using the default arguments to RDKFingerprint, each
path (really each subgraph since they can be branched) sets multiple
bits. This is controlled by the nBitsPerHash argument to
RDKFingerprint. Here's a demonstration:
In [10]: bz = Chem.MolFromSmiles('c1ccccc1')
In [11]: tl = Chem.MolFromSmiles('Cc1ccccc1')
In [12]: fp1 = Chem.RDKFingerprint(bz,nBitsPerHash=1)
In [13]: fp2 = Chem.RDKFingerprint(tl,nBitsPerHash=1)
In [14]: fp1.GetNumOnBits()
Out[14]: 6
In [15]: fp2.GetNumOnBits()
Out[15]: 19
In [16]: iBits=fp1&fp2
In [17]: iBits.GetNumOnBits()
Out[17]: 6
In [18]: fp12 = Chem.RDKFingerprint(bz,nBitsPerHash=2)
In [19]: fp22 = Chem.RDKFingerprint(tl,nBitsPerHash=2)
In [20]: fp12.GetNumOnBits()
Out[20]: 12
In [21]: fp22.GetNumOnBits()
Out[21]: 38
In [22]: iBits=fp12&fp22
In [23]: iBits.GetNumOnBits()
Out[23]: 12
The default is to set 4 bits per subgraph:
In [36]: fp14 = Chem.RDKFingerprint(bz)
In [37]: fp14.GetNumOnBits()
Out[37]: 24
In [38]: fp24 = Chem.RDKFingerprint(tl)
In [39]: fp24.GetNumOnBits()
Out[39]: 75
That last value is 75 instead of 76 because of a bit collision.
In some other validation work I've done recently, it's become pretty
clear that the default value for nBitsPerHash is too high: the bit
densities for drug-like molecules get really high, which leads to a
general increase in calculated similarities and too many molecules
that have high calculated similarities but that don't look much alike
(due to bit collisions). I've already changed the default value in the
database cartridge to 2 bits per hash instead of 4 and am considering
doing this from python as well, I'll cover that in a separate post
(thanks for the reminder that I should bring it up).
Best Regards,
-greg
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss