Hi Jan,

  The GetMorganFingerprint() returns count fingerprints, and the Tanimoto 
calculation does the full Jaccard similarity, including the counts. 

The GetMorganFingerprintAsBitVect() version only uses the keys (that is, it 
treats all non-zero values as being 1) when computing the Tanimoto.

> On Sep 14, 2019, at 11:07, Jan Halborg Jensen <jhjen...@chem.ku.dk> wrote:
> 
> When using GetMorganFingerprintAsBitVect I get the “expected” Tanimoto score
> 
> mol1 = Chem.MolFromSmiles('CCC')
> mol2 = Chem.MolFromSmiles('CNC')
> 
> fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1,2,nBits=1024)
> fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2,2,nBits=1024)

>>> list(fp1.GetOnBits())
[33, 80, 294, 320]
>>> list(fp2.GetOnBits())
[33, 128, 406, 539]

You can see the intersection is 1 and the union is 7, giving 1/7 = 0.142... as 
the Tanimoto, which is what you demonstrated was the result.

> However, when using GetMorganFingerprint I get a difference score.
> 
> fp1 = AllChem.GetMorganFingerprint(mol1,2)
> fp2 = AllChem.GetMorganFingerprint(mol2,2)

>>> fp1.GetNonzeroElements()
{2068133184: 1, 2245384272: 1, 2246728737: 2, 3542456614: 2}
>>> fp2.GetNonzeroElements()
{847961216: 1, 869080603: 1, 2246728737: 2, 3824063894: 2}

Note that there is one shared key (2246728737) while the other 7 are unique. 
The binary Tanimoto - treating all counts as 1 - gives 1/7, matching the 
BitVect version.

On the other hand, the common value 2246728737 is present 2 times in each 
fingerprint, and 3542456614 and 3824063894 are each present twice in one 
fingerprint, so the Jaccard, or count Tanimoto, is

   2 / ((1+1+2+2)+(1+1+2+2)-2) = 2/10 = 0.2

matching the value you computed.


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to