Re: [Rdkit-discuss] Tanimoto and fingerprint representation

2019-09-14 Thread Andrew Dalke
Hi Jan,

  The GetMorganFingerprint() returns count fingerprints, and the Tanimoto 
calculation does the full Jaccard similarity, including the counts. 

The GetMorganFingerprintAsBitVect() version only uses the keys (that is, it 
treats all non-zero values as being 1) when computing the Tanimoto.

> On Sep 14, 2019, at 11:07, Jan Halborg Jensen  wrote:
> 
> When using GetMorganFingerprintAsBitVect I get the “expected” Tanimoto score
> 
> mol1 = Chem.MolFromSmiles('CCC')
> mol2 = Chem.MolFromSmiles('CNC')
> 
> fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1,2,nBits=1024)
> fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2,2,nBits=1024)

>>> list(fp1.GetOnBits())
[33, 80, 294, 320]
>>> list(fp2.GetOnBits())
[33, 128, 406, 539]

You can see the intersection is 1 and the union is 7, giving 1/7 = 0.142... as 
the Tanimoto, which is what you demonstrated was the result.

> However, when using GetMorganFingerprint I get a difference score.
> 
> fp1 = AllChem.GetMorganFingerprint(mol1,2)
> fp2 = AllChem.GetMorganFingerprint(mol2,2)

>>> fp1.GetNonzeroElements()
{2068133184: 1, 2245384272: 1, 2246728737: 2, 3542456614: 2}
>>> fp2.GetNonzeroElements()
{847961216: 1, 869080603: 1, 2246728737: 2, 3824063894: 2}

Note that there is one shared key (2246728737) while the other 7 are unique. 
The binary Tanimoto - treating all counts as 1 - gives 1/7, matching the 
BitVect version.

On the other hand, the common value 2246728737 is present 2 times in each 
fingerprint, and 3542456614 and 3824063894 are each present twice in one 
fingerprint, so the Jaccard, or count Tanimoto, is

   2 / ((1+1+2+2)+(1+1+2+2)-2) = 2/10 = 0.2

matching the value you computed.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Tanimoto and fingerprint representation

2019-09-14 Thread Jan Halborg Jensen
When using GetMorganFingerprintAsBitVect I get the “expected” Tanimoto score

mol1 = Chem.MolFromSmiles('CCC')
mol2 = Chem.MolFromSmiles('CNC')

fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1,2,nBits=1024)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2,2,nBits=1024)

print(DataStructs.TanimotoSimilarity(fp1, fp2))

arr1 = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp1, arr1)
arr2 = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp2, arr2)
print(np.sum(arr1*arr2)/np.sum(arr1+arr2-arr1*arr2))

0.14285714285714285
0.14285714285714285



However, when using GetMorganFingerprint I get a difference score.

fp1 = AllChem.GetMorganFingerprint(mol1,2)
fp2 = AllChem.GetMorganFingerprint(mol2,2)

print(DataStructs.TanimotoSimilarity(fp1, fp2))

0.2

I thought the Tanimoto score was always computed using bit vectors.  Can anyone 
explain?

Best regards, Jan
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss