Hi Jan,
The GetMorganFingerprint() returns count fingerprints, and the Tanimoto
calculation does the full Jaccard similarity, including the counts.
The GetMorganFingerprintAsBitVect() version only uses the keys (that is, it
treats all non-zero values as being 1) when computing the Tanimoto.
> On Sep 14, 2019, at 11:07, Jan Halborg Jensen wrote:
>
> When using GetMorganFingerprintAsBitVect I get the “expected” Tanimoto score
>
> mol1 = Chem.MolFromSmiles('CCC')
> mol2 = Chem.MolFromSmiles('CNC')
>
> fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1,2,nBits=1024)
> fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2,2,nBits=1024)
>>> list(fp1.GetOnBits())
[33, 80, 294, 320]
>>> list(fp2.GetOnBits())
[33, 128, 406, 539]
You can see the intersection is 1 and the union is 7, giving 1/7 = 0.142... as
the Tanimoto, which is what you demonstrated was the result.
> However, when using GetMorganFingerprint I get a difference score.
>
> fp1 = AllChem.GetMorganFingerprint(mol1,2)
> fp2 = AllChem.GetMorganFingerprint(mol2,2)
>>> fp1.GetNonzeroElements()
{2068133184: 1, 2245384272: 1, 2246728737: 2, 3542456614: 2}
>>> fp2.GetNonzeroElements()
{847961216: 1, 869080603: 1, 2246728737: 2, 3824063894: 2}
Note that there is one shared key (2246728737) while the other 7 are unique.
The binary Tanimoto - treating all counts as 1 - gives 1/7, matching the
BitVect version.
On the other hand, the common value 2246728737 is present 2 times in each
fingerprint, and 3542456614 and 3824063894 are each present twice in one
fingerprint, so the Jaccard, or count Tanimoto, is
2 / ((1+1+2+2)+(1+1+2+2)-2) = 2/10 = 0.2
matching the value you computed.
Andrew
da...@dalkescientific.com
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss