On Thu, Apr 3, 2008 at 4:15 PM, Adrian Schreyer <ams...@cam.ac.uk> wrote:
> After a couple of hours work on comparing molecules with RDKit I
>  noticed that the algorithm is apparently struggling with structures
>  that consist of a repetitive substructure, such as linear alkanes
>  (azelaic acid C(CCCC(=O)O)CCCC(=O)O being the worst offender) or
>  molecules like gentian violet
>  CN(C)C1=CC=C(C=C1)C(=C2C=CC(=[N+](C)C)C=C2)C3=CC=C(C=C3)N(C)C and
>  malachite green         CN(C)c1ccc(cc1)C(c2ccccc2)=C3C=CC(C=C3)=[N+](C)C. In
>  those cases the daylight fingerprint tanimoto is 1.0, although
>  malachite green for example lacks a complete amine group. I assume
>  this is due to the hashing of only unique topological paths;
>  Similarity using atom pair descriptors / SparseIntVect returns more
>  reasonable results. Maybe someone can shed light on this matter.

Let me start with a disclaimer: the RDKit topological fingerprint is
not identical to the Daylight fingerprint and its name should be
changed so that no-one is mislead. The fingerprint is calculated using
an algorithm similar to that described in the Daylight theory manual,
but it's definitely not the same.

Now an explanation of what I think is going on. For the sake of
accuracy, I will call the Daylight-like fingerprint the RDKit
fingerprint.
The RDKit fingerprint uses a bit vector where individual bits are set
by substructures in the molecule. The substructures are by default at
most 7 bonds long. Because it's a bit vector, it doesn't matter how
many times a particular substructure appears, so long as they are far
enough away from each other that they don't "see" each other in the
fingerprint (that is, by default, 7 bonds). Atom pairs and topological
torsions, on the other hand, use counts, so they would recognize the
difference between malachite green and gentian violet solely based on
the counts.
One can see what a difference the counts make by using the two
different forms of atom pair fingerprints:
[16]>>> pair1=Pairs.GetAtomPairFingerprintAsIntVect(gv)
[17]>>> pair2=Pairs.GetAtomPairFingerprintAsIntVect(mg)
[18]>>> DataStructs.DiceSimilarity(pair1,pair2)
Out[18] 0.81415929203539827
[19]>>> pair1bv=Pairs.GetAtomPairFingerprintAsBitVect(gv)
[20]>>> pair2bv=Pairs.GetAtomPairFingerprintAsBitVect(mg)
[21]>>> DataStructs.DiceSimilarity(pair1bv,pair2bv)
Out[21] 0.93693693693693691
so when we ignore the counts, we get higher similarities.

Atom pairs have the added advantage that they "see" the full size of
the molecule, so there are atom pairs corresponding to the
"amine-amine" distances. Topological torsions (which are 4-bonds
long), don't see these, so the TT similarity between your two
molecules is higher than the AP similarity:
[22]>>> tors1 = Torsions.GetTopologicalTorsionFingerprintAsIntVect(mg)
[26]>>> tors2 = Torsions.GetTopologicalTorsionFingerprintAsIntVect(gv)
[27]>>> DataStructs.DiceSimilarity(tors1,tors2)
Out[27] 0.86274509803921573

Technical reasons prevent topological torsions from being directly
represented as bit vectors, so I can't easily show the difference
there, but I hope the point is already clear.

Does this help clear things up?
-greg

Reply via email to