On Thu, Apr 3, 2008 at 4:15 PM, Adrian Schreyer <ams...@cam.ac.uk> wrote: > After a couple of hours work on comparing molecules with RDKit I > noticed that the algorithm is apparently struggling with structures > that consist of a repetitive substructure, such as linear alkanes > (azelaic acid C(CCCC(=O)O)CCCC(=O)O being the worst offender) or > molecules like gentian violet > CN(C)C1=CC=C(C=C1)C(=C2C=CC(=[N+](C)C)C=C2)C3=CC=C(C=C3)N(C)C and > malachite green CN(C)c1ccc(cc1)C(c2ccccc2)=C3C=CC(C=C3)=[N+](C)C. In > those cases the daylight fingerprint tanimoto is 1.0, although > malachite green for example lacks a complete amine group. I assume > this is due to the hashing of only unique topological paths; > Similarity using atom pair descriptors / SparseIntVect returns more > reasonable results. Maybe someone can shed light on this matter.
Let me start with a disclaimer: the RDKit topological fingerprint is not identical to the Daylight fingerprint and its name should be changed so that no-one is mislead. The fingerprint is calculated using an algorithm similar to that described in the Daylight theory manual, but it's definitely not the same. Now an explanation of what I think is going on. For the sake of accuracy, I will call the Daylight-like fingerprint the RDKit fingerprint. The RDKit fingerprint uses a bit vector where individual bits are set by substructures in the molecule. The substructures are by default at most 7 bonds long. Because it's a bit vector, it doesn't matter how many times a particular substructure appears, so long as they are far enough away from each other that they don't "see" each other in the fingerprint (that is, by default, 7 bonds). Atom pairs and topological torsions, on the other hand, use counts, so they would recognize the difference between malachite green and gentian violet solely based on the counts. One can see what a difference the counts make by using the two different forms of atom pair fingerprints: [16]>>> pair1=Pairs.GetAtomPairFingerprintAsIntVect(gv) [17]>>> pair2=Pairs.GetAtomPairFingerprintAsIntVect(mg) [18]>>> DataStructs.DiceSimilarity(pair1,pair2) Out[18] 0.81415929203539827 [19]>>> pair1bv=Pairs.GetAtomPairFingerprintAsBitVect(gv) [20]>>> pair2bv=Pairs.GetAtomPairFingerprintAsBitVect(mg) [21]>>> DataStructs.DiceSimilarity(pair1bv,pair2bv) Out[21] 0.93693693693693691 so when we ignore the counts, we get higher similarities. Atom pairs have the added advantage that they "see" the full size of the molecule, so there are atom pairs corresponding to the "amine-amine" distances. Topological torsions (which are 4-bonds long), don't see these, so the TT similarity between your two molecules is higher than the AP similarity: [22]>>> tors1 = Torsions.GetTopologicalTorsionFingerprintAsIntVect(mg) [26]>>> tors2 = Torsions.GetTopologicalTorsionFingerprintAsIntVect(gv) [27]>>> DataStructs.DiceSimilarity(tors1,tors2) Out[27] 0.86274509803921573 Technical reasons prevent topological torsions from being directly represented as bit vectors, so I can't easily show the difference there, but I hope the point is already clear. Does this help clear things up? -greg