Thanks Greg, this is what I suspected. By the way, will you change error handling in RDKit generally? Currently, GetAtomPairFingerprintAsIntVect(mol) will throw a RuntimeError if the molecule is invalid (an error occurs) - maybe this can be changed analogous to the supplier functions. This would make the use of generator expressions or list comprehensions easier. In addition, would it be possible to have a BulkDiceSimilarity function for SparseIntVects as welll? I guess it would be considerably faster than iterating through a Python list and calling SparseIntVect.DiceSimilarity each time.
Adrian On Thu, Apr 3, 2008 at 8:56 PM, Greg Landrum <greg.land...@gmail.com> wrote: > > On Thu, Apr 3, 2008 at 4:15 PM, Adrian Schreyer <ams...@cam.ac.uk> wrote: > > After a couple of hours work on comparing molecules with RDKit I > > noticed that the algorithm is apparently struggling with structures > > that consist of a repetitive substructure, such as linear alkanes > > (azelaic acid C(CCCC(=O)O)CCCC(=O)O being the worst offender) or > > molecules like gentian violet > > CN(C)C1=CC=C(C=C1)C(=C2C=CC(=[N+](C)C)C=C2)C3=CC=C(C=C3)N(C)C and > > malachite green CN(C)c1ccc(cc1)C(c2ccccc2)=C3C=CC(C=C3)=[N+](C)C. > In > > those cases the daylight fingerprint tanimoto is 1.0, although > > malachite green for example lacks a complete amine group. I assume > > this is due to the hashing of only unique topological paths; > > Similarity using atom pair descriptors / SparseIntVect returns more > > reasonable results. Maybe someone can shed light on this matter. > > Let me start with a disclaimer: the RDKit topological fingerprint is > not identical to the Daylight fingerprint and its name should be > changed so that no-one is mislead. The fingerprint is calculated using > an algorithm similar to that described in the Daylight theory manual, > but it's definitely not the same. > > Now an explanation of what I think is going on. For the sake of > accuracy, I will call the Daylight-like fingerprint the RDKit > fingerprint. > The RDKit fingerprint uses a bit vector where individual bits are set > by substructures in the molecule. The substructures are by default at > most 7 bonds long. Because it's a bit vector, it doesn't matter how > many times a particular substructure appears, so long as they are far > enough away from each other that they don't "see" each other in the > fingerprint (that is, by default, 7 bonds). Atom pairs and topological > torsions, on the other hand, use counts, so they would recognize the > difference between malachite green and gentian violet solely based on > the counts. > One can see what a difference the counts make by using the two > different forms of atom pair fingerprints: > [16]>>> pair1=Pairs.GetAtomPairFingerprintAsIntVect(gv) > [17]>>> pair2=Pairs.GetAtomPairFingerprintAsIntVect(mg) > [18]>>> DataStructs.DiceSimilarity(pair1,pair2) > Out[18] 0.81415929203539827 > [19]>>> pair1bv=Pairs.GetAtomPairFingerprintAsBitVect(gv) > [20]>>> pair2bv=Pairs.GetAtomPairFingerprintAsBitVect(mg) > [21]>>> DataStructs.DiceSimilarity(pair1bv,pair2bv) > Out[21] 0.93693693693693691 > so when we ignore the counts, we get higher similarities. > > Atom pairs have the added advantage that they "see" the full size of > the molecule, so there are atom pairs corresponding to the > "amine-amine" distances. Topological torsions (which are 4-bonds > long), don't see these, so the TT similarity between your two > molecules is higher than the AP similarity: > [22]>>> tors1 = Torsions.GetTopologicalTorsionFingerprintAsIntVect(mg) > [26]>>> tors2 = Torsions.GetTopologicalTorsionFingerprintAsIntVect(gv) > [27]>>> DataStructs.DiceSimilarity(tors1,tors2) > Out[27] 0.86274509803921573 > > Technical reasons prevent topological torsions from being directly > represented as bit vectors, so I can't easily show the difference > there, but I hope the point is already clear. > > Does this help clear things up? > -greg >