Thanks Greg, this is what I suspected. By the way, will you change
error handling in RDKit generally? Currently,
GetAtomPairFingerprintAsIntVect(mol) will throw a RuntimeError if the
molecule is invalid (an error occurs) - maybe this can be changed
analogous to the supplier functions. This would make the use of
generator expressions or list comprehensions easier. In addition,
would it be possible to have a BulkDiceSimilarity function for
SparseIntVects as welll? I guess it would be considerably faster than
iterating through a Python list and calling
SparseIntVect.DiceSimilarity each time.

Adrian

On Thu, Apr 3, 2008 at 8:56 PM, Greg Landrum <greg.land...@gmail.com> wrote:
>
> On Thu, Apr 3, 2008 at 4:15 PM, Adrian Schreyer <ams...@cam.ac.uk> wrote:
>  > After a couple of hours work on comparing molecules with RDKit I
>  >  noticed that the algorithm is apparently struggling with structures
>  >  that consist of a repetitive substructure, such as linear alkanes
>  >  (azelaic acid C(CCCC(=O)O)CCCC(=O)O being the worst offender) or
>  >  molecules like gentian violet
>  >  CN(C)C1=CC=C(C=C1)C(=C2C=CC(=[N+](C)C)C=C2)C3=CC=C(C=C3)N(C)C and
>  >  malachite green         CN(C)c1ccc(cc1)C(c2ccccc2)=C3C=CC(C=C3)=[N+](C)C. 
> In
>  >  those cases the daylight fingerprint tanimoto is 1.0, although
>  >  malachite green for example lacks a complete amine group. I assume
>  >  this is due to the hashing of only unique topological paths;
>  >  Similarity using atom pair descriptors / SparseIntVect returns more
>  >  reasonable results. Maybe someone can shed light on this matter.
>
>  Let me start with a disclaimer: the RDKit topological fingerprint is
>  not identical to the Daylight fingerprint and its name should be
>  changed so that no-one is mislead. The fingerprint is calculated using
>  an algorithm similar to that described in the Daylight theory manual,
>  but it's definitely not the same.
>
>  Now an explanation of what I think is going on. For the sake of
>  accuracy, I will call the Daylight-like fingerprint the RDKit
>  fingerprint.
>  The RDKit fingerprint uses a bit vector where individual bits are set
>  by substructures in the molecule. The substructures are by default at
>  most 7 bonds long. Because it's a bit vector, it doesn't matter how
>  many times a particular substructure appears, so long as they are far
>  enough away from each other that they don't "see" each other in the
>  fingerprint (that is, by default, 7 bonds). Atom pairs and topological
>  torsions, on the other hand, use counts, so they would recognize the
>  difference between malachite green and gentian violet solely based on
>  the counts.
>  One can see what a difference the counts make by using the two
>  different forms of atom pair fingerprints:
>  [16]>>> pair1=Pairs.GetAtomPairFingerprintAsIntVect(gv)
>  [17]>>> pair2=Pairs.GetAtomPairFingerprintAsIntVect(mg)
>  [18]>>> DataStructs.DiceSimilarity(pair1,pair2)
>  Out[18] 0.81415929203539827
>  [19]>>> pair1bv=Pairs.GetAtomPairFingerprintAsBitVect(gv)
>  [20]>>> pair2bv=Pairs.GetAtomPairFingerprintAsBitVect(mg)
>  [21]>>> DataStructs.DiceSimilarity(pair1bv,pair2bv)
>  Out[21] 0.93693693693693691
>  so when we ignore the counts, we get higher similarities.
>
>  Atom pairs have the added advantage that they "see" the full size of
>  the molecule, so there are atom pairs corresponding to the
>  "amine-amine" distances. Topological torsions (which are 4-bonds
>  long), don't see these, so the TT similarity between your two
>  molecules is higher than the AP similarity:
>  [22]>>> tors1 = Torsions.GetTopologicalTorsionFingerprintAsIntVect(mg)
>  [26]>>> tors2 = Torsions.GetTopologicalTorsionFingerprintAsIntVect(gv)
>  [27]>>> DataStructs.DiceSimilarity(tors1,tors2)
>  Out[27] 0.86274509803921573
>
>  Technical reasons prevent topological torsions from being directly
>  represented as bit vectors, so I can't easily show the difference
>  there, but I hope the point is already clear.
>
>  Does this help clear things up?
>  -greg
>

Reply via email to