Dear Greg,

thanks for your detailed reply. I have just subscribed to rdkit-
discuss.

On Friday 11 February 2011 05:49:42 Greg Landrum wrote:
> > thanks for providing your software RDKit. I have a question
> > regarding the topological fingerprint defined in
> > /GraphMol/Fingerprints/Fingerprints.h and the parameter
> > 'branchedPaths'.
> > 
> > What exactly is a branched path? Does in differ from a subgraph?
> > Does it contain cycles?
> 
> A branched path is a subgraph. It can contain cycles. I used the
> "branchedPaths" name in the fingerprinting code so that I could
> avoid repeating the explanation of the difference between a path
> and subgraph that one finds in
> $RDBASE/Code/GraphMol/Subgraphs/Subgraphs.h

Thanks for clarification. I think the term 'branched path' is slightly 
misleading, especially if 'subgraph' is used to describe exactly the 
same concept -- but that's just my opinion.

> > I have looked at the code. Paths and branched path are both
> > identified by a set of bonds and are hashed in a similar
> > fashion. The only difference for paths and branched paths seems
> > to be 'bondNbrs', since branched paths contain bonds with more
> > than two adjacent bonds.
> 
> Both paths and subgraphs (branched paths) are hashed the same, the
> difference in the fingerprinting code is how they are identified.
> 
> The hash for each bond in a subgraph (or path) is determined by
> combining: the number of neighbors that bond has in the subgraph,
> the bond order, and the atomic numbers of the two atoms involved.
> The hash for the entire subgraph is determined by sorting the list
> of bond hashes, adding the count of the total number of atoms in
> the subgraph, and then hashing that list.
> 
> > However, I don't think that the number of neighbors  stored for
> > each bond uniquely defines the structure of a branched path.
> 
> It certainly doesn't, and it's not that hard to find
> counter-examples where the hashing algorithm breaks down and you
> get collisions (I give an example below). The simple approach used
> is certainly not a true canonicalization of the subgraphs. If you
> have ideas for improvements, I'd be more than happy to hear them;
> the primary constraint is that the subgraph hashing algorithm
> should not be too computationally intensive.
> 
> The current algorithm seems to work in practice, but that's more of
> a qualitative feeling than something concrete

I think it's important to distinguish two types of "collisions": Those 
that are due to insufficient canonicalization and those that are due to 
hashing. The latter are unavoidable when using exhaustive enumeration 
of paths/subgraphs but their influence on the performance can be 
reduced, e.g. by using a good hash function, adjusting the fingerprint 
size etc.
Collisions caused by insufficient canonicalization mean that some 
different structures can not be distinguished and thus not be fully 
exploited by the fingerprint, regardless of the fingerprint size used. 
Nevertheless, I also believe that the algorithm used in RDKit works 
well.

Although it's certainly not that hard to find a counterexample, I 
miserably failed...

> > Furthermore, does the hashing procedure distinguish, e.g., a path
> > CCN=CC from CN=CCC? Both have the same bonds but are different
> > paths...
> 
> These kinds of questions are, luckily, quite easy to answer:
> In [13]: m1 = Chem.MolFromSmiles('CCN=CC')
> 
> In [14]: m2=Chem.MolFromSmiles('CN=CCC')
> 
> In [15]:
> list(Chem.RDKFingerprint(m1,minPath=4,nBitsPerHash=1).GetOnBits())
> Out[15]: [65]
> 
> In [16]:
> list(Chem.RDKFingerprint(m2,minPath=4,nBitsPerHash=1).GetOnBits())
> Out[16]: [1491]
> 
> These are different because the CN bond has two neighbors in the
> first molecule and only one in the second (there's a CC bond that
> changes too).

Good point, adding another C atom should help to make my 
counterexample work as intended: The paths CCCN=CC and CCN=CCC result 
in the same 1-bit.

> Here's a real collision:
> In [23]:
> list(Chem.RDKFingerprint(Chem.MolFromSmiles('CC=CCCC'),minPath=5,n
> BitsPerHash=1).GetOnBits()) Out[23]: [1349]
> 
> In [24]:
> list(Chem.RDKFingerprint(Chem.MolFromSmiles('CCC=CCC'),minPath=5,n
> BitsPerHash=1).GetOnBits()) Out[24]: [1349]
> 
> > I would have expected a more sophisticated graph/tree
> > canonization algorithm for branched paths. Maybe I misunderstood
> > the code. Can you give me a hint on the hashing technique or is
> > there a publication the procedure is based on?
> 
> The details are above.
> 
> For this application -- generating a fingerprint for molecule
> similarity and filtering substructures -- it's not essential that
> the subgraphs be exactly canonicalized. Obviously the closer one
> gets to a true canonicalization the better, but it's definitely
> not critical. That said: if you have computationally efficient
> ideas for improving the canonicalization, it would be great to
> hear them; improved hashing of the subgraphs would definitely
> improve the fingerprints.

Regarding (simple) paths canonicalization is straight-forward and 
efficient: Each path (represented by a set of bonds) can be associated 
with two (possibly different) ordered lists of atom and bond types, 
e.g. N=C-C and C-C=N. Choosing the "smaller" one regarding atomic 
number and bond type (compared from left to right) as feature for 
hashing should work.

I recently implemented a fingerprint using trees as features, i.e. 
subgraphs not containing cycles. Canonicalization is implemented by 
unambiguously determining a rooted ordered tree and deriving an 
ordered list of atoms and bonds from that tree (also adding some 
additional flags). Of course, this is computationally more demanding, 
but still reasonable fast, I think.

However, I'm not sure if the procedure for trees pays off compared to 
your approach. I will do some comparisons next week and will report my 
results.


Best regards
Nils

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to