Hello,
I'm trying to understand RDKit's hash fingerprint code. The
functions in Fingerprints.cpp call
Subgraphs.cpp:findAllSubgraphsOfLengthsMtoN . The code in Subgraphs
makes a distinction between a subgraph (which can have branches) and
paths (which are linear).
The implementation for Subgraphs.cpp:findAllSubgraphsOfLengthsMtoN
appears to do what it says - return subgraphs - and the comment
- to make few things clear it might be useful to typdef a
"subgraphListType"
even if it is exactly same as the "pathListType", just to not
confuse between
path vs. subgraph definitions
reinforces that. The code then uses Balaban's J index for the
subgraph to seed the PRNG to generate the fingerprint bits.
This surprises me because I thought the Daylight approach was to only
consider linear fragments. Implementations like in
OpenBabel - finger2.cpp:1077
/// \brief Fingerprint based on linear fragments up to 7 atoms ID="FP2"
the CDK - PathTools.java:552-553
* This method returns a set of paths. Each path is a
<code>List</code> of atoms that
* make up the path (ie they are sequentially connected).
and a private code base I have access to only use linear paths.
But upon closer reading of
http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
I see that it doesn't say that paths are linear, only text like
"atoms and bonds connected by paths up to 3 bonds long"
I admit though that I'm a bit surprised. I thought I know how this
sort of code works.
Hmm, I haven't asked a question yet.... Greg? How did RDKit manage to
be the only one I found which uses branching in the fingerprint
generation code? Or do I have things wrong?
Andrew
[email protected]