Hello,

I'm trying to understand RDKit's hash fingerprint code. The functions in Fingerprints.cpp call Subgraphs.cpp:findAllSubgraphsOfLengthsMtoN . The code in Subgraphs makes a distinction between a subgraph (which can have branches) and paths (which are linear).

The implementation for Subgraphs.cpp:findAllSubgraphsOfLengthsMtoN appears to do what it says - return subgraphs - and the comment

- to make few things clear it might be useful to typdef a "subgraphListType" even if it is exactly same as the "pathListType", just to not confuse between
      path vs. subgraph definitions


reinforces that. The code then uses Balaban's J index for the subgraph to seed the PRNG to generate the fingerprint bits.

This surprises me because I thought the Daylight approach was to only consider linear fragments. Implementations like in

OpenBabel - finger2.cpp:1077
/// \brief Fingerprint based on linear fragments up to 7 atoms ID="FP2"

the CDK - PathTools.java:552-553
* This method returns a set of paths. Each path is a <code>List</code> of atoms that
     * make up the path (ie they are sequentially connected).

and a private code base I have access to only use linear paths.

But upon closer reading of
  http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
I see that it doesn't say that paths are linear, only text like

  "atoms and bonds connected by paths up to 3 bonds long"

I admit though that I'm a bit surprised. I thought I know how this sort of code works.

Hmm, I haven't asked a question yet.... Greg? How did RDKit manage to be the only one I found which uses branching in the fingerprint generation code? Or do I have things wrong?

                                Andrew
                                [email protected]



Reply via email to