Hi Andrew, I'll start with your question and then elaborate a bit:
On Mon, Feb 9, 2009 at 1:54 AM, Andrew Dalke <[email protected]> wrote: > > Hmm, I haven't asked a question yet.... Greg? How did RDKit manage to > be the only one I found which uses branching in the fingerprint > generation code? Or do I have things wrong? You are definitely correct: the topological fingerprints in the RDKit use subgraphs in addition to paths. I did things that way because that's how I read the Daylight theory page at the time, I had the subgraph code available, and it seemed reasonable. Now that you've pointed out the "uniqueness" of the approach and I've re-read the Daylight page, I can easily imagine doing it otherwise. As to why no-one else has done it this way... tja; that's a question I can't answer. :-) > I'm trying to understand RDKit's hash fingerprint code. The > functions in Fingerprints.cpp call > Subgraphs.cpp:findAllSubgraphsOfLengthsMtoN . The code in Subgraphs > makes a distinction between a subgraph (which can have branches) and > paths (which are linear). > > The implementation for Subgraphs.cpp:findAllSubgraphsOfLengthsMtoN > appears to do what it says - return subgraphs - and the comment > >> - to make few things clear it might be useful to typdef a >> "subgraphListType" >> even if it is exactly same as the "pathListType", just to not >> confuse between >> path vs. subgraph definitions > > > reinforces that. The code then uses Balaban's J index for the > subgraph to seed the PRNG to generate the fingerprint bits. One quick clarification: Balaban J is used as the hash in the function DaylightFingerprintMol(). This function is deprecated in favor of RDKFingeprintMol(), which uses a cleaner scheme for hashing the bits. I see that I never actually put a comment in the code to that effect though; I will fix that > > This surprises me because I thought the Daylight approach was to only > consider linear fragments. Implementations like in > > OpenBabel - finger2.cpp:1077 > /// \brief Fingerprint based on linear fragments up to 7 atoms ID="FP2" > > the CDK - PathTools.java:552-553 > * This method returns a set of paths. Each path is a > <code>List</code> of atoms that > * make up the path (ie they are sequentially connected). > > and a private code base I have access to only use linear paths. > > But upon closer reading of > http://www.daylight.com/dayhtml/doc/theory/theory.finger.html > I see that it doesn't say that paths are linear, only text like > > "atoms and bonds connected by paths up to 3 bonds long" > > I admit though that I'm a bit surprised. I thought I know how this > sort of code works. I must admit that I find the use of branched paths somehow more pleasing. If you don't include either branching or more detail about atom identity in the hashing, then it seems like you'd get 100% similarity between CCC and CC(C)C. One argument that's certainly in favor of doing paths instead of subgraphs is speed: generating and hashing the linear paths would be faster than the branched subgraphs. Of course, I doubt that generating these fingerprints normally lies on the critical path. -greg

