Re: [Rdkit-discuss] question about fingerprint generation

Greg Landrum Mon, 09 Feb 2009 06:21:56 +0000

Hi Andrew,

I'll start with your question and then elaborate a bit:

On Mon, Feb 9, 2009 at 1:54 AM, Andrew Dalke <[email protected]> wrote:
>
> Hmm, I haven't asked a question yet.... Greg? How did RDKit manage to
> be the only one I found which uses branching in the fingerprint
> generation code? Or do I have things wrong?

You are definitely correct: the topological fingerprints in the RDKit
use subgraphs in addition to paths. I did things that way because
that's how I read the Daylight theory page at the time, I had the
subgraph code available, and it seemed reasonable. Now that you've
pointed out the "uniqueness" of the approach and I've re-read the
Daylight page, I can easily imagine doing it otherwise. As to why
no-one else has done it this way... tja; that's a question I can't
answer. :-)

>   I'm trying to understand RDKit's hash fingerprint code. The
> functions in Fingerprints.cpp call
> Subgraphs.cpp:findAllSubgraphsOfLengthsMtoN . The code in Subgraphs
> makes a distinction between a subgraph (which can have branches) and
> paths (which are linear).
>
> The implementation for Subgraphs.cpp:findAllSubgraphsOfLengthsMtoN
> appears to do what it says - return subgraphs - and the comment
>
>>       - to make few things clear it might be useful to typdef a
>> "subgraphListType"
>>       even if it is exactly same as the "pathListType", just to not
>> confuse between
>>       path vs. subgraph definitions
>
>
> reinforces that. The code then uses Balaban's J index for the
> subgraph to seed the PRNG to generate the fingerprint bits.

One quick clarification: Balaban J is used as the hash in the function
DaylightFingerprintMol(). This function is deprecated in favor of
RDKFingeprintMol(), which uses a cleaner scheme for hashing the bits.
I see that I never actually put a comment in the code to that effect
though; I will fix that

>
> This surprises me because I thought the Daylight approach was to only
> consider linear fragments. Implementations like in
>
> OpenBabel - finger2.cpp:1077
> /// \brief Fingerprint based on linear fragments up to 7 atoms ID="FP2"
>
> the CDK - PathTools.java:552-553
>      * This method returns a set of paths. Each path is a
> <code>List</code> of atoms that
>      * make up the path (ie they are sequentially connected).
>
> and a private code base I have access to only use linear paths.
>
> But upon closer reading of
>   http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
> I see that it doesn't say that paths are linear, only text like
>
>   "atoms and bonds connected by paths up to 3 bonds long"
>
> I admit though that I'm a bit surprised. I thought I know how this
> sort of code works.

I must admit that I find the use of branched paths somehow more
pleasing. If you don't include either branching or more detail about
atom identity in the hashing, then it seems like you'd get 100%
similarity between CCC and CC(C)C.

One argument that's certainly in favor of doing paths instead of
subgraphs is speed: generating and hashing the linear paths would be
faster than the branched subgraphs. Of course, I doubt that generating
these fingerprints normally lies on the critical path.

-greg

Re: [Rdkit-discuss] question about fingerprint generation

Reply via email to