Re: [Rdkit-discuss] question about fingerprint generation

Greg Landrum Mon, 09 Feb 2009 10:38:26 +0000

Andrew,

On Mon, Feb 9, 2009 at 11:26 AM, Andrew Dalke <[email protected]> wrote:
> Greg:
>> I must admit that I find the use of branched paths somehow more
>> pleasing. If you don't include either branching or more detail about
>> atom identity in the hashing, then it seems like you'd get 100%
>> similarity between CCC and CC(C)C.
>
> I quite agree. I'm looking at this for substructure filtering,
> and it feels like the additional topology information should
> be better. Though the code is a bit more complex. With linear
> branching did a hard-coded set of for-loops so I wouldn't need
> to use recursion or need a dynamic data structure.


For substructure filtering, it might be worth taking a look at the
(newish) "layered fingerprints", also in Fingerprints.h. Those were
introduced with the idea of providing a more efficient (and
potentially better) fingerprint for this purpose. Things worked well
in some preliminary testing, but they need a bit more validation.

>> Of course, I doubt that generating
>> these fingerprints normally lies on the critical path.
>
> Though of course while I'm thinking about that for speed reasons,
> as you say, that's not on the critical path.

If you do find yourself cursing the speed of the fingerprint
generation, it might be worth taking a look at using the alternate RNG
that is applied in the layered fingerprint code (line 229 of
Fingerprints.cpp). Some profiling I did while implementing those fps
showed that I was spending a disproportionate (and unecessary) amount
of time in the RNG seeding process. The adjusted params for the
layered fingerprint RNG seemed to solve that problem.

-greg

Re: [Rdkit-discuss] question about fingerprint generation

Reply via email to