Re: [Rdkit-discuss] question about fingerprint generation

Andrew Dalke Mon, 09 Feb 2009 10:57:46 +0000

Greg:

For substructure filtering, it might be worth taking a look at the
(newish) "layered fingerprints", also in Fingerprints.h.


I'll take a look at that as well.

If you do find yourself cursing the speed of the fingerprint
generation, it might be worth taking a look at using the alternate RNG
that is applied in the layered fingerprint code (line 229 of
Fingerprints.cpp). Some profiling I did while implementing those fps
showed that I was spending a disproportionate (and unecessary) amount
of time in the RNG seeding process. The adjusted params for the
layered fingerprint RNG seemed to solve that problem.


That's another advantage of the linear paths: CDK for example
converts the path to a string (as in "C-C-O-C=N") and uses the
normal Java hashString to get the initial seed to the system
PRNG. But I suspect they could use a better hash algorithm and
skip the PRNG completely. Unlike RDKit, CDK only uses a single
number from the PRNG.

I saw that RDKit uses the Mersenne Twister as its PRNG, with
special parameters which I don't understand. The comment is

    // The standard parameters (used to create boost::mt19937)
    // result in an RNG that's much too computationally intensive
    // to seed.

which I'm a bit cautious of. From the Mersenne Twister Wikipedia page:

Unlike Blum Blum Shub, the algorithm in its native form is notsuitable for cryptography. Observing a sufficient number ofiterates (624 in the case of MT19937) allows one to predict allfuture iterates.
Another issue is that it can take a long time to turn a non-randominitial state into output that passes randomness tests, due to itssize. A small lagged Fibonacci generator or linear congruentialgenerator gets started much quicker and usually is used to seed theMersenne Twister. If only a few numbers are required and standardsaren't high it is simpler to use the seed generator. But theMersenne Twister will still work.


and I don't know if what you've done reduces the randomness. The
result if that were the case wouldn't be wrong, just less useful.

I've considered using a crytographic hash, like taking the same seed
as input to SHA-512. That would give up to 16 values (hash size 512 /
32 bits per value) for the fingerprint, give good random values, and
likely be faster. But that's a thought without anything to back it up.


                                Andrew
                                [email protected]

Re: [Rdkit-discuss] question about fingerprint generation

Reply via email to