Greg:
For substructure filtering, it might be worth taking a look at the
(newish) "layered fingerprints", also in Fingerprints.h.

I'll take a look at that as well.

If you do find yourself cursing the speed of the fingerprint
generation, it might be worth taking a look at using the alternate RNG
that is applied in the layered fingerprint code (line 229 of
Fingerprints.cpp). Some profiling I did while implementing those fps
showed that I was spending a disproportionate (and unecessary) amount
of time in the RNG seeding process. The adjusted params for the
layered fingerprint RNG seemed to solve that problem.

That's another advantage of the linear paths: CDK for example
converts the path to a string (as in "C-C-O-C=N") and uses the
normal Java hashString to get the initial seed to the system
PRNG. But I suspect they could use a better hash algorithm and
skip the PRNG completely. Unlike RDKit, CDK only uses a single
number from the PRNG.

I saw that RDKit uses the Mersenne Twister as its PRNG, with
special parameters which I don't understand. The comment is

    // The standard parameters (used to create boost::mt19937)
    // result in an RNG that's much too computationally intensive
    // to seed.

which I'm a bit cautious of. From the Mersenne Twister Wikipedia page:
Unlike Blum Blum Shub, the algorithm in its native form is not suitable for cryptography. Observing a sufficient number of iterates (624 in the case of MT19937) allows one to predict all future iterates.

Another issue is that it can take a long time to turn a non-random initial state into output that passes randomness tests, due to its size. A small lagged Fibonacci generator or linear congruential generator gets started much quicker and usually is used to seed the Mersenne Twister. If only a few numbers are required and standards aren't high it is simpler to use the seed generator. But the Mersenne Twister will still work.



and I don't know if what you've done reduces the randomness. The
result if that were the case wouldn't be wrong, just less useful.

I've considered using a crytographic hash, like taking the same seed
as input to SHA-512. That would give up to 16 values (hash size 512 /
32 bits per value) for the fingerprint, give good random values, and
likely be faster. But that's a thought without anything to back it up.


                                Andrew
                                [email protected]



Reply via email to