[Rdkit-discuss] Optimizing SSS in the RDKit

Greg Landrum Tue, 10 Feb 2009 14:11:30 +0000

Dear all

Andrew's question about fingerprints hit me at the right time: I had
just finished doing some optimization work on the RDKit substructure
search machinery (removing the vflib dependency). The details are
here:
http://code.google.com/p/rdkit/wiki/SubgraphIsomorphismOptimization


It would be quite interesting to use the new Ullmann code as a
framework and do an implementation of the VF or VF2 algorithms used in
vflib.

Of course there's no better way to optimize subgraph isomorphism than
to avoid it all together, which is where the fingerprints mentioned
come in. I'm spending a couple of days home from work (with a cold),
so I have some room to explore here a little bit.

I put together a sandbox using my 1000 pubchem molecules (they're from
the HTS set, so they are all either drug-like or lead-like, whatever
that means). To get a set of "molecule-like" queries, I fragmented
those 1000 molecules using RECAP and kept the 823 unique fragments I
got.

I've been using those 823 molecules to query the full set of 1000
molecules and looking at how many calls to the isomorphism code I can
avoid using either the RDKit (daylight-like) fingerprints or the
layered fingerprints (out to layer 0x4, beyond that these aren't
suitable for SSS).

The results look pretty encouraging: I can easily filter out more than
90% of the comparisons via fingerprints without losing anything. There
are 823000 (823x1000) possible comparisons with my dataset; using the
RDKit fingerprints as a screen I filter out 765534 of them (93%) using
the layered fingerprints I filter out 765224 (also 93%). The screening
[not even remotely optimized, I'm calculating (A&B)==A instead of
doing it on the fly and short circuiting when something mismatches]
takes about 10 seconds in each case.

By default each fingerprint uses 2048 bits. I can shrink this by
folding the fingerprints (or generating them shorter in the first
place... the end result is the same). That potentially gains speed and
certainly saves storage space, but there may be a cost at how
discriminating the fingerprints are.

Experiment 1: reduce fps to 1024 bits
RDK fingerprints: filter out 717356 (87%)
Layered: filter out 752948 (91%)
No obvious speed improvement

Experiment 2: reduce fps to 512 bits
RDK fingerprints: filter out 441529 (54%)
Layered: filter out 710647 (86%)
10-15% faster

The layered fps are clearly more robust w.r.t. fingerprint size (which
makes sense: I only set one bit per path there as opposed to 4 per
path for the RDKit fp; a good experiment would be to try the RDKit fps
with one bit per path). They're also faster to generate (they no
longer require a PRNG).

I think the screening speed thing is a bit of a red herring at the
moment since I'm not doing a smart screen, but there is a real impact
on storage space.

So what does "the community" think? Interesting results? Arguments
about my testing/benchmarking methodology? Obvious next steps?
Suggestions for improving the layered fps so that they're even more
discriminating?

-greg

[Rdkit-discuss] Optimizing SSS in the RDKit

Reply via email to