Dear all Andrew's question about fingerprints hit me at the right time: I had just finished doing some optimization work on the RDKit substructure search machinery (removing the vflib dependency). The details are here: http://code.google.com/p/rdkit/wiki/SubgraphIsomorphismOptimization
It would be quite interesting to use the new Ullmann code as a framework and do an implementation of the VF or VF2 algorithms used in vflib. Of course there's no better way to optimize subgraph isomorphism than to avoid it all together, which is where the fingerprints mentioned come in. I'm spending a couple of days home from work (with a cold), so I have some room to explore here a little bit. I put together a sandbox using my 1000 pubchem molecules (they're from the HTS set, so they are all either drug-like or lead-like, whatever that means). To get a set of "molecule-like" queries, I fragmented those 1000 molecules using RECAP and kept the 823 unique fragments I got. I've been using those 823 molecules to query the full set of 1000 molecules and looking at how many calls to the isomorphism code I can avoid using either the RDKit (daylight-like) fingerprints or the layered fingerprints (out to layer 0x4, beyond that these aren't suitable for SSS). The results look pretty encouraging: I can easily filter out more than 90% of the comparisons via fingerprints without losing anything. There are 823000 (823x1000) possible comparisons with my dataset; using the RDKit fingerprints as a screen I filter out 765534 of them (93%) using the layered fingerprints I filter out 765224 (also 93%). The screening [not even remotely optimized, I'm calculating (A&B)==A instead of doing it on the fly and short circuiting when something mismatches] takes about 10 seconds in each case. By default each fingerprint uses 2048 bits. I can shrink this by folding the fingerprints (or generating them shorter in the first place... the end result is the same). That potentially gains speed and certainly saves storage space, but there may be a cost at how discriminating the fingerprints are. Experiment 1: reduce fps to 1024 bits RDK fingerprints: filter out 717356 (87%) Layered: filter out 752948 (91%) No obvious speed improvement Experiment 2: reduce fps to 512 bits RDK fingerprints: filter out 441529 (54%) Layered: filter out 710647 (86%) 10-15% faster The layered fps are clearly more robust w.r.t. fingerprint size (which makes sense: I only set one bit per path there as opposed to 4 per path for the RDKit fp; a good experiment would be to try the RDKit fps with one bit per path). They're also faster to generate (they no longer require a PRNG). I think the screening speed thing is a bit of a red herring at the moment since I'm not doing a smart screen, but there is a real impact on storage space. So what does "the community" think? Interesting results? Arguments about my testing/benchmarking methodology? Obvious next steps? Suggestions for improving the layered fps so that they're even more discriminating? -greg

