Hi Greg, just a curiosity ...
765534 vs 76522 is one a subset of the other? If not - would it make sense to test on both? Just a thought. Apart from that I think the setup is reasonable for most applications we will have ... Nik Greg Landrum <greg.land...@gmail.com> 10.02.2009 15:11 To RDKit Discuss <rdkit-discuss@lists.sourceforge.net> cc Subject [Rdkit-discuss] Optimizing SSS in the RDKit Dear all Andrew's question about fingerprints hit me at the right time: I had just finished doing some optimization work on the RDKit substructure search machinery (removing the vflib dependency). The details are here: http://code.google.com/p/rdkit/wiki/SubgraphIsomorphismOptimization It would be quite interesting to use the new Ullmann code as a framework and do an implementation of the VF or VF2 algorithms used in vflib. Of course there's no better way to optimize subgraph isomorphism than to avoid it all together, which is where the fingerprints mentioned come in. I'm spending a couple of days home from work (with a cold), so I have some room to explore here a little bit. I put together a sandbox using my 1000 pubchem molecules (they're from the HTS set, so they are all either drug-like or lead-like, whatever that means). To get a set of "molecule-like" queries, I fragmented those 1000 molecules using RECAP and kept the 823 unique fragments I got. I've been using those 823 molecules to query the full set of 1000 molecules and looking at how many calls to the isomorphism code I can avoid using either the RDKit (daylight-like) fingerprints or the layered fingerprints (out to layer 0x4, beyond that these aren't suitable for SSS). The results look pretty encouraging: I can easily filter out more than 90% of the comparisons via fingerprints without losing anything. There are 823000 (823x1000) possible comparisons with my dataset; using the RDKit fingerprints as a screen I filter out 765534 of them (93%) using the layered fingerprints I filter out 765224 (also 93%). The screening [not even remotely optimized, I'm calculating (A&B)==A instead of doing it on the fly and short circuiting when something mismatches] takes about 10 seconds in each case. By default each fingerprint uses 2048 bits. I can shrink this by folding the fingerprints (or generating them shorter in the first place... the end result is the same). That potentially gains speed and certainly saves storage space, but there may be a cost at how discriminating the fingerprints are. Experiment 1: reduce fps to 1024 bits RDK fingerprints: filter out 717356 (87%) Layered: filter out 752948 (91%) No obvious speed improvement Experiment 2: reduce fps to 512 bits RDK fingerprints: filter out 441529 (54%) Layered: filter out 710647 (86%) 10-15% faster The layered fps are clearly more robust w.r.t. fingerprint size (which makes sense: I only set one bit per path there as opposed to 4 per path for the RDKit fp; a good experiment would be to try the RDKit fps with one bit per path). They're also faster to generate (they no longer require a PRNG). I think the screening speed thing is a bit of a red herring at the moment since I'm not doing a smart screen, but there is a real impact on storage space. So what does "the community" think? Interesting results? Arguments about my testing/benchmarking methodology? Obvious next steps? Suggestions for improving the layered fps so that they're even more discriminating? -greg ------------------------------------------------------------------------------ Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) software. With Adobe AIR, Ajax developers can use existing skills and code to build responsive, highly engaging applications that combine the power of local resources and data with the reach of the web. Download the Adobe AIR SDK and Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _________________________ CONFIDENTIALITY NOTICE The information contained in this e-mail message is intended only for the exclusive use of the individual or entity named above and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivery of the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by e-mail and delete the material from any computer. Thank you.