Dear all, I've been unhappy with the performance of the RDKit substructure fingerprinter (used in the database cartridge and useable elsewhere as well to speed up substructure searching) for a while, but haven't managed to do anything about it. This week, inspired by some discussions at a knime meeting, I decided to try some experiments. This probably was a goofy decision since there are a fairly large number of other things I ought to be doing, but sometimes the itch just needs to be scratched. :-)
I've got something in place now that seems pretty useful in my hands and I'd like to get some feedback from others before putting too much more time into it. I'll post details about how the new method works later, but for now here's an explanation of how I tested and why I think it's effective. I used a number of data sets to do the testing, but here's the basic idea: I take M patterns and search a pool of N molecules with them to find substructure matches. This means, theoretically, that I would have to do MxN substructure searches. I reduce this using the substructure fingerprints: if the fingerprint for pattern molecule i contains bits that are not set in pool molecule j, then i don't need to do the substructure search for that pair of molecules. A perfectly effective fingerprint would give 100% accuracy: every pair that passes the fingerprint test would actually contain a substructure match. Of course perfection is too much to hope for, but the goal is to get the accuracy as high as possible. For my test pool, I used a set of diverse drug-like molecules from ZINC. The queries are described on page 15 of this presentation (http://www.hinxton.wellcome.ac.uk/advancedcourses/MIOSS%20Greg%20Landrum.pdf): - 500 lead-like molecules from ZINC - 500 fragment-like molecules from ZINC - 823 pieces of molecules created from a BRICS fragmentation of pubchem screening molecules. There's more tuning to do, but the substructure screening accuracy of the new fingerprint is pretty good: - ZINC lead-like queries: 92% (previously 11%) - ZINC fragment-like queries: 99% (previously 20&) - pieces of molecules : 89% (previously 35%) I'm pretty happy with this. :-) The cartridge has not yet been updated to use the new fingerprint since I haven't extended the method to support query features yet, but this will come relatively soon. I'd love to get feedback from others about how the fingerprint works for them, so I've checked the code in. The new method is currently named Chem.LayeredFingerprint2, but this is definitely temporary. There's some sample code for testing screenout accuracy here: http://pastebin.com/9v93WHTr Best Regards, -greg ------------------------------------------------------------------------------ Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/ _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

