[Rdkit-discuss] request for feedback/testing

Greg Landrum Thu, 08 Dec 2011 19:22:34 -0800

Dear all,

I've been unhappy with the performance of the RDKit substructure
fingerprinter (used in the database cartridge and useable elsewhere as
well to speed up substructure searching) for a while, but haven't
managed to do anything about it. This week, inspired by some
discussions at a knime meeting, I decided to try some experiments.
This probably was a goofy decision since there are a fairly large
number of other things I ought to be doing, but sometimes the itch
just needs to be scratched. :-)


I've got something in place now that seems pretty useful in my hands
and I'd like to get some feedback from others before putting too much
more time into it. I'll post details about how the new method works
later, but for now here's an explanation of how I tested and why I
think it's effective.

I used a number of data sets to do the testing, but here's the basic
idea: I take M patterns and search a pool of N molecules with them to
find substructure matches. This means, theoretically, that I would
have to do MxN substructure searches. I reduce this using the
substructure fingerprints: if the fingerprint for pattern molecule i
contains bits that are not set in pool molecule j, then i don't need
to do the substructure search for that pair of molecules. A perfectly
effective fingerprint would give 100% accuracy: every pair that passes
the fingerprint test would actually contain a substructure match. Of
course perfection is too much to hope for, but the goal is to get the
accuracy as high as possible.

For my test pool, I used a set of diverse drug-like molecules from ZINC.
The queries are described on page 15 of this presentation
(http://www.hinxton.wellcome.ac.uk/advancedcourses/MIOSS%20Greg%20Landrum.pdf):
  - 500 lead-like molecules from ZINC
  - 500 fragment-like molecules from ZINC
  - 823 pieces of molecules created from a BRICS fragmentation of
pubchem screening molecules.

There's more tuning to do, but the substructure screening accuracy of
the new fingerprint is pretty good:
 - ZINC lead-like queries: 92% (previously 11%)
 - ZINC fragment-like queries: 99% (previously 20&)
 - pieces of molecules : 89% (previously 35%)

I'm pretty happy with this. :-)

The cartridge has not yet been updated to use the new fingerprint
since I haven't extended the method to support query features yet, but
this will come relatively soon.

I'd love to get feedback from others about how the fingerprint works
for them, so I've checked the code in. The new method is currently
named Chem.LayeredFingerprint2, but this is definitely temporary.
There's some sample code for testing screenout accuracy here:
http://pastebin.com/9v93WHTr


Best Regards,
-greg

------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] request for feedback/testing

Reply via email to