JP, A bit of self-advertisement if I may - our Diversity Genie, which uses RDKit on the background by the way - was initially created to answer this exact question. www.diversitygenie.com - hope it may come useful.
Igor On Wed, May 27, 2015 at 4:05 AM, JP <[email protected]> wrote: > Thanks all for the lit. references (and for the ever useful TL;DR). It > now seems clear that 0.7 is too high a value for ECFP4 (you convinced me). > > Yes George, that was what I was trying to do - make statements like "this > compound library is more diverse than this other", and quantify that > diversity with a set of numbers. > > - > Jean-Paul Ebejer > Early Stage Researcher > > On 26 May 2015 at 12:57, George Papadatos <[email protected]> wrote: > >> Hi JP, >> >> Aha, so you're looking for a threshold that will exhibit the optimal >> balance between the false positives and false negatives in the >> *biological* *activity* space. This threshold varies depending on the >> fingerprint and the dataset of course. >> See here for some generalised insights: >> >> (1) Papadatos, G.; Cooper, A. W. J.; Kadirkamanathan, V.; Macdonald, S. >> J. F.; McLay, I. M.; Pickett, S. D.; Pritchard, J. M.; Willett, P.; Gillet, >> V. J. Analysis of Neighborhood Behavior in Lead Optimization and Array >> Design. *J. Chem. Inf. Model.* *2009*, *49*, 195–208. >> >> especially Figure 17, and >> >> (2) Muchmore, S. W.; Debe, D. A.; Metz, J. T.; Brown, S. P.; Martin, Y. >> C.; Hajduk, P. J. Application of Belief Theory to Similarity Data Fusion >> for Use in Analog Searching and Lead Hopping. *J. Chem. Inf. Model.* >> *2008*, *48*, 941–948. >> >> and also Greg's blog post: >> >> http://rdkit.blogspot.co.uk/2013/10/fingerprint-thresholds.html >> >> >> The TL/DR version is that for ECFP_4, this threshold should be around >> 0.45-0.55. >> Wrt methodology, are you trying to score/rank the >> intra-diversity/heterogeneity for different structure sets? >> >> >> Cheers, >> >> George >> >> >> >> On 26 May 2015 at 11:59, JP <[email protected]> wrote: >> >>> >>> On 25 May 2015 at 22:23, Tim Dudgeon <[email protected]> wrote: >>> >>>> Maybe a clustering approach may work? Something like sphere exclusion >>>> clustering with counting the number of clusters at 0.9 - 0.8 similarity)? >>>> With 30K structures it sounds computationally tractable? >>> >>> >>> Thanks Tim for this idea. I hadn't heard of sphere exclusion. The >>> problem is we still need a distance / similarity function (which using ECFP >>> with high similarity 0.8-0.9 would result in very few compounds being >>> thrown out). I think the real issue here is selecting a sensible >>> similarity threshold which defines my idea of "similarity". But that is a >>> tricky number to get right - too high and you remove nothing, too low and >>> you start catching "different" molecules. I guess the best thing is try a >>> few values (0.5, 0.6, 0.7, 0.8, 0.9) and have a visual look at the >>> remaining compounds. >>> >>> - >>> JP >>> >>> >>> ------------------------------------------------------------------------------ >>> One dashboard for servers and applications across Physical-Virtual-Cloud >>> Widest out-of-the-box monitoring support with 50+ applications >>> Performance metrics, stats and reports that give you Actionable Insights >>> Deep dive visibility with transaction tracing using APM Insight. >>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Rdkit-discuss mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > >
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

