Thanks all for the lit. references (and for the ever useful TL;DR). It now seems clear that 0.7 is too high a value for ECFP4 (you convinced me).
Yes George, that was what I was trying to do - make statements like "this compound library is more diverse than this other", and quantify that diversity with a set of numbers. - Jean-Paul Ebejer Early Stage Researcher On 26 May 2015 at 12:57, George Papadatos <[email protected]> wrote: > Hi JP, > > Aha, so you're looking for a threshold that will exhibit the optimal > balance between the false positives and false negatives in the > *biological* *activity* space. This threshold varies depending on the > fingerprint and the dataset of course. > See here for some generalised insights: > > (1) Papadatos, G.; Cooper, A. W. J.; Kadirkamanathan, V.; Macdonald, S. > J. F.; McLay, I. M.; Pickett, S. D.; Pritchard, J. M.; Willett, P.; Gillet, > V. J. Analysis of Neighborhood Behavior in Lead Optimization and Array > Design. *J. Chem. Inf. Model.* *2009*, *49*, 195–208. > > especially Figure 17, and > > (2) Muchmore, S. W.; Debe, D. A.; Metz, J. T.; Brown, S. P.; Martin, Y. > C.; Hajduk, P. J. Application of Belief Theory to Similarity Data Fusion > for Use in Analog Searching and Lead Hopping. *J. Chem. Inf. Model.* > *2008*, *48*, 941–948. > > and also Greg's blog post: > > http://rdkit.blogspot.co.uk/2013/10/fingerprint-thresholds.html > > > The TL/DR version is that for ECFP_4, this threshold should be around > 0.45-0.55. > Wrt methodology, are you trying to score/rank the > intra-diversity/heterogeneity for different structure sets? > > > Cheers, > > George > > > > On 26 May 2015 at 11:59, JP <[email protected]> wrote: > >> >> On 25 May 2015 at 22:23, Tim Dudgeon <[email protected]> wrote: >> >>> Maybe a clustering approach may work? Something like sphere exclusion >>> clustering with counting the number of clusters at 0.9 - 0.8 similarity)? >>> With 30K structures it sounds computationally tractable? >> >> >> Thanks Tim for this idea. I hadn't heard of sphere exclusion. The >> problem is we still need a distance / similarity function (which using ECFP >> with high similarity 0.8-0.9 would result in very few compounds being >> thrown out). I think the real issue here is selecting a sensible >> similarity threshold which defines my idea of "similarity". But that is a >> tricky number to get right - too high and you remove nothing, too low and >> you start catching "different" molecules. I guess the best thing is try a >> few values (0.5, 0.6, 0.7, 0.8, 0.9) and have a visual look at the >> remaining compounds. >> >> - >> JP >> >> >> ------------------------------------------------------------------------------ >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight. >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> _______________________________________________ >> Rdkit-discuss mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> >
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

