Thanks all for the lit. references (and for the ever useful TL;DR).  It now
seems clear that 0.7 is too high a value for ECFP4 (you convinced me).

Yes George, that was what I was trying to do - make statements like "this
compound library is more diverse than this other", and quantify that
diversity with a set of numbers.

-
Jean-Paul Ebejer
Early Stage Researcher

On 26 May 2015 at 12:57, George Papadatos <[email protected]> wrote:

> Hi JP,
>
> Aha, so you're looking for a threshold that will exhibit the optimal
> balance between the false positives and false negatives in the
> *biological* *activity* space. This threshold varies depending on the
> fingerprint and the dataset of course.
> See here for some generalised insights:
>
> (1) Papadatos, G.; Cooper, A. W. J.; Kadirkamanathan, V.; Macdonald, S.
> J. F.; McLay, I. M.; Pickett, S. D.; Pritchard, J. M.; Willett, P.; Gillet,
> V. J. Analysis of Neighborhood Behavior in Lead Optimization and Array
> Design. *J. Chem. Inf. Model.* *2009*, *49*, 195–208.
>
> especially Figure 17, and
>
> (2) Muchmore, S. W.; Debe, D. A.; Metz, J. T.; Brown, S. P.; Martin, Y.
> C.; Hajduk, P. J. Application of Belief Theory to Similarity Data Fusion
> for Use in Analog Searching and Lead Hopping. *J. Chem. Inf. Model.*
> *2008*, *48*, 941–948.
>
> and also Greg's blog post:
>
> http://rdkit.blogspot.co.uk/2013/10/fingerprint-thresholds.html
>
>
> The TL/DR version is that for ECFP_4, this threshold should be around
> 0.45-0.55.
> Wrt methodology, are you trying to score/rank the
> intra-diversity/heterogeneity for different structure sets?
>
>
> Cheers,
>
> George
>
>
>
> On 26 May 2015 at 11:59, JP <[email protected]> wrote:
>
>>
>> On 25 May 2015 at 22:23, Tim Dudgeon <[email protected]> wrote:
>>
>>> Maybe a clustering approach may work? Something like sphere exclusion
>>> clustering with counting the number of clusters at 0.9 - 0.8 similarity)?
>>> With 30K structures it sounds computationally tractable?
>>
>>
>> Thanks Tim for this idea.  I hadn't heard of sphere exclusion.  The
>> problem is we still need a distance / similarity function (which using ECFP
>> with high similarity 0.8-0.9 would result in very few compounds being
>> thrown out).  I think the real issue here is selecting a sensible
>> similarity threshold which defines my idea of "similarity".  But that is a
>> tricky number to get right - too high and you remove nothing, too low and
>> you start catching "different" molecules.  I guess the best thing is try a
>> few values (0.5, 0.6, 0.7, 0.8, 0.9) and have a visual look at the
>> remaining compounds.
>>
>> -
>> JP
>>
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Rdkit-discuss mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to