Hi JP,

Aha, so you're looking for a threshold that will exhibit the optimal
balance between the false positives and false negatives in the *biological*
*activity* space. This threshold varies depending on the fingerprint and
the dataset of course.
See here for some generalised insights:

(1) Papadatos, G.; Cooper, A. W. J.; Kadirkamanathan, V.; Macdonald, S. J.
F.; McLay, I. M.; Pickett, S. D.; Pritchard, J. M.; Willett, P.; Gillet, V.
J. Analysis of Neighborhood Behavior in Lead Optimization and Array Design. *J.
Chem. Inf. Model.* *2009*, *49*, 195–208.

especially Figure 17, and

(2) Muchmore, S. W.; Debe, D. A.; Metz, J. T.; Brown, S. P.; Martin, Y. C.;
Hajduk, P. J. Application of Belief Theory to Similarity Data Fusion for
Use in Analog Searching and Lead Hopping. *J. Chem. Inf. Model.* *2008*,
*48*, 941–948.

and also Greg's blog post:

http://rdkit.blogspot.co.uk/2013/10/fingerprint-thresholds.html


The TL/DR version is that for ECFP_4, this threshold should be around
0.45-0.55.
Wrt methodology, are you trying to score/rank the
intra-diversity/heterogeneity for different structure sets?


Cheers,

George



On 26 May 2015 at 11:59, JP <[email protected]> wrote:

>
> On 25 May 2015 at 22:23, Tim Dudgeon <[email protected]> wrote:
>
>> Maybe a clustering approach may work? Something like sphere exclusion
>> clustering with counting the number of clusters at 0.9 - 0.8 similarity)?
>> With 30K structures it sounds computationally tractable?
>
>
> Thanks Tim for this idea.  I hadn't heard of sphere exclusion.  The
> problem is we still need a distance / similarity function (which using ECFP
> with high similarity 0.8-0.9 would result in very few compounds being
> thrown out).  I think the real issue here is selecting a sensible
> similarity threshold which defines my idea of "similarity".  But that is a
> tricky number to get right - too high and you remove nothing, too low and
> you start catching "different" molecules.  I guess the best thing is try a
> few values (0.5, 0.6, 0.7, 0.8, 0.9) and have a visual look at the
> remaining compounds.
>
> -
> JP
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to