Re: [Rdkit-discuss] Butina clustering with additional output

Peter S. Shenkin Tue, 25 Sep 2018 11:56:21 -0700

Well, I'm not really familiar with the Taylor-Butina clustering method, so
I'm proposing a methodology based on generalizing something that I found to
be useful in a somewhat different clustering context.

Presuming that what you are clustering is the fingerprints of structures,
and that you know which structures are in each cluster, you'd compute the
average of all the fingerprints. That is, each bit position would be given
a floating point number that is the average of the 0s and 1s at that
position computed over the structures in the cluster.  Then you'd compute
the distance (say, Manhattan or Euclidian) between the fingerprint of each
structure in the cluster and the average so computed. The "most
representative structure" would be the cluster member whose distance is
closest to the cluster's average fingerprint. (Some additional mileage
could be gained by seeing just how far away from the averag the "most
representative structures" are. It might be more representative (i.e.,
closer) for some clusters than for others.

It would make sense to try this (since it's easy enough) and see whether
the resulting "most representative structures" from the clusters really are
at least roughly representative, by comparing them with viewable random
subsets of structures from the clusters.

-P.

On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke <da...@dalkescientific.com>
wrote:

> On Sep 25, 2018, at 17:13, Peter S. Shenkin <shen...@gmail.com> wrote:
> > FWIW, in work on conformational clustering, I used the “most
> representative” molecule; that is, the real molecule closest to the
> mathematical centroid. This would probably be the best way of displaying a
> single molecule that typifies what is in the cluster.
>
> In some sense I'm rephrasing Chris Earnshaw's earlier question - how does
> one do that with Taylor-Butina clustering? And does it make sense?
>
> The algorithm starts by picking a centroid based on the fingerprints with
> the highest number of neighbors, so none of the other cluster members
> should have more neighbors within that cutoff.
>
> I am far from an expert on this topic, but with any alternative I can
> think of makes me think I should have started with something other than
> Taylor-Butina.
>
>
>
>                                 Andrew
>                                 da...@dalkescientific.com
>
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Butina clustering with additional output

Reply via email to