Well, I'm not really familiar with the Taylor-Butina clustering method, so I'm proposing a methodology based on generalizing something that I found to be useful in a somewhat different clustering context.
Presuming that what you are clustering is the fingerprints of structures, and that you know which structures are in each cluster, you'd compute the average of all the fingerprints. That is, each bit position would be given a floating point number that is the average of the 0s and 1s at that position computed over the structures in the cluster. Then you'd compute the distance (say, Manhattan or Euclidian) between the fingerprint of each structure in the cluster and the average so computed. The "most representative structure" would be the cluster member whose distance is closest to the cluster's average fingerprint. (Some additional mileage could be gained by seeing just how far away from the averag the "most representative structures" are. It might be more representative (i.e., closer) for some clusters than for others. It would make sense to try this (since it's easy enough) and see whether the resulting "most representative structures" from the clusters really are at least roughly representative, by comparing them with viewable random subsets of structures from the clusters. -P. On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke <da...@dalkescientific.com> wrote: > On Sep 25, 2018, at 17:13, Peter S. Shenkin <shen...@gmail.com> wrote: > > FWIW, in work on conformational clustering, I used the “most > representative” molecule; that is, the real molecule closest to the > mathematical centroid. This would probably be the best way of displaying a > single molecule that typifies what is in the cluster. > > In some sense I'm rephrasing Chris Earnshaw's earlier question - how does > one do that with Taylor-Butina clustering? And does it make sense? > > The algorithm starts by picking a centroid based on the fingerprints with > the highest number of neighbors, so none of the other cluster members > should have more neighbors within that cutoff. > > I am far from an expert on this topic, but with any alternative I can > think of makes me think I should have started with something other than > Taylor-Butina. > > > > Andrew > da...@dalkescientific.com > > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss