Ah, David, but how do you define a "real" singleton? -P.
On Wed, Sep 26, 2018 at 1:30 PM David Cosgrove <davidacosgrov...@gmail.com> wrote: > Slightly off topic, but a minor issue with the Taylor-Butina algorithm is > that it generates “false singletons”. These are molecules just outside the > clustering cutoff that are stranded when their neighbours are put in a > different, larger cluster. We used to find it convenient to have a sweep of > these, at a slightly looser cutoff, and drop them into the cluster whose > centroid/seed they were nearest too. This could be added to Andrew’s code > quite easily. At the very least, it’s worth keeping track of the initial > number of neighbours within the cluster cutoff that each fingerprint had so > as to distinguish real singletons from these artefactual ones. > Dave > > > On Tue, 25 Sep 2018 at 19:56, Peter S. Shenkin <shen...@gmail.com> wrote: > >> Well, I'm not really familiar with the Taylor-Butina clustering method, >> so I'm proposing a methodology based on generalizing something that I found >> to be useful in a somewhat different clustering context. >> >> Presuming that what you are clustering is the fingerprints of structures, >> and that you know which structures are in each cluster, you'd compute the >> average of all the fingerprints. That is, each bit position would be given >> a floating point number that is the average of the 0s and 1s at that >> position computed over the structures in the cluster. Then you'd compute >> the distance (say, Manhattan or Euclidian) between the fingerprint of each >> structure in the cluster and the average so computed. The "most >> representative structure" would be the cluster member whose distance is >> closest to the cluster's average fingerprint. (Some additional mileage >> could be gained by seeing just how far away from the averag the "most >> representative structures" are. It might be more representative (i.e., >> closer) for some clusters than for others. >> >> It would make sense to try this (since it's easy enough) and see whether >> the resulting "most representative structures" from the clusters really are >> at least roughly representative, by comparing them with viewable random >> subsets of structures from the clusters. >> >> -P. >> >> On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke <da...@dalkescientific.com> >> wrote: >> >>> On Sep 25, 2018, at 17:13, Peter S. Shenkin <shen...@gmail.com> wrote: >>> > FWIW, in work on conformational clustering, I used the “most >>> representative” molecule; that is, the real molecule closest to the >>> mathematical centroid. This would probably be the best way of displaying a >>> single molecule that typifies what is in the cluster. >>> >>> In some sense I'm rephrasing Chris Earnshaw's earlier question - how >>> does one do that with Taylor-Butina clustering? And does it make sense? >>> >>> The algorithm starts by picking a centroid based on the fingerprints >>> with the highest number of neighbors, so none of the other cluster members >>> should have more neighbors within that cutoff. >>> >>> I am far from an expert on this topic, but with any alternative I can >>> think of makes me think I should have started with something other than >>> Taylor-Butina. >>> >>> >>> >>> Andrew >>> da...@dalkescientific.com >>> >>> >>> >>> >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > http://cozchemix.co.uk > >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss