On Sep 26, 2018, at 20:26, Peter S. Shenkin <[email protected]> wrote:
> Ah, David, but how do you define a "real" singleton?
There can be many different definitions of what a '"real" singleton' might be,
but we are specifically talking about Butina clustering.
The Butina paper defines the term "false singleton", which Dave quoted. The
relevant text from DOI: 10.1021/ci9803381 is:
"""The molecules that have not been flagged by the end of the clustering
process, either as a cluster centroid or as a cluster member, become
singletons. It is important to emphasize at this stage that one of the
consequences of this approach is that some molecules defined as singletons may
have neighbors at the given Tanimoto similarity index, but those neighbors have
been excluded by a ‘stronger’ cluster centroid, i.e., one with more neighbors
in its list. .... the problem with the creation of a number of false singletons
that do in fact have similar compounds within the set is easily offset by the
final quality of the clusters that this approach generates."""
As you can see, there are two types of singletons, and one is called "false
singleton". No specific name is used for the other type of singleton, but it's
easy to how they can be called "real" singletons, without confusion or
misunderstanding.
(FWIW, my implementation, mentioned in an earlier email, uses the term "true
singleton" as the singleton which is not a "false singleton", but the
difference is only in the label.)
To confirm that this is what Dave means, I'll quote from his paper
Blomberg, N., Cosgrove, D. A., Kenny, P. W., & Kolmodin, K. (2009). Design of
compound libraries for fragment screening. Journal of Computer-Aided Molecular
Design, 23(8), 513–525. doi:10.1007/s10822-009-9264-5
"""The clustering program flush_clus is an implementation of the
sphere-exclusion algorithm of Taylor [41], which has also been reported
independently by Butina ... One consequence of the algorithm is the production
of ‘false singleton clusters.’ The final clusters in the output are invariably
singleton clusters, where the only member is the seed. Some of these will be
true singletons, i.e. molecules lacking neighbors within the clustering
threshold, but others (the false singletons) will be singletons by virtue of
the fact that their neighbors were placed in other larger clusters in a
previous iteration of the algorithm. The flush_clus program offers the
opportunity of performing a final sweep through the clusters using a larger
similarity threshold and placing the singleton molecules within the cluster for
which it has the greatest similarity with the seed, so long as this is within
the threshold."""
Cheers,
Andrew
[email protected]
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss