Hi There's a degree of confusion here depending on whether people are considering *similarity* (tanimoto, cosine, whatever) or *distance* (however that's defined). The two are very different from each other. Cosine *similarity* obeys the triangle inequality; cosine *distance* doesn't - pretty much by definition. Tanimoto *similarity* does not obey the inequality, but Tanimoto *distance* may - depending on how you define it. It's very important not to get confused about whether it's similarity or distance that you're considering.
Years ago I worked on some software for fingerprint-based similarity searching using a variety of metrics, and also for clustering. This was based entirely on *similarity* calculations, and clustering used the cosine metric for this reason. BTW, based on this work I recommend k-means relocation clustering, but it needs carefully optimised code to make it run fast enough to be useful. Regards, Chris Earnshaw On Thu, 27 Sep 2018 at 02:36, Francois Berenger <mli...@ligand.eu> wrote: > On 21/09/2018 16:53, Chris Earnshaw wrote: > > Hi > > > > I'm afraid I can't help with an RDkit solution to your question, but > > there are a couple of issues which should be born in mind: > > 1) The centroid of a cluster is a vector mean of the fingerprints of > > all the members of the cluster and probably will not be represented > > _exactly_ by any member of the cluster; in this case no structures > > will have a distance of 0.0 from the centroid. Do you want to > > calculate the distances from the true centroid or from the > > structure(s) closest to the centroid? > > I have seen 'clustroid' in the literature to mean > cluster member nearest to the centroid of that cluster. > > > 2) The Tanimoto metric doesn't obey the triangle inequality and is > > therefore sub-optimal for this kind of analysis. It's better to use an > > alternative which does obey the triangle inequality - e.g. the Cosine > > metric. > > The opposite is true. > > Sven Kosub. A note on the triangle inequality for the jaccard distance. > CoRR, abs/1612.02696, 2016. > > Alan H. Lipkus. A proof of the triangle inequality for the tanimoto dis- > tance. Journal of Mathematical Chemistry, 26(1):263–265, Oct 1999. > > While cosine similarity is not a metric, according to wikipedia. > > I'm not a mathematician, but I think (1 - Tanimoto) is a proper distance > as long as the molecules are encoded with only positive values. > So, Boolean fingerprints are OK, and counted unfolded fingerprints > as well. > > Regard, > Francois. > > > Regards, > > Chris Earnshaw > > > > On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss > > <rdkit-discuss@lists.sourceforge.net> wrote: > > > >> RDkit Discussion Group, > >> > >> I note that RDkit can perform Butina clustering. Given an SDF > >> of > >> small molecules I would like to cluster the ligands, but obtain > >> additional > >> information from the clustering algorithm. In particular, I would > >> like to obtain > >> the cluster number and Tanimoto distance from the centroid for every > >> ligand > >> in the SDF. The centroid would obviously have a distance of 0.00. > >> > >> Has anyone written additional RDkit code to extract this > >> additional information? > >> > >> Thank you. > >> > >> Regards, > >> > >> Jim Metz > >> > >> _______________________________________________ > >> Rdkit-discuss mailing list > >> Rdkit-discuss@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss [1] > > > > > > Links: > > ------ > > [1] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > > _______________________________________________ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss