Hi

There's a degree of confusion here depending on whether people are
considering *similarity* (tanimoto, cosine, whatever) or *distance*
(however that's defined). The two are very different from each other.
Cosine *similarity* obeys the triangle inequality; cosine *distance*
doesn't - pretty much by definition. Tanimoto *similarity* does not obey
the inequality, but Tanimoto *distance* may - depending on how you define
it. It's very important not to get confused about whether it's similarity
or distance that you're considering.

Years ago I worked on some software for fingerprint-based similarity
searching using a variety of metrics, and also for clustering. This was
based entirely on *similarity* calculations, and clustering used the cosine
metric for this reason. BTW, based on this work I recommend k-means
relocation clustering, but it needs carefully optimised code to make it run
fast enough to be useful.

Regards,
Chris Earnshaw

On Thu, 27 Sep 2018 at 02:36, Francois Berenger <mli...@ligand.eu> wrote:

> On 21/09/2018 16:53, Chris Earnshaw wrote:
> > Hi
> >
> > I'm afraid I can't help with an RDkit solution to your question, but
> > there are a couple of issues which should be born in mind:
> > 1) The centroid of a cluster is a vector mean of the fingerprints of
> > all the members of the cluster and probably will not be represented
> > _exactly_ by any member of the cluster; in this case no structures
> > will have a distance of 0.0 from the centroid. Do you want to
> > calculate the distances from the true centroid or from the
> > structure(s) closest to the centroid?
>
> I have seen 'clustroid' in the literature to mean
> cluster member nearest to the centroid of that cluster.
>
> > 2) The Tanimoto metric doesn't obey the triangle inequality and is
> > therefore sub-optimal for this kind of analysis. It's better to use an
> > alternative which does obey the triangle inequality - e.g. the Cosine
> > metric.
>
> The opposite is true.
>
> Sven Kosub. A note on the triangle inequality for the jaccard distance.
> CoRR, abs/1612.02696, 2016.
>
> Alan H. Lipkus. A proof of the triangle inequality for the tanimoto dis-
> tance. Journal of Mathematical Chemistry, 26(1):263–265, Oct 1999.
>
> While cosine similarity is not a metric, according to wikipedia.
>
> I'm not a mathematician, but I think (1 - Tanimoto) is a proper distance
> as long as the molecules are encoded with only positive values.
> So, Boolean fingerprints are OK, and counted unfolded fingerprints
> as well.
>
> Regard,
> Francois.
>
> > Regards,
> > Chris Earnshaw
> >
> > On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss
> > <rdkit-discuss@lists.sourceforge.net> wrote:
> >
> >> RDkit Discussion Group,
> >>
> >> I note that RDkit can perform Butina clustering. Given an SDF
> >> of
> >> small molecules I would like to cluster the ligands, but obtain
> >> additional
> >> information from the clustering algorithm. In particular, I would
> >> like to obtain
> >> the cluster number and Tanimoto distance from the centroid for every
> >> ligand
> >> in the SDF. The centroid would obviously have a distance of 0.00.
> >>
> >> Has anyone written additional RDkit code to extract this
> >> additional information?
> >>
> >> Thank you.
> >>
> >> Regards,
> >>
> >> Jim Metz
> >>
> >> _______________________________________________
> >> Rdkit-discuss mailing list
> >> Rdkit-discuss@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss [1]
> >
> >
> > Links:
> > ------
> > [1] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >
> > _______________________________________________
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to