Hi

I'm afraid I can't help with an RDkit solution to your question, but there
are a couple of issues which should be born in mind:
1) The centroid of a cluster is a vector mean of the fingerprints of all
the members of the cluster and probably will not be represented *exactly*
by any member of the cluster; in this case no structures will have a
distance of 0.0 from the centroid. Do you want to calculate the distances
from the true centroid or from the structure(s) closest to the centroid?
2) The Tanimoto metric doesn't obey the triangle inequality and is
therefore sub-optimal for this kind of analysis. It's better to use an
alternative which does obey the triangle inequality - e.g. the Cosine
metric.

Regards,
Chris Earnshaw


On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> RDkit Discussion Group,
>
>     I note that RDkit can perform Butina clustering.  Given an SDF of
> small molecules I would like to cluster the ligands, but obtain additional
> information from the clustering algorithm.  In particular, I would like to
> obtain
> the cluster number and Tanimoto distance from the centroid for every ligand
> in the SDF.  The centroid would obviously have a distance of 0.00.
>
>     Has anyone written additional RDkit code to extract this additional
> information?
> Thank you.
>
>     Regards,
>     Jim Metz
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to