Hi I'm afraid I can't help with an RDkit solution to your question, but there are a couple of issues which should be born in mind: 1) The centroid of a cluster is a vector mean of the fingerprints of all the members of the cluster and probably will not be represented *exactly* by any member of the cluster; in this case no structures will have a distance of 0.0 from the centroid. Do you want to calculate the distances from the true centroid or from the structure(s) closest to the centroid? 2) The Tanimoto metric doesn't obey the triangle inequality and is therefore sub-optimal for this kind of analysis. It's better to use an alternative which does obey the triangle inequality - e.g. the Cosine metric.
Regards, Chris Earnshaw On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > RDkit Discussion Group, > > I note that RDkit can perform Butina clustering. Given an SDF of > small molecules I would like to cluster the ligands, but obtain additional > information from the clustering algorithm. In particular, I would like to > obtain > the cluster number and Tanimoto distance from the centroid for every ligand > in the SDF. The centroid would obviously have a distance of 0.00. > > Has anyone written additional RDkit code to extract this additional > information? > Thank you. > > Regards, > Jim Metz > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss