Hi, you probably read about the Tanimoto being a proper metric in case of having binary data in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the revised edition.
Best, Philipp Thiel > From: "David Cosgrove" <davidacosgrov...@gmail.com> > To: "Chris Earnshaw" <cgearns...@gmail.com> > Cc: "Rdkit-discuss@lists.sourceforge.net" > <rdkit-discuss@lists.sourceforge.net>, > "James T. Metz" <jamestm...@aol.com> > Sent: Friday, 21 September, 2018 13:45:18 > Subject: Re: [Rdkit-discuss] Butina clustering with additional output > I used to have a paper that demonstrated that the tanimoto coefficient does, > in > fact, obey the triangle inequality. I fear I lost access to it when I retired > but maybe a determined google expert could rediscover it. > I expect James means what we used to call the cluster seed, i.e. the molecule > the cluster was based on, rather than the mathematical centroid. Calculating > distances from each cluster member to that would be quite straightforward as a > post-processing step although that would roughly double the time taken. > Regards , > Dave > On Fri, 21 Sep 2018 at 09:55, Chris Earnshaw < [ mailto:cgearns...@gmail.com | > cgearns...@gmail.com ] > wrote: >> Hi >> I'm afraid I can't help with an RDkit solution to your question, but there >> are a >> couple of issues which should be born in mind: >> 1) The centroid of a cluster is a vector mean of the fingerprints of all the >> members of the cluster and probably will not be represented exactly by any >> member of the cluster; in this case no structures will have a distance of 0.0 >> from the centroid. Do you want to calculate the distances from the true >> centroid or from the structure(s) closest to the centroid? >> 2) The Tanimoto metric doesn't obey the triangle inequality and is therefore >> sub-optimal for this kind of analysis. It's better to use an alternative >> which >> does obey the triangle inequality - e.g. the Cosine metric. >> Regards, >> Chris Earnshaw >> On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss < [ >> mailto:rdkit-discuss@lists.sourceforge.net | >> rdkit-discuss@lists.sourceforge.net ] > wrote: >>> RDkit Discussion Group, >>> I note that RDkit can perform Butina clustering. Given an SDF of >>> small molecules I would like to cluster the ligands, but obtain additional >>> information from the clustering algorithm. In particular, I would like to >>> obtain >>> the cluster number and Tanimoto distance from the centroid for every ligand >>> in the SDF. The centroid would obviously have a distance of 0.00. >>> Has anyone written additional RDkit code to extract this additional >>> information? >>> Thank you. >>> Regards, >>> Jim Metz >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> [ mailto:Rdkit-discuss@lists.sourceforge.net | >>> Rdkit-discuss@lists.sourceforge.net ] >>> [ https://lists.sourceforge.net/lists/listinfo/rdkit-discuss | >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ] >> _______________________________________________ >> Rdkit-discuss mailing list >> [ mailto:Rdkit-discuss@lists.sourceforge.net | >> Rdkit-discuss@lists.sourceforge.net ] >> [ https://lists.sourceforge.net/lists/listinfo/rdkit-discuss | >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ] > -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > [ http://cozchemix.co.uk/ | http://cozchemix.co.uk ] > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss