Hi, 

you probably read about the Tanimoto being a proper metric in case of having 
binary data 
in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the 
revised edition. 

Best, 
Philipp Thiel 

> From: "David Cosgrove" <davidacosgrov...@gmail.com>
> To: "Chris Earnshaw" <cgearns...@gmail.com>
> Cc: "Rdkit-discuss@lists.sourceforge.net" 
> <rdkit-discuss@lists.sourceforge.net>,
> "James T. Metz" <jamestm...@aol.com>
> Sent: Friday, 21 September, 2018 13:45:18
> Subject: Re: [Rdkit-discuss] Butina clustering with additional output

> I used to have a paper that demonstrated that the tanimoto coefficient does, 
> in
> fact, obey the triangle inequality. I fear I lost access to it when I retired
> but maybe a determined google expert could rediscover it.
> I expect James means what we used to call the cluster seed, i.e. the molecule
> the cluster was based on, rather than the mathematical centroid. Calculating
> distances from each cluster member to that would be quite straightforward as a
> post-processing step although that would roughly double the time taken.
> Regards ,
> Dave

> On Fri, 21 Sep 2018 at 09:55, Chris Earnshaw < [ mailto:cgearns...@gmail.com |
> cgearns...@gmail.com ] > wrote:

>> Hi

>> I'm afraid I can't help with an RDkit solution to your question, but there 
>> are a
>> couple of issues which should be born in mind:
>> 1) The centroid of a cluster is a vector mean of the fingerprints of all the
>> members of the cluster and probably will not be represented exactly by any
>> member of the cluster; in this case no structures will have a distance of 0.0
>> from the centroid. Do you want to calculate the distances from the true
>> centroid or from the structure(s) closest to the centroid?
>> 2) The Tanimoto metric doesn't obey the triangle inequality and is therefore
>> sub-optimal for this kind of analysis. It's better to use an alternative 
>> which
>> does obey the triangle inequality - e.g. the Cosine metric.

>> Regards,
>> Chris Earnshaw

>> On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss < [
>> mailto:rdkit-discuss@lists.sourceforge.net |
>> rdkit-discuss@lists.sourceforge.net ] > wrote:

>>> RDkit Discussion Group,

>>> I note that RDkit can perform Butina clustering. Given an SDF of
>>> small molecules I would like to cluster the ligands, but obtain additional
>>> information from the clustering algorithm. In particular, I would like to 
>>> obtain
>>> the cluster number and Tanimoto distance from the centroid for every ligand
>>> in the SDF. The centroid would obviously have a distance of 0.00.

>>> Has anyone written additional RDkit code to extract this additional 
>>> information?
>>> Thank you.

>>> Regards,
>>> Jim Metz

>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> [ mailto:Rdkit-discuss@lists.sourceforge.net |
>>> Rdkit-discuss@lists.sourceforge.net ]
>>> [ https://lists.sourceforge.net/lists/listinfo/rdkit-discuss |
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ]

>> _______________________________________________
>> Rdkit-discuss mailing list
>> [ mailto:Rdkit-discuss@lists.sourceforge.net |
>> Rdkit-discuss@lists.sourceforge.net ]
>> [ https://lists.sourceforge.net/lists/listinfo/rdkit-discuss |
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ]

> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> [ http://cozchemix.co.uk/ | http://cozchemix.co.uk ]

> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to