Hi Francesca,

adding  to David's comment, we do have some material for beginners that also 
covers and applies Butina clustering that may be useful: 
https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T005_compound_clustering/talktorial.ipynb


Best, Andrea


----

Prof. Dr. Andrea Volkamer

In silico Toxicology and Structural Bioinformatics<https://volkamerlab.org/>,
Institute of Physiology, Charité Universitätsmedizin Berlin

Campus Mitte: Virchowweg 6, 10117 Berlin
Phone: +49 30 - 450 528 504
E-Mail: andrea.volka...@charite.de<mailto:andrea.volka...@charite.de>


________________________________
Von: David Cosgrove <davidacosgrov...@gmail.com>
Gesendet: Mittwoch, 21. Juli 2021 14:01:03
An: Francesca Magarotto - francesca.magarot...@studio.unibo.it
Cc: RDKit Discuss
Betreff: [ext] Re: [Rdkit-discuss] Taylor-Butina clustering

Hi Francesca,

The Taylor-Butina clustering is not hierarchical.  It is a type of sphere 
exclusion algorithm.  A useful image for the results would be the "centroid" of 
each cluster, possibly followed by the other cluster members.  You will need to 
generate the images from the original input molecules, not the fingerprints.   
You'll need to write some extra code to read the clusters and do this.  The 
Getting Started document 
(https://www.rdkit.org/docs/GettingStartedInPython.html) should help you with 
the image generation.  Technically, the centroids aren't proper centroids, they 
are the molecules that each cluster is based on.  The true centroid would be 
some sort of average of the fingerprints of the molecules in the cluster, which 
itself would not be a molecule.  Dealing with false singletons is a matter of 
taste, as they are an artifact of the clustering method.  One way I have had 
success with in the past is to define a second, looser, similarity threshold 
and put each false singleton into the cluster whose centroid it is most similar 
to, so long as it is within this new threshold.  False singletons are certainly 
more common than true ones in my experience.
The threshold you use for the clustering should be chosen with some care, and 
will depend on the fingerprint type more than anything else.  Greg did a blog 
post recently 
(https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html)
 on selecting a threshold for similarity searching, and those suggestions are 
probably a good place to start with for this, too.

Best,
Dave


On Wed, Jul 21, 2021 at 8:58 AM Francesca Magarotto - 
francesca.magarot...@studio.unibo.it<mailto:francesca.magarot...@studio.unibo.it>
 
<francesca.magarot...@studio.unibo.it<mailto:francesca.magarot...@studio.unibo.it>>
 wrote:
Hi,
I managed to performe Taylor-Butina clustering on a dataset of 193 571 
fragments retrieved from ZINC20.
I used the indications in this link 
https://www.macinchem.org/reviews/clustering/clustering.php
Actually, I've never used RDKit before and never did a cluster analysis, so I'm 
really new to this type of work. I've read the paper related to Taylor-Butina 
clustering (https://pubs.acs.org/doi/10.1021/ci9803381), but I don't understand 
if it can be considered a hierarchical method or not.
Could someone help me understanding this?
Moreover, I've got some problems generating the images after clustering.
First, I don't know what images I need: if it's hierarchical I should do a 
dendrogram, but if it isn't hierchical there's no need (I think).
I only managed to obtain the image of a sparse similarity matrix, but the RAM 
is too small to obtain a dense matrix.
I wasn't able to do the plot of the clusters or to obtain the images of the 
moleculese that are centroids or false singletons (I've tried using RDKit to 
obtain images from fingerprints but the images of the molecules are strange). I 
have thousands of clusters and false singletons as results.
Has someone done something like that in the past? Any suggestions?
I gave me an explanation of what are false and true singletons (I obtain only 
false singletons, is that normal?), but I appreciate if someone more expert 
could explain me and confirm my guess.
I'm sorry for all this questions, but I'm really new to this topic.
Hope someone can help me,
kind regards.
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to