Hi Francesca,
adding to David's comment, we do have some material for beginners that also covers and applies Butina clustering that may be useful: https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T005_compound_clustering/talktorial.ipynb Best, Andrea ---- Prof. Dr. Andrea Volkamer In silico Toxicology and Structural Bioinformatics<https://volkamerlab.org/>, Institute of Physiology, Charité Universitätsmedizin Berlin Campus Mitte: Virchowweg 6, 10117 Berlin Phone: +49 30 - 450 528 504 E-Mail: andrea.volka...@charite.de<mailto:andrea.volka...@charite.de> ________________________________ Von: David Cosgrove <davidacosgrov...@gmail.com> Gesendet: Mittwoch, 21. Juli 2021 14:01:03 An: Francesca Magarotto - francesca.magarot...@studio.unibo.it Cc: RDKit Discuss Betreff: [ext] Re: [Rdkit-discuss] Taylor-Butina clustering Hi Francesca, The Taylor-Butina clustering is not hierarchical. It is a type of sphere exclusion algorithm. A useful image for the results would be the "centroid" of each cluster, possibly followed by the other cluster members. You will need to generate the images from the original input molecules, not the fingerprints. You'll need to write some extra code to read the clusters and do this. The Getting Started document (https://www.rdkit.org/docs/GettingStartedInPython.html) should help you with the image generation. Technically, the centroids aren't proper centroids, they are the molecules that each cluster is based on. The true centroid would be some sort of average of the fingerprints of the molecules in the cluster, which itself would not be a molecule. Dealing with false singletons is a matter of taste, as they are an artifact of the clustering method. One way I have had success with in the past is to define a second, looser, similarity threshold and put each false singleton into the cluster whose centroid it is most similar to, so long as it is within this new threshold. False singletons are certainly more common than true ones in my experience. The threshold you use for the clustering should be chosen with some care, and will depend on the fingerprint type more than anything else. Greg did a blog post recently (https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html) on selecting a threshold for similarity searching, and those suggestions are probably a good place to start with for this, too. Best, Dave On Wed, Jul 21, 2021 at 8:58 AM Francesca Magarotto - francesca.magarot...@studio.unibo.it<mailto:francesca.magarot...@studio.unibo.it> <francesca.magarot...@studio.unibo.it<mailto:francesca.magarot...@studio.unibo.it>> wrote: Hi, I managed to performe Taylor-Butina clustering on a dataset of 193 571 fragments retrieved from ZINC20. I used the indications in this link https://www.macinchem.org/reviews/clustering/clustering.php Actually, I've never used RDKit before and never did a cluster analysis, so I'm really new to this type of work. I've read the paper related to Taylor-Butina clustering (https://pubs.acs.org/doi/10.1021/ci9803381), but I don't understand if it can be considered a hierarchical method or not. Could someone help me understanding this? Moreover, I've got some problems generating the images after clustering. First, I don't know what images I need: if it's hierarchical I should do a dendrogram, but if it isn't hierchical there's no need (I think). I only managed to obtain the image of a sparse similarity matrix, but the RAM is too small to obtain a dense matrix. I wasn't able to do the plot of the clusters or to obtain the images of the moleculese that are centroids or false singletons (I've tried using RDKit to obtain images from fingerprints but the images of the molecules are strange). I have thousands of clusters and false singletons as results. Has someone done something like that in the past? Any suggestions? I gave me an explanation of what are false and true singletons (I obtain only false singletons, is that normal?), but I appreciate if someone more expert could explain me and confirm my guess. I'm sorry for all this questions, but I'm really new to this topic. Hope someone can help me, kind regards. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss