Hi Francesca,

The Taylor-Butina clustering is not hierarchical.  It is a type of sphere
exclusion algorithm.  A useful image for the results would be the
"centroid" of each cluster, possibly followed by the other cluster
members.  You will need to generate the images from the original input
molecules, not the fingerprints.   You'll need to write some extra code to
read the clusters and do this.  The Getting Started document (
https://www.rdkit.org/docs/GettingStartedInPython.html) should help you
with the image generation.  Technically, the centroids aren't proper
centroids, they are the molecules that each cluster is based on.  The true
centroid would be some sort of average of the fingerprints of the molecules
in the cluster, which itself would not be a molecule.  Dealing with false
singletons is a matter of taste, as they are an artifact of the
clustering method.  One way I have had success with in the past is to
define a second, looser, similarity threshold and put each false singleton
into the cluster whose centroid it is most similar to, so long as it is
within this new threshold.  False singletons are certainly more common than
true ones in my experience.
The threshold you use for the clustering should be chosen with some care,
and will depend on the fingerprint type more than anything else.  Greg did
a blog post recently (
https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html)
on selecting a threshold for similarity searching, and those suggestions
are probably a good place to start with for this, too.

Best,
Dave


On Wed, Jul 21, 2021 at 8:58 AM Francesca Magarotto -
francesca.magarot...@studio.unibo.it <francesca.magarot...@studio.unibo.it>
wrote:

> Hi,
> I managed to performe Taylor-Butina clustering on a dataset of 193 571
> fragments retrieved from ZINC20.
> I used the indications in this link
> https://www.macinchem.org/reviews/clustering/clustering.php
> Actually, I've never used RDKit before and never did a cluster analysis,
> so I'm really new to this type of work. I've read the paper related to
> Taylor-Butina clustering (https://pubs.acs.org/doi/10.1021/ci9803381),
> but I don't understand if it can be considered a hierarchical method or not.
> Could someone help me understanding this?
> Moreover, I've got some problems generating the images after clustering.
> First, I don't know what images I need: if it's hierarchical I should do a
> dendrogram, but if it isn't hierchical there's no need (I think).
> I only managed to obtain the image of a sparse similarity matrix, but the
> RAM is too small to obtain a dense matrix.
> I wasn't able to do the plot of the clusters or to obtain the images of
> the moleculese that are centroids or false singletons (I've tried using
> RDKit to obtain images from fingerprints but the images of the molecules
> are strange). I have thousands of clusters and false singletons as results.
> Has someone done something like that in the past? Any suggestions?
> I gave me an explanation of what are false and true singletons (I obtain
> only false singletons, is that normal?), but I appreciate if someone more
> expert could explain me and confirm my guess.
> I'm sorry for all this questions, but I'm really new to this topic.
> Hope someone can help me,
> kind regards.
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
  • [Rdkit-discuss]... Francesca Magarotto - francesca.magarot...@studio.unibo.it
    • Re: [Rdkit... David Cosgrove

Reply via email to