Hi, I have a question related to the cut-off in Taylor-Butina algorithm. I retrieved a set of 190,792 molecules in Smiles format from ZINC15. I split this dataset (190,792) in order to first perform the cluster analysis only on two small subsets (one contains 310 molecules and the other 1396 molecules). Then, I performed the cluster analysis also on the true dataset of 190,792 molecules, adapting the protocol. I followed the examples in this link https://www.macinchem.org/reviews/clustering/clustering.php. Clustering - Macs in Chemistry Home | Macs in Chemistry<https://www.macinchem.org/reviews/clustering/clustering.php> Options for Clustering large datasets of Molecules. Clustering is an invaluable cheminformatics technique for subdividing a typically large compound collection into small groups of similar compounds. www.macinchem.org For the two small subsets (310 and 1396 molecules, respectively) in order to perform the cluster analysis I used this:
clusters=ClusterFps(fps,cutoff=0.4) For both the small subsets, I chose first a cutoff of 0.4 (just like in the example) and then a cutoff of 0.2. I think (but I'm not sure) that this cutoff is a value of distance, so 0.4 corresponds to a similarity of 0.6 and 0.2 corresponds to a similarity of 0.8; this means that, for istance, with a cutoff of 0.4 the algorithm creates clusters around centroids that have a similarity with the other members of the cluster equal to 0.6. Is that correct? Then, when performing the cluster analysis on the true dataset of 190,792 molecules I followed the advice in the link (I used Chemfp) and, as you can see from the example in the link, I think (but again I'm not sure) that for the cluster analysis the cutoff is different (it's not anymore a distance value, but a similarity value). So, I thought that in this case - to match the cluster analysis already performed on the two small subsets - I have to use 0.6 (that corresponds to 0.4) and 0.8 (that corresponds to 0.2). Have I applied (and understood) the cutoff values correctly? Thank you, regards.
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss