Hi,
I have a question related to the cut-off in Taylor-Butina algorithm.
I retrieved a set of 190,792 molecules in Smiles format from ZINC15.
I split this dataset (190,792) in order to first perform the cluster analysis 
only on two small subsets (one contains 310 molecules and the other 1396 
molecules).
Then, I performed the cluster analysis also on the true dataset of 190,792 
molecules, adapting the protocol.
I followed the examples in this link 
https://www.macinchem.org/reviews/clustering/clustering.php.
Clustering - Macs in Chemistry Home | Macs in 
Chemistry<https://www.macinchem.org/reviews/clustering/clustering.php>
Options for Clustering large datasets of Molecules. Clustering is an invaluable 
cheminformatics technique for subdividing a typically large compound collection 
into small groups of similar compounds.
www.macinchem.org
For the two small subsets (310 and 1396 molecules, respectively) in order to 
perform the cluster analysis I used this:

clusters=ClusterFps(fps,cutoff=0.4)

For both the small subsets, I chose first a cutoff of 0.4 (just like in the 
example) and then a cutoff of 0.2.
I think (but I'm not sure) that this cutoff is a value of distance, so 0.4 
corresponds to a similarity of 0.6 and 0.2 corresponds to a similarity of 0.8; 
this means that, for istance, with a cutoff of 0.4 the algorithm creates 
clusters around centroids that have a similarity with the other members of the 
cluster equal to 0.6.
Is that correct?
Then, when performing the cluster analysis on the true dataset of 190,792 
molecules I followed the advice in the link (I used Chemfp) and, as you can see 
from the example in the link, I think (but again I'm not sure) that for the 
cluster analysis the cutoff is different (it's not anymore a distance value, 
but a similarity value).
So, I thought that in this case - to match the cluster analysis already 
performed on the two small subsets - I have to use 0.6 (that corresponds to 
0.4) and 0.8 (that corresponds to 0.2).
Have I applied (and understood) the cutoff values correctly?
Thank you,
regards.
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to