Hi JP, On Wed, Jun 13, 2012 at 12:03 PM, JP <jeanpaul.ebe...@inhibox.com> wrote: > > I am trying to tackle the most classical of cheminformatics problems - > clustering based on molecule similarity. > I have a few thousand molecules in a smiles file and I know how to > compute the similarity using my fingerprint of choice using RDKit. > But how do I cluster the results using the toolkit? (I have found > some code in R for the Butina from Noel - > http://www.redbrick.dcu.ie/~noel/R_clustering.html - but considering > this algorithm seems to be implemented already in RDKit)
Yes it is. The RDKit has implementation of Butina clustering (suitable for large data sets) and hierarchical clustering (probably not practically useful beyond a couple thousand data points). > I can see that there is some clustering code in rdkit.Chem.ML.Cluster > - but I can hardly find any examples/documentation (one question is > what is the "Data" parameter like in ClusterData(...) > http://www.rdkit.org/docs/api/rdkit.ML.Cluster.Butina-module.html). By default that code uses a Euclidean distance measure. In this case data should be the points to be clustered (a list of lists). If you have fingerprints, you could pass in the list of Fingerprints and pass DataStructs.TanimotoSimilarity as the distFunc argument. This would be comparatively slow though. With fingerprints it's more efficient to pass in the distance matrix. > Is there a recommended algorithm? Is it possible to generate exactly > n clusters (like kmeans) ? No, there's not currently a kmeans implementation. > Can someone offer a brief overview? > Perhaps something to cut and paste in a wiki page on the google code site? hmm, nice idea. How about this: http://code.google.com/p/rdkit/wiki/ClusteringMolecules -greg ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss