Hi JP,

On Wed, Jun 13, 2012 at 12:03 PM, JP <jeanpaul.ebe...@inhibox.com> wrote:
>
> I am trying to tackle the most classical of cheminformatics problems -
> clustering based on molecule similarity.
> I have a few thousand molecules in a smiles file and I know how to
> compute the similarity using my fingerprint of choice using RDKit.
> But how do I cluster the results using the toolkit?  (I have found
> some code in R for the Butina from Noel -
> http://www.redbrick.dcu.ie/~noel/R_clustering.html - but considering
> this algorithm seems to be implemented already in RDKit)

Yes it is.
The RDKit has implementation of Butina clustering (suitable for large
data sets) and hierarchical clustering (probably not practically
useful beyond a couple thousand data points).

> I can see that there is some clustering code in rdkit.Chem.ML.Cluster
> - but I can hardly find any examples/documentation (one question is
> what is the "Data" parameter like in ClusterData(...)
> http://www.rdkit.org/docs/api/rdkit.ML.Cluster.Butina-module.html).

By default that code uses a Euclidean distance measure. In this case
data should be the points to be clustered (a list of lists).
If you have fingerprints, you could pass in the list of Fingerprints
and pass DataStructs.TanimotoSimilarity as the distFunc argument. This
would be comparatively slow though. With fingerprints it's more
efficient to pass in the distance matrix.

> Is there a recommended algorithm?  Is it possible to generate exactly
> n clusters (like kmeans) ?

No, there's not currently a kmeans implementation.

> Can someone offer a brief overview?
> Perhaps something to cut and paste in a wiki page on the google code site?

hmm, nice idea. How about this:
http://code.google.com/p/rdkit/wiki/ClusteringMolecules

-greg

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to