On Aug 23, 2015, at 3:43 AM, Jing Lu wrote:
> If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I 
> calculate the distance between every pair of molecules, the size of distance 
> matrix will be too big. Does RDKit support any heuristic clustering algorithm 
> without calculating the distance matrix of the whole library?

You should look to a third-party package, like scikit-learn from 
http://scikit-learn.org/ , for clustering. That has a very extensive set of 
clustering algorithms, including k-means at 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html .

Though you may be interested in the note on that page: "For large scale 
learning (say n_samples > 10k) MiniBatchKMeans is probably much faster to than 
the default batch implementation." 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

My memory is that k-means depends on a Euclidean distance between the records, 
which is different from the usual Tanimoto (or metric-like 1-Tanimoto) in 
cheminformatics. 

If you would rather use Tanimoto, then perhaps try a method like 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
 ?

If you go that route, and you want to build the full 1M x 1M distance matrix, 
the usual approach is to ignore similarities below a given threshold T (e.g., 
T<0.8). This can be thought of as either setting those entries to 0.0, or 
specifying an "ignore" flag. In either case, the result can be stored in a 
sparse matrix, which is efficient at storing only the data of interest.

Using my package, chemfp, from http://chemfp.com/ you can compute a sparse 
matrix for 1M x 1M fingerprints in about an hour using a laptop or desktop.

The question would then be how to adapt the parse output format from chemfp to 
the sparse input format for your clustering method of choice.

Best regards,

                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to