On Aug 23, 2015, at 3:43 AM, Jing Lu wrote: > If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I > calculate the distance between every pair of molecules, the size of distance > matrix will be too big. Does RDKit support any heuristic clustering algorithm > without calculating the distance matrix of the whole library?
You should look to a third-party package, like scikit-learn from http://scikit-learn.org/ , for clustering. That has a very extensive set of clustering algorithms, including k-means at http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html . Though you may be interested in the note on that page: "For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster to than the default batch implementation." http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html My memory is that k-means depends on a Euclidean distance between the records, which is different from the usual Tanimoto (or metric-like 1-Tanimoto) in cheminformatics. If you would rather use Tanimoto, then perhaps try a method like http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html ? If you go that route, and you want to build the full 1M x 1M distance matrix, the usual approach is to ignore similarities below a given threshold T (e.g., T<0.8). This can be thought of as either setting those entries to 0.0, or specifying an "ignore" flag. In either case, the result can be stored in a sparse matrix, which is efficient at storing only the data of interest. Using my package, chemfp, from http://chemfp.com/ you can compute a sparse matrix for 1M x 1M fingerprints in about an hour using a laptop or desktop. The question would then be how to adapt the parse output format from chemfp to the sparse input format for your clustering method of choice. Best regards, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss