Thanks, Andrew!

Yes, I was thinking about using scikit-learn also. But I guess I need to
use a data structure for sparse matrix and define a function for
connectivity. I hope the memory issue won't be a problem.
Most AgglomerativeClustering algorithms have time complexity with N^2. Will
that be a problem?



Best,
Jing

On Sun, Aug 23, 2015 at 3:13 AM, Andrew Dalke <da...@dalkescientific.com>
wrote:

> On Aug 23, 2015, at 3:43 AM, Jing Lu wrote:
> > If I want to cluster more than 1M molecules by ECFP4. How could I do it?
> If I calculate the distance between every pair of molecules, the size of
> distance matrix will be too big. Does RDKit support any heuristic
> clustering algorithm without calculating the distance matrix of the whole
> library?
>
> You should look to a third-party package, like scikit-learn from
> http://scikit-learn.org/ , for clustering. That has a very extensive set
> of clustering algorithms, including k-means at
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
> .
>
> Though you may be interested in the note on that page: "For large scale
> learning (say n_samples > 10k) MiniBatchKMeans is probably much faster to
> than the default batch implementation."
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html
>
> My memory is that k-means depends on a Euclidean distance between the
> records, which is different from the usual Tanimoto (or metric-like
> 1-Tanimoto) in cheminformatics.
>
> If you would rather use Tanimoto, then perhaps try a method like
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
> ?
>
> If you go that route, and you want to build the full 1M x 1M distance
> matrix, the usual approach is to ignore similarities below a given
> threshold T (e.g., T<0.8). This can be thought of as either setting those
> entries to 0.0, or specifying an "ignore" flag. In either case, the result
> can be stored in a sparse matrix, which is efficient at storing only the
> data of interest.
>
> Using my package, chemfp, from http://chemfp.com/ you can compute a
> sparse matrix for 1M x 1M fingerprints in about an hour using a laptop or
> desktop.
>
> The question would then be how to adapt the parse output format from
> chemfp to the sparse input format for your clustering method of choice.
>
> Best regards,
>
>                                 Andrew
>                                 da...@dalkescientific.com
>
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to