Sorry to bother again...

Now, the most time consuming part is clustering. The process getting the
fingerprints only takes less than 1h. But, the process for clustering has
already taken more than 30h, and I am not sure when it will finish.

Currently, I use scikit learn DBSCAN, which has time complexity O(nlog(n)).
A more efficient clustering algorithm is miniBatch KMeans. But, Batch
KMeans only take matrix as input.

So, I wonder is there any way to convert fingerprint to a numpy vector?


Thanks,
Jing


On Sun, Aug 23, 2015 at 5:07 PM, Andrew Dalke <da...@dalkescientific.com>
wrote:

> On Aug 23, 2015, at 6:38 PM, Jing Lu wrote:
> > I hope the memory issue won't be a problem.
>
> That's up to you and your choice of threshold.
>
> >  Most AgglomerativeClustering algorithms have time complexity with N^2.
> Will that be a problem?
>
> You have to decided for yourself what counts as a problem. If you want to
> get it done in 1 minute with a threshold of 0.2, then you've got a problem.
> If you're willing to take a month, then there's no problem.
>
> With chemfp, Taylor-Butina clustering, at
> http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering
> , took 35 seconds for 100,000 fingers. The NxN calculation is also N^2
> time, so should only take about an hour for 1 million fingerprints.
>
> Best of course is to start with a smaller system first, see if it works,
> and only then try to scale up. Then you'll have experience of which methods
> are appropriate and what your time constraints are.
>
>
>                                 Andrew
>                                 da...@dalkescientific.com
>
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to