On Aug 23, 2015, at 6:38 PM, Jing Lu wrote:
> I hope the memory issue won't be a problem.

That's up to you and your choice of threshold.

>  Most AgglomerativeClustering algorithms have time complexity with N^2. Will 
> that be a problem?

You have to decided for yourself what counts as a problem. If you want to get 
it done in 1 minute with a threshold of 0.2, then you've got a problem. If 
you're willing to take a month, then there's no problem.

With chemfp, Taylor-Butina clustering, at 
http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering 
, took 35 seconds for 100,000 fingers. The NxN calculation is also N^2 time, so 
should only take about an hour for 1 million fingerprints.

Best of course is to start with a smaller system first, see if it works, and 
only then try to scale up. Then you'll have experience of which methods are 
appropriate and what your time constraints are.


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to