On Aug 23, 2015, at 6:38 PM, Jing Lu wrote: > I hope the memory issue won't be a problem.
That's up to you and your choice of threshold. > Most AgglomerativeClustering algorithms have time complexity with N^2. Will > that be a problem? You have to decided for yourself what counts as a problem. If you want to get it done in 1 minute with a threshold of 0.2, then you've got a problem. If you're willing to take a month, then there's no problem. With chemfp, Taylor-Butina clustering, at http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering , took 35 seconds for 100,000 fingers. The NxN calculation is also N^2 time, so should only take about an hour for 1 million fingerprints. Best of course is to start with a smaller system first, see if it works, and only then try to scale up. Then you'll have experience of which methods are appropriate and what your time constraints are. Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss