Le 25 mars 2012 20:40, Gael Varoquaux <[email protected]> a écrit : > Hi list, > > I am working on a summary table on clustering methods. It is not > finished, I need to do a bit more literature review, however, I'd love > some feedback on the current status: > https://github.com/GaelVaroquaux/scikit-learn/blob/master/doc/modules/clustering.rst
Thanks for working on this. The comparative table combined with the plots is very useful for new comers IMHO. However I would not put that most clustering algorithms are "very scalable" without running some extensive benchmark. I know that minibatchkmeans can work on with n_samples > 100 000 without any issue (e.g. less than a minute although I don't remember then exact timings) with n_centers=100 and n_features ~= 100000 with a sparsity level of 0.001 to 0.01. I have no idea whether the batch KMeans can do so and whether memory is not too much copied around. Same remark applies for other algorithms. Also it might be interesting to have 2 columns, 1 for n_samples scalability and the other for n_clusters scalability. AFAIK SpectralClustering is not scalable at all for many samples while MiniBatchKMeans will work as expected as long as the model (cluster centers) hold in memory. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
