Le 25 mars 2012 20:40, Gael Varoquaux <[email protected]> a écrit :
> Hi list,
>
> I am working on a summary table on clustering methods. It is not
> finished, I need to do a bit more literature review, however, I'd love
> some feedback on the current status:
> https://github.com/GaelVaroquaux/scikit-learn/blob/master/doc/modules/clustering.rst

Thanks for working on this. The comparative table combined with the
plots is very useful for new comers IMHO.

However I would not put that most clustering algorithms are "very
scalable" without running some extensive benchmark. I know that
minibatchkmeans can work on with n_samples > 100 000 without any issue
(e.g. less than a minute although I don't remember then exact timings)
with n_centers=100 and n_features ~= 100000 with a sparsity level of
0.001 to 0.01. I have no idea whether the batch KMeans can do so and
whether memory is not too much copied around. Same remark applies for
other algorithms.

Also it might be interesting to have 2 columns, 1 for n_samples
scalability and the other for n_clusters scalability. AFAIK
SpectralClustering is not scalable at all for many samples while
MiniBatchKMeans will work as expected as long as the model (cluster
centers) hold in memory.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to