Le 30 mars 2012 07:19, Gael Varoquaux <gael.varoqu...@normalesup.org> a écrit :
> Hi Lee,
>
> Welcome! Thanks for preparing a proposal. My impression looking at it, is
> that it seems a bit light for 2.5 months of work. It is pretty much
> centered around implementing one algorithm, weighted k-means.

One way to complement this proposal would be to take over the
development of the Power Iteration Clustering. I am pretty sure PIC
can be a scalable alternative to Spectral Clustering.

  https://github.com/scikit-learn/scikit-learn/pull/138

Here is the paper:

  http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf

Another interesting task related to clustering is to try to implement
model selection using the stability of the clustering algorithm across
partially overlapping training sets as meta performance metric.

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5211310

http://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf

I am aware that this is still a bit experimental (still an open
research area) but I would really like to invest some time to check on
some realistic datasets whether this unsupervised model selection
strategy works in practice. If it proves useful, then its practicality
could motivate the inclusions of such tooling into the scikit-learn
projects despite not being established yet (disclaimer: this is my own
opinion and is subject for debate).

Anyway, to strengthen the GSoC proposal it would be necessary to do
some actual code contributions before the GSoC proposal submission
deadline.

That can involve bugfixing stuff from master, contributing small
improvements in new pull request or even starting some work on the
Power Iteration Clustering a branch such as rebasing it on top the
current master and starting to write some tests.

Lee, what is your github account? Do you have prior experience with
Numpy / Scipy / Cython development?

Also about kernel k-means: I don't know this algorithm myself. Do you
have practical evidence that this approach is really working a
scalable way? e.g. an implementation in another language that works
and beat spectral clustering on realistic datasets?

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to