Le 30 mars 2012 07:19, Gael Varoquaux <gael.varoqu...@normalesup.org> a écrit : > Hi Lee, > > Welcome! Thanks for preparing a proposal. My impression looking at it, is > that it seems a bit light for 2.5 months of work. It is pretty much > centered around implementing one algorithm, weighted k-means.
One way to complement this proposal would be to take over the development of the Power Iteration Clustering. I am pretty sure PIC can be a scalable alternative to Spectral Clustering. https://github.com/scikit-learn/scikit-learn/pull/138 Here is the paper: http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf Another interesting task related to clustering is to try to implement model selection using the stability of the clustering algorithm across partially overlapping training sets as meta performance metric. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5211310 http://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf I am aware that this is still a bit experimental (still an open research area) but I would really like to invest some time to check on some realistic datasets whether this unsupervised model selection strategy works in practice. If it proves useful, then its practicality could motivate the inclusions of such tooling into the scikit-learn projects despite not being established yet (disclaimer: this is my own opinion and is subject for debate). Anyway, to strengthen the GSoC proposal it would be necessary to do some actual code contributions before the GSoC proposal submission deadline. That can involve bugfixing stuff from master, contributing small improvements in new pull request or even starting some work on the Power Iteration Clustering a branch such as rebasing it on top the current master and starting to write some tests. Lee, what is your github account? Do you have prior experience with Numpy / Scipy / Cython development? Also about kernel k-means: I don't know this algorithm myself. Do you have practical evidence that this approach is really working a scalable way? e.g. an implementation in another language that works and beat spectral clustering on realistic datasets? -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general