Hi everyone, Thanks for all your comments on my proposal. I apologize for not responding earlier, and I'll try to address each of your concerns or comments in this mail.
@Olivier: my git hub account is lzamparo. I don't have any prior Cython development experience, but I do have some exposure to Numpy and Scipy through some contributions to the CellProfiler project. The kernel k-means algorithm works by replacing euclidean distance from points to their cluster centres in the input space by euclidean distance in the kernel feature space (section 2.1 of [1] in my proposal). The authors show that it is equivalent to the normalized-cut formulation of spectral clustering. While I have not implemented it myself, the paper shows that it performs well (and quickly) on the Pendigits data set (from UCI machine learning repository), as well as the Rosetta Inpharmatics yeast gene expression data set. The reason I think it will beat other formulations of spectral clustering is that the affinity matrix need not be stored in memory, which can be a problem for very large data sets. Also, the kernel matrix need not be sparsified a priori, which is sometimes the case for spectral clustering. I think it would be a nice addition to sklearn. @Gael: I agree, my proposal is a bit light for only 2.5 months of work. I had prepared an addition of 'nice to haves' for my original proposal, but not included it for the sake of brevity. The idea was to implement the a large margin multi-class metric learning algorithm (K.Q Weinberger, L.K. Saul. Distance Metric Learning for Large Margin Nearest Neighbour Classification. JMLR 10 (2009) 207-244), which is intended to learn a metric for multi-way nearest neighbour classification, but which I thought could also be a nice pre-processing step for clustering. The gist is that it learns a Mahalanobis distance that optimizes multi-class hinge loss. The metric is applied to the training set as a linear transformation, which could then be followed by K-means in the transformed space. However, in light of the suggestions by Bertrand and Olivier, I'm more inclined to include an implementation of power iteration clustering (see Olivier's reply) or exemplar based clustering (see Bertrand's reply). @Mathieu: My proposal for speeding up kernel k-means is two-fold. The first wold be caching of values for the kernel function, while the second is a triangle-inequality based scheme to cut down on the number of distance evaluations required. They update a K x K matrix and a K x N matrix that are used to estimate a lower bound on the distance from points to all new potential cluster centres, and only compute the distances when any lower bound is smaller than the distance from a point to its current centre. The experiments in the paper show it saves a lot of distance calculation time, which dominates the running time for K-means. Regarding the suggested additions, I'm interested in Olivier's suggestion of Power Iteration Clustering, and seeing how it fares against kernel K-means as well as the convex exemplar based clustering paper suggested by Bertrand. I'll revise my proposal accordingly. Thanks and apologies for the long reply, I made the mistake of getting the list in digest mode. Lee. > > Message: 8 > Date: Mon, 2 Apr 2012 16:51:56 +0200 > From: Olivier Grisel <olivier.gri...@ensta.org> > Subject: Re: [Scikit-learn-general] GSoC 2012 pre-application > To: scikit-learn-general@lists.sourceforge.net > Message-ID: > <cafve7k6vdu-qcjl1fx8zdkh91f9s0il7j+fez0hwi65+afh...@mail.gmail.com> > Content-Type: text/plain; charset=UTF-8 > > Le 30 mars 2012 15:21, Olivier Grisel <olivier.gri...@ensta.org> a ?crit : >> >> Lee, what is your github account? Do you have prior experience with >> Numpy / Scipy / Cython development? >> >> Also about kernel k-means: I don't know this algorithm myself. Do you >> have practical evidence that this approach is really working a >> scalable way? e.g. an implementation in another language that works >> and beat spectral clustering on realistic datasets? > > Lee, could you answer those questions and the comments by Mathieu, > Bertrand and Gael ? > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > > > > ------------------------------ > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > > ------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > End of Scikit-learn-general Digest, Vol 27, Issue 1 > *************************************************** ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general