Le 20 mars 2012 20:51, Immanuel <[email protected]> a écrit : > Hello all, > > I followed the mailing list and poked around in the source code for the > last couple of week. > Now, I'm absolutely sure that I would enjoy to work on scikit-learn as > GSoC project. > > I especially like the proposed online NMF project, could you enlighten > me on the following points? > > There was some discussion about the integration of some NMF code in > scikit-learn. How will > this influence the proposed online NMF project? > > @Vlad > Looks like we have the same interest, I like the robust PCA project too. > Have you already > a preference? I guess it makes little sense to pitch against you ;). > > @Olivier > I did some preliminary reading on the topic and found the following > paper interesting: > "Efficient Document Clustering via Online Nonnegative Matrix Factorizations" > source: http://research.microsoft.com/apps/pubs/default.aspx?id=143211 > > It claims: > * to efficiently handle very large-scale and/or streaming datasets > * low memory consumption > Different algorithm versions are presented in the paper. I don't now > which one would be the most attractive for scikit.
Sounds like a good starting point. Please add your name as a potential candidate on the wiki and the article as a reference in the proposal on the wiki. If we are to extend this proposal I would also include extending the existing MiniBatchSparseDictionaryLearning code (that does online block coordinate descent) to accept sparse inputs and positivity constraints. We could also compare those algorithms with MiniBatchKMeans extended to perform soft assignments with cosine similarity as metrics instead of euclidean distance. Maybe @mblondel knows some references for this part. But I rather than implementing 3 different algorithms I would prefer to focus on one implementation and make it scale to large datasets (large enough to work out-of-core) and make it work as good as possible on a bunch of realistic datasets. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
