Re: [Scikit-learn-general] Participation in GSoC 2013

Nicolas Trésegnie Fri, 26 Apr 2013 01:10:15 -0700

Hi,

Here's an update about my possible participation in GSoC this year.

If they have time, I think that Gilles Louppe or Arnaud Joly would
make good mentors since they are from
University of Liège too. Gilles wrote his master thesis on recommender
systems so he has good knowledge of
matrix factorization problems.

I talked to Gilles who agreed to supervise me for GSoC (if I'm accepted;-) )

Too keep things simple, I would first focus on sequential algorithms
and keep the parallel version as a bonus if time permits. Remember
that you need to provide us not only with an implementation but also
with complete unit tests and documentation.

Thanks for your advice. In the proposal I'll send to the GSoC websiteI'll make sure to include some time to add unit tests to the sequentialversion before adding the parallelization. For the documentation, inaddition to he commentaries and docstrings for the classes/methods, Iwould like to add an exemple on the MovieLens dataset. This dataset wasproposed by the one of the authors of the idea page. I could maybe alsoadd the different versions of the dataset to the sklearn.datasets package.

Please start contributing to the project as soon as possible. It is a
requirement for eligibility to the GSOC.

I wasn't active on the mailing list during the last weeks but Icontinued to read the documentation and some parts of the code. I alsoproposed three modest pull requests (one was rejected, one was acceptedand the last one has not been yet reviewed).

* Online and parallelisation are different things, both interesting and to
   keep in mind, but that should not be confused.

Sorry for the confusion, I shouldn't have put the "parallelization" and"online" in the same sentence. I wasn't confusing online learning withparallelization but with itself. For what I understood, a gradientdescend algorithm can be online in the sense that it runs sequentiallyover all the data to update the model. In this case, it can be comparedto a mini-batch approach where the gradient is computed (or the updateis done) using chunks of the data. The online learning approach can bedone internally in the fit() method simply iterating over the availabledata. By "and could be adapted to online learning" in my first mail Ireally meant "also implement the partial_fit() method", giving to theuser to opportunity to update the model (for example, when he receivesnew ratings in the case of a recommender system). One approach is hiddento the user and the other one is exposed to him but the concepts aremainly the sames. Note that I'm not quite sure of the API i should use,I'll discuss it below.

* For non-negative matrix factorization, Julien Mairal's algorithm for
   online dictionary learning can also be used (see the JMLR paper). It
   needs a small modification compared to what we currently have, but it
   shouldn't be too much work.

I guess you are talking about this paper<http://jmlr.csail.mit.edu/papers/volume11/mairal10a/mairal10a.pdf>.Honestly, I'm not quite familiar with dictionary learning. I'll make aproposal on data the matrix completion problem and more generally ondata imputation in scikit-learn, keeping in mind the non-negative matrixfactorization problem and this paper. By the way, reading the code andthe documentation I have found that scikit-learn already implemented anon-negative matrix factorization algorithm (here<https://github.com/scikit-learn/scikit-learn/blob/9cf9e9f3357653878766cbee6040dc7c475c2cba/sklearn/decomposition/nmf.py>)so I don't know if I'll have to implement it after all. If someone couldclarify that point it would be great.

* For matrix factorization to be useful in the context of recomender
   systems, there needs to be an API for recomender systems. While I'd
   love to see this, I am afraid that it might be premature and should
   probably happen after the release of 1.0.

It seems to me that inverse_transform would do the job:
X_transformed = estimator.fit_transform(X) # X contains missing values
X = estimator.inverse_transform(X_transformed) # missing values were imputed

AFAIK, you might not want all the missing values to be imputed at once,
especially if the dimensions of X are large.  Maybe something like:


X_transformed = estimator.fit_transform(X) # X contains missing values
X_subset = estimator.inverse_transform(X_transformed,row_subset) # impute
only a subset of the rows of X

Can't you just do estimator.inverse_transform(X[:subset])?

Assuming that the model is already constructed (with fit() orpartial_fit()), I think three things can be done:


1. Add some new columns.
2. Add some new lines.
3. Predict values in the existing matrix.

(1) (adding new movies for examples) and (2) (adding new users forexample) can be addressed naively refitting an entirely new model. Ithink clever solutions could be found to address this problem (maybe:extend the matrix and make the model learn the new lines/rows) and thatwe should discuss it.

(3) doesn't need any new matrix, just a subset of the rows/columns. Icould be done passing a matrix of the same shape of the matrix used tofit the model and two additional parameters. Is it allowed inscikit-learn to add more parameters to the transform() orfit_transform() methods? What will be the behavior with the pipelines?

I saw in the upcoming events page that there's a sprint in july in Parisand I could be present. I could start working on the project adding twoadditional parameters to the transform() method and we could define thefinal API at the sprint or using the mailing list.


In summary, I think I'll include in the proposal:

 * Basic data imputation for rows/columns, for example:
     o For continuous data:
         + Using the mean/median
         + Using a random value in an interval (for example
           [mean-n*se;mean+n*se])
     o For discrete non ordered data:
         + Select the most frequent value
     o For both:
         + Selecting a value randomly in the row/column
 * The implementation of the matrix completion algorithm described in
   the paper I used in my first email:
     o Online, not parallelized
     o Using minibatches, not paralellized
     o Paralelized
 * Tests
 * Documenation
 * Example(s)
 * Speed test (for example on the NetFlix dataset)
 * The discussion with the community of a new API for recommender systems.
 * Discuss how missing values can be handled.

Regards,

Nicolas

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Participation in GSoC 2013

Reply via email to