Hi,
Here's an update about my possible participation in GSoC this year.
If they have time, I think that Gilles Louppe or Arnaud Joly would
make good mentors since they are from
University of Liège too. Gilles wrote his master thesis on recommender
systems so he has good knowledge of
matrix factorization problems.
I talked to Gilles who agreed to supervise me for GSoC (if I'm accepted
;-) )
Too keep things simple, I would first focus on sequential algorithms
and keep the parallel version as a bonus if time permits. Remember
that you need to provide us not only with an implementation but also
with complete unit tests and documentation.
Thanks for your advice. In the proposal I'll send to the GSoC website
I'll make sure to include some time to add unit tests to the sequential
version before adding the parallelization. For the documentation, in
addition to he commentaries and docstrings for the classes/methods, I
would like to add an exemple on the MovieLens dataset. This dataset was
proposed by the one of the authors of the idea page. I could maybe also
add the different versions of the dataset to the sklearn.datasets package.
Please start contributing to the project as soon as possible. It is a
requirement for eligibility to the GSOC.
I wasn't active on the mailing list during the last weeks but I
continued to read the documentation and some parts of the code. I also
proposed three modest pull requests (one was rejected, one was accepted
and the last one has not been yet reviewed).
* Online and parallelisation are different things, both interesting and to
keep in mind, but that should not be confused.
Sorry for the confusion, I shouldn't have put the "parallelization" and
"online" in the same sentence. I wasn't confusing online learning with
parallelization but with itself. For what I understood, a gradient
descend algorithm can be online in the sense that it runs sequentially
over all the data to update the model. In this case, it can be compared
to a mini-batch approach where the gradient is computed (or the update
is done) using chunks of the data. The online learning approach can be
done internally in the fit() method simply iterating over the available
data. By "and could be adapted to online learning" in my first mail I
really meant "also implement the partial_fit() method", giving to the
user to opportunity to update the model (for example, when he receives
new ratings in the case of a recommender system). One approach is hidden
to the user and the other one is exposed to him but the concepts are
mainly the sames. Note that I'm not quite sure of the API i should use,
I'll discuss it below.
* For non-negative matrix factorization, Julien Mairal's algorithm for
online dictionary learning can also be used (see the JMLR paper). It
needs a small modification compared to what we currently have, but it
shouldn't be too much work.
I guess you are talking about this paper
<http://jmlr.csail.mit.edu/papers/volume11/mairal10a/mairal10a.pdf>.
Honestly, I'm not quite familiar with dictionary learning. I'll make a
proposal on data the matrix completion problem and more generally on
data imputation in scikit-learn, keeping in mind the non-negative matrix
factorization problem and this paper. By the way, reading the code and
the documentation I have found that scikit-learn already implemented a
non-negative matrix factorization algorithm (here
<https://github.com/scikit-learn/scikit-learn/blob/9cf9e9f3357653878766cbee6040dc7c475c2cba/sklearn/decomposition/nmf.py>)
so I don't know if I'll have to implement it after all. If someone could
clarify that point it would be great.
* For matrix factorization to be useful in the context of recomender
systems, there needs to be an API for recomender systems. While I'd
love to see this, I am afraid that it might be premature and should
probably happen after the release of 1.0.
It seems to me that inverse_transform would do the job:
X_transformed = estimator.fit_transform(X) # X contains missing values
X = estimator.inverse_transform(X_transformed) # missing values were imputed
AFAIK, you might not want all the missing values to be imputed at once,
especially if the dimensions of X are large. Maybe something like:
X_transformed = estimator.fit_transform(X) # X contains missing values
X_subset = estimator.inverse_transform(X_transformed,row_subset) # impute
only a subset of the rows of X
Can't you just do estimator.inverse_transform(X[:subset])?
Assuming that the model is already constructed (with fit() or
partial_fit()), I think three things can be done:
1. Add some new columns.
2. Add some new lines.
3. Predict values in the existing matrix.
(1) (adding new movies for examples) and (2) (adding new users for
example) can be addressed naively refitting an entirely new model. I
think clever solutions could be found to address this problem (maybe:
extend the matrix and make the model learn the new lines/rows) and that
we should discuss it.
(3) doesn't need any new matrix, just a subset of the rows/columns. I
could be done passing a matrix of the same shape of the matrix used to
fit the model and two additional parameters. Is it allowed in
scikit-learn to add more parameters to the transform() or
fit_transform() methods? What will be the behavior with the pipelines?
I saw in the upcoming events page that there's a sprint in july in Paris
and I could be present. I could start working on the project adding two
additional parameters to the transform() method and we could define the
final API at the sprint or using the mailing list.
In summary, I think I'll include in the proposal:
* Basic data imputation for rows/columns, for example:
o For continuous data:
+ Using the mean/median
+ Using a random value in an interval (for example
[mean-n*se;mean+n*se])
o For discrete non ordered data:
+ Select the most frequent value
o For both:
+ Selecting a value randomly in the row/column
* The implementation of the matrix completion algorithm described in
the paper I used in my first email:
o Online, not parallelized
o Using minibatches, not paralellized
o Paralelized
* Tests
* Documenation
* Example(s)
* Speed test (for example on the NetFlix dataset)
* The discussion with the community of a new API for recommender systems.
* Discuss how missing values can be handled.
Regards,
Nicolas
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general