Chunnan Yao created SPARK-6711:
----------------------------------
Summary: Support parallelized online matrix factorization for
Collaborative Filtering
Key: SPARK-6711
URL: https://issues.apache.org/jira/browse/SPARK-6711
Project: Spark
Issue Type: Improvement
Components: MLlib, Streaming
Reporter: Chunnan Yao
On-line Collaborative Filtering(CF) has been widely used and studied. To
re-train a CF model from scratch every time when new data comes in is very
inefficient
(http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
However, in Spark community we see few discussion about collaborative
filtering on streaming data. Given streaming k-means, streaming logistic
regression, and the on-going incremental model training of Naive Bayes
Classifier (SPARK-4144), we think it is meaningful to consider streaming
Collaborative Filtering support on MLlib.
We have already been considering about this issue during the past week. We plan
to refer to this paper
(https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on
SGD instead of ALS, which is easier to be tackled under streaming data.
Fortunately, the authors of this paper have implemented their algorithm as a
Github Project, based on Storm:
https://github.com/MrChrisJohnson/CollabStream
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]