On-line Collaborative Filtering(CF) has been widely used and studied. To re-train a CF model from scratch every time when new data comes in is very inefficient (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model). However, in Spark community we see few discussion about collaborative filtering on streaming data. Given streaming k-means, streaming logistic regression, and the on-going incremental model training of Naive Bayes Classifier (SPARK-4144), we think it is meaningful to consider streaming Collaborative Filtering support on MLlib.
I've created an issue on JIRA (SPARK-6711) for possible discussions. We suggest to refer to this paper (https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on SGD instead of ALS, which is easier to be tackled under streaming data. Fortunately, the authors of this paper have implemented their algorithm as a Github Project, based on Storm: https://github.com/MrChrisJohnson/CollabStream Please don't hesitate to give your opinions on this issue and our planned approach. We'd like to work on this in the next few weeks. ----- Feel the sparking Spark! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Support-parallelized-online-matrix-factorization-for-Collaborative-Filtering-tp11413.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org