Hi Ryan, I have done some research on collaborative filtering literature and here are my thoughts.
I think removing global effects, as described in the reference paper, is worth implementing the most. It is relatively easy to implment and it leads to significant improvement in rating prediction. I did a small experiment on Grouplens-100k dataset. I centered the ratings by substracting overall mean and then run the cf algorithm with different factorizers. The RMSE of the default factorizer (NMFALSFactorizer) decreases from 2.83887 to 1.08704, and that of RegularziedSVD decreases from 1.1613 to 1.11595. So far I am still thinking about a good way to incorporate this into the current cf code so that it would be flexible to extend to removing other global effects. One problem I also noticed is that the default factorizer (NMFALSFactorizer) gives poor rating prediction result (2.83887) as shown above. I had a look at the predicted ratings and found that most of the predictions are close to zero (the rating scale is 1-5). I am not familiar with the mathematics behind the updating rule of this factorizer, but I guess the reason may be that the factorizer is trying to fit zero in the place where ratings are missing. That could also explain why there is a significant improvement after normalizing the raw ratings. There are other svd-related algorithms that are worth implementing: 1) BiasSVD is a method similar to RegularziedSVD. The difference is that BiasSVD also considers the user/item rating bias. 2) svd++ improves BiasSVD by taking implicit feedback into consideration. It allows modelling the effect of boolean-valued implicit feedback. A nice aspect of this is that the rating itself can be regarded as a kind of implicit feedback (whether the user rated the item). So if no other implicit feedback (eg. whether the user browsed the item) is provided, svd++ can still be used with the rating as implicit feedback. But these two algorithms are not the matrix factorization in the form of V = W * H which we can directly put into the current cf code. One solution is to add a new member like "bool UseFactorizerSpecificRatingFunction" in struct FactorizerTraits, and use SFINAE to write the code. And then a function like "double getRating(user, item)" needs to be implmented in the class of BiasSVD/Svd++. I would like to hear some suggestions on this:) These two algorithms can be found in this paper: http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf Progress indicator should be a useful tool. There are some algorithms that take a relatively long time to compute, such as cf with SVDBatchFactorizer. With a progress indicator (maybe in the form of progress bar?) the user will have a rough idea how much time the process needs. As for now, I think maybe I can focus on removing simple global effects (overall mean, user/item main effect) or BiasSVD. What do you think? Thank you! Best, Wenhao On Wed, Feb 7, 2018 at 10:53 PM Ryan Curtin <[email protected]> wrote: > On Tue, Feb 06, 2018 at 05:31:14PM +0000, Wenhao Huang wrote: > > Thanks a lot Ryan! I am going through the code in the cf module. And do > you > > know any current relevant issues that I can have a look or even start > > working on, to better my understanding of the the implementation of cf > > algorithm in mlpack? > > Hi Wenhao, > > At this time there are not any open issues that I am aware of for CF. > However, there are always improvements that can be made to the code, so > I wpuld encpurage you to explore it and see if you can find any speedups > or propose any functionality improvements. For instance, maybe one idea > is adding another simple factorizer (unfortunately I don't have one > handy to suggest), or to profile the code and see if you can find any > slow parts. > > I hope this is helpful. :) > > Thanks, > > Ryan > > -- > Ryan Curtin | "Moo." > [email protected] | - Eugene Belford >
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
