[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

Richard Simon Just (JIRA) Mon, 26 Apr 2010 16:11:56 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861159#action_12861159
 ]


Richard Simon Just commented on MAHOUT-371:
-------------------------------------------

Excellent! I haven't downloaded the latest MEAP version of MiA yet, so that 
would great. Not sure if it has changed much but will re-read the version I 
have and start looking at a more detailed design, before consulting mahout-dev.

> [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-371
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-371
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Richard Simon Just
>
> Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
> Recommender using Hadoop
> Student Name: Richard Simon Just
> Student E-mail:i...@richardsimonjust.co.uk
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract: 
> During the Netflix Prize Challenge one of the most popular forms of 
> Recommender algorithm was that of Matrix Factorisation, in particular 
> Singular Value Decomposition (SVD). As such this proposal looks to implement 
> a distributed version of one of the most successful SVD-based recommender 
> algorithms from the Netflix competition. Namely, the SVD++ algorithm. 
> The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
> feedback[1]. That is to say that it is able to take into account not just 
> explicit user preferences, but also feedback such as, in the case of a 
> company like Netflix, whether a movie has been rented. Implicit feedback 
> assumes that the fact of there being some correlation between the user and 
> the item is more important that whether the correlation is positive or 
> negative. Implicit feedback would account for an item has being rated, but 
> not what the rating was.
> The implementation will include testing, in-depth documentation and a 
> demo/tutorial. If there is time, I will also look to developing the algorithm 
> into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
> of the SVD algorithm by taking into account temporal dynamics. Temporal 
> dynamics addresses the way user preferences in items and their behaviour in 
> how they rate items can change over time. According to [2] the gains in 
> accuracy implementing timeSVD++ are significantly bigger than the gains going 
> from SVD to SVD++. 
> The overall project will provide three deliverables:
> 1. The basic framework for distributed SVD-based recommender
> 2. A distributed SVD++ implementation and demo
> 3. A distributed timeSVD++ implementation
> Detailed Description:
> The SVD++ algorithm uses the principle of categorising users and items into 
> factors, combined with regularisation and implicit feedback to predict how 
> much a user is likely to match an item. Factors are abstract categorises that 
> are created from comparing the data presented. Factor values are grades 
> saying how much each user/item is related to that category. For example with 
> the Netflix data a factor could loosely correspond to a movie genre, a 
> director or story line. The more factors used the more detailed the 
> categories are likely to become, and thus the more accurate the predictions 
> are likely to become. 
> Implicit feedback is based on the theory that a user having any sort of 
> relationship to an item is more important that whether they rated it, or what 
> rating they gave. The assumption is that even if a user does not like an 
> item, or has not rated it, the very fact that they choose to have some 
> interaction with it indicates something about their preferences. In the 
> Netflix case this would be represented by whether a user has rated a movie or 
> not,  it could also take into account the user's rental history. 
> As well as the actual implementation of the code the project has two other 
> deliverable focuses. The readability, documentation, testing of the code; and 
> a full tutorial and demo of the code. It is felt that without these things 
> the implementation is not really complete or fully usable. 
> The recommender consists of 5 main parts. The job class that runs the code, 
> an input conversion section, a training section, a re-trainer section and a 
> prediction section. A brief overview of these sections follows.
> The SVD++  Classes:
> The Recommender Job class:
> This class is the foundation of the recommender and allows it to run on 
> Hadoop by implementing the Tool interface through AbstractJob. This class 
> will parse any user arguments and setup the jobs that will run the algorithm 
> on Map/Reduce, much in the same way Mahout's other distributed recommenders, 
> do such as RecommenderJob.
> Input Mapper/Reducer Classes:
> The Mapper will take the input data and convert it to key value pairs in the 
> form of a hadoop Writeable. The Reducer will take the Mapper Writeables and 
> create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and 
> ToUserVectorReducer.
> Training Mapper/Reducer Classes:
> This phase creates the factor matrices that will be used to create the 
> predictions later. It does this by making a prediction of a known rating 
> using the SVD, comparing it against the known rating, calculating the error 
> and updating the factors accordingly. The Mapper will loop through the 
> training of the factor matrices. The Reducer will collect the outputs of the 
> Mapper to create the dense factor matrices.
> Re-trainer Mapper/Reducer Classes:
> This is a separate job that would be used to update the factor matrices 
> taking into account any new user ratings. The Mapper/Reducer follow a similar 
> pattern to the training section.
> Prediction Mapper/Reducer Classes:
> This job would take the factor matrices from the training set and use them in 
> the SVD to create the required predictions. Most likely this would be done 
> for all user/item preferences, to create dense matrix of all user/item 
> combinations. However potentially it could also be done for individual users.
> Just like training, the Mapper would handle the predictions the Reducer would 
> collect them into a dense matrix.
> Timeline:
> The Warm Up/Bonding Period (<=May 23rd):
> - familiarise myself further with Mahout and Hadoop's code base and 
> documentation
> - discuss with community the proposal, design and implementation as well as 
> related code tests, optimisations and documentation they would like to see 
> incorporated into the project
> - build a more detailed design of algorithm implementation and tweak timeline 
> based on feedback
> - familiarise myself more with unit testing
> - finish building 3-4 node Hadoop cluster and play with all the examples
> Week 1 (May 24th-30th):
> - start writing the back bone of the code in the form of comments and 
> skeleton code
> - implement SVDppRecommenderJob
> - start to integrate DistributedLanzcosSolver
> Week 2 (May 31st - June 6th):
> - complete DistributedLanzcosSolver integration
> - start implementing distributed training, prediction and regularisation
> Week 3 - 5 (June 7th - 27th):
> - complete implementation of distributed training, prediction and 
> regularisation
> - work on testing, cleaning up code, and tying up any loose documentation ends
> - work on any documentation, tests and optimisation requested by community
> - Deliverable : basic framework for distributed SVD-based recommender
> Week 6 - 7 (June 28th-July 11th):
> - start implementation of SVD++ (keeping documentation and tests up-to-date)
> - prepare demo
> Week 8 (July 12th - 18th): Mid-Term Report by the 16th
> - complete SVD++ and iron out bugs
> - implement and document demo
> - write wiki articles and tutorial for what has been implemented including 
> the demo
> Week 9 (July 19th - 25th):
> - work on any documentation, tests and optimisation requested by community 
> during project
> - work on testing, cleaning up code, and tying up any loose documentation ends
> - Deliverable : Distributed SVD++ Recommender (including Demo)
> Week 10 - 11 (July 26th - Aug 8th):
> - incorporate temporal dynamics
> - write temporal dynamics documentation, including wiki article
> Week 12 (Aug 9th - 15th): Suggested Pencils Down
> - last optimisation and tidy up of code, documentation, tests and demo
> - discuss with community what comes next, consider what JIRA issues to 
> contribute to 
> - Deliverable: Distributed SVD++ Recommender with temporal dynamics
> Final Evaluations Hand-in: Aug 16th-20th. 
> Additional Information:
> I am currently a final year undergraduate at the University of Reading, UK, 
> studying BSc Applied Computer Science. As part of this degree I did a 9 month 
> industrial placement programming in Java for an international non-profit 
> organisation. While in this final year of study I have been fortunate enough 
> to take modules in Distributed Computing, Evolutionary Computation and 
> Concurrent Systems, which have proven to be the best of my University career. 
> Each module required me to design and implement algorithms of either a 
> distributed or machine learning nature [A], two in Java, one in C. For my 
> final year project I have implemented a Gaming Persistence Platform in 
> ActionScript, PHP and MySQL. It allows the user to play computer games 
> integrated with the platform from different devices and/or different software 
> platforms while retaining game progress across all instances. I am looking 
> forward to the day when mobile computing goes distributed.
> I am a passionate user of open source software and now I am looking to become 
> involved as a developer. Having enjoyed my Distributed Computing and 
> Evolutionary Computation modules so much, and after reading all the 
> introductory pages about the ASF I realised Mahout would be a great place to 
> start. After graduation (and GSoC) I hope to continue contributing to Mahout 
> while working in a related field. 
> Outside of uni, when I'm not coding, I like to push my limits through 
> meditation, mountaineering and running. I also love to cook!
> After my last exam on May 21st my only commitment over the summer will be 
> GSoC. As such GSoC will be treated like a full-time job.
> [1] - Y. Koren, "Factorization Meets the Neighborhood: a Mulitfaceted 
> Collaborative Filtering Model", ACM Press, 2008, 
> http://public.research.att.com/~volinsky/netflix/kdd08koren.pdf
> [2] - Y. Koren, "Collaborative Filtering with temporal Dynamics", ACM Press, 
> 2009, http://research.yahoo.com/files/kdd-fp074-koren.pdf
> [3] - R.M.Bell, Y. Koren, and C. Volinsky, "The Bellkor 2008 Solution to the 
> Netflix Prize", 
> http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf
> [A] - Example of code I've implemented as part of University degree, 
> http://www.richardsimonjust.com/code/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

Reply via email to