[ https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861159#action_12861159 ]
Richard Simon Just commented on MAHOUT-371: ------------------------------------------- Excellent! I haven't downloaded the latest MEAP version of MiA yet, so that would great. Not sure if it has changed much but will re-read the version I have and start looking at a more detailed design, before consulting mahout-dev. > [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop > ----------------------------------------------------------------------- > > Key: MAHOUT-371 > URL: https://issues.apache.org/jira/browse/MAHOUT-371 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering > Reporter: Richard Simon Just > > Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ > Recommender using Hadoop > Student Name: Richard Simon Just > Student E-mail:i...@richardsimonjust.co.uk > Organization/Project: Apache Mahout > Assigned Mentor: > Proposal Abstract: > During the Netflix Prize Challenge one of the most popular forms of > Recommender algorithm was that of Matrix Factorisation, in particular > Singular Value Decomposition (SVD). As such this proposal looks to implement > a distributed version of one of the most successful SVD-based recommender > algorithms from the Netflix competition. Namely, the SVD++ algorithm. > The SVD++ improves upon other basic SVD algorithms by incorporating implicit > feedback[1]. That is to say that it is able to take into account not just > explicit user preferences, but also feedback such as, in the case of a > company like Netflix, whether a movie has been rented. Implicit feedback > assumes that the fact of there being some correlation between the user and > the item is more important that whether the correlation is positive or > negative. Implicit feedback would account for an item has being rated, but > not what the rating was. > The implementation will include testing, in-depth documentation and a > demo/tutorial. If there is time, I will also look to developing the algorithm > into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results > of the SVD algorithm by taking into account temporal dynamics. Temporal > dynamics addresses the way user preferences in items and their behaviour in > how they rate items can change over time. According to [2] the gains in > accuracy implementing timeSVD++ are significantly bigger than the gains going > from SVD to SVD++. > The overall project will provide three deliverables: > 1. The basic framework for distributed SVD-based recommender > 2. A distributed SVD++ implementation and demo > 3. A distributed timeSVD++ implementation > Detailed Description: > The SVD++ algorithm uses the principle of categorising users and items into > factors, combined with regularisation and implicit feedback to predict how > much a user is likely to match an item. Factors are abstract categorises that > are created from comparing the data presented. Factor values are grades > saying how much each user/item is related to that category. For example with > the Netflix data a factor could loosely correspond to a movie genre, a > director or story line. The more factors used the more detailed the > categories are likely to become, and thus the more accurate the predictions > are likely to become. > Implicit feedback is based on the theory that a user having any sort of > relationship to an item is more important that whether they rated it, or what > rating they gave. The assumption is that even if a user does not like an > item, or has not rated it, the very fact that they choose to have some > interaction with it indicates something about their preferences. In the > Netflix case this would be represented by whether a user has rated a movie or > not, it could also take into account the user's rental history. > As well as the actual implementation of the code the project has two other > deliverable focuses. The readability, documentation, testing of the code; and > a full tutorial and demo of the code. It is felt that without these things > the implementation is not really complete or fully usable. > The recommender consists of 5 main parts. The job class that runs the code, > an input conversion section, a training section, a re-trainer section and a > prediction section. A brief overview of these sections follows. > The SVD++ Classes: > The Recommender Job class: > This class is the foundation of the recommender and allows it to run on > Hadoop by implementing the Tool interface through AbstractJob. This class > will parse any user arguments and setup the jobs that will run the algorithm > on Map/Reduce, much in the same way Mahout's other distributed recommenders, > do such as RecommenderJob. > Input Mapper/Reducer Classes: > The Mapper will take the input data and convert it to key value pairs in the > form of a hadoop Writeable. The Reducer will take the Mapper Writeables and > create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and > ToUserVectorReducer. > Training Mapper/Reducer Classes: > This phase creates the factor matrices that will be used to create the > predictions later. It does this by making a prediction of a known rating > using the SVD, comparing it against the known rating, calculating the error > and updating the factors accordingly. The Mapper will loop through the > training of the factor matrices. The Reducer will collect the outputs of the > Mapper to create the dense factor matrices. > Re-trainer Mapper/Reducer Classes: > This is a separate job that would be used to update the factor matrices > taking into account any new user ratings. The Mapper/Reducer follow a similar > pattern to the training section. > Prediction Mapper/Reducer Classes: > This job would take the factor matrices from the training set and use them in > the SVD to create the required predictions. Most likely this would be done > for all user/item preferences, to create dense matrix of all user/item > combinations. However potentially it could also be done for individual users. > Just like training, the Mapper would handle the predictions the Reducer would > collect them into a dense matrix. > Timeline: > The Warm Up/Bonding Period (<=May 23rd): > - familiarise myself further with Mahout and Hadoop's code base and > documentation > - discuss with community the proposal, design and implementation as well as > related code tests, optimisations and documentation they would like to see > incorporated into the project > - build a more detailed design of algorithm implementation and tweak timeline > based on feedback > - familiarise myself more with unit testing > - finish building 3-4 node Hadoop cluster and play with all the examples > Week 1 (May 24th-30th): > - start writing the back bone of the code in the form of comments and > skeleton code > - implement SVDppRecommenderJob > - start to integrate DistributedLanzcosSolver > Week 2 (May 31st - June 6th): > - complete DistributedLanzcosSolver integration > - start implementing distributed training, prediction and regularisation > Week 3 - 5 (June 7th - 27th): > - complete implementation of distributed training, prediction and > regularisation > - work on testing, cleaning up code, and tying up any loose documentation ends > - work on any documentation, tests and optimisation requested by community > - Deliverable : basic framework for distributed SVD-based recommender > Week 6 - 7 (June 28th-July 11th): > - start implementation of SVD++ (keeping documentation and tests up-to-date) > - prepare demo > Week 8 (July 12th - 18th): Mid-Term Report by the 16th > - complete SVD++ and iron out bugs > - implement and document demo > - write wiki articles and tutorial for what has been implemented including > the demo > Week 9 (July 19th - 25th): > - work on any documentation, tests and optimisation requested by community > during project > - work on testing, cleaning up code, and tying up any loose documentation ends > - Deliverable : Distributed SVD++ Recommender (including Demo) > Week 10 - 11 (July 26th - Aug 8th): > - incorporate temporal dynamics > - write temporal dynamics documentation, including wiki article > Week 12 (Aug 9th - 15th): Suggested Pencils Down > - last optimisation and tidy up of code, documentation, tests and demo > - discuss with community what comes next, consider what JIRA issues to > contribute to > - Deliverable: Distributed SVD++ Recommender with temporal dynamics > Final Evaluations Hand-in: Aug 16th-20th. > Additional Information: > I am currently a final year undergraduate at the University of Reading, UK, > studying BSc Applied Computer Science. As part of this degree I did a 9 month > industrial placement programming in Java for an international non-profit > organisation. While in this final year of study I have been fortunate enough > to take modules in Distributed Computing, Evolutionary Computation and > Concurrent Systems, which have proven to be the best of my University career. > Each module required me to design and implement algorithms of either a > distributed or machine learning nature [A], two in Java, one in C. For my > final year project I have implemented a Gaming Persistence Platform in > ActionScript, PHP and MySQL. It allows the user to play computer games > integrated with the platform from different devices and/or different software > platforms while retaining game progress across all instances. I am looking > forward to the day when mobile computing goes distributed. > I am a passionate user of open source software and now I am looking to become > involved as a developer. Having enjoyed my Distributed Computing and > Evolutionary Computation modules so much, and after reading all the > introductory pages about the ASF I realised Mahout would be a great place to > start. After graduation (and GSoC) I hope to continue contributing to Mahout > while working in a related field. > Outside of uni, when I'm not coding, I like to push my limits through > meditation, mountaineering and running. I also love to cook! > After my last exam on May 21st my only commitment over the summer will be > GSoC. As such GSoC will be treated like a full-time job. > [1] - Y. Koren, "Factorization Meets the Neighborhood: a Mulitfaceted > Collaborative Filtering Model", ACM Press, 2008, > http://public.research.att.com/~volinsky/netflix/kdd08koren.pdf > [2] - Y. Koren, "Collaborative Filtering with temporal Dynamics", ACM Press, > 2009, http://research.yahoo.com/files/kdd-fp074-koren.pdf > [3] - R.M.Bell, Y. Koren, and C. Volinsky, "The Bellkor 2008 Solution to the > Netflix Prize", > http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf > [A] - Example of code I've implemented as part of University degree, > http://www.richardsimonjust.com/code/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.