[ https://issues.apache.org/jira/browse/MAHOUT-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855694#action_12855694 ]
Sisir Koppaka commented on MAHOUT-375: -------------------------------------- Moved the proposal to JIRA. I've processed the Netflix dataset in the format Sean suggested on mahout-dev and put it up at https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv (click to view in browser/download). The format is [movieID, userID, rating]. During the proposal consideration time, I am implementing a version of the RBM that is not conditional, not factored, and not parallelized yet. I will submit this tomorrow after testing. Since testing on my machine alone is right now is a bit time taking for the Netflix dataset, are there any datasets that you could suggest to me for quicker testing of the RBM - at least for now? If the test dataset has some results on RBM that I can compare with, that'd really help me with the testing. > [GSOC] Restricted Boltzmann Machines in Apache Mahout > ----------------------------------------------------- > > Key: MAHOUT-375 > URL: https://issues.apache.org/jira/browse/MAHOUT-375 > Project: Mahout > Issue Type: New Feature > Reporter: Sisir Koppaka > > Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses > issue Mahout-329) > Student Name: Sisir Koppaka > Student E-mail: sisir.kopp...@gmail.com > Organization/Project:Assigned Mentor: > Abstract > This is a proposal to implement Restricted Boltzmann Machines in Apache > Mahout as a part of Google Summer of Code 2010. The demo for the code would > be built on the Netflix dataset. > 1 Introduction > The Grand Prize solution to the Netflix Prize offered several new lessons in > the application of traditional machine learning techniques to very large > scale datasets. The most significant among these were the impact of temporal > models, the remarkable contribution of RBM's to the solution in the overall > model, and the great success in applying ensemble models to achieve superior > predictions. The present proposal seeks to implement a conditional factored > RBM[4] in Apache Mahout as a project under Google Summer of Code 2010. > 2 Background > The Netflix dataset takes the form of a sparse matrix of a N X M ratings that > N users assign to M movies. Matrix decompositions such as variants of > Singular Value Decompositions(SVDs) form the first type of methods applied. > This has also induced several recent works in applied mathematics relevant to > the Netflix Prize, including [1, 2]. Another genre of techniques have been > k-nearest neighbour approaches - user-user, movie-movie and using different > distance measures such as Pearson and Cosine. The third set of techniques > that offers arguably the most divergent predictions that aid in the increase > in prediction RMSE are RBM and it's variants. > [4] demonstrates the algorithm that the author proposes to implement this > summer in Apache Mahout. Parallelization can be done by updating all the > hidden units, followed by the visible units in parallel, due to the > conditional independence of the hidden units, given a visible binary > indicator matrix. Rather than implementing a naive RBM, the conditional > factored RBM is chosen due to it's useful combination of effectiveness and > speed. Minor variations, in any case, could be developed later with little > difficulty. > The training data set consists of nearly 100 million ratings from 480,000 > users on 17,770 movie titles. As part of the training data, Netflix also pro- > vides validation data(called the probe set), containing nearly 1.4 million > rat- ings. In addition to the training and validation data, Netflix also > provides a test set containing 2.8 million user/movie pairs(called the > qualifying set) whose ratings were previously withheld, but have now been > released post the conclusion of the Prize. > 3 Milestones > 3.1 April 26-May 24 > Community Bonding Period Certain boilerplate code for the Netflix dataset > exists at org.apache.mahout.cf.taste.example.netflix. However, this code is > non-distributed and is unrelated to Hadoop. Certain parts of this code, like > the file read-in based on Netflix format will be reused to match the > processed Netflix dataset file linked below. > Test out any of the already-implemented Mahout algorithms like SVD or k-Means > on the whole dataset to make sure that everything works as ad- vertised. Make > a note of testing time. If testing time is very large, then make a 10% > training set, and use the 10% probe, which already exists as a standardized > Netflix Prize community contribution. This is only so that iterations can be > faster/a multiple node Hadoop installation need not al- ways be required. > Work on the map-reduce version of RBM and evaluate if parallelization beyond > the hidden units and visible units alternate computa- tion can be > implemented. Get the community's approval for the map-reduce version of RBM, > and then proceed. > 3.2 May 24-July 12 Pre-midterm > Implementation time! Write code, test code, rewrite code. > Should have working code with decent predictions by end of this segment. > Design details > The RBM code would live at org.apache.mahout.classifier.rbm. Classify.java > would need to be written to support the RBM similar to that in > discriminative. An equivalent of BayesFileFormatter.java would not be > required because of the pre-written Netflix read-in code as mentioned above. > ConfusionMatrix.java, ResultAnalyzer.java and ClassifyResult.java would be > reused as-is from discriminative. > algorithm would contain the actual conditional factored RBM algorithm. common > would contain the relevant code common to various files in algo- rithm. > mapreduce.rbm would contain the driver, mapper and reducer for the > parallelized updating of the hidden units layer, followed by the visible > units, and appropriate refactored code would be placed in mapreduce.common. > The algorithm would be implemented generically, and the demo would be on the > Netflix dataset. > 3.3 July 12-July 31 Post-midterm > If testing was on the 10% set, run multiple times on the whole dataset and > ensure results match. Test a two-method ensemble of SVD(already in Mahout) > and RBM and confirm that RBM offers a unique perpendicular dimensionality to > the problem. Make sure unit tests are all in. > Test on Netflix dataset linked above and prepare for demo. > 3.4 July 31-August 16 Pencils down > Refactor, tune and clean code. Final demo done. Write documentation and add a > wiki page. > 4 About Myself > Im a 19-year old student hailing from the beautiful, sea-hugging, coastal > city of Visakhapatnam in Andhra Pradesh, South India. In 2006, I was one of > the only 110 students in the country to be awarded a scholarship from the > Indian Institute of Science and the Department of Science and Technology, > Government of India, under the KVPY programme and I attended their summer > camp that year. > I interned in the Prediction Algorithms Lab at GE Research, Bangalore, last > summer. I worked on custom-toolkit in C++ that implemented various Netflix > algorithms and operated using data parallelization for some of the more > lenghty algorithms like user-user kNN. Our team stood at rank 409 at the end > of the two-month internship, when the Grand Prize was awarded. > I have also published independent work in GECCO 2010[3]. GECCO is ACM > SIGEVO's annual conference, and is ranked 11th out of 701 interna- tional > conferences in AI, ML, Robotics, and HCI based on it's impact factor. > I have also contributed code to FFmpeg, and was part of my Hall of > Residence's award-winning Java-based Open Soft project team that we have now > open sourced. I am also an avid quizzer and have won several prestigious > quizzes during my schooling days. I also got the 4th rank in the regional > qualifications for the Indian National Mathematics Olympiad. > References > [1] E. J. Candes and T. Tao. The Power of Convex Relaxation: Near-Optimal > Matrix Completion. Arxiv, 2009. > [2] R. H. Keshavan, A. Montanari, and S. Oh. Matrix Completion from Noisy > Entries. Arxiv, 2009. > [3] S. Koppaka and A. R. Hota. Superior Exploration-Exploitation Balance with > Quantum-Inspired Hadamard Walks. Proceedings of the 12th Annual Conference on > Genetic and Evolutionary computation - GECCO '10 to appear, 2010. > [4] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann Machines > for Collaborative Filtering. Proceedings of the 24th International Conference > on Machine Learning, Corvallis, OR,2007, 6, 2007. > Open Source Java Project - http://blunet.googlecode.com > FFmpeg patches > http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0 > http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira