[GSOC] Restricted Boltzmann Machines in Apache Mahout
-----------------------------------------------------

                 Key: MAHOUT-375
                 URL: https://issues.apache.org/jira/browse/MAHOUT-375
             Project: Mahout
          Issue Type: New Feature
            Reporter: Sisir Koppaka


Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses issue 
Mahout-329)

Student Name: Sisir Koppaka

Student E-mail: sisir.kopp...@gmail.com

Organization/Project:Assigned Mentor:

Abstract
This is a proposal to implement Restricted Boltzmann Machines in Apache Mahout 
as a part of Google Summer of Code 2010. The demo for the code would be built 
on the Netflix dataset.

1 Introduction
The Grand Prize solution to the Netflix Prize offered several new lessons in 
the application of traditional machine learning techniques to very large scale 
datasets. The most significant among these were the impact of temporal models, 
the remarkable contribution of RBM's to the solution in the overall model, and 
the great success in applying ensemble models to achieve superior predictions. 
The present proposal seeks to implement a conditional factored RBM[4] in Apache 
Mahout as a project under Google Summer of Code 2010.

2 Background
The Netflix dataset takes the form of a sparse matrix of a N X M ratings that N 
users assign to M movies. Matrix decompositions such as variants of Singular 
Value Decompositions(SVDs) form the first type of methods applied. This has 
also induced several recent works in applied mathematics relevant to the 
Netflix Prize, including [1, 2]. Another genre of techniques have been 
k-nearest neighbour approaches - user-user, movie-movie and using different
distance measures such as Pearson and Cosine. The third set of techniques that 
offers arguably the most divergent predictions that aid in the increase in 
prediction RMSE are RBM and it's variants.

[4] demonstrates the algorithm that the author proposes to implement this 
summer in Apache Mahout. Parallelization can be done by updating all the hidden 
units, followed by the visible units in parallel, due to the conditional 
independence of the hidden units, given a visible binary indicator matrix. 
Rather than implementing a naive RBM, the conditional factored RBM is chosen 
due to it's useful combination of effectiveness and speed. Minor variations, in 
any case, could be developed later with little difficulty.

The training data set consists of nearly 100 million ratings from 480,000 users 
on 17,770 movie titles. As part of the training data, Netflix also pro- vides 
validation data(called the probe set), containing nearly 1.4 million rat- ings. 
In addition to the training and validation data, Netflix also provides a test 
set containing 2.8 million user/movie pairs(called the qualifying set) whose 
ratings were previously withheld, but have now been released post the 
conclusion of the Prize.

3 Milestones 

3.1     April 26-May 24
Community Bonding Period Certain boilerplate code for the Netflix dataset 
exists at org.apache.mahout.cf.taste.example.netflix. However, this code is 
non-distributed and is unrelated to Hadoop. Certain parts of this code, like 
the file read-in based on Netflix format will be reused to match the processed 
Netflix dataset file linked below.

Test out any of the already-implemented Mahout algorithms like SVD or k-Means 
on the whole dataset to make sure that everything works as ad- vertised. Make a 
note of testing time. If testing time is very large, then make a 10% training 
set, and use the 10% probe, which already exists as a standardized Netflix 
Prize community contribution. This is only so that iterations can be faster/a 
multiple node Hadoop installation need not al- ways be required. Work on the 
map-reduce version of RBM and evaluate if parallelization beyond the hidden 
units and visible units alternate computa- tion can be implemented. Get the 
community's approval for the map-reduce version of RBM, and then proceed.

3.2     May 24-July 12 Pre-midterm      
Implementation time! Write code, test code, rewrite code.
Should have working code with decent predictions by end of this segment.

Design details  
The RBM code would live at org.apache.mahout.classifier.rbm. Classify.java 
would need to be written to support the RBM similar to that in discriminative. 
An equivalent of BayesFileFormatter.java would not be required because of the 
pre-written Netflix read-in code as mentioned above. ConfusionMatrix.java, 
ResultAnalyzer.java and ClassifyResult.java would be reused as-is from 
discriminative.
algorithm would contain the actual conditional factored RBM algorithm. common 
would contain the relevant code common to various files in algo- rithm. 
mapreduce.rbm would contain the driver, mapper and reducer for the parallelized 
updating of the hidden units layer, followed by the visible units, and 
appropriate refactored code would be placed in mapreduce.common.

The algorithm would be implemented generically, and the demo would be on the 
Netflix dataset. 

3.3     July 12-July 31 Post-midterm    

If testing was on the 10% set, run multiple times on the whole dataset and 
ensure results match. Test a two-method ensemble of SVD(already in Mahout) and 
RBM and confirm that RBM offers a unique perpendicular dimensionality to the 
problem. Make sure unit tests are all in.
Test on Netflix dataset linked above and prepare for demo.

3.4     July 31-August 16 Pencils down 
Refactor, tune and clean code. Final demo done. Write documentation and add a 
wiki page.

4       About Myself
Im a 19-year old student hailing from the beautiful, sea-hugging, coastal city 
of Visakhapatnam in Andhra Pradesh, South India. In 2006, I was one of
the only 110 students in the country to be awarded a scholarship from the 
Indian Institute of Science and the Department of Science and Technology, 
Government of India, under the KVPY programme and I attended their summer camp 
that year.

I interned in the Prediction Algorithms Lab at GE Research, Bangalore, last 
summer. I worked on custom-toolkit in C++ that implemented various Netflix 
algorithms and operated using data parallelization for some of the more lenghty 
algorithms like user-user kNN. Our team stood at rank 409 at the end of the 
two-month internship, when the Grand Prize was awarded.

I have also published independent work in GECCO 2010[3]. GECCO is ACM SIGEVO's 
annual conference, and is ranked 11th out of 701 interna- tional conferences in 
AI, ML, Robotics, and HCI based on it's impact factor.

I have also contributed code to FFmpeg, and was part of my Hall of Residence's 
award-winning Java-based Open Soft project team that we have now open sourced. 
I am also an avid quizzer and have won several prestigious quizzes during my 
schooling days. I also got the 4th rank in the regional qualifications for the 
Indian National Mathematics Olympiad.

References
[1] E. J. Candes and T. Tao. The Power of Convex Relaxation: Near-Optimal 
Matrix Completion. Arxiv, 2009.
[2] R. H. Keshavan, A. Montanari, and S. Oh. Matrix Completion from Noisy 
Entries. Arxiv, 2009.
[3] S. Koppaka and A. R. Hota. Superior Exploration-Exploitation Balance with 
Quantum-Inspired Hadamard Walks. Proceedings of the 12th Annual Conference on 
Genetic and Evolutionary computation - GECCO '10 to appear, 2010.
[4] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann Machines for 
Collaborative Filtering. Proceedings of the 24th International Conference on 
Machine Learning, Corvallis, OR,2007, 6, 2007.

Open Source Java Project - http://blunet.googlecode.com
FFmpeg patches 
http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to