[jira] Commented: (MAHOUT-375) [GSOC] Restricted Boltzmann Machines in Apache Mahout

Sisir Koppaka (JIRA) Sun, 11 Apr 2010 02:04:59 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855694#action_12855694
 ]


Sisir Koppaka commented on MAHOUT-375:
--------------------------------------

Moved the proposal to JIRA. I've processed the Netflix dataset in the format 
Sean suggested on mahout-dev and put it up at 
https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv (click 
to view in browser/download). The format is [movieID, userID, rating]. 

During the proposal consideration time, I am implementing a version of the RBM 
that is not conditional, not factored, and not parallelized yet. I will submit 
this tomorrow after testing. Since testing on my machine alone is right now is 
a bit time taking for the Netflix dataset, are there any datasets that you 
could suggest to me for quicker testing of the RBM - at least for now? If the 
test dataset has some results on RBM that I can compare with, that'd really 
help me with the testing.

> [GSOC] Restricted Boltzmann Machines in Apache Mahout
> -----------------------------------------------------
>
>                 Key: MAHOUT-375
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-375
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Sisir Koppaka
>
> Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses 
> issue Mahout-329)
> Student Name: Sisir Koppaka
> Student E-mail: sisir.kopp...@gmail.com
> Organization/Project:Assigned Mentor:
> Abstract
> This is a proposal to implement Restricted Boltzmann Machines in Apache 
> Mahout as a part of Google Summer of Code 2010. The demo for the code would 
> be built on the Netflix dataset.
> 1 Introduction
> The Grand Prize solution to the Netflix Prize offered several new lessons in 
> the application of traditional machine learning techniques to very large 
> scale datasets. The most significant among these were the impact of temporal 
> models, the remarkable contribution of RBM's to the solution in the overall 
> model, and the great success in applying ensemble models to achieve superior 
> predictions. The present proposal seeks to implement a conditional factored 
> RBM[4] in Apache Mahout as a project under Google Summer of Code 2010.
> 2 Background
> The Netflix dataset takes the form of a sparse matrix of a N X M ratings that 
> N users assign to M movies. Matrix decompositions such as variants of 
> Singular Value Decompositions(SVDs) form the first type of methods applied. 
> This has also induced several recent works in applied mathematics relevant to 
> the Netflix Prize, including [1, 2]. Another genre of techniques have been 
> k-nearest neighbour approaches - user-user, movie-movie and using different
> distance measures such as Pearson and Cosine. The third set of techniques 
> that offers arguably the most divergent predictions that aid in the increase 
> in prediction RMSE are RBM and it's variants.
> [4] demonstrates the algorithm that the author proposes to implement this 
> summer in Apache Mahout. Parallelization can be done by updating all the 
> hidden units, followed by the visible units in parallel, due to the 
> conditional independence of the hidden units, given a visible binary 
> indicator matrix. Rather than implementing a naive RBM, the conditional 
> factored RBM is chosen due to it's useful combination of effectiveness and 
> speed. Minor variations, in any case, could be developed later with little 
> difficulty.
> The training data set consists of nearly 100 million ratings from 480,000 
> users on 17,770 movie titles. As part of the training data, Netflix also pro- 
> vides validation data(called the probe set), containing nearly 1.4 million 
> rat- ings. In addition to the training and validation data, Netflix also 
> provides a test set containing 2.8 million user/movie pairs(called the 
> qualifying set) whose ratings were previously withheld, but have now been 
> released post the conclusion of the Prize.
> 3 Milestones 
> 3.1   April 26-May 24
> Community Bonding Period Certain boilerplate code for the Netflix dataset 
> exists at org.apache.mahout.cf.taste.example.netflix. However, this code is 
> non-distributed and is unrelated to Hadoop. Certain parts of this code, like 
> the file read-in based on Netflix format will be reused to match the 
> processed Netflix dataset file linked below.
> Test out any of the already-implemented Mahout algorithms like SVD or k-Means 
> on the whole dataset to make sure that everything works as ad- vertised. Make 
> a note of testing time. If testing time is very large, then make a 10% 
> training set, and use the 10% probe, which already exists as a standardized 
> Netflix Prize community contribution. This is only so that iterations can be 
> faster/a multiple node Hadoop installation need not al- ways be required. 
> Work on the map-reduce version of RBM and evaluate if parallelization beyond 
> the hidden units and visible units alternate computa- tion can be 
> implemented. Get the community's approval for the map-reduce version of RBM, 
> and then proceed.
> 3.2   May 24-July 12 Pre-midterm      
> Implementation time! Write code, test code, rewrite code.
> Should have working code with decent predictions by end of this segment.
> Design details        
> The RBM code would live at org.apache.mahout.classifier.rbm. Classify.java 
> would need to be written to support the RBM similar to that in 
> discriminative. An equivalent of BayesFileFormatter.java would not be 
> required because of the pre-written Netflix read-in code as mentioned above. 
> ConfusionMatrix.java, ResultAnalyzer.java and ClassifyResult.java would be 
> reused as-is from discriminative.
> algorithm would contain the actual conditional factored RBM algorithm. common 
> would contain the relevant code common to various files in algo- rithm. 
> mapreduce.rbm would contain the driver, mapper and reducer for the 
> parallelized updating of the hidden units layer, followed by the visible 
> units, and appropriate refactored code would be placed in mapreduce.common.
> The algorithm would be implemented generically, and the demo would be on the 
> Netflix dataset. 
> 3.3   July 12-July 31 Post-midterm    
> If testing was on the 10% set, run multiple times on the whole dataset and 
> ensure results match. Test a two-method ensemble of SVD(already in Mahout) 
> and RBM and confirm that RBM offers a unique perpendicular dimensionality to 
> the problem. Make sure unit tests are all in.
> Test on Netflix dataset linked above and prepare for demo.
> 3.4   July 31-August 16 Pencils down 
> Refactor, tune and clean code. Final demo done. Write documentation and add a 
> wiki page.
> 4     About Myself
> Im a 19-year old student hailing from the beautiful, sea-hugging, coastal 
> city of Visakhapatnam in Andhra Pradesh, South India. In 2006, I was one of
> the only 110 students in the country to be awarded a scholarship from the 
> Indian Institute of Science and the Department of Science and Technology, 
> Government of India, under the KVPY programme and I attended their summer 
> camp that year.
> I interned in the Prediction Algorithms Lab at GE Research, Bangalore, last 
> summer. I worked on custom-toolkit in C++ that implemented various Netflix 
> algorithms and operated using data parallelization for some of the more 
> lenghty algorithms like user-user kNN. Our team stood at rank 409 at the end 
> of the two-month internship, when the Grand Prize was awarded.
> I have also published independent work in GECCO 2010[3]. GECCO is ACM 
> SIGEVO's annual conference, and is ranked 11th out of 701 interna- tional 
> conferences in AI, ML, Robotics, and HCI based on it's impact factor.
> I have also contributed code to FFmpeg, and was part of my Hall of 
> Residence's award-winning Java-based Open Soft project team that we have now 
> open sourced. I am also an avid quizzer and have won several prestigious 
> quizzes during my schooling days. I also got the 4th rank in the regional 
> qualifications for the Indian National Mathematics Olympiad.
> References
> [1] E. J. Candes and T. Tao. The Power of Convex Relaxation: Near-Optimal 
> Matrix Completion. Arxiv, 2009.
> [2] R. H. Keshavan, A. Montanari, and S. Oh. Matrix Completion from Noisy 
> Entries. Arxiv, 2009.
> [3] S. Koppaka and A. R. Hota. Superior Exploration-Exploitation Balance with 
> Quantum-Inspired Hadamard Walks. Proceedings of the 12th Annual Conference on 
> Genetic and Evolutionary computation - GECCO '10 to appear, 2010.
> [4] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann Machines 
> for Collaborative Filtering. Proceedings of the 24th International Conference 
> on Machine Learning, Corvallis, OR,2007, 6, 2007.
> Open Source Java Project - http://blunet.googlecode.com
> FFmpeg patches 
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-375) [GSOC] Restricted Boltzmann Machines in Apache Mahout

Reply via email to