[
https://issues.apache.org/jira/browse/MAHOUT-375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved MAHOUT-375.
------------------------------
Resolution: Won't Fix
No plans for this to be completed at the moment; GSoC project did not complete.
> [GSOC] Restricted Boltzmann Machines in Apache Mahout
> -----------------------------------------------------
>
> Key: MAHOUT-375
> URL: https://issues.apache.org/jira/browse/MAHOUT-375
> Project: Mahout
> Issue Type: New Feature
> Reporter: Sisir Koppaka
> Assignee: Grant Ingersoll
> Attachments: MAHOUT-375.diff, MAHOUT-375.patch, MAHOUT-375.patch,
> MAHOUT-375.patch, MAHOUT-375.patch
>
>
> Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses
> issue Mahout-329)
> Student Name: Sisir Koppaka
> Student E-mail: [email protected]
> Organization/Project:Assigned Mentor:
> Abstract
> This is a proposal to implement Restricted Boltzmann Machines in Apache
> Mahout as a part of Google Summer of Code 2010. The demo for the code would
> be built on the Netflix dataset.
> 1 Introduction
> The Grand Prize solution to the Netflix Prize offered several new lessons in
> the application of traditional machine learning techniques to very large
> scale datasets. The most significant among these were the impact of temporal
> models, the remarkable contribution of RBM's to the solution in the overall
> model, and the great success in applying ensemble models to achieve superior
> predictions. The present proposal seeks to implement a conditional factored
> RBM[4] in Apache Mahout as a project under Google Summer of Code 2010.
> 2 Background
> The Netflix dataset takes the form of a sparse matrix of a N X M ratings that
> N users assign to M movies. Matrix decompositions such as variants of
> Singular Value Decompositions(SVDs) form the first type of methods applied.
> This has also induced several recent works in applied mathematics relevant to
> the Netflix Prize, including [1, 2]. Another genre of techniques have been
> k-nearest neighbour approaches - user-user, movie-movie and using different
> distance measures such as Pearson and Cosine. The third set of techniques
> that offers arguably the most divergent predictions that aid in the increase
> in prediction RMSE are RBM and it's variants.
> [4] demonstrates the algorithm that the author proposes to implement this
> summer in Apache Mahout. Parallelization can be done by updating all the
> hidden units, followed by the visible units in parallel, due to the
> conditional independence of the hidden units, given a visible binary
> indicator matrix. Rather than implementing a naive RBM, the conditional
> factored RBM is chosen due to it's useful combination of effectiveness and
> speed. Minor variations, in any case, could be developed later with little
> difficulty.
> The training data set consists of nearly 100 million ratings from 480,000
> users on 17,770 movie titles. As part of the training data, Netflix also pro-
> vides validation data(called the probe set), containing nearly 1.4 million
> rat- ings. In addition to the training and validation data, Netflix also
> provides a test set containing 2.8 million user/movie pairs(called the
> qualifying set) whose ratings were previously withheld, but have now been
> released post the conclusion of the Prize.
> 3 Milestones
> 3.1 April 26-May 24
> Community Bonding Period Certain boilerplate code for the Netflix dataset
> exists at org.apache.mahout.cf.taste.example.netflix. However, this code is
> non-distributed and is unrelated to Hadoop. Certain parts of this code, like
> the file read-in based on Netflix format will be reused to match the
> processed Netflix dataset file linked below.
> Test out any of the already-implemented Mahout algorithms like SVD or k-Means
> on the whole dataset to make sure that everything works as ad- vertised. Make
> a note of testing time. If testing time is very large, then make a 10%
> training set, and use the 10% probe, which already exists as a standardized
> Netflix Prize community contribution. This is only so that iterations can be
> faster/a multiple node Hadoop installation need not al- ways be required.
> Work on the map-reduce version of RBM and evaluate if parallelization beyond
> the hidden units and visible units alternate computa- tion can be
> implemented. Get the community's approval for the map-reduce version of RBM,
> and then proceed.
> 3.2 May 24-July 12 Pre-midterm
> Implementation time! Write code, test code, rewrite code.
> Should have working code with decent predictions by end of this segment.
> Design details
> The RBM code would live at org.apache.mahout.classifier.rbm. Classify.java
> would need to be written to support the RBM similar to that in
> discriminative. An equivalent of BayesFileFormatter.java would not be
> required because of the pre-written Netflix read-in code as mentioned above.
> ConfusionMatrix.java, ResultAnalyzer.java and ClassifyResult.java would be
> reused as-is from discriminative.
> algorithm would contain the actual conditional factored RBM algorithm. common
> would contain the relevant code common to various files in algo- rithm.
> mapreduce.rbm would contain the driver, mapper and reducer for the
> parallelized updating of the hidden units layer, followed by the visible
> units, and appropriate refactored code would be placed in mapreduce.common.
> The algorithm would be implemented generically, and the demo would be on the
> Netflix dataset.
> 3.3 July 12-July 31 Post-midterm
> If testing was on the 10% set, run multiple times on the whole dataset and
> ensure results match. Test a two-method ensemble of SVD(already in Mahout)
> and RBM and confirm that RBM offers a unique perpendicular dimensionality to
> the problem. Make sure unit tests are all in.
> Test on Netflix dataset linked above and prepare for demo.
> 3.4 July 31-August 16 Pencils down
> Refactor, tune and clean code. Final demo done. Write documentation and add a
> wiki page.
> 4 About Myself
> Im a 19-year old student hailing from the beautiful, sea-hugging, coastal
> city of Visakhapatnam in Andhra Pradesh, South India. In 2006, I was one of
> the only 110 students in the country to be awarded a scholarship from the
> Indian Institute of Science and the Department of Science and Technology,
> Government of India, under the KVPY programme and I attended their summer
> camp that year.
> I interned in the Prediction Algorithms Lab at GE Research, Bangalore, last
> summer. I worked on custom-toolkit in C++ that implemented various Netflix
> algorithms and operated using data parallelization for some of the more
> lenghty algorithms like user-user kNN. Our team stood at rank 409 at the end
> of the two-month internship, when the Grand Prize was awarded.
> I have also published independent work in GECCO 2010[3]. GECCO is ACM
> SIGEVO's annual conference, and is ranked 11th out of 701 interna- tional
> conferences in AI, ML, Robotics, and HCI based on it's impact factor.
> I have also contributed code to FFmpeg, and was part of my Hall of
> Residence's award-winning Java-based Open Soft project team that we have now
> open sourced. I am also an avid quizzer and have won several prestigious
> quizzes during my schooling days. I also got the 4th rank in the regional
> qualifications for the Indian National Mathematics Olympiad.
> References
> [1] E. J. Candes and T. Tao. The Power of Convex Relaxation: Near-Optimal
> Matrix Completion. Arxiv, 2009.
> [2] R. H. Keshavan, A. Montanari, and S. Oh. Matrix Completion from Noisy
> Entries. Arxiv, 2009.
> [3] S. Koppaka and A. R. Hota. Superior Exploration-Exploitation Balance with
> Quantum-Inspired Hadamard Walks. Proceedings of the 12th Annual Conference on
> Genetic and Evolutionary computation - GECCO '10 to appear, 2010.
> [4] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann Machines
> for Collaborative Filtering. Proceedings of the 24th International Conference
> on Machine Learning, Corvallis, OR,2007, 6, 2007.
> Open Source Java Project - http://blunet.googlecode.com
> FFmpeg patches
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.