Reg. Netflix Prize Apache Mahout GSoC Application

Sisir Koppaka Mon, 22 Mar 2010 09:37:22 -0700

Dear Robin & the Apache Mahout team,
I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
contributed to open source projects like FFmpeg earlier(Repository diff
links are 
here<http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0>and
here<http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd>
), and I am very interested to work on a project for Apache Mahout this
year(the Netflix algorithms project, to be precise - mentored by Robin).
Kindly let me explain my background so that I can make myself relevant in
this context.


I've done research work in meta-heuristics, including proposing the
equivalents of local search and mutation for quantum-inspired algorithms, in
my paper titled "*Superior Exploration-Exploitation Balance With
Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking
paper at GECCO 2010. We(myself and a friend - it was an independent work),
hope to send an expanded version of the communication to a journal in the
near future. For this project, our language of implementation was in
Mathematica, as we needed the combination of functional paradigms and
available mathematically sound resources(like biased random number
generation, simple linear programming functions etc.) as well as rapid
prototyping ability.

I have earlier interned in GE Research in their Computing and Decision
Sciences 
Lab<http://ge.geglobalresearch.com/technologies/computing-decision-sciences/>last
year, where I worked on machine learning techniques for large-scale
databases - specifically on the Netflix Prize itself. Over a 2 month
internship we rose from 1800 to 409th position on the Leaderboard, and had
implemented at least one variant of each of the major algorithms. The
contest ended at the same time as the conclusion of our internship, and the
winning result was the combination of multiple variants of our implemented
algorithms.

Interestingly, we did try to use Hadoop and the Map-Reduce model for the
purpose based on a talk from a person from Yahoo! who visited us during that
time. However, not having access to a cluster proved to be an impedance for
fast iterative development. We had one machine of 16 cores, so we developed
a toolkit in C++ that could multiprocess upto 16 threads(data input
parallelization, rather than modifying the algorithms to suit the Map-Reduce
model), and implemented all our algorithms using the same toolkit.
Specifically, SVD, kNN Movie-Movie, kNN User-User, NSVD(Bellkor and other
variants like the Paterek SVD, and the temporal SVD++ too) were the major
algorithms that we implemented. Some algorithms had readily available open
source code for the Netflix Prize, like NSVD1, so we used them as well. We
also worked on certain regression schemes that could improve prediction
accuracy like kernel-ridge regression, and it's optimization.

Towards the end, we also attempted to verify the results of the infamous
paper that showed that IMDB-Netflix correlation could destroy privacy, and
identify users. We would import IMDB datasets, and put them into a database
and then correlate the IMDB entries to Netflix(we matched double the number
of movies that the paper mentioned), and then verify the results. We also
identified genre-wise trends and recorded them as such. Unfortuantely, the
paper resulted in a libel case, wherein Netflix surrendered it's rights to
hold future Prizes of this kind in return for withdrawal of charges. The
case effectively closed the possibility of Netflix or any other company
releasing similar datasets to the public pending further advances in privacy
enforcement techniques, leaving the Netflix database as the largest of it's
kind.

Naturally, it is very interesting if there is an opportunity to implement
the same for Hadoop, since Netflix is the largest database of it's kind and
this would be of multiple uses - as a kickstart, tutorial, and as a
performance testing(using an included segment of the total database) tool
for grids. In addition, the Netflix and IMDB framework base code could be
incredibly useful as a prototyping tool for algorithm designers who want to
design better machine learning algorithms/ensembles for large-scale
databases. The lack of a decent scalable solution for the Netflix Prize so
that people can learn from it/add to it, is also a major disappointment that
this proposal hopes to correct.

Specifically, the challenge I am personally looking forward to is to
implement some of these algorithms using the Map-Reduce model, which is what
I missed out last time. I am looking forward to:
1. Account for the 12 global effects. The global effects were 12 corrections
made in the dataset to account for time, multiple votes on a single day
etc., and these alone give a major boost to the accuracy of the predictions.
2. Implement at least 1 kNN based approach. Most of them differ in their
parameters, or in their choice of distance definitions(Pearson, Cosine,
Euclidean etc.)
3. Implement at least 1 SVD based temporal approach. While Mahout already
possesses SVD implementations - slight variations of which could replicate
multiple SVD approaches for the Netflix Prize, the temporal SVD++ has in my
past experience been a very promising candidate that contributes a lot of
perpendicular insight into the predictions.
4. RBM implementation. RBM's were found to offer distinct insights into the
predictions, and Mahout doesn't currently possess one implementation yet(as
Ankur C. Goel mentioned on this list a few days ago), so this is a
no-brainer for a must-do implementation for the Netflix Prize GSoC project.
5. Implementing at least 1 regression scheme relevant to the Netflix Prize
as part of the boiler plate code that would be required for this project to
facilitate further additions of algorithm variants. I'd like to do
kernel-ridge regression, since it has, again, provided decent results on the
Netflix dataset compared to other options. I need to discuss this part with
my mentor/the Apache Mahout team because while multiple regression schemes
are required for this project, I am not sure if it is necessary to implement
this as a Map-Reduce based scheme, *at least within the scope of the GSoC
project.* This is because it is not very compute-intensive, at least not as
much as the central algorithms for the Netflix Prize, and wouldn't be a
critical limiting factor to successful completion of an 8% improvement
project goal.

The idea is that 1,2,5 would not take more than two weeks, with 3 taking
another two weeks and the rest of the time would be spent on 4. A
demonstrated 8% improvement over Cinematch on the Netflix dataset would be
the project goal.

There are plenty of references about the Grand Prize solution - Yehuda
Koren's final 
summary<http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf>is
a great starting point. However, when I last worked on the problem
this
wasn't available, so I referred to papers by Arik Paterek and some of
Bellkor's earlier papers.

Looking forward to a discussion about the proposal,
Thanks a lot,
Best regards,
Sisir Koppaka

-- 
SK

Reg. Netflix Prize Apache Mahout GSoC Application

Reply via email to