Dear Robin & the Apache Mahout team, I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've contributed to open source projects like FFmpeg earlier(Repository diff links are here<http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0>and here<http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd> ), and I am very interested to work on a project for Apache Mahout this year(the Netflix algorithms project, to be precise - mentored by Robin). Kindly let me explain my background so that I can make myself relevant in this context.
I've done research work in meta-heuristics, including proposing the equivalents of local search and mutation for quantum-inspired algorithms, in my paper titled "*Superior Exploration-Exploitation Balance With Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking paper at GECCO 2010. We(myself and a friend - it was an independent work), hope to send an expanded version of the communication to a journal in the near future. For this project, our language of implementation was in Mathematica, as we needed the combination of functional paradigms and available mathematically sound resources(like biased random number generation, simple linear programming functions etc.) as well as rapid prototyping ability. I have earlier interned in GE Research in their Computing and Decision Sciences Lab<http://ge.geglobalresearch.com/technologies/computing-decision-sciences/>last year, where I worked on machine learning techniques for large-scale databases - specifically on the Netflix Prize itself. Over a 2 month internship we rose from 1800 to 409th position on the Leaderboard, and had implemented at least one variant of each of the major algorithms. The contest ended at the same time as the conclusion of our internship, and the winning result was the combination of multiple variants of our implemented algorithms. Interestingly, we did try to use Hadoop and the Map-Reduce model for the purpose based on a talk from a person from Yahoo! who visited us during that time. However, not having access to a cluster proved to be an impedance for fast iterative development. We had one machine of 16 cores, so we developed a toolkit in C++ that could multiprocess upto 16 threads(data input parallelization, rather than modifying the algorithms to suit the Map-Reduce model), and implemented all our algorithms using the same toolkit. Specifically, SVD, kNN Movie-Movie, kNN User-User, NSVD(Bellkor and other variants like the Paterek SVD, and the temporal SVD++ too) were the major algorithms that we implemented. Some algorithms had readily available open source code for the Netflix Prize, like NSVD1, so we used them as well. We also worked on certain regression schemes that could improve prediction accuracy like kernel-ridge regression, and it's optimization. Towards the end, we also attempted to verify the results of the infamous paper that showed that IMDB-Netflix correlation could destroy privacy, and identify users. We would import IMDB datasets, and put them into a database and then correlate the IMDB entries to Netflix(we matched double the number of movies that the paper mentioned), and then verify the results. We also identified genre-wise trends and recorded them as such. Unfortuantely, the paper resulted in a libel case, wherein Netflix surrendered it's rights to hold future Prizes of this kind in return for withdrawal of charges. The case effectively closed the possibility of Netflix or any other company releasing similar datasets to the public pending further advances in privacy enforcement techniques, leaving the Netflix database as the largest of it's kind. Naturally, it is very interesting if there is an opportunity to implement the same for Hadoop, since Netflix is the largest database of it's kind and this would be of multiple uses - as a kickstart, tutorial, and as a performance testing(using an included segment of the total database) tool for grids. In addition, the Netflix and IMDB framework base code could be incredibly useful as a prototyping tool for algorithm designers who want to design better machine learning algorithms/ensembles for large-scale databases. The lack of a decent scalable solution for the Netflix Prize so that people can learn from it/add to it, is also a major disappointment that this proposal hopes to correct. Specifically, the challenge I am personally looking forward to is to implement some of these algorithms using the Map-Reduce model, which is what I missed out last time. I am looking forward to: 1. Account for the 12 global effects. The global effects were 12 corrections made in the dataset to account for time, multiple votes on a single day etc., and these alone give a major boost to the accuracy of the predictions. 2. Implement at least 1 kNN based approach. Most of them differ in their parameters, or in their choice of distance definitions(Pearson, Cosine, Euclidean etc.) 3. Implement at least 1 SVD based temporal approach. While Mahout already possesses SVD implementations - slight variations of which could replicate multiple SVD approaches for the Netflix Prize, the temporal SVD++ has in my past experience been a very promising candidate that contributes a lot of perpendicular insight into the predictions. 4. RBM implementation. RBM's were found to offer distinct insights into the predictions, and Mahout doesn't currently possess one implementation yet(as Ankur C. Goel mentioned on this list a few days ago), so this is a no-brainer for a must-do implementation for the Netflix Prize GSoC project. 5. Implementing at least 1 regression scheme relevant to the Netflix Prize as part of the boiler plate code that would be required for this project to facilitate further additions of algorithm variants. I'd like to do kernel-ridge regression, since it has, again, provided decent results on the Netflix dataset compared to other options. I need to discuss this part with my mentor/the Apache Mahout team because while multiple regression schemes are required for this project, I am not sure if it is necessary to implement this as a Map-Reduce based scheme, *at least within the scope of the GSoC project.* This is because it is not very compute-intensive, at least not as much as the central algorithms for the Netflix Prize, and wouldn't be a critical limiting factor to successful completion of an 8% improvement project goal. The idea is that 1,2,5 would not take more than two weeks, with 3 taking another two weeks and the rest of the time would be spent on 4. A demonstrated 8% improvement over Cinematch on the Netflix dataset would be the project goal. There are plenty of references about the Grand Prize solution - Yehuda Koren's final summary<http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf>is a great starting point. However, when I last worked on the problem this wasn't available, so I referred to papers by Arik Paterek and some of Bellkor's earlier papers. Looking forward to a discussion about the proposal, Thanks a lot, Best regards, Sisir Koppaka -- SK