Hi Sisir, I'm the one who added the most recent SVD stuff to Mahout, and while I'd love to see improvements in that, and incorporation into netflix-style recommenders, I would even more like to see a stacked-RBM implementation if you think you can do that. We don't currently have anything like that in Mahout, and even if it is completely serial (no Hadoop-hooks), it would be very excellent as an addition to the project.
That's my opinion on the matter. :) -jake On Mon, Mar 22, 2010 at 1:53 PM, Sisir Koppaka <sisir.kopp...@gmail.com>wrote: > Hi, > Thanks a lot for taking time out to reply. I understand that it's important > to get the proposal right - that's why I wanted to bounce off all > possibilites as far as Netflix is concerned - the methods that I've worked > with before, on this list, and see what would be of priority interest to > the > team. If global effects, and temporal SVD would be of interest then I'd > incorporate that into my final proposal accordingly. On the other hand, > I've > read that RBM is something the team is interested in, so I could also > implement a very good performing(approximately 0.91 RMSE) RBM for Netflix, > as the GSoC project. I'd like to know which of the Netflix algorithms the > Mahout team would like to see implemented first. > > Depending on the feedback, I'll prepare the final proposal. I'll definitely > work with the code now and post any queries that I get on the list. > > Thanks a lot, > Best regards, > Sisir Koppaka > > On Tue, Mar 23, 2010 at 1:22 AM, Robin Anil <robin.a...@gmail.com> wrote: > > > Hi Sisir, > > I am currently on vacation. So wont be able to review your > > proposal fully. But from the looks of it what I would suggest you is to > > target a somewhat lower and practical proposal. Trust me converting these > > algorithms to map/reduce is not as easy as it sounds and most of the time > > you would spend in debugging your code. Your work history is quite > > impressive but whats more important here is getting your proposal right. > > Sean has written most of the recommender code of Mahout and would be best > > to > > give you feedback as he has tried quite a number of approaches to > > recommenders on map/reduce and knows very well, some of the constraints > of > > the framework. Feel free to explore the current Mahout recommenders code > > and > > ask on the list if you find anything confusing. But remember you are > trying > > to reproduce some of the cutting edge work in Recommendations over 2 > years > > in a span of 10 weeks :) so stop and ponder over the feasibility. If you > > still are good to go then prolly, you need to demonstrate something in > > terms > > of code during the proposal period(which is optional). > > > > Don't take this in the wrong way, its not meant to demotivate you. If we > > can > > get this into mahout, I am sure noone here would be objecting to it. So > > your > > good next step would be read, explore, think, discuss. > > > > Regards > > Robin > > > > > > On Mon, Mar 22, 2010 at 4:36 PM, Sisir Koppaka <sisir.kopp...@gmail.com > > >wrote: > > > > > Dear Robin & the Apache Mahout team, > > > I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've > > > contributed to open source projects like FFmpeg earlier(Repository diff > > > links are here< > > > > > > http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0 > > > >and > > > here< > > > > > > http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd > > > > > > > ), and I am very interested to work on a project for Apache Mahout this > > > year(the Netflix algorithms project, to be precise - mentored by > Robin). > > > Kindly let me explain my background so that I can make myself relevant > in > > > this context. > > > > > > I've done research work in meta-heuristics, including proposing the > > > equivalents of local search and mutation for quantum-inspired > algorithms, > > > in > > > my paper titled "*Superior Exploration-Exploitation Balance With > > > Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking > > > paper at GECCO 2010. We(myself and a friend - it was an independent > > work), > > > hope to send an expanded version of the communication to a journal in > the > > > near future. For this project, our language of implementation was in > > > Mathematica, as we needed the combination of functional paradigms and > > > available mathematically sound resources(like biased random number > > > generation, simple linear programming functions etc.) as well as rapid > > > prototyping ability. > > > > > > I have earlier interned in GE Research in their Computing and Decision > > > Sciences Lab< > > > > http://ge.geglobalresearch.com/technologies/computing-decision-sciences/ > > > >last > > > year, where I worked on machine learning techniques for large-scale > > > databases - specifically on the Netflix Prize itself. Over a 2 month > > > internship we rose from 1800 to 409th position on the Leaderboard, and > > had > > > implemented at least one variant of each of the major algorithms. The > > > contest ended at the same time as the conclusion of our internship, and > > the > > > winning result was the combination of multiple variants of our > > implemented > > > algorithms. > > > > > > Interestingly, we did try to use Hadoop and the Map-Reduce model for > the > > > purpose based on a talk from a person from Yahoo! who visited us during > > > that > > > time. However, not having access to a cluster proved to be an impedance > > for > > > fast iterative development. We had one machine of 16 cores, so we > > developed > > > a toolkit in C++ that could multiprocess upto 16 threads(data input > > > parallelization, rather than modifying the algorithms to suit the > > > Map-Reduce > > > model), and implemented all our algorithms using the same toolkit. > > > Specifically, SVD, kNN Movie-Movie, kNN User-User, NSVD(Bellkor and > other > > > variants like the Paterek SVD, and the temporal SVD++ too) were the > major > > > algorithms that we implemented. Some algorithms had readily available > > open > > > source code for the Netflix Prize, like NSVD1, so we used them as well. > > We > > > also worked on certain regression schemes that could improve prediction > > > accuracy like kernel-ridge regression, and it's optimization. > > > > > > Towards the end, we also attempted to verify the results of the > infamous > > > paper that showed that IMDB-Netflix correlation could destroy privacy, > > and > > > identify users. We would import IMDB datasets, and put them into a > > database > > > and then correlate the IMDB entries to Netflix(we matched double the > > number > > > of movies that the paper mentioned), and then verify the results. We > also > > > identified genre-wise trends and recorded them as such. Unfortuantely, > > the > > > paper resulted in a libel case, wherein Netflix surrendered it's rights > > to > > > hold future Prizes of this kind in return for withdrawal of charges. > The > > > case effectively closed the possibility of Netflix or any other company > > > releasing similar datasets to the public pending further advances in > > > privacy > > > enforcement techniques, leaving the Netflix database as the largest of > > it's > > > kind. > > > > > > Naturally, it is very interesting if there is an opportunity to > implement > > > the same for Hadoop, since Netflix is the largest database of it's kind > > and > > > this would be of multiple uses - as a kickstart, tutorial, and as a > > > performance testing(using an included segment of the total database) > tool > > > for grids. In addition, the Netflix and IMDB framework base code could > be > > > incredibly useful as a prototyping tool for algorithm designers who > want > > to > > > design better machine learning algorithms/ensembles for large-scale > > > databases. The lack of a decent scalable solution for the Netflix Prize > > so > > > that people can learn from it/add to it, is also a major disappointment > > > that > > > this proposal hopes to correct. > > > > > > Specifically, the challenge I am personally looking forward to is to > > > implement some of these algorithms using the Map-Reduce model, which is > > > what > > > I missed out last time. I am looking forward to: > > > 1. Account for the 12 global effects. The global effects were 12 > > > corrections > > > made in the dataset to account for time, multiple votes on a single day > > > etc., and these alone give a major boost to the accuracy of the > > > predictions. > > > 2. Implement at least 1 kNN based approach. Most of them differ in > their > > > parameters, or in their choice of distance definitions(Pearson, Cosine, > > > Euclidean etc.) > > > 3. Implement at least 1 SVD based temporal approach. While Mahout > already > > > possesses SVD implementations - slight variations of which could > > replicate > > > multiple SVD approaches for the Netflix Prize, the temporal SVD++ has > in > > my > > > past experience been a very promising candidate that contributes a lot > of > > > perpendicular insight into the predictions. > > > 4. RBM implementation. RBM's were found to offer distinct insights into > > the > > > predictions, and Mahout doesn't currently possess one implementation > > yet(as > > > Ankur C. Goel mentioned on this list a few days ago), so this is a > > > no-brainer for a must-do implementation for the Netflix Prize GSoC > > project. > > > 5. Implementing at least 1 regression scheme relevant to the Netflix > > Prize > > > as part of the boiler plate code that would be required for this > project > > to > > > facilitate further additions of algorithm variants. I'd like to do > > > kernel-ridge regression, since it has, again, provided decent results > on > > > the > > > Netflix dataset compared to other options. I need to discuss this part > > with > > > my mentor/the Apache Mahout team because while multiple regression > > schemes > > > are required for this project, I am not sure if it is necessary to > > > implement > > > this as a Map-Reduce based scheme, *at least within the scope of the > GSoC > > > project.* This is because it is not very compute-intensive, at least > not > > as > > > much as the central algorithms for the Netflix Prize, and wouldn't be a > > > critical limiting factor to successful completion of an 8% improvement > > > project goal. > > > > > > The idea is that 1,2,5 would not take more than two weeks, with 3 > taking > > > another two weeks and the rest of the time would be spent on 4. A > > > demonstrated 8% improvement over Cinematch on the Netflix dataset would > > be > > > the project goal. > > > > > > There are plenty of references about the Grand Prize solution - Yehuda > > > Koren's final summary< > > > http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf>is > > > a great starting point. However, when I last worked on the problem > > > this > > > wasn't available, so I referred to papers by Arik Paterek and some of > > > Bellkor's earlier papers. > > > > > > Looking forward to a discussion about the proposal, > > > Thanks a lot, > > > Best regards, > > > Sisir Koppaka > > > > > > -- > > > SK > > > > > > > > > -- > SK >