Re: Reg. Netflix Prize Apache Mahout GSoC Application

Jake Mannix Mon, 22 Mar 2010 14:10:16 -0700

Hi Sisir,

  I'm the one who added the most recent SVD stuff to Mahout,
and while I'd love to see improvements in that, and incorporation
into netflix-style recommenders, I would even more like to see
a stacked-RBM implementation if you think you can do that.  We
don't currently have anything like that in Mahout, and even if
it is completely serial (no Hadoop-hooks), it would be very
excellent as an addition to the project.


  That's my opinion on the matter. :)

  -jake

On Mon, Mar 22, 2010 at 1:53 PM, Sisir Koppaka <sisir.kopp...@gmail.com>wrote:

> Hi,
> Thanks a lot for taking time out to reply. I understand that it's important
> to get the proposal right - that's why I wanted to bounce off all
> possibilites as far as Netflix is concerned - the methods that I've worked
> with before, on this list, and see what would be of priority interest to
> the
> team. If global effects, and temporal SVD would be of interest then I'd
> incorporate that into my final proposal accordingly. On the other hand,
> I've
> read that RBM is something the team is interested in, so I could also
> implement a very good performing(approximately 0.91 RMSE) RBM for Netflix,
> as the GSoC project. I'd like to know which of the Netflix algorithms the
> Mahout team would like to see implemented first.
>
> Depending on the feedback, I'll prepare the final proposal. I'll definitely
> work with the code now and post any queries that I get on the list.
>
> Thanks a lot,
> Best regards,
> Sisir Koppaka
>
> On Tue, Mar 23, 2010 at 1:22 AM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > Hi Sisir,
> >          I am currently on vacation. So wont be able to review your
> > proposal fully. But from the looks of it what I would suggest you is to
> > target a somewhat lower and practical proposal. Trust me converting these
> > algorithms to map/reduce is not as easy as it sounds and most of the time
> > you would spend in debugging your code. Your work history is quite
> > impressive but whats more important here is getting your proposal right.
> > Sean has written most of the recommender code of Mahout and would be best
> > to
> > give you feedback as he has tried quite a number of approaches to
> > recommenders on map/reduce and knows very well, some of the constraints
> of
> > the framework. Feel free to explore the current Mahout recommenders code
> > and
> > ask on the list if you find anything confusing. But remember you are
> trying
> > to reproduce some of the cutting edge work in Recommendations over 2
> years
> > in a span of 10 weeks :) so stop and ponder over the feasibility. If you
> > still are good to go then prolly, you need to demonstrate something in
> > terms
> > of code during the proposal period(which is optional).
> >
> > Don't take this in the wrong way, its not meant to demotivate you. If we
> > can
> > get this into mahout, I am sure noone here would be objecting to it. So
> > your
> > good next step would be read, explore, think, discuss.
> >
> > Regards
> > Robin
> >
> >
> > On Mon, Mar 22, 2010 at 4:36 PM, Sisir Koppaka <sisir.kopp...@gmail.com
> > >wrote:
> >
> > > Dear Robin & the Apache Mahout team,
> > > I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
> > > contributed to open source projects like FFmpeg earlier(Repository diff
> > > links are here<
> > >
> >
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
> > > >and
> > > here<
> > >
> >
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd
> > > >
> > > ), and I am very interested to work on a project for Apache Mahout this
> > > year(the Netflix algorithms project, to be precise - mentored by
> Robin).
> > > Kindly let me explain my background so that I can make myself relevant
> in
> > > this context.
> > >
> > > I've done research work in meta-heuristics, including proposing the
> > > equivalents of local search and mutation for quantum-inspired
> algorithms,
> > > in
> > > my paper titled "*Superior Exploration-Exploitation Balance With
> > > Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking
> > > paper at GECCO 2010. We(myself and a friend - it was an independent
> > work),
> > > hope to send an expanded version of the communication to a journal in
> the
> > > near future. For this project, our language of implementation was in
> > > Mathematica, as we needed the combination of functional paradigms and
> > > available mathematically sound resources(like biased random number
> > > generation, simple linear programming functions etc.) as well as rapid
> > > prototyping ability.
> > >
> > > I have earlier interned in GE Research in their Computing and Decision
> > > Sciences Lab<
> > >
> http://ge.geglobalresearch.com/technologies/computing-decision-sciences/
> > > >last
> > > year, where I worked on machine learning techniques for large-scale
> > > databases - specifically on the Netflix Prize itself. Over a 2 month
> > > internship we rose from 1800 to 409th position on the Leaderboard, and
> > had
> > > implemented at least one variant of each of the major algorithms. The
> > > contest ended at the same time as the conclusion of our internship, and
> > the
> > > winning result was the combination of multiple variants of our
> > implemented
> > > algorithms.
> > >
> > > Interestingly, we did try to use Hadoop and the Map-Reduce model for
> the
> > > purpose based on a talk from a person from Yahoo! who visited us during
> > > that
> > > time. However, not having access to a cluster proved to be an impedance
> > for
> > > fast iterative development. We had one machine of 16 cores, so we
> > developed
> > > a toolkit in C++ that could multiprocess upto 16 threads(data input
> > > parallelization, rather than modifying the algorithms to suit the
> > > Map-Reduce
> > > model), and implemented all our algorithms using the same toolkit.
> > > Specifically, SVD, kNN Movie-Movie, kNN User-User, NSVD(Bellkor and
> other
> > > variants like the Paterek SVD, and the temporal SVD++ too) were the
> major
> > > algorithms that we implemented. Some algorithms had readily available
> > open
> > > source code for the Netflix Prize, like NSVD1, so we used them as well.
> > We
> > > also worked on certain regression schemes that could improve prediction
> > > accuracy like kernel-ridge regression, and it's optimization.
> > >
> > > Towards the end, we also attempted to verify the results of the
> infamous
> > > paper that showed that IMDB-Netflix correlation could destroy privacy,
> > and
> > > identify users. We would import IMDB datasets, and put them into a
> > database
> > > and then correlate the IMDB entries to Netflix(we matched double the
> > number
> > > of movies that the paper mentioned), and then verify the results. We
> also
> > > identified genre-wise trends and recorded them as such. Unfortuantely,
> > the
> > > paper resulted in a libel case, wherein Netflix surrendered it's rights
> > to
> > > hold future Prizes of this kind in return for withdrawal of charges.
> The
> > > case effectively closed the possibility of Netflix or any other company
> > > releasing similar datasets to the public pending further advances in
> > > privacy
> > > enforcement techniques, leaving the Netflix database as the largest of
> > it's
> > > kind.
> > >
> > > Naturally, it is very interesting if there is an opportunity to
> implement
> > > the same for Hadoop, since Netflix is the largest database of it's kind
> > and
> > > this would be of multiple uses - as a kickstart, tutorial, and as a
> > > performance testing(using an included segment of the total database)
> tool
> > > for grids. In addition, the Netflix and IMDB framework base code could
> be
> > > incredibly useful as a prototyping tool for algorithm designers who
> want
> > to
> > > design better machine learning algorithms/ensembles for large-scale
> > > databases. The lack of a decent scalable solution for the Netflix Prize
> > so
> > > that people can learn from it/add to it, is also a major disappointment
> > > that
> > > this proposal hopes to correct.
> > >
> > > Specifically, the challenge I am personally looking forward to is to
> > > implement some of these algorithms using the Map-Reduce model, which is
> > > what
> > > I missed out last time. I am looking forward to:
> > > 1. Account for the 12 global effects. The global effects were 12
> > > corrections
> > > made in the dataset to account for time, multiple votes on a single day
> > > etc., and these alone give a major boost to the accuracy of the
> > > predictions.
> > > 2. Implement at least 1 kNN based approach. Most of them differ in
> their
> > > parameters, or in their choice of distance definitions(Pearson, Cosine,
> > > Euclidean etc.)
> > > 3. Implement at least 1 SVD based temporal approach. While Mahout
> already
> > > possesses SVD implementations - slight variations of which could
> > replicate
> > > multiple SVD approaches for the Netflix Prize, the temporal SVD++ has
> in
> > my
> > > past experience been a very promising candidate that contributes a lot
> of
> > > perpendicular insight into the predictions.
> > > 4. RBM implementation. RBM's were found to offer distinct insights into
> > the
> > > predictions, and Mahout doesn't currently possess one implementation
> > yet(as
> > > Ankur C. Goel mentioned on this list a few days ago), so this is a
> > > no-brainer for a must-do implementation for the Netflix Prize GSoC
> > project.
> > > 5. Implementing at least 1 regression scheme relevant to the Netflix
> > Prize
> > > as part of the boiler plate code that would be required for this
> project
> > to
> > > facilitate further additions of algorithm variants. I'd like to do
> > > kernel-ridge regression, since it has, again, provided decent results
> on
> > > the
> > > Netflix dataset compared to other options. I need to discuss this part
> > with
> > > my mentor/the Apache Mahout team because while multiple regression
> > schemes
> > > are required for this project, I am not sure if it is necessary to
> > > implement
> > > this as a Map-Reduce based scheme, *at least within the scope of the
> GSoC
> > > project.* This is because it is not very compute-intensive, at least
> not
> > as
> > > much as the central algorithms for the Netflix Prize, and wouldn't be a
> > > critical limiting factor to successful completion of an 8% improvement
> > > project goal.
> > >
> > > The idea is that 1,2,5 would not take more than two weeks, with 3
> taking
> > > another two weeks and the rest of the time would be spent on 4. A
> > > demonstrated 8% improvement over Cinematch on the Netflix dataset would
> > be
> > > the project goal.
> > >
> > > There are plenty of references about the Grand Prize solution - Yehuda
> > > Koren's final summary<
> > > http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf>is
> > > a great starting point. However, when I last worked on the problem
> > > this
> > > wasn't available, so I referred to papers by Arik Paterek and some of
> > > Bellkor's earlier papers.
> > >
> > > Looking forward to a discussion about the proposal,
> > > Thanks a lot,
> > > Best regards,
> > > Sisir Koppaka
> > >
> > > --
> > > SK
> > >
> >
>
>
>
> --
> SK
>

Re: Reg. Netflix Prize Apache Mahout GSoC Application

Reply via email to