[jira] Created: (MAHOUT-390) Quickstart script for kmeans algorithm

2010-05-04 Thread Sisir Koppaka (JIRA)
Quickstart script for kmeans algorithm
--

 Key: MAHOUT-390
 URL: https://issues.apache.org/jira/browse/MAHOUT-390
 Project: Mahout
  Issue Type: Improvement
  Components: Utils
Reporter: Sisir Koppaka


Contains a quickstart shell script for kmeans algorithm on the Reuters dataset 
as described at https://cwiki.apache.org/MAHOUT/k-means.html 

The script in JIRA is a slightly modified and cleaner version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-390) Quickstart script for kmeans algorithm

2010-05-04 Thread Sisir Koppaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sisir Koppaka updated MAHOUT-390:
-

Attachment: quickstart-kmeans.sh

The quickstart script for kmeans algorithm.

 Quickstart script for kmeans algorithm
 --

 Key: MAHOUT-390
 URL: https://issues.apache.org/jira/browse/MAHOUT-390
 Project: Mahout
  Issue Type: Improvement
  Components: Utils
Reporter: Sisir Koppaka
 Attachments: quickstart-kmeans.sh


 Contains a quickstart shell script for kmeans algorithm on the Reuters 
 dataset as described at https://cwiki.apache.org/MAHOUT/k-means.html 
 The script in JIRA is a slightly modified and cleaner version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Quickstart for kMeans

2010-05-04 Thread Sisir Koppaka
Hi,
I've put up a slightly cleaner version of the script on JIRA at
https://issues.apache.org/jira/browse/MAHOUT-390

Best regards,
Sisir Koppaka

On Mon, May 3, 2010 at 11:28 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Sisir,

 Thanks for the script.  I think it would be great to open a JIRA issue for
 this and we can check in the shell script under the examples.

 I think LDA also has similar tools to download Reuters, we should try to
 reuse if possible.

 On May 2, 2010, at 3:42 PM, Sisir Koppaka wrote:

  For GSOC students,
  In case anyone was going through the code and finding some difficulty in
  running stuff, I have updated the kMeans page on the
  wikihttps://cwiki.apache.org/confluence/display/MAHOUT/k-Means with
  a short quickstart shell script that will run it for you. You can tweak
 the
  settings and reuse it. Reading the code after running it will hopefully
 help
  out in understanding the codebase well.
 
  If any of you have any tips to share, or have made notes of
  quirks-to-be-aware-of, do post them here for everyone's benefit.





Quickstart for kMeans

2010-05-02 Thread Sisir Koppaka
For GSOC students,
In case anyone was going through the code and finding some difficulty in
running stuff, I have updated the kMeans page on the
wikihttps://cwiki.apache.org/confluence/display/MAHOUT/k-Means with
a short quickstart shell script that will run it for you. You can tweak the
settings and reuse it. Reading the code after running it will hopefully help
out in understanding the codebase well.

If any of you have any tips to share, or have made notes of
quirks-to-be-aware-of, do post them here for everyone's benefit.


Re: Quickstart for kMeans

2010-05-02 Thread Sisir Koppaka
Two more useful resources for quickstarting with the code -
http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
http://www.lucenebootcamp.com/lucene-boot-camp-preclass-training/

On Mon, May 3, 2010 at 1:14 AM, Robin Anil robin.a...@gmail.com wrote:

 Nice work!

 On Mon, May 3, 2010 at 1:12 AM, Sisir Koppaka sisir.kopp...@gmail.com
 wrote:

  For GSOC students,
  In case anyone was going through the code and finding some difficulty in
  running stuff, I have updated the kMeans page on the
  wikihttps://cwiki.apache.org/confluence/display/MAHOUT/k-Means with
  a short quickstart shell script that will run it for you. You can tweak
 the
  settings and reuse it. Reading the code after running it will hopefully
  help
  out in understanding the codebase well.
 
  If any of you have any tips to share, or have made notes of
  quirks-to-be-aware-of, do post them here for everyone's benefit.
 



Re: [GSOC] Congrats to all students

2010-04-27 Thread Sisir Koppaka
+1 for shared blog!


Re: [GSOC] Congrats to all students

2010-04-26 Thread Sisir Koppaka
Thanks everyone!

This is a fantastic opportunity, and I'll try to make the best of this for
myself, as well as Mahout. Hopefully, we'll have a great compilation of deep
learning networks within the next few releases.

BTW, congrats to everyone on Mahout becoming a TLP!

On Tue, Apr 27, 2010 at 1:13 AM, Grant Ingersoll gsing...@apache.orgwrote:

 Looks like student GSOC announcements are up (
 http://socghop.appspot.com/gsoc/program/list_projects/google/gsoc2010).
  Mahout got quite a few projects (5) accepted this year, which is a true
 credit to the ASF, Mahout, the mentors, and most of all the students!  We
 had a good number of very high quality student proposals for Mahout this
 year and it was very difficult to choose.  Of the ones selected, I think
 they all bode well for the future of Mahout and the students.

 For those who didn't make the cut, I know it's small consolation, but I
 would encourage you all to stay involved in open source, if not Mahout
 specifically.  We'd certainly love to see you contributing here as many of
 you had very good ideas.

 At any rate, for everyone, keep an eye out on the Mahout project, as you
 should be seeing lots of exciting features coming to Mahout soon in the form
 of scalable Neural Networks, Restricted Boltzmann Machines (recommenders),
 SVD-based recommenders, EigenCuts Spectral Clustering and Support Vector
 Machines (SVM)!

 Should be an exciting summer!

 -Grant




-- 
SK


[jira] Created: (MAHOUT-375) [GSOC] Restricted Boltzmann Machines in Apache Mahout

2010-04-11 Thread Sisir Koppaka (JIRA)
[GSOC] Restricted Boltzmann Machines in Apache Mahout
-

 Key: MAHOUT-375
 URL: https://issues.apache.org/jira/browse/MAHOUT-375
 Project: Mahout
  Issue Type: New Feature
Reporter: Sisir Koppaka


Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses issue 
Mahout-329)

Student Name: Sisir Koppaka

Student E-mail: sisir.kopp...@gmail.com

Organization/Project:Assigned Mentor:

Abstract
This is a proposal to implement Restricted Boltzmann Machines in Apache Mahout 
as a part of Google Summer of Code 2010. The demo for the code would be built 
on the Netflix dataset.

1 Introduction
The Grand Prize solution to the Netflix Prize offered several new lessons in 
the application of traditional machine learning techniques to very large scale 
datasets. The most significant among these were the impact of temporal models, 
the remarkable contribution of RBM's to the solution in the overall model, and 
the great success in applying ensemble models to achieve superior predictions. 
The present proposal seeks to implement a conditional factored RBM[4] in Apache 
Mahout as a project under Google Summer of Code 2010.

2 Background
The Netflix dataset takes the form of a sparse matrix of a N X M ratings that N 
users assign to M movies. Matrix decompositions such as variants of Singular 
Value Decompositions(SVDs) form the first type of methods applied. This has 
also induced several recent works in applied mathematics relevant to the 
Netflix Prize, including [1, 2]. Another genre of techniques have been 
k-nearest neighbour approaches - user-user, movie-movie and using different
distance measures such as Pearson and Cosine. The third set of techniques that 
offers arguably the most divergent predictions that aid in the increase in 
prediction RMSE are RBM and it's variants.

[4] demonstrates the algorithm that the author proposes to implement this 
summer in Apache Mahout. Parallelization can be done by updating all the hidden 
units, followed by the visible units in parallel, due to the conditional 
independence of the hidden units, given a visible binary indicator matrix. 
Rather than implementing a naive RBM, the conditional factored RBM is chosen 
due to it's useful combination of effectiveness and speed. Minor variations, in 
any case, could be developed later with little difficulty.

The training data set consists of nearly 100 million ratings from 480,000 users 
on 17,770 movie titles. As part of the training data, Netflix also pro- vides 
validation data(called the probe set), containing nearly 1.4 million rat- ings. 
In addition to the training and validation data, Netflix also provides a test 
set containing 2.8 million user/movie pairs(called the qualifying set) whose 
ratings were previously withheld, but have now been released post the 
conclusion of the Prize.

3 Milestones 

3.1 April 26-May 24
Community Bonding Period Certain boilerplate code for the Netflix dataset 
exists at org.apache.mahout.cf.taste.example.netflix. However, this code is 
non-distributed and is unrelated to Hadoop. Certain parts of this code, like 
the file read-in based on Netflix format will be reused to match the processed 
Netflix dataset file linked below.

Test out any of the already-implemented Mahout algorithms like SVD or k-Means 
on the whole dataset to make sure that everything works as ad- vertised. Make a 
note of testing time. If testing time is very large, then make a 10% training 
set, and use the 10% probe, which already exists as a standardized Netflix 
Prize community contribution. This is only so that iterations can be faster/a 
multiple node Hadoop installation need not al- ways be required. Work on the 
map-reduce version of RBM and evaluate if parallelization beyond the hidden 
units and visible units alternate computa- tion can be implemented. Get the 
community's approval for the map-reduce version of RBM, and then proceed.

3.2 May 24-July 12 Pre-midterm  
Implementation time! Write code, test code, rewrite code.
Should have working code with decent predictions by end of this segment.

Design details  
The RBM code would live at org.apache.mahout.classifier.rbm. Classify.java 
would need to be written to support the RBM similar to that in discriminative. 
An equivalent of BayesFileFormatter.java would not be required because of the 
pre-written Netflix read-in code as mentioned above. ConfusionMatrix.java, 
ResultAnalyzer.java and ClassifyResult.java would be reused as-is from 
discriminative.
algorithm would contain the actual conditional factored RBM algorithm. common 
would contain the relevant code common to various files in algo- rithm. 
mapreduce.rbm would contain the driver, mapper and reducer for the parallelized 
updating of the hidden units layer, followed by the visible units, and 
appropriate refactored code would be placed in mapreduce.common

[jira] Commented: (MAHOUT-375) [GSOC] Restricted Boltzmann Machines in Apache Mahout

2010-04-11 Thread Sisir Koppaka (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855694#action_12855694
 ] 

Sisir Koppaka commented on MAHOUT-375:
--

Moved the proposal to JIRA. I've processed the Netflix dataset in the format 
Sean suggested on mahout-dev and put it up at 
https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv (click 
to view in browser/download). The format is [movieID, userID, rating]. 

During the proposal consideration time, I am implementing a version of the RBM 
that is not conditional, not factored, and not parallelized yet. I will submit 
this tomorrow after testing. Since testing on my machine alone is right now is 
a bit time taking for the Netflix dataset, are there any datasets that you 
could suggest to me for quicker testing of the RBM - at least for now? If the 
test dataset has some results on RBM that I can compare with, that'd really 
help me with the testing.

 [GSOC] Restricted Boltzmann Machines in Apache Mahout
 -

 Key: MAHOUT-375
 URL: https://issues.apache.org/jira/browse/MAHOUT-375
 Project: Mahout
  Issue Type: New Feature
Reporter: Sisir Koppaka

 Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses 
 issue Mahout-329)
 Student Name: Sisir Koppaka
 Student E-mail: sisir.kopp...@gmail.com
 Organization/Project:Assigned Mentor:
 Abstract
 This is a proposal to implement Restricted Boltzmann Machines in Apache 
 Mahout as a part of Google Summer of Code 2010. The demo for the code would 
 be built on the Netflix dataset.
 1 Introduction
 The Grand Prize solution to the Netflix Prize offered several new lessons in 
 the application of traditional machine learning techniques to very large 
 scale datasets. The most significant among these were the impact of temporal 
 models, the remarkable contribution of RBM's to the solution in the overall 
 model, and the great success in applying ensemble models to achieve superior 
 predictions. The present proposal seeks to implement a conditional factored 
 RBM[4] in Apache Mahout as a project under Google Summer of Code 2010.
 2 Background
 The Netflix dataset takes the form of a sparse matrix of a N X M ratings that 
 N users assign to M movies. Matrix decompositions such as variants of 
 Singular Value Decompositions(SVDs) form the first type of methods applied. 
 This has also induced several recent works in applied mathematics relevant to 
 the Netflix Prize, including [1, 2]. Another genre of techniques have been 
 k-nearest neighbour approaches - user-user, movie-movie and using different
 distance measures such as Pearson and Cosine. The third set of techniques 
 that offers arguably the most divergent predictions that aid in the increase 
 in prediction RMSE are RBM and it's variants.
 [4] demonstrates the algorithm that the author proposes to implement this 
 summer in Apache Mahout. Parallelization can be done by updating all the 
 hidden units, followed by the visible units in parallel, due to the 
 conditional independence of the hidden units, given a visible binary 
 indicator matrix. Rather than implementing a naive RBM, the conditional 
 factored RBM is chosen due to it's useful combination of effectiveness and 
 speed. Minor variations, in any case, could be developed later with little 
 difficulty.
 The training data set consists of nearly 100 million ratings from 480,000 
 users on 17,770 movie titles. As part of the training data, Netflix also pro- 
 vides validation data(called the probe set), containing nearly 1.4 million 
 rat- ings. In addition to the training and validation data, Netflix also 
 provides a test set containing 2.8 million user/movie pairs(called the 
 qualifying set) whose ratings were previously withheld, but have now been 
 released post the conclusion of the Prize.
 3 Milestones 
 3.1   April 26-May 24
 Community Bonding Period Certain boilerplate code for the Netflix dataset 
 exists at org.apache.mahout.cf.taste.example.netflix. However, this code is 
 non-distributed and is unrelated to Hadoop. Certain parts of this code, like 
 the file read-in based on Netflix format will be reused to match the 
 processed Netflix dataset file linked below.
 Test out any of the already-implemented Mahout algorithms like SVD or k-Means 
 on the whole dataset to make sure that everything works as ad- vertised. Make 
 a note of testing time. If testing time is very large, then make a 10% 
 training set, and use the 10% probe, which already exists as a standardized 
 Netflix Prize community contribution. This is only so that iterations can be 
 faster/a multiple node Hadoop installation need not al- ways be required. 
 Work on the map-reduce version of RBM and evaluate if parallelization beyond 
 the hidden units and visible units alternate

[jira] Issue Comment Edited: (MAHOUT-375) [GSOC] Restricted Boltzmann Machines in Apache Mahout

2010-04-11 Thread Sisir Koppaka (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855694#action_12855694
 ] 

Sisir Koppaka edited comment on MAHOUT-375 at 4/11/10 5:09 AM:
---

Moved the proposal to JIRA. I've processed the Netflix dataset in the format 
Sean suggested on mahout-dev and put it up at 
https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv (click 
to view in browser/download). The format is [movieID, userID, rating]. [The 
complete file is 1.5GB, so view in browser unless you need the whole file! :)]

During the proposal consideration time, I am implementing a version of the RBM 
that is not conditional, not factored, and not parallelized yet. I will submit 
this tomorrow after testing. Since testing on my machine alone is right now is 
a bit time taking for the Netflix dataset, are there any datasets that you 
could suggest to me for quicker testing of the RBM - at least for now? If the 
test dataset has some results on RBM that I can compare with, that'd really 
help me with the testing.

  was (Author: sisirkoppaka):
Moved the proposal to JIRA. I've processed the Netflix dataset in the 
format Sean suggested on mahout-dev and put it up at 
https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv (click 
to view in browser/download). The format is [movieID, userID, rating]. 

During the proposal consideration time, I am implementing a version of the RBM 
that is not conditional, not factored, and not parallelized yet. I will submit 
this tomorrow after testing. Since testing on my machine alone is right now is 
a bit time taking for the Netflix dataset, are there any datasets that you 
could suggest to me for quicker testing of the RBM - at least for now? If the 
test dataset has some results on RBM that I can compare with, that'd really 
help me with the testing.
  
 [GSOC] Restricted Boltzmann Machines in Apache Mahout
 -

 Key: MAHOUT-375
 URL: https://issues.apache.org/jira/browse/MAHOUT-375
 Project: Mahout
  Issue Type: New Feature
Reporter: Sisir Koppaka

 Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses 
 issue Mahout-329)
 Student Name: Sisir Koppaka
 Student E-mail: sisir.kopp...@gmail.com
 Organization/Project:Assigned Mentor:
 Abstract
 This is a proposal to implement Restricted Boltzmann Machines in Apache 
 Mahout as a part of Google Summer of Code 2010. The demo for the code would 
 be built on the Netflix dataset.
 1 Introduction
 The Grand Prize solution to the Netflix Prize offered several new lessons in 
 the application of traditional machine learning techniques to very large 
 scale datasets. The most significant among these were the impact of temporal 
 models, the remarkable contribution of RBM's to the solution in the overall 
 model, and the great success in applying ensemble models to achieve superior 
 predictions. The present proposal seeks to implement a conditional factored 
 RBM[4] in Apache Mahout as a project under Google Summer of Code 2010.
 2 Background
 The Netflix dataset takes the form of a sparse matrix of a N X M ratings that 
 N users assign to M movies. Matrix decompositions such as variants of 
 Singular Value Decompositions(SVDs) form the first type of methods applied. 
 This has also induced several recent works in applied mathematics relevant to 
 the Netflix Prize, including [1, 2]. Another genre of techniques have been 
 k-nearest neighbour approaches - user-user, movie-movie and using different
 distance measures such as Pearson and Cosine. The third set of techniques 
 that offers arguably the most divergent predictions that aid in the increase 
 in prediction RMSE are RBM and it's variants.
 [4] demonstrates the algorithm that the author proposes to implement this 
 summer in Apache Mahout. Parallelization can be done by updating all the 
 hidden units, followed by the visible units in parallel, due to the 
 conditional independence of the hidden units, given a visible binary 
 indicator matrix. Rather than implementing a naive RBM, the conditional 
 factored RBM is chosen due to it's useful combination of effectiveness and 
 speed. Minor variations, in any case, could be developed later with little 
 difficulty.
 The training data set consists of nearly 100 million ratings from 480,000 
 users on 17,770 movie titles. As part of the training data, Netflix also pro- 
 vides validation data(called the probe set), containing nearly 1.4 million 
 rat- ings. In addition to the training and validation data, Netflix also 
 provides a test set containing 2.8 million user/movie pairs(called the 
 qualifying set) whose ratings were previously withheld, but have now been 
 released post the conclusion of the Prize.
 3 Milestones 
 3.1   April 26-May

Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-04-04 Thread Sisir Koppaka
Thanks Robin, Ted, Jake and Sean for your feedback. I've refined my
proposal, added in a milestone timeline, with design details, and have
submitted it at the GSoC site. The title of the proposal is *Restricted
Boltzmann Machines on the Netflix Dataset. Please do give me your feedback
on the proposal located
herehttps://docs.google.com/fileview?id=0B-jUrudTSg7-ZTg3YTU5YTktZDBhZC00NWFiLTk4MTQtNzVlODZhOWEzYTU0hl=en
.*

I have a couple of queries that'd help me further refine my proposal.
Firstly, I am expecting to reuse the code at
*org.apache.mahout.cf.taste.example.netflix
*and have mentioned so in my proposal. Please let me know if this is OK, or
if you foresee any problems doing this. Secondly, I will implement a
HBase-based datastore as well as a InMemory-based one, but is the
InMemory-based one a pre-requisite for the HBase-based one to be used?
(Eventually everything has to go to memory, so is this being done elsewhere
or does the InMemory datastore do it?)

Thanking you,
Best regards,
Sisir Koppaka


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-04-04 Thread Sisir Koppaka
Thanks, this is what I wanted to know. So, now, there would be a separate
example that reads-in the Netflix dataset in a distributed way, that would
be utilize the RBM implementation. Would that be right?

The datastore I was referring to in the proposal was based on
mahout.classifier.bayes.datastore. I understand the HBase, Cassandra and
other adapters are being refactored out in a separate ticket, so I'll just
stick with HDFS and S3.

If there's anything else that I would need to add in the proposal, do let me
know.

On Sun, Apr 4, 2010 at 3:09 PM, Sean Owen sro...@gmail.com wrote:

 Reusing code is fine, in principle. The code you mention, however,
 will not help you much. It is non-distributed and has nothing to do
 with Hadoop. You might reuse a bit of code to parse the input files,
 that's about it.

 Which data store are you referring to... if I understand right, you
 are implementing an algorithm on Hadoop. You would definitely not
 implement anything to load into memory, and I think you want to work
 with HDFS and Amazon S3, not Hbase.

 On Sun, Apr 4, 2010 at 9:29 AM, Sisir Koppaka sisir.kopp...@gmail.com
 wrote:
  Firstly, I am expecting to reuse the code at
  *org.apache.mahout.cf.taste.example.netflix
  *and have mentioned so in my proposal. Please let me know if this is OK,
 or
  if you foresee any problems doing this. Secondly, I will implement a
  HBase-based datastore as well as a InMemory-based one, but is the
  InMemory-based one a pre-requisite for the HBase-based one to be used?
  (Eventually everything has to go to memory, so is this being done
 elsewhere
  or does the InMemory datastore do it?)




-- 
SK


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-04-04 Thread Sisir Koppaka
that would *be* utilize - sorry!

I'll start off by implementing the distributed Netfflix read-in, if that's
OK by you.


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-04-04 Thread Sisir Koppaka
On Sun, Apr 4, 2010 at 4:10 PM, Sean Owen sro...@gmail.com wrote:

 I think you want to write this to accept generic data, and not
 necessarily assume the Netflix input format. I suggest you accept CSV
 data, in the form userID,itemID,value, since that is what all the
 recommenders do.

 Sure, I'll write it for userID, movieID, rating. Netflix also provides
dates but we can ignore it for the time being.


 You may need a quick utility program to convert Netflix data format to
 this. this wouldn't be part of the project, or else, we can put it in
 utils later.

 I have done this already. I have a 1.2GB CSV file containing all the 100
million records in the Netflix dataset as userID, movieID, rating, date.


-- 
SK


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-04-04 Thread Sisir Koppaka
I have put up the processed Netflix dataset
herehttps://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv.
This file does not contain dates, and is 1.5GB in
size.https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv?torrent


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Sisir Koppaka
Hi,
Thanks a lot for taking time out to reply. I understand that it's important
to get the proposal right - that's why I wanted to bounce off all
possibilites as far as Netflix is concerned - the methods that I've worked
with before, on this list, and see what would be of priority interest to the
team. If global effects, and temporal SVD would be of interest then I'd
incorporate that into my final proposal accordingly. On the other hand, I've
read that RBM is something the team is interested in, so I could also
implement a very good performing(approximately 0.91 RMSE) RBM for Netflix,
as the GSoC project. I'd like to know which of the Netflix algorithms the
Mahout team would like to see implemented first.

Depending on the feedback, I'll prepare the final proposal. I'll definitely
work with the code now and post any queries that I get on the list.

Thanks a lot,
Best regards,
Sisir Koppaka

On Tue, Mar 23, 2010 at 1:22 AM, Robin Anil robin.a...@gmail.com wrote:

 Hi Sisir,
  I am currently on vacation. So wont be able to review your
 proposal fully. But from the looks of it what I would suggest you is to
 target a somewhat lower and practical proposal. Trust me converting these
 algorithms to map/reduce is not as easy as it sounds and most of the time
 you would spend in debugging your code. Your work history is quite
 impressive but whats more important here is getting your proposal right.
 Sean has written most of the recommender code of Mahout and would be best
 to
 give you feedback as he has tried quite a number of approaches to
 recommenders on map/reduce and knows very well, some of the constraints of
 the framework. Feel free to explore the current Mahout recommenders code
 and
 ask on the list if you find anything confusing. But remember you are trying
 to reproduce some of the cutting edge work in Recommendations over 2 years
 in a span of 10 weeks :) so stop and ponder over the feasibility. If you
 still are good to go then prolly, you need to demonstrate something in
 terms
 of code during the proposal period(which is optional).

 Don't take this in the wrong way, its not meant to demotivate you. If we
 can
 get this into mahout, I am sure noone here would be objecting to it. So
 your
 good next step would be read, explore, think, discuss.

 Regards
 Robin


 On Mon, Mar 22, 2010 at 4:36 PM, Sisir Koppaka sisir.kopp...@gmail.com
 wrote:

  Dear Robin  the Apache Mahout team,
  I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
  contributed to open source projects like FFmpeg earlier(Repository diff
  links are here
 
 http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
  and
  here
 
 http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd
  
  ), and I am very interested to work on a project for Apache Mahout this
  year(the Netflix algorithms project, to be precise - mentored by Robin).
  Kindly let me explain my background so that I can make myself relevant in
  this context.
 
  I've done research work in meta-heuristics, including proposing the
  equivalents of local search and mutation for quantum-inspired algorithms,
  in
  my paper titled *Superior Exploration-Exploitation Balance With
  Quantum-Inspired Hadamard Walks*, that was accepted as a late-breaking
  paper at GECCO 2010. We(myself and a friend - it was an independent
 work),
  hope to send an expanded version of the communication to a journal in the
  near future. For this project, our language of implementation was in
  Mathematica, as we needed the combination of functional paradigms and
  available mathematically sound resources(like biased random number
  generation, simple linear programming functions etc.) as well as rapid
  prototyping ability.
 
  I have earlier interned in GE Research in their Computing and Decision
  Sciences Lab
  http://ge.geglobalresearch.com/technologies/computing-decision-sciences/
  last
  year, where I worked on machine learning techniques for large-scale
  databases - specifically on the Netflix Prize itself. Over a 2 month
  internship we rose from 1800 to 409th position on the Leaderboard, and
 had
  implemented at least one variant of each of the major algorithms. The
  contest ended at the same time as the conclusion of our internship, and
 the
  winning result was the combination of multiple variants of our
 implemented
  algorithms.
 
  Interestingly, we did try to use Hadoop and the Map-Reduce model for the
  purpose based on a talk from a person from Yahoo! who visited us during
  that
  time. However, not having access to a cluster proved to be an impedance
 for
  fast iterative development. We had one machine of 16 cores, so we
 developed
  a toolkit in C++ that could multiprocess upto 16 threads(data input
  parallelization, rather than modifying the algorithms to suit the
  Map-Reduce
  model), and implemented all our algorithms using the same toolkit.
  Specifically, SVD, kNN Movie