Dear Mahout community,

My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student 
pursuing a MSc degree in Computer Science. 
I started studying machine learning in the past year and during my research I 
found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was 
very impressed by the power of these frameowrks and their great potential. For 
this reason I would like to submit a proposal for this year Google Summer of 
Code competition. 

I have looked at the proposals made by Robin on JIRA 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021).
 I have stopped at two ideas. I would like to ask for your help in deciding 
which idea would be best to pick. Since I've never done GSoC before, I'm hoping 
someone would advise on the size of the project (too small or two big for the 
summer period) and mostly, it's importance for the Mahout framwork. After 
hearing your answers my intentions are to fully focus on the thourough research 
of a single idea.


IDEA 1 - MinHash clustering
---------------------------
The first idea come after taking a look at Google 's paper on collaborative 
filtering for their news system[2]. In that paper, I looked at MinHash 
clustering. 
My first question is: is MinHash clustering considered cool ? If yes, than I 
would like to take a stab at implementing it. 
The paper also describes the implementation in a MapReduce style. Since this is 
only a suggestion I will not elaborate very much on the solution now. I would 
like to ask you weather this might be considered a good choice (i.e. important 
for the framework to have something like this) and if this is a big enough 
project.  

IDEA 2 - Additions to Taste Recommender
---------------------------------------
As a second idea for this competition, was to add some capabilities to the 
Taste framework. I have revised a couple of papers from the Netflix contest 
winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. 
My idea was to implement a parallel prediction blending support by using linear 
regression or any other machine learning method - but so far I didn't got to a 
point where I would have a clear solution of this. I'm preparing my disertation 
paper on recommender systems and this was the first idea I got when thinking 
about participating to GSoC. If you have any ideas on this and want to share 
them, I would be very thankful.

Thank you in advance.

Best regards,
Cristi. 

BIBLIOGRAPHY:
---------------
[1] Owen, Anil - Mahout in Action. Manning, 2010. 

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
Personalization: Scalable Online Collaborative Filtering, WWW 2007.

Reply via email to