Dear Mahout community, My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student pursuing a MSc degree in Computer Science. I started studying machine learning in the past year and during my research I found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was very impressed by the power of these frameowrks and their great potential. For this reason I would like to submit a proposal for this year Google Summer of Code competition.
I have looked at the proposals made by Robin on JIRA (https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021). I have stopped at two ideas. I would like to ask for your help in deciding which idea would be best to pick. Since I've never done GSoC before, I'm hoping someone would advise on the size of the project (too small or two big for the summer period) and mostly, it's importance for the Mahout framwork. After hearing your answers my intentions are to fully focus on the thourough research of a single idea. IDEA 1 - MinHash clustering --------------------------- The first idea come after taking a look at Google 's paper on collaborative filtering for their news system[2]. In that paper, I looked at MinHash clustering. My first question is: is MinHash clustering considered cool ? If yes, than I would like to take a stab at implementing it. The paper also describes the implementation in a MapReduce style. Since this is only a suggestion I will not elaborate very much on the solution now. I would like to ask you weather this might be considered a good choice (i.e. important for the framework to have something like this) and if this is a big enough project. IDEA 2 - Additions to Taste Recommender --------------------------------------- As a second idea for this competition, was to add some capabilities to the Taste framework. I have revised a couple of papers from the Netflix contest winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. My idea was to implement a parallel prediction blending support by using linear regression or any other machine learning method - but so far I didn't got to a point where I would have a clear solution of this. I'm preparing my disertation paper on recommender systems and this was the first idea I got when thinking about participating to GSoC. If you have any ideas on this and want to share them, I would be very thankful. Thank you in advance. Best regards, Cristi. BIBLIOGRAPHY: --------------- [1] Owen, Anil - Mahout in Action. Manning, 2010. [2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News Personalization: Scalable Online Collaborative Filtering, WWW 2007.