Dear all, I have submitted my application on Google. It seems that all students also post the application here. So I hope it will not too late for me to post it here. Please give me some suggestion on my proposal, thanks.
Coincidently, Yun Jiang, another applicant, and me are in the same lab :-). Application Abstract I have solid background knowledge on Machine Learning. Naive Bayes, Neural Networks, Logistic Regression, Locally Weighted Linear Regression, and k-Means are easy for me to implement, while SVM, PCA, ICA, EM, and GDA may cost me some effort. For each algorithm, I plan to find an existing stable implementation for reference first. Secondly I will implement a single-machine version, and verify the correctness with the reference implementation. Then I will implement a Map/Reduce version, and verify the correctness with the reference implementation/the single-machine version. Finally, I will find some large datasets to benchmark the Map/Reduce version, and figure out the speedup of it. Of course, I will also write documentation and unit tests during each step. I am interested in Open Source development, and I am eager to participate in an open source project. I have used so many open source software/tools for a long time. GSoC is a good opportunity for me to contribute to open source community. I want te get started here, and continue to contribute even after the GSoC. 1. Biography I am a graduate student at CS department, Shanghai JiaoTong University, Shanghai, China. I have read through the dev maillist of Mahout, and I have a rough idea of the progress of Mahout. My research interests include Social Annotation, Information Retrieval, Web Mining, Semantic Web, Web 2.0, etc. Statistical Learning and Machine Learning are the fundamental knowledge to me, because I have to deal with many tasks on data and knowledge management. My resume could be accessed at http://www.apexlab.org/apex_wiki/hzheng. Despite my research in lab, I have taken two highly-related courses about Machine Learning: Machine Learning (textbook: Machine Learning. Tom Mitchell, McGraw Hill, 1997. http://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077), and Statistical Learning (textbook: The Elements of Statistical Learning. T Hastie, R Tibshirani, J Friedman, Springer, 2001. http://www.amazon.com/Elements-Statistical-Learning-T-Hastie/dp/0387952845). So I believe I have solid background knowledge on Machine Learning. My plan for the Mahout project of GSoC is detailed in Section 2. Recently, I am interested in Open Source development, and I am eager to participate in an open source project. I have used so many open source software/tools for a long time. GSoC is a good opportunity for me to contribute to open source community. I want te get start here, and continue to contribute even after the GSoC. 2. Plan 2.1. Preparation Phase About Machine Learning, I believe that Naive Bayes, Neural Networks, Logistic Regression, Locally Weighted Linear Regression, and k-Means are easy for me to implement, while SVM, PCA, ICA, EM, and GDA may cost my much effort. I notice that there are issues on Naive Bayes, k-Means, and EM on JIRA, while svn trunk only has code on k-Means. I can help current commiters on these existing algorithms, and also create new algorithms. About Map/Reduce, I have read the Google paper "MapReduce: Simplified Data Processing on Large Clusters", and the NIPS paper "Map-Reduce for Machine Learning on Multicore". I learned the general idea of Map/Reduce, but I have to admit that I have no experience of it. I will learn Hadoop first, and try some trivial use case on Hadoop to get familiar with Map/Reduce programming. As long as I get familiar with Hadoop, I think I have no problem in this aspect, too. About general programming skills, I have about 4 years experience in Java programming. I am proficient in Java, and have taken part in several large projects. 2.2. Coding Phase I predict it will take me about 2 weeks to implement Naive Bayes, Neural Networks, Logistic Regression, Locally Weighted Linear Regression, and k-Means; 4 weeks to implement SVM, PCA, ICA, EM, and GDA. By "implement", I mean the following thing: a). find an existing stable implementation for reference b). implement a single-machine version, and verify the correctness with the reference implementation c). implement a Map/Reduce version, and verify the correctness with the reference implementation/the single-machine version d). find some large datasets to benchmark the Map/Reduce version, and figure out the speedup of it * During each step, I will also write documentation/unit tests. The a), b), c) steps can ensure the correctness of our Map/Reduce implementation, while the d) step can measure the performance of it. 2.3. Miscellaneous but Non-trivial Aspect I can also help Mahout on some miscellaneous but non-trivial aspect. For example, input/output format standards, input/output utils, and anything proposed on JIRA. My experience on software engineering may help on these aspects. 3. Schedule now - May 26: Preparation Phase. Learn more on Map/Reduce. Consult mentors on what to started first. Take part in the discussion on the dev maillist. May 27 - August 11: Coding Phase. In this 11 weeks, plan to implement 3-4 algorithms. Also help on some miscellaneous aspects. Write documentation and unit tests. August 12 - August 18: Revise some minor errors. Complete some documentation.
