Also, this proposal should itself be a JIRA ticket with the GSOC tag applied to it. That will make it visible to the Apache-wide summer of code administrator's.
On Fri, Mar 12, 2010 at 3:57 AM, zhao zhendong <zhaozhend...@gmail.com>wrote: > Sure, I will revise it tonight. > > Thanks, Robin. > > > On Fri, Mar 12, 2010 at 7:22 PM, Robin Anil <robin.a...@gmail.com> wrote: > > > Hi Zhao, > > Some quick feedback. > > > > 1) Can you update the gsoc issue on classifier with a nabble link to this > > thread or using any other aggregator > > 2) I hope you would read the Gsoc timelines more clearly, there is mid > term > > evaluation, end term evaluation and some buffer time. You will have to > > change your currently timeline to accurately reflect that. > > > > I will post more queries about the design choice later > > > > Robin > > > > On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong <zhaozhend...@gmail.com > > >wrote: > > > > > Hi all, > > > The updated proposal for GSoC 2010 is as follows, any comment is > welcome. > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > Title/Summary: > > > Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao > Student > > > e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information > > > Retrieval /Computer ScienceStudent Degree: Master Student > > > Graduation: > > > NUS’10 Organization: Hadoop > > > > > > 0 Abstract > > > > > > Linear Support Vector Machine (SVM) is pretty useful in some > applications > > > with large-scale datasets or datasets with high dimension features. > This > > > proposal will port one of the most famous linear SVM solvers, say, > > > LIBLINEAR > > > [1] to mahout with unified interface as same as Pegasos [2] @ mahout, > > which > > > is another linear SVM solver and almost finished by me. Two distinct > > > contributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified > > > interfaces for linear SVM classifier. > > > > > > 1 Motivation > > > > > > As one of TOP 10 algorithms in data mining society [3], Support Vector > > > Machine is very powerful Machine Learning tool and widely adopted in > Data > > > Mining, Pattern Recognition and Information Retrieval domains. > > > > > > The SVM training procedure is pretty slow, however, especially on the > > case > > > with large-scale dataset. Nowadays, several literatures propose SVM > > solvers > > > with linear kernel that can handle large-scale learning problem, for > > > instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype > > of > > > linear SVM classifier based on Pegasos [2] for Mahout (issue: > Mahout-232). > > > Nevertheless, as the winner of ICML 2008 large-scale learning challenge > > > (linear SVM <http://largescale.first.fraunhofer.de/summary/>track ( > > > http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose > > to > > > be > > > incorporated in Mahout too. Currently, LIBLINEAR package supports: > > > > > > - > > > > > > L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and > > > logistic regression (LR) > > > - > > > > > > L1-regularized classifiers L2-loss linear SVM and logistic regression > > (LR) > > > > > > > > > Main features of LIBLINEAR are following: > > > > > > - > > > > > > Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer > > > - > > > > > > Cross validation for model selection > > > - > > > > > > Probability estimates (logistic regression only) > > > - > > > > > > Weights for unbalanced data > > > > > > *All the functionalities suppose to be implemented except probability > > > estimates and weights for unbalanced data* (If time permitting, I would > > > like > > > to do so). > > > > > > 2 Unified Interfaces > > > > > > Linear SVM classifier based on Pegasos package on Mahout already can > > provide > > > such functionalities: *( > > http://issues.apache.org/jira/browse/MAHOUT-232)*<http://issues.apache.org/jira/browse/MAHOUT-232%29*> > > > > > > - > > > > > > Sequential Binary Classification (Two-class Classification), includes > > > sequential training and prediction; > > > - > > > > > > Sequential Regression; > > > - > > > > > > Parallel & Sequential Multi-Classification, includes One-vs.-One and > > > One-vs.-Others schemes. > > > > > > Apparently, the functionalities of Pegasos package on Mahout and > > LIBLINEAR > > > are quite similar to each other. As aforementioned, in this section I > > will > > > introduce an unified interfaces for linear SVM classifier on Mahout, > which > > > will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is > > > illustrated in Figure 1: > > > > > > The unfied interfaces has two main parts: 1) Dataset loader; 2) > > Algorithms. > > > I will introduce them separately. > > > > > > *2.1 Data Handler* > > > > > > The dataset can be stored on personal computer or on Hadoop cluster. > This > > > framework provides high performance Random Loader, Sequential Loader > for > > > accessing large-scale data. > > > > > > Figure 1: The framework of linear SVM on Mahout > > > > > > *2.2 Sequential Algorithms* > > > > > > Sequential Algorithms will include binary classification, regression > based > > > on > > > Pegasos and LIBLINEAR with unified interface. > > > > > > *2.3 Parallel Algorithms* > > > > > > It is widely accepted that to parallelize binary SVM classifier is hard. > > For > > > multi-classification, however, the coarse-grained scheme (e.g. each > Mapper > > > or > > > Reducer has one independent SVM binary classifier) is easier to achieve > > > great > > > improvement. Besides, cross validation for model selection also can > take > > > advantage of such coarse-grained parallelism. I will introduce a unified > > > interface for all of them. > > > > > > 3 Biography: > > > > > > I am a graduating masters student in Multimedia Information Retrieval > > > System > > > from National University of Singapore. My research has involved the > > > large-scale SVM classifier. > > > > > > I have worked with Hadoop and Map Reduce since one year ago, and I have > > > dedicated lots of my spare time to Sequential SVM (Pegasos) based on > > > Mahout. > > > > > > *(http://issues.apache.org/jira/browse/MAHOUT-232).*<http://issues.apache.org/jira/browse/MAHOUT-232%29.*>I > > > have taken part > > in > > > setting up and maintaining a Hadoop cluster with around 70 nodes in our > > > group. > > > > > > 4 Timeline: > > > > > > Weeks 1-4: Implement binary classifier > > > > > > Weeks 5-6: Implement parallel multi-class classification and cross > > > validation for model selection > > > > > > Weeks 7-8: Interface re-factory and performance turning > > > > > > Weeks 9-10: Clean up/ preparing for end of GSoC > > > > > > References > > > > > > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and > > Chih-Jen > > > Lin. Liblinear: A library for large linear classification. J. Mach. > Learn. > > > Res., 9:1871–1874, 2008. > > > > > > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: > Primal > > > estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the > > 24th > > > international conference on Machine learning, pages 807–814, New York, > > NY, > > > USA, 2007. ACM. > > > > > > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang > Yang, > > > Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, > > > Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top > 10 > > > algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007. > > > > > > ------------------------------------------------------------- > > > > > > Zhen-Dong Zhao (Maxim) > > > > > > <><<><><><><><><><>><><><><><>>>>>> > > > > > > Department of Computer Science > > > School of Computing > > > National University of Singapore > > > > > > >>>>>>><><><><><><><><<><>><><<<<<< > > > > > > > > > -- > ------------------------------------------------------------- > > Zhen-Dong Zhao (Maxim) > > <><<><><><><><><><>><><><><><>>>>>> > > Department of Computer Science > School of Computing > National University of Singapore > > >>>>>>><><><><><><><><<><>><><<<<<< >