Sure, I will revise it tonight. Thanks, Robin.
On Fri, Mar 12, 2010 at 7:22 PM, Robin Anil <robin.a...@gmail.com> wrote: > Hi Zhao, > Some quick feedback. > > 1) Can you update the gsoc issue on classifier with a nabble link to this > thread or using any other aggregator > 2) I hope you would read the Gsoc timelines more clearly, there is mid term > evaluation, end term evaluation and some buffer time. You will have to > change your currently timeline to accurately reflect that. > > I will post more queries about the design choice later > > Robin > > On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong <zhaozhend...@gmail.com > >wrote: > > > Hi all, > > The updated proposal for GSoC 2010 is as follows, any comment is welcome. > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Title/Summary: > > Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao Student > > e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information > > Retrieval /Computer ScienceStudent Degree: Master Student > > Graduation: > > NUS’10 Organization: Hadoop > > > > 0 Abstract > > > > Linear Support Vector Machine (SVM) is pretty useful in some applications > > with large-scale datasets or datasets with high dimension features. This > > proposal will port one of the most famous linear SVM solvers, say, > > LIBLINEAR > > [1] to mahout with unified interface as same as Pegasos [2] @ mahout, > which > > is another linear SVM solver and almost finished by me. Two distinct > > contributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified > > interfaces for linear SVM classifier. > > > > 1 Motivation > > > > As one of TOP 10 algorithms in data mining society [3], Support Vector > > Machine is very powerful Machine Learning tool and widely adopted in Data > > Mining, Pattern Recognition and Information Retrieval domains. > > > > The SVM training procedure is pretty slow, however, especially on the > case > > with large-scale dataset. Nowadays, several literatures propose SVM > solvers > > with linear kernel that can handle large-scale learning problem, for > > instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype > of > > linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232). > > Nevertheless, as the winner of ICML 2008 large-scale learning challenge > > (linear SVM <http://largescale.first.fraunhofer.de/summary/>track ( > > http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose > to > > be > > incorporated in Mahout too. Currently, LIBLINEAR package supports: > > > > - > > > > L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and > > logistic regression (LR) > > - > > > > L1-regularized classifiers L2-loss linear SVM and logistic regression > (LR) > > > > > > Main features of LIBLINEAR are following: > > > > - > > > > Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer > > - > > > > Cross validation for model selection > > - > > > > Probability estimates (logistic regression only) > > - > > > > Weights for unbalanced data > > > > *All the functionalities suppose to be implemented except probability > > estimates and weights for unbalanced data* (If time permitting, I would > > like > > to do so). > > > > 2 Unified Interfaces > > > > Linear SVM classifier based on Pegasos package on Mahout already can > provide > > such functionalities: *( > http://issues.apache.org/jira/browse/MAHOUT-232)* > > > > - > > > > Sequential Binary Classification (Two-class Classification), includes > > sequential training and prediction; > > - > > > > Sequential Regression; > > - > > > > Parallel & Sequential Multi-Classification, includes One-vs.-One and > > One-vs.-Others schemes. > > > > Apparently, the functionalities of Pegasos package on Mahout and > LIBLINEAR > > are quite similar to each other. As aforementioned, in this section I > will > > introduce an unified interfaces for linear SVM classifier on Mahout, which > > will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is > > illustrated in Figure 1: > > > > The unfied interfaces has two main parts: 1) Dataset loader; 2) > Algorithms. > > I will introduce them separately. > > > > *2.1 Data Handler* > > > > The dataset can be stored on personal computer or on Hadoop cluster. This > > framework provides high performance Random Loader, Sequential Loader for > > accessing large-scale data. > > > > Figure 1: The framework of linear SVM on Mahout > > > > *2.2 Sequential Algorithms* > > > > Sequential Algorithms will include binary classification, regression based > > on > > Pegasos and LIBLINEAR with unified interface. > > > > *2.3 Parallel Algorithms* > > > > It is widely accepted that to parallelize binary SVM classifier is hard. > For > > multi-classification, however, the coarse-grained scheme (e.g. each Mapper > > or > > Reducer has one independent SVM binary classifier) is easier to achieve > > great > > improvement. Besides, cross validation for model selection also can take > > advantage of such coarse-grained parallelism. I will introduce a unified > > interface for all of them. > > > > 3 Biography: > > > > I am a graduating masters student in Multimedia Information Retrieval > > System > > from National University of Singapore. My research has involved the > > large-scale SVM classifier. > > > > I have worked with Hadoop and Map Reduce since one year ago, and I have > > dedicated lots of my spare time to Sequential SVM (Pegasos) based on > > Mahout. > > > > *(http://issues.apache.org/jira/browse/MAHOUT-232).* I have taken part > in > > setting up and maintaining a Hadoop cluster with around 70 nodes in our > > group. > > > > 4 Timeline: > > > > Weeks 1-4: Implement binary classifier > > > > Weeks 5-6: Implement parallel multi-class classification and cross > > validation for model selection > > > > Weeks 7-8: Interface re-factory and performance turning > > > > Weeks 9-10: Clean up/ preparing for end of GSoC > > > > References > > > > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and > Chih-Jen > > Lin. Liblinear: A library for large linear classification. J. Mach. Learn. > > Res., 9:1871–1874, 2008. > > > > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal > > estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the > 24th > > international conference on Machine learning, pages 807–814, New York, > NY, > > USA, 2007. ACM. > > > > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, > > Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, > > Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 > > algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007. > > > > ------------------------------------------------------------- > > > > Zhen-Dong Zhao (Maxim) > > > > <><<><><><><><><><>><><><><><>>>>>> > > > > Department of Computer Science > > School of Computing > > National University of Singapore > > > > >>>>>>><><><><><><><><<><>><><<<<<< > > > -- ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore >>>>>>><><><><><><><><<><>><><<<<<<