Zhao, had been busy with his new work, he was doing some optimisations and tweaks to the code. Havent heard from him for last couple of months. I have cc'ed him on this email
Robin On Wed, Jan 19, 2011 at 1:44 PM, Ted Dunning (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983601#action_12983601] > > Ted Dunning commented on MAHOUT-334: > ------------------------------------ > > It didn't sound real ready to me. Having a non-scalable SVM just > replicates capabilities elsewhere like in R. > > I do think that we should lift the evaluation metrics and apply them to SGD > even if this code doesn't get committed as is. > > > Proposal for GSoC2010 (Linear SVM for Mahout) > > --------------------------------------------- > > > > Key: MAHOUT-334 > > URL: https://issues.apache.org/jira/browse/MAHOUT-334 > > Project: Mahout > > Issue Type: Task > > Affects Versions: 0.4 > > Reporter: zhao zhendong > > Assignee: Robin Anil > > Fix For: 0.5 > > > > Attachments: Mahout-issue334-0.2.patch, > Mahout-issue334-0.3.patch, Mahout-issue334-0.5.patch, Mahout-issue334.patch, > Mahout-issue334.patch, Utils_LibsvmFormat_Convertor.patch > > > > > > Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout > > Student: Zhen-Dong Zhao > > Student e-mail: [email protected] > > Student Major: Multimedia Information Retrieval /Computer Science > > Student Degree: Master Student Graduation: NUS'10 > Organization: Hadoop > > 0 Abstract > > Linear Support Vector Machine (SVM) is pretty useful in some applications > with large-scale datasets or datasets with high dimension features. This > proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR > [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which > is another linear SVM solver and almost finished by me. Two distinct con > tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces > for linear SVM classifier. > > 1 Motivation > > As one of TOP 10 algorithms in data mining society [3], Support Vector > Machine is very powerful Machine Learning tool and widely adopted in Data > Mining, Pattern Recognition and Information Retrieval domains. > > The SVM training procedure is pretty slow, however, especially on the > case with large-scale dataset. Nowadays, several literatures propose SVM > solvers with linear kernel that can handle large-scale learning problem, for > instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of > linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232). > Nevertheless, as the winner of ICML 2008 large-scale learning challenge > (linear SVM track (http://largescale.first.fraunhofer.de/summary/), > LIBLINEAR [1] suppose to be incorporated in Mahout too. Currently, LIBLINEAR > package supports: > > (1) L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, > and logistic regression (LR) > > (2) L1-regularized classifiers L2-loss linear SVM and logistic > regression (LR) > > Main features of LIBLINEAR are following: > > (1) Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer > > (2) Cross validation for model selection > > (3) Probability estimates (logistic regression only) > > (4) Weights for unbalanced data > > All the functionalities suppose to be implemented except probability > estimates and weights for unbalanced data (If time permitting, I would like > to do so). > > 2 Unified Interfaces > > Linear SVM classifier based on Pegasos package on Mahout already can > provide such functionalities: ( > http://issues.apache.org/jira/browse/MAHOUT-232) > > (1) Sequential Binary Classification (Two-class Classification), includes > sequential training and prediction; > > (2) Sequential Regression; > > (3) Parallel & Sequential Multi-Classification, includes One-vs.-One and > One-vs.-Others schemes. > > Apparently, the functionalities of Pegasos package on Mahout and > LIBLINEAR are quite similar to each other. As aforementioned, in this > section I will introduce an unified interfaces for linear SVM classifier on > Mahout, which will incorporate Pegasos, LIBLINEAR. > > The unfied interfaces has two main parts: 1) Dataset loader; 2) > Algorithms. I will introduce them separately. > > 2.1 Data Handler > > The dataset can be stored on personal computer or on Hadoop cluster. This > framework provides high performance Random Loader, Sequential Loader for > accessing large-scale data. > > 2.2 Sequential Algorithms > > Sequential Algorithms will include binary classification, regression based > on Pegasos and LIBLINEAR with unified interface. > > 2.3 Parallel Algorithms > > It is widely accepted that to parallelize binary SVM classifier is hard. > For multi-classification, however, the coarse-grained scheme (e.g. each > Mapper or Reducer has one independent SVM binary classifier) is easier to > achieve great improvement. Besides, cross validation for model selection > also can take advantage of such coarse-grained parallelism. I will introduce > a unified interface for all of them. > > 3 Biography: > > I am a graduating masters student in Multimedia Information Retrieval > System from National University of Singapore. My research has involved the > large-scale SVM classifier. > > I have worked with Hadoop and Map Reduce since one year ago, and I have > dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout > (http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in > setting up and maintaining a Hadoop cluster with around 70 nodes in our > group. > > 4 Timeline: > > Weeks 1-4 (May 24 ~ June 18): Implement binary classifier > > Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class > classification and Implement cross validation for model selection. > > Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation > > Weeks 9 - 11 (July 16 ~ August 9): Interface re-factory and performance > turning > > Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and > testing. > > 5 References > > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and > Chih-Jen Lin. Liblinear: A library for large linear classification. J. Mach. > Learn. Res., 9:1871-1874, 2008. > > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal > estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th > international conference on Machine learning, pages 807-814, New York, NY, > USA, 2007. ACM. > > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, > Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, > Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 > algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
