Re: [jira] Commented: (MAHOUT-334) Proposal for GSoC2010 (Linear SVM for Mahout)

Robin Anil Wed, 19 Jan 2011 00:25:49 -0800

Zhao, had been busy with his new work, he was doing some optimisations and
tweaks to the code. Havent heard from him for last couple of months.  I have
cc'ed him on this email


Robin

On Wed, Jan 19, 2011 at 1:44 PM, Ted Dunning (JIRA) <[email protected]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983601#action_12983601]
>
> Ted Dunning commented on MAHOUT-334:
> ------------------------------------
>
> It didn't sound real ready to me.  Having a non-scalable SVM just
> replicates capabilities elsewhere like in R.
>
> I do think that we should lift the evaluation metrics and apply them to SGD
> even if this code doesn't get committed as is.
>
> > Proposal for GSoC2010 (Linear SVM for Mahout)
> > ---------------------------------------------
> >
> >                 Key: MAHOUT-334
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-334
> >             Project: Mahout
> >          Issue Type: Task
> >    Affects Versions: 0.4
> >            Reporter: zhao zhendong
> >            Assignee: Robin Anil
> >             Fix For: 0.5
> >
> >         Attachments: Mahout-issue334-0.2.patch,
> Mahout-issue334-0.3.patch, Mahout-issue334-0.5.patch, Mahout-issue334.patch,
> Mahout-issue334.patch, Utils_LibsvmFormat_Convertor.patch
> >
> >
> > Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout
> > Student: Zhen-Dong Zhao
> > Student e-mail: [email protected]
> > Student Major: Multimedia Information Retrieval /Computer Science
> > Student Degree: Master        Student Graduation: NUS'10
> Organization: Hadoop
> > 0 Abstract
> > Linear Support Vector Machine (SVM) is pretty useful in some applications
> with large-scale datasets or datasets with high dimension features. This
> proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR
> [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which
> is another linear SVM solver and almost finished by me. Two distinct con
> tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Uniﬁed interfaces
> for linear SVM classiﬁer.
> > 1 Motivation
> > As one of TOP 10 algorithms in data mining society [3], Support Vector
> Machine is very powerful Machine Learning tool and widely adopted in Data
> Mining, Pattern Recognition and Information Retrieval domains.
> > The SVM training procedure is pretty slow, however, especially on the
> case with large-scale dataset. Nowadays, several literatures propose SVM
> solvers with linear kernel that can handle large-scale learning problem, for
> instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of
> linear SVM classiﬁer based on Pegasos [2] for Mahout (issue: Mahout-232).
> Nevertheless, as the winner of ICML 2008 large-scale learning challenge
> (linear SVM track (http://largescale.first.fraunhofer.de/summary/),
> LIBLINEAR [1] suppose to be incorporated in Mahout too. Currently, LIBLINEAR
> package supports:
> >   (1) L2-regularized classiﬁers L2-loss linear SVM, L1-loss linear SVM,
> and logistic regression (LR)
> >   (2) L1-regularized classiﬁers L2-loss linear SVM and logistic
> regression (LR)
> > Main features of LIBLINEAR are following:
> >   (1) Multi-class classiﬁcation: 1) one-vs-the rest, 2) Crammer & Singer
> >   (2) Cross validation for model selection
> >   (3) Probability estimates (logistic regression only)
> >   (4) Weights for unbalanced data
> > All the functionalities suppose to be implemented except probability
> estimates and weights for unbalanced data (If time permitting, I would like
> to do so).
> > 2 Unified Interfaces
> > Linear SVM classiﬁer based on Pegasos package on Mahout already can
> provide such functionalities: (
> http://issues.apache.org/jira/browse/MAHOUT-232)
> >   (1) Sequential Binary Classiﬁcation (Two-class Classiﬁcation), includes
> sequential training and prediction;
> >   (2) Sequential Regression;
> >   (3) Parallel & Sequential Multi-Classiﬁcation, includes One-vs.-One and
> One-vs.-Others schemes.
> > Apparently, the functionalities of Pegasos package on Mahout and
> LIBLINEAR are quite similar to each other. As aforementioned, in this
> section I will introduce an unified interfaces for linear SVM classiﬁer on
> Mahout, which will incorporate Pegasos, LIBLINEAR.
> > The unfied interfaces has two main parts: 1) Dataset loader; 2)
> Algorithms. I will introduce them separately.
> > 2.1 Data Handler
> > The dataset can be stored on personal computer or on Hadoop cluster. This
> framework provides high performance Random Loader, Sequential Loader for
> accessing large-scale data.
> > 2.2 Sequential Algorithms
> > Sequential Algorithms will include binary classiﬁcation, regression based
> on Pegasos and LIBLINEAR with uniﬁed interface.
> > 2.3 Parallel Algorithms
> > It is widely accepted that to parallelize binary SVM classiﬁer is hard.
> For multi-classiﬁcation, however, the coarse-grained scheme (e.g. each
> Mapper or Reducer has one independent SVM binary classiﬁer) is easier to
> achieve great improvement. Besides, cross validation for model selection
> also can take advantage of such coarse-grained parallelism. I will introduce
> a uniﬁed interface for all of them.
> > 3 Biography:
> > I am a graduating masters student in Multimedia Information Retrieval
> System from National University of Singapore. My research has involved the
> large-scale SVM classifier.
> > I have worked with Hadoop and Map Reduce since one year ago, and I have
> dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout
> (http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in
> setting up and maintaining a Hadoop cluster with around 70 nodes in our
> group.
> > 4 Timeline:
> > Weeks 1-4 (May 24 ~ June 18): Implement binary classifier
> > Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class
> classification and Implement cross validation for model selection.
> > Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation
> > Weeks 9 - 11 (July 16 ~ August 9):  Interface re-factory and performance
> turning
> > Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and
> testing.
> > 5 References
> > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and
> Chih-Jen Lin. Liblinear: A library for large linear classiﬁcation. J. Mach.
> Learn. Res., 9:1871-1874, 2008.
> > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal
> estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th
> international conference on Machine learning, pages 807-814, New York, NY,
> USA, 2007. ACM.
> > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
> Hiroshi Motoda, Geoﬀrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
> Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
> algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Commented: (MAHOUT-334) Proposal for GSoC2010 (Linear SVM for Mahout)

Reply via email to