Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

Ted Dunning Fri, 12 Mar 2010 09:21:37 -0800

Also, this proposal should itself be a  JIRA ticket with the GSOC tag
applied to it.  That will make it visible to the Apache-wide summer of code
administrator's.


On Fri, Mar 12, 2010 at 3:57 AM, zhao zhendong <zhaozhend...@gmail.com>wrote:

> Sure, I will revise it tonight.
>
> Thanks, Robin.
>
>
> On Fri, Mar 12, 2010 at 7:22 PM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > Hi Zhao,
> >      Some quick feedback.
> >
> > 1) Can you update the gsoc issue on classifier with a nabble link to this
> > thread or using any other aggregator
> > 2) I hope you would read the Gsoc timelines more clearly, there is mid
> term
> > evaluation, end term evaluation and some buffer time. You will have to
> > change your currently timeline to accurately reflect that.
> >
> >  I will post more queries about the design choice later
> >
> > Robin
> >
> > On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong <zhaozhend...@gmail.com
> > >wrote:
> >
> > >  Hi all,
> > > The updated proposal for GSoC 2010 is as follows, any comment is
> welcome.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > Title/Summary:
> > > Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao
> Student
> > > e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information
> > > Retrieval /Computer ScienceStudent Degree: Master        Student
> > > Graduation:
> > > NUS’10           Organization: Hadoop
> > >
> > > 0 Abstract
> > >
> > > Linear Support Vector Machine (SVM) is pretty useful in some
> applications
> > > with large-scale datasets or datasets with high dimension features.
> This
> > > proposal will port one of the most famous linear SVM solvers, say,
> > > LIBLINEAR
> > > [1] to mahout with unified interface as same as Pegasos [2] @ mahout,
> > which
> > > is another linear SVM solver and almost finished by me. Two distinct
> > > contributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Uniﬁed
> > > interfaces for linear SVM classiﬁer.
> > >
> > > 1 Motivation
> > >
> > > As one of TOP 10 algorithms in data mining society [3], Support Vector
> > > Machine is very powerful Machine Learning tool and widely adopted in
> Data
> > > Mining, Pattern Recognition and Information Retrieval domains.
> > >
> > > The SVM training procedure is pretty slow, however, especially on the
> > case
> > > with large-scale dataset. Nowadays, several literatures propose SVM
> > solvers
> > > with linear kernel that can handle large-scale learning problem, for
> > > instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype
> > of
> > > linear SVM classiﬁer based on Pegasos [2] for Mahout (issue:
> Mahout-232).
> > > Nevertheless, as the winner of ICML 2008 large-scale learning challenge
> > > (linear SVM <http://largescale.first.fraunhofer.de/summary/>track (
> > > http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose
> > to
> > > be
> > > incorporated in Mahout too. Currently, LIBLINEAR package supports:
> > >
> > >   -
> > >
> > >   L2-regularized classiﬁers L2-loss linear SVM, L1-loss linear SVM, and
> > >   logistic regression (LR)
> > >   -
> > >
> > >   L1-regularized classiﬁers L2-loss linear SVM and logistic regression
> > (LR)
> > >
> > >
> > > Main features of LIBLINEAR are following:
> > >
> > >   -
> > >
> > >   Multi-class classiﬁcation: 1) one-vs-the rest, 2) Crammer & Singer
> > >   -
> > >
> > >   Cross validation for model selection
> > >   -
> > >
> > >   Probability estimates (logistic regression only)
> > >   -
> > >
> > >   Weights for unbalanced data
> > >
> > > *All the functionalities suppose to be implemented except probability
> > > estimates and weights for unbalanced data* (If time permitting, I would
> > > like
> > > to do so).
> > >
> > > 2 Unified Interfaces
> > >
> > > Linear SVM classiﬁer based on Pegasos package on Mahout already can
> > provide
> > > such functionalities: *(
> > http://issues.apache.org/jira/browse/MAHOUT-232)*<http://issues.apache.org/jira/browse/MAHOUT-232%29*>
> > >
> > >   -
> > >
> > >   Sequential Binary Classiﬁcation (Two-class Classiﬁcation), includes
> > >   sequential training and prediction;
> > >   -
> > >
> > >   Sequential Regression;
> > >   -
> > >
> > >   Parallel & Sequential Multi-Classiﬁcation, includes One-vs.-One and
> > >   One-vs.-Others schemes.
> > >
> > > Apparently, the functionalities of Pegasos package on Mahout and
> > LIBLINEAR
> > > are quite similar to each other. As aforementioned, in this section I
> > will
> > > introduce an unified interfaces for linear SVM classiﬁer on Mahout,
> which
> > > will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is
> > > illustrated in Figure 1:
> > >
> > > The unfied interfaces has two main parts: 1) Dataset loader; 2)
> > Algorithms.
> > > I will introduce them separately.
> > >
> > > *2.1 Data Handler*
> > >
> > > The dataset can be stored on personal computer or on Hadoop cluster.
> This
> > > framework provides high performance Random Loader, Sequential Loader
> for
> > > accessing large-scale data.
> > >
> > >  Figure 1: The framework of linear SVM on Mahout
> > >
> > > *2.2 Sequential Algorithms*
> > >
> > > Sequential Algorithms will include binary classiﬁcation, regression
> based
> > > on
> > > Pegasos and LIBLINEAR with uniﬁed interface.
> > >
> > > *2.3 Parallel Algorithms*
> > >
> > > It is widely accepted that to parallelize binary SVM classiﬁer is hard.
> > For
> > > multi-classiﬁcation, however, the coarse-grained scheme (e.g. each
> Mapper
> > > or
> > > Reducer has one independent SVM binary classiﬁer) is easier to achieve
> > > great
> > > improvement. Besides, cross validation for model selection also can
> take
> > > advantage of such coarse-grained parallelism. I will introduce a uniﬁed
> > > interface for all of them.
> > >
> > > 3 Biography:
> > >
> > > I am a graduating masters student in Multimedia Information Retrieval
> > > System
> > > from National University of Singapore. My research has involved the
> > > large-scale SVM classifier.
> > >
> > > I have worked with Hadoop and Map Reduce since one year ago, and I have
> > > dedicated lots of my spare time to Sequential SVM (Pegasos) based on
> > > Mahout.
> > >
> > > *(http://issues.apache.org/jira/browse/MAHOUT-232).*<http://issues.apache.org/jira/browse/MAHOUT-232%29.*>I
> > >  have taken part
> > in
> > > setting up and maintaining a Hadoop cluster with around 70 nodes in our
> > > group.
> > >
> > > 4 Timeline:
> > >
> > > Weeks 1-4: Implement binary classifier
> > >
> > > Weeks 5-6: Implement parallel multi-class classification and cross
> > > validation for model selection
> > >
> > > Weeks 7-8: Interface re-factory and performance turning
> > >
> > > Weeks 9-10: Clean up/ preparing for end of GSoC
> > >
> > > References
> > >
> > > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and
> > Chih-Jen
> > > Lin. Liblinear: A library for large linear classiﬁcation. J. Mach.
> Learn.
> > > Res., 9:1871–1874, 2008.
> > >
> > > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos:
> Primal
> > > estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the
> > 24th
> > > international conference on Machine learning, pages 807–814, New York,
> > NY,
> > > USA, 2007. ACM.
> > >
> > > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang
> Yang,
> > > Hiroshi Motoda, Geoﬀrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
> > > Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top
> 10
> > > algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007.
> > >
> > > -------------------------------------------------------------
> > >
> > > Zhen-Dong Zhao (Maxim)
> > >
> > > <><<><><><><><><><>><><><><><>>>>>>
> > >
> > > Department of Computer Science
> > > School of Computing
> > > National University of Singapore
> > >
> > > >>>>>>><><><><><><><><<><>><><<<<<<
> > >
> >
>
>
>
> --
> -------------------------------------------------------------
>
> Zhen-Dong Zhao (Maxim)
>
> <><<><><><><><><><>><><><><><>>>>>>
>
> Department of Computer Science
> School of Computing
> National University of Singapore
>
> >>>>>>><><><><><><><><<><>><><<<<<<
>

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

Reply via email to