Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

zhao zhendong Fri, 12 Mar 2010 03:58:10 -0800

Sure, I will revise it tonight.

Thanks, Robin.



On Fri, Mar 12, 2010 at 7:22 PM, Robin Anil <robin.a...@gmail.com> wrote:

> Hi Zhao,
>      Some quick feedback.
>
> 1) Can you update the gsoc issue on classifier with a nabble link to this
> thread or using any other aggregator
> 2) I hope you would read the Gsoc timelines more clearly, there is mid term
> evaluation, end term evaluation and some buffer time. You will have to
> change your currently timeline to accurately reflect that.
>
>  I will post more queries about the design choice later
>
> Robin
>
> On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong <zhaozhend...@gmail.com
> >wrote:
>
> >  Hi all,
> > The updated proposal for GSoC 2010 is as follows, any comment is welcome.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Title/Summary:
> > Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao Student
> > e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information
> > Retrieval /Computer ScienceStudent Degree: Master        Student
> > Graduation:
> > NUS’10           Organization: Hadoop
> >
> > 0 Abstract
> >
> > Linear Support Vector Machine (SVM) is pretty useful in some applications
> > with large-scale datasets or datasets with high dimension features. This
> > proposal will port one of the most famous linear SVM solvers, say,
> > LIBLINEAR
> > [1] to mahout with unified interface as same as Pegasos [2] @ mahout,
> which
> > is another linear SVM solver and almost finished by me. Two distinct
> > contributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Uniﬁed
> > interfaces for linear SVM classiﬁer.
> >
> > 1 Motivation
> >
> > As one of TOP 10 algorithms in data mining society [3], Support Vector
> > Machine is very powerful Machine Learning tool and widely adopted in Data
> > Mining, Pattern Recognition and Information Retrieval domains.
> >
> > The SVM training procedure is pretty slow, however, especially on the
> case
> > with large-scale dataset. Nowadays, several literatures propose SVM
> solvers
> > with linear kernel that can handle large-scale learning problem, for
> > instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype
> of
> > linear SVM classiﬁer based on Pegasos [2] for Mahout (issue: Mahout-232).
> > Nevertheless, as the winner of ICML 2008 large-scale learning challenge
> > (linear SVM <http://largescale.first.fraunhofer.de/summary/>track (
> > http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose
> to
> > be
> > incorporated in Mahout too. Currently, LIBLINEAR package supports:
> >
> >   -
> >
> >   L2-regularized classiﬁers L2-loss linear SVM, L1-loss linear SVM, and
> >   logistic regression (LR)
> >   -
> >
> >   L1-regularized classiﬁers L2-loss linear SVM and logistic regression
> (LR)
> >
> >
> > Main features of LIBLINEAR are following:
> >
> >   -
> >
> >   Multi-class classiﬁcation: 1) one-vs-the rest, 2) Crammer & Singer
> >   -
> >
> >   Cross validation for model selection
> >   -
> >
> >   Probability estimates (logistic regression only)
> >   -
> >
> >   Weights for unbalanced data
> >
> > *All the functionalities suppose to be implemented except probability
> > estimates and weights for unbalanced data* (If time permitting, I would
> > like
> > to do so).
> >
> > 2 Unified Interfaces
> >
> > Linear SVM classiﬁer based on Pegasos package on Mahout already can
> provide
> > such functionalities: *(
> http://issues.apache.org/jira/browse/MAHOUT-232)*
> >
> >   -
> >
> >   Sequential Binary Classiﬁcation (Two-class Classiﬁcation), includes
> >   sequential training and prediction;
> >   -
> >
> >   Sequential Regression;
> >   -
> >
> >   Parallel & Sequential Multi-Classiﬁcation, includes One-vs.-One and
> >   One-vs.-Others schemes.
> >
> > Apparently, the functionalities of Pegasos package on Mahout and
> LIBLINEAR
> > are quite similar to each other. As aforementioned, in this section I
> will
> > introduce an unified interfaces for linear SVM classiﬁer on Mahout, which
> > will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is
> > illustrated in Figure 1:
> >
> > The unfied interfaces has two main parts: 1) Dataset loader; 2)
> Algorithms.
> > I will introduce them separately.
> >
> > *2.1 Data Handler*
> >
> > The dataset can be stored on personal computer or on Hadoop cluster. This
> > framework provides high performance Random Loader, Sequential Loader for
> > accessing large-scale data.
> >
> >  Figure 1: The framework of linear SVM on Mahout
> >
> > *2.2 Sequential Algorithms*
> >
> > Sequential Algorithms will include binary classiﬁcation, regression based
> > on
> > Pegasos and LIBLINEAR with uniﬁed interface.
> >
> > *2.3 Parallel Algorithms*
> >
> > It is widely accepted that to parallelize binary SVM classiﬁer is hard.
> For
> > multi-classiﬁcation, however, the coarse-grained scheme (e.g. each Mapper
> > or
> > Reducer has one independent SVM binary classiﬁer) is easier to achieve
> > great
> > improvement. Besides, cross validation for model selection also can take
> > advantage of such coarse-grained parallelism. I will introduce a uniﬁed
> > interface for all of them.
> >
> > 3 Biography:
> >
> > I am a graduating masters student in Multimedia Information Retrieval
> > System
> > from National University of Singapore. My research has involved the
> > large-scale SVM classifier.
> >
> > I have worked with Hadoop and Map Reduce since one year ago, and I have
> > dedicated lots of my spare time to Sequential SVM (Pegasos) based on
> > Mahout.
> >
> > *(http://issues.apache.org/jira/browse/MAHOUT-232).* I have taken part
> in
> > setting up and maintaining a Hadoop cluster with around 70 nodes in our
> > group.
> >
> > 4 Timeline:
> >
> > Weeks 1-4: Implement binary classifier
> >
> > Weeks 5-6: Implement parallel multi-class classification and cross
> > validation for model selection
> >
> > Weeks 7-8: Interface re-factory and performance turning
> >
> > Weeks 9-10: Clean up/ preparing for end of GSoC
> >
> > References
> >
> > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and
> Chih-Jen
> > Lin. Liblinear: A library for large linear classiﬁcation. J. Mach. Learn.
> > Res., 9:1871–1874, 2008.
> >
> > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal
> > estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the
> 24th
> > international conference on Machine learning, pages 807–814, New York,
> NY,
> > USA, 2007. ACM.
> >
> > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
> > Hiroshi Motoda, Geoﬀrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
> > Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
> > algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007.
> >
> > -------------------------------------------------------------
> >
> > Zhen-Dong Zhao (Maxim)
> >
> > <><<><><><><><><><>><><><><><>>>>>>
> >
> > Department of Computer Science
> > School of Computing
> > National University of Singapore
> >
> > >>>>>>><><><><><><><><<><>><><<<<<<
> >
>



-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

>>>>>>><><><><><><><><<><>><><<<<<<

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

Reply via email to