[ 
https://issues.apache.org/jira/browse/MAHOUT-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851727#action_12851727
 ] 

zhao zhendong commented on MAHOUT-334:
--------------------------------------


Proposal Title: Linear SVM Package (LIBLINEAR) for Mahout

Student Name: Zhendong Zhao

Student E-mail: zha...@comp.nus.edu.sg

Organization/Project:Assigned Mentor:

Proposal Abstract:
Linear Support Vector Machine (SVM) is pretty useful in plenty of applications 
with large-scale datasets or datasets with high dimension features. This 
proposal will port one of the most famous linear SVM solvers, LIBLINEAR [1] to 
mahout with unified interface with Pegasos [2] on mahout, which is another 
linear SVM solver and almost finished by myself (Mahout-232).  Two distinct 
contributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces 
for linear SVM classifier on Mahout.

Detailed Description:

1 Motivation
As one of TOP 10 algorithms in data mining society [3], Support Vector Machine 
is very powerful Machine Learning tool and widely adopted in Data Mining, 
Pattern Recognition and Information Retrieval domains.

The SVM training procedure is pretty slow, however, especially on the case with 
large-scale dataset. Nowadays, several literatures propose SVM solvers with 
linear kernel that can handle large-scale learning problem, for instance, 
LIBLINEAR [1] and Pegasos [2]. I have already implemented a prototype of linear 
SVM classifier based on Pegasos [2] for Mahout (issue: 
Mahout-232)(http://issues.apache.org/jira/browse/MAHOUT-232). Nevertheless, as 
the winner of ICML 2008 large-scale learning challenge (linear SVM track 
(http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to be 
incorporated in Mahout too.

2 Functionalities to Be Implemented
Currently, LIBLINEAR package supports:
—   L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and 
logistic regression (LR)
—   L1-regularized classifiers L2-loss linear SVM and logistic regression (LR)

Main features of LIBLINEAR are following:
—   Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
—   Cross validation for model selection
—   Probability estimates (logistic regression only)
—   Weights for unbalanced data 

All the functionalities suppose to be implemented except probability estimates 
and weights for unbalanced data (If time permitting, I would like to do so), 
Crammer & Singer scheme for Multi-Class classification will be replace of 
On-vs.-one method. Thus L1-,  L2-regularized/loss linear binary SVM solver, LR 
and Multi-class classification, and cross validation for model selection will 
be included into my proposal.

3 Implementation Details
Linear SVM classifier based on Pegasos package on Mahout already provides such 
functionalities: (http://issues.apache.org/jira/browse/MAHOUT-232)
—   Sequential Binary Classification (Two-class Classification), includes 
sequential training and prediction;
—   Sequential Regression;
—   Parallel & Sequential Multi-Classification, includes One-vs.-One and 
One-vs.-Others schemes.

Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR are 
quite similar to each other. As aforementioned, in this section I will 
introduce a unified interfaces for linear SVM classifier on Mahout, which will 
incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is illustrated 
in Figure 1:

http://sites.google.com/site/zhaozhendong2/home/project/framework-gsoc-small.png
Figure 1: The framework of linear SVM on Mahout

The unified interface has two main parts: 1) Dataset loader, 2) Algorithms. I 
will introduce them separately.

3.1 Data Handler
Due to all the packages Mahout Core are based on high performance Vector 
(DenseVector or SparseVector), the data handler is also based on them. The 
dataset (Vectors) can be stored on personal computer or on Hadoop cluster. This 
framework provides high performance Random Loader, Sequential Loader for 
accessing large-scale data.  

3.2 Sequential Algorithms
Sequential Algorithms will include binary classification, regression based on 
Pegasos and LIBLINEAR with unified interface.

3.3 Parallel Algorithms
It is widely accepted that to parallelize binary SVM classifier is hard. For 
multi-class classification, however, the coarse-grained scheme (e.g. each 
Mapper or Reducer has one independent SVM binary classifier) is easier to 
achieve improvement. Besides, cross validation for model selection also can 
take advantage of such coarse-grained parallelism. I will introduce a unified 
interface for all of them.

3.3.1 Multi-class Classification based on MapReduce Framework
In SVM, multi-class classification can be decomposed as a set of binary 
classifiers, and the classifiers are independent to each other, in this sense, 
the multi-classification can take advantage of MapReduce framework.

http://sites.google.com/site/zhaozhendong2/home/project/multi-classification-small.jpg
Figure 2. Parallel Multi-class Classifier

This package will includes two distinct schemes: (1) One versus One 
(One-vs.-One); (2) One versus Others (One-vs.-Others). We will explain the 
later scheme due to it is a bit easier to understand.

An intuitive example will be introduced firstly. As you can see in Figure 2, 
the Mappers act as Emit Controller. Each sample will be emit N times with 
different categories' label, where N is the number of categories in dataset.

The class label will be emitted as the key of Mappers' output. After sorting, 
all the samples with same class label will be sank into a same Reducer. The 
samples in one Reducer should with "-1" and "+1" label right now, where "+1" 
denotes the sample within a certain category while "-1" represents all other 
samples belong to rest of categories.

Reducer then calls a binary SVM classifier to train a model for this category 
and emits the model as Reducer's output.

3.3.2 Parallel Model Selection
Similar to Multi-Class Classification, SVM Model selection is stacked with a 
set of binary classifier. Thus, we may leverage MapReduce framework to 
accelerate the process of model selection.

4 Biography
I am a graduating masters student in Multimedia Information Retrieval System 
from National University of Singapore. My research has involved the large-scale 
SVM classifier.

I have worked with Hadoop and Map Reduce since one year ago, and I have 
dedicated lots of my spare time to Sequential SVM (Pegasos) based on 
Mahout.(http://issues.apache.org/jira/browse/MAHOUT-232).  I have taken part in 
setting up and maintaining a Hadoop cluster with around 75 nodes in our group.

5 Timeline
Weeks 1-4 (May 24 ~ June 18): Implement binary classifier

Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification 
and Implement cross validation for model selection.

Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation

Weeks 9 - 11 (July 16 ~ August 9):  Interface re-factory and performance turning

Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing.

6 References
[1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen 
Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 
9:1871-1874, 2008.

[2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal 
estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th 
international conference on Machine learning, pages 807-814, New York, NY, USA, 
2007. ACM.

[3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, 
Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua 
Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in 
data mining. Knowl. Inf. Syst., 14(1):1-37, 2007. 

Additional Information:
PDF edition of this Proposal:
http://sites.google.com/site/zhaozhendong2/home/project/GSoC2010-SVMonMahout.pdf?attredirects=0&d=1

> Proposal for GSoC2010 (Linear SVM for Mahout)
> ---------------------------------------------
>
>                 Key: MAHOUT-334
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-334
>             Project: Mahout
>          Issue Type: Task
>            Reporter: zhao zhendong
>
> Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout
> Student: Zhen-Dong Zhao
> Student e-mail: zha...@comp.nus.edu.sg
> Student Major: Multimedia Information Retrieval /Computer Science
> Student Degree: Master        Student Graduation: NUS'10           
> Organization: Hadoop
> 0 Abstract
> Linear Support Vector Machine (SVM) is pretty useful in some applications 
> with large-scale datasets or datasets with high dimension features. This 
> proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR 
> [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which 
> is another linear SVM solver and almost finished by me. Two distinct con 
> tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces 
> for linear SVM classifier.
> 1 Motivation
> As one of TOP 10 algorithms in data mining society [3], Support Vector 
> Machine is very powerful Machine Learning tool and widely adopted in Data 
> Mining, Pattern Recognition and Information Retrieval domains.
> The SVM training procedure is pretty slow, however, especially on the case 
> with large-scale dataset. Nowadays, several literatures propose SVM solvers 
> with linear kernel that can handle large-scale learning problem, for 
> instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of 
> linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232). 
> Nevertheless, as the winner of ICML 2008 large-scale learning challenge 
> (linear SVM track (http://largescale.first.fraunhofer.de/summary/), LIBLINEAR 
> [1] suppose to be incorporated in Mahout too. Currently, LIBLINEAR package 
> supports:
>   (1) L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and 
> logistic regression (LR)
>   (2) L1-regularized classifiers L2-loss linear SVM and logistic regression 
> (LR)
> Main features of LIBLINEAR are following:
>   (1) Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
>   (2) Cross validation for model selection
>   (3) Probability estimates (logistic regression only)
>   (4) Weights for unbalanced data
> All the functionalities suppose to be implemented except probability 
> estimates and weights for unbalanced data (If time permitting, I would like 
> to do so).
> 2 Unified Interfaces
> Linear SVM classifier based on Pegasos package on Mahout already can provide 
> such functionalities: (http://issues.apache.org/jira/browse/MAHOUT-232)
>   (1) Sequential Binary Classification (Two-class Classification), includes 
> sequential training and prediction;
>   (2) Sequential Regression;
>   (3) Parallel & Sequential Multi-Classification, includes One-vs.-One and 
> One-vs.-Others schemes.
> Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR 
> are quite similar to each other. As aforementioned, in this section I will 
> introduce an unified interfaces for linear SVM classifier on Mahout, which 
> will incorporate Pegasos, LIBLINEAR. 
> The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms. I 
> will introduce them separately.
> 2.1 Data Handler
> The dataset can be stored on personal computer or on Hadoop cluster. This 
> framework provides high performance Random Loader, Sequential Loader for 
> accessing large-scale data.
> 2.2 Sequential Algorithms
> Sequential Algorithms will include binary classification, regression based on 
> Pegasos and LIBLINEAR with unified interface.
> 2.3 Parallel Algorithms
> It is widely accepted that to parallelize binary SVM classifier is hard. For 
> multi-classification, however, the coarse-grained scheme (e.g. each Mapper or 
> Reducer has one independent SVM binary classifier) is easier to achieve great 
> improvement. Besides, cross validation for model selection also can take 
> advantage of such coarse-grained parallelism. I will introduce a unified 
> interface for all of them.
> 3 Biography:
> I am a graduating masters student in Multimedia Information Retrieval System 
> from National University of Singapore. My research has involved the 
> large-scale SVM classifier.
> I have worked with Hadoop and Map Reduce since one year ago, and I have 
> dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout 
> (http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in 
> setting up and maintaining a Hadoop cluster with around 70 nodes in our group.
> 4 Timeline:
> Weeks 1-4 (May 24 ~ June 18): Implement binary classifier 
> Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification 
> and Implement cross validation for model selection. 
> Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation
> Weeks 9 - 11 (July 16 ~ August 9):  Interface re-factory and performance 
> turning
> Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing. 
> 5 References
> [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen 
> Lin. Liblinear: A library for large linear classification. J. Mach. Learn. 
> Res., 9:1871-1874, 2008.
> [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal 
> estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th 
> international conference on Machine learning, pages 807-814, New York, NY, 
> USA, 2007. ACM.
> [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, 
> Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, 
> Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 
> algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to