[ 
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao zhendong updated MAHOUT-232:
---------------------------------

          Description: 
After discussed with guys in this community, I decided to re-implement a 
Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support 
HDFS. 

Sequential SVM based on Pegasos.
Maxim zhao (zhaozhendong at gmail dot com)

-------------------------------------------------------------------------------------------
Currently, this package provides (Features):
-------------------------------------------------------------------------------------------

1. Sequential SVM linear solver, include training and testing.

2. Support general file system and HDFS right now.

3. Supporting large-scale data set training.
Because of the Pegasos only need to sample certain samples, this package 
supports to pre-fetch
the certain size (e.g. max iteration) of samples to memory.
For example: if the size of data set has 100,000,000 samples, due to the 
default maximum iteration is 10,000,
as the result, this package only random load 10,000 samples to memory.

4. Sequential Data set testing, then the package can support large-scale data 
set both on training and testing.

5. Supporting parallel classification (only testing phrase) based on Map-Reduce 
framework.

6. Supoorting Multi-classfication based on Map-Reduce framework (whole 
parallelized version).

7. Supporting Regression.

-------------------------------------------------------------------------------------------
TODO:
-------------------------------------------------------------------------------------------
1. Multi-classification Probability Prediction
2. Performance Testing

-------------------------------------------------------------------------------------------
Usage:
-------------------------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Classification:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ Training: @@
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SVMPegasosTraining.java
The default argument is:

-tr ../examples/src/test/resources/svmdataset/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

~~~~~~~~~~~~~~~~~~~~~~
@ For the case that training data set on HDFS:@
~~~~~~~~~~~~~~~~~~~~~~

1 Assure that your training data set has been submitted to hdfs
hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

2 revise the argument:
-tr /user/hadoop/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009

~~~~~~~~~~~~~~~~~~~~~~
@ Multi-class Training [Based on MapReduce Framework]:@
~~~~~~~~~~~~~~~~~~~~~~
bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver
 -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m 
/user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M 
-ttt 1080

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ Testing: @@
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

SVMPegasosTesting.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function.
The default argument is:
-te ../examples/src/test/resources/svmdataset/test.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

~~~~~~~~~~~~~~~~~~~~~~
@ Parallel Testing (Classification): @
~~~~~~~~~~~~~~~~~~~~~~
ParallelClassifierDriver.java
bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver 
-if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result -m 
/user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080

~~~~~~~~~~~~~~~~~~~~~~
@ Parallel multi-classification: @
~~~~~~~~~~~~~~~~~~~~~~
bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver
 -if /user/maximzhao/dataset/protein.t -of 
/user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 
-nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080

Note: the parameter -ms 241572968 is obtained by equation : ms = input files 
size / number of mapper.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Regression: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
SVMPegasosTraining.java
-tr ../examples/src/test/resources/svmdataset/abalone_scale -m 
../examples/src/test/resources/svmdataset/SVMregression.model -s 1

-------------------------------------------------------------------------------------------
Experimental Results:
-------------------------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Classsification:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set:
name              source            type        class   training size   testing 
size    feature
-----------------------------------------------------------------------------------------------
rcv1.binary      [DL04b]        classification  2          20,242         
677,399       47,236
covtype.binary    UCI           classification  2         581,012               
         54
a9a               UCI           classification  2          32,561          
16,281       123
w8a              [JP98a]        classification  2          49,749          
14,951       300

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy         |       Training Time      | 
   Testing Time     |
rcv1.binary              |          94.67%         |         19 Sec           | 
    2 min 25 Sec    |
covtype.binary           |                         |         19 Sec           | 
                    |
a9a                      |          84.72%         |         14 Sec           | 
    12 Sec          |
w8a                      |          89.8 %         |         14 Sec           | 
    8  Sec          |

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Parallel Classification (Testing)
Data set                 |        Accuracy         |       Training Time      | 
   Testing Time            |
rcv1.binary              |          94.98%         |         19 Sec           | 
    3 min 29 Sec (one node)|

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Parallel Multi-classification Based on MapReduce Framework:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set:
name      |        source           | type      | class | training size | 
testing size  | feature
-----------------------------------------------------------------------------------------------
poker   | UCI   | classification        | 10    | 25,010        | 1,000,000     
| 10
protein  | [JYW02a]     | classification        | 3     | 17,766        | 6,621 
| 357

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy  vs. (Libsvm with linear kernel)
poker | 50.14 %  vs. ( 49.952% ) |
protein | 68.14% vs. ( 64.93% ) |

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Regression:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set:
name    |          source       |    type |     class   | training size |       
testing size |  feature
-----------------------------------------------------------------------------------------------
abalone |       UCI     | regression            | 4,177         | | 8
triazines |     UCI     | regression            | 186           | | 60
cadata  | StatLib       | regression            | 20,640        | | 8
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Mean Squared error vs. (Libsvm with linear 
kernel)   |       Training Time      | Test Time |
abalone | 6.01 vs. (5.25) | 13 Sec |
triazines | 0.031  vs. (0.0276) | 14 Sec |
cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |

  was:
After discussed with guys in this community, I decided to re-implement a 
Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support 
HDFS. 

Sequential SVM based on Pegasos.
Maxim zhao (zhaozhendong at gmail dot com)

-------------------------------------------------------------------------------------------
Currently, this package provides (Features):
-------------------------------------------------------------------------------------------

1. Sequential SVM linear solver, include training and testing.

2. Support general file system and HDFS right now.

3. Supporting large-scale data set training.
Because of the Pegasos only need to sample certain samples, this package 
supports to pre-fetch
the certain size (e.g. max iteration) of samples to memory.
For example: if the size of data set has 100,000,000 samples, due to the 
default maximum iteration is 10,000,
as the result, this package only random load 10,000 samples to memory.

4. Sequential Data set testing, then the package can support large-scale data 
set both on training and testing.

5. Supporting parallel classification (only testing phrase) based on Map-Reduce 
framework.

6. Supoorting Multi-classfication based on Map-Reduce framework (whole 
parallelized version).

7. Supporting Regression.

-------------------------------------------------------------------------------------------
TODO:
-------------------------------------------------------------------------------------------
1. Multi-classification Probability Prediction
2. Performance Testing

-------------------------------------------------------------------------------------------
Usage:
-------------------------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Classification:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ Training: @@
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SVMPegasosTraining.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function. 
The default argument is:

-tr ../examples/src/test/resources/svmdataset/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

~~~~~~~~~~~~~~~~~~~~~~
@ For the case that training data set on HDFS:@
~~~~~~~~~~~~~~~~~~~~~~

1 Assure that your training data set has been submitted to hdfs
hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

2 revise the argument:
-tr /user/hadoop/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009

~~~~~~~~~~~~~~~~~~~~~~
@ Multi-class Training [Based on MapReduce Framework]:@
~~~~~~~~~~~~~~~~~~~~~~
bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
org.apache.mahout.classifier.svm.ParallelMultiClassifierTrainDriver -if 
/user/maximzhao/dataset/protein -of /user/maximzhao/protein -m 
/user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M 
-ttt 1080

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ Testing: @@
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

SVMPegasosTesting.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function.
The default argument is:
-te ../examples/src/test/resources/svmdataset/test.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

~~~~~~~~~~~~~~~~~~~~~~
@ Parallel Testing (Classification): @
~~~~~~~~~~~~~~~~~~~~~~
ParallelClassifierDriver.java
bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
org.apache.mahout.classifier.svm.ParallelClassifierDriver -if 
/user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result -m 
/user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080

~~~~~~~~~~~~~~~~~~~~~~
@ Parallel multi-classification: @
~~~~~~~~~~~~~~~~~~~~~~
bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
org.apache.mahout.classifier.svm.ParallelMultiClassPredictionDriver -if 
/user/maximzhao/dataset/protein.t -of /user/maximzhao/proteinpredictionResult 
-m /user/maximzhao/proteinmodel -c 3 -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080

Note: the parameter -ms 241572968 is obtained by equation : ms = input files 
size / number of mapper.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Regression: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
SVMPegasosTraining.java
-tr ../examples/src/test/resources/svmdataset/abalone_scale -m 
../examples/src/test/resources/svmdataset/SVMregression.model -s 1

-------------------------------------------------------------------------------------------
Experimental Results:
-------------------------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Classsification:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set:
name              source            type        class   training size   testing 
size    feature
-----------------------------------------------------------------------------------------------
rcv1.binary      [DL04b]        classification  2          20,242         
677,399       47,236
covtype.binary    UCI           classification  2         581,012               
         54
a9a               UCI           classification  2          32,561          
16,281       123
w8a              [JP98a]        classification  2          49,749          
14,951       300

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy         |       Training Time      | 
   Testing Time     |
rcv1.binary              |          94.67%         |         19 Sec           | 
    2 min 25 Sec    |
covtype.binary           |                         |         19 Sec           | 
                    |
a9a                      |          84.72%         |         14 Sec           | 
    12 Sec          |
w8a                      |          89.8 %         |         14 Sec           | 
    8  Sec          |

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Parallel Classification (Testing)
Data set                 |        Accuracy         |       Training Time      | 
   Testing Time            |
rcv1.binary              |          94.98%         |         19 Sec           | 
    3 min 29 Sec (one node)|

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Parallel Multi-classification Based on MapReduce Framework:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set:
name      |        source           | type      | class | training size | 
testing size  | feature
-----------------------------------------------------------------------------------------------
poker   | UCI   | classification        | 10    | 25,010        | 1,000,000     
| 10
protein  | [JYW02a]     | classification        | 3     | 17,766        | 6,621 
| 357

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy  vs. (Libsvm with linear kernel)
poker | 50.14 %  vs. ( 49.952% ) |
protein | 68.14% vs. ( 64.93% ) |

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Regression:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set:
name    |          source       |    type |     class   | training size |       
testing size |  feature
-----------------------------------------------------------------------------------------------
abalone |       UCI     | regression            | 4,177         | | 8
triazines |     UCI     | regression            | 186           | | 60
cadata  | StatLib       | regression            | 20,640        | | 8
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Mean Squared error vs. (Libsvm with linear 
kernel)   |       Training Time      | Test Time |
abalone | 6.01 vs. (5.25) | 13 Sec |
triazines | 0.031  vs. (0.0276) | 14 Sec |
cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |

    Affects Version/s:     (was: 0.3)
                       0.4

I have changed the class directory of parallel algorithms.

> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
>                 Key: MAHOUT-232
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-232
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: zhao zhendong
>             Fix For: 0.3
>
>         Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, 
> SequentialSVM_0.3.patch, SequentialSVM_0.4.patch
>
>
> After discussed with guys in this community, I decided to re-implement a 
> Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
> line style,  SparseMatrix and SparseVector etc.) , Eventually, it will 
> support HDFS. 
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Support general file system and HDFS right now.
> 3. Supporting large-scale data set training.
> Because of the Pegasos only need to sample certain samples, this package 
> supports to pre-fetch
> the certain size (e.g. max iteration) of samples to memory.
> For example: if the size of data set has 100,000,000 samples, due to the 
> default maximum iteration is 10,000,
> as the result, this package only random load 10,000 samples to memory.
> 4. Sequential Data set testing, then the package can support large-scale data 
> set both on training and testing.
> 5. Supporting parallel classification (only testing phrase) based on 
> Map-Reduce framework.
> 6. Supoorting Multi-classfication based on Map-Reduce framework (whole 
> parallelized version).
> 7. Supporting Regression.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. Multi-classification Probability Prediction
> 2. Performance Testing
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Training: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTraining.java
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ For the case that training data set on HDFS:@
> ~~~~~~~~~~~~~~~~~~~~~~
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model -hdfs 
> hdfs://localhost:12009
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Multi-class Training [Based on MapReduce Framework]:@
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver
>  -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m 
> /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M 
> -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Testing: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the 
> arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel Testing (Classification): @
> ~~~~~~~~~~~~~~~~~~~~~~
> ParallelClassifierDriver.java
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver 
> -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result 
> -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel multi-classification: @
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver
>  -if /user/maximzhao/dataset/protein.t -of 
> /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 
> -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080
> Note: the parameter -ms 241572968 is obtained by equation : ms = input files 
> size / number of mapper.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression: 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> SVMPegasosTraining.java
> -tr ../examples/src/test/resources/svmdataset/abalone_scale -m 
> ../examples/src/test/resources/svmdataset/SVMregression.model -s 1
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classsification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name            source            type        class   training size   testing 
> size    feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary    [DL04b]        classification  2          20,242         
> 677,399       47,236
> covtype.binary          UCI           classification  2         581,012       
>                  54
> a9a               UCI           classification        2          32,561       
>    16,281       123
> w8a            [JP98a]        classification  2          49,749          
> 14,951       300
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy         |       Training Time      
> |    Testing Time     |
> rcv1.binary              |          94.67%         |         19 Sec           
> |     2 min 25 Sec    |
> covtype.binary           |                         |         19 Sec           
> |                     |
> a9a                      |          84.72%         |         14 Sec           
> |     12 Sec          |
> w8a                      |          89.8 %         |         14 Sec           
> |     8  Sec          |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Classification (Testing)
> Data set                 |        Accuracy         |       Training Time      
> |    Testing Time            |
> rcv1.binary              |          94.98%         |         19 Sec           
> |     3 min 29 Sec (one node)|
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Multi-classification Based on MapReduce Framework:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name    |        source           | type      | class | training size | 
> testing size  | feature
> -----------------------------------------------------------------------------------------------
> poker | UCI   | classification        | 10    | 25,010        | 1,000,000     
> | 10
> protein        | [JYW02a]     | classification        | 3     | 17,766        
> | 6,621 | 357
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy  vs. (Libsvm with linear kernel)
> poker | 50.14 %  vs. ( 49.952% ) |
> protein | 68.14% vs. ( 64.93% ) |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name  |          source       |    type |     class   | training size |       
> testing size |  feature
> -----------------------------------------------------------------------------------------------
> abalone |     UCI     | regression            | 4,177         | | 8
> triazines |   UCI     | regression            | 186           | | 60
> cadata        | StatLib       | regression            | 20,640        | | 8
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Mean Squared error vs. (Libsvm with linear 
> kernel)   |       Training Time      | Test Time |
> abalone | 6.01 vs. (5.25) | 13 Sec |
> triazines | 0.031  vs. (0.0276) | 14 Sec |
> cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to