[
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001577#comment-13001577
]
Ted Dunning commented on MAHOUT-232:
------------------------------------
The ASF is applying. I haven't heard anything from our committers, but I
would imagine that somebody would be interested in mentoring 1-4 students.
More news anon.
> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
> Key: MAHOUT-232
> URL: https://issues.apache.org/jira/browse/MAHOUT-232
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Affects Versions: 0.4
> Reporter: zhao zhendong
> Assignee: Ted Dunning
> Attachments: 0001-Rename-DatastoreSequenenceFile-class.patch,
> 0002-Renamed-HADOOP_MODLE_PATH-to-HADOOP_MODEL_PATH.patch,
> 0003-Change-MultiClassifierDrivers-type-to-AbstractJob.patch,
> 0004-A-script-for-svm-classification-on-20news-dataset.patch,
> Mahout-232-0.8.patch, SVMDataset.patch, SVMonMahout0.5.1.patch,
> SVMonMahout0.5.patch, SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch,
> SequentialSVM_0.3.patch, SequentialSVM_0.4.patch, a2a.mvc
>
>
> After discussed with guys in this community, I decided to re-implement a
> Sequential SVM solver based on Pegasos for Mahout platform (mahout command
> line style, SparseMatrix and SparseVector etc.) , Eventually, it will
> support HDFS.
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Support general file system and HDFS right now.
> 3. Supporting large-scale data set training.
> Because of the Pegasos only need to sample certain samples, this package
> supports to pre-fetch
> the certain size (e.g. max iteration) of samples to memory.
> For example: if the size of data set has 100,000,000 samples, due to the
> default maximum iteration is 10,000,
> as the result, this package only random load 10,000 samples to memory.
> 4. Sequential Data set testing, then the package can support large-scale data
> set both on training and testing.
> 5. Supporting parallel classification (only testing phrase) based on
> Map-Reduce framework.
> 6. Supoorting Multi-classfication based on Map-Reduce framework (whole
> parallelized version).
> 7. Supporting Regression.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. Multi-classification Probability Prediction
> 2. Performance Testing
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Training: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTraining.java
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m
> ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ For the case that training data set on HDFS:@
> ~~~~~~~~~~~~~~~~~~~~~~
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m
> ../examples/src/test/resources/svmdataset/SVM.model -hdfs
> hdfs://localhost:12009
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Multi-class Training [Based on MapReduce Framework]:@
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver
> -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m
> /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M
> -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Testing: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the
> arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m
> ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel Testing (Classification): @
> ~~~~~~~~~~~~~~~~~~~~~~
> ParallelClassifierDriver.java
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver
> -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result
> -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel multi-classification: @
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver
> -if /user/maximzhao/dataset/protein.t -of
> /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3
> -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080
> Note: the parameter -ms 241572968 is obtained by equation : ms = input files
> size / number of mapper.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> SVMPegasosTraining.java
> -tr ../examples/src/test/resources/svmdataset/abalone_scale -m
> ../examples/src/test/resources/svmdataset/SVMregression.model -s 1
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classsification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name source type class training size testing
> size feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary [DL04b] classification 2 20,242
> 677,399 47,236
> covtype.binary UCI classification 2 581,012
> 54
> a9a UCI classification 2 32,561
> 16,281 123
> w8a [JP98a] classification 2 49,749
> 14,951 300
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set | Accuracy | Training Time
> | Testing Time |
> rcv1.binary | 94.67% | 19 Sec
> | 2 min 25 Sec |
> covtype.binary | | 19 Sec
> | |
> a9a | 84.72% | 14 Sec
> | 12 Sec |
> w8a | 89.8 % | 14 Sec
> | 8 Sec |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Classification (Testing)
> Data set | Accuracy | Training Time
> | Testing Time |
> rcv1.binary | 94.98% | 19 Sec
> | 3 min 29 Sec (one node)|
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Multi-classification Based on MapReduce Framework:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name | source | type | class | training size |
> testing size | feature
> -----------------------------------------------------------------------------------------------
> poker | UCI | classification | 10 | 25,010 | 1,000,000
> | 10
> protein | [JYW02a] | classification | 3 | 17,766
> | 6,621 | 357
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set | Accuracy vs. (Libsvm with linear kernel)
> poker | 50.14 % vs. ( 49.952% ) |
> protein | 68.14% vs. ( 64.93% ) |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name | source | type | class | training size |
> testing size | feature
> -----------------------------------------------------------------------------------------------
> abalone | UCI | regression | 4,177 | | 8
> triazines | UCI | regression | 186 | | 60
> cadata | StatLib | regression | 20,640 | | 8
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set | Mean Squared error vs. (Libsvm with linear
> kernel) | Training Time | Test Time |
> abalone | 6.01 vs. (5.25) | 13 Sec |
> triazines | 0.031 vs. (0.0276) | 14 Sec |
> cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira