Re: [jira] Commented: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

zhao zhendong Fri, 25 Feb 2011 22:46:27 -0800

Hi guys,

I'm so sorry about that I don't check this track issue for a long time. I
just set up my old laptop for both liblinear and pegasos on mahout patches.


I will be very glad that Viktor Gal may help to get these two packages into
mahout.

Cheers,
Zhendong

On Sat, Feb 26, 2011 at 1:56 PM, Robin Anil (JIRA) <[email protected]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999723#comment-12999723]
>
> Robin Anil commented on MAHOUT-232:
> -----------------------------------
>
> Let me list down some tasks that may be necessary to whip the patch to a
> committable state
>
> There were some spelling mistakes in the function/method names, maybe you
> can fix any apparent ones when are are reading through the code. Try to make
> the code run within a Hadoop job, I am sure there are some fixes necessary
> to do that. Remove any hardcoded paths or ports that you see. Finally get an
> end to end example running maybe using 20newsgroups.
>
> > Implementation of sequential SVM solver based on Pegasos
> > --------------------------------------------------------
> >
> >                 Key: MAHOUT-232
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-232
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Classification
> >    Affects Versions: 0.4
> >            Reporter: zhao zhendong
> >            Assignee: Ted Dunning
> >         Attachments: Mahout-232-0.8.patch, SVMDataset.patch,
> SVMonMahout0.5.1.patch, SVMonMahout0.5.patch, SequentialSVM_0.1.patch,
> SequentialSVM_0.2.2.patch, SequentialSVM_0.3.patch, SequentialSVM_0.4.patch,
> a2a.mvc
> >
> >
> > After discussed with guys in this community, I decided to re-implement a
> Sequential SVM solver based on Pegasos  for Mahout platform (mahout command
> line style,  SparseMatrix and SparseVector etc.) , Eventually, it will
> support HDFS.
> > Sequential SVM based on Pegasos.
> > Maxim zhao (zhaozhendong at gmail dot com)
> >
> -------------------------------------------------------------------------------------------
> > Currently, this package provides (Features):
> >
> -------------------------------------------------------------------------------------------
> > 1. Sequential SVM linear solver, include training and testing.
> > 2. Support general file system and HDFS right now.
> > 3. Supporting large-scale data set training.
> > Because of the Pegasos only need to sample certain samples, this package
> supports to pre-fetch
> > the certain size (e.g. max iteration) of samples to memory.
> > For example: if the size of data set has 100,000,000 samples, due to the
> default maximum iteration is 10,000,
> > as the result, this package only random load 10,000 samples to memory.
> > 4. Sequential Data set testing, then the package can support large-scale
> data set both on training and testing.
> > 5. Supporting parallel classification (only testing phrase) based on
> Map-Reduce framework.
> > 6. Supoorting Multi-classfication based on Map-Reduce framework (whole
> parallelized version).
> > 7. Supporting Regression.
> >
> -------------------------------------------------------------------------------------------
> > TODO:
> >
> -------------------------------------------------------------------------------------------
> > 1. Multi-classification Probability Prediction
> > 2. Performance Testing
> >
> -------------------------------------------------------------------------------------------
> > Usage:
> >
> -------------------------------------------------------------------------------------------
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Classification:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > @@ Training: @@
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > SVMPegasosTraining.java
> > The default argument is:
> > -tr ../examples/src/test/resources/svmdataset/train.dat -m
> ../examples/src/test/resources/svmdataset/SVM.model
> > ~~~~~~~~~~~~~~~~~~~~~~
> > @ For the case that training data set on HDFS:@
> > ~~~~~~~~~~~~~~~~~~~~~~
> > 1 Assure that your training data set has been submitted to hdfs
> > hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> > 2 revise the argument:
> > -tr /user/hadoop/train.dat -m
> ../examples/src/test/resources/svmdataset/SVM.model -hdfs
> hdfs://localhost:12009
> > ~~~~~~~~~~~~~~~~~~~~~~
> > @ Multi-class Training [Based on MapReduce Framework]:@
> > ~~~~~~~~~~~~~~~~~~~~~~
> > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver
> -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m
> /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs
> -Xmx1000M -ttt 1080
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > @@ Testing: @@
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > SVMPegasosTesting.java
> > I have hard coded the arguments in this file, if you want to custom the
> arguments by youself, please uncomment the first line in main function.
> > The default argument is:
> > -te ../examples/src/test/resources/svmdataset/test.dat -m
> ../examples/src/test/resources/svmdataset/SVM.model
> > ~~~~~~~~~~~~~~~~~~~~~~
> > @ Parallel Testing (Classification): @
> > ~~~~~~~~~~~~~~~~~~~~~~
> > ParallelClassifierDriver.java
> > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver
> -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result
> -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080
> > ~~~~~~~~~~~~~~~~~~~~~~
> > @ Parallel multi-classification: @
> > ~~~~~~~~~~~~~~~~~~~~~~
> > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job
> org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver
> -if /user/maximzhao/dataset/protein.t -of
> /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3
> -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080
> > Note: the parameter -ms 241572968 is obtained by equation : ms = input
> files size / number of mapper.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Regression:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > SVMPegasosTraining.java
> > -tr ../examples/src/test/resources/svmdataset/abalone_scale -m
> ../examples/src/test/resources/svmdataset/SVMregression.model -s 1
> >
> -------------------------------------------------------------------------------------------
> > Experimental Results:
> >
> -------------------------------------------------------------------------------------------
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Classsification:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Data set:
> > name            source            type        class   training size
> testing size    feature
> >
> -----------------------------------------------------------------------------------------------
> > rcv1.binary    [DL04b]        classification  2          20,242
> 677,399       47,236
> > covtype.binary          UCI           classification  2         581,012
>                      54
> > a9a               UCI           classification        2          32,561
>        16,281       123
> > w8a            [JP98a]        classification  2          49,749
>  14,951       300
> >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Data set                 |        Accuracy         |       Training Time
>      |    Testing Time     |
> > rcv1.binary              |          94.67%         |         19 Sec
>     |     2 min 25 Sec    |
> > covtype.binary           |                         |         19 Sec
>     |                     |
> > a9a                      |          84.72%         |         14 Sec
>     |     12 Sec          |
> > w8a                      |          89.8 %         |         14 Sec
>     |     8  Sec          |
> >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Parallel Classification (Testing)
> > Data set                 |        Accuracy         |       Training Time
>      |    Testing Time            |
> > rcv1.binary              |          94.98%         |         19 Sec
>     |     3 min 29 Sec (one node)|
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Parallel Multi-classification Based on MapReduce Framework:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Data set:
> > name    |        source           | type      | class | training size |
> testing size  | feature
> >
> -----------------------------------------------------------------------------------------------
> > poker | UCI   | classification        | 10    | 25,010        | 1,000,000
>     | 10
> > protein        | [JYW02a]     | classification        | 3     | 17,766
>      | 6,621 | 357
> >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Data set                 |        Accuracy  vs. (Libsvm with linear
> kernel)
> > poker | 50.14 %  vs. ( 49.952% ) |
> > protein | 68.14% vs. ( 64.93% ) |
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Regression:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Data set:
> > name  |          source       |    type |     class   | training size |
>     testing size |  feature
> >
> -----------------------------------------------------------------------------------------------
> > abalone |     UCI     | regression            | 4,177         | | 8
> > triazines |   UCI     | regression            | 186           | | 60
> > cadata        | StatLib       | regression            | 20,640        | |
> 8
> >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Data set                 |        Mean Squared error vs. (Libsvm with
> linear kernel)   |       Training Time      | Test Time |
> > abalone | 6.01 vs. (5.25) | 13 Sec |
> > triazines | 0.031  vs. (0.0276) | 14 Sec |
> > cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |
>
> --
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] Commented: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

Reply via email to