Hi guys, I'm so sorry about that I don't check this track issue for a long time. I just set up my old laptop for both liblinear and pegasos on mahout patches.
I will be very glad that Viktor Gal may help to get these two packages into mahout. Cheers, Zhendong On Sat, Feb 26, 2011 at 1:56 PM, Robin Anil (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999723#comment-12999723] > > Robin Anil commented on MAHOUT-232: > ----------------------------------- > > Let me list down some tasks that may be necessary to whip the patch to a > committable state > > There were some spelling mistakes in the function/method names, maybe you > can fix any apparent ones when are are reading through the code. Try to make > the code run within a Hadoop job, I am sure there are some fixes necessary > to do that. Remove any hardcoded paths or ports that you see. Finally get an > end to end example running maybe using 20newsgroups. > > > Implementation of sequential SVM solver based on Pegasos > > -------------------------------------------------------- > > > > Key: MAHOUT-232 > > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > > Project: Mahout > > Issue Type: New Feature > > Components: Classification > > Affects Versions: 0.4 > > Reporter: zhao zhendong > > Assignee: Ted Dunning > > Attachments: Mahout-232-0.8.patch, SVMDataset.patch, > SVMonMahout0.5.1.patch, SVMonMahout0.5.patch, SequentialSVM_0.1.patch, > SequentialSVM_0.2.2.patch, SequentialSVM_0.3.patch, SequentialSVM_0.4.patch, > a2a.mvc > > > > > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > > Sequential SVM based on Pegasos. > > Maxim zhao (zhaozhendong at gmail dot com) > > > ------------------------------------------------------------------------------------------- > > Currently, this package provides (Features): > > > ------------------------------------------------------------------------------------------- > > 1. Sequential SVM linear solver, include training and testing. > > 2. Support general file system and HDFS right now. > > 3. Supporting large-scale data set training. > > Because of the Pegasos only need to sample certain samples, this package > supports to pre-fetch > > the certain size (e.g. max iteration) of samples to memory. > > For example: if the size of data set has 100,000,000 samples, due to the > default maximum iteration is 10,000, > > as the result, this package only random load 10,000 samples to memory. > > 4. Sequential Data set testing, then the package can support large-scale > data set both on training and testing. > > 5. Supporting parallel classification (only testing phrase) based on > Map-Reduce framework. > > 6. Supoorting Multi-classfication based on Map-Reduce framework (whole > parallelized version). > > 7. Supporting Regression. > > > ------------------------------------------------------------------------------------------- > > TODO: > > > ------------------------------------------------------------------------------------------- > > 1. Multi-classification Probability Prediction > > 2. Performance Testing > > > ------------------------------------------------------------------------------------------- > > Usage: > > > ------------------------------------------------------------------------------------------- > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Classification: > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > @@ Training: @@ > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > SVMPegasosTraining.java > > The default argument is: > > -tr ../examples/src/test/resources/svmdataset/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > > ~~~~~~~~~~~~~~~~~~~~~~ > > @ For the case that training data set on HDFS:@ > > ~~~~~~~~~~~~~~~~~~~~~~ > > 1 Assure that your training data set has been submitted to hdfs > > hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset > > 2 revise the argument: > > -tr /user/hadoop/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model -hdfs > hdfs://localhost:12009 > > ~~~~~~~~~~~~~~~~~~~~~~ > > @ Multi-class Training [Based on MapReduce Framework]:@ > > ~~~~~~~~~~~~~~~~~~~~~~ > > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job > org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver > -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m > /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs > -Xmx1000M -ttt 1080 > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > @@ Testing: @@ > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > SVMPegasosTesting.java > > I have hard coded the arguments in this file, if you want to custom the > arguments by youself, please uncomment the first line in main function. > > The default argument is: > > -te ../examples/src/test/resources/svmdataset/test.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > > ~~~~~~~~~~~~~~~~~~~~~~ > > @ Parallel Testing (Classification): @ > > ~~~~~~~~~~~~~~~~~~~~~~ > > ParallelClassifierDriver.java > > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job > org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver > -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result > -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080 > > ~~~~~~~~~~~~~~~~~~~~~~ > > @ Parallel multi-classification: @ > > ~~~~~~~~~~~~~~~~~~~~~~ > > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job > org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver > -if /user/maximzhao/dataset/protein.t -of > /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 > -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080 > > Note: the parameter -ms 241572968 is obtained by equation : ms = input > files size / number of mapper. > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Regression: > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > SVMPegasosTraining.java > > -tr ../examples/src/test/resources/svmdataset/abalone_scale -m > ../examples/src/test/resources/svmdataset/SVMregression.model -s 1 > > > ------------------------------------------------------------------------------------------- > > Experimental Results: > > > ------------------------------------------------------------------------------------------- > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Classsification: > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Data set: > > name source type class training size > testing size feature > > > ----------------------------------------------------------------------------------------------- > > rcv1.binary [DL04b] classification 2 20,242 > 677,399 47,236 > > covtype.binary UCI classification 2 581,012 > 54 > > a9a UCI classification 2 32,561 > 16,281 123 > > w8a [JP98a] classification 2 49,749 > 14,951 300 > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Data set | Accuracy | Training Time > | Testing Time | > > rcv1.binary | 94.67% | 19 Sec > | 2 min 25 Sec | > > covtype.binary | | 19 Sec > | | > > a9a | 84.72% | 14 Sec > | 12 Sec | > > w8a | 89.8 % | 14 Sec > | 8 Sec | > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Parallel Classification (Testing) > > Data set | Accuracy | Training Time > | Testing Time | > > rcv1.binary | 94.98% | 19 Sec > | 3 min 29 Sec (one node)| > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Parallel Multi-classification Based on MapReduce Framework: > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Data set: > > name | source | type | class | training size | > testing size | feature > > > ----------------------------------------------------------------------------------------------- > > poker | UCI | classification | 10 | 25,010 | 1,000,000 > | 10 > > protein | [JYW02a] | classification | 3 | 17,766 > | 6,621 | 357 > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Data set | Accuracy vs. (Libsvm with linear > kernel) > > poker | 50.14 % vs. ( 49.952% ) | > > protein | 68.14% vs. ( 64.93% ) | > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Regression: > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Data set: > > name | source | type | class | training size | > testing size | feature > > > ----------------------------------------------------------------------------------------------- > > abalone | UCI | regression | 4,177 | | 8 > > triazines | UCI | regression | 186 | | 60 > > cadata | StatLib | regression | 20,640 | | > 8 > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Data set | Mean Squared error vs. (Libsvm with > linear kernel) | Training Time | Test Time | > > abalone | 6.01 vs. (5.25) | 13 Sec | > > triazines | 0.031 vs. (0.0276) | 14 Sec | > > cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec | > > -- > This message is automatically generated by JIRA. > - > For more information on JIRA, see: http://www.atlassian.com/software/jira > > >
