Hi Gangadhar, rite. I did the same to execute the TrainClassifier but then since the default datasource is hdfs, we should not be mandated to provide this parameter. I havent completed executing the TrainClassifier yet. I'll do it tonite and let you know if I get into trouble.
reg, Joe. On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala <[email protected]>wrote: > I ran into the issue that Joe mentioned about the command line > parameters. I just added the datasource to the command line to execute > thus > $HADOOP_HOME/bin/hadoop jar > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 > --input wikipediainput10 --output wikipediamodel10 --classifierType > bayes --dataSource hdfs > > On a related note, Joe, were you able to run the TrainClassifier > without any errors ? When I tried this, the map-reduce job would abort > always at 99%. I tried the example that was given in the wiki with > both subjects and countries. I even reduced the list of countries in > the country.txt assuming that was what was causing the issue. No > matter what, the classifier task fails. And the exception in the task > log : > > 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart > = 41271492; bufend = 58259002; bufvoid = 99614720 > 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart > = 196379; kvend = 130842; length = 327680 > 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask: > Finished spill 287 > 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask: > Starting flush of map output > 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask: > Finished spill 288 > 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker: > Error running child > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > any valid local directory for > > taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) > at > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > I checked the hadoop JIRA and this seems to be fixed already > https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what > I am doing wrong. Any suggestions to what I need to change to get this > fixed will be very helpful. I have been struggling with this for a > while now. > > Thank you > > On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <[email protected]> wrote: > > Robin, > > > > sure. I'll submit a patch. > > > > The command line flag already has the default behavior specified. > > --classifierType (-type) classifierType Type of classifier: > > bayes|cbayes. > > Default: bayes > > > > --dataSource (-source) dataSource Location of model: > hdfs|hbase. > > > > Default Value: hdfs > > So there is no change in the flag description. > > > > reg, > > Joe. > > > > > > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <[email protected]> > wrote: > > > >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <[email protected]> wrote: > >> > >> > Hi all, > >> > > >> > As I was going through wikipedia example, I encountered a situation > with > >> > TrainClassifier wherein some of the options with default values are > >> > actually > >> > mandatory. > >> > The documentation / command line help says that > >> > > >> > 1. default source (--datasource) is hdfs but TrainClassifier > >> > has withRequired(true) while building the --datasource option. We > are > >> > checking if the dataSourceType is hbase else set it to hdfs. so > >> > ideally withRequired should be set to false > >> > 2. default --classifierType is bayes but withRequired is set to true > >> and > >> > we have code like > >> > > >> > if ("bayes".equalsIgnoreCase(classifierType)) { > >> > log.info("Training Bayes Classifier"); > >> > trainNaiveBayes(inputPath, outputPath, params); > >> > > >> > } else if ("cbayes".equalsIgnoreCase(classifierType)) { > >> > log.info("Training Complementary Bayes Classifier"); > >> > // setup the HDFS and copy the files there, then run the > trainer > >> > trainCNaiveBayes(inputPath, outputPath, params); > >> > } > >> > > >> > which should be changed to > >> > > >> > *if ("cbayes".equalsIgnoreCase(classifierType)) {* > >> > log.info("Training Complementary Bayes Classifier"); > >> > trainCNaiveBayes(inputPath, outputPath, params); > >> > > >> > } *else {* > >> > log.info("Training Bayes Classifier"); > >> > // setup the HDFS and copy the files there, then run the > trainer > >> > trainNaiveBayes(inputPath, outputPath, params); > >> > } > >> > > >> > Please let me know if this looks valid and I'll submit a patch for a > JIRA > >> > issue. > >> > > >> > +1 all valid. , Go ahead and fix it and in the cmdline flags write the > >> default behavior in the flag description > >> > >> > >> > reg > >> > Joe. > >> > > >> > > >
