Re: Options in TrainClassifier.java

Joe Kumar Wed, 15 Sep 2010 19:21:06 -0700

Hi Gangadhar,

rite. I did the same to execute the TrainClassifier but then since the
default datasource is hdfs, we should not be mandated to provide this
parameter.
I havent completed executing the TrainClassifier yet. I'll do it tonite and
let you know if I get into trouble.


reg,
Joe.

On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala
<[email protected]>wrote:

> I ran into the issue that Joe mentioned about the command line
> parameters. I just added the datasource to the command line to execute
> thus
>  $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
> --input wikipediainput10 --output wikipediamodel10 --classifierType
> bayes --dataSource hdfs
>
> On a related note, Joe, were you able to run the TrainClassifier
> without any errors ? When I tried this, the map-reduce job would abort
> always at 99%. I tried the example that was given in the wiki with
> both subjects and countries. I even reduced the list of countries in
> the country.txt assuming that was what was causing the issue. No
> matter what, the classifier task fails. And the exception in the task
> log :
>
> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
> = 41271492; bufend = 58259002; bufvoid = 99614720
> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
> = 196379; kvend = 130842; length = 327680
> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
> Finished spill 287
> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
> Starting flush of map output
> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
> Finished spill 288
> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
> Error running child
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> any valid local directory for
>
> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
>        at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
>        at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>        at
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> I checked the hadoop JIRA and this seems to be fixed already
> https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
> I am doing wrong. Any suggestions to what I need to change to get this
> fixed will be very helpful. I have been struggling with this for a
> while now.
>
> Thank you
>
> On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <[email protected]> wrote:
> > Robin,
> >
> > sure. I'll submit a patch.
> >
> > The command line flag already has the default behavior specified.
> >  --classifierType (-type) classifierType    Type of classifier:
> > bayes|cbayes.
> >                                             Default: bayes
> >
> >  --dataSource (-source) dataSource          Location of model:
> hdfs|hbase.
> >
> >                                             Default Value: hdfs
> > So there is no change in the flag description.
> >
> > reg,
> > Joe.
> >
> >
> > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <[email protected]>
> wrote:
> >
> >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <[email protected]> wrote:
> >>
> >> > Hi all,
> >> >
> >> > As I was going through wikipedia example, I encountered a situation
> with
> >> > TrainClassifier wherein some of the options with default values are
> >> > actually
> >> > mandatory.
> >> > The documentation / command line help says that
> >> >
> >> >   1. default source (--datasource) is hdfs but TrainClassifier
> >> >   has withRequired(true) while building the --datasource option. We
> are
> >> >   checking if the dataSourceType is hbase else set it to hdfs. so
> >> >   ideally withRequired should be set to false
> >> >   2. default --classifierType is bayes but withRequired is set to true
> >> and
> >> >   we have code like
> >> >
> >> > if ("bayes".equalsIgnoreCase(classifierType)) {
> >> >        log.info("Training Bayes Classifier");
> >> >        trainNaiveBayes(inputPath, outputPath, params);
> >> >
> >> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
> >> >        log.info("Training Complementary Bayes Classifier");
> >> >        // setup the HDFS and copy the files there, then run the
> trainer
> >> >        trainCNaiveBayes(inputPath, outputPath, params);
> >> >      }
> >> >
> >> > which should be changed to
> >> >
> >> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
> >> >        log.info("Training Complementary Bayes Classifier");
> >> >        trainCNaiveBayes(inputPath, outputPath, params);
> >> >
> >> >      } *else  {*
> >> >        log.info("Training  Bayes Classifier");
> >> >        // setup the HDFS and copy the files there, then run the
> trainer
> >> >        trainNaiveBayes(inputPath, outputPath, params);
> >> >      }
> >> >
> >> > Please let me know if this looks valid and I'll submit a patch for a
> JIRA
> >> > issue.
> >> >
> >> > +1 all valid. , Go ahead and fix it and in the cmdline flags write the
> >> default behavior in the flag description
> >>
> >>
> >> > reg
> >> > Joe.
> >> >
> >>
> >
>

Re: Options in TrainClassifier.java

Reply via email to