Re: Options in TrainClassifier.java

Gangadhar Nittala Wed, 15 Sep 2010 18:42:20 -0700

I ran into the issue that Joe mentioned about the command line
parameters. I just added the datasource to the command line to execute
thus
 $HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
--input wikipediainput10 --output wikipediamodel10 --classifierType
bayes --dataSource hdfs


On a related note, Joe, were you able to run the TrainClassifier
without any errors ? When I tried this, the map-reduce job would abort
always at 99%. I tried the example that was given in the wiki with
both subjects and countries. I even reduced the list of countries in
the country.txt assuming that was what was causing the issue. No
matter what, the classifier task fails. And the exception in the task
log :

10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 41271492; bufend = 58259002; bufvoid = 99614720
2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 196379; kvend = 130842; length = 327680
2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 287
2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
Starting flush of map output
2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 288
2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
Error running child
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
any valid local directory for
taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
        at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

I checked the hadoop JIRA and this seems to be fixed already
https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
I am doing wrong. Any suggestions to what I need to change to get this
fixed will be very helpful. I have been struggling with this for a
while now.

Thank you

On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <[email protected]> wrote:
> Robin,
>
> sure. I'll submit a patch.
>
> The command line flag already has the default behavior specified.
>  --classifierType (-type) classifierType    Type of classifier:
> bayes|cbayes.
>                                             Default: bayes
>
>  --dataSource (-source) dataSource          Location of model: hdfs|hbase.
>
>                                             Default Value: hdfs
> So there is no change in the flag description.
>
> reg,
> Joe.
>
>
> On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <[email protected]> wrote:
>
>> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <[email protected]> wrote:
>>
>> > Hi all,
>> >
>> > As I was going through wikipedia example, I encountered a situation with
>> > TrainClassifier wherein some of the options with default values are
>> > actually
>> > mandatory.
>> > The documentation / command line help says that
>> >
>> >   1. default source (--datasource) is hdfs but TrainClassifier
>> >   has withRequired(true) while building the --datasource option. We are
>> >   checking if the dataSourceType is hbase else set it to hdfs. so
>> >   ideally withRequired should be set to false
>> >   2. default --classifierType is bayes but withRequired is set to true
>> and
>> >   we have code like
>> >
>> > if ("bayes".equalsIgnoreCase(classifierType)) {
>> >        log.info("Training Bayes Classifier");
>> >        trainNaiveBayes(inputPath, outputPath, params);
>> >
>> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
>> >        log.info("Training Complementary Bayes Classifier");
>> >        // setup the HDFS and copy the files there, then run the trainer
>> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >      }
>> >
>> > which should be changed to
>> >
>> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
>> >        log.info("Training Complementary Bayes Classifier");
>> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >
>> >      } *else  {*
>> >        log.info("Training  Bayes Classifier");
>> >        // setup the HDFS and copy the files there, then run the trainer
>> >        trainNaiveBayes(inputPath, outputPath, params);
>> >      }
>> >
>> > Please let me know if this looks valid and I'll submit a patch for a JIRA
>> > issue.
>> >
>> > +1 all valid. , Go ahead and fix it and in the cmdline flags write the
>> default behavior in the flag description
>>
>>
>> > reg
>> > Joe.
>> >
>>
>

Re: Options in TrainClassifier.java

Reply via email to