Re: Options in TrainClassifier.java

Joe Kumar Thu, 16 Sep 2010 20:34:49 -0700

Gangadhar,

After some system issues, I finally ran the TrainClassifier. After almost
65% into the map job, I got the same error that you have mentioned.
INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0,
Status : FAILED
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
...
Havent yet analyzed the root cause / solution but just wanted to confirm
that I am facing the same issue as you do.
I'll try to search / analyze and post more details.


reg,
Joe.

On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <[email protected]> wrote:

> Hi Gangadhar,
>
> rite. I did the same to execute the TrainClassifier but then since the
> default datasource is hdfs, we should not be mandated to provide this
> parameter.
> I havent completed executing the TrainClassifier yet. I'll do it tonite and
> let you know if I get into trouble.
>
> reg,
> Joe.
>
>
> On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala <
> [email protected]> wrote:
>
>> I ran into the issue that Joe mentioned about the command line
>> parameters. I just added the datasource to the command line to execute
>> thus
>>  $HADOOP_HOME/bin/hadoop jar
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>> bayes --dataSource hdfs
>>
>> On a related note, Joe, were you able to run the TrainClassifier
>> without any errors ? When I tried this, the map-reduce job would abort
>> always at 99%. I tried the example that was given in the wiki with
>> both subjects and countries. I even reduced the list of countries in
>> the country.txt assuming that was what was causing the issue. No
>> matter what, the classifier task fails. And the exception in the task
>> log :
>>
>> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
>> = 41271492; bufend = 58259002; bufvoid = 99614720
>> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
>> = 196379; kvend = 130842; length = 327680
>> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
>> Finished spill 287
>> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
>> Starting flush of map output
>> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
>> Finished spill 288
>> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
>> Error running child
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> any valid local directory for
>>
>> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
>>        at
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
>>        at
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>>        at
>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
>>        at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
>>        at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> I checked the hadoop JIRA and this seems to be fixed already
>> https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
>> I am doing wrong. Any suggestions to what I need to change to get this
>> fixed will be very helpful. I have been struggling with this for a
>> while now.
>>
>> Thank you
>>
>> On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <[email protected]> wrote:
>> > Robin,
>> >
>> > sure. I'll submit a patch.
>> >
>> > The command line flag already has the default behavior specified.
>> >  --classifierType (-type) classifierType    Type of classifier:
>> > bayes|cbayes.
>> >                                             Default: bayes
>> >
>> >  --dataSource (-source) dataSource          Location of model:
>> hdfs|hbase.
>> >
>> >                                             Default Value: hdfs
>> > So there is no change in the flag description.
>> >
>> > reg,
>> > Joe.
>> >
>> >
>> > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <[email protected]>
>> wrote:
>> >
>> >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <[email protected]>
>> wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > As I was going through wikipedia example, I encountered a situation
>> with
>> >> > TrainClassifier wherein some of the options with default values are
>> >> > actually
>> >> > mandatory.
>> >> > The documentation / command line help says that
>> >> >
>> >> >   1. default source (--datasource) is hdfs but TrainClassifier
>> >> >   has withRequired(true) while building the --datasource option. We
>> are
>> >> >   checking if the dataSourceType is hbase else set it to hdfs. so
>> >> >   ideally withRequired should be set to false
>> >> >   2. default --classifierType is bayes but withRequired is set to
>> true
>> >> and
>> >> >   we have code like
>> >> >
>> >> > if ("bayes".equalsIgnoreCase(classifierType)) {
>> >> >        log.info("Training Bayes Classifier");
>> >> >        trainNaiveBayes(inputPath, outputPath, params);
>> >> >
>> >> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
>> >> >        log.info("Training Complementary Bayes Classifier");
>> >> >        // setup the HDFS and copy the files there, then run the
>> trainer
>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >> >      }
>> >> >
>> >> > which should be changed to
>> >> >
>> >> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
>> >> >        log.info("Training Complementary Bayes Classifier");
>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >> >
>> >> >      } *else  {*
>> >> >        log.info("Training  Bayes Classifier");
>> >> >        // setup the HDFS and copy the files there, then run the
>> trainer
>> >> >        trainNaiveBayes(inputPath, outputPath, params);
>> >> >      }
>> >> >
>> >> > Please let me know if this looks valid and I'll submit a patch for a
>> JIRA
>> >> > issue.
>> >> >
>> >> > +1 all valid. , Go ahead and fix it and in the cmdline flags write
>> the
>> >> default behavior in the flag description
>> >>
>> >>
>> >> > reg
>> >> > Joe.
>> >> >
>> >>
>> >
>>
>
>
>
>
>

Re: Options in TrainClassifier.java

Reply via email to