Re: Options in TrainClassifier.java

deneche abdelhakim Sun, 19 Sep 2010 22:46:36 -0700

I don't know if it's related, but I remember getting a similar
Exception one year ago when I was  working on the implementation of
Random Forests. In my case it was caused by
SequenceFile.Sorter.merge(). I ended up writing my own merge function
because I really didn't need to sort the output.


On Mon, Sep 20, 2010 at 6:14 AM, Joe Kumar <[email protected]> wrote:
> Gangadhar,
>
> Just to eliminate the usual suspects, I am using Mac OSX 10.5.8, Mahout 0.4
> (revision 986659), Hadoop 0.20.2, 2GB Mem for Hadoop , 80 GB free space.
> commands tat I executed.
>
> I had issues with my namenode and so did a format using hadoop namenode
> -format.
> $MAHOUT_HOME/examples/src/test/resources/country.txt had just 1 entry
> (spain). I havent tried with multiple entries.
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o
> wikipedia/chunks -c 64
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipediainput -c
> $MAHOUT_HOME/examples/src/test/resources/country.txt
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o
> wikipediamodel  -type bayes -source hdfs
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>  wikipediainput  -ng 3 -type bayes -source hdfs
>
> Please try the above and let me know. we'll try and find out what is going
> wrong.
> Reg,
> Joe.
>
> On Sun, Sep 19, 2010 at 11:13 PM, Gangadhar Nittala <[email protected]
>> wrote:
>
>> Joe,
>> Even I tried with reducing the number of countries in the country.txt.
>> That didn't help. And in my case, I was monitoring the disk space and
>> at no time did it reach 0%. So, I am not sure if that is the case. To
>> remove the dependency on the number of countries, I even tried with
>> the subjects.txt as the classification - that also did not help.
>> I think this problem is due to the type of the data being processed,
>> but what I am not sure of is what I need to change to get the data to
>> be processed successfully.
>>
>> The experienced folks on Mahout will be able to tell us what is missing I
>> guess.
>>
>> Thank you
>> Gangadhar
>>
>> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]> wrote:
>> > Gangadhar,
>> >
>> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
>> have
>> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
>> > wikipediainput data set and then ran TrainClassifier and it worked. when
>> I
>> > ran TestClassifier as below, I got blank results in the output.
>> >
>> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>> >  wikipediainput  -ng 3 -type bayes -source hdfs
>> >
>> > Summary
>> > -------------------------------------------------------
>> > Correctly Classified Instances          :          0         ?%
>> > Incorrectly Classified Instances        :          0         ?%
>> > Total Classified Instances              :          0
>> >
>> > =======================================================
>> > Confusion Matrix
>> > -------------------------------------------------------
>> > a     <--Classified as
>> > 0     |  0     a     = spain
>> > Default Category: unknown: 1
>> >
>> > I am not sure if I am doing something wrong.. have to figure out why my
>> o/p
>> > is so blank.
>> > I'll document these steps and mention about country.txt in the wiki.
>> >
>> > Question to all
>> > Should we have 2 country.txt
>> >
>> >   1. country_full_list.txt - this is the existing list
>> >   2. country_sample_list.txt - a list with 2 or 3 countries
>> >
>> > To get a flavor of the wikipedia bayes example, we can use
>> > country_sample.txt. When new people want to just try out the example,
>> they
>> > can reference this txt file  as a parameter.
>> > To run the example in a robust scalable infrastructure, we could use
>> > country_full_list.txt.
>> > any thots ?
>> >
>> > regards
>> > Joe.
>> >
>> > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected]> wrote:
>> >
>> >> Gangadhar,
>> >>
>> >> After running TrainClassifier again, the map task just failed with the
>> same
>> >> exception and I am pretty sure it is an issue with disk space.
>> >> As the map was progressing, I was monitoring my free disk space dropping
>> >> from 81GB. It came down to 0 after almost 66% through the map task and
>> then
>> >> the exception happened. After the exception, another map task was
>> resuming
>> >> at 33% and I got close to 15GB free space (i guess the first map task
>> freed
>> >> up some space) and I am sure they would drop down to zero again and
>> throw
>> >> the same exception.
>> >> I am going to modify the country.txt to just 1 country and recreate
>> >> wikipediainput and run TrainClassifier. Will let you know how it goes..
>> >>
>> >> Do we have any benchmarks / system requirements for running this example
>> ?
>> >> Has anyone else had success running this example anytime. Would
>> appreciate
>> >> your inputs / thots.
>> >>
>> >> Should we look at tuning the code for handling these situations ? Any
>> quick
>> >> suggestions on where to start looking at ?
>> >>
>> >> regards,
>> >> Joe.
>> >>
>> >>
>> >>
>> >>
>> >
>>
>

Re: Options in TrainClassifier.java

Reply via email to