Re: Options in TrainClassifier.java

Gangadhar Nittala Sun, 19 Sep 2010 20:14:08 -0700

Joe,
Even I tried with reducing the number of countries in the country.txt.
That didn't help. And in my case, I was monitoring the disk space and
at no time did it reach 0%. So, I am not sure if that is the case. To
remove the dependency on the number of countries, I even tried with
the subjects.txt as the classification - that also did not help.
I think this problem is due to the type of the data being processed,
but what I am not sure of is what I need to change to get the data to
be processed successfully.


The experienced folks on Mahout will be able to tell us what is missing I guess.

Thank you
Gangadhar

On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]> wrote:
> Gangadhar,
>
> I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just have
> 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> wikipediainput data set and then ran TrainClassifier and it worked. when I
> ran TestClassifier as below, I got blank results in the output.
>
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>  wikipediainput  -ng 3 -type bayes -source hdfs
>
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0         ?%
> Incorrectly Classified Instances        :          0         ?%
> Total Classified Instances              :          0
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a     <--Classified as
> 0     |  0     a     = spain
> Default Category: unknown: 1
>
> I am not sure if I am doing something wrong.. have to figure out why my o/p
> is so blank.
> I'll document these steps and mention about country.txt in the wiki.
>
> Question to all
> Should we have 2 country.txt
>
>   1. country_full_list.txt - this is the existing list
>   2. country_sample_list.txt - a list with 2 or 3 countries
>
> To get a flavor of the wikipedia bayes example, we can use
> country_sample.txt. When new people want to just try out the example, they
> can reference this txt file  as a parameter.
> To run the example in a robust scalable infrastructure, we could use
> country_full_list.txt.
> any thots ?
>
> regards
> Joe.
>
> On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected]> wrote:
>
>> Gangadhar,
>>
>> After running TrainClassifier again, the map task just failed with the same
>> exception and I am pretty sure it is an issue with disk space.
>> As the map was progressing, I was monitoring my free disk space dropping
>> from 81GB. It came down to 0 after almost 66% through the map task and then
>> the exception happened. After the exception, another map task was resuming
>> at 33% and I got close to 15GB free space (i guess the first map task freed
>> up some space) and I am sure they would drop down to zero again and throw
>> the same exception.
>> I am going to modify the country.txt to just 1 country and recreate
>> wikipediainput and run TrainClassifier. Will let you know how it goes..
>>
>> Do we have any benchmarks / system requirements for running this example ?
>> Has anyone else had success running this example anytime. Would appreciate
>> your inputs / thots.
>>
>> Should we look at tuning the code for handling these situations ? Any quick
>> suggestions on where to start looking at ?
>>
>> regards,
>> Joe.
>>
>>
>>
>>
>

Re: Options in TrainClassifier.java

Reply via email to