Robin / Gangadhar,
With ngram as 1 and all the countries in the country.txt , the model is
getting created without any issues.
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput
-o wikipediamodel -type bayes -source hdfs

Robin,
Even for ngram parameter, the default value is mentioned as 1 but it is set
as a mandatory parameter in TrainClassifier. so i'll modify the code to set
the default ngram as 1 and make it as a non mandatory param.

That aside, When I try to test the model, the summary is getting printed
like below.
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0         ?%
Incorrectly Classified Instances        :          0         ?%
Total Classified Instances              :          0
Need to figure out the reason..

Since TestClassifier also has the same params and settings like
TrainClassifier, can i modify it to set the default values for ngram,
classifierType & dataSource ?

reg,
Joe.

On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote:

> Robin,
>
> Thanks for your tip.
> Will try it out and post updates.
>
> reg
> Joe.
>
>
> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]> wrote:
>
>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
>> need atleast 2 countries. otherwise there is no classification. Secondly
>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>> number
>> of features. Why dont you try with one and see.
>>
>> Robin
>>
>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]> wrote:
>>
>> > Hi Ted,
>> >
>> > sure. will keep digging..
>> >
>> > About SGD, I dont have an idea about how it works et al. If there is
>> some
>> > documentation / reference / quick summary to read about it that'll be
>> gr8.
>> > Just saw one reference in
>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>> >
>> > I am assuming we should be able to create a model from wikipedia
>> articles
>> > and label the country of a new article. If so, could you please provide
>> a
>> > note on how to do this. We already have the wikipedia data being
>> extracted
>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
>> > about training the classifier using SGD ?
>> >
>> > thanks for your help,
>> > Joe.
>> >
>> >
>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[email protected]>
>> > wrote:
>> >
>> > > I am watching these efforts with interest, but have been unable to
>> > > contribute much to the process.  I would encourage Joe and others to
>> keep
>> > > whittling this problem down so that we can understand what is causing
>> it.
>> > >
>> > > In the meantime, I think that the SGD classifiers are close to
>> production
>> > > quality.  For problems with less than several million training
>> examples,
>> > > and
>> > > especially problems with many sparse features, I think that these
>> > > classifiers might be easier to get started with than the Naive Bayes
>> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
>> to
>> > > not
>> > > use Hadoop for training.  This makes deployment of a classification
>> > > training
>> > > workflow easier, but limits the total size of data that can be
>> handled.
>> > >
>> > > What would you guys need to get started with trying these alternative
>> > > models?
>> > >
>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>> > > <[email protected]>wrote:
>> > >
>> > > > Joe,
>> > > > Even I tried with reducing the number of countries in the
>> country.txt.
>> > > > That didn't help. And in my case, I was monitoring the disk space
>> and
>> > > > at no time did it reach 0%. So, I am not sure if that is the case.
>> To
>> > > > remove the dependency on the number of countries, I even tried with
>> > > > the subjects.txt as the classification - that also did not help.
>> > > > I think this problem is due to the type of the data being processed,
>> > > > but what I am not sure of is what I need to change to get the data
>> to
>> > > > be processed successfully.
>> > > >
>> > > > The experienced folks on Mahout will be able to tell us what is
>> missing
>> > I
>> > > > guess.
>> > > >
>> > > > Thank you
>> > > > Gangadhar
>> > > >
>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]>
>> wrote:
>> > > > > Gangadhar,
>> > > > >
>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
>> > just
>> > > > have
>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
>> the
>> > > > > wikipediainput data set and then ran TrainClassifier and it
>> worked.
>> > > when
>> > > > I
>> > > > > ran TestClassifier as below, I got blank results in the output.
>> > > > >
>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>> wikipediamodel
>> > -d
>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>> > > > >
>> > > > > Summary
>> > > > > -------------------------------------------------------
>> > > > > Correctly Classified Instances          :          0         ?%
>> > > > > Incorrectly Classified Instances        :          0         ?%
>> > > > > Total Classified Instances              :          0
>> > > > >
>> > > > > =======================================================
>> > > > > Confusion Matrix
>> > > > > -------------------------------------------------------
>> > > > > a     <--Classified as
>> > > > > 0     |  0     a     = spain
>> > > > > Default Category: unknown: 1
>> > > > >
>> > > > > I am not sure if I am doing something wrong.. have to figure out
>> why
>> > my
>> > > > o/p
>> > > > > is so blank.
>> > > > > I'll document these steps and mention about country.txt in the
>> wiki.
>> > > > >
>> > > > > Question to all
>> > > > > Should we have 2 country.txt
>> > > > >
>> > > > >   1. country_full_list.txt - this is the existing list
>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>> > > > >
>> > > > > To get a flavor of the wikipedia bayes example, we can use
>> > > > > country_sample.txt. When new people want to just try out the
>> example,
>> > > > they
>> > > > > can reference this txt file  as a parameter.
>> > > > > To run the example in a robust scalable infrastructure, we could
>> use
>> > > > > country_full_list.txt.
>> > > > > any thots ?
>> > > > >
>> > > > > regards
>> > > > > Joe.
>> > > > >
>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected]>
>> > wrote:
>> > > > >
>> > > > >> Gangadhar,
>> > > > >>
>> > > > >> After running TrainClassifier again, the map task just failed
>> with
>> > the
>> > > > same
>> > > > >> exception and I am pretty sure it is an issue with disk space.
>> > > > >> As the map was progressing, I was monitoring my free disk space
>> > > dropping
>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>> task
>> > and
>> > > > then
>> > > > >> the exception happened. After the exception, another map task was
>> > > > resuming
>> > > > >> at 33% and I got close to 15GB free space (i guess the first map
>> > task
>> > > > freed
>> > > > >> up some space) and I am sure they would drop down to zero again
>> and
>> > > > throw
>> > > > >> the same exception.
>> > > > >> I am going to modify the country.txt to just 1 country and
>> recreate
>> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
>> > > goes..
>> > > > >>
>> > > > >> Do we have any benchmarks / system requirements for running this
>> > > example
>> > > > ?
>> > > > >> Has anyone else had success running this example anytime. Would
>> > > > appreciate
>> > > > >> your inputs / thots.
>> > > > >>
>> > > > >> Should we look at tuning the code for handling these situations ?
>> > Any
>> > > > quick
>> > > > >> suggestions on where to start looking at ?
>> > > > >>
>> > > > >> regards,
>> > > > >> Joe.
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
>
>
>

Reply via email to