Re: Options in TrainClassifier.java

Gangadhar Nittala Mon, 20 Sep 2010 20:14:35 -0700

Joe,
I will try with the ngram setting of 1 and let you know how it goes.
Robin, the ngram parameter is used to check the number of subsequences
of characters isn't it ? Or is it evaluated differently w.r.t to the
Bayesian classifier ?


Ted, like Joe mentioned, if you could point us to some information on
SGD we could try it and report back the results to the list.

Thank you
Gangadhar

On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[email protected]> wrote:
> Robin / Gangadhar,
> With ngram as 1 and all the countries in the country.txt , the model is
> getting created without any issues.
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput
> -o wikipediamodel -type bayes -source hdfs
>
> Robin,
> Even for ngram parameter, the default value is mentioned as 1 but it is set
> as a mandatory parameter in TrainClassifier. so i'll modify the code to set
> the default ngram as 1 and make it as a non mandatory param.
>
> That aside, When I try to test the model, the summary is getting printed
> like below.
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0         ?%
> Incorrectly Classified Instances        :          0         ?%
> Total Classified Instances              :          0
> Need to figure out the reason..
>
> Since TestClassifier also has the same params and settings like
> TrainClassifier, can i modify it to set the default values for ngram,
> classifierType & dataSource ?
>
> reg,
> Joe.
>
> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote:
>
>> Robin,
>>
>> Thanks for your tip.
>> Will try it out and post updates.
>>
>> reg
>> Joe.
>>
>>
>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]> wrote:
>>
>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
>>> need atleast 2 countries. otherwise there is no classification. Secondly
>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>>> number
>>> of features. Why dont you try with one and see.
>>>
>>> Robin
>>>
>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]> wrote:
>>>
>>> > Hi Ted,
>>> >
>>> > sure. will keep digging..
>>> >
>>> > About SGD, I dont have an idea about how it works et al. If there is
>>> some
>>> > documentation / reference / quick summary to read about it that'll be
>>> gr8.
>>> > Just saw one reference in
>>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>>> >
>>> > I am assuming we should be able to create a model from wikipedia
>>> articles
>>> > and label the country of a new article. If so, could you please provide
>>> a
>>> > note on how to do this. We already have the wikipedia data being
>>> extracted
>>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
>>> > about training the classifier using SGD ?
>>> >
>>> > thanks for your help,
>>> > Joe.
>>> >
>>> >
>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[email protected]>
>>> > wrote:
>>> >
>>> > > I am watching these efforts with interest, but have been unable to
>>> > > contribute much to the process.  I would encourage Joe and others to
>>> keep
>>> > > whittling this problem down so that we can understand what is causing
>>> it.
>>> > >
>>> > > In the meantime, I think that the SGD classifiers are close to
>>> production
>>> > > quality.  For problems with less than several million training
>>> examples,
>>> > > and
>>> > > especially problems with many sparse features, I think that these
>>> > > classifiers might be easier to get started with than the Naive Bayes
>>> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
>>> to
>>> > > not
>>> > > use Hadoop for training.  This makes deployment of a classification
>>> > > training
>>> > > workflow easier, but limits the total size of data that can be
>>> handled.
>>> > >
>>> > > What would you guys need to get started with trying these alternative
>>> > > models?
>>> > >
>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>>> > > <[email protected]>wrote:
>>> > >
>>> > > > Joe,
>>> > > > Even I tried with reducing the number of countries in the
>>> country.txt.
>>> > > > That didn't help. And in my case, I was monitoring the disk space
>>> and
>>> > > > at no time did it reach 0%. So, I am not sure if that is the case.
>>> To
>>> > > > remove the dependency on the number of countries, I even tried with
>>> > > > the subjects.txt as the classification - that also did not help.
>>> > > > I think this problem is due to the type of the data being processed,
>>> > > > but what I am not sure of is what I need to change to get the data
>>> to
>>> > > > be processed successfully.
>>> > > >
>>> > > > The experienced folks on Mahout will be able to tell us what is
>>> missing
>>> > I
>>> > > > guess.
>>> > > >
>>> > > > Thank you
>>> > > > Gangadhar
>>> > > >
>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]>
>>> wrote:
>>> > > > > Gangadhar,
>>> > > > >
>>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
>>> > just
>>> > > > have
>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
>>> the
>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>>> worked.
>>> > > when
>>> > > > I
>>> > > > > ran TestClassifier as below, I got blank results in the output.
>>> > > > >
>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>>> wikipediamodel
>>> > -d
>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>>> > > > >
>>> > > > > Summary
>>> > > > > -------------------------------------------------------
>>> > > > > Correctly Classified Instances          :          0         ?%
>>> > > > > Incorrectly Classified Instances        :          0         ?%
>>> > > > > Total Classified Instances              :          0
>>> > > > >
>>> > > > > =======================================================
>>> > > > > Confusion Matrix
>>> > > > > -------------------------------------------------------
>>> > > > > a     <--Classified as
>>> > > > > 0     |  0     a     = spain
>>> > > > > Default Category: unknown: 1
>>> > > > >
>>> > > > > I am not sure if I am doing something wrong.. have to figure out
>>> why
>>> > my
>>> > > > o/p
>>> > > > > is so blank.
>>> > > > > I'll document these steps and mention about country.txt in the
>>> wiki.
>>> > > > >
>>> > > > > Question to all
>>> > > > > Should we have 2 country.txt
>>> > > > >
>>> > > > >   1. country_full_list.txt - this is the existing list
>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>>> > > > >
>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>>> > > > > country_sample.txt. When new people want to just try out the
>>> example,
>>> > > > they
>>> > > > > can reference this txt file  as a parameter.
>>> > > > > To run the example in a robust scalable infrastructure, we could
>>> use
>>> > > > > country_full_list.txt.
>>> > > > > any thots ?
>>> > > > >
>>> > > > > regards
>>> > > > > Joe.
>>> > > > >
>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected]>
>>> > wrote:
>>> > > > >
>>> > > > >> Gangadhar,
>>> > > > >>
>>> > > > >> After running TrainClassifier again, the map task just failed
>>> with
>>> > the
>>> > > > same
>>> > > > >> exception and I am pretty sure it is an issue with disk space.
>>> > > > >> As the map was progressing, I was monitoring my free disk space
>>> > > dropping
>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>>> task
>>> > and
>>> > > > then
>>> > > > >> the exception happened. After the exception, another map task was
>>> > > > resuming
>>> > > > >> at 33% and I got close to 15GB free space (i guess the first map
>>> > task
>>> > > > freed
>>> > > > >> up some space) and I am sure they would drop down to zero again
>>> and
>>> > > > throw
>>> > > > >> the same exception.
>>> > > > >> I am going to modify the country.txt to just 1 country and
>>> recreate
>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
>>> > > goes..
>>> > > > >>
>>> > > > >> Do we have any benchmarks / system requirements for running this
>>> > > example
>>> > > > ?
>>> > > > >> Has anyone else had success running this example anytime. Would
>>> > > > appreciate
>>> > > > >> your inputs / thots.
>>> > > > >>
>>> > > > >> Should we look at tuning the code for handling these situations ?
>>> > Any
>>> > > > quick
>>> > > > >> suggestions on where to start looking at ?
>>> > > > >>
>>> > > > >> regards,
>>> > > > >> Joe.
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>>
>>
>>
>

Re: Options in TrainClassifier.java

Reply via email to