Re: Options in TrainClassifier.java

Ted Dunning Mon, 20 Sep 2010 22:42:19 -0700

There is a test program called TrainNewsGroups
in org.apache.mahout.classifier.sgd in the examples module.


I would love to work with you to get better documentation pulled together.

On Mon, Sep 20, 2010 at 8:13 PM, Gangadhar Nittala
<[email protected]>wrote:

> Joe,
> I will try with the ngram setting of 1 and let you know how it goes.
> Robin, the ngram parameter is used to check the number of subsequences
> of characters isn't it ? Or is it evaluated differently w.r.t to the
> Bayesian classifier ?
>
> Ted, like Joe mentioned, if you could point us to some information on
> SGD we could try it and report back the results to the list.
>
> Thank you
> Gangadhar
>
> On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[email protected]> wrote:
> > Robin / Gangadhar,
> > With ngram as 1 and all the countries in the country.txt , the model is
> > getting created without any issues.
> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
> wikipediainput
> > -o wikipediamodel -type bayes -source hdfs
> >
> > Robin,
> > Even for ngram parameter, the default value is mentioned as 1 but it is
> set
> > as a mandatory parameter in TrainClassifier. so i'll modify the code to
> set
> > the default ngram as 1 and make it as a non mandatory param.
> >
> > That aside, When I try to test the model, the summary is getting printed
> > like below.
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :          0         ?%
> > Incorrectly Classified Instances        :          0         ?%
> > Total Classified Instances              :          0
> > Need to figure out the reason..
> >
> > Since TestClassifier also has the same params and settings like
> > TrainClassifier, can i modify it to set the default values for ngram,
> > classifierType & dataSource ?
> >
> > reg,
> > Joe.
> >
> > On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote:
> >
> >> Robin,
> >>
> >> Thanks for your tip.
> >> Will try it out and post updates.
> >>
> >> reg
> >> Joe.
> >>
> >>
> >> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]>
> wrote:
> >>
> >>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
> You
> >>> need atleast 2 countries. otherwise there is no classification.
> Secondly
> >>> ngram =3 is a bit too high. With wikipedia this will result in a huge
> >>> number
> >>> of features. Why dont you try with one and see.
> >>>
> >>> Robin
> >>>
> >>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]>
> wrote:
> >>>
> >>> > Hi Ted,
> >>> >
> >>> > sure. will keep digging..
> >>> >
> >>> > About SGD, I dont have an idea about how it works et al. If there is
> >>> some
> >>> > documentation / reference / quick summary to read about it that'll be
> >>> gr8.
> >>> > Just saw one reference in
> >>> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
> >>> >
> >>> > I am assuming we should be able to create a model from wikipedia
> >>> articles
> >>> > and label the country of a new article. If so, could you please
> provide
> >>> a
> >>> > note on how to do this. We already have the wikipedia data being
> >>> extracted
> >>> > for specific countries using WikipediaDatasetCreatorDriver. How do we
> go
> >>> > about training the classifier using SGD ?
> >>> >
> >>> > thanks for your help,
> >>> > Joe.
> >>> >
> >>> >
> >>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[email protected]
> >
> >>> > wrote:
> >>> >
> >>> > > I am watching these efforts with interest, but have been unable to
> >>> > > contribute much to the process.  I would encourage Joe and others
> to
> >>> keep
> >>> > > whittling this problem down so that we can understand what is
> causing
> >>> it.
> >>> > >
> >>> > > In the meantime, I think that the SGD classifiers are close to
> >>> production
> >>> > > quality.  For problems with less than several million training
> >>> examples,
> >>> > > and
> >>> > > especially problems with many sparse features, I think that these
> >>> > > classifiers might be easier to get started with than the Naive
> Bayes
> >>> > > classifiers.  To make a virtue of a defect, the SGD based
> classifiers
> >>> to
> >>> > > not
> >>> > > use Hadoop for training.  This makes deployment of a classification
> >>> > > training
> >>> > > workflow easier, but limits the total size of data that can be
> >>> handled.
> >>> > >
> >>> > > What would you guys need to get started with trying these
> alternative
> >>> > > models?
> >>> > >
> >>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> >>> > > <[email protected]>wrote:
> >>> > >
> >>> > > > Joe,
> >>> > > > Even I tried with reducing the number of countries in the
> >>> country.txt.
> >>> > > > That didn't help. And in my case, I was monitoring the disk space
> >>> and
> >>> > > > at no time did it reach 0%. So, I am not sure if that is the
> case.
> >>> To
> >>> > > > remove the dependency on the number of countries, I even tried
> with
> >>> > > > the subjects.txt as the classification - that also did not help.
> >>> > > > I think this problem is due to the type of the data being
> processed,
> >>> > > > but what I am not sure of is what I need to change to get the
> data
> >>> to
> >>> > > > be processed successfully.
> >>> > > >
> >>> > > > The experienced folks on Mahout will be able to tell us what is
> >>> missing
> >>> > I
> >>> > > > guess.
> >>> > > >
> >>> > > > Thank you
> >>> > > > Gangadhar
> >>> > > >
> >>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]>
> >>> wrote:
> >>> > > > > Gangadhar,
> >>> > > > >
> >>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt
> to
> >>> > just
> >>> > > > have
> >>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
> create
> >>> the
> >>> > > > > wikipediainput data set and then ran TrainClassifier and it
> >>> worked.
> >>> > > when
> >>> > > > I
> >>> > > > > ran TestClassifier as below, I got blank results in the output.
> >>> > > > >
> >>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
> >>> wikipediamodel
> >>> > -d
> >>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> >>> > > > >
> >>> > > > > Summary
> >>> > > > > -------------------------------------------------------
> >>> > > > > Correctly Classified Instances          :          0         ?%
> >>> > > > > Incorrectly Classified Instances        :          0         ?%
> >>> > > > > Total Classified Instances              :          0
> >>> > > > >
> >>> > > > > =======================================================
> >>> > > > > Confusion Matrix
> >>> > > > > -------------------------------------------------------
> >>> > > > > a     <--Classified as
> >>> > > > > 0     |  0     a     = spain
> >>> > > > > Default Category: unknown: 1
> >>> > > > >
> >>> > > > > I am not sure if I am doing something wrong.. have to figure
> out
> >>> why
> >>> > my
> >>> > > > o/p
> >>> > > > > is so blank.
> >>> > > > > I'll document these steps and mention about country.txt in the
> >>> wiki.
> >>> > > > >
> >>> > > > > Question to all
> >>> > > > > Should we have 2 country.txt
> >>> > > > >
> >>> > > > >   1. country_full_list.txt - this is the existing list
> >>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> >>> > > > >
> >>> > > > > To get a flavor of the wikipedia bayes example, we can use
> >>> > > > > country_sample.txt. When new people want to just try out the
> >>> example,
> >>> > > > they
> >>> > > > > can reference this txt file  as a parameter.
> >>> > > > > To run the example in a robust scalable infrastructure, we
> could
> >>> use
> >>> > > > > country_full_list.txt.
> >>> > > > > any thots ?
> >>> > > > >
> >>> > > > > regards
> >>> > > > > Joe.
> >>> > > > >
> >>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected]
> >
> >>> > wrote:
> >>> > > > >
> >>> > > > >> Gangadhar,
> >>> > > > >>
> >>> > > > >> After running TrainClassifier again, the map task just failed
> >>> with
> >>> > the
> >>> > > > same
> >>> > > > >> exception and I am pretty sure it is an issue with disk space.
> >>> > > > >> As the map was progressing, I was monitoring my free disk
> space
> >>> > > dropping
> >>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
> >>> task
> >>> > and
> >>> > > > then
> >>> > > > >> the exception happened. After the exception, another map task
> was
> >>> > > > resuming
> >>> > > > >> at 33% and I got close to 15GB free space (i guess the first
> map
> >>> > task
> >>> > > > freed
> >>> > > > >> up some space) and I am sure they would drop down to zero
> again
> >>> and
> >>> > > > throw
> >>> > > > >> the same exception.
> >>> > > > >> I am going to modify the country.txt to just 1 country and
> >>> recreate
> >>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
> it
> >>> > > goes..
> >>> > > > >>
> >>> > > > >> Do we have any benchmarks / system requirements for running
> this
> >>> > > example
> >>> > > > ?
> >>> > > > >> Has anyone else had success running this example anytime.
> Would
> >>> > > > appreciate
> >>> > > > >> your inputs / thots.
> >>> > > > >>
> >>> > > > >> Should we look at tuning the code for handling these
> situations ?
> >>> > Any
> >>> > > > quick
> >>> > > > >> suggestions on where to start looking at ?
> >>> > > > >>
> >>> > > > >> regards,
> >>> > > > >> Joe.
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >>
> >>
> >>
> >
>

Re: Options in TrainClassifier.java

Reply via email to