Re: Options in TrainClassifier.java

Joe Kumar Fri, 24 Sep 2010 05:45:10 -0700

Hi Gangadhar,

I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.


I have documented the steps that worked for me at
https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
Can you please get the patch available at MAHOUT-509 and apply it and then
try the steps in the wiki.
Please let me know if you still face issues.

reg
Joe.


On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <[email protected]
> wrote:

> Joe,
> Can you let me know what was the command you used to test the
> classifier ? With the ngrams set to 1 as suggested by Robin, I was
> able to train the classifier. The command:
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
> --input wikipediainput10 --output wikipediamodel10 --classifierType
> bayes --dataSource hdfs
>
> After this, as per the wiki, we need to get the data from HDFS. I did that
> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>
> After this, the classifier is to be tested:
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>
> When I run this, this runs for close to 2 hours and after 2 hours, it
> errors out with a java.io.FileException saying that the logs_ is a
> directory in the wikipediainput10 folder. I am sorry I can't provide
> the stack trace right now because I accidentally closed the terminal
> window before I could copy it. I will run this again and send the
> stack trace.
>
> But, if you can send me the steps that you followed after running the
> classifier, I can repeat those and see if I am able to successfully
> execute the classifier.
>
> Thank you
> Gangadhar
>
>
> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
> <[email protected]> wrote:
> > Joe,
> > I will try with the ngram setting of 1 and let you know how it goes.
> > Robin, the ngram parameter is used to check the number of subsequences
> > of characters isn't it ? Or is it evaluated differently w.r.t to the
> > Bayesian classifier ?
> >
> > Ted, like Joe mentioned, if you could point us to some information on
> > SGD we could try it and report back the results to the list.
> >
> > Thank you
> > Gangadhar
> >
> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[email protected]> wrote:
> >> Robin / Gangadhar,
> >> With ngram as 1 and all the countries in the country.txt , the model is
> >> getting created without any issues.
> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
> wikipediainput
> >> -o wikipediamodel -type bayes -source hdfs
> >>
> >> Robin,
> >> Even for ngram parameter, the default value is mentioned as 1 but it is
> set
> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
> set
> >> the default ngram as 1 and make it as a non mandatory param.
> >>
> >> That aside, When I try to test the model, the summary is getting printed
> >> like below.
> >> Summary
> >> -------------------------------------------------------
> >> Correctly Classified Instances          :          0         ?%
> >> Incorrectly Classified Instances        :          0         ?%
> >> Total Classified Instances              :          0
> >> Need to figure out the reason..
> >>
> >> Since TestClassifier also has the same params and settings like
> >> TrainClassifier, can i modify it to set the default values for ngram,
> >> classifierType & dataSource ?
> >>
> >> reg,
> >> Joe.
> >>
> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote:
> >>
> >>> Robin,
> >>>
> >>> Thanks for your tip.
> >>> Will try it out and post updates.
> >>>
> >>> reg
> >>> Joe.
> >>>
> >>>
> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]>
> wrote:
> >>>
> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
> You
> >>>> need atleast 2 countries. otherwise there is no classification.
> Secondly
> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
> >>>> number
> >>>> of features. Why dont you try with one and see.
> >>>>
> >>>> Robin
> >>>>
> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]>
> wrote:
> >>>>
> >>>> > Hi Ted,
> >>>> >
> >>>> > sure. will keep digging..
> >>>> >
> >>>> > About SGD, I dont have an idea about how it works et al. If there is
> >>>> some
> >>>> > documentation / reference / quick summary to read about it that'll
> be
> >>>> gr8.
> >>>> > Just saw one reference in
> >>>> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
> >>>> >
> >>>> > I am assuming we should be able to create a model from wikipedia
> >>>> articles
> >>>> > and label the country of a new article. If so, could you please
> provide
> >>>> a
> >>>> > note on how to do this. We already have the wikipedia data being
> >>>> extracted
> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
> we go
> >>>> > about training the classifier using SGD ?
> >>>> >
> >>>> > thanks for your help,
> >>>> > Joe.
> >>>> >
> >>>> >
> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
> [email protected]>
> >>>> > wrote:
> >>>> >
> >>>> > > I am watching these efforts with interest, but have been unable to
> >>>> > > contribute much to the process.  I would encourage Joe and others
> to
> >>>> keep
> >>>> > > whittling this problem down so that we can understand what is
> causing
> >>>> it.
> >>>> > >
> >>>> > > In the meantime, I think that the SGD classifiers are close to
> >>>> production
> >>>> > > quality.  For problems with less than several million training
> >>>> examples,
> >>>> > > and
> >>>> > > especially problems with many sparse features, I think that these
> >>>> > > classifiers might be easier to get started with than the Naive
> Bayes
> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
> classifiers
> >>>> to
> >>>> > > not
> >>>> > > use Hadoop for training.  This makes deployment of a
> classification
> >>>> > > training
> >>>> > > workflow easier, but limits the total size of data that can be
> >>>> handled.
> >>>> > >
> >>>> > > What would you guys need to get started with trying these
> alternative
> >>>> > > models?
> >>>> > >
> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> >>>> > > <[email protected]>wrote:
> >>>> > >
> >>>> > > > Joe,
> >>>> > > > Even I tried with reducing the number of countries in the
> >>>> country.txt.
> >>>> > > > That didn't help. And in my case, I was monitoring the disk
> space
> >>>> and
> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
> case.
> >>>> To
> >>>> > > > remove the dependency on the number of countries, I even tried
> with
> >>>> > > > the subjects.txt as the classification - that also did not help.
> >>>> > > > I think this problem is due to the type of the data being
> processed,
> >>>> > > > but what I am not sure of is what I need to change to get the
> data
> >>>> to
> >>>> > > > be processed successfully.
> >>>> > > >
> >>>> > > > The experienced folks on Mahout will be able to tell us what is
> >>>> missing
> >>>> > I
> >>>> > > > guess.
> >>>> > > >
> >>>> > > > Thank you
> >>>> > > > Gangadhar
> >>>> > > >
> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]>
> >>>> wrote:
> >>>> > > > > Gangadhar,
> >>>> > > > >
> >>>> > > > > I modified
> $MAHOUT_HOME/examples/src/test/resources/country.txt to
> >>>> > just
> >>>> > > > have
> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
> create
> >>>> the
> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
> >>>> worked.
> >>>> > > when
> >>>> > > > I
> >>>> > > > > ran TestClassifier as below, I got blank results in the
> output.
> >>>> > > > >
> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
> >>>> wikipediamodel
> >>>> > -d
> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> >>>> > > > >
> >>>> > > > > Summary
> >>>> > > > > -------------------------------------------------------
> >>>> > > > > Correctly Classified Instances          :          0
> ?%
> >>>> > > > > Incorrectly Classified Instances        :          0
> ?%
> >>>> > > > > Total Classified Instances              :          0
> >>>> > > > >
> >>>> > > > > =======================================================
> >>>> > > > > Confusion Matrix
> >>>> > > > > -------------------------------------------------------
> >>>> > > > > a     <--Classified as
> >>>> > > > > 0     |  0     a     = spain
> >>>> > > > > Default Category: unknown: 1
> >>>> > > > >
> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
> out
> >>>> why
> >>>> > my
> >>>> > > > o/p
> >>>> > > > > is so blank.
> >>>> > > > > I'll document these steps and mention about country.txt in the
> >>>> wiki.
> >>>> > > > >
> >>>> > > > > Question to all
> >>>> > > > > Should we have 2 country.txt
> >>>> > > > >
> >>>> > > > >   1. country_full_list.txt - this is the existing list
> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> >>>> > > > >
> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
> >>>> > > > > country_sample.txt. When new people want to just try out the
> >>>> example,
> >>>> > > > they
> >>>> > > > > can reference this txt file  as a parameter.
> >>>> > > > > To run the example in a robust scalable infrastructure, we
> could
> >>>> use
> >>>> > > > > country_full_list.txt.
> >>>> > > > > any thots ?
> >>>> > > > >
> >>>> > > > > regards
> >>>> > > > > Joe.
> >>>> > > > >
> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
> [email protected]>
> >>>> > wrote:
> >>>> > > > >
> >>>> > > > >> Gangadhar,
> >>>> > > > >>
> >>>> > > > >> After running TrainClassifier again, the map task just failed
> >>>> with
> >>>> > the
> >>>> > > > same
> >>>> > > > >> exception and I am pretty sure it is an issue with disk
> space.
> >>>> > > > >> As the map was progressing, I was monitoring my free disk
> space
> >>>> > > dropping
> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
> >>>> task
> >>>> > and
> >>>> > > > then
> >>>> > > > >> the exception happened. After the exception, another map task
> was
> >>>> > > > resuming
> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
> map
> >>>> > task
> >>>> > > > freed
> >>>> > > > >> up some space) and I am sure they would drop down to zero
> again
> >>>> and
> >>>> > > > throw
> >>>> > > > >> the same exception.
> >>>> > > > >> I am going to modify the country.txt to just 1 country and
> >>>> recreate
> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
> it
> >>>> > > goes..
> >>>> > > > >>
> >>>> > > > >> Do we have any benchmarks / system requirements for running
> this
> >>>> > > example
> >>>> > > > ?
> >>>> > > > >> Has anyone else had success running this example anytime.
> Would
> >>>> > > > appreciate
> >>>> > > > >> your inputs / thots.
> >>>> > > > >>
> >>>> > > > >> Should we look at tuning the code for handling these
> situations ?
> >>>> > Any
> >>>> > > > quick
> >>>> > > > >> suggestions on where to start looking at ?
> >>>> > > > >>
> >>>> > > > >> regards,
> >>>> > > > >> Joe.
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
>

Re: Options in TrainClassifier.java

Reply via email to