Re: Options in TrainClassifier.java

Gangadhar Nittala Sun, 26 Sep 2010 06:28:52 -0700

Joe,
I am out of town for this week and won't have access to my machine. I
will check this during the weekend and will get back to you. Will
follow the steps in the wiki.


Thank you

On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <[email protected]> wrote:
> Hi Gangadhar,
>
> I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.
>
> I have documented the steps that worked for me at
> https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
> Can you please get the patch available at MAHOUT-509 and apply it and then
> try the steps in the wiki.
> Please let me know if you still face issues.
>
> reg
> Joe.
>
>
> On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <[email protected]
>> wrote:
>
>> Joe,
>> Can you let me know what was the command you used to test the
>> classifier ? With the ngrams set to 1 as suggested by Robin, I was
>> able to train the classifier. The command:
>> $HADOOP_HOME/bin/hadoop jar
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>> bayes --dataSource hdfs
>>
>> After this, as per the wiki, we need to get the data from HDFS. I did that
>> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>>
>> After this, the classifier is to be tested:
>> $HADOOP_HOME/bin/hadoop jar
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
>> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>>
>> When I run this, this runs for close to 2 hours and after 2 hours, it
>> errors out with a java.io.FileException saying that the logs_ is a
>> directory in the wikipediainput10 folder. I am sorry I can't provide
>> the stack trace right now because I accidentally closed the terminal
>> window before I could copy it. I will run this again and send the
>> stack trace.
>>
>> But, if you can send me the steps that you followed after running the
>> classifier, I can repeat those and see if I am able to successfully
>> execute the classifier.
>>
>> Thank you
>> Gangadhar
>>
>>
>> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
>> <[email protected]> wrote:
>> > Joe,
>> > I will try with the ngram setting of 1 and let you know how it goes.
>> > Robin, the ngram parameter is used to check the number of subsequences
>> > of characters isn't it ? Or is it evaluated differently w.r.t to the
>> > Bayesian classifier ?
>> >
>> > Ted, like Joe mentioned, if you could point us to some information on
>> > SGD we could try it and report back the results to the list.
>> >
>> > Thank you
>> > Gangadhar
>> >
>> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[email protected]> wrote:
>> >> Robin / Gangadhar,
>> >> With ngram as 1 and all the countries in the country.txt , the model is
>> >> getting created without any issues.
>> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
>> wikipediainput
>> >> -o wikipediamodel -type bayes -source hdfs
>> >>
>> >> Robin,
>> >> Even for ngram parameter, the default value is mentioned as 1 but it is
>> set
>> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
>> set
>> >> the default ngram as 1 and make it as a non mandatory param.
>> >>
>> >> That aside, When I try to test the model, the summary is getting printed
>> >> like below.
>> >> Summary
>> >> -------------------------------------------------------
>> >> Correctly Classified Instances          :          0         ?%
>> >> Incorrectly Classified Instances        :          0         ?%
>> >> Total Classified Instances              :          0
>> >> Need to figure out the reason..
>> >>
>> >> Since TestClassifier also has the same params and settings like
>> >> TrainClassifier, can i modify it to set the default values for ngram,
>> >> classifierType & dataSource ?
>> >>
>> >> reg,
>> >> Joe.
>> >>
>> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote:
>> >>
>> >>> Robin,
>> >>>
>> >>> Thanks for your tip.
>> >>> Will try it out and post updates.
>> >>>
>> >>> reg
>> >>> Joe.
>> >>>
>> >>>
>> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]>
>> wrote:
>> >>>
>> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
>> You
>> >>>> need atleast 2 countries. otherwise there is no classification.
>> Secondly
>> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>> >>>> number
>> >>>> of features. Why dont you try with one and see.
>> >>>>
>> >>>> Robin
>> >>>>
>> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]>
>> wrote:
>> >>>>
>> >>>> > Hi Ted,
>> >>>> >
>> >>>> > sure. will keep digging..
>> >>>> >
>> >>>> > About SGD, I dont have an idea about how it works et al. If there is
>> >>>> some
>> >>>> > documentation / reference / quick summary to read about it that'll
>> be
>> >>>> gr8.
>> >>>> > Just saw one reference in
>> >>>> >
>> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>> >>>> >
>> >>>> > I am assuming we should be able to create a model from wikipedia
>> >>>> articles
>> >>>> > and label the country of a new article. If so, could you please
>> provide
>> >>>> a
>> >>>> > note on how to do this. We already have the wikipedia data being
>> >>>> extracted
>> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
>> we go
>> >>>> > about training the classifier using SGD ?
>> >>>> >
>> >>>> > thanks for your help,
>> >>>> > Joe.
>> >>>> >
>> >>>> >
>> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
>> [email protected]>
>> >>>> > wrote:
>> >>>> >
>> >>>> > > I am watching these efforts with interest, but have been unable to
>> >>>> > > contribute much to the process.  I would encourage Joe and others
>> to
>> >>>> keep
>> >>>> > > whittling this problem down so that we can understand what is
>> causing
>> >>>> it.
>> >>>> > >
>> >>>> > > In the meantime, I think that the SGD classifiers are close to
>> >>>> production
>> >>>> > > quality.  For problems with less than several million training
>> >>>> examples,
>> >>>> > > and
>> >>>> > > especially problems with many sparse features, I think that these
>> >>>> > > classifiers might be easier to get started with than the Naive
>> Bayes
>> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
>> classifiers
>> >>>> to
>> >>>> > > not
>> >>>> > > use Hadoop for training.  This makes deployment of a
>> classification
>> >>>> > > training
>> >>>> > > workflow easier, but limits the total size of data that can be
>> >>>> handled.
>> >>>> > >
>> >>>> > > What would you guys need to get started with trying these
>> alternative
>> >>>> > > models?
>> >>>> > >
>> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>> >>>> > > <[email protected]>wrote:
>> >>>> > >
>> >>>> > > > Joe,
>> >>>> > > > Even I tried with reducing the number of countries in the
>> >>>> country.txt.
>> >>>> > > > That didn't help. And in my case, I was monitoring the disk
>> space
>> >>>> and
>> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
>> case.
>> >>>> To
>> >>>> > > > remove the dependency on the number of countries, I even tried
>> with
>> >>>> > > > the subjects.txt as the classification - that also did not help.
>> >>>> > > > I think this problem is due to the type of the data being
>> processed,
>> >>>> > > > but what I am not sure of is what I need to change to get the
>> data
>> >>>> to
>> >>>> > > > be processed successfully.
>> >>>> > > >
>> >>>> > > > The experienced folks on Mahout will be able to tell us what is
>> >>>> missing
>> >>>> > I
>> >>>> > > > guess.
>> >>>> > > >
>> >>>> > > > Thank you
>> >>>> > > > Gangadhar
>> >>>> > > >
>> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]>
>> >>>> wrote:
>> >>>> > > > > Gangadhar,
>> >>>> > > > >
>> >>>> > > > > I modified
>> $MAHOUT_HOME/examples/src/test/resources/country.txt to
>> >>>> > just
>> >>>> > > > have
>> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
>> create
>> >>>> the
>> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>> >>>> worked.
>> >>>> > > when
>> >>>> > > > I
>> >>>> > > > > ran TestClassifier as below, I got blank results in the
>> output.
>> >>>> > > > >
>> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>> >>>> wikipediamodel
>> >>>> > -d
>> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>> >>>> > > > >
>> >>>> > > > > Summary
>> >>>> > > > > -------------------------------------------------------
>> >>>> > > > > Correctly Classified Instances          :          0
>> ?%
>> >>>> > > > > Incorrectly Classified Instances        :          0
>> ?%
>> >>>> > > > > Total Classified Instances              :          0
>> >>>> > > > >
>> >>>> > > > > =======================================================
>> >>>> > > > > Confusion Matrix
>> >>>> > > > > -------------------------------------------------------
>> >>>> > > > > a     <--Classified as
>> >>>> > > > > 0     |  0     a     = spain
>> >>>> > > > > Default Category: unknown: 1
>> >>>> > > > >
>> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
>> out
>> >>>> why
>> >>>> > my
>> >>>> > > > o/p
>> >>>> > > > > is so blank.
>> >>>> > > > > I'll document these steps and mention about country.txt in the
>> >>>> wiki.
>> >>>> > > > >
>> >>>> > > > > Question to all
>> >>>> > > > > Should we have 2 country.txt
>> >>>> > > > >
>> >>>> > > > >   1. country_full_list.txt - this is the existing list
>> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>> >>>> > > > >
>> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>> >>>> > > > > country_sample.txt. When new people want to just try out the
>> >>>> example,
>> >>>> > > > they
>> >>>> > > > > can reference this txt file  as a parameter.
>> >>>> > > > > To run the example in a robust scalable infrastructure, we
>> could
>> >>>> use
>> >>>> > > > > country_full_list.txt.
>> >>>> > > > > any thots ?
>> >>>> > > > >
>> >>>> > > > > regards
>> >>>> > > > > Joe.
>> >>>> > > > >
>> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
>> [email protected]>
>> >>>> > wrote:
>> >>>> > > > >
>> >>>> > > > >> Gangadhar,
>> >>>> > > > >>
>> >>>> > > > >> After running TrainClassifier again, the map task just failed
>> >>>> with
>> >>>> > the
>> >>>> > > > same
>> >>>> > > > >> exception and I am pretty sure it is an issue with disk
>> space.
>> >>>> > > > >> As the map was progressing, I was monitoring my free disk
>> space
>> >>>> > > dropping
>> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>> >>>> task
>> >>>> > and
>> >>>> > > > then
>> >>>> > > > >> the exception happened. After the exception, another map task
>> was
>> >>>> > > > resuming
>> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
>> map
>> >>>> > task
>> >>>> > > > freed
>> >>>> > > > >> up some space) and I am sure they would drop down to zero
>> again
>> >>>> and
>> >>>> > > > throw
>> >>>> > > > >> the same exception.
>> >>>> > > > >> I am going to modify the country.txt to just 1 country and
>> >>>> recreate
>> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
>> it
>> >>>> > > goes..
>> >>>> > > > >>
>> >>>> > > > >> Do we have any benchmarks / system requirements for running
>> this
>> >>>> > > example
>> >>>> > > > ?
>> >>>> > > > >> Has anyone else had success running this example anytime.
>> Would
>> >>>> > > > appreciate
>> >>>> > > > >> your inputs / thots.
>> >>>> > > > >>
>> >>>> > > > >> Should we look at tuning the code for handling these
>> situations ?
>> >>>> > Any
>> >>>> > > > quick
>> >>>> > > > >> suggestions on where to start looking at ?
>> >>>> > > > >>
>> >>>> > > > >> regards,
>> >>>> > > > >> Joe.
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >
>> >>>> > > >
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >
>>
>

Re: Options in TrainClassifier.java

Reply via email to