Re: Options in TrainClassifier.java

Gangadhar Nittala Wed, 06 Oct 2010 21:23:18 -0700

Joe / others,

I was finally able to test the changes that were done as part of
MAHOUT-509[ https://issues.apache.org/jira/browse/MAHOUT-509] and
follow the instructions in the wiki for the Bayes example [
https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
]. The instructions in the wiki work only if the testclassifier.props
has the values for the required options. Else, the user needs to
provide the values on the command line for the datasource,
classifiertype and the n-gram size. The testClassifier executed and
printed a large matrix of values (though I still don't know how to
interpret the results :) )


Also, I found a minor problem in the TestClassifier.java where in
there is an Integer.parseInt with the command line option that is
read. If there are any leading / ending spaces in the
testclassifier.props, this results in a NumberFormatException.
Attached patch does a trim on the string before doing a parseInt.

I have attached a patch which has the modified testclassifier.props
and the fix with the parseInt. I think both these belong to
MAHOUT-509. If you think the wiki can be modified to include the
parameters instead of having settings in a .props file (preferring
clarity for the user over ease of use), then I can modify the wiki
instructions and remove the .props file from the patch.

The fix for the TestClassifier.java though, I think is required - it
is to sanitize the user input.

I am not sure of what is the preferred approach for providing patches
for a resolved issue. Should I create a new issue just for this or
would it be easier to add this patch to the existing issue itself?
Please let me know and I shall create a new issue and attach the
modified patch file to it.

Thank you
Gangadhar
p.s: I named the patch file with an underscore as the existing issue
already has a MAHOUT-509.patch

On Sun, Sep 26, 2010 at 9:28 AM, Gangadhar Nittala
<[email protected]> wrote:
> Joe,
> I am out of town for this week and won't have access to my machine. I
> will check this during the weekend and will get back to you. Will
> follow the steps in the wiki.
>
> Thank you
>
> On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <[email protected]> wrote:
>> Hi Gangadhar,
>>
>> I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.
>>
>> I have documented the steps that worked for me at
>> https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
>> Can you please get the patch available at MAHOUT-509 and apply it and then
>> try the steps in the wiki.
>> Please let me know if you still face issues.
>>
>> reg
>> Joe.
>>
>>
>> On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <[email protected]
>>> wrote:
>>
>>> Joe,
>>> Can you let me know what was the command you used to test the
>>> classifier ? With the ngrams set to 1 as suggested by Robin, I was
>>> able to train the classifier. The command:
>>> $HADOOP_HOME/bin/hadoop jar
>>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
>>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>>> bayes --dataSource hdfs
>>>
>>> After this, as per the wiki, we need to get the data from HDFS. I did that
>>> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>>>
>>> After this, the classifier is to be tested:
>>> $HADOOP_HOME/bin/hadoop jar
>>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
>>> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>>>
>>> When I run this, this runs for close to 2 hours and after 2 hours, it
>>> errors out with a java.io.FileException saying that the logs_ is a
>>> directory in the wikipediainput10 folder. I am sorry I can't provide
>>> the stack trace right now because I accidentally closed the terminal
>>> window before I could copy it. I will run this again and send the
>>> stack trace.
>>>
>>> But, if you can send me the steps that you followed after running the
>>> classifier, I can repeat those and see if I am able to successfully
>>> execute the classifier.
>>>
>>> Thank you
>>> Gangadhar
>>>
>>>
>>> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
>>> <[email protected]> wrote:
>>> > Joe,
>>> > I will try with the ngram setting of 1 and let you know how it goes.
>>> > Robin, the ngram parameter is used to check the number of subsequences
>>> > of characters isn't it ? Or is it evaluated differently w.r.t to the
>>> > Bayesian classifier ?
>>> >
>>> > Ted, like Joe mentioned, if you could point us to some information on
>>> > SGD we could try it and report back the results to the list.
>>> >
>>> > Thank you
>>> > Gangadhar
>>> >
>>> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[email protected]> wrote:
>>> >> Robin / Gangadhar,
>>> >> With ngram as 1 and all the countries in the country.txt , the model is
>>> >> getting created without any issues.
>>> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
>>> wikipediainput
>>> >> -o wikipediamodel -type bayes -source hdfs
>>> >>
>>> >> Robin,
>>> >> Even for ngram parameter, the default value is mentioned as 1 but it is
>>> set
>>> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
>>> set
>>> >> the default ngram as 1 and make it as a non mandatory param.
>>> >>
>>> >> That aside, When I try to test the model, the summary is getting printed
>>> >> like below.
>>> >> Summary
>>> >> -------------------------------------------------------
>>> >> Correctly Classified Instances          :          0         ?%
>>> >> Incorrectly Classified Instances        :          0         ?%
>>> >> Total Classified Instances              :          0
>>> >> Need to figure out the reason..
>>> >>
>>> >> Since TestClassifier also has the same params and settings like
>>> >> TrainClassifier, can i modify it to set the default values for ngram,
>>> >> classifierType & dataSource ?
>>> >>
>>> >> reg,
>>> >> Joe.
>>> >>
>>> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote:
>>> >>
>>> >>> Robin,
>>> >>>
>>> >>> Thanks for your tip.
>>> >>> Will try it out and post updates.
>>> >>>
>>> >>> reg
>>> >>> Joe.
>>> >>>
>>> >>>
>>> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]>
>>> wrote:
>>> >>>
>>> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
>>> You
>>> >>>> need atleast 2 countries. otherwise there is no classification.
>>> Secondly
>>> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>>> >>>> number
>>> >>>> of features. Why dont you try with one and see.
>>> >>>>
>>> >>>> Robin
>>> >>>>
>>> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>> > Hi Ted,
>>> >>>> >
>>> >>>> > sure. will keep digging..
>>> >>>> >
>>> >>>> > About SGD, I dont have an idea about how it works et al. If there is
>>> >>>> some
>>> >>>> > documentation / reference / quick summary to read about it that'll
>>> be
>>> >>>> gr8.
>>> >>>> > Just saw one reference in
>>> >>>> >
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>>> >>>> >
>>> >>>> > I am assuming we should be able to create a model from wikipedia
>>> >>>> articles
>>> >>>> > and label the country of a new article. If so, could you please
>>> provide
>>> >>>> a
>>> >>>> > note on how to do this. We already have the wikipedia data being
>>> >>>> extracted
>>> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
>>> we go
>>> >>>> > about training the classifier using SGD ?
>>> >>>> >
>>> >>>> > thanks for your help,
>>> >>>> > Joe.
>>> >>>> >
>>> >>>> >
>>> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
>>> [email protected]>
>>> >>>> > wrote:
>>> >>>> >
>>> >>>> > > I am watching these efforts with interest, but have been unable to
>>> >>>> > > contribute much to the process.  I would encourage Joe and others
>>> to
>>> >>>> keep
>>> >>>> > > whittling this problem down so that we can understand what is
>>> causing
>>> >>>> it.
>>> >>>> > >
>>> >>>> > > In the meantime, I think that the SGD classifiers are close to
>>> >>>> production
>>> >>>> > > quality.  For problems with less than several million training
>>> >>>> examples,
>>> >>>> > > and
>>> >>>> > > especially problems with many sparse features, I think that these
>>> >>>> > > classifiers might be easier to get started with than the Naive
>>> Bayes
>>> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
>>> classifiers
>>> >>>> to
>>> >>>> > > not
>>> >>>> > > use Hadoop for training.  This makes deployment of a
>>> classification
>>> >>>> > > training
>>> >>>> > > workflow easier, but limits the total size of data that can be
>>> >>>> handled.
>>> >>>> > >
>>> >>>> > > What would you guys need to get started with trying these
>>> alternative
>>> >>>> > > models?
>>> >>>> > >
>>> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>>> >>>> > > <[email protected]>wrote:
>>> >>>> > >
>>> >>>> > > > Joe,
>>> >>>> > > > Even I tried with reducing the number of countries in the
>>> >>>> country.txt.
>>> >>>> > > > That didn't help. And in my case, I was monitoring the disk
>>> space
>>> >>>> and
>>> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
>>> case.
>>> >>>> To
>>> >>>> > > > remove the dependency on the number of countries, I even tried
>>> with
>>> >>>> > > > the subjects.txt as the classification - that also did not help.
>>> >>>> > > > I think this problem is due to the type of the data being
>>> processed,
>>> >>>> > > > but what I am not sure of is what I need to change to get the
>>> data
>>> >>>> to
>>> >>>> > > > be processed successfully.
>>> >>>> > > >
>>> >>>> > > > The experienced folks on Mahout will be able to tell us what is
>>> >>>> missing
>>> >>>> > I
>>> >>>> > > > guess.
>>> >>>> > > >
>>> >>>> > > > Thank you
>>> >>>> > > > Gangadhar
>>> >>>> > > >
>>> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]>
>>> >>>> wrote:
>>> >>>> > > > > Gangadhar,
>>> >>>> > > > >
>>> >>>> > > > > I modified
>>> $MAHOUT_HOME/examples/src/test/resources/country.txt to
>>> >>>> > just
>>> >>>> > > > have
>>> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
>>> create
>>> >>>> the
>>> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>>> >>>> worked.
>>> >>>> > > when
>>> >>>> > > > I
>>> >>>> > > > > ran TestClassifier as below, I got blank results in the
>>> output.
>>> >>>> > > > >
>>> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>>> >>>> wikipediamodel
>>> >>>> > -d
>>> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>>> >>>> > > > >
>>> >>>> > > > > Summary
>>> >>>> > > > > -------------------------------------------------------
>>> >>>> > > > > Correctly Classified Instances          :          0
>>> ?%
>>> >>>> > > > > Incorrectly Classified Instances        :          0
>>> ?%
>>> >>>> > > > > Total Classified Instances              :          0
>>> >>>> > > > >
>>> >>>> > > > > =======================================================
>>> >>>> > > > > Confusion Matrix
>>> >>>> > > > > -------------------------------------------------------
>>> >>>> > > > > a     <--Classified as
>>> >>>> > > > > 0     |  0     a     = spain
>>> >>>> > > > > Default Category: unknown: 1
>>> >>>> > > > >
>>> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
>>> out
>>> >>>> why
>>> >>>> > my
>>> >>>> > > > o/p
>>> >>>> > > > > is so blank.
>>> >>>> > > > > I'll document these steps and mention about country.txt in the
>>> >>>> wiki.
>>> >>>> > > > >
>>> >>>> > > > > Question to all
>>> >>>> > > > > Should we have 2 country.txt
>>> >>>> > > > >
>>> >>>> > > > >   1. country_full_list.txt - this is the existing list
>>> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>>> >>>> > > > >
>>> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>>> >>>> > > > > country_sample.txt. When new people want to just try out the
>>> >>>> example,
>>> >>>> > > > they
>>> >>>> > > > > can reference this txt file  as a parameter.
>>> >>>> > > > > To run the example in a robust scalable infrastructure, we
>>> could
>>> >>>> use
>>> >>>> > > > > country_full_list.txt.
>>> >>>> > > > > any thots ?
>>> >>>> > > > >
>>> >>>> > > > > regards
>>> >>>> > > > > Joe.
>>> >>>> > > > >
>>> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
>>> [email protected]>
>>> >>>> > wrote:
>>> >>>> > > > >
>>> >>>> > > > >> Gangadhar,
>>> >>>> > > > >>
>>> >>>> > > > >> After running TrainClassifier again, the map task just failed
>>> >>>> with
>>> >>>> > the
>>> >>>> > > > same
>>> >>>> > > > >> exception and I am pretty sure it is an issue with disk
>>> space.
>>> >>>> > > > >> As the map was progressing, I was monitoring my free disk
>>> space
>>> >>>> > > dropping
>>> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>>> >>>> task
>>> >>>> > and
>>> >>>> > > > then
>>> >>>> > > > >> the exception happened. After the exception, another map task
>>> was
>>> >>>> > > > resuming
>>> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
>>> map
>>> >>>> > task
>>> >>>> > > > freed
>>> >>>> > > > >> up some space) and I am sure they would drop down to zero
>>> again
>>> >>>> and
>>> >>>> > > > throw
>>> >>>> > > > >> the same exception.
>>> >>>> > > > >> I am going to modify the country.txt to just 1 country and
>>> >>>> recreate
>>> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
>>> it
>>> >>>> > > goes..
>>> >>>> > > > >>
>>> >>>> > > > >> Do we have any benchmarks / system requirements for running
>>> this
>>> >>>> > > example
>>> >>>> > > > ?
>>> >>>> > > > >> Has anyone else had success running this example anytime.
>>> Would
>>> >>>> > > > appreciate
>>> >>>> > > > >> your inputs / thots.
>>> >>>> > > > >>
>>> >>>> > > > >> Should we look at tuning the code for handling these
>>> situations ?
>>> >>>> > Any
>>> >>>> > > > quick
>>> >>>> > > > >> suggestions on where to start looking at ?
>>> >>>> > > > >>
>>> >>>> > > > >> regards,
>>> >>>> > > > >> Joe.
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>>
>>
>

Re: Options in TrainClassifier.java

Reply via email to