Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
need atleast 2 countries. otherwise there is no classification. Secondly
ngram =3 is a bit too high. With wikipedia this will result in a huge number
of features. Why dont you try with one and see.

Robin

On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]> wrote:

> Hi Ted,
>
> sure. will keep digging..
>
> About SGD, I dont have an idea about how it works et al. If there is some
> documentation / reference / quick summary to read about it that'll be gr8.
> Just saw one reference in
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>
> I am assuming we should be able to create a model from wikipedia articles
> and label the country of a new article. If so, could you please provide a
> note on how to do this. We already have the wikipedia data being extracted
> for specific countries using WikipediaDatasetCreatorDriver. How do we go
> about training the classifier using SGD ?
>
> thanks for your help,
> Joe.
>
>
> On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[email protected]>
> wrote:
>
> > I am watching these efforts with interest, but have been unable to
> > contribute much to the process.  I would encourage Joe and others to keep
> > whittling this problem down so that we can understand what is causing it.
> >
> > In the meantime, I think that the SGD classifiers are close to production
> > quality.  For problems with less than several million training examples,
> > and
> > especially problems with many sparse features, I think that these
> > classifiers might be easier to get started with than the Naive Bayes
> > classifiers.  To make a virtue of a defect, the SGD based classifiers to
> > not
> > use Hadoop for training.  This makes deployment of a classification
> > training
> > workflow easier, but limits the total size of data that can be handled.
> >
> > What would you guys need to get started with trying these alternative
> > models?
> >
> > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> > <[email protected]>wrote:
> >
> > > Joe,
> > > Even I tried with reducing the number of countries in the country.txt.
> > > That didn't help. And in my case, I was monitoring the disk space and
> > > at no time did it reach 0%. So, I am not sure if that is the case. To
> > > remove the dependency on the number of countries, I even tried with
> > > the subjects.txt as the classification - that also did not help.
> > > I think this problem is due to the type of the data being processed,
> > > but what I am not sure of is what I need to change to get the data to
> > > be processed successfully.
> > >
> > > The experienced folks on Mahout will be able to tell us what is missing
> I
> > > guess.
> > >
> > > Thank you
> > > Gangadhar
> > >
> > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]> wrote:
> > > > Gangadhar,
> > > >
> > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
> just
> > > have
> > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > > > wikipediainput data set and then ran TrainClassifier and it worked.
> > when
> > > I
> > > > ran TestClassifier as below, I got blank results in the output.
> > > >
> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel
> -d
> > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> > > >
> > > > Summary
> > > > -------------------------------------------------------
> > > > Correctly Classified Instances          :          0         ?%
> > > > Incorrectly Classified Instances        :          0         ?%
> > > > Total Classified Instances              :          0
> > > >
> > > > =======================================================
> > > > Confusion Matrix
> > > > -------------------------------------------------------
> > > > a     <--Classified as
> > > > 0     |  0     a     = spain
> > > > Default Category: unknown: 1
> > > >
> > > > I am not sure if I am doing something wrong.. have to figure out why
> my
> > > o/p
> > > > is so blank.
> > > > I'll document these steps and mention about country.txt in the wiki.
> > > >
> > > > Question to all
> > > > Should we have 2 country.txt
> > > >
> > > >   1. country_full_list.txt - this is the existing list
> > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> > > >
> > > > To get a flavor of the wikipedia bayes example, we can use
> > > > country_sample.txt. When new people want to just try out the example,
> > > they
> > > > can reference this txt file  as a parameter.
> > > > To run the example in a robust scalable infrastructure, we could use
> > > > country_full_list.txt.
> > > > any thots ?
> > > >
> > > > regards
> > > > Joe.
> > > >
> > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected]>
> wrote:
> > > >
> > > >> Gangadhar,
> > > >>
> > > >> After running TrainClassifier again, the map task just failed with
> the
> > > same
> > > >> exception and I am pretty sure it is an issue with disk space.
> > > >> As the map was progressing, I was monitoring my free disk space
> > dropping
> > > >> from 81GB. It came down to 0 after almost 66% through the map task
> and
> > > then
> > > >> the exception happened. After the exception, another map task was
> > > resuming
> > > >> at 33% and I got close to 15GB free space (i guess the first map
> task
> > > freed
> > > >> up some space) and I am sure they would drop down to zero again and
> > > throw
> > > >> the same exception.
> > > >> I am going to modify the country.txt to just 1 country and recreate
> > > >> wikipediainput and run TrainClassifier. Will let you know how it
> > goes..
> > > >>
> > > >> Do we have any benchmarks / system requirements for running this
> > example
> > > ?
> > > >> Has anyone else had success running this example anytime. Would
> > > appreciate
> > > >> your inputs / thots.
> > > >>
> > > >> Should we look at tuning the code for handling these situations ?
> Any
> > > quick
> > > >> suggestions on where to start looking at ?
> > > >>
> > > >> regards,
> > > >> Joe.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > >
> >
>

Reply via email to