Robin / Gangadhar, With ngram as 1 and all the countries in the country.txt , the model is getting created without any issues. $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput -o wikipediamodel -type bayes -source hdfs
Robin, Even for ngram parameter, the default value is mentioned as 1 but it is set as a mandatory parameter in TrainClassifier. so i'll modify the code to set the default ngram as 1 and make it as a non mandatory param. That aside, When I try to test the model, the summary is getting printed like below. Summary ------------------------------------------------------- Correctly Classified Instances : 0 ?% Incorrectly Classified Instances : 0 ?% Total Classified Instances : 0 Need to figure out the reason.. Since TestClassifier also has the same params and settings like TrainClassifier, can i modify it to set the default values for ngram, classifierType & dataSource ? reg, Joe. On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote: > Robin, > > Thanks for your tip. > Will try it out and post updates. > > reg > Joe. > > > On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]> wrote: > >> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You >> need atleast 2 countries. otherwise there is no classification. Secondly >> ngram =3 is a bit too high. With wikipedia this will result in a huge >> number >> of features. Why dont you try with one and see. >> >> Robin >> >> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]> wrote: >> >> > Hi Ted, >> > >> > sure. will keep digging.. >> > >> > About SGD, I dont have an idea about how it works et al. If there is >> some >> > documentation / reference / quick summary to read about it that'll be >> gr8. >> > Just saw one reference in >> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. >> > >> > I am assuming we should be able to create a model from wikipedia >> articles >> > and label the country of a new article. If so, could you please provide >> a >> > note on how to do this. We already have the wikipedia data being >> extracted >> > for specific countries using WikipediaDatasetCreatorDriver. How do we go >> > about training the classifier using SGD ? >> > >> > thanks for your help, >> > Joe. >> > >> > >> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[email protected]> >> > wrote: >> > >> > > I am watching these efforts with interest, but have been unable to >> > > contribute much to the process. I would encourage Joe and others to >> keep >> > > whittling this problem down so that we can understand what is causing >> it. >> > > >> > > In the meantime, I think that the SGD classifiers are close to >> production >> > > quality. For problems with less than several million training >> examples, >> > > and >> > > especially problems with many sparse features, I think that these >> > > classifiers might be easier to get started with than the Naive Bayes >> > > classifiers. To make a virtue of a defect, the SGD based classifiers >> to >> > > not >> > > use Hadoop for training. This makes deployment of a classification >> > > training >> > > workflow easier, but limits the total size of data that can be >> handled. >> > > >> > > What would you guys need to get started with trying these alternative >> > > models? >> > > >> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala >> > > <[email protected]>wrote: >> > > >> > > > Joe, >> > > > Even I tried with reducing the number of countries in the >> country.txt. >> > > > That didn't help. And in my case, I was monitoring the disk space >> and >> > > > at no time did it reach 0%. So, I am not sure if that is the case. >> To >> > > > remove the dependency on the number of countries, I even tried with >> > > > the subjects.txt as the classification - that also did not help. >> > > > I think this problem is due to the type of the data being processed, >> > > > but what I am not sure of is what I need to change to get the data >> to >> > > > be processed successfully. >> > > > >> > > > The experienced folks on Mahout will be able to tell us what is >> missing >> > I >> > > > guess. >> > > > >> > > > Thank you >> > > > Gangadhar >> > > > >> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]> >> wrote: >> > > > > Gangadhar, >> > > > > >> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to >> > just >> > > > have >> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create >> the >> > > > > wikipediainput data set and then ran TrainClassifier and it >> worked. >> > > when >> > > > I >> > > > > ran TestClassifier as below, I got blank results in the output. >> > > > > >> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m >> wikipediamodel >> > -d >> > > > > wikipediainput -ng 3 -type bayes -source hdfs >> > > > > >> > > > > Summary >> > > > > ------------------------------------------------------- >> > > > > Correctly Classified Instances : 0 ?% >> > > > > Incorrectly Classified Instances : 0 ?% >> > > > > Total Classified Instances : 0 >> > > > > >> > > > > ======================================================= >> > > > > Confusion Matrix >> > > > > ------------------------------------------------------- >> > > > > a <--Classified as >> > > > > 0 | 0 a = spain >> > > > > Default Category: unknown: 1 >> > > > > >> > > > > I am not sure if I am doing something wrong.. have to figure out >> why >> > my >> > > > o/p >> > > > > is so blank. >> > > > > I'll document these steps and mention about country.txt in the >> wiki. >> > > > > >> > > > > Question to all >> > > > > Should we have 2 country.txt >> > > > > >> > > > > 1. country_full_list.txt - this is the existing list >> > > > > 2. country_sample_list.txt - a list with 2 or 3 countries >> > > > > >> > > > > To get a flavor of the wikipedia bayes example, we can use >> > > > > country_sample.txt. When new people want to just try out the >> example, >> > > > they >> > > > > can reference this txt file as a parameter. >> > > > > To run the example in a robust scalable infrastructure, we could >> use >> > > > > country_full_list.txt. >> > > > > any thots ? >> > > > > >> > > > > regards >> > > > > Joe. >> > > > > >> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected]> >> > wrote: >> > > > > >> > > > >> Gangadhar, >> > > > >> >> > > > >> After running TrainClassifier again, the map task just failed >> with >> > the >> > > > same >> > > > >> exception and I am pretty sure it is an issue with disk space. >> > > > >> As the map was progressing, I was monitoring my free disk space >> > > dropping >> > > > >> from 81GB. It came down to 0 after almost 66% through the map >> task >> > and >> > > > then >> > > > >> the exception happened. After the exception, another map task was >> > > > resuming >> > > > >> at 33% and I got close to 15GB free space (i guess the first map >> > task >> > > > freed >> > > > >> up some space) and I am sure they would drop down to zero again >> and >> > > > throw >> > > > >> the same exception. >> > > > >> I am going to modify the country.txt to just 1 country and >> recreate >> > > > >> wikipediainput and run TrainClassifier. Will let you know how it >> > > goes.. >> > > > >> >> > > > >> Do we have any benchmarks / system requirements for running this >> > > example >> > > > ? >> > > > >> Has anyone else had success running this example anytime. Would >> > > > appreciate >> > > > >> your inputs / thots. >> > > > >> >> > > > >> Should we look at tuning the code for handling these situations ? >> > Any >> > > > quick >> > > > >> suggestions on where to start looking at ? >> > > > >> >> > > > >> regards, >> > > > >> Joe. >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > > >> > > > >> > > >> > >> > > > > >
