Hi Gangadhar, I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.
I have documented the steps that worked for me at https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example Can you please get the patch available at MAHOUT-509 and apply it and then try the steps in the wiki. Please let me know if you still face issues. reg Joe. On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <[email protected] > wrote: > Joe, > Can you let me know what was the command you used to test the > classifier ? With the ngrams set to 1 as suggested by Robin, I was > able to train the classifier. The command: > $HADOOP_HOME/bin/hadoop jar > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1 > --input wikipediainput10 --output wikipediamodel10 --classifierType > bayes --dataSource hdfs > > After this, as per the wiki, we need to get the data from HDFS. I did that > <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10 > > After this, the classifier is to be tested: > $HADOOP_HOME/bin/hadoop jar > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10 > -d wikipediainput10 -ng 1 -type bayes -source hdfs > > When I run this, this runs for close to 2 hours and after 2 hours, it > errors out with a java.io.FileException saying that the logs_ is a > directory in the wikipediainput10 folder. I am sorry I can't provide > the stack trace right now because I accidentally closed the terminal > window before I could copy it. I will run this again and send the > stack trace. > > But, if you can send me the steps that you followed after running the > classifier, I can repeat those and see if I am able to successfully > execute the classifier. > > Thank you > Gangadhar > > > On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala > <[email protected]> wrote: > > Joe, > > I will try with the ngram setting of 1 and let you know how it goes. > > Robin, the ngram parameter is used to check the number of subsequences > > of characters isn't it ? Or is it evaluated differently w.r.t to the > > Bayesian classifier ? > > > > Ted, like Joe mentioned, if you could point us to some information on > > SGD we could try it and report back the results to the list. > > > > Thank you > > Gangadhar > > > > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[email protected]> wrote: > >> Robin / Gangadhar, > >> With ngram as 1 and all the countries in the country.txt , the model is > >> getting created without any issues. > >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i > wikipediainput > >> -o wikipediamodel -type bayes -source hdfs > >> > >> Robin, > >> Even for ngram parameter, the default value is mentioned as 1 but it is > set > >> as a mandatory parameter in TrainClassifier. so i'll modify the code to > set > >> the default ngram as 1 and make it as a non mandatory param. > >> > >> That aside, When I try to test the model, the summary is getting printed > >> like below. > >> Summary > >> ------------------------------------------------------- > >> Correctly Classified Instances : 0 ?% > >> Incorrectly Classified Instances : 0 ?% > >> Total Classified Instances : 0 > >> Need to figure out the reason.. > >> > >> Since TestClassifier also has the same params and settings like > >> TrainClassifier, can i modify it to set the default values for ngram, > >> classifierType & dataSource ? > >> > >> reg, > >> Joe. > >> > >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote: > >> > >>> Robin, > >>> > >>> Thanks for your tip. > >>> Will try it out and post updates. > >>> > >>> reg > >>> Joe. > >>> > >>> > >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]> > wrote: > >>> > >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. > You > >>>> need atleast 2 countries. otherwise there is no classification. > Secondly > >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge > >>>> number > >>>> of features. Why dont you try with one and see. > >>>> > >>>> Robin > >>>> > >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]> > wrote: > >>>> > >>>> > Hi Ted, > >>>> > > >>>> > sure. will keep digging.. > >>>> > > >>>> > About SGD, I dont have an idea about how it works et al. If there is > >>>> some > >>>> > documentation / reference / quick summary to read about it that'll > be > >>>> gr8. > >>>> > Just saw one reference in > >>>> > > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. > >>>> > > >>>> > I am assuming we should be able to create a model from wikipedia > >>>> articles > >>>> > and label the country of a new article. If so, could you please > provide > >>>> a > >>>> > note on how to do this. We already have the wikipedia data being > >>>> extracted > >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do > we go > >>>> > about training the classifier using SGD ? > >>>> > > >>>> > thanks for your help, > >>>> > Joe. > >>>> > > >>>> > > >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning < > [email protected]> > >>>> > wrote: > >>>> > > >>>> > > I am watching these efforts with interest, but have been unable to > >>>> > > contribute much to the process. I would encourage Joe and others > to > >>>> keep > >>>> > > whittling this problem down so that we can understand what is > causing > >>>> it. > >>>> > > > >>>> > > In the meantime, I think that the SGD classifiers are close to > >>>> production > >>>> > > quality. For problems with less than several million training > >>>> examples, > >>>> > > and > >>>> > > especially problems with many sparse features, I think that these > >>>> > > classifiers might be easier to get started with than the Naive > Bayes > >>>> > > classifiers. To make a virtue of a defect, the SGD based > classifiers > >>>> to > >>>> > > not > >>>> > > use Hadoop for training. This makes deployment of a > classification > >>>> > > training > >>>> > > workflow easier, but limits the total size of data that can be > >>>> handled. > >>>> > > > >>>> > > What would you guys need to get started with trying these > alternative > >>>> > > models? > >>>> > > > >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala > >>>> > > <[email protected]>wrote: > >>>> > > > >>>> > > > Joe, > >>>> > > > Even I tried with reducing the number of countries in the > >>>> country.txt. > >>>> > > > That didn't help. And in my case, I was monitoring the disk > space > >>>> and > >>>> > > > at no time did it reach 0%. So, I am not sure if that is the > case. > >>>> To > >>>> > > > remove the dependency on the number of countries, I even tried > with > >>>> > > > the subjects.txt as the classification - that also did not help. > >>>> > > > I think this problem is due to the type of the data being > processed, > >>>> > > > but what I am not sure of is what I need to change to get the > data > >>>> to > >>>> > > > be processed successfully. > >>>> > > > > >>>> > > > The experienced folks on Mahout will be able to tell us what is > >>>> missing > >>>> > I > >>>> > > > guess. > >>>> > > > > >>>> > > > Thank you > >>>> > > > Gangadhar > >>>> > > > > >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]> > >>>> wrote: > >>>> > > > > Gangadhar, > >>>> > > > > > >>>> > > > > I modified > $MAHOUT_HOME/examples/src/test/resources/country.txt to > >>>> > just > >>>> > > > have > >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to > create > >>>> the > >>>> > > > > wikipediainput data set and then ran TrainClassifier and it > >>>> worked. > >>>> > > when > >>>> > > > I > >>>> > > > > ran TestClassifier as below, I got blank results in the > output. > >>>> > > > > > >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m > >>>> wikipediamodel > >>>> > -d > >>>> > > > > wikipediainput -ng 3 -type bayes -source hdfs > >>>> > > > > > >>>> > > > > Summary > >>>> > > > > ------------------------------------------------------- > >>>> > > > > Correctly Classified Instances : 0 > ?% > >>>> > > > > Incorrectly Classified Instances : 0 > ?% > >>>> > > > > Total Classified Instances : 0 > >>>> > > > > > >>>> > > > > ======================================================= > >>>> > > > > Confusion Matrix > >>>> > > > > ------------------------------------------------------- > >>>> > > > > a <--Classified as > >>>> > > > > 0 | 0 a = spain > >>>> > > > > Default Category: unknown: 1 > >>>> > > > > > >>>> > > > > I am not sure if I am doing something wrong.. have to figure > out > >>>> why > >>>> > my > >>>> > > > o/p > >>>> > > > > is so blank. > >>>> > > > > I'll document these steps and mention about country.txt in the > >>>> wiki. > >>>> > > > > > >>>> > > > > Question to all > >>>> > > > > Should we have 2 country.txt > >>>> > > > > > >>>> > > > > 1. country_full_list.txt - this is the existing list > >>>> > > > > 2. country_sample_list.txt - a list with 2 or 3 countries > >>>> > > > > > >>>> > > > > To get a flavor of the wikipedia bayes example, we can use > >>>> > > > > country_sample.txt. When new people want to just try out the > >>>> example, > >>>> > > > they > >>>> > > > > can reference this txt file as a parameter. > >>>> > > > > To run the example in a robust scalable infrastructure, we > could > >>>> use > >>>> > > > > country_full_list.txt. > >>>> > > > > any thots ? > >>>> > > > > > >>>> > > > > regards > >>>> > > > > Joe. > >>>> > > > > > >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar < > [email protected]> > >>>> > wrote: > >>>> > > > > > >>>> > > > >> Gangadhar, > >>>> > > > >> > >>>> > > > >> After running TrainClassifier again, the map task just failed > >>>> with > >>>> > the > >>>> > > > same > >>>> > > > >> exception and I am pretty sure it is an issue with disk > space. > >>>> > > > >> As the map was progressing, I was monitoring my free disk > space > >>>> > > dropping > >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map > >>>> task > >>>> > and > >>>> > > > then > >>>> > > > >> the exception happened. After the exception, another map task > was > >>>> > > > resuming > >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first > map > >>>> > task > >>>> > > > freed > >>>> > > > >> up some space) and I am sure they would drop down to zero > again > >>>> and > >>>> > > > throw > >>>> > > > >> the same exception. > >>>> > > > >> I am going to modify the country.txt to just 1 country and > >>>> recreate > >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how > it > >>>> > > goes.. > >>>> > > > >> > >>>> > > > >> Do we have any benchmarks / system requirements for running > this > >>>> > > example > >>>> > > > ? > >>>> > > > >> Has anyone else had success running this example anytime. > Would > >>>> > > > appreciate > >>>> > > > >> your inputs / thots. > >>>> > > > >> > >>>> > > > >> Should we look at tuning the code for handling these > situations ? > >>>> > Any > >>>> > > > quick > >>>> > > > >> suggestions on where to start looking at ? > >>>> > > > >> > >>>> > > > >> regards, > >>>> > > > >> Joe. > >>>> > > > >> > >>>> > > > >> > >>>> > > > >> > >>>> > > > >> > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>> > >>> > >>> > >>> > >>> > >> > > >
