Joe, I am out of town for this week and won't have access to my machine. I will check this during the weekend and will get back to you. Will follow the steps in the wiki.
Thank you On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <[email protected]> wrote: > Hi Gangadhar, > > I ran TestClassifier with similar parameters. It didnt take me 2 hrs though. > > I have documented the steps that worked for me at > https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example > Can you please get the patch available at MAHOUT-509 and apply it and then > try the steps in the wiki. > Please let me know if you still face issues. > > reg > Joe. > > > On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <[email protected] >> wrote: > >> Joe, >> Can you let me know what was the command you used to test the >> classifier ? With the ngrams set to 1 as suggested by Robin, I was >> able to train the classifier. The command: >> $HADOOP_HOME/bin/hadoop jar >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1 >> --input wikipediainput10 --output wikipediamodel10 --classifierType >> bayes --dataSource hdfs >> >> After this, as per the wiki, we need to get the data from HDFS. I did that >> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10 >> >> After this, the classifier is to be tested: >> $HADOOP_HOME/bin/hadoop jar >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10 >> -d wikipediainput10 -ng 1 -type bayes -source hdfs >> >> When I run this, this runs for close to 2 hours and after 2 hours, it >> errors out with a java.io.FileException saying that the logs_ is a >> directory in the wikipediainput10 folder. I am sorry I can't provide >> the stack trace right now because I accidentally closed the terminal >> window before I could copy it. I will run this again and send the >> stack trace. >> >> But, if you can send me the steps that you followed after running the >> classifier, I can repeat those and see if I am able to successfully >> execute the classifier. >> >> Thank you >> Gangadhar >> >> >> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala >> <[email protected]> wrote: >> > Joe, >> > I will try with the ngram setting of 1 and let you know how it goes. >> > Robin, the ngram parameter is used to check the number of subsequences >> > of characters isn't it ? Or is it evaluated differently w.r.t to the >> > Bayesian classifier ? >> > >> > Ted, like Joe mentioned, if you could point us to some information on >> > SGD we could try it and report back the results to the list. >> > >> > Thank you >> > Gangadhar >> > >> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[email protected]> wrote: >> >> Robin / Gangadhar, >> >> With ngram as 1 and all the countries in the country.txt , the model is >> >> getting created without any issues. >> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i >> wikipediainput >> >> -o wikipediamodel -type bayes -source hdfs >> >> >> >> Robin, >> >> Even for ngram parameter, the default value is mentioned as 1 but it is >> set >> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to >> set >> >> the default ngram as 1 and make it as a non mandatory param. >> >> >> >> That aside, When I try to test the model, the summary is getting printed >> >> like below. >> >> Summary >> >> ------------------------------------------------------- >> >> Correctly Classified Instances : 0 ?% >> >> Incorrectly Classified Instances : 0 ?% >> >> Total Classified Instances : 0 >> >> Need to figure out the reason.. >> >> >> >> Since TestClassifier also has the same params and settings like >> >> TrainClassifier, can i modify it to set the default values for ngram, >> >> classifierType & dataSource ? >> >> >> >> reg, >> >> Joe. >> >> >> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote: >> >> >> >>> Robin, >> >>> >> >>> Thanks for your tip. >> >>> Will try it out and post updates. >> >>> >> >>> reg >> >>> Joe. >> >>> >> >>> >> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]> >> wrote: >> >>> >> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. >> You >> >>>> need atleast 2 countries. otherwise there is no classification. >> Secondly >> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge >> >>>> number >> >>>> of features. Why dont you try with one and see. >> >>>> >> >>>> Robin >> >>>> >> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]> >> wrote: >> >>>> >> >>>> > Hi Ted, >> >>>> > >> >>>> > sure. will keep digging.. >> >>>> > >> >>>> > About SGD, I dont have an idea about how it works et al. If there is >> >>>> some >> >>>> > documentation / reference / quick summary to read about it that'll >> be >> >>>> gr8. >> >>>> > Just saw one reference in >> >>>> > >> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. >> >>>> > >> >>>> > I am assuming we should be able to create a model from wikipedia >> >>>> articles >> >>>> > and label the country of a new article. If so, could you please >> provide >> >>>> a >> >>>> > note on how to do this. We already have the wikipedia data being >> >>>> extracted >> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do >> we go >> >>>> > about training the classifier using SGD ? >> >>>> > >> >>>> > thanks for your help, >> >>>> > Joe. >> >>>> > >> >>>> > >> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning < >> [email protected]> >> >>>> > wrote: >> >>>> > >> >>>> > > I am watching these efforts with interest, but have been unable to >> >>>> > > contribute much to the process. I would encourage Joe and others >> to >> >>>> keep >> >>>> > > whittling this problem down so that we can understand what is >> causing >> >>>> it. >> >>>> > > >> >>>> > > In the meantime, I think that the SGD classifiers are close to >> >>>> production >> >>>> > > quality. For problems with less than several million training >> >>>> examples, >> >>>> > > and >> >>>> > > especially problems with many sparse features, I think that these >> >>>> > > classifiers might be easier to get started with than the Naive >> Bayes >> >>>> > > classifiers. To make a virtue of a defect, the SGD based >> classifiers >> >>>> to >> >>>> > > not >> >>>> > > use Hadoop for training. This makes deployment of a >> classification >> >>>> > > training >> >>>> > > workflow easier, but limits the total size of data that can be >> >>>> handled. >> >>>> > > >> >>>> > > What would you guys need to get started with trying these >> alternative >> >>>> > > models? >> >>>> > > >> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala >> >>>> > > <[email protected]>wrote: >> >>>> > > >> >>>> > > > Joe, >> >>>> > > > Even I tried with reducing the number of countries in the >> >>>> country.txt. >> >>>> > > > That didn't help. And in my case, I was monitoring the disk >> space >> >>>> and >> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the >> case. >> >>>> To >> >>>> > > > remove the dependency on the number of countries, I even tried >> with >> >>>> > > > the subjects.txt as the classification - that also did not help. >> >>>> > > > I think this problem is due to the type of the data being >> processed, >> >>>> > > > but what I am not sure of is what I need to change to get the >> data >> >>>> to >> >>>> > > > be processed successfully. >> >>>> > > > >> >>>> > > > The experienced folks on Mahout will be able to tell us what is >> >>>> missing >> >>>> > I >> >>>> > > > guess. >> >>>> > > > >> >>>> > > > Thank you >> >>>> > > > Gangadhar >> >>>> > > > >> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]> >> >>>> wrote: >> >>>> > > > > Gangadhar, >> >>>> > > > > >> >>>> > > > > I modified >> $MAHOUT_HOME/examples/src/test/resources/country.txt to >> >>>> > just >> >>>> > > > have >> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to >> create >> >>>> the >> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it >> >>>> worked. >> >>>> > > when >> >>>> > > > I >> >>>> > > > > ran TestClassifier as below, I got blank results in the >> output. >> >>>> > > > > >> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m >> >>>> wikipediamodel >> >>>> > -d >> >>>> > > > > wikipediainput -ng 3 -type bayes -source hdfs >> >>>> > > > > >> >>>> > > > > Summary >> >>>> > > > > ------------------------------------------------------- >> >>>> > > > > Correctly Classified Instances : 0 >> ?% >> >>>> > > > > Incorrectly Classified Instances : 0 >> ?% >> >>>> > > > > Total Classified Instances : 0 >> >>>> > > > > >> >>>> > > > > ======================================================= >> >>>> > > > > Confusion Matrix >> >>>> > > > > ------------------------------------------------------- >> >>>> > > > > a <--Classified as >> >>>> > > > > 0 | 0 a = spain >> >>>> > > > > Default Category: unknown: 1 >> >>>> > > > > >> >>>> > > > > I am not sure if I am doing something wrong.. have to figure >> out >> >>>> why >> >>>> > my >> >>>> > > > o/p >> >>>> > > > > is so blank. >> >>>> > > > > I'll document these steps and mention about country.txt in the >> >>>> wiki. >> >>>> > > > > >> >>>> > > > > Question to all >> >>>> > > > > Should we have 2 country.txt >> >>>> > > > > >> >>>> > > > > 1. country_full_list.txt - this is the existing list >> >>>> > > > > 2. country_sample_list.txt - a list with 2 or 3 countries >> >>>> > > > > >> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use >> >>>> > > > > country_sample.txt. When new people want to just try out the >> >>>> example, >> >>>> > > > they >> >>>> > > > > can reference this txt file as a parameter. >> >>>> > > > > To run the example in a robust scalable infrastructure, we >> could >> >>>> use >> >>>> > > > > country_full_list.txt. >> >>>> > > > > any thots ? >> >>>> > > > > >> >>>> > > > > regards >> >>>> > > > > Joe. >> >>>> > > > > >> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar < >> [email protected]> >> >>>> > wrote: >> >>>> > > > > >> >>>> > > > >> Gangadhar, >> >>>> > > > >> >> >>>> > > > >> After running TrainClassifier again, the map task just failed >> >>>> with >> >>>> > the >> >>>> > > > same >> >>>> > > > >> exception and I am pretty sure it is an issue with disk >> space. >> >>>> > > > >> As the map was progressing, I was monitoring my free disk >> space >> >>>> > > dropping >> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map >> >>>> task >> >>>> > and >> >>>> > > > then >> >>>> > > > >> the exception happened. After the exception, another map task >> was >> >>>> > > > resuming >> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first >> map >> >>>> > task >> >>>> > > > freed >> >>>> > > > >> up some space) and I am sure they would drop down to zero >> again >> >>>> and >> >>>> > > > throw >> >>>> > > > >> the same exception. >> >>>> > > > >> I am going to modify the country.txt to just 1 country and >> >>>> recreate >> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how >> it >> >>>> > > goes.. >> >>>> > > > >> >> >>>> > > > >> Do we have any benchmarks / system requirements for running >> this >> >>>> > > example >> >>>> > > > ? >> >>>> > > > >> Has anyone else had success running this example anytime. >> Would >> >>>> > > > appreciate >> >>>> > > > >> your inputs / thots. >> >>>> > > > >> >> >>>> > > > >> Should we look at tuning the code for handling these >> situations ? >> >>>> > Any >> >>>> > > > quick >> >>>> > > > >> suggestions on where to start looking at ? >> >>>> > > > >> >> >>>> > > > >> regards, >> >>>> > > > >> Joe. >> >>>> > > > >> >> >>>> > > > >> >> >>>> > > > >> >> >>>> > > > >> >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >> >> > >> >
