There is a test program called TrainNewsGroups in org.apache.mahout.classifier.sgd in the examples module.
I would love to work with you to get better documentation pulled together. On Mon, Sep 20, 2010 at 8:13 PM, Gangadhar Nittala <[email protected]>wrote: > Joe, > I will try with the ngram setting of 1 and let you know how it goes. > Robin, the ngram parameter is used to check the number of subsequences > of characters isn't it ? Or is it evaluated differently w.r.t to the > Bayesian classifier ? > > Ted, like Joe mentioned, if you could point us to some information on > SGD we could try it and report back the results to the list. > > Thank you > Gangadhar > > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[email protected]> wrote: > > Robin / Gangadhar, > > With ngram as 1 and all the countries in the country.txt , the model is > > getting created without any issues. > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > > org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i > wikipediainput > > -o wikipediamodel -type bayes -source hdfs > > > > Robin, > > Even for ngram parameter, the default value is mentioned as 1 but it is > set > > as a mandatory parameter in TrainClassifier. so i'll modify the code to > set > > the default ngram as 1 and make it as a non mandatory param. > > > > That aside, When I try to test the model, the summary is getting printed > > like below. > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 0 ?% > > Incorrectly Classified Instances : 0 ?% > > Total Classified Instances : 0 > > Need to figure out the reason.. > > > > Since TestClassifier also has the same params and settings like > > TrainClassifier, can i modify it to set the default values for ngram, > > classifierType & dataSource ? > > > > reg, > > Joe. > > > > On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[email protected]> wrote: > > > >> Robin, > >> > >> Thanks for your tip. > >> Will try it out and post updates. > >> > >> reg > >> Joe. > >> > >> > >> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]> > wrote: > >> > >>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. > You > >>> need atleast 2 countries. otherwise there is no classification. > Secondly > >>> ngram =3 is a bit too high. With wikipedia this will result in a huge > >>> number > >>> of features. Why dont you try with one and see. > >>> > >>> Robin > >>> > >>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]> > wrote: > >>> > >>> > Hi Ted, > >>> > > >>> > sure. will keep digging.. > >>> > > >>> > About SGD, I dont have an idea about how it works et al. If there is > >>> some > >>> > documentation / reference / quick summary to read about it that'll be > >>> gr8. > >>> > Just saw one reference in > >>> > > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. > >>> > > >>> > I am assuming we should be able to create a model from wikipedia > >>> articles > >>> > and label the country of a new article. If so, could you please > provide > >>> a > >>> > note on how to do this. We already have the wikipedia data being > >>> extracted > >>> > for specific countries using WikipediaDatasetCreatorDriver. How do we > go > >>> > about training the classifier using SGD ? > >>> > > >>> > thanks for your help, > >>> > Joe. > >>> > > >>> > > >>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[email protected] > > > >>> > wrote: > >>> > > >>> > > I am watching these efforts with interest, but have been unable to > >>> > > contribute much to the process. I would encourage Joe and others > to > >>> keep > >>> > > whittling this problem down so that we can understand what is > causing > >>> it. > >>> > > > >>> > > In the meantime, I think that the SGD classifiers are close to > >>> production > >>> > > quality. For problems with less than several million training > >>> examples, > >>> > > and > >>> > > especially problems with many sparse features, I think that these > >>> > > classifiers might be easier to get started with than the Naive > Bayes > >>> > > classifiers. To make a virtue of a defect, the SGD based > classifiers > >>> to > >>> > > not > >>> > > use Hadoop for training. This makes deployment of a classification > >>> > > training > >>> > > workflow easier, but limits the total size of data that can be > >>> handled. > >>> > > > >>> > > What would you guys need to get started with trying these > alternative > >>> > > models? > >>> > > > >>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala > >>> > > <[email protected]>wrote: > >>> > > > >>> > > > Joe, > >>> > > > Even I tried with reducing the number of countries in the > >>> country.txt. > >>> > > > That didn't help. And in my case, I was monitoring the disk space > >>> and > >>> > > > at no time did it reach 0%. So, I am not sure if that is the > case. > >>> To > >>> > > > remove the dependency on the number of countries, I even tried > with > >>> > > > the subjects.txt as the classification - that also did not help. > >>> > > > I think this problem is due to the type of the data being > processed, > >>> > > > but what I am not sure of is what I need to change to get the > data > >>> to > >>> > > > be processed successfully. > >>> > > > > >>> > > > The experienced folks on Mahout will be able to tell us what is > >>> missing > >>> > I > >>> > > > guess. > >>> > > > > >>> > > > Thank you > >>> > > > Gangadhar > >>> > > > > >>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]> > >>> wrote: > >>> > > > > Gangadhar, > >>> > > > > > >>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt > to > >>> > just > >>> > > > have > >>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to > create > >>> the > >>> > > > > wikipediainput data set and then ran TrainClassifier and it > >>> worked. > >>> > > when > >>> > > > I > >>> > > > > ran TestClassifier as below, I got blank results in the output. > >>> > > > > > >>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > >>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m > >>> wikipediamodel > >>> > -d > >>> > > > > wikipediainput -ng 3 -type bayes -source hdfs > >>> > > > > > >>> > > > > Summary > >>> > > > > ------------------------------------------------------- > >>> > > > > Correctly Classified Instances : 0 ?% > >>> > > > > Incorrectly Classified Instances : 0 ?% > >>> > > > > Total Classified Instances : 0 > >>> > > > > > >>> > > > > ======================================================= > >>> > > > > Confusion Matrix > >>> > > > > ------------------------------------------------------- > >>> > > > > a <--Classified as > >>> > > > > 0 | 0 a = spain > >>> > > > > Default Category: unknown: 1 > >>> > > > > > >>> > > > > I am not sure if I am doing something wrong.. have to figure > out > >>> why > >>> > my > >>> > > > o/p > >>> > > > > is so blank. > >>> > > > > I'll document these steps and mention about country.txt in the > >>> wiki. > >>> > > > > > >>> > > > > Question to all > >>> > > > > Should we have 2 country.txt > >>> > > > > > >>> > > > > 1. country_full_list.txt - this is the existing list > >>> > > > > 2. country_sample_list.txt - a list with 2 or 3 countries > >>> > > > > > >>> > > > > To get a flavor of the wikipedia bayes example, we can use > >>> > > > > country_sample.txt. When new people want to just try out the > >>> example, > >>> > > > they > >>> > > > > can reference this txt file as a parameter. > >>> > > > > To run the example in a robust scalable infrastructure, we > could > >>> use > >>> > > > > country_full_list.txt. > >>> > > > > any thots ? > >>> > > > > > >>> > > > > regards > >>> > > > > Joe. > >>> > > > > > >>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected] > > > >>> > wrote: > >>> > > > > > >>> > > > >> Gangadhar, > >>> > > > >> > >>> > > > >> After running TrainClassifier again, the map task just failed > >>> with > >>> > the > >>> > > > same > >>> > > > >> exception and I am pretty sure it is an issue with disk space. > >>> > > > >> As the map was progressing, I was monitoring my free disk > space > >>> > > dropping > >>> > > > >> from 81GB. It came down to 0 after almost 66% through the map > >>> task > >>> > and > >>> > > > then > >>> > > > >> the exception happened. After the exception, another map task > was > >>> > > > resuming > >>> > > > >> at 33% and I got close to 15GB free space (i guess the first > map > >>> > task > >>> > > > freed > >>> > > > >> up some space) and I am sure they would drop down to zero > again > >>> and > >>> > > > throw > >>> > > > >> the same exception. > >>> > > > >> I am going to modify the country.txt to just 1 country and > >>> recreate > >>> > > > >> wikipediainput and run TrainClassifier. Will let you know how > it > >>> > > goes.. > >>> > > > >> > >>> > > > >> Do we have any benchmarks / system requirements for running > this > >>> > > example > >>> > > > ? > >>> > > > >> Has anyone else had success running this example anytime. > Would > >>> > > > appreciate > >>> > > > >> your inputs / thots. > >>> > > > >> > >>> > > > >> Should we look at tuning the code for handling these > situations ? > >>> > Any > >>> > > > quick > >>> > > > >> suggestions on where to start looking at ? > >>> > > > >> > >>> > > > >> regards, > >>> > > > >> Joe. > >>> > > > >> > >>> > > > >> > >>> > > > >> > >>> > > > >> > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >> > >> > >> > >> > >> > > >
