Robin, Thanks for your tip. Will try it out and post updates.
reg Joe. On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[email protected]> wrote: > Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You > need atleast 2 countries. otherwise there is no classification. Secondly > ngram =3 is a bit too high. With wikipedia this will result in a huge > number > of features. Why dont you try with one and see. > > Robin > > On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[email protected]> wrote: > > > Hi Ted, > > > > sure. will keep digging.. > > > > About SGD, I dont have an idea about how it works et al. If there is some > > documentation / reference / quick summary to read about it that'll be > gr8. > > Just saw one reference in > > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. > > > > I am assuming we should be able to create a model from wikipedia articles > > and label the country of a new article. If so, could you please provide a > > note on how to do this. We already have the wikipedia data being > extracted > > for specific countries using WikipediaDatasetCreatorDriver. How do we go > > about training the classifier using SGD ? > > > > thanks for your help, > > Joe. > > > > > > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[email protected]> > > wrote: > > > > > I am watching these efforts with interest, but have been unable to > > > contribute much to the process. I would encourage Joe and others to > keep > > > whittling this problem down so that we can understand what is causing > it. > > > > > > In the meantime, I think that the SGD classifiers are close to > production > > > quality. For problems with less than several million training > examples, > > > and > > > especially problems with many sparse features, I think that these > > > classifiers might be easier to get started with than the Naive Bayes > > > classifiers. To make a virtue of a defect, the SGD based classifiers > to > > > not > > > use Hadoop for training. This makes deployment of a classification > > > training > > > workflow easier, but limits the total size of data that can be handled. > > > > > > What would you guys need to get started with trying these alternative > > > models? > > > > > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala > > > <[email protected]>wrote: > > > > > > > Joe, > > > > Even I tried with reducing the number of countries in the > country.txt. > > > > That didn't help. And in my case, I was monitoring the disk space and > > > > at no time did it reach 0%. So, I am not sure if that is the case. To > > > > remove the dependency on the number of countries, I even tried with > > > > the subjects.txt as the classification - that also did not help. > > > > I think this problem is due to the type of the data being processed, > > > > but what I am not sure of is what I need to change to get the data to > > > > be processed successfully. > > > > > > > > The experienced folks on Mahout will be able to tell us what is > missing > > I > > > > guess. > > > > > > > > Thank you > > > > Gangadhar > > > > > > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[email protected]> > wrote: > > > > > Gangadhar, > > > > > > > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to > > just > > > > have > > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create > the > > > > > wikipediainput data set and then ran TrainClassifier and it worked. > > > when > > > > I > > > > > ran TestClassifier as below, I got blank results in the output. > > > > > > > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > > > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel > > -d > > > > > wikipediainput -ng 3 -type bayes -source hdfs > > > > > > > > > > Summary > > > > > ------------------------------------------------------- > > > > > Correctly Classified Instances : 0 ?% > > > > > Incorrectly Classified Instances : 0 ?% > > > > > Total Classified Instances : 0 > > > > > > > > > > ======================================================= > > > > > Confusion Matrix > > > > > ------------------------------------------------------- > > > > > a <--Classified as > > > > > 0 | 0 a = spain > > > > > Default Category: unknown: 1 > > > > > > > > > > I am not sure if I am doing something wrong.. have to figure out > why > > my > > > > o/p > > > > > is so blank. > > > > > I'll document these steps and mention about country.txt in the > wiki. > > > > > > > > > > Question to all > > > > > Should we have 2 country.txt > > > > > > > > > > 1. country_full_list.txt - this is the existing list > > > > > 2. country_sample_list.txt - a list with 2 or 3 countries > > > > > > > > > > To get a flavor of the wikipedia bayes example, we can use > > > > > country_sample.txt. When new people want to just try out the > example, > > > > they > > > > > can reference this txt file as a parameter. > > > > > To run the example in a robust scalable infrastructure, we could > use > > > > > country_full_list.txt. > > > > > any thots ? > > > > > > > > > > regards > > > > > Joe. > > > > > > > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[email protected]> > > wrote: > > > > > > > > > >> Gangadhar, > > > > >> > > > > >> After running TrainClassifier again, the map task just failed with > > the > > > > same > > > > >> exception and I am pretty sure it is an issue with disk space. > > > > >> As the map was progressing, I was monitoring my free disk space > > > dropping > > > > >> from 81GB. It came down to 0 after almost 66% through the map task > > and > > > > then > > > > >> the exception happened. After the exception, another map task was > > > > resuming > > > > >> at 33% and I got close to 15GB free space (i guess the first map > > task > > > > freed > > > > >> up some space) and I am sure they would drop down to zero again > and > > > > throw > > > > >> the same exception. > > > > >> I am going to modify the country.txt to just 1 country and > recreate > > > > >> wikipediainput and run TrainClassifier. Will let you know how it > > > goes.. > > > > >> > > > > >> Do we have any benchmarks / system requirements for running this > > > example > > > > ? > > > > >> Has anyone else had success running this example anytime. Would > > > > appreciate > > > > >> your inputs / thots. > > > > >> > > > > >> Should we look at tuning the code for handling these situations ? > > Any > > > > quick > > > > >> suggestions on where to start looking at ? > > > > >> > > > > >> regards, > > > > >> Joe. > > > > >> > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > >
