The reason for including the target variable in the test file is so that the classifier can be run and the output compared to the correct answer. Otherwise, all that would be possible is to get the output of the classifier and you would have to run an entire other program to find out which answers were correct and which not. Having the classification and verification happen together is just easier.
On Fri, Oct 28, 2011 at 7:58 PM, Sam Cunningham <sam_cun...@yahoo.com>wrote: > I have a text classification project. So, I am going through the examples > provided in Mahout in Action book. 20news example works fine for me. > However, I don't understand something: Why do we include the target > variables in the test data files? (target variable - tab - text content). I > understand that in order for us to train the program we need to provide > target variables but I don't understand why we include target variables in > the test files? Isn't Mahout supposed to determine them by using the model > created from training? Just to test that, I renamed the folder names under > 20news-bydate-test to 1, 2, 3, ...20. Then I ran prepare20newsgroups to > generate the files required for naive bayes classifier. The new files > included renamed folder names as target variables such that 1, 2, 3, ... > 20. > When I ran the testclassifier after training the classifier, I received the > the following error. Why? Please help me understand. Also, is there Java > source code for 20newsgroup bayes classification (instead of command line)? > > Exception in thread "main" java.lang.IllegalArgumentException: Label not > found: 20 > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > > org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:93) > at > > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:113) > at > > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:117) > at > > org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:85) > at > > org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:67) > at > > org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:252) > at > > org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:185) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Mahout-20news-example-tp3462754p3462754.html > Sent from the Lucene - General mailing list archive at Nabble.com. >