[
https://issues.apache.org/jira/browse/MAHOUT-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288304#comment-13288304
]
Robin Anil commented on MAHOUT-1006:
------------------------------------
Phew! After weeks of relooking at the code, I finally figured out that the
theta normalization driver is screwed up. The current code is somehow different
Right now I do not have a solution for fixing that. However theta-normalization
will only remove some 1-2% off accuracy. So I would not be too worried about
that. The solution will just work with seq2sparse and tfidf vectors. It assumes
that input sequence file of vectors are named in the format
"/class-name/filename and it will expect this to be the case if used otherwise.
Sorry I dont have a better representation for classname and vector name in the
short amount of time so as to have to make this a working replacement for bayes
for this release. I have gone ahead and put deprecation messages if people try
to run prepare20newsgroups via commandline.
So for a 80-20 Random split, the classifier gives 91% accuracy on 20newsgroups
data. the example is added into classify-20newsgroups.sh
If I get more time this week during buzzwords, I will try to fix the issue with
thetanormalizer as well. But this should be release-able, even if I am not able
to do that in time.
{noformat}
Standard NB Results: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 3357 91.0991%
Incorrectly Classified Instances : 328 8.9009%
Total Classified Instances : 3685
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j
k l m n o p q r s t
<--Classified as
159 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 4 0
| 165 a = alt.atheism
0 155 0 8 3 6 3 0 0 0
0 1 2 0 2 0 0 0 0 0
| 180 b = comp.graphics
0 26 104 36 7 6 1 0 0 0
0 0 1 0 2 0 0 0 0 0
| 183 c = comp.os.ms-windows.misc
0 4 2 139 11 0 5 0 0 0
0 1 2 0 0 0 0 0 0 0
| 164 d = comp.sys.ibm.pc.hardware
0 2 1 2 165 0 3 0 0 0
0 1 3 0 0 0 0 0 0 0
| 177 e = comp.sys.mac.hardware
1 13 0 5 2 175 3 0 0 0
0 0 0 0 0 0 0 0 0 0
| 199 f = comp.windows.x
0 2 0 5 0 0 168 3 0 2
2 0 1 0 0 0 0 0 1 0
| 184 g = misc.forsale
0 0 0 1 2 0 2 182 3 0
0 0 4 0 0 1 0 0 0 0
| 195 h = rec.autos
0 0 0 0 0 1 5 2 199 0
0 0 1 0 0 0 0 0 0 0
| 208 i = rec.motorcycles
0 0 0 0 0 0 1 0 0 177
1 0 0 0 0 0 0 0 0 0
| 179 j = rec.sport.baseball
0 0 0 1 0 0 0 0 0 0
183 0 0 0 0 0 0 0 0 1
| 185 k = rec.sport.hockey
0 1 0 0 0 3 0 1 0 1
0 193 0 2 0 0 0 1 1 2
| 205 l = sci.crypt
0 3 0 9 4 2 3 1 0 0
1 2 171 0 0 0 0 0 0 0
| 196 m = sci.electronics
0 2 1 1 0 0 1 0 0 0
0 0 1 190 2 0 0 0 0 0
| 198 n = sci.med
0 3 0 0 0 1 0 0 0 0
0 0 2 0 190 0 0 0 2 1
| 199 o = sci.space
4 1 0 1 1 0 0 0 0 0
1 0 0 1 0 212 0 0 1 0
| 222 p = soc.religion.christian
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 170 1 0 1
| 172 q = talk.politics.mideast
0 0 1 0 0 0 1 0 1 0
0 2 0 0 0 0 0 165 0 5
| 175 r = talk.politics.guns
14 0 0 0 0 0 0 0 0 0
0 0 0 0 0 8 0 3 116 5
| 146 s = talk.religion.misc
0 0 0 0 0 0 0 0 0 2
0 0 1 0 0 0 2 3 1 144
| 153 t = talk.politics.misc
Complementary Results: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 3357 91.0991%
Incorrectly Classified Instances : 328 8.9009%
Total Classified Instances : 3685
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j
k l m n o p q r s t
<--Classified as
159 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 4 0
| 165 a = alt.atheism
0 155 0 8 3 6 3 0 0 0
0 1 2 0 2 0 0 0 0 0
| 180 b = comp.graphics
0 26 104 36 7 6 1 0 0 0
0 0 1 0 2 0 0 0 0 0
| 183 c = comp.os.ms-windows.misc
0 4 2 139 11 0 5 0 0 0
0 1 2 0 0 0 0 0 0 0
| 164 d = comp.sys.ibm.pc.hardware
0 2 1 2 165 0 3 0 0 0
0 1 3 0 0 0 0 0 0 0
| 177 e = comp.sys.mac.hardware
1 13 0 5 2 175 3 0 0 0
0 0 0 0 0 0 0 0 0 0
| 199 f = comp.windows.x
0 2 0 5 0 0 168 3 0 2
2 0 1 0 0 0 0 0 1 0
| 184 g = misc.forsale
0 0 0 1 2 0 2 182 3 0
0 0 4 0 0 1 0 0 0 0
| 195 h = rec.autos
0 0 0 0 0 1 5 2 199 0
0 0 1 0 0 0 0 0 0 0
| 208 i = rec.motorcycles
0 0 0 0 0 0 1 0 0 177
1 0 0 0 0 0 0 0 0 0
| 179 j = rec.sport.baseball
0 0 0 1 0 0 0 0 0 0
183 0 0 0 0 0 0 0 0 1
| 185 k = rec.sport.hockey
0 1 0 0 0 3 0 1 0 1
0 193 0 2 0 0 0 1 1 2
| 205 l = sci.crypt
0 3 0 9 4 2 3 1 0 0
1 2 171 0 0 0 0 0 0 0
| 196 m = sci.electronics
0 2 1 1 0 0 1 0 0 0
0 0 1 190 2 0 0 0 0 0
| 198 n = sci.med
0 3 0 0 0 1 0 0 0 0
0 0 2 0 190 0 0 0 2 1
| 199 o = sci.space
4 1 0 1 1 0 0 0 0 0
1 0 0 1 0 212 0 0 1 0
| 222 p = soc.religion.christian
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 170 1 0 1
| 172 q = talk.politics.mideast
0 0 1 0 0 0 1 0 1 0
0 2 0 0 0 0 0 165 0 5
| 175 r = talk.politics.guns
14 0 0 0 0 0 0 0 0 0
0 0 0 0 0 8 0 3 116 5
| 146 s = talk.religion.misc
0 0 0 0 0 0 0 0 0 2
0 0 1 0 0 0 2 3 1 144
| 153 t = talk.politics.misc
{noformat}
> Example from book no longer works - prepare20newsgroups broken with Lucene
> upgrade
> ----------------------------------------------------------------------------------
>
> Key: MAHOUT-1006
> URL: https://issues.apache.org/jira/browse/MAHOUT-1006
> Project: Mahout
> Issue Type: Bug
> Affects Versions: 0.7
> Reporter: Ted Dunning
> Assignee: Robin Anil
> Priority: Critical
> Fix For: 0.7
>
> Attachments: MAHOUT-1006.patch
>
>
> The StandardAnalyzer from Lucene no longer has a no-args constructor. Our
> code uses reflection to create this class, but looks for a no-args
> constructor and that causes this:
> {code}
> ./bin/mahout prepare20newsgroups -p 20news-bydate-train/ -o 20news-train/ -a
> org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> no HADOOP_HOME set, running locally
> Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/Users/hadoop/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/Users/hadoop/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/Users/hadoop/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> Exception in thread "main" java.lang.IllegalStateException:
> java.lang.NoSuchMethodException:
> org.apache.lucene.analysis.standard.StandardAnalyzer.<init>()
> at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:68)
> at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:28)
> at
> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups.main(PrepareTwentyNewsgroups.java:89)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> Caused by: java.lang.NoSuchMethodException:
> org.apache.lucene.analysis.standard.StandardAnalyzer.<init>()
> at java.lang.Class.getConstructor0(Class.java:2706)
> at java.lang.Class.getConstructor(Class.java:1657)
> at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:62)
> ... 9 more
> {code}
> This is really bad.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira