[ 
https://issues.apache.org/jira/browse/MAHOUT-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288304#comment-13288304
 ] 

Robin Anil commented on MAHOUT-1006:
------------------------------------

Phew! After weeks of relooking at the code, I finally figured out that the 
theta normalization driver is screwed up. The current code is somehow different 
Right now I do not have a solution for fixing that. However theta-normalization 
will only remove some 1-2% off accuracy. So I would not be too worried about 
that. The solution will just work with seq2sparse and tfidf vectors. It assumes 
that input sequence file of vectors are named in the format 
"/class-name/filename and it will expect this to be the case if used otherwise. 
Sorry I dont have a better representation for classname and vector name in the 
short amount of time so as to have to make this a working replacement for bayes 
for this release. I have gone ahead and put deprecation messages if people try 
to run prepare20newsgroups via commandline.

So for a 80-20 Random split, the classifier gives 91% accuracy on 20newsgroups 
data. the example is added into classify-20newsgroups.sh

If I get more time this week during buzzwords, I will try to fix the issue with 
thetanormalizer as well. But this should be release-able, even if I am not able 
to do that in time.

{noformat} 

Standard NB Results: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       3357       91.0991%
Incorrectly Classified Instances        :        328        8.9009%
Total Classified Instances              :       3685

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       c       d       e       f       g       h       i       j       
k       l       m       n       o       p       q       r       s       t       
<--Classified as
159     0       0       0       1       0       0       0       0       0       
0       0       0       0       0       1       0       0       4       0       
 |  165         a     = alt.atheism
0       155     0       8       3       6       3       0       0       0       
0       1       2       0       2       0       0       0       0       0       
 |  180         b     = comp.graphics
0       26      104     36      7       6       1       0       0       0       
0       0       1       0       2       0       0       0       0       0       
 |  183         c     = comp.os.ms-windows.misc
0       4       2       139     11      0       5       0       0       0       
0       1       2       0       0       0       0       0       0       0       
 |  164         d     = comp.sys.ibm.pc.hardware
0       2       1       2       165     0       3       0       0       0       
0       1       3       0       0       0       0       0       0       0       
 |  177         e     = comp.sys.mac.hardware
1       13      0       5       2       175     3       0       0       0       
0       0       0       0       0       0       0       0       0       0       
 |  199         f     = comp.windows.x
0       2       0       5       0       0       168     3       0       2       
2       0       1       0       0       0       0       0       1       0       
 |  184         g     = misc.forsale
0       0       0       1       2       0       2       182     3       0       
0       0       4       0       0       1       0       0       0       0       
 |  195         h     = rec.autos
0       0       0       0       0       1       5       2       199     0       
0       0       1       0       0       0       0       0       0       0       
 |  208         i     = rec.motorcycles
0       0       0       0       0       0       1       0       0       177     
1       0       0       0       0       0       0       0       0       0       
 |  179         j     = rec.sport.baseball
0       0       0       1       0       0       0       0       0       0       
183     0       0       0       0       0       0       0       0       1       
 |  185         k     = rec.sport.hockey
0       1       0       0       0       3       0       1       0       1       
0       193     0       2       0       0       0       1       1       2       
 |  205         l     = sci.crypt
0       3       0       9       4       2       3       1       0       0       
1       2       171     0       0       0       0       0       0       0       
 |  196         m     = sci.electronics
0       2       1       1       0       0       1       0       0       0       
0       0       1       190     2       0       0       0       0       0       
 |  198         n     = sci.med
0       3       0       0       0       1       0       0       0       0       
0       0       2       0       190     0       0       0       2       1       
 |  199         o     = sci.space
4       1       0       1       1       0       0       0       0       0       
1       0       0       1       0       212     0       0       1       0       
 |  222         p     = soc.religion.christian
0       0       0       0       0       0       0       0       0       0       
0       0       0       0       0       0       170     1       0       1       
 |  172         q     = talk.politics.mideast
0       0       1       0       0       0       1       0       1       0       
0       2       0       0       0       0       0       165     0       5       
 |  175         r     = talk.politics.guns
14      0       0       0       0       0       0       0       0       0       
0       0       0       0       0       8       0       3       116     5       
 |  146         s     = talk.religion.misc
0       0       0       0       0       0       0       0       0       2       
0       0       1       0       0       0       2       3       1       144     
 |  153         t     = talk.politics.misc

Complementary Results: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       3357       91.0991%
Incorrectly Classified Instances        :        328        8.9009%
Total Classified Instances              :       3685

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       c       d       e       f       g       h       i       j       
k       l       m       n       o       p       q       r       s       t       
<--Classified as
159     0       0       0       1       0       0       0       0       0       
0       0       0       0       0       1       0       0       4       0       
 |  165         a     = alt.atheism
0       155     0       8       3       6       3       0       0       0       
0       1       2       0       2       0       0       0       0       0       
 |  180         b     = comp.graphics
0       26      104     36      7       6       1       0       0       0       
0       0       1       0       2       0       0       0       0       0       
 |  183         c     = comp.os.ms-windows.misc
0       4       2       139     11      0       5       0       0       0       
0       1       2       0       0       0       0       0       0       0       
 |  164         d     = comp.sys.ibm.pc.hardware
0       2       1       2       165     0       3       0       0       0       
0       1       3       0       0       0       0       0       0       0       
 |  177         e     = comp.sys.mac.hardware
1       13      0       5       2       175     3       0       0       0       
0       0       0       0       0       0       0       0       0       0       
 |  199         f     = comp.windows.x
0       2       0       5       0       0       168     3       0       2       
2       0       1       0       0       0       0       0       1       0       
 |  184         g     = misc.forsale
0       0       0       1       2       0       2       182     3       0       
0       0       4       0       0       1       0       0       0       0       
 |  195         h     = rec.autos
0       0       0       0       0       1       5       2       199     0       
0       0       1       0       0       0       0       0       0       0       
 |  208         i     = rec.motorcycles
0       0       0       0       0       0       1       0       0       177     
1       0       0       0       0       0       0       0       0       0       
 |  179         j     = rec.sport.baseball
0       0       0       1       0       0       0       0       0       0       
183     0       0       0       0       0       0       0       0       1       
 |  185         k     = rec.sport.hockey
0       1       0       0       0       3       0       1       0       1       
0       193     0       2       0       0       0       1       1       2       
 |  205         l     = sci.crypt
0       3       0       9       4       2       3       1       0       0       
1       2       171     0       0       0       0       0       0       0       
 |  196         m     = sci.electronics
0       2       1       1       0       0       1       0       0       0       
0       0       1       190     2       0       0       0       0       0       
 |  198         n     = sci.med
0       3       0       0       0       1       0       0       0       0       
0       0       2       0       190     0       0       0       2       1       
 |  199         o     = sci.space
4       1       0       1       1       0       0       0       0       0       
1       0       0       1       0       212     0       0       1       0       
 |  222         p     = soc.religion.christian
0       0       0       0       0       0       0       0       0       0       
0       0       0       0       0       0       170     1       0       1       
 |  172         q     = talk.politics.mideast
0       0       1       0       0       0       1       0       1       0       
0       2       0       0       0       0       0       165     0       5       
 |  175         r     = talk.politics.guns
14      0       0       0       0       0       0       0       0       0       
0       0       0       0       0       8       0       3       116     5       
 |  146         s     = talk.religion.misc
0       0       0       0       0       0       0       0       0       2       
0       0       1       0       0       0       2       3       1       144     
 |  153         t     = talk.politics.misc



{noformat}
                
> Example from book no longer works - prepare20newsgroups broken with Lucene 
> upgrade
> ----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1006
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1006
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.7
>            Reporter: Ted Dunning
>            Assignee: Robin Anil
>            Priority: Critical
>             Fix For: 0.7
>
>         Attachments: MAHOUT-1006.patch
>
>
> The StandardAnalyzer from Lucene no longer has a no-args constructor.  Our 
> code uses reflection to create this class, but looks for a no-args 
> constructor and that causes this:
> {code}
> ./bin/mahout prepare20newsgroups -p 20news-bydate-train/ -o 20news-train/ -a 
> org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8  
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> no HADOOP_HOME set, running locally
> Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/hadoop/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/hadoop/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/hadoop/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> Exception in thread "main" java.lang.IllegalStateException: 
> java.lang.NoSuchMethodException: 
> org.apache.lucene.analysis.standard.StandardAnalyzer.<init>()
>       at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:68)
>       at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:28)
>       at 
> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups.main(PrepareTwentyNewsgroups.java:89)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> Caused by: java.lang.NoSuchMethodException: 
> org.apache.lucene.analysis.standard.StandardAnalyzer.<init>()
>       at java.lang.Class.getConstructor0(Class.java:2706)
>       at java.lang.Class.getConstructor(Class.java:1657)
>       at org.apache.mahout.common.ClassUtils.instantiateAs(ClassUtils.java:62)
>       ... 9 more
> {code}
> This is really bad.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to