Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: TwentyNewsgroups 
(http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups)


Edited by Robin Anil:
---------------------------------------------------------------------
h1. Twenty Newsgroups Classification

[Get 
Mahout|http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup]
 

Assume the environment variable $HADOOP_HOME refers to the location where you 
checked out/installed Hadoop
Assume the environment variable $MAHOUT_HOME refers to the location where you 
checked out/installed Mahout

After downloading the distribution, unzip/untar it into the directory of your 
choice and do:

h2. Setup:

# In trunk, mvn install // This will compile everything and create the Hadoop 
Job.
# cd examples

NOTE: For mahout 0.1 release do the following

# If you've run this before, you may want to rm -rf the work and temp 
directories
# ant -f build-deprecated.xml get-files  //Note, we are in the process of 
updating to Maven
# mkdir lib  //NOTE The next few steps are a workaround for the interim while 
we fully migrate to Maven
# mvn dependency:copy-dependencies -DoutputDirectory=lib
# ant -f build-deprecated.xml extract-20news-18828 -Ddest=target
# mv 20news-18828-collapse 20news-input

NOTE: After you have done this, skip to the hadoop section to run the 
20newsgroups example in mahout 0.1 release

For mahout releases >0.2 run the commands in the following order to execute 
20newsgroups example without a hadoop cluster. We assume that 20newsgroups 
dataset is downloaded into the examples directory
To generate input dataset:
{noformat}
$ tar zxf 20news-18828.tar.gz  //extract 20newsgroups dataset
$ mkdir 20news-input
$ mvn -e  exec:java   
-Dexec.mainClass=org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups 
-Dexec.args="-p 20news-18828 -o 20news-input -a 
org.apache.lucene.analysis.standard.StandardAnalyzer -c UTF-8"
{noformat}
To Train the classifier:
{noformat}
$ mvn -e  exec:java   
-Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier 
-Dexec.args="-i 20news-input -o 20news-model -type cbayes -ng 1 -source hdfs"
{noformat}
To Test over the input:
{noformat}
$ mvn -e  exec:java   
-Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier 
-Dexec.args="-m 20news-model -d 20news-input -type cbayes -ng 1 -source hdfs 
-method sequential"
{noformat}

h2. Running 20newsgroups example over hadoop cluster

h3. Set Up Hadoop Cluster
# emacs $HADOOP_HOME/conf/hadoop-site.xml (add in local settings per 
[quickstart|http://hadoop.apache.org/core/docs/current/quickstart.html])
# $HADOOP_HOME/bin/hadoop namenode -format  //Format the HDFS
# $HADOOP_HOME/bin/start-all.sh  //Start Hadoop
# $HADOOP_HOME/bin/hadoop dfs -put $MAHOUT_HOME/examples/20news-input 
20news-input  //Copies the extracted text to HDFS
 
Example: 
Train the Bayes Classifier using tri-grams:
{code}$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job 
org.apache.mahout.classifier.bayes.TrainClassifier -i 20news-input -o newsmodel 
-ng 3 -type bayes{code}
This will run 4 map reduce jobs on Hadoop to train the classifier and will take 
a while on a single node machine. You can monitor the status of these jobs by 
opening a web browser on your Job Tracker node: 
http://localhost:50030/jobtracker.jsp

Test classifier over the input folder
{code}$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job 
org.apache.mahout.classifier.bayes.TestClassifier -p newsmodel -t 
work/20news-input -ng 3 -type bayes{code}

Output might look like:
{code}
08/11/07 16:52:39 INFO bayes.TestClassifier: Done loading model: # labels: 20
08/11/07 16:52:39 INFO bayes.TestClassifier: Done generating Model
08/11/07 16:52:57 INFO bayes.TestClassifier: alt.atheism        
96.9962453066333        775/799.0
08/11/07 16:53:15 INFO bayes.TestClassifier: comp.graphics      
99.28057553956835       966/973.0
08/11/07 16:53:45 INFO bayes.TestClassifier: comp.os.ms-windows.misc    
96.95431472081218       955/985.0
08/11/07 16:53:59 INFO bayes.TestClassifier: comp.sys.ibm.pc.hardware   
99.59266802443992       978/982.0
08/11/07 16:54:10 INFO bayes.TestClassifier: comp.sys.mac.hardware      
99.47970863683663       956/961.0
08/11/07 16:54:28 INFO bayes.TestClassifier: comp.windows.x     
99.59183673469387       976/980.0
08/11/07 16:54:38 INFO bayes.TestClassifier: misc.forsale       
98.45679012345678       957/972.0
08/11/07 16:54:50 INFO bayes.TestClassifier: rec.autos  99.4949494949495        
985/990.0
08/11/07 16:55:04 INFO bayes.TestClassifier: rec.motorcycles    100.0   
994/994.0
08/11/07 16:55:16 INFO bayes.TestClassifier: rec.sport.baseball 
99.89939637826961       993/994.0
08/11/07 16:55:36 INFO bayes.TestClassifier: rec.sport.hockey   
99.89989989989989       998/999.0
08/11/07 16:55:54 INFO bayes.TestClassifier: sci.crypt  99.39455095862765       
985/991.0
08/11/07 16:56:05 INFO bayes.TestClassifier: sci.electronics    
98.98063200815494       971/981.0
08/11/07 16:56:27 INFO bayes.TestClassifier: sci.med    99.79797979797979       
988/990.0
08/11/07 16:56:44 INFO bayes.TestClassifier: sci.space  99.3920972644377        
981/987.0
08/11/07 16:57:06 INFO bayes.TestClassifier: soc.religion.christian     
99.49849548645938       992/997.0
08/11/07 16:57:24 INFO bayes.TestClassifier: talk.politics.guns 
99.45054945054945       905/910.0
08/11/07 16:57:51 INFO bayes.TestClassifier: talk.politics.mideast      
98.82978723404256       929/940.0
08/11/07 16:58:13 INFO bayes.TestClassifier: talk.politics.misc 
89.93548387096774       697/775.0
08/11/07 16:58:25 INFO bayes.TestClassifier: talk.religion.misc 
61.78343949044586       388/628.0
08/11/07 16:58:25 INFO bayes.TestClassifier: 
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      18369       97.5621%
Incorrectly Classified Instances        :        459        2.4379%
Total Classified Instances              :      18828

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       c       d       e       f       g       h       i       j       
k       l       m       n       o       p       q       r       s       t       
<--Classified as
994     0       0       0       0       0       0       0       0       0       
0       0       0       0       0       0       0       0       0       0       
 |  994         a     = rec.motorcycles
0       976     0       0       0       0       0       0       0       0       
1       0       0       0       0       0       0       0       2       1       
 |  980         b     = comp.windows.x
7       0       929     1       0       0       0       0       0       0       
0       0       1       0       2       0       0       0       0       0       
 |  940         c     = talk.politics.mideast
0       0       0       905     0       0       1       0       0       0       
0       0       0       0       0       0       3       0       1       0       
 |  910         d     = talk.politics.guns
4       1       4       27      388     1       0       1       0       5       
1       1       2       2       149     7       2       33      0       0       
 |  628         e     = talk.religion.misc
3       0       0       0       0       985     0       1       0       0       
0       0       0       1       0       0       0       0       0       0       
 |  990         f     = rec.autos
0       0       0       0       0       0       993     1       0       0       
0       0       0       0       0       0       0       0       0       0       
 |  994         g     = rec.sport.baseball
0       0       0       0       0       0       1       998     0       0       
0       0       0       0       0       0       0       0       0       0       
 |  999         h     = rec.sport.hockey
0       0       0       0       0       0       0       0       956     0       
2       0       0       0       0       0       0       0       2       1       
 |  961         i     = comp.sys.mac.hardware
0       0       0       0       0       0       0       0       0       981     
0       0       5       0       0       1       0       0       0       0       
 |  987         j     = sci.space
0       0       0       0       0       0       0       0       0       0       
978     0       1       0       0       0       0       0       2       1       
 |  982         k     = comp.sys.ibm.pc.hardware
1       0       3       36      0       1       2       1       0       5       
0       697     4       0       3       3       19      0       0       0       
 |  775         l     = talk.politics.misc
0       2       0       0       0       0       0       0       0       0       
2       0       966     0       0       0       0       0       2       1       
 |  973         m     = comp.graphics
1       0       0       0       0       0       0       0       0       0       
6       0       0       971     0       0       0       0       3       0       
 |  981         n     = sci.electronics
1       0       0       0       0       0       0       0       1       0       
0       0       0       0       992     1       0       1       0       1       
 |  997         o     = soc.religion.christian
0       0       0       0       0       0       0       0       0       0       
1       0       0       0       0       988     0       0       0       1       
 |  990         p     = sci.med
0       0       0       2       0       0       0       0       0       0       
0       0       2       1       0       0       985     0       1       0       
 |  991         q     = sci.crypt
0       0       0       1       1       0       0       0       0       1       
0       0       1       0       19      0       1       775     0       0       
 |  799         r     = alt.atheism
1       0       0       0       0       3       1       2       0       0       
3       0       0       5       0       0       0       0       957     0       
 |  972         s     = misc.forsale
0       0       0       8       0       0       0       0       0       0       
6       0       6       0       0       0       0       0       10      955     
 |  985         t     = comp.os.ms-windows.misc

{code}




h2. Complementary Naive Bayes

To Train a CBayes Classifier using bi-grams
{code}$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job 
org.apache.mahout.classifier.bayes.TrainClassifier -i 20news-input -o 
20news-model -ng 2 -type cbayes -source <hdfs|hbase>{code}

To Test a CBayes Classifier using bi-grams
{code}$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job 
org.apache.mahout.classifier.bayes.TestClassifier -p 20news-model -t 
work/20news-input -ng 2 -type cbayes -source <hdfs|hbase>{code}



Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to