Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Wikipedia Bayes Example 
(https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example)


Edited by Robin Anil:
---------------------------------------------------------------------
h1. Intro

The Mahout Examples source comes with tools for classifying a Wikipedia data 
dump using either the Naive Bayes or Complementary Naive Bayes implementations 
in Mahout.  The example (described below) gets a Wikipedia dump and then splits 
it up into chunks.  These chunks are then further split by country.  From these 
splits, a classifier is trained to predict what country an unseen article 
should be categorized into.


h1. Running the example
NOTE: Substitute in the appropriate version of Mahout as needed below (i.e. 
replace 0.1-dev with the appropriate value)

# cd <MAHOUT_HOME>/examples
# ant -f build-deprecated.xml enwiki-files
# Chunk the Data into pieces: {code}<HADOOP_HOME>/bin/hadoop jar 
<MAHOUT_HOME>/examples/target/apache-mahout-0.1-dev-ex.jar 
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d 
<MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml -o  
<MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64{code} {quote}*We strongly 
suggest you backup the results to some other place so that you don't have to do 
this step again in case it gets accidentally erased*{quote}
# Move the chunks to HDFS:  {code}<HADOOP_HOME>/bin/hadoop dfs -put 
<MAHOUT_HOME>/examples/work/wikipedia/chunks/ wikipediadump{code}
# Create the countries based Split of wikipedia dataset. 
{code}<HADOOP_HOME>/bin/hadoop jar 
<MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job 
org.apache.mahout.classifier.bayes.WikipediaDatasetCreator -i wikipediadump -o 
wikipediainput -c <MAHOUT_HOME>/examples/src/test/resources/country.txt{code}
# Train the classifier: {code}<HADOOP_HOME>bin/hadoop jar 
<MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job 
org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o 
wikipediamodel --gramSize 3 -classifierType bayes{code}
# Fetch the input files for testing: {code}<HADOOP_HOME>/bin/hadoop dfs -get 
wikipediainput wikipediainput {code}
# Test the classifier: {code}<HADOOP_HOME>/bin/hadoop jar 
<MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar 
org.apache.mahout.classifier.bayes.TestClassifier -p wikipediamodel -t  
wikipediainput{code}


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to