Author: srowen
Date: Tue Dec  6 21:11:26 2011
New Revision: 1211153

URL: http://svn.apache.org/viewvc?rev=1211153&view=rev
Log:
Clear or merge some duplicated content

Removed:
    
mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/package-info.java
    mahout/trunk/integration/src/main/java/org/apache/mahout/cf/taste/example/
Modified:
    
mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaXmlSplitter.java

Modified: 
mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaXmlSplitter.java
URL: 
http://svn.apache.org/viewvc/mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaXmlSplitter.java?rev=1211153&r1=1211152&r2=1211153&view=diff
==============================================================================
--- 
mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaXmlSplitter.java
 (original)
+++ 
mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/bayes/WikipediaXmlSplitter.java
 Tue Dec  6 21:11:26 2011
@@ -46,8 +46,29 @@ import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 /**
- * Splits the wikipedia xml file in to chunks of size as specified by command 
line parameter
- * 
+ * <p>The Bayes example package provides some helper classes for training the 
Naive Bayes classifier
+ * on the Twenty Newsgroups data. See {@link 
org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups}
+ * for details on running the trainer and
+ * formatting the Twenty Newsgroups data properly for the training.</p>
+ *
+ * <p>The easiest way to prepare the data is to use the ant task in 
core/build.xml:</p>
+ *
+ * <p>{@code ant extract-20news-18828}</p>
+ *
+ * <p>This runs the arg line:</p>
+ *
+ * <p>{@code -p $\{working.dir\}/20news-18828/ -o 
$\{working.dir\}/20news-18828-collapse -a $\{analyzer\} -c UTF-8}</p>
+ *
+ * <p>To Run the Wikipedia examples (assumes you've built the Mahout Job 
jar):</p>
+ *
+ * <ol>
+ *  <li>Download the Wikipedia Dataset. Use the Ant target: {@code ant 
enwiki-files}</li>
+ *  <li>Chunk the data using the WikipediaXmlSplitter (from the Hadoop home):
+ *   {@code bin/hadoop jar $MAHOUT_HOME/target/mahout-examples-0.x
+ *   org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
+ *   -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml
+ *   -o $MAHOUT_HOME/examples/work/wikipedia/chunks/ -c 64}</li>
+ * </ol>
  */
 public final class WikipediaXmlSplitter {
   


Reply via email to