All, I am following the details given in the Mahout wiki to run the Bayes example [ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html ] with the 0.4 trunk code. I had to make a few modifications to the commands to match the 0.4 snapshot, but when I run the Step 6 - to train the classifier thus (I was able to get everything till Step 5 right), $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 --input wikipediainput10 --output wikipediamodel10 --classifierType bayes --dataSource hdfs, the machine runs out of disk-space.
I did not run this for the complete enwiki-latest-pages-articles.xml but only a part of the complete articles - enwiki-latest-pages-articles10.xml. [http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml.bz2]. Even with this, the HDFS fills up my 50 GB disk. Is this normal ? Does the training of the classifier consume so much space ? Or is this something that can be controlled via hadoop settings? I ask this because, when I terminated the classifier process, stopped hadoop (executed $HADOOP_HOME/bin/stop-all.sh) and checked the disk space, it was back to what it was (around 43 GB free). If the space usage is normal, is there a smaller set over which I can run the classifier ? I want to see the output for the classifier before I try to understand the code (also the intent was for me to understand how to run Mahout algorithms and write example code). Should I be asking these sort of questions on the mahout-users list ? Thank you Gangadhar
