Running Bayes classifier fills up disk space

Gangadhar Nittala Sun, 12 Sep 2010 15:11:59 -0700

All,

I am following the details given in the Mahout wiki to run the Bayes
example [ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
] with the 0.4 trunk code. I had to make a few modifications to the
commands to match the 0.4 snapshot, but when I run the Step 6 - to
train the classifier thus (I was able to get everything till Step 5
right), $HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
--input wikipediainput10 --output wikipediamodel10 --classifierType
bayes --dataSource hdfs, the machine runs out of disk-space.


I did not run this for the complete enwiki-latest-pages-articles.xml
but only a part of the complete articles -
enwiki-latest-pages-articles10.xml.
[http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml.bz2].
Even with this, the HDFS fills up my 50 GB disk. Is this normal ? Does
the training of the classifier consume so much space ? Or is this
something that can be controlled via hadoop settings? I ask this
because, when I terminated the classifier process, stopped hadoop
(executed $HADOOP_HOME/bin/stop-all.sh) and checked the disk space, it
was back to what it was (around 43 GB free).

If the space usage is normal, is there a smaller set over which I can
run the classifier ? I want to see the output for the classifier
before I try to understand the code (also the intent was for me to
understand how to run Mahout algorithms and write example code).
Should I be asking these sort of questions on the mahout-users list ?

Thank you
Gangadhar

Running Bayes classifier fills up disk space

Reply via email to