On Mon, Sep 13, 2010 at 3:41 AM, Gangadhar Nittala
<[email protected]>wrote:

> All,
>
> I am following the details given in the Mahout wiki to run the Bayes
> example [ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
> ] with the 0.4 trunk code. I had to make a few modifications to the
> commands to match the 0.4 snapshot, but when I run the Step 6 - to
> train the classifier thus (I was able to get everything till Step 5
> right), $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
> --input wikipediainput10 --output wikipediamodel10 --classifierType
> bayes --dataSource hdfs, the machine runs out of disk-space.
>
> I did not run this for the complete enwiki-latest-pages-articles.xml
> but only a part of the complete articles -
> enwiki-latest-pages-articles10.xml.
> [
> http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml.bz2
> ].
> Even with this, the HDFS fills up my 50 GB disk. Is this normal ? Does
> the training of the classifier consume so much space ? Or is this
> something that can be controlled via hadoop settings? I ask this
> because, when I terminated the classifier process, stopped hadoop
> (executed $HADOOP_HOME/bin/stop-all.sh) and checked the disk space, it
> was back to what it was (around 43 GB free).
>

Yes, for now the classifier doesn't delete intermediate files. The final
model is much smaller < 1GB

>
> If the space usage is normal, is there a smaller set over which I can
> run the classifier ? I want to see the output for the classifier
> before I try to understand the code (also the intent was for me to
> understand how to run Mahout algorithms and write example code).
> Should I be asking these sort of questions on the mahout-users list ?
>

Try to use the WikipediaDatasetCreator to select articles from a given
category list. See the code for more details

>
> Thank you
> Gangadhar
>

Reply via email to