Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: dirichlet-commandline
(http://cwiki.apache.org/confluence/display/MAHOUT/dirichlet-commandline)
Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Running Dirichlet Process Clustering from the Command Line
Mahout's Dirichlet clustering can be launched from the same command line
invocation whether you are running on a single machine in stand-alone mode or
on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME
and $HADOOP_CONF_DIR environment variables. If both are set to an operating
Hadoop cluster on the target machine then the invocation will run Dirichlet on
that cluster. If either of the environment variables are missing then the
stand-alone Hadoop configuration will be invoked instead.
{code}
./bin/mahout dirichlet <OPTIONS>
{code}
* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will
be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout
version number. For example, when using Mahout 0.3 release, the job will be
mahout-core-0.3.job
h2. Testing it on one single machine w/o cluster
* Put the data: cp <PATH TO DATA> testdata
* Run the Job:
{code}
./bin/mahout dirichlet -i testdata <OTHER OPTIONS>
{code}
h2. Running it on the cluster
* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job:
{code}
export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout dirichlet -i testdata <OTHER OPTIONS>
{code}
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to
view all outputs.
h1. Command line options
{code}
--input (-i) input Path to job input directory.
Must be a SequenceFile of
VectorWritable
--output (-o) output The directory pathname for
output.
--overwrite (-ow) If present, overwrite the
output directory before running
job
--modelDistClass (-md) modelDistClass The ModelDistribution class
name. Defaults to
NormalModelDistribution
--modelPrototypeClass (-mp) prototypeClass The ModelDistribution prototype
Vector class name. Defaults to
RandomAccessSparseVector
--maxIter (-x) maxIter The maximum number of
iterations.
--alpha (-m) alpha The alpha0 value for the
DirichletDistribution. Defaults
to 1.0
--k (-k) k The number of clusters to
create
--help (-h) Print out help
--maxRed (-r) maxRed The number of reduce tasks.
Defaults to 2
--clustering (-cl) If present, run clustering
after the iterations have taken
place
--emitMostLikely (-e) emitMostLikely True if clustering should emit
the most likely point only,
false for threshold clustering.
Default is true
--threshold (-t) threshold The pdf threshold used for
cluster determination. Default
is 0
{code}
Change your notification preferences:
http://cwiki.apache.org/confluence/users/viewnotifications.action