[CONF] Apache Lucene Mahout > dirichlet-commandline

confluence Fri, 04 Jun 2010 08:55:22 -0700

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: dirichlet-commandline 
(http://cwiki.apache.org/confluence/display/MAHOUT/dirichlet-commandline)



Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Running Dirichlet Process Clustering from the Command Line
Mahout's Dirichlet clustering can be launched from the same command line 
invocation whether you are running on a single machine in stand-alone mode or 
on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME 
and $HADOOP_CONF_DIR environment variables. If both are set to an operating 
Hadoop cluster on the target machine then the invocation will run Dirichlet on 
that cluster. If either of the environment variables are missing then the 
stand-alone Hadoop configuration will be invoked instead.

{code}
./bin/mahout dirichlet <OPTIONS>
{code}

* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will 
be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout 
version number. For example, when using Mahout 0.3 release, the job will be 
mahout-core-0.3.job


h2. Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job: 
{code}
./bin/mahout dirichlet -i testdata <OTHER OPTIONS>
{code}

h2. Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job: 
{code}
export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout dirichlet -i testdata <OTHER OPTIONS>
{code}
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to 
view all outputs.

h1. Command line options
{code}
  --input (-i) input                            Path to job input directory.    
                                                Must be a SequenceFile of       
                                                VectorWritable                  
  --output (-o) output                          The directory pathname for      
                                                output.                         
  --overwrite (-ow)                             If present, overwrite the       
                                                output directory before running 
                                                job                             
  --modelDistClass (-md) modelDistClass         The ModelDistribution class     
                                                name. Defaults to               
                                                NormalModelDistribution         
  --modelPrototypeClass (-mp) prototypeClass    The ModelDistribution prototype 
                                                Vector class name. Defaults to  
                                                RandomAccessSparseVector        
  --maxIter (-x) maxIter                        The maximum number of           
                                                iterations.                     
  --alpha (-m) alpha                            The alpha0 value for the        
                                                DirichletDistribution. Defaults 
                                                to 1.0                          
  --k (-k) k                                    The number of clusters to       
                                                create                          
  --help (-h)                                   Print out help                  
  --maxRed (-r) maxRed                          The number of reduce tasks.     
                                                Defaults to 2                   
  --clustering (-cl)                            If present, run clustering      
                                                after the iterations have taken 
                                                place                           
  --emitMostLikely (-e) emitMostLikely          True if clustering should emit  
                                                the most likely point only,     
                                                false for threshold clustering. 
                                                Default is true                 
  --threshold (-t) threshold                    The pdf threshold used for      
                                                cluster determination. Default  
                                                is 0                            
{code}

Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Lucene Mahout > dirichlet-commandline

Reply via email to