Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Creating Vectors from Weka's ARFF Format 
(https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Weka%27s+ARFF+Format)


Edited by Joe Prasanna Kumar:
---------------------------------------------------------------------
h1. Introduction

Mahout now has capabilities for converting Weka's 
[ARFF|http://www.cs.waikato.ac.nz/~ml/weka/arff.html] (2.1) format to Mahout's 
Vector format.

h1. Running the Converter

ARFF files are easily converted using the org.apache.mahout.utils.arff.Driver 
program.  The input arguments can be found by running it with the \--help 
argument which produces results similar to:
{noformat}
Usage:
 [--input <input> --output <output> --max <max> --help --dictOut <dictOut>
--outputWriter <outputWriter> --delimiter <delimiter>]
Options
  --input (-d) input                  The file or directory containing the ARFF
                                      files.  If it is a directory, all .arff
                                      files will be converted. (Mandatory 
parameter)
  --output (-o) output                The output directory.  Files will have
                                      the same name as the input, but with the
                                      extension .mvc (Mandatory parameter)
  --max (-m) max                      The maximum number of vectors to output.
                                      If not specified, then it will loop over
                                      all docs (Optional parameter)
  --help (-h)                         Print out help (Optional parameter)
  --dictOut (-t) dictOut              The file to output the label bindings
                                      (Mandatory parameter)
  --outputWriter (-e) outputWriter    The VectorWriter to use, either seq
                                      (SequenceFileVectorWriter - default) or
                                      file (Writes to a File using JSON format)
                                      (Optional parameter)
  --delimiter (-l) delimiter          The delimiter for outputing the
                                      dictionary (Optional parameter)

{noformat}

You can use the parameters in its long format like \--input or using the 
equivalent short name \-d.  From here, running the Driver is as simple as 
pointing it at the ARFF file:
{noformat}
$MAHOUT_HOME/bin/mahout arff.vector -d ./content/reuters-modapte/ \
      -t ./content/reuters-modapte/output/dict.txt -o 
./content/reuters-modapte/output/convert
{noformat}

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to