[CONF] Apache Mahout > File Format Integrations

Isabel Drost (Confluence) Mon, 23 Dec 2013 09:53:13 -0800

	Isabel Drost edited the page:
	File Format Integrations

There are several importers and exporters for common file formats.

Table of Contents

General-purpose convertors

Importer 'bin/mahout' jobs

Run these with --help to see options

bin/mahout arff.vector
bin/mahout lucene.vector
bin/mahout seqdirectory
- turns text files into sequence files, one file per key/value pair
bin/mahout SequenceFilesFromMailArchives
- parses mailboxes and emits one text body per mail message
bin/mahout regexconverter
- reads text lines and emits the regex output lines into SequenceFiles.

Exporter 'bin/mahout' jobs

Some programs exist to dump text versions of SequenceFiles for perusal. Run these with --help to see options.

bin/mahout clusterdump
bin/mahout cmdump
bin/mahout matrixdump
bin/mahout seqdumper
bin/mahout vectordump

Note: all classes with a 'main' method can be used as a bin/mahout job name.

Importer classes

These are not main() classes and must be coded against.

CSVVectorIterator imports CSV files into vectors.

Exporter classes

GraphMLClusterWriter saves cluster data in the GraphML
CSVClusterWriter saves clusters in a csv-based format.

Both of these formats are read by the Gephi program, an interactive graph explorer.

There are many file importers which are custom-made for particular algorithms:

The various text -> Lucene index converters

Examples

Regex Converter

For example, the following will extract queries from HTTP request logs to Solr and prepare them for use by Frequent Itemset Mining.

Code Block


bin/mahout regexconverter --input /Users/grantingersoll/projects/content/lucid/lucidfind/logs --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg

See tutorial and cheat sheet for this marvelously opaque toolkit.

View Online · Like · View Changes

Stop watching space · Manage Notifications