There are several importers and exporters for common file formats.
Table of Contents |
General-purpose convertors
Importer 'bin/mahout' jobs
Run these with --help to see options
- bin/mahout arff.vector
- bin/mahout lucene.vector
- bin/mahout seqdirectory
- turns text files into sequence files, one file per key/value pair
- bin/mahout SequenceFilesFromMailArchives
- parses mailboxes and emits one text body per mail message
- bin/mahout regexconverter
- reads text lines and emits the regex output lines into SequenceFiles.
Exporter 'bin/mahout' jobs
Some programs exist to dump text versions of SequenceFiles for perusal. Run these with --help to see options.
- bin/mahout clusterdump
- bin/mahout cmdump
- bin/mahout matrixdump
- bin/mahout seqdumper
- bin/mahout vectordump
Note: all classes with a 'main' method can be used as a bin/mahout job name.
Importer classes
These are not main() classes and must be coded against.
- CSVVectorIterator imports CSV files into vectors.
Exporter classes
- GraphMLClusterWriter saves cluster data in the GraphML
- CSVClusterWriter saves clusters in a csv-based format.
Both of these formats are read by the Gephi program, an interactive graph explorer.
There are many file importers which are custom-made for particular algorithms:
- The various text -> Lucene index converters
Examples
Regex Converter
For example, the following will extract queries from HTTP request logs to Solr and prepare them for use by Frequent Itemset Mining.
Code Block |
bin/mahout regexconverter --input /Users/grantingersoll/projects/content/lucid/lucidfind/logs --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg
|
See tutorial and cheat sheet for this marvelously opaque toolkit.