Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: File Format Integrations
(https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations)
Edited by Lance Norskog:
---------------------------------------------------------------------
There are several importers and exporters for common file formats.
{toc}
h2. General-purpose convertors
h3. Importer 'bin/mahout' jobs
Run these with --help to see options
* bin/mahout arff.vector
* bin/mahout lucene.vector
* bin/mahout seqdirectory
** turns text files into sequence files, one file per key/value pair
* bin/mahout SequenceFilesFromMailArchives
** parses mailboxes and emits one text body per mail message
* bin/mahout regexconverter
** reads text lines and emits the regex output lines into SequenceFiles.
h3. Exporter 'bin/mahout' jobs
Some programs exist to dump text versions of SequenceFiles for perusal. Run
these with --help to see options.
* bin/mahout clusterdump
* bin/mahout cmdump
* bin/mahout matrixdump
* bin/mahout seqdumper
* bin/mahout vectordump
*Note:* all classes with a 'main' method can be used as a bin/mahout job name.
h3. Importer classes
These are not main() classes and must be coded against.
* CSVVectorIterator imports CSV files into vectors.
h3. Exporter classes
* *GraphMLClusterWriter* saves cluster data in the
[GraphML|http://graphml.graphdrawing.org/]
* *CSVClusterWriter* saves clusters in a csv-based format.
Both of these formats are read by the [Gephi|http://gephi.org/] program, an
interactive graph explorer.
There are many file importers which are custom-made for particular algorithms:
* The various text -> Lucene index converters
h2. Examples
h3. Regex Converter
For example, the following will extract queries from HTTP request logs to
[Solr|http://lucene.apache.org] and prepare them for use by Frequent Itemset
Mining.
{code}
bin/mahout regexconverter --input
/Users/grantingersoll/projects/content/lucid/lucidfind/logs --output
/tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite
--transformerClass url --formatterClass fpg
{code}
See [tutorial|http://download.oracle.com/javase/tutorial/essential/regex/] and
[cheat
sheet|http://www.omicentral.com/cheatsheets/JavaRegularExpressionsCheatSheet.pdf]
for this marvelously opaque toolkit.
Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action