Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: File Format Integrations (https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations)
Edited by Lance Norskog: --------------------------------------------------------------------- There are several importers and exporters for common file formats. h2. General-purpose convertors h3. Importer 'bin/mahout' jobs Run these with --help to see options * bin/mahout arff.vector * bin/mahout lucene.vector * bin/mahout seqdirectory ** turns text files into sequence files, one file per key/value pair * bin/mahout SequenceFilesFromMailArchives ** parses mailboxes and emits one text body per mail message * bin/mahout regexconverter ** reads text lines and emits the regex output lines into SequenceFiles. h3. Exporter 'bin/mahout' jobs Some programs exist to dump text versions of SequenceFiles for perusal. Run these with --help to see options. * bin/mahout clusterdump * bin/mahout cmdump * bin/mahout matrixdump * bin/mahout seqdumper * bin/mahout vectordump *Note:* all classes with a 'main' method can be used as a bin/mahout job name. h3. Importer classes These are not main() classes and must be coded against. * CSVVectorIterator imports CSV files into vectors. h3. Exporter classes * *GraphMLClusterWriter* saves cluster data in the [GraphML|http://graphml.graphdrawing.org/] * *CSVClusterWriter* saves clusters in a csv-based format. Both of these formats are read by the [Gephi|http://gephi.org/] program, an interactive graph explorer. There are many file importers which are custom-made for particular algorithms: * The various text -> Lucene index converters h2. Examples h5. Regex Converter For example, the following will extract queries from HTTP request logs to [Solr|http://lucene.apache.org] and prepare them for use by Frequent Itemset Mining. {code} bin/mahout regexconverter --input /Users/grantingersoll/projects/content/lucid/lucidfind/logs --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg {code} See [tutorial|http://download.oracle.com/javase/tutorial/essential/regex/] and [cheat sheet|http://www.omicentral.com/cheatsheets/JavaRegularExpressionsCheatSheet.pdf] for this marvelously opaque toolkit. Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
