Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: File Format Integrations 
(https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations)


Edited by Lance Norskog:
---------------------------------------------------------------------
There are several importers and exporters for common file formats.
h2. General-purpose convertors
h3. Importer 'bin/mahout' jobs
Run these with --help to see options
* bin/mahout arff.vector
* bin/mahout lucene.vector
* 'mahout regexconverter' reads text lines and emits the "captured" regex 
output into LongWritable/Text SequenceFiles. 

h3. Exporter 'bin/mahout' jobs
Some programs exist to dump text versions of SequenceFiles for perusal. Run 
these with --help to see options.
* bin/mahout clusterdump
* bin/mahout cmdump
* bin/mahout matrixdump
* bin/mahout seqdumper
* bin/mahout vectordump

*Note:* all classes with a 'main' method can be used as a bin/mahout job name.

h3. Importer classes

These are not main() classes and must be coded against.
* CSVVectorIterator imports CSV files into vectors. 
* MailProcessor parses text-only mailboxes into a SequenceFile with a numbered 
key and the text body in the value.

h3. Exporter classes

* *GraphMLClusterWriter* saves cluster data in the 
[GraphML|http://graphml.graphdrawing.org/]
* *CSVClusterWriter* saves clusters in a csv-based format.

Both of these formats are read by the [Gephi|http://gephi.org/] program, an 
interactive graph explorer. 

There are many file importers which are custom-made for particular algorithms:
* The various text -> Lucene index converters

h2. Examples
h5. Regex Converter
For example, the following will extract queries from HTTP request logs to 
[Solr|http://lucene.apache.org] and prepare them for use by Frequent Itemset 
Mining.
{code}
bin/mahout regexconverter --input 
/Users/grantingersoll/projects/content/lucid/lucidfind/logs --output 
/tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite 
--transformerClass url --formatterClass fpg
{code}
See a [tutorial|http://download.oracle.com/javase/tutorial/essential/regex/] 
and [cheat 
sheet|http://www.omicentral.com/cheatsheets/JavaRegularExpressionsCheatSheet.pdf]
 for this marvelously opaque toolkit.




Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to