Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: File Format Integrations 
(https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations)


Edited by Lance Norskog:
---------------------------------------------------------------------
There are several importers and exporters for common file formats.
h2. General-purpose convertors
h3. Importer 'bin/mahout' jobs
Run these with --help to see options
* bin/mahout arff.vector
* bin/mahout lucene.vector
* bin/mahout seqdirectory
** turns text files into sequence files, one file per key/value pair
* bin/mahout SequenceFilesFromMailArchives
** parses mailboxes and emits one text body per mail message
* bin/mahout regexconverter 
** reads text lines and emits the regex output lines into SequenceFiles.

h3. Exporter 'bin/mahout' jobs
Some programs exist to dump text versions of SequenceFiles for perusal. Run 
these with --help to see options.
* bin/mahout clusterdump
* bin/mahout cmdump
* bin/mahout matrixdump
* bin/mahout seqdumper
* bin/mahout vectordump

*Note:* all classes with a 'main' method can be used as a bin/mahout job name.

h3. Importer classes

These are not main() classes and must be coded against.
* CSVVectorIterator imports CSV files into vectors. 

h3. Exporter classes

* *GraphMLClusterWriter* saves cluster data in the 
[GraphML|http://graphml.graphdrawing.org/]
* *CSVClusterWriter* saves clusters in a csv-based format.

Both of these formats are read by the [Gephi|http://gephi.org/] program, an 
interactive graph explorer. 

There are many file importers which are custom-made for particular algorithms:
* The various text -> Lucene index converters

h2. Examples
h5. Regex Converter
For example, the following will extract queries from HTTP request logs to 
[Solr|http://lucene.apache.org] and prepare them for use by Frequent Itemset 
Mining.
{code}
bin/mahout regexconverter --input 
/Users/grantingersoll/projects/content/lucid/lucidfind/logs --output 
/tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite 
--transformerClass url --formatterClass fpg
{code}
See [tutorial|http://download.oracle.com/javase/tutorial/essential/regex/] and 
[cheat 
sheet|http://www.omicentral.com/cheatsheets/JavaRegularExpressionsCheatSheet.pdf]
 for this marvelously opaque toolkit.




Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to