Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Converting Content
(https://cwiki.apache.org/confluence/display/MAHOUT/Converting+Content)
Added by Grant Ingersoll:
---------------------------------------------------------------------
{toc}
h1. Intro
Mahout has some tools for converting content into formats more consumable for
Mahout. While they shouldn't be confused as a full ETL layer, they can be
useful for things like converting text files and log files. All of these can
be accessed via the $MAHOUT_HOME/bin/mahout command line driver.
h1. SequenceFilesFrom*
* SequenceFilesFromDirectory -- Converts
* SequenceFilesFromMailArchives -- works
h1. RegexConverterDriver
Useful for converting things like log files from one format to another. For
instance, you could convert Solr log files containing query requests to a
format consumable by [FrequentItemsetMining]
For example, the following will extract queries from HTTP request logs to
[Solr|http://lucene.apache.org] and prepare them for use by Frequent Itemset
Mining.
{noformat}
bin/mahout regexconverter --input
/Users/grantingersoll/projects/content/lucid/lucidfind/logs --output
/tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite
--transformerClass url --formatterClass fpg
{noformat}
Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action