Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Data Formats 
(https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats)


Edited by Lance Norskog:
---------------------------------------------------------------------
Mahout uses a few file formats quite a bit in its various job implementations.
{toc}
h2. File formats
h3. Raw formats for import
* Text files
** can be parsed into SequenceFiles of:
*** (line number, text of line)
*** (file name, full contents of file)
*** (line number, parts of line extracted with regex patterns)
** can also be parsed into Lucene indexes:
*** _precise index design ???_
* ARFF files
** Weka project text data format
** Parsed into SequenceFile of <Int,Vector>
* Mailbox files
** can be parsed into SequenceFiles of:
*** (mail message id, text body of mail message)
*** no html or attachment support
* CSV files
** generally without column or row headers
** no "multiple values per column" options
* Hadoop SequenceFile
** canonical, no variations. Currently no use of metadata.
* Lucene indexes
** translated into SequenceFiles
*** _precise index design ???_

h3. Raw formats for export
* SequenceFiles
* Text lines, mostly of the toString() variety
* MatrixWritable for ConfusionMatrix
* CSV for MatrixWritable
* A special CSV format for Clusters
* [GraphML XML|http://graphml.graphdrawing.org/] for Clusters

h2. Who Stores What in a SequenceFile? 
h3. "Simple" Text Vectors
Simple text vectors represent documents. The dimensions are the set of terms in 
the entire document set. Each document vector stores a number in the position 
of each term it contains. This number may be derived from the count of the term 
inside the document.
h3. Encoded Text Vectors
Each vector represents a document. However, term dimensions are "collapsed" 
stochastically, meaning each term in the full term set is mapped randomly to 
several smaller indexes. 
h3. Directories
<Integer,Text> pairs which match matrix rows to input text keys like movie 
names, document file names etc. These are made by RowIdJob.
h3. Matrices
Matrices are almost universally stored as LongWritable/VectorWritable pairs, 
where VectorWritable can be sparse or dense.
h3. Clusters
Clusters are stored in complex data structures.
h3. FPGrowth Clusters
These are stored in a custom data structure.

h2. Life cycle
All Mahout jobs generally assume that files generated have no lifespan. All 
Writable formats may change, and some may disappear. There are no file 
compatibility requirements.


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to