[CONF] Apache Mahout > Import Export Sequence File Formats

confluence Sun, 11 Sep 2011 20:12:35 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Import Export Sequence File Formats 
(https://cwiki.apache.org/confluence/display/MAHOUT/Import+Export+Sequence+File+Formats)



Edited by Lance Norskog:
---------------------------------------------------------------------
h5. Status
This is a talk page.
h1. Scope of Project
There are different kinds of import/export problem. One class of problem is 
defining a set of SequenceFile formats that a "Mahout Job" will import and 
export. This page is limited to the SequenceFile problem.
h1. Purpose
This feature would make the suite of Mahout jobs far more useful, as they can 
cross-connect with each other. Right now each job is a large complex beast that 
does everything a use case might need. This would allow smaller modular job 
designs.

The feature should not create more "I am a confused beginner" traffic on the 
mahout-user list.
h1. Use Cases
h3. Lucene "Bag-of-words" vector
This is a NamedVector file containing a String key and a sparse-encoded vector. 
There may be an external dictionary defining documents and/or terms.
h5. Import
The various Bayes text classification jobs like Wikipedia import Lucene 
bag-of-words Vector files.  
h5. Export
Feature vectors derived from text vectors are useful to text-oriented machine 
learning research. An example:

* Compare a feature vector to all of the original text vectors. This searches 
for "exemplar" documents which seem to most comprehensively match the given 
feature. A bunch of papers discuss this for creating document abstracts from 
sentence vectors.

h3. Confusion Matrix 
A classification job creates among other things a Confusion Matrix. The current 
example jobs log a text version of the confusion matrix.
h5. Import
Comparing confusion matrices from different classification runs lets you 
evaluate tuning knobs for a classifier.
h5. Export
Classification jobs export a confusion matrix defining misclassification 
events. (Recommender jobs have an analogous output: the user/item matrix of 
preference deltas when comparing training and test data. I would use the same 
tool to visualize both matrices.)
h1. Contract
All "Mahout Jobs" have to honor a contract around SequenceFile types.
h3. Proposal #1
There is a small list of simple SequenceFiles. All jobs are required to accept 
at least some of these file types. The job must have its own interpretation of 
what it means to import each one. There can be several interpretations for a 
particular type.

The job must log a list of what file types it imports and exports, and 
descriptions of each interpretation.
h5. Limits
All jobs still have the current parameters and their meanings. Participating in 
the Import/Export feature occurs outside of the "native" file formats for input 
& output.
A job is not limited to the list of file types. It can import and export any 
other types. The FPGrowth job exports a complex tree structure. 

h5. Types of SequenceFiles
There should be a very short list of SequenceFile types.
* Matrix with optional row&column labels.
* NamedVector
* ??

h5. Information structure
Data exported to the common formats are not expected to include full 
information. Under this proposal, it would create various "flattened" versions 
of the main data structure.
h5. Parameters
There is a common set of parameters for the Import/Export service. It should be 
as simple as possible.
* Use the import service
* Where is the file or Hadoop directory?
* Which interpretation should the job use?
* Same for export






Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Import Export Sequence File Formats

Reply via email to