Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Import Export Sequence File Formats (https://cwiki.apache.org/confluence/display/MAHOUT/Import+Export+Sequence+File+Formats)
Added by Lance Norskog: --------------------------------------------------------------------- h5. Status This is a talk page. h1. Scope of Project There are different kinds of import/export problem. One class of problem is defining a set of SequenceFile formats that a "Mahout Job" will import and export. This page is limited to the SequenceFile problem. h1. Use Cases h3. Lucene "Bag-of-words" vector This is a NamedVector file containing a String key and a sparse-encoded vector. There may be an external dictionary defining documents and/or terms. h5. Import The various Bayes text classification jobs like Wikipedia import Lucene bag-of-words Vector files. h5. Export Feature vectors derived from text vectors are useful to text-oriented machine learning research. An example: * Compare a feature vector to all of the original text vectors. This searches for "exemplar" documents which seem to most comprehensively match the given feature. A bunch of papers discuss this for creating document abstracts from sentence vectors. h3. Confusion Matrix A classification job creates among other things a Confusion Matrix. The current example jobs log a text version of the confusion matrix. h5. Import Comparing confusion matrices from different classification runs lets you evaluate tuning knobs for a classifier. h5. Export Comparing confusion matrices from different classification runs lets you evaluate tuning knobs for a classifier. Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
