[ 
https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated MAHOUT-61:
------------------------------

    Attachment: MAHOUT-61.txt

This is what it is now:

 1. InstanceHandler gathers instances
 2. TokenizationMapper, Reducer and Combiner create one intermediate 
MapWritiable instance (see [4]). These are reduced down to unique feature names 
and class values.
 3. The features and class values are placed in maps, assigned column index and 
numeric values,  and stored as MapFile on DFS.
 4. VectorBuilderMapper is a Mapping only job that use the results from [2] and 
[3] to produce sparse vectors.


> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. 
> Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be 
> nice to bounce the data via JDBM or perhaps using the PersistentHashMap in 
> MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to