[ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wettin updated MAHOUT-61: ------------------------------ Attachment: MAHOUT-61.txt This is what it is now: 1. InstanceHandler gathers instances 2. TokenizationMapper, Reducer and Combiner create one intermediate MapWritiable instance (see [4]). These are reduced down to unique feature names and class values. 3. The features and class values are placed in maps, assigned column index and numeric values, and stored as MapFile on DFS. 4. VectorBuilderMapper is a Mapping only job that use the results from [2] and [3] to produce sparse vectors. > Text problem matrix builder > ---------------------------- > > Key: MAHOUT-61 > URL: https://issues.apache.org/jira/browse/MAHOUT-61 > Project: Mahout > Issue Type: New Feature > Reporter: Karl Wettin > Assignee: Karl Wettin > Priority: Minor > Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt > > > A set of classes that builds matrices from text. > Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. > Should be thread safe. > PostReader imports 20news-bydate. This takes several GB heap. It would be > nice to bounce the data via JDBM or perhaps using the PersistentHashMap in > MAHOUT-19. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.