[
https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722105#action_12722105
]
Jeff Eastman commented on MAHOUT-136:
-------------------------------------
r786738 committed the following changes.
- Modified CanopyMapper and CanopyReducer to produce and consume Canopy
centroids as Writable values vs. previous formatStrings
- Modified CanopyMapper to specify SparseVector output from mapper
- Fixed null name hash() bug in SparseVector
- Modified Canopy.emitPointToExistingCanopies to emit only canopy id and not
full serialized canopy.
- This eliminates the need for the OutputDriver and OutputMapper in synthetic
control example so they are deleted.
- Updated unit tests; all tests run
- Synthetic control example runs
NOTE: When passing Vectors between Map and Reduce steps using Writable format,
Hadoop uses the *same instance* to do all of the deserializations. I had to
change the Canopy constructors to clone() their center arguments so that the
same instance would not be reused for multiple canopies.
> Change Canopy MR Implementation to use Vector Writable
> ------------------------------------------------------
>
> Key: MAHOUT-136
> URL: https://issues.apache.org/jira/browse/MAHOUT-136
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.1
> Reporter: Jeff Eastman
> Assignee: Jeff Eastman
> Fix For: 0.1
>
>
> Internal serialization of Canopy currently uses asFormatString rather than
> just making the Canopy writable. This is storage inefficient.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.