[ 
https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722105#action_12722105
 ] 

Jeff Eastman commented on MAHOUT-136:
-------------------------------------

r786738 committed the following changes.
- Modified CanopyMapper and CanopyReducer to produce and consume Canopy 
centroids as Writable values vs. previous formatStrings
- Modified CanopyMapper to specify SparseVector output from mapper
- Fixed null name hash() bug in SparseVector
- Modified Canopy.emitPointToExistingCanopies to emit only canopy id and not 
full serialized canopy. 
- This eliminates the need for the OutputDriver and OutputMapper in synthetic 
control example so they are deleted.
- Updated unit tests; all tests run
- Synthetic control example runs

NOTE: When passing Vectors between Map and Reduce steps using Writable format, 
Hadoop uses the *same instance* to do all of the deserializations. I had to 
change the Canopy constructors to clone() their center arguments so that the 
same instance would not be reused for multiple canopies.

> Change Canopy MR Implementation to use Vector Writable
> ------------------------------------------------------
>
>                 Key: MAHOUT-136
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-136
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.1
>
>
> Internal serialization of Canopy currently uses asFormatString rather than 
> just making the Canopy writable. This is storage inefficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to