So, should we just go to having everything be binary and then have
Input/Output utilities that can take the binary format and output
GSON? Seems like w/ Canopy, since it's used for feeding into other
algorithms that it should output Writable as well, otherwise we're
still going to be round tripping through Text.
Then, it would be pretty easy to write a M/R job that takes Vectors
and outputs asFormatString(), right?
On Jun 19, 2009, at 8:54 PM, Jeff Eastman (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722105
#action_12722105 ]
Jeff Eastman commented on MAHOUT-136:
-------------------------------------
r786738 committed the following changes.
- Modified CanopyMapper and CanopyReducer to produce and consume
Canopy centroids as Writable values vs. previous formatStrings
- Modified CanopyMapper to specify SparseVector output from mapper
- Fixed null name hash() bug in SparseVector
- Modified Canopy.emitPointToExistingCanopies to emit only canopy id
and not full serialized canopy.
- This eliminates the need for the OutputDriver and OutputMapper in
synthetic control example so they are deleted.
- Updated unit tests; all tests run
- Synthetic control example runs
NOTE: When passing Vectors between Map and Reduce steps using
Writable format, Hadoop uses the *same instance* to do all of the
deserializations. I had to change the Canopy constructors to clone()
their center arguments so that the same instance would not be reused
for multiple canopies.
Change Canopy MR Implementation to use Vector Writable
------------------------------------------------------
Key: MAHOUT-136
URL: https://issues.apache.org/jira/browse/MAHOUT-136
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
Fix For: 0.1
Internal serialization of Canopy currently uses asFormatString
rather than just making the Canopy writable. This is storage
inefficient.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.