So, should we just go to having everything be binary and then have Input/Output utilities that can take the binary format and output GSON? Seems like w/ Canopy, since it's used for feeding into other algorithms that it should output Writable as well, otherwise we're still going to be round tripping through Text.

Then, it would be pretty easy to write a M/R job that takes Vectors and outputs asFormatString(), right?

On Jun 19, 2009, at 8:54 PM, Jeff Eastman (JIRA) wrote:


[ https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722105 #action_12722105 ]

Jeff Eastman commented on MAHOUT-136:
-------------------------------------

r786738 committed the following changes.
- Modified CanopyMapper and CanopyReducer to produce and consume Canopy centroids as Writable values vs. previous formatStrings
- Modified CanopyMapper to specify SparseVector output from mapper
- Fixed null name hash() bug in SparseVector
- Modified Canopy.emitPointToExistingCanopies to emit only canopy id and not full serialized canopy. - This eliminates the need for the OutputDriver and OutputMapper in synthetic control example so they are deleted.
- Updated unit tests; all tests run
- Synthetic control example runs

NOTE: When passing Vectors between Map and Reduce steps using Writable format, Hadoop uses the *same instance* to do all of the deserializations. I had to change the Canopy constructors to clone() their center arguments so that the same instance would not be reused for multiple canopies.

Change Canopy MR Implementation to use Vector Writable
------------------------------------------------------

               Key: MAHOUT-136
               URL: https://issues.apache.org/jira/browse/MAHOUT-136
           Project: Mahout
        Issue Type: Improvement
        Components: Clustering
  Affects Versions: 0.1
          Reporter: Jeff Eastman
          Assignee: Jeff Eastman
           Fix For: 0.1


Internal serialization of Canopy currently uses asFormatString rather than just making the Canopy writable. This is storage inefficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to