So, are we to make these changes on all the Mappers/Reducers?
On Jun 19, 2009, at 8:54 PM, Jeff Eastman (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722105
#action_12722105 ]
Jeff Eastman commented on MAHOUT-136:
-------------------------------------
r786738 committed the following changes.
- Modified CanopyMapper and CanopyReducer to produce and consume
Canopy centroids as Writable values vs. previous formatStrings
- Modified CanopyMapper to specify SparseVector output from mapper
- Fixed null name hash() bug in SparseVector
- Modified Canopy.emitPointToExistingCanopies to emit only canopy id
and not full serialized canopy.
- This eliminates the need for the OutputDriver and OutputMapper in
synthetic control example so they are deleted.
- Updated unit tests; all tests run
- Synthetic control example runs
NOTE: When passing Vectors between Map and Reduce steps using
Writable format, Hadoop uses the *same instance* to do all of the
deserializations. I had to change the Canopy constructors to clone()
their center arguments so that the same instance would not be reused
for multiple canopies.
Change Canopy MR Implementation to use Vector Writable
------------------------------------------------------
Key: MAHOUT-136
URL: https://issues.apache.org/jira/browse/MAHOUT-136
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
Fix For: 0.1
Internal serialization of Canopy currently uses asFormatString
rather than just making the Canopy writable. This is storage
inefficient.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search