[
https://issues.apache.org/jira/browse/MAHOUT-82?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640559#action_12640559
]
Jeff Eastman commented on MAHOUT-82:
------------------------------------
I applied the patch and the unit tests continue to pass. The change affects the
communication between the mappers and their combiners and not between the
combiners and reducer, so my earlier comment referred to a different interface.
In this instance, the mapper records *can* be keyed by the canopyId alone,
since they do share a common id-space. The current implementation passes the
entire canopy - including its original center - as key but this information is
not used when summing points for the centroid for output to the reducer. The
key is only used to correlate the various points during copy/sort prior to the
summing process and is not actually used within it.
I think Edward's patch represents a small performance improvement, in that
shorter keys would presumably be faster than longer ones.
+1
Jeff
> Canopy map intermediate file structure should be keyed by canopyId.
> -------------------------------------------------------------------
>
> Key: MAHOUT-82
> URL: https://issues.apache.org/jira/browse/MAHOUT-82
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.1
> Reporter: Edward J. Yoon
> Fix For: 0.1
>
> Attachments: MAHOUT-82.patch
>
>
> When emit the point to the collector, it should be keyed by canopyId w/o
> computed centroid. (or make a other key datum instead of hadoop.IO.Text)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.