[
https://issues.apache.org/jira/browse/CRUNCH-213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabriel Reid updated CRUNCH-213:
--------------------------------
Attachment: CRUNCH-213.patch
I like the idea of using the taskId is the seed for the random to ensure having
deterministic behaviour, even if it isn't strictly necessary. Here's an updated
patch using the task id as the seed.
> Add sharded join functionality
> ------------------------------
>
> Key: CRUNCH-213
> URL: https://issues.apache.org/jira/browse/CRUNCH-213
> Project: Crunch
> Issue Type: New Feature
> Reporter: Gabriel Reid
> Assignee: Gabriel Reid
> Attachments: CRUNCH-213.patch, CRUNCH-213.patch
>
>
> Performing joins where a large proportion of the values on one or both sides
> of the join are mapped to a single key can result in poor performance, as one
> (or a small number) of reducers end up handling most of the joining work,
> leaving the rest of the cluster idle.
> Sharded joining should be added to allow splitting up join keys, thereby
> distributing values mapped to a single key over multiple reducer partitions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira