[
https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799186#comment-13799186
]
Ben Podgursky commented on MAPREDUCE-199:
-----------------------------------------
Doesn't seem like there's been any progress on this recently, but this
functionality would be really helpful to us (we've been hunting for a way to do
exactly this.)
Our use-case is somewhat similar to the HBase one--we have a number of stores
which we keep sorted on the same keys and partitioned identically (ex,
partitioned into partfiles 0000- 0599). When we need to join these stores,
instead of running a full map + reduce, we can just run a map task for each
file which reads in the partfiles for each side of the join. Since we are
reading these stores many times, it saves us a lot of cluster time to only sort
the files once.
These files are each produced by a normal reduce task. It would be great if we
were able to give hadoop a hint that part-0123 of store A and part-0123 of
store B should end up on the same host, so any job joining the two files will
be reading purely local data. Ideally we could accomplish this by giving
hadoop a hint about where to run each reduce task so we don't have to shuffle
the data around later.
> Locality hints for Reduce
> -------------------------
>
> Key: MAPREDUCE-199
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-199
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: applicationmaster, mrv2
> Reporter: Benjamin Reed
> Assignee: Harsh J
> Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch
>
>
> It would be nice if we could add method to OutputFormat that would allow a
> job to indicate where a reducer for a given partition should should run. This
> is similar to the getSplits() method on InputFormat. In our application the
> reducer is using other data in addition to the map outputs during processing
> and data accesses could be made more efficient if the JobTracker scheduled
> the reducers to run on specific hosts.
--
This message was sent by Atlassian JIRA
(v6.1#6144)