[ 
https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799186#comment-13799186
 ] 

Ben Podgursky commented on MAPREDUCE-199:
-----------------------------------------

Doesn't seem like there's been any progress on this recently, but this 
functionality would be really helpful to us (we've been hunting for a way to do 
exactly this.)

Our use-case is somewhat similar to the HBase one--we have a number of stores 
which we keep sorted on the same keys and partitioned identically (ex, 
partitioned into partfiles 0000- 0599).  When we need to join these stores, 
instead of running a full map + reduce, we can just run a map task for each 
file which reads in the partfiles for each side of the join.  Since we are 
reading these stores many times, it saves us a lot of cluster time to only sort 
the files once.  

These files are each produced by a normal reduce task.  It would be great if we 
were able to give hadoop a hint that part-0123 of store A and part-0123 of 
store B should end up on the same host, so any job joining the two files will 
be reading purely local data.  Ideally we could accomplish this by giving 
hadoop a hint about where to run each reduce task so we don't have to shuffle 
the data around later.

> Locality hints for Reduce
> -------------------------
>
>                 Key: MAPREDUCE-199
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-199
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: applicationmaster, mrv2
>            Reporter: Benjamin Reed
>            Assignee: Harsh J
>         Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch
>
>
> It would be nice if we could add method to OutputFormat that would allow a 
> job to indicate where a reducer for a given partition should should run. This 
> is similar to the getSplits() method on InputFormat. In our application the 
> reducer is using other data in addition to the map outputs during processing 
> and data accesses could be made more efficient if the JobTracker scheduled 
> the reducers to run on specific hosts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to