[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-1220:
-------------------------------------

    Attachment: MAPREDUCE-1220_yhadoop20.patch

I spent a long (and happy) weekend building a half-baked *prototype* for this...

Essentially, I've introduced a new kind of task, called "Uber Task", half in 
jest. I've got it to mimic the old local job-runner by running all maps 
serially and then a single reduce. It needs a lot more work to fix things on 
the JobTracker, TaskTracker, Scheduler and so on. Most of the effort involved 
teasing out the framework in the MapTask and ReduceTask to allow several 
components such as MapOutputBuffer, ReduceValuesIterator etc. to be used as 
'pluggable' components. 

> Implement an in-cluster LocalJobRunner
> --------------------------------------
>
>                 Key: MAPREDUCE-1220
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: client, jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1220_yhadoop20.patch
>
>
> Currently very small map-reduce jobs suffer from latency issues due to 
> overheads in Hadoop Map-Reduce such as scheduling, jvm startup etc. We've 
> periodically tried to optimize all parts of framework to achieve lower 
> latencies.
> I'd like to turn the problem around a little bit. I propose we allow very 
> small jobs to run as a single task job with multiple maps and reduces i.e. 
> similar to our current implementation of the LocalJobRunner. Thus, under 
> certain conditions (maybe user-set configuration, or if input data is small 
> i.e. less a DFS blocksize) we could launch a special task which will run all 
> maps in a serial manner, followed by the reduces. This would really help 
> small jobs achieve significantly smaller latencies, thanks to lesser 
> scheduling overhead, jvm startup, lack of shuffle over the network etc. 
> This would be a huge benefit, especially on large clusters, to small Hive/Pig 
> queries.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to