Implement an in-cluster LocalJobRunner
--------------------------------------

                 Key: MAPREDUCE-1220
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
          Components: client, jobtracker
            Reporter: Arun C Murthy
            Assignee: Arun C Murthy
             Fix For: 0.22.0


Currently very small map-reduce jobs suffer from latency issues due to 
overheads in Hadoop Map-Reduce such as scheduling, jvm startup etc. We've 
periodically tried to optimize all parts of framework to achieve lower 
latencies.

I'd like to turn the problem around a little bit. I propose we allow very small 
jobs to run as a single task job with multiple maps and reduces i.e. similar to 
our current implementation of the LocalJobRunner. Thus, under certain 
conditions (maybe user-set configuration, or if input data is small i.e. less a 
DFS blocksize) we could launch a special task which will run all maps in a 
serial manner, followed by the reduces. This would really help small jobs 
achieve significantly smaller latencies, thanks to lesser scheduling overhead, 
jvm startup, lack of shuffle over the network etc. 

This would be a huge benefit, especially on large clusters, to small Hive/Pig 
queries.

Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to