[
https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Roelofs updated MAPREDUCE-1220:
------------------------------------
Attachment: MR-1220.v2.trunk-hadoop-mapreduce.patch.txt
Oops, here's the real updated patch. (Forgot to "git add" the two new files,
sigh.)
> Implement an in-cluster LocalJobRunner
> --------------------------------------
>
> Key: MAPREDUCE-1220
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: client, jobtracker
> Reporter: Arun C Murthy
> Assignee: Greg Roelofs
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1220_yhadoop20.patch,
> MR-1220.v2.trunk-hadoop-mapreduce.patch.txt,
> MR-1220.v2.trunk-hadoop-mapreduce.patch.txt
>
>
> Currently very small map-reduce jobs suffer from latency issues due to
> overheads in Hadoop Map-Reduce such as scheduling, jvm startup etc. We've
> periodically tried to optimize all parts of framework to achieve lower
> latencies.
> I'd like to turn the problem around a little bit. I propose we allow very
> small jobs to run as a single task job with multiple maps and reduces i.e.
> similar to our current implementation of the LocalJobRunner. Thus, under
> certain conditions (maybe user-set configuration, or if input data is small
> i.e. less a DFS blocksize) we could launch a special task which will run all
> maps in a serial manner, followed by the reduces. This would really help
> small jobs achieve significantly smaller latencies, thanks to lesser
> scheduling overhead, jvm startup, lack of shuffle over the network etc.
> This would be a huge benefit, especially on large clusters, to small Hive/Pig
> queries.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.