[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909583#action_12909583
 ] 

Greg Roelofs commented on MAPREDUCE-1220:
-----------------------------------------

I've been looking at what it will take to extend this from Arun's February 
sketch to something that will actually schedule and run small jobs. At the 
moment it looks like it will largely avoid JobTracker; most of the action seems 
to be in JobInProgress (particularly {{initTasks()}} and {{obtainNew*Task()}}), 
TaskInProgress ({{findNew*Task()}}, {{getTaskToRun()}}, {{addRunningTask()}}), 
and the scheduler ({{assignTasks()}}).

I haven't asked Arun what he originally had in mind, but it seems that there 
are two fairly obvious approaches:

 * treat UberTask (or MetaTask or MultiTask or AllInOneTask or ...) as a third 
kind of Task, not exactly like either MapTask or ReduceTask but a peer (more or 
less) to both
 * treat UberTask as a variant (subclass) of MapTask

I see the first approach as conceptually cleaner, and some of its 
implementation details would be cleaner as well, but overall it's harder to 
implement: there are lots of places where there's map-vs-reduce logic 
(occasionally with special setup/cleanup cases), and most of it would require 
modifications. The second approach seems slightly hackish, but it has at least 
that one big advantage: many of the map-vs-reduce bits of code would not 
require changes - in particular, I don't believe it would be necessary to touch 
the schedulers (and we're up to, what, four at this point?) since they'd simply 
see a job containing one map and no reduces.

Thoughts? (I don't yet have a strong opinion myself, though I'm interested in 
getting a proof of concept running quickly so I/we can see what works and 
whether it's a net win in the first place; from that perspective, the latter 
approach may be better.)

Either way, there will be changes needed in the UI, metrics, and other 
accounting to surface the tasks-within-a-task details, as well as the obvious 
configuration-related changes. Retry logic, speculation, etc., are still 
unclear (to me, anyway).

Stab at a couple of the other upstream questions:

bq. Will users be able to set a number of reduce tasks greater than one? Or are 
you limited to at most one reduce task?

The latter, at least for now. This is somewhat related to the question of 
serial execution; i.e., if it's serial, the only reason you would want multiple 
reduces is if a single one is too big to fit in memory, and if that's the case, 
it's arguably not a "small job" anymore.

bq. i am still a little dubious about whether this is ambitious enough. 
particularly - why serial execution? to me local execution == exploiting 
resources available in a single box. which are 8 core today and on the way up.

Well, generally multiple cores => multiple slots, at which point you let the 
scheduler figure it out. Chris mentioned that there's a JIRA to extend the 
LocalJobRunner in this direction (MAPREDUCE-434?), and if the currently 
proposed version of _this_ one works out, an obvious next step would be to look 
at similar extensions. But this is still an untested optimization, and all the 
usual caveats about premature optimization apply.

I don't know how Arun looked at the problem - I'm guessing the motivation was 
empirical, based on the behavior of Oozie workflows and Pig jobs - but I view 
it as way to make task granularity a bit more homogeneous by packing multiple, 
too-small tasks into larger containers. My mental model is that that will tend 
to make the scheduler more efficient - but also that trying to do too much may 
begin to work at cross-purposes with the scheduler, i.e., one probably doesn't 
want to create a secondary scheduler (trying to tune both halves could get very 
messy).

bq. it does seem that a dispatcher facility at the lowest layer that can juggle 
between different hadoop clusters is useful generically (across 
Hive/Pig/native-Hadoop etc.) - but that's not quite the same as what's proposed 
here - is it?

Nope. But I think other groups are looking at stuff like that.

> Implement an in-cluster LocalJobRunner
> --------------------------------------
>
>                 Key: MAPREDUCE-1220
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: client, jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Greg Roelofs
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1220_yhadoop20.patch, 
> MR-1220.v2.trunk-hadoop-mapreduce.patch.txt, 
> MR-1220.v2.trunk-hadoop-mapreduce.patch.txt
>
>
> Currently very small map-reduce jobs suffer from latency issues due to 
> overheads in Hadoop Map-Reduce such as scheduling, jvm startup etc. We've 
> periodically tried to optimize all parts of framework to achieve lower 
> latencies.
> I'd like to turn the problem around a little bit. I propose we allow very 
> small jobs to run as a single task job with multiple maps and reduces i.e. 
> similar to our current implementation of the LocalJobRunner. Thus, under 
> certain conditions (maybe user-set configuration, or if input data is small 
> i.e. less a DFS blocksize) we could launch a special task which will run all 
> maps in a serial manner, followed by the reduces. This would really help 
> small jobs achieve significantly smaller latencies, thanks to lesser 
> scheduling overhead, jvm startup, lack of shuffle over the network etc. 
> This would be a huge benefit, especially on large clusters, to small Hive/Pig 
> queries.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to