William Watson created PIG-5071:
-----------------------------------
Summary: MapReduce concurrency Could Be Better
Key: PIG-5071
URL: https://issues.apache.org/jira/browse/PIG-5071
Project: Pig
Issue Type: Wish
Reporter: William Watson
We have a job that launches, after optimization, about 20 MapReduce jobs. Some
of these are quite long running and while pig does an okay job of running jobs
concurrently, it could do better at least in this very specific case.
The pig job can be divided up amongst 4 major sections like so:
A1 -> A2 -> A3 -> A4 -> A
B1 -> B2 -> B
C1 -> C2 -> C3 -> C
D1 -> D2 -> D3 -> D4 -> D
and the sections are joined at the end:
A + B -> AB
AB + C -> ABC
ABC + D -> ABCD
In short, if C2 finishes very quickly, C3 won't be started until A2, B2, and D2
are all also complete. This is a problem if say, D2 takes an hour and there are
unused cluster resources that could be made available to C3 (and by extension
A3 and B3 if their prerequisites also finish before D2).
One possible work around is to scale D2 better, but that's besides the point. I
think pig is capable of knowing that the prerequisites are done for certain
jobs, but since it only kicks off jobs in "phases", it won't kick off jobs as
soon as possible.
I've taken a look at the code and I'm having a hard time working out where the
issue is or else I would be glad to contribute a patch.
Is this a desirable feature and is this directly controlled by pig? If so,
could someone help point me in the right direction so I can contribute a patch?
Note: We can change this from a "wish" to an "improvement" if this feature is
desired...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)