[
https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632365#action_12632365
]
Shravan Matthur Narayanamurthy commented on PIG-364:
----------------------------------------------------
Consider the following script:
{noformat}
a = load 'file:/etc/passwd';
b = limit a 10;
c = filter b by 2>1 parallel 10;
split c into c1 if 2>1, c2 if 2>1;
d = group c1 by $0;
e = group c2 by $0;
f = group d by $0, e by $0;
dump f;
{noformat}
This is a case where, multiple MROps are generated at the split as shown in the
figure below, if what I understand from the code is right.
!https://issues.apache.org/jira/secure/attachment/12390410/limitsplit.png!
Now when the job controller sees this graph of MROps, it first schedules the LD
MROp. To remind you, the limitadjuster has now changed the output of this to
some temporary file. At this point, the controller has an option to schedule
both the Lim Adj Op and the free 2-LRs Op whose dependency has been just
resolved. If at all the choice is to execute the 2-LRs oP it tries to read the
original output of the split which doesn't exist since the Lim Adj Op hasn't
run yet and will fail. However if it decides to choose the Lim Adj Op, things
will go fine.
In order to avoid this, we need to make sure to disconnect all the successors
and make the Lim Adj Op their predecessor and connect Lim Adj Op to LD as
indicated in the figure.
Let me know if I my understanding is wrong.
> Limit return incorrect records when we use multiple reducer
> -----------------------------------------------------------
>
> Key: PIG-364
> URL: https://issues.apache.org/jira/browse/PIG-364
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Reporter: Daniel Dai
> Assignee: Daniel Dai
> Fix For: types_branch
>
> Attachments: limitsplit.png, PIG-364-2.patch, PIG-364.patch
>
>
> Currently we put Limit(k) operator in the reducer plan. However, in the case
> of n reducer, we will get up to n*k output.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.