[
https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015665#comment-13015665
]
Thejas M Nair commented on PIG-1963:
------------------------------------
Note that the issue is seen only when there are more than 20000 in the bag used
by the nested order-by statement, or the value of pig.accumulative.batchsize
property if it is set.
The is happening because in accumulative mode the nested relational operator is
being passed a portion of the bag. That works fine in case of operations such
as filter or limit. If secondary sort is used for the ordering, there is no
POSort in the plan, so it works fine.
This issue might exist in case of nested distinct as well, because it is also
supposed to be a blocking operation.
Another query which demonstrates this issue (when property
pig.accumulative.batchsize=2 is set)
{code}
a1 = load 'fruits.txt' as (cid:int,fruit : chararray);
b = group a1 by cid;
d = foreach b {
sort1 = order a1 by fruit ;
sort2 = order a1 by fruit desc;
generate group as cid, MYCONCATBAG(sort1.fruit), MYCONCATBAG(sort2.fruit); --
The second instance of the udf does not get sorted results
}
explain d;
dump d;
{code}
To fix this, if such blocking relational operators exist in the plan after
secondary-sort optimization, accumulative mode should be disabled by the
optimizer.
> in nested foreach, accumutive udf taking input from order-by does not get
> results in order
> ------------------------------------------------------------------------------------------
>
> Key: PIG-1963
> URL: https://issues.apache.org/jira/browse/PIG-1963
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.9.0
> Reporter: Thejas M Nair
> Attachments: MYCONCATBAG.java
>
>
> This happens only when secondary sort is not being used for the order-by.
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
> sort1 = order a1 by f2;
> sort2 = order a2 by f2; -- secondary sort not getting used here,
> MYCONCATBAG gets results in wrong order
> generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira