[
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817385#comment-13817385
]
Koji Noguchi commented on PIG-3347:
-----------------------------------
I wish there was a wiki/document describing how UID should be assigned. Even
after going through PIG-3492, I'm still lost on when exactly we should assign
new UIDs.
My current understanding(or guess) is, UID represents an uniqueness within a
record.
Just by looking at the UIDs from the two separate relations(bags), we can tell
if the fields were altered or not. (Although we cannot tell if the tuples were
filtered or not.)
FilterAboveForeach(PushUpFilter) is using this property to determine if FILTER
can be moved before the foreach. Bug here is, nested distinct is not assigning
a new UID for the bag it creates so FilterAboveForeach mistakenly thinks that
no fields were altered within the foreach and decides to move this filter
upfront.
Following show the schema BEFORE calling PushUpFilter/FilterAboveForeach
without and with Daniel's patch. We can see that after applying the patch,
relation 'c' and 'a_group' contain different UIDs for the bag.
{noformat}
(without the patch)
|---c: (Name: LOFilter Schema:
group#11:bytearray,a_distinct#12:bag{#13:tuple(#14:bytearray)})
|---b: (Name: LOForEach Schema:
group#11:bytearray,a_distinct#12:bag{#13:tuple(#14:bytearray)})
|---a_group: (Name: LOCogroup Schema:
group#11:bytearray,a#12:bag{#13:tuple()})
(with the patch)
|---c: (Name: LOFilter Schema:
group#15:bytearray,a_distinct#20:bag{#19:tuple(#18:bytearray)})
|---b: (Name: LOForEach Schema:
group#15:bytearray,a_distinct#20:bag{#19:tuple(#18:bytearray)})
|---a_group: (Name: LOCogroup Schema:
group#15:bytearray,a#16:bag{#17:tuple()})
{noformat}
So I think the patch fixes the bug described on the jira nicely. However,
question remains for other nested operations. I believe the same bug can appear
for nested LIMIT and nested FILTER. For example,
{noformat}
a = load 'test.txt';
a_group = group a by $0;
b = foreach a_group {
a_limit = limit a.$0 5;
generate group, a_limit;
}
c = filter b by SIZE(a_limit) == 5;
store c into 'out';
{noformat}
{noformat}
a = load 'test3.txt' as (a0, a1);
a_group = group a by a0;
b = foreach a_group {
newA = filter a by a1 == 2;
generate group, newA;
}
c = filter b by SIZE(newA) == 5;
store c into 'out';
{noformat}
I confirmed these two examples also mistakenly push the filter before foreach
and produce empty results. Former case, nested LIMIT, is actually covered with
the current patch since nested LIMIT uses LOLIMIT+LOForeach. So the patch
{noformat}
+ 98 // If it is nested foreach or nested distinct, generate new uid
+ 99 if (op instanceof LOForEach || op instanceof LODistinct) {
+ 100 needNewUid = true;
+ 101 }
{noformat}
takes care of nested limit although comment doesn't mention it. Nested filter
is not the case here and the bug still exists after the current patch. Can we
cover this case as well?
> Store invocation brings side effect
> -----------------------------------
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
> Issue Type: Bug
> Components: grunt
> Affects Versions: 0.11
> Environment: local mode
> Reporter: Sergey
> Assignee: Daniel Dai
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store
> output. Looks like it brings some kind of side effect. We did use 'local'
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
> a_distinct = distinct a.$0;
> generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.
--
This message was sent by Atlassian JIRA
(v6.1#6144)