[
https://issues.apache.org/jira/browse/PIG-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204121#comment-15204121
]
Xianda Ke commented on PIG-4842:
--------------------------------
root cause:
Currently, when a predecessor(PhyscialOperator) of POCollectedGroup reaches at
the end of its input, the flag endOfAllInput of plan may be set as true. But
the input of POCollectedGroup has NOT reach at the end. The last record of its
input may be still cached, input.hasNext() == true now.
Take above scenario for example, when B return a record (1,{(1),(1)}) to C,
A(input of B) reaches at the end, so B set flag endOfAllInput as true, and
cache a record (2,{(2)}). When C handling the record (1,{(1),(1)}), C finds
that endOfAllInput==true, in fact, B(intput of C) still has a record. The
function getNextTuple() of C won't work.
According to the semantic of flag endOfAllInput, it seems that this flag should
be placed in a PhyscialOperator, not in it's parentPlan.
> Collected group doesn't work in some cases
> ------------------------------------------
>
> Key: PIG-4842
> URL: https://issues.apache.org/jira/browse/PIG-4842
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Xianda Ke
> Assignee: Xianda Ke
> Fix For: spark-branch
>
>
> Scenario:
> 1. input data:
> cat collectedgroup1
> {code}
> 1
> 1
> 2
> {code}
> 2. pig script:
> {code}
> A = LOAD 'collectedgroup1' USING myudfs.DummyCollectableLoader() AS (id);
> B = GROUP A by $0 USING 'collected';
> C = GROUP B by $0 USING 'collected';
> DUMP C;
> {code}
> The expected output:
> {code}
> (1,{(1,{(1),(1)})})
> (2,{(2,{(2)})})
> {code}
> The actual output:
> {code}
> (1,{(1,{(1),(1)})})
> (1,)
> (2,{(2,{(2)})})
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)