[ 
https://issues.apache.org/jira/browse/PIG-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204121#comment-15204121
 ] 

Xianda Ke commented on PIG-4842:
--------------------------------

root cause:
Currently, when a predecessor(PhyscialOperator) of POCollectedGroup reaches at 
the end of its input, the flag endOfAllInput of plan may be  set as true. But 
the input of POCollectedGroup has NOT reach at the end. The last record of its 
input may be still cached, input.hasNext() == true now.

Take above scenario for example, when B return a record (1,{(1),(1)}) to C,  
A(input of B) reaches at the end, so B set flag endOfAllInput as true, and 
cache a record (2,{(2)}).  When C handling the record (1,{(1),(1)}), C finds 
that endOfAllInput==true,  in fact, B(intput of C) still has a record.  The 
function getNextTuple() of C won't work.

According to the semantic of flag endOfAllInput, it seems that this flag should 
be placed in a PhyscialOperator, not in it's parentPlan. 


> Collected group doesn't work in some cases
> ------------------------------------------
>
>                 Key: PIG-4842
>                 URL: https://issues.apache.org/jira/browse/PIG-4842
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Xianda Ke
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>
> Scenario:
> 1. input data:
> cat collectedgroup1
> {code}
> 1
> 1
> 2
> {code}
> 2. pig script:
> {code}
> A = LOAD 'collectedgroup1' USING myudfs.DummyCollectableLoader() AS (id);
> B = GROUP A by $0 USING 'collected';
> C = GROUP B by $0 USING 'collected';
> DUMP C;
> {code}
> The expected output:
> {code}
> (1,{(1,{(1),(1)})})
> (2,{(2,{(2)})})
> {code}
> The actual output:
> {code}
> (1,{(1,{(1),(1)})})
> (1,)
> (2,{(2,{(2)})})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to