[jira] [Commented] (PIG-4857) Last record is missing in STREAM operator

Xianda Ke (JIRA) Mon, 18 Apr 2016 18:51:09 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246955#comment-15246955
 ]


Xianda Ke commented on PIG-4857:
--------------------------------

[~mohitsabharwal], agree with you. OutputConsumeIterator should handle this. I 
will provide a more generic solution.   
Since the fix won't touch any code of Stream operator. Stream operator works 
well in the new solution.  I will close this Stream JIRA.   
I will link both of this Stream JIRA and CollectedGroup JIRA(PIG-4842) to the 
new JIRA.

> Last record is missing in STREAM operator
> -----------------------------------------
>
>                 Key: PIG-4857
>                 URL: https://issues.apache.org/jira/browse/PIG-4857
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Xianda Ke
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-4857.patch
>
>
> This bug is similar to PIG-4842.
> Scenario:
> {code}
> cat input.txt
> 1
> 1
> 2
> {code}
> Pig script:
> {code}
> REGISTER myudfs.jar;
> A = LOAD 'input.txt' USING myudfs.DummyCollectableLoader() AS (id); 
> B = GROUP A by $0 USING 'collected';    -- (1, {(1),(1)}), (2,{(2)})
> C = STREAM B THROUGH ` awk '{
>      print $0;
> }'`;
> DUMP C;
> {code}
> Expected Result:
> {code}
> (1,{(1),(1)})
> (2,{(2)})
> {code}
> Actual Result:
> {code}
> (1,{(1),(1)})
> {code}
> The last record is missing...
> Root Cause:
> When the flag endOfAllInput was set as true by the predecessor,  the 
> predecessor buffers the last record which is the input of Stream.   Then 
> POStream find endOfAllInput is true, in fact, the last input is not consumed 
> yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4857) Last record is missing in STREAM operator

Reply via email to