[
https://issues.apache.org/jira/browse/PIG-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246955#comment-15246955
]
Xianda Ke commented on PIG-4857:
--------------------------------
[~mohitsabharwal], agree with you. OutputConsumeIterator should handle this. I
will provide a more generic solution.
Since the fix won't touch any code of Stream operator. Stream operator works
well in the new solution. I will close this Stream JIRA.
I will link both of this Stream JIRA and CollectedGroup JIRA(PIG-4842) to the
new JIRA.
> Last record is missing in STREAM operator
> -----------------------------------------
>
> Key: PIG-4857
> URL: https://issues.apache.org/jira/browse/PIG-4857
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Xianda Ke
> Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4857.patch
>
>
> This bug is similar to PIG-4842.
> Scenario:
> {code}
> cat input.txt
> 1
> 1
> 2
> {code}
> Pig script:
> {code}
> REGISTER myudfs.jar;
> A = LOAD 'input.txt' USING myudfs.DummyCollectableLoader() AS (id);
> B = GROUP A by $0 USING 'collected'; -- (1, {(1),(1)}), (2,{(2)})
> C = STREAM B THROUGH ` awk '{
> print $0;
> }'`;
> DUMP C;
> {code}
> Expected Result:
> {code}
> (1,{(1),(1)})
> (2,{(2)})
> {code}
> Actual Result:
> {code}
> (1,{(1),(1)})
> {code}
> The last record is missing...
> Root Cause:
> When the flag endOfAllInput was set as true by the predecessor, the
> predecessor buffers the last record which is the input of Stream. Then
> POStream find endOfAllInput is true, in fact, the last input is not consumed
> yet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)