[
https://issues.apache.org/jira/browse/PIG-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218544#comment-15218544
]
Mohit Sabharwal commented on PIG-4857:
--------------------------------------
[~kexianda], since the StreamOperator uses OutputConsumerIterator, isn't this
just a matter of correctly implementing the endOfInput() method in the
OutputConsumerIterator object
(https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/StreamConverter.java#L106)
?
IOW, endOfInput() is supposed to be implemented to flush the last record.
> Last record is missing in STREAM operator
> -----------------------------------------
>
> Key: PIG-4857
> URL: https://issues.apache.org/jira/browse/PIG-4857
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Xianda Ke
> Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4857.patch
>
>
> This bug is similar to PIG-4842.
> Scenario:
> {code}
> cat input.txt
> 1
> 1
> 2
> {code}
> Pig script:
> {code}
> REGISTER myudfs.jar;
> A = LOAD 'input.txt' USING myudfs.DummyCollectableLoader() AS (id);
> B = GROUP A by $0 USING 'collected'; -- (1, {(1),(1)}), (2,{(2)})
> C = STREAM B THROUGH ` awk '{
> print $0;
> }'`;
> DUMP C;
> {code}
> Expected Result:
> {code}
> (1,{(1),(1)})
> (2,{(2)})
> {code}
> Actual Result:
> {code}
> (1,{(1),(1)})
> {code}
> The last record is missing...
> Root Cause:
> When the flag endOfAllInput was set as true by the predecessor, the
> predecessor buffers the last record which is the input of Stream. Then
> POStream find endOfAllInput is true, in fact, the last input is not consumed
> yet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)