[ 
https://issues.apache.org/jira/browse/PIG-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218544#comment-15218544
 ] 

Mohit Sabharwal commented on PIG-4857:
--------------------------------------

[~kexianda], since the StreamOperator uses OutputConsumerIterator, isn't this 
just a matter of correctly implementing the endOfInput() method in the 
OutputConsumerIterator object  
(https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/StreamConverter.java#L106)
 ?

IOW, endOfInput() is supposed to be implemented to flush the last record.

> Last record is missing in STREAM operator
> -----------------------------------------
>
>                 Key: PIG-4857
>                 URL: https://issues.apache.org/jira/browse/PIG-4857
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Xianda Ke
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-4857.patch
>
>
> This bug is similar to PIG-4842.
> Scenario:
> {code}
> cat input.txt
> 1
> 1
> 2
> {code}
> Pig script:
> {code}
> REGISTER myudfs.jar;
> A = LOAD 'input.txt' USING myudfs.DummyCollectableLoader() AS (id); 
> B = GROUP A by $0 USING 'collected';    -- (1, {(1),(1)}), (2,{(2)})
> C = STREAM B THROUGH ` awk '{
>      print $0;
> }'`;
> DUMP C;
> {code}
> Expected Result:
> {code}
> (1,{(1),(1)})
> (2,{(2)})
> {code}
> Actual Result:
> {code}
> (1,{(1),(1)})
> {code}
> The last record is missing...
> Root Cause:
> When the flag endOfAllInput was set as true by the predecessor,  the 
> predecessor buffers the last record which is the input of Stream.   Then 
> POStream find endOfAllInput is true, in fact, the last input is not consumed 
> yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to