[ 
https://issues.apache.org/jira/browse/PIG-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535769#comment-14535769
 ] 

Mohit Sabharwal commented on PIG-4542:
--------------------------------------

FYI [~kellyzly]. [~praveenr019], [~xuefuz], this patch:

- Gets rid of the use of RDD.count() in CollectedGroupConverter and 
StreamConverter.
- Adds an abstract method endInput() which is executed before we call 
getNextTuple() for the last time.
- Deletes POStreamSpark since it was just handling the last record.
- Removes special code in POCollectedGroup to handle the last record.
- While I was here, renamed POOutputConsumerIterator to OutputConsumerIterator. 
The "PO" prefix should only be used for physical operators.

> OutputConsumerIterator should flush buffered records
> ----------------------------------------------------
>
>                 Key: PIG-4542
>                 URL: https://issues.apache.org/jira/browse/PIG-4542
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>    Affects Versions: spark-branch
>            Reporter: Mohit Sabharwal
>            Assignee: Mohit Sabharwal
>             Fix For: spark-branch
>
>         Attachments: PIG-4542.patch
>
>
> Certain operators may buffer the output. We need to flush the last set of 
> records from such operators, when we encounter the last input record, before 
> calling getNextTuple() for the last time.
> Currently, to flush the last set of records, we compute RDD.count() and 
> compare the count with a running counter to determine if we have reached the 
> last record. This is an unnecessary and inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to