[ 
https://issues.apache.org/jira/browse/PIG-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206458#comment-15206458
 ] 

Xianda Ke commented on PIG-4842:
--------------------------------

Hi [~kellyzly]
Take above scenario as example:
MR mode: (Sequence Diagram)
{code}
  Mapper                       C                         B
    |                          |                         |
    | 1. runPipeline (1)       |   attach (1)            |
    |--------------------------------------------------->| (1,{(1)})
    |                          |                         |
    |       NULL EOP           | 2. NULL EOP             |
    |<-------------------------|<----------------------- |
    |                          |                         |
    | 3. runPipeline (1)       |   attach (1)            |
    |--------------------------------------------------->| (1,{(1),(1)})
    |                          |                         |
    |       NULL EOP           | 4. NULL EOP             |
    |<-------------------------|<----------------------- |
    |                          |                         |
    | 5. runPipeline (2)       |   attach (2)            |
    |--------------------------------------------------->| (2, {(2)})
    |                          |                         |
    |                          | 6. (1,{(1),(1)})        |
    |                          |<----------------------- |
    |     7.  NULL EOP         |                         |
    |<-------------------------|(1,{(1,{(1),(1)})})      |
    |                          |                         |
    | 8.endOfInput runPipeline |                         |
    |------------------------->|------------------------>|
    |                          |                         | 
    |                          | 9. (2, {(2)})           |
    |                          |<----------------------- |
    |                          |                         |
    | 10. (1,{(1,{(1),(1)})})  |                         |
    |<------------------------ |(2,{(2,{(2)})})          |
    |                          |                         |
    | 11. (2,{(2,{(2)})})      |                         |
    |<------------------------ |                         |
    |                          |                         |

in MR mode, endOfInput is set after step 6 & 7, (1,{(1),(1)}) is consumed by C, 
and (1,{(1,{(1),(1)})}) is buffered in C. It works. On the other hand, in spark 
mode, when the last record (2) attached to B, endOfInput is set as true, before 
(1,{(1),(1)}) is consumed by C. That's why it fails in spark mode.

{code}

> Collected group doesn't work in some cases
> ------------------------------------------
>
>                 Key: PIG-4842
>                 URL: https://issues.apache.org/jira/browse/PIG-4842
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Xianda Ke
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-4842.patch
>
>
> Scenario:
> 1. input data:
> cat collectedgroup1
> {code}
> 1
> 1
> 2
> {code}
> 2. pig script:
> {code}
> A = LOAD 'collectedgroup1' USING myudfs.DummyCollectableLoader() AS (id);
> B = GROUP A by $0 USING 'collected';
> C = GROUP B by $0 USING 'collected';
> DUMP C;
> {code}
> The expected output:
> {code}
> (1,{(1,{(1),(1)})})
> (2,{(2,{(2)})})
> {code}
> The actual output:
> {code}
> (1,{(1,{(1),(1)})})
> (1,)
> (2,{(2,{(2)})})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to