[jira] [Comment Edited] (PIG-4548) Records Lost With Specific Combination of Commands and Streaming Function

Rohini Palaniswamy (JIRA) Fri, 23 Jun 2017 13:47:25 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061480#comment-16061480
 ]


Rohini Palaniswamy edited comment on PIG-4548 at 6/23/17 8:46 PM:
------------------------------------------------------------------

+1. Patch looks good.  

I noticed an inefficiency that we are attaching input to just one child plan 
based on record index but instead of just processing that plan, we are 
iterating through all child plans and processing them. Not so familiar with 
PODemux. So not suggest to optimize it now. Might break something like case of 
nested PODemux. 

bq. I was able to reproduce the behavior on trunk but only for map-reduce mode 
and not tez
 PODemux is only used for multi-query. For eg: Combining multiple groupbys on 
same input into one hadoop job which is the case mentioned in the jira.  We do 
not use PODemux operator at all in Tez. Tez supports multiple outputs and 
inputs. So multi-query processing is very different with it. In this case, it 
will create three vertices. 
{code}
           V1 (Load) 
           /\
          /  \
         /    \
       V2     V3  
{code}
where V2 is the reducer for T2 groupby and V3 for F2 group by.

bq. I have many Pig scripts in production where I have to use inefficient 
work-arounds like storing and reloading to avoid loosing data.
  That is bad. As [~knoguchi] said, we always treat data loss as critical. 
Sorry we missed fixing this earlier. There will sure be others who are 
experiencing this, but must not have not noticed that records are lost. 


was (Author: rohini):
+1. Patch looks good.  

I noticed an inefficiency that we are attaching input to just one child plan 
based on record index but instead of just processing that plan, we are 
iterating through all child plans and processing them. Not so familiar with 
PODemux. So not suggest to optimize it now. Might break something like case of 
nested PODemux. 

bq. I was able to reproduce the behavior on trunk but only for map-reduce mode 
and not tez
 PODemux is only used for multi-query. For eg: Combining multiple groupbys on 
same input into one hadoop job which is the case mentioned in the jira.  We do 
not use PODemux operator at all in Tez. Tez supports multiple outputs and 
inputs. So multi-query processing is very different with it. In this case, it 
will create three vertices. 
         V1 (Load) 
           /\
          /  \
         /   \
       V2  V3  
where V2 is the reducer for T2 groupby and V3 for F2 group by.

bq. I have many Pig scripts in production where I have to use inefficient 
work-arounds like storing and reloading to avoid loosing data.
  That is bad. As [~knoguchi] said, we always treat data loss as critical. 
Sorry we missed fixing this earlier. There will sure be others who are 
experiencing this, but must not have not noticed that records are lost. 

> Records Lost With Specific Combination of Commands and Streaming Function
> -------------------------------------------------------------------------
>
>                 Key: PIG-4548
>                 URL: https://issues.apache.org/jira/browse/PIG-4548
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0, 0.14.0
>         Environment: Amazon EMR (Elastic Map-Reduce) AMI 3.3.1
>            Reporter: Steve T
>            Assignee: Koji Noguchi
>             Fix For: 0.18.0
>
>         Attachments: pig-4548-v1.patch
>
>
> The below is the bare minimum I was able to extract from my original
> problem to in order to demonstrate the bug.  So, don't expect the following
> code to serve any practical purpose.  :)
> My input file (test_in) is two columns with a tab delimiter:
> 1   F
> 2   F
> My streaming function (sf.py) ignores the actual input and simply generates
> 2 records:
> #!/usr/bin/python
> if __name__ == '__main__':
>     print 'x'
>     print 'y'
> (But I should mention that in my original problem the input to output was
> one-to-one.  I just ignored the input here to get to the bare minimum
> effect.)
> My pig script:
> MY_INPUT = load 'test_in' as ( f1, f2);
> split MY_INPUT into T if (f2 == 'T'), F otherwise;
> T2 = group T by f1;
> store T2 into 'test_out/T2';
> F2 = group F by f1;
> store F2 into 'test_out/F2';  -- (this line is actually optional to demo
> the bug)
> F3 = stream F2 through `sf.py`;
> store F3 into 'test_out/F3';
> My expected output for test/out/F3 is two records that come directly from
> sf.py:
> x
> y
> However, I only get:
> x
> I've tried all of the following to get the expected behavior:
>    - upgraded Pig from 0.12.0 to 0.14.0
>    - local vs. distributed mode
>    - flush sys.stdout in the streaming function
>    - replace sf.py with sf.sh which is a bash script that used "echo x;
>    echo y" to do the same thing.  In this case, the final contents of
>    test_out/F# would vary - sometimes I would get both x and y, and sometimes
>    I would just get x.
> Aside from removing the one Pig line that I've marked optional, any other
> attempts to simplify the Pig script or input file causes the bug to not
> manifest.
> Log files can be found at 
> http://www.mail-archive.com/user@pig.apache.org/msg10195.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (PIG-4548) Records Lost With Specific Combination of Commands and Streaming Function

Reply via email to