[ 
https://issues.apache.org/jira/browse/PIG-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4175:
----------------------------
    Attachment: PIG-4175-1.patch

Sure. In the mean time, I tried the script with Pig 0.14 and it produces right 
result. However, we can do better since cross is using only 1 reduce. I shall 
use Rohini's suggestion "One way to fix this would be to always have GFCross 
UDF as part of map task of the actual cross job and never do it as part of 
previous job's map or reduce.". Attach patch.

> PIG CROSS operation follow by STORE produces non-deterministic results each 
> run
> -------------------------------------------------------------------------------
>
>                 Key: PIG-4175
>                 URL: https://issues.apache.org/jira/browse/PIG-4175
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11, 0.12.0
>         Environment: RHEL 6/64-bit
>            Reporter: Jim Huang
>         Attachments: PIG-4175-1.patch, mktestdata.py, pig_testcross_plan.png, 
> test_cross.out, test_cross.pig
>
>
> Three files will be attached to help visualize this issue.
> 1. mktestdata.py - to generate test data to feed the pig script
> 2. test_cross.pig - the PIG script using CROSS and STORE
> 3. test_cross.out - the PIG console output showing the input/output records 
> delta
> To reproduce this PIG CROSS operation problem, you need to use the supplied 
> Python script,
> mktestdata.py, to generate an input file that is at least 13,948,228,930 
> bytes (> 13GB).
> The CROSS between raw_data (m records) and cross_count (1 record) should 
> yield exactly (m records) as the output.  
> The STORE results from the CROSS operations yielded about 1/3 of input record 
> in raw_data as the output.  
> If I joined the both of the CROSS operations together, the STORE results from 
> the CROSS operations yielded about 2/3
> of the input records in raw-data as the output.  
> -- data = CROSS raw_data, field04s_count, subsection1_field04s_count, 
> subsection2_field04s_count;
> We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 
> 2.x) clusters.  
> The default HDFS block size is 128MB.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to