Nathan Smith created PIG-4834:
---------------------------------

             Summary: Left Outer Skewed Join produces incorrect results
                 Key: PIG-4834
                 URL: https://issues.apache.org/jira/browse/PIG-4834
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.15.0
         Environment: HDP 2.3.2
Pig 0.15.0.2.3.2.0-2950
5 node cluster (2 name, 3 data)
            Reporter: Nathan Smith


I've been working on a Pig script to join some datasets recently and I think I 
found a bug in Left Outer Join using "skewed". In an attempt to speed up what 
seemed to be some joins on skewed data I used the 'skewed' keyword, but the 
skewed version produced a different number of results. The dataflow is quite 
complicated, but I've isolated the jobs where the results start to differ.

Non-skewed version:

* 36 map tasks
* 5 reduce tasks
* shortest reducer: 46sec
* longest reducer: 7min, 9sec
* input records: 16,903,866
* output records: 16,891,935

{code}
out = JOIN leftrel BY prevrel::f1 LEFT OUTER, rightrel BY f1;
{code}

Skewed version:

* 36 map tasks
* 5 reduce tasks
* shortest reducer: 1min, 34sec
* longest reducer: 2min, 15sec
* input records: 16,903,866
* output records: 7,916,768

{code}
out = JOIN leftrel BY prevrel::f1 LEFT OUTER, rightrel BY f1 USING 'skewed';
{code}

The two scripts are identical except for each join has {{ USING 'skewed' }}. My 
understanding is that using "skewed" should produce the same results, except 
that it does a preliminary scan to determine the best reducer distribution 
scheme.

See attached for screenshots of the counters page for both versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to