Nathan Smith created PIG-4834:
---------------------------------
Summary: Left Outer Skewed Join produces incorrect results
Key: PIG-4834
URL: https://issues.apache.org/jira/browse/PIG-4834
Project: Pig
Issue Type: Bug
Affects Versions: 0.15.0
Environment: HDP 2.3.2
Pig 0.15.0.2.3.2.0-2950
5 node cluster (2 name, 3 data)
Reporter: Nathan Smith
I've been working on a Pig script to join some datasets recently and I think I
found a bug in Left Outer Join using "skewed". In an attempt to speed up what
seemed to be some joins on skewed data I used the 'skewed' keyword, but the
skewed version produced a different number of results. The dataflow is quite
complicated, but I've isolated the jobs where the results start to differ.
Non-skewed version:
* 36 map tasks
* 5 reduce tasks
* shortest reducer: 46sec
* longest reducer: 7min, 9sec
* input records: 16,903,866
* output records: 16,891,935
{code}
out = JOIN leftrel BY prevrel::f1 LEFT OUTER, rightrel BY f1;
{code}
Skewed version:
* 36 map tasks
* 5 reduce tasks
* shortest reducer: 1min, 34sec
* longest reducer: 2min, 15sec
* input records: 16,903,866
* output records: 7,916,768
{code}
out = JOIN leftrel BY prevrel::f1 LEFT OUTER, rightrel BY f1 USING 'skewed';
{code}
The two scripts are identical except for each join has {{ USING 'skewed' }}. My
understanding is that using "skewed" should produce the same results, except
that it does a preliminary scan to determine the best reducer distribution
scheme.
See attached for screenshots of the counters page for both versions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)