Xianda Ke created PIG-4848:
------------------------------

             Summary: pig.noSplitCombination=true should always be set 
internally for a merge join
                 Key: PIG-4848
                 URL: https://issues.apache.org/jira/browse/PIG-4848
             Project: Pig
          Issue Type: Sub-task
            Reporter: Xianda Ke
            Assignee: Xianda Ke


In spark mode, for a merge join, the flag is NOT set as true internally. The 
input splits will be in the order of file size. The output is out of order.

Scenaro:
cat input1
{code}
1       1
{code}

cat input2
{code}
2       2
{code}

cat input3
{code}
33      33
{code}

A = LOAD 'input*' as (a:int, b:int);
B = LOAD 'input*' as (a:int, b:int);
C = JOIN A BY $0, B BY $0 USING 'merge';
DUMP C;

expected result:
{code}
(1,1,1,1)
(2,2,2,2)
(33,33,33,33)
{code}
actual result:
{code}
(33,33,33,33)
(1,1,1,1)
(2,2,2,2)
{code}

In MR mode, the flag was set as true internally for a merge join(see: 
PIG-2773). However, it doesn't work now. The output is still out of order, 
because the splits will be ordered again by hadoop-client. In spark mode, we 
can solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to