[
https://issues.apache.org/jira/browse/PIG-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219232#comment-15219232
]
liyunzhang_intel commented on PIG-4848:
---------------------------------------
[~xuefuz]: +1 for PIG-4848-2.patch. please commit it.
> pig.noSplitCombination=true should always be set internally for a merge join
> ----------------------------------------------------------------------------
>
> Key: PIG-4848
> URL: https://issues.apache.org/jira/browse/PIG-4848
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Xianda Ke
> Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4848-2.patch, PIG-4848.patch
>
>
> In spark mode, for a merge join, the flag is NOT set as true internally. The
> input splits will be in the order of file size. The output is out of order.
> Scenaro:
> cat input1
> {code}
> 1 1
> {code}
> cat input2
> {code}
> 2 2
> {code}
> cat input3
> {code}
> 33 33
> {code}
> A = LOAD 'input*' as (a:int, b:int);
> B = LOAD 'input*' as (a:int, b:int);
> C = JOIN A BY $0, B BY $0 USING 'merge';
> DUMP C;
> expected result:
> {code}
> (1,1,1,1)
> (2,2,2,2)
> (33,33,33,33)
> {code}
> actual result:
> {code}
> (33,33,33,33)
> (1,1,1,1)
> (2,2,2,2)
> {code}
> In MR mode, the flag was set as true internally for a merge join(see:
> PIG-2773). However, it doesn't work now. The output is still out of order,
> because the splits will be ordered again by hadoop-client. In spark mode, we
> can solve this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)