[ 
https://issues.apache.org/jira/browse/PIG-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217248#comment-15217248
 ] 

Xianda Ke commented on PIG-4848:
--------------------------------

Hi [~kellyzly], Thanks for your tips.  Here, we only set the flag internally 
for those POLoad operators who have POMergeJoin successor. we won't set the 
flag for all the POLoad operators if there is POMergeJoin in PhysicalPlan. 
PlanHelper.getPhysicalOperators() is not suitable here.

> pig.noSplitCombination=true should always be set internally for a merge join
> ----------------------------------------------------------------------------
>
>                 Key: PIG-4848
>                 URL: https://issues.apache.org/jira/browse/PIG-4848
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Xianda Ke
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-4848.patch
>
>
> In spark mode, for a merge join, the flag is NOT set as true internally. The 
> input splits will be in the order of file size. The output is out of order.
> Scenaro:
> cat input1
> {code}
> 1     1
> {code}
> cat input2
> {code}
> 2     2
> {code}
> cat input3
> {code}
> 33    33
> {code}
> A = LOAD 'input*' as (a:int, b:int);
> B = LOAD 'input*' as (a:int, b:int);
> C = JOIN A BY $0, B BY $0 USING 'merge';
> DUMP C;
> expected result:
> {code}
> (1,1,1,1)
> (2,2,2,2)
> (33,33,33,33)
> {code}
> actual result:
> {code}
> (33,33,33,33)
> (1,1,1,1)
> (2,2,2,2)
> {code}
> In MR mode, the flag was set as true internally for a merge join(see: 
> PIG-2773). However, it doesn't work now. The output is still out of order, 
> because the splits will be ordered again by hadoop-client. In spark mode, we 
> can solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to