[ 
https://issues.apache.org/jira/browse/PIG-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259738#comment-15259738
 ] 

liyunzhang_intel commented on PIG-4810:
---------------------------------------

[~kexianda]:  some comments:  
1. add joinOp.setIndexFile(strFile.getFileName()) in spark like what did in mr, 
later it will upload this index file to distributed 
cache(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.JoinDistributedCacheVisitor#visitMergeJoin)
 so that nodes in the distributed cluster can access to the index file more 
efficiently. I think we can make more copys for index file by 
FileSytem.setReplication(indexFile, 10) later to make other nodes to access the 
file more efficiently.
2.  For TestMerge#testMergeJoinWithReplicatedJoin, it need not add order by 
before regular join(it does not require data  sorted before in regular join)
{code}
 if(! Util.isSparkExecType(cluster.getExecType())) {
                pigServer.registerQuery("D = join A by f1, B by f1 using 
'replicated';");
            } else {
                // currently, the implementation of FRJoin can't guarantee the 
order in spark mode
                // the input for MergeJoin should be in asc order.
                pigServer.registerQuery("D0 = join A by f1, B by f1 using 
'replicated';");
                pigServer.registerQuery("D = ORDER D0 BY A::f1 ASC;");
            }

{code}
can be
{code}
   pigServer.registerQuery("D = join A by f1, B by f1 using 'replicated';");
{code}
3. code format like indent 

> Implement Merge join for spark engine
> -------------------------------------
>
>                 Key: PIG-4810
>                 URL: https://issues.apache.org/jira/browse/PIG-4810
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-4810-2.patch, PIG-4810-3.patch, PIG-4810-4.patch, 
> PIG-4810-5.patch, PIG-4810.patch
>
>
> In current code base(a9151ac), we use regular join to implement merge join in 
> spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to