[
https://issues.apache.org/jira/browse/PIG-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259738#comment-15259738
]
liyunzhang_intel commented on PIG-4810:
---------------------------------------
[~kexianda]: some comments:
1. add joinOp.setIndexFile(strFile.getFileName()) in spark like what did in mr,
later it will upload this index file to distributed
cache(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.JoinDistributedCacheVisitor#visitMergeJoin)
so that nodes in the distributed cluster can access to the index file more
efficiently. I think we can make more copys for index file by
FileSytem.setReplication(indexFile, 10) later to make other nodes to access the
file more efficiently.
2. For TestMerge#testMergeJoinWithReplicatedJoin, it need not add order by
before regular join(it does not require data sorted before in regular join)
{code}
if(! Util.isSparkExecType(cluster.getExecType())) {
pigServer.registerQuery("D = join A by f1, B by f1 using
'replicated';");
} else {
// currently, the implementation of FRJoin can't guarantee the
order in spark mode
// the input for MergeJoin should be in asc order.
pigServer.registerQuery("D0 = join A by f1, B by f1 using
'replicated';");
pigServer.registerQuery("D = ORDER D0 BY A::f1 ASC;");
}
{code}
can be
{code}
pigServer.registerQuery("D = join A by f1, B by f1 using 'replicated';");
{code}
3. code format like indent
> Implement Merge join for spark engine
> -------------------------------------
>
> Key: PIG-4810
> URL: https://issues.apache.org/jira/browse/PIG-4810
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4810-2.patch, PIG-4810-3.patch, PIG-4810-4.patch,
> PIG-4810-5.patch, PIG-4810.patch
>
>
> In current code base(a9151ac), we use regular join to implement merge join in
> spark mode.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)