[
https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286004#comment-15286004
]
Mohit Sabharwal commented on PIG-4771:
--------------------------------------
+1 (non-binding)
> Implement FR Join for spark engine
> ----------------------------------
>
> Key: PIG-4771
> URL: https://issues.apache.org/jira/browse/PIG-4771
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4771.patch, PIG-4771_2.patch, PIG-4771_3.patch
>
>
> We use regular join to replace FR join in current code base(fd31fda). We need
> to implement FR join.
> Some info collected from
> https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins:
> *Replicated Joins*
> Fragment replicate join is a special type of join that works well if one or
> more relations are small enough to fit into main memory. In such cases, Pig
> can perform a very efficient join because all of the hadoop work is done on
> the map side. In this type of join the large relation is followed by one or
> more small relations. The small relations must be small enough to fit into
> main memory; if they don't, the process fails and an error is generated.
> *Usage*
> Perform a replicated join with the USING clause (see JOIN (inner) and JOIN
> (outer)). In this example, a large relation is joined with two smaller
> relations. Note that the large relation comes first followed by the smaller
> relations; and, all small relations together must fit into main memory,
> otherwise an error is generated.
> big = LOAD 'big_data' AS (b1,b2,b3);
> tiny = LOAD 'tiny_data' AS (t1,t2,t3);
> mini = LOAD 'mini_data' AS (m1,m2,m3);
> C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
> *Conditions*
> Fragment replicate joins are experimental; we don't have a strong sense of
> how small the small relation must be to fit into memory. In our tests with a
> simple query that involves just a JOIN, a relation of up to 100 M can be used
> if the process overall gets 1 GB of memory.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)