[
https://issues.apache.org/jira/browse/PIG-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-3631:
------------------------------------
Fix Version/s: (was: 0.14.0)
> Improve performance of replicate-join
> -------------------------------------
>
> Key: PIG-3631
> URL: https://issues.apache.org/jira/browse/PIG-3631
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Rohini Palaniswamy
>
> Replicated join is implemented in Tez as follows:
> - POFRJoinTez extends POFRJoin. The difference between two is that
> replication hash table is constructed out of broadcasting edges in Tez
> instead of files on distributed cache in MR.
> - TezCompiler adds a vertex per replicated table and connect it to POFRJoin
> vertex via broadcasting edge.
> Verify no performance regression with MR:
> - The above approach is good when the replicate join is not the first
> vertex of the DAG (i.e in case of a MR, replicate join is part of a reduce).
> If it is the first vertex of the DAG, we need to compare and see that with
> this approach the performance does not regress with the MR's map only
> replicate join using distributed cache.
> Evaluate:
> - Instead of broadcasting key values and constructing hashmap, evaluate
> broadcasting (or distributing via cache based on performance) serialized
> hashmap and loading it as is similar to hive.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)