[ 
https://issues.apache.org/jira/browse/PIG-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3631:
----------------------------

    Fix Version/s:     (was: tez-branch)
                   0.14.0

> Improve performance of replicate-join
> -------------------------------------
>
>                 Key: PIG-3631
>                 URL: https://issues.apache.org/jira/browse/PIG-3631
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>
> Replicated join is implemented in Tez as follows:
> - POFRJoinTez extends POFRJoin. The difference between two is that 
> replication hash table is constructed out of broadcasting edges in Tez 
> instead of files on distributed cache in MR.
> - TezCompiler adds a vertex per replicated table and connect it to POFRJoin 
> vertex via broadcasting edge.
> Verify no performance regression with MR:
>   - The above approach is good when the replicate join is not the first 
> vertex of the DAG (i.e in case of a MR, replicate join is part of a reduce). 
> If it is the first vertex of the DAG, we need to compare and see that with 
> this approach the performance does not regress with the MR's map only 
> replicate join using distributed cache. 
> Evaluate:
>    - Instead of broadcasting key values and constructing hashmap, evaluate 
> broadcasting (or distributing via cache based on performance) serialized 
> hashmap and loading it as is similar to hive.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to