[ 
https://issues.apache.org/jira/browse/PIG-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297098#comment-14297098
 ] 

Rohini Palaniswamy commented on PIG-3856:
-----------------------------------------

bq. Skew join also is impacted (TEZC-Union-6) Is this expected ? 
    Yes. Similar to the small table in replicate join, SkewedJoin sample that 
was broadcast to union vertex now needs to be broadcast to all the union 
predecessors.

bq. The new DAG for TEZ-Union-4 is as following. But I think v4 is not 
necessary, just group v2 and v3 together as one vertex group should be enough.
     Currently without this patch the plan is  v2(load),v3(load),v1(small 
table)->v4(union vertex). With this patch, the plan should be v1->v2,v3 where 
v2 and v3 are in one vertex group for union and there is no v4. But v1 output 
is written twice - once for v2 and once for v3. See plan for TEZC-Union-4.gld. 
{code}
Tez vertex scope-40     ->      Tez vertex scope-34,Tez vertex scope-35,
Tez vertex scope-34     ->      Tez vertex scope-41,
Tez vertex scope-35     ->      Tez vertex scope-41,
Tez vertex scope-41
{code}

  Tez vertex scope-41 is the vertex group and there is only v1,v2 and v3. May 
be you are looking at TEZC-Union-4-OPTOFF.gld where the UnionOptimizer is 
turned off.
 
Once shared edge is done, patch should be re-written to make v1 write small 
table once, but send to both v2 and v3. It will have to be a new optimizer that 
runs before UnionOptimizer. UnionOptimizer creates vertex groups for consuming 
input. The new optimizer will create vertex group for sending outputs.

> UnionOptimizer in Tez should optimize the case of replicated join
> -----------------------------------------------------------------
>
>                 Key: PIG-3856
>                 URL: https://issues.apache.org/jira/browse/PIG-3856
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>         Attachments: PIG-3856-1.patch
>
>
> Replicate join input that was broadcast to union vertex now needs to be 
> broadcast to all the union predecessors. So we need to
>     - Create edges from the Replicate join input to all the union predecessors
>     - Change replicate join input to write to multiple outputs.
> This can be further optimized by using a shared edge which is yet to be 
> implemented in Tez (TEZ-391)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to