[
https://issues.apache.org/jira/browse/PIG-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955445#comment-13955445
]
Rohini Palaniswamy commented on PIG-3835:
-----------------------------------------
To give more details
- Without the UnionOptimizer, there will be a union vertex which will just
merge the two inputs.
UnionOptimizer:
This is modeled similar to MultiQueryOptimizer. It creates a number of vertex
groups equal to the number of outputs from the union vertex. It merges the plan
of union vertex with its predecessor vertices and connects the predecessor and
successor vertex to appropriate vertex groups. And it finally removes the union
vertex.
For eg:
1) Union followed by store
Vertex 1 (Load), Vertex 2 (Load) -> Vertex 3 (Union + Store) will be
optimized to
Vertex 1 (Load + Store), Vertex 2 (Load + Store). Both the vertices will be
writing output to same store location directly which is supported by Tez.
2) Union followed by groupby
Vertex 1 (Load), Vertex 2 (Load) -> Vertex 3 (Union + POLocalRearrange) ->
Vertex 4 (Group by)
will be optimized to Vertex 1 (Load + POLR), Vertex 2 (Load + POLR) ->(Vertex
Group)-> Vertex 4 (Group by)
In the case of 1), plan is reduced from 3 to 2 vertices. And in case of 2),
plan is reduced from 4 to 3 vertices. This enables to save 1 vertex worth of
tasks (scheduling/CPU/IO).
Will do some basic performance runs after TEZ-1003 is fixed.
> Improve performance of union
> ----------------------------
>
> Key: PIG-3835
> URL: https://issues.apache.org/jira/browse/PIG-3835
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Affects Versions: tez-branch
> Reporter: Cheolsoo Park
> Assignee: Rohini Palaniswamy
> Fix For: tez-branch
>
> Attachments: PIG-3835-Initial-1.patch
>
>
> PIG-3743 implements union using VertexGroup. But there are a couple of
> optimizations that we can apply to it.
> * Union followed by store
> Union is a blocking operator meaning that a new vertex is added for its
> succeeding operators. But if there is only one store in the succeeding
> vertex, MROutput could be directly attached to VertexGroup instead of adding
> a new vertex for it. Then, each union source vertex will write directly to
> the destination, and therefore, it will be faster.
> * Replace POLocalRearrangeTez with POValueOutputTez
> Union uses POLocalRearrange by setting the whole record as key. But since
> union only needs to partition records evenly across tasks, it might make more
> sense to use POValueOutputTez with RR partitioner instead.
--
This message was sent by Atlassian JIRA
(v6.2#6252)