[
https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553693#comment-14553693
]
Jeff Zhang commented on TEZ-391:
--------------------------------
The following shows the different edge types we may need to support.
| | Vertex | VertexGroup |
| Vertex | Common Edge | SharedOutputEdge |
| VertexGroup | GroupInputEdge | Both SharedOutputEdge & GroupInputEdge (not
implemented yet ) |
List several main changes of this patch
* Currently SharedOutputEdge only support One-to-One and Broadcast
(ScatterGather require the 2 downstream vertices has the same parallelism,
otherwise shuffle will break. Although I did some change to make the
ScatterGather work, but it still need more work, especially on the reducer
auto-parallelism) From the pig's usage scenario, One-to-One and broadcast
should be sufficient now.
* Work flow for shared output edge
** Specify the shared output edge when building DAG on client.
** AM get the shared output edge from DAGPlan and pass this SharedOutputSpec
through TaskSpec to TezChild
** LogicalIOProcessorRuntimeTask get the TaskSpec which contains the
SharedOutputSpec. It would created corresponded SharedLogicOutput &
SharedOutputContext which is very similar to common LogicOutput &
OutputContext. The only difference is that SharedLogicOutput &
SharedOutputContext is associated with the downstream vertex group name rather
than the downstream vertex name. The key thing here is that although we
generate one copy of DatamovementEvent but we will send this one copy to each
members of the downstream vertex group. (This is done in
LogicalIOProcessorRuntimeTask.close())
* Refactor changes
** I rename lots of MergedInput to GroupedInput to make it align with
SharedOutput
** Rename VertexImpl#sharedOutput to VertexImpl#mergedOutput
> SharedEdge - Support for passing same output from a vertex as input to two
> different vertices
> ---------------------------------------------------------------------------------------------
>
> Key: TEZ-391
> URL: https://issues.apache.org/jira/browse/TEZ-391
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rohini Palaniswamy
> Assignee: Jeff Zhang
> Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch,
> TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch, TEZ-391-WIP-4.patch,
> TEZ-391-WIP-5.patch, TEZ-391-WIP-6.patch, TEZ-391-WIP-7.patch
>
>
> We need this for lot of usecases. For cases where multi-query is turned off
> and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and
> we write the output multiple times.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)