Ming Ma created TEZ-3500:
----------------------------
Summary: Support for multiple source vertices
Key: TEZ-3500
URL: https://issues.apache.org/jira/browse/TEZ-3500
Project: Apache Tez
Issue Type: Sub-task
Reporter: Ming Ma
For fair_parallelism policy where multiple destination tasks process the data
from different source tasks of the same partition, current implementation only
supports one source vertex.
Support for multiple source vertices will enable skewed shuffle join as
mentioned in
https://issues.apache.org/jira/browse/TEZ-3209?focusedCommentId=15385449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15385449.
Some rough ideas:
* For a large partition, if the volume comes mostly from one source vertex,
apply fair routing on that primary source vertex and have other vertices
broadcast their output to those destination tasks processing that partition.
* If the large partition volume is big from more than one source vertex, then
we will need something like cartesian product to do the join of different
sub-partition data from multiple vertices.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)