[
https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304797#comment-14304797
]
Jeff Zhang commented on TEZ-391:
--------------------------------
bq. We should probably name it GroupOutputEdge to be symmetric with
GroupInputEdge.
I still think ShareOutputEdge is more suitable. Because for GroupInputEdge,
there's multiple inputs from upstream vertices, we group them together into
GroupInput. While for ShareOutputEdge, there's actually only one output from
upstream vertex. So from semantic perspective I think ShareOutputEdge is
better. Besides, there's one concept of SharedOutput in VertexImpl
(VertexImpl:: addSharedOutputs ) for output to a data sink. I think this kind
of output be renamed as GroupOutput is much better.
bq. Is the design suggesting that a group output edge expand into standard
edges with additional metadata at the source vertex which will enable its
TezChild to provide a single output to its tasks even though there are multiple
consumers?
Yes, TezChild only has one output but would send multiple events to AM based on
the additional metadata about the share edge.
bq. What happens to fault tolerance? If a destination vertex reports an error
about a shared source then what should happen in other destination vertices
that are sharing that source?
The upstream vertex will get the the InputReadErrorEvent and would send the
InputFailedEvent to both downstream vertices. In theory it should be no
problem. But you are right, I think I need to highlight these case and verify
it in unittest.
bq. Related: When an output of a task is marked bad then it sends an
InputFailed event to its destination tasks. This happens in the AM and needs to
be sent to all destination tasks of a shared output. So the AM routing would
need to take into account shared outputs for this case.
For the AM, it knows the standard edges that are expanded from share edge. so
all the downstream vertices will get the InputFailed event.
bq. Can it happen that a VertexGroup is connected to another VertexGroup? What
use case would that be?
Good question. This case would be 2 union join together and one of them is
replicated part. In this case the edges between these vertex group would be
both GroupInputEdge and ShareOutputEdge. Need to look into it more deeply.
{code}
a = load 'file:///tmp/input' as (x:int, y:chararray);
b = load 'file:///tmp/input' as (y:chararray, x:int);
c = union onschema a, b;
d = load 'file:///tmp/input1' as (x:int, z:chararray);
e = load 'file:///tmp/input2' as (x:int, z:chararray);
f = union onschema d,e;
g = join c by x, d by f using 'replicated';
store g into 'file:///tmp/output';
{code}
Besides, I am thinking is it necessary to expose the
GroupInputEdge/ShareOutputEdge as public API. User just need to create edge by
connecting one Vertex/VertexGroup and another Vertex/VertexGroup (2 by 2
cases).,
* If the destination is vertex group, then that mean they share the one copy of
output from source no matter the source is vertex or vertex group.
* Meanwhile, If the source is vertex group, then that mean destination use the
merged input from the destination no matter the destination is vertex or vertex
group.
> SharedEdge - Support for passing same output from a vertex as input to two
> different vertices
> ---------------------------------------------------------------------------------------------
>
> Key: TEZ-391
> URL: https://issues.apache.org/jira/browse/TEZ-391
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rohini Palaniswamy
> Assignee: Jeff Zhang
> Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch,
> TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch
>
>
> We need this for lot of usecases. For cases where multi-query is turned off
> and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and
> we write the output multiple times.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)