[
https://issues.apache.org/jira/browse/TEZ-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879095#comment-13879095
]
Siddharth Seth commented on TEZ-678:
------------------------------------
Looks like we're effectively grouping equivalent or similar vertices together
as a convenience for users, after which they can define similar operations on
all of these vertices as a group rather than having to set them up individually.
Instead of AliasVertex extending Vertex - this could just be a separate
construct in itself (something like VertexGroup). I'm not sure an AliasVertex
itself fits very well into a Graph - since it's not really a vertex. Having a
separate construct gets rid of this concern. Also, it gets rid of all the
additional methods on a Vertex which don't apply to a VertexGroup. Since this
is getting converted into a helper - it should be fairly clear from the API
itself, that this vertex group doesn't have a physical representation, cannot
be monitored individually etc. Edges could be setup between Vertices, or
between a VertexGroup and a Vertex.
On InputDescriptor associated with an AliasVertex: I'm assuming an AliasVertex
could potentially generate multiple outputs - which can be linked to different
downstream vertices via different edges. Associating an InputDescriptor with
the vertex itself won't allow this. Unless I'm missing something, to achieve
something like this, users would have to setup multiple Aliases/Groups for the
same set of Vertices (Can a vertex belong to multiple Aliases ?). If this were
associated with the edge itself (which is where Input/OutputDescriptors are
defined) - it should be possible to use the same alias/group for different
Outputs and Edges generated by the same set of vertices. Something like
addEdge(VertexGroup, Vertex, EdgeProperty, GroupInputDescriptor)
Nit: When adding an AliasVertex/GroupedVertex to a DAG (whether this is via
addVertex or addVertexGroup) - I don't think users should need to add the
individual vertices separately.
Output handling - was expecting users would be able to specify a single
committer which would run once for all vertices in the group, rather than each
vertex running a committer. Currently the output semantics just ends up
creating a group of committers which will always be executed together. If we
didn't have semantics to commit early - this wouldn't even be required ?
Will probably have some more comments on the patch itself as I go through it in
detail - is rather big!
> Support for union operations
> ----------------------------
>
> Key: TEZ-678
> URL: https://issues.apache.org/jira/browse/TEZ-678
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: TEZ-678.1.patch, TEZ-678.2.patch, TEZ-678.3.patch,
> TEZ-678.4.patch, TEZ-678.5.patch
>
>
> Unions represent a collection of results obtained from different branches of
> computation. The collection is a virtual operation that does not need to
> execute any tasks. Subsequent operations can conveniently work on the union
> named data set instead of each individual member of the union. While unions
> can be implemented efficiently without additional support from Tez, having
> API support can make it easier and less error-prone to implement.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)