[
https://issues.apache.org/jira/browse/TEZ-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15668758#comment-15668758
]
Zhiyuan Yang commented on TEZ-3458:
-----------------------------------
Thanks for review!
{quote}
Grouper abstraction is nice. Maybe
FairShuffleVertexManager#PartitionsGroupingCalculator should use that as well.
If you agree, it can be moved out of cartesian product.
{quote}
I've moved it to
/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/utils.
{quote}
The grouping is based on static per-source-vertex desiredBytesPerGroup, not the
aggregated size across all source vertices. So applications will config the
value based on the desired aggregate input size and the number of source
vertices?
{quote}
Aggregated size across all source vertices isn't closer to the true amount of
work comparing to per-vertex grouping. Ideally we want each task to have same
number of combinations of input entries, but current stats only support data
size instead of number of entries, so the best we can do it to assume each
entry has same size and output size is a good estimation of number of entries.
{quote}
Wonder if there is any scenario to assign different values of
desiredBytesPerGroup for different source vertex dynamically.
{quote}
Per-vertex config should be more accurate than single static config. But one
config is easy to use for user and good enough for first step. Actually if we
are going to support per-vertex config, I'd rather implement stats of number of
entries and still keep single config.
{quote}
vertexOutputBytes should be long
{quote}
Fixed.
{quote}
What if some source vertex doesn’t generate output?
{quote}
Source vertex should always generate output because it has edge connected to cp
vertex. The problem is output is not required to generate VertexManagerEvent on
close, although all existings do this. With this in mind, we can say if vertex
manager depends on this behavior, it shouldn't be used with outputs that don't
generate VertexManagerEvent.
{quote}
What if the VME comes from some broadcast vertex?
{quote}
Nice catch. I forgot this case while doing rebase. Thanks!
> Auto grouping for cartesian product edge(unpartitioned case)
> ------------------------------------------------------------
>
> Key: TEZ-3458
> URL: https://issues.apache.org/jira/browse/TEZ-3458
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Zhiyuan Yang
> Assignee: Zhiyuan Yang
> Attachments: TEZ-3458.1.patch, TEZ-3458.2.patch, TEZ-3458.3.patch,
> TEZ-3458.4.patch, TEZ-3458.5.patch
>
>
> Original CartesianProductVertexManagerUnpartitioned set parallelism as
> product of all source vertices parallelism which may explode to insane
> number. We should do auto reduce as in ShuffleVertexManager to avoid this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)