[
https://issues.apache.org/jira/browse/TEZ-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393624#comment-14393624
]
Siddharth Seth commented on TEZ-2251:
-------------------------------------
It's not related to this issue, but may need to be another jira which needs to
be addressed.
Without multiple threads - tasks would always be created before a downstream
vertex is re-configured - implying that all tasks would generate the same
number of output partitions.
With multiple threads causing re-configuration. Some tasks on a vertex may
generate the original number of output partitions, while others generate the
revised number. Generating the revised number of partitions can be problematic
since it can affect the partition function. There's advantages where fewer
events are generated, and transmitted.
Tasks on the downstream vertex will likely see a consistent number of source
physical inputs - which is the revised parallelism or whatever is returned by
the revised edge manager.
Does the ShuffleEdgeManager handle this correctly ? - in terms of providing the
correct number of physical inputs.
Otherwise we run into DAGs hanging because of inadequate events being available.
> Enabling auto reduce parallelism in certain jobs causes DAG to hang
> -------------------------------------------------------------------
>
> Key: TEZ-2251
> URL: https://issues.apache.org/jira/browse/TEZ-2251
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Attachments: TEZ-2251.2.patch, TEZ-2251.VertexImpl.patch,
> TEZ-2251.VertexImpl.readlock.patch, TEZ-2251.fix_but_slows_down.patch,
> hive_console.png, tez-2251.vertexpatch.am.log.gz, tez_2251_dag.png
>
>
> Scenario:
> - Run TPCH query20
> (https://github.com/cartershanklin/hive-testbench/blob/master/sample-queries-tpch/tpch_query20.sql)
> at 1 TB scale (tez-master branch, hive trunk)
> - Enable auto reduce parallelism
> - DAG didn't complete and got stuck in "Reducer 6"
> Vertex parallelism of "Reducer 5 & 6" happens within a span of 3
> milliseconds, and tasks of "reducer 5" ends up producing wrong partition
> details as it sees the updated task numbers of reducer 6 when scheduled.
> This causes, job to hang.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)