[ 
https://issues.apache.org/jira/browse/TEZ-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393624#comment-14393624
 ] 

Siddharth Seth commented on TEZ-2251:
-------------------------------------

It's not related to this issue, but may need to be another jira which needs to 
be addressed.

Without multiple threads - tasks would always be created before a downstream 
vertex is re-configured - implying that all tasks would generate the same 
number of output partitions.
With multiple threads causing re-configuration. Some tasks on a vertex may 
generate the original number of output partitions, while others generate the 
revised number. Generating the revised number of partitions can be problematic 
since it can affect the partition function. There's advantages where fewer 
events are generated, and transmitted.
Tasks on the downstream vertex will likely see a consistent number of source 
physical inputs - which is the revised parallelism or whatever is returned by 
the revised edge manager.
Does the ShuffleEdgeManager handle this correctly ? - in terms of providing the 
correct number of physical inputs.

Otherwise we run into DAGs hanging because of inadequate events being available.

> Enabling auto reduce parallelism in certain jobs causes DAG to hang
> -------------------------------------------------------------------
>
>                 Key: TEZ-2251
>                 URL: https://issues.apache.org/jira/browse/TEZ-2251
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: TEZ-2251.2.patch, TEZ-2251.VertexImpl.patch, 
> TEZ-2251.VertexImpl.readlock.patch, TEZ-2251.fix_but_slows_down.patch, 
> hive_console.png, tez-2251.vertexpatch.am.log.gz, tez_2251_dag.png
>
>
> Scenario:
> - Run TPCH query20 
> (https://github.com/cartershanklin/hive-testbench/blob/master/sample-queries-tpch/tpch_query20.sql)
>  at 1 TB scale (tez-master branch, hive trunk)
> - Enable auto reduce parallelism
> - DAG didn't complete and got stuck in "Reducer 6"
> Vertex parallelism of "Reducer 5 & 6" happens within a span of 3 
> milliseconds, and tasks of "reducer 5" ends up producing wrong partition 
> details as it sees the updated task numbers of reducer 6 when scheduled.  
> This causes, job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to