[jira] [Commented] (TEZ-2251) Enabling auto reduce parallelism in certain jobs causes DAG to hang

Rajesh Balamohan (JIRA) Wed, 01 Apr 2015 05:28:13 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390478#comment-14390478
 ]


Rajesh Balamohan commented on TEZ-2251:
---------------------------------------

update:
=====
- "Reducer 4 " gets auto-reduced and sends CONFIGURED signal to downstream.
- "Reducer 5" reduces from 12 -> 1 task & sends CONFIGURED signal to downstream 
"Reducer 6". This is handled by "App Shared Pool" threads in AM.
- "Reducer 6" reduces from 42 -> 6 tasks & sends CONFIGURED signal. This is 
handled by "App Shared Pool" threads in AM.
- In the mean time, Task in "Reducer 5" gets scheduled.  This gets on to 
"central dispatcher" thread in AM.  For some reason, this is slow. And events 
in "App Shared Pool" gets processed fast & is in the middle of updating 
CustomManager in "Reducer 6". Note that "numTasks" has been updated by this 
time.
-  "central dispatcher" looks for the value of tasks in "Reducer 5".  This 
would end up seeing the updated value of "numTasks" and end up generating wrong 
partitions. This causes "Reduce 6" to hang indefinitely

-Tried fixing vertexImpl.setParallelism() which didn't work
-Ideally, synchronizing Edge.getSourceSpec() and Edge.getDestinationSpec() 
should help.  However, this ends up in deadlock pretty soon as VertexImpl 
acquires the writelock and tries to get the readlock via 
EdgeManagerPluginContextImpl.getDestinationVertexNumTasks().

> Enabling auto reduce parallelism in certain jobs causes DAG to hang
> -------------------------------------------------------------------
>
>                 Key: TEZ-2251
>                 URL: https://issues.apache.org/jira/browse/TEZ-2251
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: hive_console.png, tez_2251_dag.png
>
>
> Scenario:
> - Run TPCH query20 
> (https://github.com/cartershanklin/hive-testbench/blob/master/sample-queries-tpch/tpch_query20.sql)
>  at 1 TB scale (tez-master branch, hive trunk)
> - Enable auto reduce parallelism
> - DAG didn't complete and got stuck in "Reducer 6"
> Vertex parallelism of "Reducer 5 & 6" happens within a span of 3 
> milliseconds, and tasks of "reducer 5" ends up producing wrong partition 
> details as it sees the updated task numbers of reducer 6 when scheduled.  
> This causes, job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2251) Enabling auto reduce parallelism in certain jobs causes DAG to hang

Reply via email to