[
https://issues.apache.org/jira/browse/TEZ-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393046#comment-14393046
]
Siddharth Seth commented on TEZ-2251:
-------------------------------------
Was looking at this a bit.
There's visibility issues in the Edge - where the AppSharedPool may update the
edge manager, while it's accessed in the main dispatcher thread. I believe
Rajesh's patch may be fixing this indirectly - adding explicit synchronization
/ locks on the edge may be useful since it's shared between two different
vertices - and effectively updated by different locks.
After sync fixes - is it possible for some tasks to get their spec from the old
edge manager, and others to get it from the new one ? Tasks are created on the
central dispatcher - meanwhile, the VMplugin decides to update the EdgeManager
which happens on the AppSharedPool ?
e.g.
Reducer 5 starts running it's tasks. (Reducer 6 hasn't changed parallelism yet)
- queued on the Central dispatcher.
Some of these tasks get created and use the original parallelism
Reducer 6 gets configured and it's parallelism and edge manager are changed.
(On the AppSharedPool)
The rest of the tasks for Reducer 5 pick up parallelism from the updated edge
manager.
Is their something which avoids this scenario ?
> Enabling auto reduce parallelism in certain jobs causes DAG to hang
> -------------------------------------------------------------------
>
> Key: TEZ-2251
> URL: https://issues.apache.org/jira/browse/TEZ-2251
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Attachments: TEZ-2251.2.patch, TEZ-2251.VertexImpl.patch,
> TEZ-2251.VertexImpl.readlock.patch, TEZ-2251.fix_but_slows_down.patch,
> hive_console.png, tez-2251.vertexpatch.am.log.gz, tez_2251_dag.png
>
>
> Scenario:
> - Run TPCH query20
> (https://github.com/cartershanklin/hive-testbench/blob/master/sample-queries-tpch/tpch_query20.sql)
> at 1 TB scale (tez-master branch, hive trunk)
> - Enable auto reduce parallelism
> - DAG didn't complete and got stuck in "Reducer 6"
> Vertex parallelism of "Reducer 5 & 6" happens within a span of 3
> milliseconds, and tasks of "reducer 5" ends up producing wrong partition
> details as it sees the updated task numbers of reducer 6 when scheduled.
> This causes, job to hang.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)