[ 
https://issues.apache.org/jira/browse/TEZ-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393391#comment-14393391
 ] 

Bikas Saha commented on TEZ-2251:
---------------------------------

Thats not an avoidable scenario and is probably orthogonal to the thread 
locking issue here. The source vertex can start some tasks much ahead of when 
the edge will change and then later, start other tasks after the edge has been 
changed (irrespective of any threading races which happened here). Indeed, 
thats what happens in auto-reduce. The maps will be started with the original 
edge in place. Then after some maps complete (with specs from the original 
edge) the edge is changed (for the new parallelism) and these maps will get 
specs via the new edge. It is the responsibility of the new edge to provide a 
consistent view across the change. If old and new behaviors cannot be aligned 
at runtime then the source vertex can be prevented from running until the edge 
is defined by specifying a null edge initially.

> Enabling auto reduce parallelism in certain jobs causes DAG to hang
> -------------------------------------------------------------------
>
>                 Key: TEZ-2251
>                 URL: https://issues.apache.org/jira/browse/TEZ-2251
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: TEZ-2251.2.patch, TEZ-2251.VertexImpl.patch, 
> TEZ-2251.VertexImpl.readlock.patch, TEZ-2251.fix_but_slows_down.patch, 
> hive_console.png, tez-2251.vertexpatch.am.log.gz, tez_2251_dag.png
>
>
> Scenario:
> - Run TPCH query20 
> (https://github.com/cartershanklin/hive-testbench/blob/master/sample-queries-tpch/tpch_query20.sql)
>  at 1 TB scale (tez-master branch, hive trunk)
> - Enable auto reduce parallelism
> - DAG didn't complete and got stuck in "Reducer 6"
> Vertex parallelism of "Reducer 5 & 6" happens within a span of 3 
> milliseconds, and tasks of "reducer 5" ends up producing wrong partition 
> details as it sees the updated task numbers of reducer 6 when scheduled.  
> This causes, job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to