[ 
https://issues.apache.org/jira/browse/TEZ-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326660#comment-15326660
 ] 

Bikas Saha commented on TEZ-3297:
---------------------------------

looking at the code further, looks like the crucial change is not holding own 
vertex lock while trying to read src/dest vertex lock. that makes sense and 
seems like a lock ordering issue waiting to happen. Perhaps a quick scan of 
such nested locking is in order in case not already done.

The removal of the overall lock is fine since each internal method invocation 
like getTotalTasks() are already handling their own locking. 

lgtm.

Moving VM invoked sync calls onto the dispatcher is a good idea but would need 
the addition of new callbacks into the VM to notify them of completion of the 
requested vertex state change operation. Since most current VMs dont do much 
after changing parallelism, the change might be simpler to implement now. Not 
sure about Hive custom VMs.

> Deadlock scenario in AM during ShuffleVertexManager auto reduce
> ---------------------------------------------------------------
>
>                 Key: TEZ-3297
>                 URL: https://issues.apache.org/jira/browse/TEZ-3297
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Zhiyuan Yang
>            Priority: Critical
>         Attachments: TEZ-3297.1.patch, TEZ-3297.2.patch, am_log, thread_dump
>
>
> Here is what's happening in the attached thread dump.
> App Pool thread #9 does the auto reduce on V2 and initializes the new edge 
> manager, it holds the V2 write lock and wants read lock of source vertex V1. 
> At the same time, another App Pool thread #2 schedules a task of V1 and gets 
> the output spec, so it holds the V1 read lock and wants V2 read lock. 
> Also, dispatcher thread wants the V1 write lock to begin the state machine 
> transition. Since dispatcher thread is at the head of V1 ReadWriteLock queue, 
> thread #9 cannot get V1 read lock even thread #2 is holding V1 read lock. 
> This is a circular lock scenario. #2 blocks dispatcher, dispatcher blocks #9, 
> and #9 blocks #2.
> There is no problem with ReadWriteLock behavior in this case. Please see this 
> java bug report, http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6816565.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to