[ 
https://issues.apache.org/jira/browse/TEZ-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323836#comment-15323836
 ] 

Siddharth Seth commented on TEZ-3297:
-------------------------------------

This looks good to me. The readLock to obtain information from the Input/Output 
vertices is obtained outside of the read/write lock of the same vertex. +1.
Couple of things which may be worth adding to the patch.
- Some commentary around why this change is being made.
- Potentially some asserts (maybe Preconditions) around the state of the locks 
during calls to upstream / downstream vertices in these two methods.

I believe this fixes things for now. Would be useful to look at how to make 
this simpler in the future. One potential option is to have all write 
transitions on a vertex go through the central dispatcher (instead of direct 
calls in from the VertexManager). This should work fairly well in terms of 
restricting the number of write request.

Going through the code a little more, potential issues unrelated to this patch
I think any method which invokes a method on the EdgeManager which invokes the 
actual Edge, from within a lock, is prone to such deadlocks, since the Edges 
itself have access to 'getSourceVertexNumTasks' and 
'getDesintationVertexNumTasks'.
Another potential issue is with handleInitEvent which accesses upstream 
totalTasks from within a writeLock on the current vertex.
setInputVertices / setOutputVertices should acquire a writeLock for visibility.

> Deadlock scenario in AM during ShuffleVertexManager auto reduce
> ---------------------------------------------------------------
>
>                 Key: TEZ-3297
>                 URL: https://issues.apache.org/jira/browse/TEZ-3297
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Zhiyuan Yang
>            Priority: Critical
>         Attachments: TEZ-3297.1.patch, am_log, thread_dump
>
>
> Here is what's happening in the attached thread dump.
> App Pool thread #9 does the auto reduce on V2 and initializes the new edge 
> manager, it holds the V2 write lock and wants read lock of source vertex V1. 
> At the same time, another App Pool thread #2 schedules a task of V1 and gets 
> the output spec, so it holds the V1 read lock and wants V2 read lock. 
> Also, dispatcher thread wants the V1 write lock to begin the state machine 
> transition. Since dispatcher thread is at the head of V1 ReadWriteLock queue, 
> thread #9 cannot get V1 read lock even thread #2 is holding V1 read lock. 
> This is a circular lock scenario. #2 blocks dispatcher, dispatcher blocks #9, 
> and #9 blocks #2.
> There is no problem with ReadWriteLock behavior in this case. Please see this 
> java bug report, http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6816565.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to