Daniel Dai created TEZ-2310:
-------------------------------

             Summary: AM Deadlock in VertexImpl
                 Key: TEZ-2310
                 URL: https://issues.apache.org/jira/browse/TEZ-2310
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Daniel Dai
             Fix For: 0.7.0


See the following deadlock in testing:

Thread#1:
{code}
Daemon Thread [App Shared Pool - #3] (Suspended)        
        owns: VertexManager$VertexManagerPluginContextImpl  (id=327)    
        owns: ShuffleVertexManager  (id=328)    
        owns: VertexManager  (id=329)   
        waiting for: VertexManager$VertexManagerPluginContextImpl  (id=326)     
        
VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate) 
line: 344        
        
StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) line: 
138      
        
StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer,
 VertexStateUpdate) line: 122    
        StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate) 
line: 116   
        StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line: 
106      
        VertexImpl.maybeSendConfiguredEvent() line: 3385        
        VertexImpl.doneReconfiguringVertex() line: 1634 
        VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex() 
line: 339        
        ShuffleVertexManager.schedulePendingTasks(int) line: 561        
        ShuffleVertexManager.schedulePendingTasks() line: 620   
        ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line: 
731       
        ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744  
        VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527  
        VertexManager$VertexManagerEvent$1.run() line: 612      
        VertexManager$VertexManagerEvent$1.run() line: 607      
        AccessController.doPrivileged(PrivilegedExceptionAction<T>, 
AccessControlContext) line: not available [native method]   
        Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415   
        UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548      
        
VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
 line: 607  
        
VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
 line: 596  
        ListenableFutureTask<V>(FutureTask<V>).run() line: 262  
        ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145      
        ThreadPoolExecutor$Worker.run() line: 615       
        Thread.run() line: 745  
{code}
Thread #2
{code}
Daemon Thread [App Shared Pool - #2] (Suspended)        
        owns: VertexManager$VertexManagerPluginContextImpl  (id=326)    
        owns: PigGraceShuffleVertexManager  (id=344)    
        owns: VertexManager  (id=345)   
        Unsafe.park(boolean, long) line: not available [native method]  
        LockSupport.park(Object) line: 186      
        
ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt()
 line: 834        
        
ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int)
 line: 964   
        
ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int)
 line: 1282    
        ReentrantReadWriteLock$ReadLock.lock() line: 731        
        VertexImpl.getTotalTasks() line: 952    
        VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String) 
line: 162        
        
PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount() 
line: 435    
        
PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map<String,List<Integer>>)
 line: 353 
        VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541      
        VertexManager$VertexManagerEvent$1.run() line: 612      
        VertexManager$VertexManagerEvent$1.run() line: 607      
        AccessController.doPrivileged(PrivilegedExceptionAction<T>, 
AccessControlContext) line: not available [native method]   
        Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415   
        UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548      
        
VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
 line: 607      
        
VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
 line: 596      
        ListenableFutureTask<V>(FutureTask<V>).run() line: 262  
        ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145      
        ThreadPoolExecutor$Worker.run() line: 615       
        Thread.run() line: 745  
{code}
What happens is thread #1 holding a writeLock (VertexImpl:1628) and enter into 
a synchronized block (ShuffleVertexManager.onVertexStateUpdated), in the mean 
time, thread #2 already in the synchronized block 
(ShuffleVertexManager.onVertexStarted) and try to get a 
readLock(VertexImpl:952). Holding a lock and then enter a synchronized block 
might be dangerous. 

I attach a patch which avoiding that and then deadlock goes away. Not sure if 
that is the right fix or if any other patterns like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to