[ 
https://issues.apache.org/jira/browse/TEZ-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498972#comment-14498972
 ] 

Hitesh Shah commented on TEZ-2310:
----------------------------------

bq.Because we are not using a bounded queue and will never block on the put 
method. But the based API has an exception that must be caught for compilation.

Any reason why we cannot "not catch the exception" and let the calling code 
handle it?

 

> AM Deadlock in VertexImpl
> -------------------------
>
>                 Key: TEZ-2310
>                 URL: https://issues.apache.org/jira/browse/TEZ-2310
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Daniel Dai
>            Assignee: Bikas Saha
>         Attachments: TEZ-2310-0.patch, TEZ-2310.1.patch, TEZ-2310.2.patch
>
>
> See the following deadlock in testing:
> Thread#1:
> {code}
> Daemon Thread [App Shared Pool - #3] (Suspended)      
>       owns: VertexManager$VertexManagerPluginContextImpl  (id=327)    
>       owns: ShuffleVertexManager  (id=328)    
>       owns: VertexManager  (id=329)   
>       waiting for: VertexManager$VertexManagerPluginContextImpl  (id=326)     
>       
> VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate)
>  line: 344        
>       
> StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) 
> line: 138      
>       
> StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer,
>  VertexStateUpdate) line: 122    
>       StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate) 
> line: 116   
>       StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line: 
> 106      
>       VertexImpl.maybeSendConfiguredEvent() line: 3385        
>       VertexImpl.doneReconfiguringVertex() line: 1634 
>       VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex() 
> line: 339        
>       ShuffleVertexManager.schedulePendingTasks(int) line: 561        
>       ShuffleVertexManager.schedulePendingTasks() line: 620   
>       ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line: 
> 731       
>       ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744  
>       VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527  
>       VertexManager$VertexManagerEvent$1.run() line: 612      
>       VertexManager$VertexManagerEvent$1.run() line: 607      
>       AccessController.doPrivileged(PrivilegedExceptionAction<T>, 
> AccessControlContext) line: not available [native method]   
>       Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415   
>       UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548      
>       
> VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
>  line: 607  
>       
> VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
>  line: 596  
>       ListenableFutureTask<V>(FutureTask<V>).run() line: 262  
>       ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145      
>       ThreadPoolExecutor$Worker.run() line: 615       
>       Thread.run() line: 745  
> {code}
> Thread #2
> {code}
> Daemon Thread [App Shared Pool - #2] (Suspended)      
>       owns: VertexManager$VertexManagerPluginContextImpl  (id=326)    
>       owns: PigGraceShuffleVertexManager  (id=344)    
>       owns: VertexManager  (id=345)   
>       Unsafe.park(boolean, long) line: not available [native method]  
>       LockSupport.park(Object) line: 186      
>       
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt()
>  line: 834        
>       
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int)
>  line: 964   
>       
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int)
>  line: 1282    
>       ReentrantReadWriteLock$ReadLock.lock() line: 731        
>       VertexImpl.getTotalTasks() line: 952    
>       VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String) 
> line: 162        
>       
> PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount() 
> line: 435    
>       
> PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map<String,List<Integer>>)
>  line: 353 
>       VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541      
>       VertexManager$VertexManagerEvent$1.run() line: 612      
>       VertexManager$VertexManagerEvent$1.run() line: 607      
>       AccessController.doPrivileged(PrivilegedExceptionAction<T>, 
> AccessControlContext) line: not available [native method]   
>       Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415   
>       UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548      
>       
> VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
>  line: 607      
>       
> VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
>  line: 596      
>       ListenableFutureTask<V>(FutureTask<V>).run() line: 262  
>       ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145      
>       ThreadPoolExecutor$Worker.run() line: 615       
>       Thread.run() line: 745  
> {code}
> What happens is thread #1 holding a writeLock (VertexImpl:1628) and enter 
> into a synchronized block (ShuffleVertexManager.onVertexStateUpdated), in the 
> mean time, thread #2 already in the synchronized block 
> (ShuffleVertexManager.onVertexStarted) and try to get a 
> readLock(VertexImpl:952). Holding a lock and then enter a synchronized block 
> might be dangerous. 
> I attach a patch which avoiding that and then deadlock goes away. Not sure if 
> that is the right fix or if any other patterns like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to