[ 
https://issues.apache.org/jira/browse/TEZ-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151303#comment-15151303
 ] 

Bikas Saha commented on TEZ-3117:
---------------------------------

Looking at it further these methods are called by the destination vertex and 
hence visibility ends up being guarded by the rw locks in the vertex. The 
synchronization in here is probably superfluous since we are reading these same 
vars in other places outside of any locks. In this patch, the synchs have been 
moved around to prevent out of order locking situations like the one that 
caused this deadlock. 
We can follow up separately to take another look at the synchs and remove them 
altogether (TEZ-3122). Sounds good?

> Deadlock in Edge and Vertex code
> --------------------------------
>
>                 Key: TEZ-3117
>                 URL: https://issues.apache.org/jira/browse/TEZ-3117
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Yesha Vora
>            Assignee: Bikas Saha
>             Fix For: 0.7.1, 0.8.3
>
>         Attachments: TEZ-3117.1.patch
>
>
> {code}
> Java-level deadlocks detected
>  
> This means that some threads are blocked waiting to enter a synchronization 
> block or
> waiting to reenter a synchronization block after an Object.wait() call, where 
> each thread
> owns one monitor while trying to obtain another monitor already held by 
> another thread.
>  
> Deadlock:
> App Shared Pool - #1 is waiting to lock 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@18a7c819 which 
> is held by Dispatcher thread {Central}
> Dispatcher thread {Central} is waiting to lock 
> org.apache.tez.dag.app.dag.impl.Edge@3e6ba2db which is held by App Shared 
> Pool - #1
>  
> Deadlock:
> Dispatcher thread {Central} is waiting to lock 
> org.apache.tez.dag.app.dag.impl.Edge@3e6ba2db which is held by App Shared 
> Pool - #1
> App Shared Pool - #1 is waiting to lock 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@18a7c819 which 
> is held by Dispatcher thread {Central}
> Thread stacks
> App Shared Pool - #1 [WAITING]
>  sun.misc.Unsafe.park(native method)
>  java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>  
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>  
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>  
> org.apache.tez.dag.app.dag.impl.VertexImpl.getTotalTasks(VertexImpl.java:1098)
>  
> org.apache.tez.dag.app.dag.impl.Edge$EdgeManagerPluginContextImpl.getDestinationVertexNumTasks(Edge.java:99)
>  org.apache.tez.dag.app.dag.impl.Edge.routingToBegin(Edge.java:214)
>  
> org.apache.tez.dag.app.dag.impl.VertexImpl.setupEdgeRouting(VertexImpl.java:1447)
>  
> org.apache.tez.dag.app.dag.impl.VertexImpl.unsetTasksNotYetScheduled(VertexImpl.java:1453)
>  
> org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1496)
>  
> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleTasks(VertexManager.java:216)
>  
> org.apache.tez.dag.library.vertexmanager.InputReadyVertexManager.handleSourceTaskFinished(InputReadyVertexManager.java:275)
>  
> org.apache.tez.dag.library.vertexmanager.InputReadyVertexManager.onSourceTaskCompleted(InputReadyVertexManager.java:196)
>  
> org.apache.tez.dag.library.vertexmanager.InputReadyVertexManager.trySchedulingPendingCompletions(InputReadyVertexManager.java:146)
>  
> org.apache.tez.dag.library.vertexmanager.InputReadyVertexManager.onVertexStarted(InputReadyVertexManager.java:187)
>  
> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventOnVertexStarted.invoke(VertexManager.java:578)
>  
> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:647)
>  
> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:642)
>  java.security.AccessController.doPrivileged(native method)
>  javax.security.auth.Subject.doAs(Subject.java:422)
>  
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>  
> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:642)
>  
> org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:631)
>  java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  java.lang.Thread.<null>(unknown source)
> Dispatcher thread {Central} [BLOCKED; waiting to lock 
> org.apache.tez.dag.app.dag.impl.Edge@3e6ba2db]
>  org.apache.tez.dag.app.dag.impl.Edge.getEdgeProperty(Edge.java:241)
>  
> org.apache.tez.dag.app.dag.impl.VertexImpl.logVertexConfigurationDoneEvent(VertexImpl.java:1886)
>  
> org.apache.tez.dag.app.dag.impl.VertexImpl.maybeSendConfiguredEvent(VertexImpl.java:3020)
>  org.apache.tez.dag.app.dag.impl.VertexImpl.startVertex(VertexImpl.java:3055)
>  org.apache.tez.dag.app.dag.impl.VertexImpl.access$4500(VertexImpl.java:204)
>  
> org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:3007)
>  
> org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:2996)
>  
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>  
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>  
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>  
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>  org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59)
>  org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1799)
>  org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:203)
>  
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:2214)
>  
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:2200)
>  org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>  org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114)
>  java.lang.Thread.<null>(unknown source)
> Frozen threads found (potential deadlock)
>  
> It seems that the following threads have not changed their stack for more 
> than 10 seconds.
> These threads are possibly (but not necessarily!) in a deadlock or hung.
>  
> client DomainSocketWatcher <--- Frozen for at least 20m 33 sec
> org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int, 
> DomainSocketWatcher$FdSet) DomainSocketWatcher.java (native)
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(int, 
> DomainSocketWatcher$FdSet) DomainSocketWatcher.java:52
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run() 
> DomainSocketWatcher.java:511
> java.lang.Thread.run() Thread.java:745
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to