Daniel Dai created TEZ-2310:
-------------------------------
Summary: AM Deadlock in VertexImpl
Key: TEZ-2310
URL: https://issues.apache.org/jira/browse/TEZ-2310
Project: Apache Tez
Issue Type: Bug
Reporter: Daniel Dai
Fix For: 0.7.0
See the following deadlock in testing:
Thread#1:
{code}
Daemon Thread [App Shared Pool - #3] (Suspended)
owns: VertexManager$VertexManagerPluginContextImpl (id=327)
owns: ShuffleVertexManager (id=328)
owns: VertexManager (id=329)
waiting for: VertexManager$VertexManagerPluginContextImpl (id=326)
VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate)
line: 344
StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) line:
138
StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer,
VertexStateUpdate) line: 122
StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate)
line: 116
StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line:
106
VertexImpl.maybeSendConfiguredEvent() line: 3385
VertexImpl.doneReconfiguringVertex() line: 1634
VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex()
line: 339
ShuffleVertexManager.schedulePendingTasks(int) line: 561
ShuffleVertexManager.schedulePendingTasks() line: 620
ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line:
731
ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744
VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527
VertexManager$VertexManagerEvent$1.run() line: 612
VertexManager$VertexManagerEvent$1.run() line: 607
AccessController.doPrivileged(PrivilegedExceptionAction<T>,
AccessControlContext) line: not available [native method]
Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415
UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548
VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
line: 607
VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
line: 596
ListenableFutureTask<V>(FutureTask<V>).run() line: 262
ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145
ThreadPoolExecutor$Worker.run() line: 615
Thread.run() line: 745
{code}
Thread #2
{code}
Daemon Thread [App Shared Pool - #2] (Suspended)
owns: VertexManager$VertexManagerPluginContextImpl (id=326)
owns: PigGraceShuffleVertexManager (id=344)
owns: VertexManager (id=345)
Unsafe.park(boolean, long) line: not available [native method]
LockSupport.park(Object) line: 186
ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt()
line: 834
ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int)
line: 964
ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int)
line: 1282
ReentrantReadWriteLock$ReadLock.lock() line: 731
VertexImpl.getTotalTasks() line: 952
VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String)
line: 162
PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount()
line: 435
PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map<String,List<Integer>>)
line: 353
VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541
VertexManager$VertexManagerEvent$1.run() line: 612
VertexManager$VertexManagerEvent$1.run() line: 607
AccessController.doPrivileged(PrivilegedExceptionAction<T>,
AccessControlContext) line: not available [native method]
Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415
UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548
VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
line: 607
VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
line: 596
ListenableFutureTask<V>(FutureTask<V>).run() line: 262
ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145
ThreadPoolExecutor$Worker.run() line: 615
Thread.run() line: 745
{code}
What happens is thread #1 holding a writeLock (VertexImpl:1628) and enter into
a synchronized block (ShuffleVertexManager.onVertexStateUpdated), in the mean
time, thread #2 already in the synchronized block
(ShuffleVertexManager.onVertexStarted) and try to get a
readLock(VertexImpl:952). Holding a lock and then enter a synchronized block
might be dangerous.
I attach a patch which avoiding that and then deadlock goes away. Not sure if
that is the right fix or if any other patterns like this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)