[ 
https://issues.apache.org/jira/browse/TEZ-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220835#comment-15220835
 ] 

Jason Lowe commented on TEZ-3193:
---------------------------------

Here are the relevant portions of the AM stacktrace when it deadlocks:
{noformat}
"TaskSchedulerAppCaller #0" #106 daemon prio=5 os_prio=0 tid=0x00007fb1cc1bb800 
nid=0x4619 waiting on condition [0x00007fb1b6509000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000c5d8e398> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.getState(TaskAttemptImpl.java:630)
        at 
org.apache.tez.dag.app.dag.impl.TaskImpl.selectBestAttempt(TaskImpl.java:745)
        at 
org.apache.tez.dag.app.dag.impl.TaskImpl.getProgress(TaskImpl.java:483)
        at 
org.apache.tez.dag.app.dag.impl.VertexImpl.computeProgress(VertexImpl.java:1285)
        at 
org.apache.tez.dag.app.dag.impl.VertexImpl.getProgress(VertexImpl.java:1195)
        at org.apache.tez.dag.app.dag.impl.DAGImpl.getProgress(DAGImpl.java:829)
        at 
org.apache.tez.dag.app.DAGAppMaster.getProgress(DAGAppMaster.java:1181)
        at 
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.getProgress(TaskSchedulerEventHandler.java:560)
        at 
org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$GetProgressCallable.call(TaskSchedulerAppCallbackWrapper.java:291)
        at 
org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$GetProgressCallable.call(TaskSchedulerAppCallbackWrapper.java:282)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

"IPC Server handler 4 on 52743" #64 daemon prio=5 os_prio=0 
tid=0x00007fb1c454c800 nid=0x45ca waiting on condition [0x00007fb1b920e000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000c1421810> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
        at org.apache.tez.dag.app.dag.impl.TaskImpl.canCommit(TaskImpl.java:768)
        at 
org.apache.tez.dag.app.TaskAttemptListenerImpTezDag.canCommit(TaskAttemptListenerImpTezDag.java:274)
        at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)
        at 
org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:514)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)

"Dispatcher thread {Central}" #37 prio=5 os_prio=0 tid=0x00007fb1c422f000 
nid=0x45aa waiting on condition [0x00007fb1ba722000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000c1421810> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at 
org.apache.tez.dag.app.dag.impl.TaskImpl.getTaskLocationHint(TaskImpl.java:541)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.getTaskLocationHint(TaskAttemptImpl.java:1068)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.access$100(TaskAttemptImpl.java:116)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl$StartedTransition.transition(TaskAttemptImpl.java:1349)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl$StartedTransition.transition(TaskAttemptImpl.java:1308)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        - locked <0x00000000c5d8e600> (a 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:786)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:116)
        at 
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2097)
        at 
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2082)
        at 
org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
        at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

This is the sequence of events that I believe leads to the deadlock:
# The attempt started transition grabs the write lock on the TaskAttemptImpl 
and starts processing
# Progress is being requested on the task, so another thread grabs the read 
lock on the TaskImpl and tries to get the attempt state.  Since the attempt 
write lock is held, it blocks
# Either the task attempt runs very fast and/or the AM is running slowly, and 
the attempt heartbeats in requesting a commit go/no-go, but that thread blocks 
trying to grab the TaskImpl lock
# The TaskAttemptImpl processing proceeds and eventually tries to call 
getTaskLocationHint on the task, but it blocks on the lock to avoid complete 
starvation of the thread currently waiting for the write lock.

It looks like one potential fix would be to have TaskImpl#getTaskLocationHint 
and TaskImpl#getBaseTaskSpec avoid grabbing the task read lock and simply mark 
those task fields as volatile so the TaskAttempt can safely call them.  I 
didn't see any other invocations of TaskImpl methods from the TaskAttemptImpl 
which could cause reverse locking order issues.


> Deadlock in AM during task commit request
> -----------------------------------------
>
>                 Key: TEZ-3193
>                 URL: https://issues.apache.org/jira/browse/TEZ-3193
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1, 0.8.2
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> The AM can deadlock between TaskImpl and TaskAttemptImpl.  Stacktrace and 
> details in a followup comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to