[
https://issues.apache.org/jira/browse/SPARK-11928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-11928.
-------------------------------
Resolution: Cannot Reproduce
> Master retry deadlock
> ---------------------
>
> Key: SPARK-11928
> URL: https://issues.apache.org/jira/browse/SPARK-11928
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.5.2
> Reporter: Richard Marscher
>
> In our hardening testing during upgrade to Spark 1.5.2 we are noticing that
> there is a deadlock issue in the master connection retry code in AppClient:
> https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L96
> On a cursory analysis, the Runnable blocks at line 102 waiting on the rpcEnv
> for the masterRef. This Runnable is not checking for interruption on the
> current thread so it seems like it will continue to block as long as the
> rpcEnv is blocking and not respect the cancel(true) call in
> registerWithMaster. The thread pool only has enough threads to run
> tryRegisterAllMasters once at a time.
> It is being called from registerWithMaster which is going to retry every 20
> seconds by default
> (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L49).
> Meanwhile the rpcEnv default timeout is 120 seconds by default
> (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/util/RpcUtils.scala#L60).
> So once registerWithMaster recursively calls itself it will provoke a
> deadlock in tryRegisterAllMasters.
> Exception
> {code}
> 15/11/23 15:42:19 INFO o.a.s.d.c.AppClient$ClientEndpoint: Connecting to
> master spark://ip-172-22-121-44:7077...
> 15/11/23 15:42:39 ERROR o.a.s.u.SparkUncaughtExceptionHandler: Uncaught
> exception in thread Thread[appclient-registration-retry-thread,5,main]
> java.util.concurrent.RejectedExecutionException: Task
> java.util.concurrent.FutureTask@1d96b2c9 rejected from
> java.util.concurrent.ThreadPoolExecutor@10b3b94c[Running, pool size = 1,
> active threads = 0, queued tasks = 0, completed tasks = 1]
> at
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
> ~[na:1.7.0_91]
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
> [na:1.7.0_91]
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
> [na:1.7.0_91]
> at
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
> ~[na:1.7.0_91]
> at
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
> ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
> at
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
> ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
> {code}
> Thread dump excerpt
> {code}
> "appclient-registration-retry-thread" daemon prio=10 tid=0x00007f92dc00f800
> nid=0x6939 waiting on condition [0x00007f927dada000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f7e7aef8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
> at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> "appclient-register-master-threadpool-0" daemon prio=10
> tid=0x00007f92dc004000 nid=0x6938 waiting on condition [0x00007f927dbdb000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f7e7a078> (a
> java.util.concurrent.SynchronousQueue$TransferStack)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> at
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
> at
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
> at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
> at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]