Richard Marscher created SPARK-11928:
----------------------------------------
Summary: Master retry deadlock
Key: SPARK-11928
URL: https://issues.apache.org/jira/browse/SPARK-11928
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.5.2
Reporter: Richard Marscher
In our hardening testing during upgrade to Spark 1.5.2 we are noticing that
there is a deadlock issue in the master connection retry code in AppClient:
https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L96
On a cursory analysis, the Runnable blocks at line 102 waiting on the rpcEnv
for the masterRef. This Runnable is not checking for interruption on the
current thread so it seems like it will continue to block as long as the rpcEnv
is blocking and not respect the cancel(true) call in registerWithMaster. The
thread pool only has enough threads to run tryRegisterAllMasters once at a time.
It is being called from registerWithMaster which is going to retry every 20
seconds by default
(https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L49).
Meanwhile the rpcEnv default timeout is 120 seconds by default
(https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/util/RpcUtils.scala#L60).
So once registerWithMaster recursively calls itself it will provoke a deadlock
in tryRegisterAllMasters.
{code}
15/11/23 15:42:19 INFO o.a.s.d.c.AppClient$ClientEndpoint: Connecting to master
spark://ip-172-22-121-44:7077...
15/11/23 15:42:39 ERROR o.a.s.u.SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.FutureTask@1d96b2c9 rejected from
java.util.concurrent.ThreadPoolExecutor@10b3b94c[Running, pool size = 1, active
threads = 0, queued tasks = 0, completed tasks = 1]
at
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
~[na:1.7.0_91]
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
[na:1.7.0_91]
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
[na:1.7.0_91]
at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
~[na:1.7.0_91]
at
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
at
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]