[ 
https://issues.apache.org/jira/browse/SPARK-11928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marscher updated SPARK-11928:
-------------------------------------
    Description: 
In our hardening testing during upgrade to Spark 1.5.2 we are noticing that 
there is a deadlock issue in the master connection retry code in AppClient: 
https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L96

On a cursory analysis, the Runnable blocks at line 102 waiting on the rpcEnv 
for the masterRef. This Runnable is not checking for interruption on the 
current thread so it seems like it will continue to block as long as the rpcEnv 
is blocking and not respect the cancel(true) call in registerWithMaster. The 
thread pool only has enough threads to run tryRegisterAllMasters once at a time.

It is being called from registerWithMaster which is going to retry every 20 
seconds by default 
(https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L49).
 Meanwhile the rpcEnv default timeout is 120 seconds by default 
(https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/util/RpcUtils.scala#L60).
 So once registerWithMaster recursively calls itself it will provoke a deadlock 
in tryRegisterAllMasters.

Exception
{code}
15/11/23 15:42:19 INFO o.a.s.d.c.AppClient$ClientEndpoint: Connecting to master 
spark://ip-172-22-121-44:7077...
15/11/23 15:42:39 ERROR o.a.s.u.SparkUncaughtExceptionHandler: Uncaught 
exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.FutureTask@1d96b2c9 rejected from 
java.util.concurrent.ThreadPoolExecutor@10b3b94c[Running, pool size = 1, active 
threads = 0, queued tasks = 0, completed tasks = 1]
        at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
 ~[na:1.7.0_91]
        at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) 
[na:1.7.0_91]
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) 
[na:1.7.0_91]
        at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
 ~[na:1.7.0_91]
        at 
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
 ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
        at 
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
 ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
{code}

Thread dump excerpt
{code}
"appclient-registration-retry-thread" daemon prio=10 tid=0x00007f92dc00f800 
nid=0x6939 waiting on condition [0x00007f927dada000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f7e7aef8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
        at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
"appclient-register-master-threadpool-0" daemon prio=10 tid=0x00007f92dc004000 
nid=0x6938 waiting on condition [0x00007f927dbdb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f7e7a078> (a 
java.util.concurrent.SynchronousQueue$TransferStack)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
        at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}

  was:
In our hardening testing during upgrade to Spark 1.5.2 we are noticing that 
there is a deadlock issue in the master connection retry code in AppClient: 
https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L96

On a cursory analysis, the Runnable blocks at line 102 waiting on the rpcEnv 
for the masterRef. This Runnable is not checking for interruption on the 
current thread so it seems like it will continue to block as long as the rpcEnv 
is blocking and not respect the cancel(true) call in registerWithMaster. The 
thread pool only has enough threads to run tryRegisterAllMasters once at a time.

It is being called from registerWithMaster which is going to retry every 20 
seconds by default 
(https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L49).
 Meanwhile the rpcEnv default timeout is 120 seconds by default 
(https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/util/RpcUtils.scala#L60).
 So once registerWithMaster recursively calls itself it will provoke a deadlock 
in tryRegisterAllMasters.

{code}
15/11/23 15:42:19 INFO o.a.s.d.c.AppClient$ClientEndpoint: Connecting to master 
spark://ip-172-22-121-44:7077...
15/11/23 15:42:39 ERROR o.a.s.u.SparkUncaughtExceptionHandler: Uncaught 
exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.FutureTask@1d96b2c9 rejected from 
java.util.concurrent.ThreadPoolExecutor@10b3b94c[Running, pool size = 1, active 
threads = 0, queued tasks = 0, completed tasks = 1]
        at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
 ~[na:1.7.0_91]
        at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) 
[na:1.7.0_91]
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) 
[na:1.7.0_91]
        at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
 ~[na:1.7.0_91]
        at 
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
 ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
        at 
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
 ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
{code}


> Master retry deadlock
> ---------------------
>
>                 Key: SPARK-11928
>                 URL: https://issues.apache.org/jira/browse/SPARK-11928
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Richard Marscher
>
> In our hardening testing during upgrade to Spark 1.5.2 we are noticing that 
> there is a deadlock issue in the master connection retry code in AppClient: 
> https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L96
> On a cursory analysis, the Runnable blocks at line 102 waiting on the rpcEnv 
> for the masterRef. This Runnable is not checking for interruption on the 
> current thread so it seems like it will continue to block as long as the 
> rpcEnv is blocking and not respect the cancel(true) call in 
> registerWithMaster. The thread pool only has enough threads to run 
> tryRegisterAllMasters once at a time.
> It is being called from registerWithMaster which is going to retry every 20 
> seconds by default 
> (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L49).
>  Meanwhile the rpcEnv default timeout is 120 seconds by default 
> (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/util/RpcUtils.scala#L60).
>  So once registerWithMaster recursively calls itself it will provoke a 
> deadlock in tryRegisterAllMasters.
> Exception
> {code}
> 15/11/23 15:42:19 INFO o.a.s.d.c.AppClient$ClientEndpoint: Connecting to 
> master spark://ip-172-22-121-44:7077...
> 15/11/23 15:42:39 ERROR o.a.s.u.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[appclient-registration-retry-thread,5,main]
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@1d96b2c9 rejected from 
> java.util.concurrent.ThreadPoolExecutor@10b3b94c[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
>       at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>  ~[na:1.7.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) 
> [na:1.7.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) 
> [na:1.7.0_91]
>       at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>  ~[na:1.7.0_91]
>       at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
>  ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
>       at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
>  ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
> {code}
> Thread dump excerpt
> {code}
> "appclient-registration-retry-thread" daemon prio=10 tid=0x00007f92dc00f800 
> nid=0x6939 waiting on condition [0x00007f927dada000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000f7e7aef8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
>       at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> "appclient-register-master-threadpool-0" daemon prio=10 
> tid=0x00007f92dc004000 nid=0x6938 waiting on condition [0x00007f927dbdb000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000f7e7a078> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>       at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
>       at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>       at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
>       at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
>       at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to