[jira] [Updated] (SPARK-29475) Executors aren't marked as dead on OutOfMemoryError in YARN mode

Nikita Gorbachevski (Jira) Tue, 15 Oct 2019 04:05:11 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-29475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nikita Gorbachevski updated SPARK-29475:
----------------------------------------
    Description: 
My spark-streaming application runs in yarn cluster mode with 
``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of 
batches were scheduled and application did not make any progress.

Thread dump showed that all the streaming threads are blocked, infinitely 
waiting for result from executor on 
``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. 
{code:java}
"streaming-job-executor-11" - Thread t@324
   java.lang.Thread.State: WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <7fd76f11> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
        at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code}
My tasks are short running and pretty simple, e.g. read raw data from Kafka, 
normalize it and write back to Kafka.

After logs analysis i noticed that there were lots of ``RpcTimeoutException`` 
on both executor
{code:java}
driver-heartbeater WARN executor.Executor: Issue communicating with driver in 
heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
        at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
and driver sides during context clearing.
{code:java}
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
any reply in 120 seconds. This timeout is controlled by 
spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive 
any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is 
controlled by spark.rpc.askTimeout at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Also the only error on executors was
{code:java}
SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
TERM
{code}
no exceptions at all.

After digging into YARN logs i noticed that executors were killed by AM because 
of high memory usage. 

Also there were no any logs on driver side saying that executors were lost. 
Thus seems that driver wasn't notified about this.

Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend which 
logs such message. ``exitExecutor`` method looks similar but it's message 
should look differently.

However i believe that driver is notified that executor is lost via async rpc 
call, but if executor encounters issues with rpc because of high GC pressure 
this message won't be send.

 

  was:
My spark-streaming application runs in yarn cluster mode with 
``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of 
batches were scheduled and application did not make any progress.

Thread dump showed that all the streaming threads are blocked, infinitely 
waiting for result from executor on 
``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``.

 
{code:java}
"streaming-job-executor-11" - Thread t@324
   java.lang.Thread.State: WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <7fd76f11> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
        at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code}
 

My tasks are short running and pretty simple, e.g. read raw data from Kafka, 
normalize it and write back to Kafka.

After logs analysis i noticed that there were lots of ``RpcTimeoutException`` 
on both executor

 
{code:java}
driver-heartbeater WARN executor.Executor: Issue communicating with driver in 
heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
        at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
and driver sides during context clearing.

 

 
{code:java}
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
any reply in 120 seconds. This timeout is controlled by 
spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive 
any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is 
controlled by spark.rpc.askTimeout at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Also the only error on executors was

 

 
{code:java}
SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
TERM
{code}
no exceptions at all.

After digging into YARN logs i noticed that executors were killed by AM because 
of high memory usage. 

Also there were no any logs on driver side saying that executors were lost. 
Thus seems that driver wasn't notified about this.

Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend which 
logs such message. ``exitExecutor`` method looks similar but it's message 
should look differently.

However i believe that driver is notified that executor is lost via async rpc 
call, but if executor encounters issues with rpc because of high GC pressure 
this message won't be send.

 


> Executors aren't marked as dead on OutOfMemoryError in YARN mode
> ----------------------------------------------------------------
>
>                 Key: SPARK-29475
>                 URL: https://issues.apache.org/jira/browse/SPARK-29475
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams, Spark Core, YARN
>    Affects Versions: 2.3.0
>            Reporter: Nikita Gorbachevski
>            Priority: Major
>
> My spark-streaming application runs in yarn cluster mode with 
> ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of 
> batches were scheduled and application did not make any progress.
> Thread dump showed that all the streaming threads are blocked, infinitely 
> waiting for result from executor on 
> ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. 
> {code:java}
> "streaming-job-executor-11" - Thread t@324
>    java.lang.Thread.State: WAITING
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for <7fd76f11> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
>         at 
> org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
>         at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code}
> My tasks are short running and pretty simple, e.g. read raw data from Kafka, 
> normalize it and write back to Kafka.
> After logs analysis i noticed that there were lots of ``RpcTimeoutException`` 
> on both executor
> {code:java}
> driver-heartbeater WARN executor.Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>       at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> and driver sides during context clearing.
> {code:java}
> WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
> any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 
> WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
> any reply in 120 seconds. This timeout is controlled by 
> spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive 
> any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is 
> controlled by spark.rpc.askTimeout at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Also the only error on executors was
> {code:java}
> SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
> TERM
> {code}
> no exceptions at all.
> After digging into YARN logs i noticed that executors were killed by AM 
> because of high memory usage. 
> Also there were no any logs on driver side saying that executors were lost. 
> Thus seems that driver wasn't notified about this.
> Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend 
> which logs such message. ``exitExecutor`` method looks similar but it's 
> message should look differently.
> However i believe that driver is notified that executor is lost via async rpc 
> call, but if executor encounters issues with rpc because of high GC pressure 
> this message won't be send.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29475) Executors aren't marked as dead on OutOfMemoryError in YARN mode

Reply via email to