[ 
https://issues.apache.org/jira/browse/SPARK-29475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikita Gorbachevski updated SPARK-29475:
----------------------------------------
    Summary: Executors aren't marked as lost on OutOfMemoryError in YARN mode  
(was: Executors aren't marked as dead on OutOfMemoryError in YARN mode)

> Executors aren't marked as lost on OutOfMemoryError in YARN mode
> ----------------------------------------------------------------
>
>                 Key: SPARK-29475
>                 URL: https://issues.apache.org/jira/browse/SPARK-29475
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams, Spark Core, YARN
>    Affects Versions: 2.3.0
>            Reporter: Nikita Gorbachevski
>            Priority: Major
>
> My spark-streaming application runs in yarn cluster mode with 
> ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of 
> batches were scheduled and application did not make any progress.
> Thread dump showed that all the streaming threads are blocked, infinitely 
> waiting for result from executor on 
> ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. 
> {code:java}
> "streaming-job-executor-11" - Thread t@324
>    java.lang.Thread.State: WAITING
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for <7fd76f11> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
>         at 
> org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
>         at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code}
> My tasks are short running and pretty simple, e.g. read raw data from Kafka, 
> normalize it and write back to Kafka.
> After logs analysis i noticed that there were lots of ``RpcTimeoutException`` 
> on both executor
> {code:java}
> driver-heartbeater WARN executor.Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>       at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> and driver sides during context clearing.
> {code:java}
> WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
> any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 
> WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive 
> any reply in 120 seconds. This timeout is controlled by 
> spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive 
> any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is 
> controlled by spark.rpc.askTimeout at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Also the only error on executors was
> {code:java}
> SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
> TERM
> {code}
> no exceptions at all.
> After digging into YARN logs i noticed that executors were killed by AM 
> because of high memory usage. 
> Also there were no any logs on driver side saying that executors were lost. 
> Thus seems that driver wasn't notified about this.
> Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend 
> which logs such message. ``exitExecutor`` method looks similar but it's 
> message should look differently.
> However i believe that driver is notified that executor is lost via async rpc 
> call, but if executor encounters issues with rpc because of high GC pressure 
> this message won't be send.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to