[ https://issues.apache.org/jira/browse/SPARK-29475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nikita Gorbachevski updated SPARK-29475: ---------------------------------------- Summary: Executors aren't marked as lost on OutOfMemoryError in YARN mode (was: Executors aren't marked as dead on OutOfMemoryError in YARN mode) > Executors aren't marked as lost on OutOfMemoryError in YARN mode > ---------------------------------------------------------------- > > Key: SPARK-29475 > URL: https://issues.apache.org/jira/browse/SPARK-29475 > Project: Spark > Issue Type: Bug > Components: DStreams, Spark Core, YARN > Affects Versions: 2.3.0 > Reporter: Nikita Gorbachevski > Priority: Major > > My spark-streaming application runs in yarn cluster mode with > ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of > batches were scheduled and application did not make any progress. > Thread dump showed that all the streaming threads are blocked, infinitely > waiting for result from executor on > ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``. > {code:java} > "streaming-job-executor-11" - Thread t@324 > java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <7fd76f11> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) > at > org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code} > My tasks are short running and pretty simple, e.g. read raw data from Kafka, > normalize it and write back to Kafka. > After logs analysis i noticed that there were lots of ``RpcTimeoutException`` > on both executor > {code:java} > driver-heartbeater WARN executor.Executor: Issue communicating with driver in > heartbeater > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > and driver sides during context clearing. > {code:java} > WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive > any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout > WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive > any reply in 120 seconds. This timeout is controlled by > spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive > any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is > controlled by spark.rpc.askTimeout at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > Also the only error on executors was > {code:java} > SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL > TERM > {code} > no exceptions at all. > After digging into YARN logs i noticed that executors were killed by AM > because of high memory usage. > Also there were no any logs on driver side saying that executors were lost. > Thus seems that driver wasn't notified about this. > Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend > which logs such message. ``exitExecutor`` method looks similar but it's > message should look differently. > However i believe that driver is notified that executor is lost via async rpc > call, but if executor encounters issues with rpc because of high GC pressure > this message won't be send. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org