[
https://issues.apache.org/jira/browse/SPARK-35625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Liran updated SPARK-35625:
--------------------------
Attachment: image-2021-06-03-12-16-12-573.png
> Spark on k8s zombie executors
> -----------------------------
>
> Key: SPARK-35625
> URL: https://issues.apache.org/jira/browse/SPARK-35625
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 3.0.1
> Reporter: Liran
> Priority: Major
> Attachments: image-2021-06-03-12-15-57-095.png,
> image-2021-06-03-12-16-03-621.png, image-2021-06-03-12-16-12-573.png
>
>
> We are running a POC of Spark on K8s setup for one of our apps, it's scaling
> up/down quite a lot, and we started noticing that after a while we get many
> of these sort of logs:
> {code:java}
> Error trying to remove broadcast 8805 from block manager BlockManagerId(79,
> 10.244.248.23, 46681, None)
> java.io.IOException: Failed to send RPC RPC 6006709312311899870 to
> /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException
> {code}
> {code:java}
> Error trying to remove RDD 32952 from block manager BlockManagerId(79,
> 10.244.248.23, 46681, None) java.io.IOException: Failed to send RPC RPC
> 7506603739599355778 to /10.244.248.23:54004:
> io.netty.channel.StacklessClosedChannelException
> {code}
>
> All the errors/warn are related to trying to *remove* (shuffle/broadcast/rdd)
> files/blocks, which doesn't seems to harmful at this point other than
> spamming our logs.
>
> The interesting part is that when looking in kubectl the executors doesn't
> seems to be alive (as expected), on the other hand in Spark UI, they do show
> up as "active" with 0 cores:
>
> !image-2021-06-03-12-13-20-140.png!
>
> !image-2021-06-03-12-00-04-471.png!
>
> All the executors marked above are long dead, but for some reason the driver
> app still tries to send RPC requests to them.
>
> According to our event logs, on of the pods was create at May 21 20:11 and
> was killed 9 min later at 20:20, but we are still seeing new logs on Jun 3.
> !image-2021-06-03-12-07-29-032.png!
>
>
> Sample of one of the errors:
> {code:java}
> Error trying to remove RDD 33178 from block manager BlockManagerId(79,
> 10.244.248.23, 46681, None)Error trying to remove RDD 33178 from block
> manager BlockManagerId(79, 10.244.248.23, 46681, None)java.io.IOException:
> Failed to send RPC RPC 7684271332363250835 to /10.244.248.23:54004:
> io.netty.channel.StacklessClosedChannelException at
> org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363)
> at
> org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340)
> at
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
> at
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
> at
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
> at
> io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at
> io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
> at
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:998)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:866)
> at
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> at
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> at
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)Caused by:
> io.netty.channel.StacklessClosedChannelException at
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> ChannelPromise)(Unknown Source) Failed to send RPC RPC 7684271332363250835
> to /10.244.248.23:54004:
> io.netty.channel.StacklessClosedChannelExceptionio.netty.channel.StacklessClosedChannelException
> at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> ChannelPromise)(Unknown Source)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]