[ 
https://issues.apache.org/jira/browse/SPARK-35625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liran updated SPARK-35625:
--------------------------
    Attachment: image-2021-06-03-12-16-12-573.png

> Spark on k8s zombie executors
> -----------------------------
>
>                 Key: SPARK-35625
>                 URL: https://issues.apache.org/jira/browse/SPARK-35625
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.0.1
>            Reporter: Liran
>            Priority: Major
>         Attachments: image-2021-06-03-12-15-57-095.png, 
> image-2021-06-03-12-16-03-621.png, image-2021-06-03-12-16-12-573.png
>
>
> We are running a POC of Spark on K8s setup for one of our apps, it's scaling 
> up/down quite a lot, and we started noticing that after a while we get many 
> of these sort of logs:
> {code:java}
> Error trying to remove broadcast 8805 from block manager BlockManagerId(79, 
> 10.244.248.23, 46681, None)
> java.io.IOException: Failed to send RPC RPC 6006709312311899870 to 
> /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException
> {code}
> {code:java}
> Error trying to remove RDD 32952 from block manager BlockManagerId(79, 
> 10.244.248.23, 46681, None) java.io.IOException: Failed to send RPC RPC 
> 7506603739599355778 to /10.244.248.23:54004: 
> io.netty.channel.StacklessClosedChannelException
> {code}
>  
> All the errors/warn are related to trying to *remove* (shuffle/broadcast/rdd) 
> files/blocks, which doesn't seems to harmful at this point other than 
> spamming our logs.
>  
> The interesting part is that when looking in kubectl the executors doesn't 
> seems to be alive (as expected), on the other hand in Spark UI, they do show 
> up as "active" with 0 cores:
>  
> !image-2021-06-03-12-13-20-140.png!
>  
> !image-2021-06-03-12-00-04-471.png!
>  
> All the executors marked above are long dead, but for some reason the driver 
> app still tries to send RPC requests to them.
>  
> According to our event logs, on of the pods was create at May 21 20:11 and 
> was killed 9 min later at 20:20, but we are still seeing new logs on Jun 3.
> !image-2021-06-03-12-07-29-032.png!
>  
>  
> Sample of one of the errors:
> {code:java}
> Error trying to remove RDD 33178 from block manager BlockManagerId(79, 
> 10.244.248.23, 46681, None)Error trying to remove RDD 33178 from block 
> manager BlockManagerId(79, 10.244.248.23, 46681, None)java.io.IOException: 
> Failed to send RPC RPC 7684271332363250835 to /10.244.248.23:54004: 
> io.netty.channel.StacklessClosedChannelException at 
> org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363)
>  at 
> org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>  at 
> io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at 
> io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) 
> at 
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) 
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:998)
>  at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:866)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>  at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(Thread.java:748)Caused by: 
> io.netty.channel.StacklessClosedChannelException at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
> ChannelPromise)(Unknown Source)  Failed to send RPC RPC 7684271332363250835 
> to /10.244.248.23:54004: 
> io.netty.channel.StacklessClosedChannelExceptionio.netty.channel.StacklessClosedChannelException
>  at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
> ChannelPromise)(Unknown Source)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to