[ https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mridul Muralidharan resolved SPARK-40320. ----------------------------------------- Resolution: Fixed > When the Executor plugin fails to initialize, the Executor shows active but > does not accept tasks forever, just like being hung > ------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-40320 > URL: https://issues.apache.org/jira/browse/SPARK-40320 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 3.0.0 > Reporter: Mars > Priority: Major > Fix For: 3.4.0 > > > *Reproduce step:* > set `spark.plugins=ErrorSparkPlugin` > `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the > code to make it clearer): > {code:java} > class ErrorSparkPlugin extends SparkPlugin { > /** > */ > override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() > /** > */ > override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() > }{code} > {code:java} > class ErrorExecutorPlugin extends ExecutorPlugin { > private val checkingInterval: Long = 1 > override def init(_ctx: PluginContext, extraConf: util.Map[String, > String]): Unit = { > if (checkingInterval == 1) { > throw new UnsatisfiedLinkError("My Exception error") > } > } > } {code} > The Executor is active when we check in spark-ui, however it was broken and > doesn't receive any task. > *Root Cause:* > I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` > it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method > `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM > process is active but the communication thread is no longer working ( > please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, > so executor doesn't receive any message) > Some ideas: > I think it is very hard to know what happened here unless we check in the > code. The Executor is active but it can't do anything. We will wonder if the > driver is broken or the Executor problem. I think at least the Executor > status shouldn't be active here or the Executor can exitExecutor (kill itself) > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org