[
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mridul Muralidharan resolved SPARK-40320.
-----------------------------------------
Resolution: Fixed
> When the Executor plugin fails to initialize, the Executor shows active but
> does not accept tasks forever, just like being hung
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 3.0.0
> Reporter: Mars
> Priority: Major
> Fix For: 3.4.0
>
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
> /**
> */
> override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin()
> /**
> */
> override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
> private val checkingInterval: Long = 1
> override def init(_ctx: PluginContext, extraConf: util.Map[String,
> String]): Unit = {
> if (checkingInterval == 1) {
> throw new UnsatisfiedLinkError("My Exception error")
> }
> }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall`
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method
> `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM
> process is active but the communication thread is no longer working (
> please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken,
> so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the
> code. The Executor is active but it can't do anything. We will wonder if the
> driver is broken or the Executor problem. I think at least the Executor
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]