kevin85421 opened a new pull request, #37400: URL: https://github.com/apache/spark/pull/37400
### What changes were proposed in this pull request? When onDisconnected is triggered, (1) Delay `RemoveExecutor` for 5 seconds to enable driver receives ExecutorExitCode from slow path (2) Prevent task scheduler from assigning tasks on the lost executor. (By adding the executor to `executorsPendingLossReason`) ### Why are the changes needed? There are two methods to detect executor loss. (1) (fast path) `onDisconnected` Executor -> Driver: When Executor closes its JVM, the socket (Netty's channel) will be closed. The function onDisconnected will be triggered when it knows the channel is closed. (2) (slow path) ExecutorRunner -> Worker -> Master -> Driver (See #37385 for details) When executor exits with ExecutorExitCode, the exit code will be passed from ExecutorRunner to Driver. Because fast path determines the executor loss without the information of ExecutorExitCode, these two methods may categorize same cases into different conclusions. For example, when Executor exits with ExecutorExitCode HEARTBEAT_FAILURE, onDisconnected will consider the executor loss as a task failure, but slow path will consider it as a network failure. Obviously, HEARTBEAT_FAILURE is a network failure. [Notice] For more details about ExecutorExitCode, check #37385 for more details. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash bazel run //core:org.apache.spark.SparkContextSuite -- -z "ExitCode" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
