yadavay-amzn opened a new pull request, #56628: URL: https://github.com/apache/spark/pull/56628
### What changes were proposed in this pull request? Add a dedicated `ExecutorShutdownFailure` TaskFailedReason with `countTowardsTaskFailures = false`. When the executor thread pool rejects a task with `RejectedExecutionException` while the executor is shutting down (gated on the `executorShutdown` flag), this reason is reported so the attempt is not counted toward `spark.task.maxFailures` and the task is rescheduled elsewhere. ### Why are the changes needed? Tasks launched onto an executor whose thread pool is shutting down are rejected with `RejectedExecutionException`. Today this surfaces as a generic `ExceptionFailure` (which counts toward maxFailures), so repeated shutdown races can exhaust the retry budget and abort the stage even though no real task fault occurred. See SPARK-57465 (reporter: Thomas Newton). ### Design notes - New dedicated reason mirrors existing special reasons (`ExecutorLostFailure`, `TaskCommitDenied`, `TaskKilled`). - Gated narrowly on the executor shutdown flag so a genuine non-shutdown `RejectedExecutionException` still counts. - Resubmission is naturally bounded since the shutting-down executor leaves the cluster. - `JsonProtocol` ser/deser added. Older event logs without this reason still parse correctly. The forward-compat `MatchError` for older readers of new logs follows the same pre-existing pattern as all `TaskEndReason` additions. - `TaskSetManager` logs `ExecutorShutdownFailure` at INFO (not WARN) since it is an expected/benign reschedule. ### Does this PR introduce _any_ user-facing change? No. This is scheduler-internal; the visible effect is fewer spurious stage failures under executor churn. ### How was this patch tested? - `TaskSetManagerSuite`: the new reason does not increment `numFailures` and the task is resubmitted; contrasted with `ExceptionFailure` which does. - `JsonProtocolSuite`: round-trip serialization and back-compat deserialization. ### Was this patch authored or co-authored using generative AI tooling? Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
