yadavay-amzn opened a new pull request, #56628:
URL: https://github.com/apache/spark/pull/56628

   ### What changes were proposed in this pull request?
   
   Add a dedicated `ExecutorShutdownFailure` TaskFailedReason with 
`countTowardsTaskFailures = false`. When the executor thread pool rejects a 
task with `RejectedExecutionException` while the executor is shutting down 
(gated on the `executorShutdown` flag), this reason is reported so the attempt 
is not counted toward `spark.task.maxFailures` and the task is rescheduled 
elsewhere.
   
   ### Why are the changes needed?
   
   Tasks launched onto an executor whose thread pool is shutting down are 
rejected with `RejectedExecutionException`. Today this surfaces as a generic 
`ExceptionFailure` (which counts toward maxFailures), so repeated shutdown 
races can exhaust the retry budget and abort the stage even though no real task 
fault occurred. See SPARK-57465 (reporter: Thomas Newton).
   
   ### Design notes
   
   - New dedicated reason mirrors existing special reasons 
(`ExecutorLostFailure`, `TaskCommitDenied`, `TaskKilled`).
   - Gated narrowly on the executor shutdown flag so a genuine non-shutdown 
`RejectedExecutionException` still counts.
   - Resubmission is naturally bounded since the shutting-down executor leaves 
the cluster.
   - `JsonProtocol` ser/deser added. Older event logs without this reason still 
parse correctly. The forward-compat `MatchError` for older readers of new logs 
follows the same pre-existing pattern as all `TaskEndReason` additions.
   - `TaskSetManager` logs `ExecutorShutdownFailure` at INFO (not WARN) since 
it is an expected/benign reschedule.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is scheduler-internal; the visible effect is fewer spurious stage 
failures under executor churn.
   
   ### How was this patch tested?
   
   - `TaskSetManagerSuite`: the new reason does not increment `numFailures` and 
the task is resubmitted; contrasted with `ExceptionFailure` which does.
   - `JsonProtocolSuite`: round-trip serialization and back-compat 
deserialization.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to