fe2s opened a new pull request, #53600:
URL: https://github.com/apache/spark/pull/53600

   …for executor pod requests when pods repeatedly fail to start
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: 
https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: 
https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., 
'[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a 
faster review.
     7. If you want to add a new configuration, please read the guideline first 
for naming configurations in
        
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the 
guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Introduces exponential backoff delays for executor pod requests when pods 
repeatedly fail to start. It tracks the following startup failures:
   * **API request failures** - Pod creation requests to the Kubernetes API 
server throw exceptions
   * **Pod startup failures** - Pods transition to `PodFailed` status before 
the executor registers with the driver (indicating the executor never 
successfully started)
   * **Creation timeouts** - Pods do not appear within the timeout period 
(existing mechanism)
   
   Operates as a state machine with two states:
   * **Normal state** - Executor pods are requested without extra delay. The 
startup failures are tracked within a sliding time window. When the failure 
count exceeds the configured threshold, it transitions to Backoff state.
   * **Backoff state** - Executor pod requests are throttled with exponentially 
increasing delays between requests. It immediately transitions back to Normal 
state when an executor that was requested during the Backoff state successfully 
starts, indicating the infrastructure issue has resolved.
   
   When backoff is enabled, two new metrics added. Will update `monitoring.md` 
doc with new source if patch looks good.
   
   
   ### Why are the changes needed?
   When executor pods repeatedly fail to start due to Kubernetes infrastructure 
issues (control plane overload, resource exhaustion, service mesh issues), the 
current implementation continues requesting pods at full speed, amplifying load 
on already stressed infrastructure. 
   
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   **Relationship to ExecutorFailureTracker**:
   This backoff mechanism is complementary to the existing 
`ExecutorFailureTracker`. While `ExecutorFailureTracker` counts all Pod 
failures (including those that started successfully but failed during task 
execution) to determine when to abort the application 
(`spark.executor.maxNumFailures`), the backoff controller specifically tracks 
startup failures only to throttle allocation requests and protect 
infrastructure. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as 
new features, bug fixes, or other behavior changes. Documentation-only updates 
are not considered user-facing changes.
   
   If yes, please clarify the previous behavior and the change this PR proposes 
- provide the console output, description and/or an example to show the 
behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to 
the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   When enabled: executor pod allocation is throttled with exponential delays 
when startup failures exceed the threshold.
   Observability changes when enabled:
   * New log messages indicating backoff state transitions and current delays
   * New metrics.
   
   ### How was this patch tested?
   1. Unit tests
   2. **API request failures** simulated with `kubectl create quota test 
--hard=cpu=1,memory=1G`
   3. **Pod startup failures** simulated with an init container that fails to 
start.
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions 
for the consistent environment, and the instructions could accord to: 
https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to