[PR] Fail fast for unrecoverable Kubernetes jobs (druid)

via GitHub Thu, 19 Jun 2025 03:09:26 -0700


GWphua opened a new pull request, #18159:
URL: https://github.com/apache/druid/pull/18159


   ### Description
   
   <!-- Describe the goal of this PR, what problem are you fixing. If there is 
a corresponding issue (referenced above), it's not necessary to repeat the 
description here, however, you may choose to keep one summary sentence. -->
   
   <!-- Describe your patch: what did you change in code? How did you fix the 
problem? -->
   
   <!-- If there are several relatively logically separate changes in this PR, 
create a mini-section for each of them. For example: -->
   
   Currently, Kubernetes jobs will retry for a total of 10 times to create a 
peon pod for ingestion tasks. However, there are some pain points that surface 
during periods of heavy configuration, or when starting new clusters using 
similar Druid helm charts (copy-paste):
   1. The error `Error when looking for K8s pod with label[job-name=%s]` is not 
useful for us. We will need to further discover what is going on by doing 
`kubectl describe job ${jobname}`.
   2. Given the 10 retries for pod creation, along with Kubernetes exponential 
backoff, jobs will be pending for ~2minutes before failing and giving the 
above-mentioned logs.
   
   This PR improves error handling and retry logic in 
`druid-kubernetes-overlord-extension` when the Kubernetes job fails to create 
peon pods. 
   
   #### Enhanced Error Handling for Pod Creation Failures
   
   1. Provides more descriptive error messages that include the related job's 
latest Kubernetes event message.
   2. Better error categorization using `DruidException` with appropriate 
persona (OPERATOR) and category (NOT_FOUND)
   
   **Important Note: Operators will need to allow event logging for Druid 
service accounts to allow the fail-fast feature to work properly.** If you 
somehow forget to do this, the behaviour of jobs that successfully spin up pods 
will not be affected, but you will get a warning `Failed to get events for 
job[%s]` and receive the old `K8s pod with label[job-name=%s] not 
found"`message.
   
   #### Improved Retry Logic using Blacklisted Error Message
   
   Implements intelligent retry logic that avoids retrying when the failure is 
due to known unrecoverable conditions. A list of unrecoverable event message 
substrings are specified under `BLACKLISTED_PEON_POD_ERROR_MESSAGES`. The 
`shouldRetryStartingPeonPod()` method checks exception messages (in the form of 
Kubernetes Job event messages) against this blacklist to determine if retrying 
would be futile.
   
   `BLACKLISTED_PEON_POD_ERROR_MESSAGES` currently includes: "max limit to 
request ratio" - which catches failures when resource (cpu, memory, etc.) 
request-limits ratio is beyond the allowable amount. 
   
   This is how the exception message will look like now:
   ```
   Job[XXX] failed to start up pods. Latest event: [Error creating: pods "XXX" 
is forbidden: memory max limit to request ratio per Container is 1, but 
provided ratio is 1.333333]
   ```
   
   I have only added one event message substring that really hits me very 
often. Feel free to add upon this constant should you discover more 
unrecoverable issues (or even allow this list to be configurable?). 
   
   #### Release note
   <!-- Give your best effort to summarize your changes in a couple of 
sentences aimed toward Druid users. 
   
   If your change doesn't have end user impact, you can skip this section.
   
   For tips about how to write a good release note, see [Release 
notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes).
   
   -->
   
   Kubernetes jobs will have clearer failure messages during pod creation, and 
will fail fast under unrecoverable conditions. 
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `KubernetesPeonClient.java`
    * `DruidK8sConstants.java`
    * `KubernetesPeonClientTest.java`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist below are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   
   - [x] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [x] a release note entry in the PR description.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Fail fast for unrecoverable Kubernetes jobs (druid)

Reply via email to