[PR] Fix error where communication failures to k8s can lead to stuck tasks (druid)

via GitHub Mon, 28 Oct 2024 12:37:06 -0700


georgew5656 opened a new pull request, #17431:
URL: https://github.com/apache/druid/pull/17431

Bugfix for a issue with k8s based ingestion.

### Description
If there is a communication error between the overlord and k8s during the
KubernetesPeonLifecycle.join method (for example when a overlord comes up and
makes a lot of calls to k8s), the overlord may attempt to create a log watcher
for a still running task and stream this watcher to a file. This will
infinitely hang as long as the task is running since new logs will constantly
be generated.

#### Fixed the bug ...
#### Renamed the class ...
#### Added a forbidden-apis entry ...
In the above situation (where the overlord can't get the location of a task
on startup), the overlord should just kill the task instead of hanging forever.
Currently when saving task logs we always try to create a log watcher for the
pod and then use this log watcher to load the task logs.

This makes sense to do in shutdown() before we actually delete the running
k8s job because we want to get the logs up until the point the pod is killed
(hence the log watcher).

This is not actually necessary in join(), since saveLogs is called in join
after the task lifecycle has already completed. Instead we should just get a
stream to the logs at the time we call saveLogs and upload all the logs at that
point.

This fixes the bug described above because the overlord will just upload the
logs for the pod at the time of the initial communication failure and then mark
the task as failed.

The other bug here is that we don't actually log the reason for the
communication failure between the overlord <-> k8s. I saw this happen in one of
my clusters and it just logged a unhelpful (job no found log), so I updated the
exception to include info on the actual exception. Because of this issue I am
not actually sure if the fabric8 client actually retried requests to kubernetes
even though according to the config it should have. I think if we are planning
on still aggresively loading task location on overlord startup (as in
https://github.com/apache/druid/pull/17419/files#diff-bb902bcc2fa097a13509038cd5ae6987b355c2bcf50f7a558bf9c1a3f5d521db)
it may make sense to add a extra retry block around the getPeonPodWithRetries
method that catches the generic fabric8 kubernetes client exception.

#### Release note
Fix some bugs with kubernetes tasks

##### Key changed/added classes in this PR
* `KubernetesPeonLifecycle`
* `KubernetesPeonClient`

This PR has:

- [X] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [X] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Fix error where communication failures to k8s can lead to stuck tasks (druid)

Reply via email to