[GitHub] [druid] georgew5656 opened a new pull request, #14030: Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks

via GitHub Tue, 04 Apr 2023 18:42:26 -0700


georgew5656 opened a new pull request, #14030:
URL: https://github.com/apache/druid/pull/14030

### Description
Discovered a race condition in the shutdown code while doing some additional
testing around
https://github.com/apache/druid/commit/f60f377e5f67221f8a4a34163b257ecffece60d7

If a task is shutdown, has its k8s job deleted and then subsequently removed
from the tasks map on
https://github.com/apache/druid/blob/master/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java#L218,

it is possible for the pod to still be running in a terminating state for a
little while after. this is because k8s deletes the controller (the job) and
then lets the pod clean itself up.

If getKnownTasks
(https://github.com/apache/druid/blob/master/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java#L338)
is called by the TaskQueue main loop between when the tasks table is cleaned
up and the peon pod finishes deleting, another run() will be called by
getKnownTasks.

Normally this sees the existing future in the tasks table and returns it,
but since the task id will already have been deleted from the tasks table, the
task runner will actually start another job with the same name as the original
one that was deleted.

This job will quickly be deleted again (because getKnownTasks will return
it, the taskQueue will see the task that the task runner returned is not in
TaskStorage and submit another shutdown request to delete the job), but this is
a poor user experience.

Additionally, there may be DuplicateKeyError's thrown because there are
multiple pods with the same job name running, this can cause the ingestion to
be briefly nonresponsive.

It is also possible to get these DuplicateKeyErrors if a pod (not a job) is
manually deleted in K8s. Normally, since we set backoffLimit to 0, if a pod
fails, the job doesn't try to retry creating the pod and just fails out. But it
seems like k8s treats manual termination differently from a pod failing, so
when a pod is manually terminated, the job actually starts another pod.

IMO it makes more sense for getKnownTasks to try to rehydrate state from
Kubernetes via the list of jobs, rather than the list of Pods. This solves both
of the above problems because we can be sure the K8s job has been deleted
before deleting the task future from the tasks map, and there will never be
duplicate jobs with the same name.
#### Release note

Bugfixes to the kubernetes overlord extension

##### Key changed/added classes in this PR
- The change here to have toTask take in the Job object rather than the Pod
object and then use this change to have getKnownTasks and getRunningTasks list
jobs rather than pods when trying to read the k8s state.

I explored some other options for fixing this issue, such as excluding
terminating pods from getKNownTasks and getRunningTasks or having the main run
loop wait for pods to delete if the job has been deleted, but this seemed like
the cleanest solution.

This PR has:

- [ X been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [X] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [X] been tested in a test Druid cluster.
I have tested more thoroughly with the PodTemplateTaskAdapter and done some
smoke tests with the MultiContainerTaskAdapter. I'm currently doing some
additional testing in my local cluster to try to catch any additional issues.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] georgew5656 opened a new pull request, #14030: Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks

Reply via email to