georgew5656 opened a new pull request, #17431: URL: https://github.com/apache/druid/pull/17431
Bugfix for a issue with k8s based ingestion. ### Description If there is a communication error between the overlord and k8s during the KubernetesPeonLifecycle.join method (for example when a overlord comes up and makes a lot of calls to k8s), the overlord may attempt to create a log watcher for a still running task and stream this watcher to a file. This will infinitely hang as long as the task is running since new logs will constantly be generated. #### Fixed the bug ... #### Renamed the class ... #### Added a forbidden-apis entry ... In the above situation (where the overlord can't get the location of a task on startup), the overlord should just kill the task instead of hanging forever. Currently when saving task logs we always try to create a log watcher for the pod and then use this log watcher to load the task logs. This makes sense to do in shutdown() before we actually delete the running k8s job because we want to get the logs up until the point the pod is killed (hence the log watcher). This is not actually necessary in join(), since saveLogs is called in join after the task lifecycle has already completed. Instead we should just get a stream to the logs at the time we call saveLogs and upload all the logs at that point. This fixes the bug described above because the overlord will just upload the logs for the pod at the time of the initial communication failure and then mark the task as failed. The other bug here is that we don't actually log the reason for the communication failure between the overlord <-> k8s. I saw this happen in one of my clusters and it just logged a unhelpful (job no found log), so I updated the exception to include info on the actual exception. Because of this issue I am not actually sure if the fabric8 client actually retried requests to kubernetes even though according to the config it should have. I think if we are planning on still aggresively loading task location on overlord startup (as in https://github.com/apache/druid/pull/17419/files#diff-bb902bcc2fa097a13509038cd5ae6987b355c2bcf50f7a558bf9c1a3f5d521db) it may make sense to add a extra retry block around the getPeonPodWithRetries method that catches the generic fabric8 kubernetes client exception. #### Release note Fix some bugs with kubernetes tasks ##### Key changed/added classes in this PR * `KubernetesPeonLifecycle` * `KubernetesPeonClient` This PR has: - [X] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [X] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
