georgew5656 opened a new pull request, #17431:
URL: https://github.com/apache/druid/pull/17431

   Bugfix for a issue with k8s based ingestion.
   
   ### Description
   If there is a communication error between the overlord and k8s during the 
KubernetesPeonLifecycle.join method (for example when a overlord comes up and 
makes a lot of calls to k8s), the overlord may attempt to create a log watcher 
for a still running task and stream this watcher to a file. This will 
infinitely hang as long as the task is running since new logs will constantly 
be generated.
   
   #### Fixed the bug ...
   #### Renamed the class ...
   #### Added a forbidden-apis entry ...
   In the above situation (where the overlord can't get the location of a task 
on startup), the overlord should just kill the task instead of hanging forever. 
Currently when saving task logs we always try to create a log watcher for the 
pod and then use this log watcher to load the task logs.
   
   This makes sense to do in shutdown() before we actually delete the running 
k8s job because we want to get the logs up until the point the pod is killed 
(hence the log watcher).
   
   This is not actually necessary in join(), since saveLogs is called in join 
after the task lifecycle has already completed. Instead we should just get a 
stream to the logs at the time we call saveLogs and upload all the logs at that 
point.
   
   This fixes the bug described above because the overlord will just upload the 
logs for the pod at the time of the initial communication failure and then mark 
the task as failed.
   
   The other bug here is that we don't actually log the reason for the 
communication failure between the overlord <-> k8s. I saw this happen in one of 
my clusters and it just logged a unhelpful (job no found log), so I updated the 
exception to include info on the actual exception. Because of this issue I am 
not actually sure if the fabric8 client actually retried requests to kubernetes 
even though according to the config it should have. I think if we are planning 
on still aggresively loading task location on overlord startup (as in 
https://github.com/apache/druid/pull/17419/files#diff-bb902bcc2fa097a13509038cd5ae6987b355c2bcf50f7a558bf9c1a3f5d521db)
 it may make sense to add a extra retry block around the getPeonPodWithRetries 
method that catches the generic fabric8 kubernetes client exception.
   
   
   #### Release note
   Fix some bugs with kubernetes tasks
   
   ##### Key changed/added classes in this PR
    * `KubernetesPeonLifecycle`
    * `KubernetesPeonClient`
   
   
   This PR has:
   
   - [X] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [X] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to