georgew5656 opened a new pull request, #14001:
URL: https://github.com/apache/druid/pull/14001

   ### Description
   With the KubernetesTaskRunner, if a task is manually shutdown while running 
or the job is manually deleted, the thread responsible for overseeing the job 
gets stuck in a loop because the fabric8 client sends one event to it that the 
job is null when the job is deleted, but this doesn't pass the condition.
   
   This means that the thread is stuck waiting on a fabric8 event (the job 
being successful) that will never come up until maxTaskDuration (default 4 
hours). If a user of the extension is trying to use a limited taskqueue 
maxSize, this can cause problems as the k8s executor pool is unable to pick up 
additional tasks (since threads are stuck waiting on the old tasks that have 
already been deleted).
   
   An alternative method might be to have the shutdown method in the K8s Task 
runner cancel running futures so they don't get stuck when the job is deleted, 
but this would not address the situation where a k8s job is manually deleted.
   
   #### Release notes
   Fix a bug with hanging threads in the K8s Task Scheduler
   
   ##### Key changed/added classes in this PR
   Update waitForJobCompletion to exit out with a failed status if the job has 
been deleted. This function is only called after a job has been confirmed to 
have been launched (either right after launchJobAndWaitForStart has been called 
or after a job that is already running has been run again), so there should be 
no issues with race conditions here.
   
   This PR has:
   
   - [ X] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [X] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [X] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to