churromorales commented on code in PR #13804:
URL: https://github.com/apache/druid/pull/13804#discussion_r1112268989
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -106,13 +104,15 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId,
long howLong, TimeUnit
.inNamespace(namespace)
.withName(taskId.getK8sTaskId())
.waitUntilCondition(
- x -> x != null && x.getStatus() != null &&
x.getStatus().getActive() == null,
+ x -> x != null && x.getStatus() != null &&
x.getStatus().getActive() == null
+ && (x.getStatus().getFailed() != null ||
x.getStatus().getSucceeded() !=null),
Review Comment:
This was a tricky one. For some background
the overlord launches a task, waits for it to complete and then returns the
status. Now in k8s < 1.25 it would just mean once the pod is not active, go
grab the status.
Now the contact has changed a bit, because in 1.25 they introduced
finalizers:
https://github.com/kubernetes/kubernetes/pull/110948
And we noticed that when we ran tasks on a 1.25 k8s druid cluster they
would complete fine but marked as failure.
We outputted the job status and noticed that the job was not active, but
both success and failure were not set, but there was this field that had
```
uncountedTerminatedPods=UncountedTerminatedPods(failed=[],
succeeded=[e916cbf9-467a-45f3-86a7-3767145d6384], additionalProperties={})
```
which from the docs:
```
UncountedTerminatedPods holds the UIDs of Pods that have terminated but the
job controller hasn't yet accounted for in the status counters. The job
controller creates pods with a finalizer. When a pod terminates (succeeded or
failed), the controller does three steps to account for it in the job status:
(1) Add the pod UID to the arrays in this field. (2) Remove the pod finalizer.
(3) Remove the pod UID from the arrays while increasing the corresponding
counter. This field is beta-level. The job controller only makes use of this
field when the feature gate JobTrackingWithFinalizers is enabled (enabled by
default). Old jobs might not be tracked using this field, in which case the
field remains null.
```
So now what happens is the job goes from a state where it is not active, to
having uncountedTerminatedPods to then having a status with success or failure.
I will push up a one-line fix to make this work, but for those of you working
with 1.25 version of k8s, I’m sure you will be affected as well.
Basically add another check to wait on,
Right now we wait for this:
```
// block until
job.getStatus() != null && job.getActive() == null
// then return
return job.getStatus().getSucceeded() != null
```
So the change has to become
```
// block until
job.getStatus() != null && job.getActive() == null &&
(job.getStatus().getFailed() != null || job.getStatus().getSucceeded() !=null)
// then return
return job.getStatus().getSucceeded() != null
```
This should keep things backwards compatible and working in all versions of
k8s
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]