georgew5656 commented on code in PR #17419:
URL: https://github.com/apache/druid/pull/17419#discussion_r1821183763
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java:
##########
@@ -393,4 +402,9 @@ protected TaskLocation getTaskLocationFromK8s()
);
}
+
+ public ListenableFuture<?> locatedFuture()
Review Comment:
duplicate function here i think
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java:
##########
@@ -260,11 +261,19 @@ protected TaskLocation getTaskLocation()
since Druid doesn't support retrying tasks from a external system (K8s).
We can explore adding a fabric8 watcher
if we decide we need to change this later.
**/
- if (taskLocation == null) {
+ if (!taskLocation.isDone()) {
log.warn("Unknown task location for [%s]", taskId);
Review Comment:
i'm a little unsure about this logic because i have seen instances where the
k8s calls can temporarily fail. i am also not sure if fabric8 is actually
retrying these calls.
see the initial trigger for (https://github.com/apache/druid/pull/17431,
https://github.com/apache/druid/pull/17417)
i think i would feel better if we still fell back to querying k8s for the
pod location in this block for now (returning a immediate future). i think
another option would be to wrap getTaskLocationFromK8s in more retries, i can
put up a separate pr if you don't want to do it in this one though.
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java:
##########
@@ -183,7 +184,7 @@ protected synchronized TaskStatus join(long timeout) throws
IllegalStateExceptio
since Druid doesn't support retrying tasks from a external system
(K8s). We can explore adding a fabric8 watcher
if we decide we need to change this later.
**/
- taskLocation = getTaskLocationFromK8s();
+ taskLocation.set(getTaskLocationFromK8s());
updateState(new State[]{State.NOT_STARTED, State.PENDING},
State.RUNNING);
Review Comment:
i think we should reverse the order of these two lines. technically if
updateState -> RUNNING is not called the runner will still not know where the
task is even if the getTaskLocationAsync future returns (b/c getTaskLocation
does a check on state first). in any case we should always be marking a task as
running immediately after we join it (i looked through all the references to
state)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]