georgew5656 commented on code in PR #17419:
URL: https://github.com/apache/druid/pull/17419#discussion_r1821183763


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java:
##########
@@ -393,4 +402,9 @@ protected TaskLocation getTaskLocationFromK8s()
     );
 
   }
+
+  public ListenableFuture<?> locatedFuture()

Review Comment:
   duplicate function here i think



##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java:
##########
@@ -260,11 +261,19 @@ protected TaskLocation getTaskLocation()
     since Druid doesn't support retrying tasks from a external system (K8s). 
We can explore adding a fabric8 watcher
     if we decide we need to change this later.
     **/
-    if (taskLocation == null) {
+    if (!taskLocation.isDone()) {
       log.warn("Unknown task location for [%s]", taskId);

Review Comment:
   i'm a little unsure about this logic because i have seen instances where the 
k8s calls can temporarily fail. i am also not sure if fabric8 is actually 
retrying these calls.
   
   see the initial trigger for (https://github.com/apache/druid/pull/17431, 
https://github.com/apache/druid/pull/17417)
   
   i think i would feel better if we still fell back to querying k8s for the 
pod location in this block for now (returning a immediate future). i think 
another option would be to wrap getTaskLocationFromK8s in more retries, i can 
put up a separate pr if you don't want to do it in this one though.



##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java:
##########
@@ -183,7 +184,7 @@ protected synchronized TaskStatus join(long timeout) throws 
IllegalStateExceptio
         since Druid doesn't support retrying tasks from a external system 
(K8s). We can explore adding a fabric8 watcher
         if we decide we need to change this later.
       **/
-      taskLocation = getTaskLocationFromK8s();
+      taskLocation.set(getTaskLocationFromK8s());
       updateState(new State[]{State.NOT_STARTED, State.PENDING}, 
State.RUNNING);

Review Comment:
   i think we should reverse the order of these two lines. technically if 
updateState -> RUNNING is not called the runner will still not know where the 
task is even if the getTaskLocationAsync future returns (b/c getTaskLocation 
does a check on state first). in any case we should always be marking a task as 
running immediately after we join it (i looked through all the references to 
state)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to