georgew5656 commented on code in PR #17446:
URL: https://github.com/apache/druid/pull/17446#discussion_r1833113432


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java:
##########
@@ -321,16 +333,40 @@ public List<Pair<Task, ListenableFuture<TaskStatus>>> 
restore()
   public void start()
   {
     log.info("Starting K8sTaskRunner...");
-    // Load tasks from previously running jobs and wait for their statuses to 
be updated asynchronously.
-    for (Job job : client.getPeonJobs()) {
+    // Load tasks from previously running jobs and wait for their statuses to 
start running.
+    final List<ListenableFuture<Boolean>> taskStatusActiveList = new 
ArrayList<>();
+    final List<Job> peonJobs = client.getPeonJobs();
+
+    log.info("Locating [%,d] active tasks.", peonJobs.size());
+    for (Job job : peonJobs) {
       try {
-        joinAsync(adapter.toTask(job));
+        KubernetesWorkItem kubernetesWorkItem = joinAsync(adapter.toTask(job));
+        
taskStatusActiveList.add(kubernetesWorkItem.getPeonLifeycle().getTaskStartedSuccessfullyFuture());
       }
       catch (IOException e) {
         log.error(e, "Error deserializing task from job [%s]", 
job.getMetadata().getName());
       }
     }
-    log.info("Loaded %,d tasks from previous run", tasks.size());
+
+    try {
+      final DateTime nowUtc = DateTimes.nowUtc();
+      final long timeoutMs = 
nowUtc.plus(config.getTaskJoinTimeout()).getMillis() - nowUtc.getMillis();
+      if (timeoutMs > 0) {
+        FutureUtils.coalesce(taskStatusActiveList).get(timeoutMs, 
TimeUnit.MILLISECONDS);
+      }
+      log.info("Located [%,d] active tasks.", taskStatusActiveList.size());
+    }
+    catch (Exception e) {
+      final long numInitialized =
+          tasks.values().stream().filter(item -> 
item.getPeonLifeycle().getTaskStartedSuccessfullyFuture().isDone()).count();

Review Comment:
   yeah i think looking for ones that eval to true makes sense, updated this.
   
   on the second comment, the main logic of the task lifecycle is performed by 
the exec thread.
   `      return tasks.computeIfAbsent(task.getId(), k -> new 
KubernetesWorkItem(
             task,
             exec.submit(() -> joinTask(task)),
             peonLifecycleFactory.build(
                 task,
                 this::emitTaskStateMetrics
             )
         ));`
   
   so if the task fails to join the KubernetesWorkItem.getResult() future will 
complete as failed and the taskQueue will mark the task as failed and shut it 
down, i don't believe we need to do this in the start() method too.
   
   for the new "did task start successfully future"
   i updated the logic so the catch blocks in run/join catch any errors during 
run/join and mark the future as failed if it didn't start properly. also added 
a test for the scenario 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to