georgew5656 commented on code in PR #17446:
URL: https://github.com/apache/druid/pull/17446#discussion_r1833113432
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java:
##########
@@ -321,16 +333,40 @@ public List<Pair<Task, ListenableFuture<TaskStatus>>>
restore()
public void start()
{
log.info("Starting K8sTaskRunner...");
- // Load tasks from previously running jobs and wait for their statuses to
be updated asynchronously.
- for (Job job : client.getPeonJobs()) {
+ // Load tasks from previously running jobs and wait for their statuses to
start running.
+ final List<ListenableFuture<Boolean>> taskStatusActiveList = new
ArrayList<>();
+ final List<Job> peonJobs = client.getPeonJobs();
+
+ log.info("Locating [%,d] active tasks.", peonJobs.size());
+ for (Job job : peonJobs) {
try {
- joinAsync(adapter.toTask(job));
+ KubernetesWorkItem kubernetesWorkItem = joinAsync(adapter.toTask(job));
+
taskStatusActiveList.add(kubernetesWorkItem.getPeonLifeycle().getTaskStartedSuccessfullyFuture());
}
catch (IOException e) {
log.error(e, "Error deserializing task from job [%s]",
job.getMetadata().getName());
}
}
- log.info("Loaded %,d tasks from previous run", tasks.size());
+
+ try {
+ final DateTime nowUtc = DateTimes.nowUtc();
+ final long timeoutMs =
nowUtc.plus(config.getTaskJoinTimeout()).getMillis() - nowUtc.getMillis();
+ if (timeoutMs > 0) {
+ FutureUtils.coalesce(taskStatusActiveList).get(timeoutMs,
TimeUnit.MILLISECONDS);
+ }
+ log.info("Located [%,d] active tasks.", taskStatusActiveList.size());
+ }
+ catch (Exception e) {
+ final long numInitialized =
+ tasks.values().stream().filter(item ->
item.getPeonLifeycle().getTaskStartedSuccessfullyFuture().isDone()).count();
Review Comment:
yeah i think looking for ones that eval to true makes sense, updated this.
on the second comment, the main logic of the task lifecycle is performed by
the exec thread.
` return tasks.computeIfAbsent(task.getId(), k -> new
KubernetesWorkItem(
task,
exec.submit(() -> joinTask(task)),
peonLifecycleFactory.build(
task,
this::emitTaskStateMetrics
)
));`
so if the task fails to join the KubernetesWorkItem.getResult() future will
complete as failed and the taskQueue will mark the task as failed and shut it
down, we don't need to do this in the start() method
for the new "did task start successfully future"
i updated the logic so the catch blocks in run/join catch any errors during
run/join and mark the future as failed if it didn't start properly. also added
a test for the scenario
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]