suneet-s commented on code in PR #14771:
URL: https://github.com/apache/druid/pull/14771#discussion_r1293401529


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java:
##########
@@ -69,12 +85,14 @@ public Pod launchPeonJobAndWaitForStart(Job job, long 
howLong, TimeUnit timeUnit
                          }, howLong, timeUnit);
       long duration = System.currentTimeMillis() - start;
       log.info("Took task %s %d ms for pod to startup", jobName, duration);
+      emitK8sPodMetrics(job, "peon/startup/time", duration);

Review Comment:
   I think it would be good to preface the metric with `k8s` so that it is 
clear that the metric only applies to peons started by the kubernetes task 
runner.
   
   I also like adding the unit to the end of the metric name to make it easier 
for operators to understand what unit the metric is reported in.
   
   ```suggestion
         emitK8sPodMetrics(job, "k8s/peon/startup/timeMillis", duration);
   ```



##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java:
##########
@@ -87,6 +105,8 @@ public JobResponse waitForPeonJobCompletion(K8sTaskId 
taskId, long howLong, Time
                           howLong,
                           unit
                       );
+      long duration = System.currentTimeMillis() - start;
+      emitK8sPodMetrics(job, "peon/running/time", duration);

Review Comment:
   How is this different than the `task/run/time` metric?



##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java:
##########
@@ -370,13 +370,13 @@ public Optional<ScalingStats> getScalingStats()
   @Override
   public Map<String, Long> getIdleTaskSlotCount()
   {
-    return Collections.emptyMap();
+    return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() 
- tasks.size()));

Review Comment:
   The other implementations of getIdleTaskSlotCount do not use negative 
numbers to indicate being over capacity. I think we should follow the same 
pattern here.
   
   ```suggestion
       return ImmutableMap.of(WORKER_CATEGORY, Math.max(0L, 
Long.valueOf(config.getCapacity() - tasks.size())));
   ```



##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java:
##########
@@ -42,12 +48,22 @@ public class KubernetesPeonClient
   private final KubernetesClientApi clientApi;
   private final String namespace;
   private final boolean debugJobs;
+  private final TaskAdapter adapter;
+  private final ServiceEmitter emitter;
 
-  public KubernetesPeonClient(KubernetesClientApi clientApi, String namespace, 
boolean debugJobs)
+  public KubernetesPeonClient(
+      KubernetesClientApi clientApi,
+      String namespace,
+      boolean debugJobs,
+      TaskAdapter adapter,
+      ServiceEmitter emitter
+  )
   {
     this.clientApi = clientApi;
     this.namespace = namespace;
     this.debugJobs = debugJobs;
+    this.adapter = adapter;
+    this.emitter = emitter;
   }
 
   public Pod launchPeonJobAndWaitForStart(Job job, long howLong, TimeUnit 
timeUnit)

Review Comment:
   The job object that is being passed in here is calculated by using the 
adapter to convert a task to the Job in `KubernetesTaskRunner#doTask`. And then 
later we are using the adapter to convert it back to a task.
   
   Would it be cleaner to just pass in the Task object here instead?



##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java:
##########
@@ -370,13 +370,15 @@ public Optional<ScalingStats> getScalingStats()
   @Override
   public Map<String, Long> getIdleTaskSlotCount()
   {
-    return Collections.emptyMap();
+    // A negative return value indicates tasks are queued in the thread pool
+    // but have not been scheduled to run in K8s yet.
+    return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() 
- tasks.size()));
   }
 
   @Override
   public Map<String, Long> getUsedTaskSlotCount()
   {
-    return Collections.emptyMap();
+    return ImmutableMap.of(WORKER_CATEGORY, 
Long.valueOf(Math.min(config.getCapacity(), tasks.size())));

Review Comment:
   There can be a delay between tasks being added to the `tasks` map and when 
they are actually running. Can you explain how an operator should think about 
using this metric?



##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java:
##########
@@ -106,7 +106,7 @@ public class KubernetesTaskRunner implements 
TaskLogStreamer, TaskRunner
   private final HttpClient httpClient;
   private final PeonLifecycleFactory peonLifecycleFactory;
   private final ServiceEmitter emitter;
-
+  private static String WORKER_CATEGORY = "_k8s_worker_category";

Review Comment:
   Can you add a javadoc here explaining why this is the worker category.
   
   I believe it is because the KubernetesTaskRunner does not support worker 
categories currently, so we are just using this name to report metrics since 
that is what the interface calls for.
   
   `getTotalTaskSlotCount` uses the category `taskQueue`. Which one do you 
think we should standardize on?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to