suneet-s commented on code in PR #14771:
URL: https://github.com/apache/druid/pull/14771#discussion_r1293401529
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java:
##########
@@ -69,12 +85,14 @@ public Pod launchPeonJobAndWaitForStart(Job job, long
howLong, TimeUnit timeUnit
}, howLong, timeUnit);
long duration = System.currentTimeMillis() - start;
log.info("Took task %s %d ms for pod to startup", jobName, duration);
+ emitK8sPodMetrics(job, "peon/startup/time", duration);
Review Comment:
I think it would be good to preface the metric with `k8s` so that it is
clear that the metric only applies to peons started by the kubernetes task
runner.
I also like adding the unit to the end of the metric name to make it easier
for operators to understand what unit the metric is reported in.
```suggestion
emitK8sPodMetrics(job, "k8s/peon/startup/timeMillis", duration);
```
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java:
##########
@@ -87,6 +105,8 @@ public JobResponse waitForPeonJobCompletion(K8sTaskId
taskId, long howLong, Time
howLong,
unit
);
+ long duration = System.currentTimeMillis() - start;
+ emitK8sPodMetrics(job, "peon/running/time", duration);
Review Comment:
How is this different than the `task/run/time` metric?
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java:
##########
@@ -370,13 +370,13 @@ public Optional<ScalingStats> getScalingStats()
@Override
public Map<String, Long> getIdleTaskSlotCount()
{
- return Collections.emptyMap();
+ return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity()
- tasks.size()));
Review Comment:
The other implementations of getIdleTaskSlotCount do not use negative
numbers to indicate being over capacity. I think we should follow the same
pattern here.
```suggestion
return ImmutableMap.of(WORKER_CATEGORY, Math.max(0L,
Long.valueOf(config.getCapacity() - tasks.size())));
```
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java:
##########
@@ -42,12 +48,22 @@ public class KubernetesPeonClient
private final KubernetesClientApi clientApi;
private final String namespace;
private final boolean debugJobs;
+ private final TaskAdapter adapter;
+ private final ServiceEmitter emitter;
- public KubernetesPeonClient(KubernetesClientApi clientApi, String namespace,
boolean debugJobs)
+ public KubernetesPeonClient(
+ KubernetesClientApi clientApi,
+ String namespace,
+ boolean debugJobs,
+ TaskAdapter adapter,
+ ServiceEmitter emitter
+ )
{
this.clientApi = clientApi;
this.namespace = namespace;
this.debugJobs = debugJobs;
+ this.adapter = adapter;
+ this.emitter = emitter;
}
public Pod launchPeonJobAndWaitForStart(Job job, long howLong, TimeUnit
timeUnit)
Review Comment:
The job object that is being passed in here is calculated by using the
adapter to convert a task to the Job in `KubernetesTaskRunner#doTask`. And then
later we are using the adapter to convert it back to a task.
Would it be cleaner to just pass in the Task object here instead?
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java:
##########
@@ -370,13 +370,15 @@ public Optional<ScalingStats> getScalingStats()
@Override
public Map<String, Long> getIdleTaskSlotCount()
{
- return Collections.emptyMap();
+ // A negative return value indicates tasks are queued in the thread pool
+ // but have not been scheduled to run in K8s yet.
+ return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity()
- tasks.size()));
}
@Override
public Map<String, Long> getUsedTaskSlotCount()
{
- return Collections.emptyMap();
+ return ImmutableMap.of(WORKER_CATEGORY,
Long.valueOf(Math.min(config.getCapacity(), tasks.size())));
Review Comment:
There can be a delay between tasks being added to the `tasks` map and when
they are actually running. Can you explain how an operator should think about
using this metric?
##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java:
##########
@@ -106,7 +106,7 @@ public class KubernetesTaskRunner implements
TaskLogStreamer, TaskRunner
private final HttpClient httpClient;
private final PeonLifecycleFactory peonLifecycleFactory;
private final ServiceEmitter emitter;
-
+ private static String WORKER_CATEGORY = "_k8s_worker_category";
Review Comment:
Can you add a javadoc here explaining why this is the worker category.
I believe it is because the KubernetesTaskRunner does not support worker
categories currently, so we are just using this name to report metrics since
that is what the interface calls for.
`getTotalTaskSlotCount` uses the category `taskQueue`. Which one do you
think we should standardize on?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]