Thank you for your reply. 

There's no GPU sharing for pods at the moment (this is how it is in general 
for k8s, except for Nvidia MIGs). The goal is to have HPA 
increasing/decreasing the replicas for a deployment, which will call on the 
cluster autoscaler to provision a new node if needed.

We're using plain GKE (and also RKE on-prem).
On Saturday, May 1, 2021 at 4:44:38 PM UTC+1 [email protected] wrote:

> Hi,
> It depends on how the pods from the same node are sharing the GPU, but I 
> think it is doable if you configure the hpa to spawn new pods and the pods 
> to `request` GPU resources, this will force the GKE cluster autoscaler into 
> creating new nodes to locate the new pods.
> Are you using KubeFlow on top of GKE or a homemade platform?
>
> On Saturday, May 1, 2021 at 3:36:37 PM UTC+2 [email protected] wrote:
>
>> Hi all.
>>
>> I'm trying to implement HPA based on GPU utilization metrics. 
>>
>> My initial approach is to use DCGM Exporter which is a daemonset that 
>> runs a pod on every GPU node and exports GPU metrics. 
>>
>> By setting an additional scrape config when installing 
>> kube-prometheus-community and a custom rule when installing 
>> prometheus-adapter, I'm able to query the prometheus API and get the 
>> dcgm_gpu_utilization for each node:
>> dcgm_gpu_utilization{Hostname="dcgm-exporter-dmrff", 
>> UUID="GPU-e26f8adc-c4aa-4a46-b3d3-ff4599da50a3", device="nvidia0", gpu="0", 
>> instance="10.28.0.50:9400", job="gpu-metrics", 
>> kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-qth8"} 3
>> dcgm_gpu_utilization{Hostname="dcgm-exporter-rxjfm", 
>> UUID="GPU-0446c63e-3843-62fa-56db-423958021f5c", device="nvidia0", gpu="0", 
>> instance="10.28.1.27:9400", job="gpu-metrics", 
>> kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-8bgb"} 0
>>
>> What I'd like to ask is this: Is it possible to configure HPA for a 
>> deployment based on this metric (even though it's being exported for each 
>> node through dcgm-exporter pods and not the pods corresponding to the 
>> deployment we want to autoscale)?
>>
>> Perhaps there's a way to generate a metric like mydeploy_gpu_avg which 
>> is equal to avg(dcgm_gpu_utilization) over all nodes that have a replica 
>> of the deployment mydeploy? That would make it possible to configure HPA 
>> with a custom object that targets this mydeploy_gpu_avg metric of 
>> mydeploy.
>>
>>
>> I hope I'm making sense so far. To my surprise, this is a very rare 
>> scenario it seems. Our use-case is autoscaling GPU-based machine learning 
>> inference servers, in case it helps to know.
>>
>>
>> I would really appreciate any advice regarding this. I tried to document 
>> my current progress in a Github repo: 
>> https://github.com/ashrafgt/k8s-gpu-hpa
>>
>> Thank you.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/581b6040-f1cd-4e5f-b8c3-4d170fc3db3fn%40googlegroups.com.

Reply via email to