I realized that using the request metrics may not work because they can
only be updated once a request is complete. Ideally you'd have a direct "is
this pod occupied" 1/0 metric from each model pod, but I don't know if
that's possible with the framework.

For the GPU metrics, we need to match the per-node utilization back to the
pods running on each node. Fortunately, kube-state-metrics provides the
kube_pod_info metric that does just that. Since it always has a value of 1,
we can multiply it with any metric without changing the value. Normally
PromQL expects a 1:1 correspondence on both sides of the multiplication,
but we can do many-to-one with the on, group_left, and group_ right
modifiers. With these we can tell Prometheus which labels we expect to
match, and which we want to carry from one side to the other. This works
best if we get rid of all labels that we are not interested in.

First we need the source metrics, aggregated down accordingly.
Structurally, any aggregation will work; choose the one that semantically
works best for the given metric. For example, we can use

avg by(node) (gpu_utilization)
and
max by(node, pod, namespace) (kube_pod_info)

Now we need to combine them to create a new metric for GPU utilization by
pod and namespace:

avg by(node) (gpu_utilization)
* on(node) group_right(pod, namespace)
max by(node, pod, namespace) (kube_pod_info)

If you record this, or configure it as an expression in the adapter, you
should be able to autoscale on it.

/MR


On Sat, May 1, 2021, 22:39 Ashraf Guitouni <a.guito...@instadeep.com> wrote:

> I'm glad to discuss this from different angles.
>
> Before further detailing this particular use-case with seldon-core, I want
> to say that we have a few other use-cases where we may be also interested
> in horizontal scaling based on usage metrics, so finding a way to implement
> this (even if it's not the final solution that we reach) has a lot of value.
>
> To answer the first question: "How many requests can one pod handle in
> parallel?". The answer is, in most cases, just one request. This is
> because, for these requests, the input isn't a single instance that we'd
> like to run inference on (this is how it usually for a lot of ML systems),
> rather, a bulk of instances.
>
> To perform the computation on this bulk, we usually split it into batches,
> where the size of this batch depends on the model that we're serving, the
> GPU type, etc ... Then, sequentially (batch after batch), the batch
> inference is executed. Because we work like this, usually, every request
> will consume the GPU resource to nearly 100% until the process is done.
>
> Now, to be frank, this way of using a ML model server is strange, because
> if requests take minutes or hours to process, a workflow (argo, kubeflow,
> ...) that requests the needed resources on the spot and releases them once
> done is more fitting. That's how I initially implemented older versions of
> such systems.
>
> The issue is, among this majority of requests that take hours to process,
> there are requests that take less than a second, and for those, having a
> model server like seldon-core (or kfserving, ...) makes sense.
>
> Ideally, we'd have both methods implemented and deployed, forwarding small
> requests to the always-on seldon-core model server (and not worrying a lot
> about the need to autoscale, because a few requests having to wait an extra
> second or two is not a big issue for us) and triggering asynchronous
> workflows to process big requests, requesting its own resources and using
> them exclusively.
>
> Because the current load on our system is very low, I decided that to make
> the best use of expensive cloud Nvidia GPUs, I can use just the
> synchronous seldon-core model server to handle both types, and if a big
> request happens to fully consume the server's resources for a long time
> (longer than 30 secs for example), a new replica would be created to be
> ready for any potential future requests.
>
> I hope the use-case makes a bit more sense now.
>
> I'll try to look into the default seldon-core metrics, because if they can
> indicate that the server is has been under load for the last 30 secs or so
> (one or many requests are being processed), we can use that. Still, I hope
> you agree that there is merit to figuring out how to autoscale based on
> hardware resource usage (GPUs, TPUs, etc ...)
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAMV%3D_gYgpRRu0vFmXysb4UDcQyLo2Vcqw91A1tg1nJV4LjbX5w%40mail.gmail.com.

Reply via email to