Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

Stavros Kontopoulos Wed, 19 Jan 2022 05:56:00 -0800

Hi all! 

Hope not too late for the discussion. I would like to revive it as I find 
it really useful for Knative and any serverless framework. As a Knative 
contributor, working also on the monitoring side of the project, here is my 
pov:


a) OpenFaas as an example (mentioned earlier above) might not be the best 
to consider as it seems that it only provides metrics
at the ingress side (Gateway), similarly to what you get from a Service 
mesh like istio when you monitor its ingress.
Don't see any option to collect user metrics at least out of the box. 
Another serverless system, Dapr, wrt tracing, has a sidecar that among 
others pushes traces to the OTEL collector  
(https://docs.dapr.io/operations/monitoring/tracing/open-telemetry-collector). 
Although Dapr for metrics uses a pull model still this highlights the path 
they are taking. Knative btw supports different exporters and so it can 
either use a pull model or a push model. It is not restricted to 
opentelemetry at all.

b) What is the targeted latency for serverless? In cloud environments it is 
possible to get invocation latency down to milliseconds 
(https://aws.amazon.com/blogs/compute/creating-low-latency-high-volume-apis-with-provisioned-concurrency)
 
for simple funcs and also minimize cold start issues. As a rule any 
solution that ship metrics should take far less than the func run and 
should not add considerable resource overhead. Also users depending on the 
cost model should not pay for that overhead and you need to be able to 
distinguish that somehow at least. Regarding latency some apps can tolerate 
seconds or even minutes of latency. So it depends on how people want to 
ship metrics given their scenario. Btw as a background info Knative cold 
start time is a few seconds 
(https://groups.google.com/g/knative-users/c/vqkP95ibq60).

c) There is a question whether serverless runtime should provide metrics 
forwarding/collection. I would say it is possible for at least the 
end-to-end traffic metrics.This is for metrics related to requests entering 
the system eg. at ingress and usually each requests corresponds to a 
function invocation (Knative has this 1-1 mapping). Ingress seems the right 
point for robustness reasons. For example a request may fail at different 
stages and this is also true for Knative where different components may be 
on the request path. For any other metric including user metrics I would 
say that a different localized approach for gathering metrics seems 
preferable. Separation of concerns is one reason behind this as we dont 
want centralized components to become a metric sink like a collector while 
also doing other stuff like scaling apps etc. 

Looking at a possible generic solution, I would guess this to be based on a 
local agent. Afaik a local tcp connection is at that ms scale including 
time for sending a few kbs of metrics data. Of course this is not the only 
option, metrics could be written to some local file and then stream its 
contents (log solution mentioned above). Ideally an architecture that ships 
metrics locally to some agent on a node would roughly satisfy reqs (which 
should be captured btw in detail). That agent would then be possible to 
push metrics to a metrics collector with either via remote writing, if it 
is Prometheus based, or via some other way if it is OTEL node 
agent(https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/design.md#running-as-an-agent)
 
etc. This is already done elsewhere for example AWS 
(https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-open-telemetry.html).

Best,
Stavros

On Sunday, November 28, 2021 at 1:42:38 AM UTC+2 co...@quirl.co.nz wrote:

> Just to throw my 2c in, we've been battling with this problem at (company) 
> as we move more services to a serverless model for our customer facing 
> things. Chiefly the issue of metrics aggregation for services that can't 
> easily track their own state across multiple requests. For us, there's just 
> too many metric semantics for different aggregations than can be expressed 
> in Prometheus types, so we have resorted to hacks such as 
> https://github.com/sinkingpoint/gravel-gateway to be able to express 
> these. The wider variety of OpenMetrics types solves most of these issues, 
> but that requires push gateway support as above, and a non zero effort from 
> clients to migrate to OpenMetrics client libs (if those even exist for 
> their languages of choice).
>
> For the above, _we_ answer the above in the following way:
>
> > What tradeoff would it make when metric ingestion is slower than metric 
> production? Backpressure or drop data?
>
> Just drop it, with metrics to indicate as such
>
> > What are the semantics of pushing a counter?
>
> Aggregation by summing by default with different options available, 
> configurable by the client
>
> > Where would the data move from there, and how?
>
> Exposed as per the push gateway as a regular Prometheus scrape
>
> > How many of these receivers would you typically run? How much 
> coordination is necessary between them?
>
> This gets complicated. In our setup we have a daemonset in k8s and an 
> ingress that does consistent hashing on the service name so that any given 
> service is routed to two different instances
>
> Having run this setup in production for about a year and a half now it 
> works for us in practice although it's definitely not ideal. We'd welcome 
> some sort of official OpenMetrics solution
>
> - Colin
>
>
> On Sun, Nov 28, 2021 at 10:22 AM Matthias Rampke <matt...@prometheus.io> 
> wrote:
>
>> What properties would an ideal OpenMetrics push receiver have? In 
>> particular, I am wondering:
>>
>> - What tradeoff would it make when metric ingestion is slower than metric 
>> production? Backpressure or drop data?
>> - What are the semantics of pushing a counter?
>> - Where would the data move from there, and how?
>> - How many of these receivers would you typically run? How much 
>> coordination is necessary between them?
>>
>> From observing the use of the statsd exporter, I see a few cases where it 
>> covers ground that is not very compatible with the in-process aggregation 
>> implied by the pull model. It has the downside of mapping through a 
>> different metrics model, and its tradeoffs are informed by the ones statsd 
>> made 10+ years ago. I wonder what it would look like, remade in 2022 
>> starting from OpenMetrics.
>>
>>
>> /MR
>>
>> On Sat, 27 Nov 2021, 12:50 Rob Skillington, <rob.ski...@gmail.com> wrote:
>>
>>> Here’s the documentation for using M3 coordinator (with it without M3 
>>> aggregator) with a backend that has a Prometheus Remote Write receiver:
>>> https://m3db.io/docs/how_to/any_remote_storage/
>>>
>>> Would be more than happy to do a call some time on this topic, the more 
>>> we’ve looked at this it’s a client library issue primarily way before you 
>>> consider the backend/receiver aspect (which there are options out there and 
>>> are fairly mechanical to overcome, vs the client library concerns which 
>>> have a lot of ergonomic and practical issues especially in a serverless 
>>> environment where you may need to wait for publishing before finishing your 
>>> request - perhaps an async process like publishing a message to local 
>>> serverless message queue like SQS is an option and having a reader read 
>>> that and use another client library to push that data out is ideal - it 
>>> would be more type safe and probably less lossy than logs and reading the 
>>> logs then publishing but would need good client library support for both 
>>> the serverless producers and the readers/pushers).
>>>
>>> Rob
>>>
>>> On Sat, Nov 27, 2021 at 1:41 AM Rob Skillington <r...@chronosphere.io> 
>>> wrote:
>>>
>>>> FWIW we have been experimenting with users pushing OpenMetrics protobuf 
>>>> payloads quite successfully, but only sophisticated exporters that can 
>>>> guarantee no collisions of time series and generate their own monotonic 
>>>> counters, etc are using this at this time.
>>>>
>>>> If you're looking for a solution that also involves aggregation 
>>>> support, M3 Coordinator (either standalone or combined with M3 Aggregator) 
>>>> supports Remote Write as a backend (and is thus compatible with Thanos, 
>>>> Cortex and of course Prometheus itself too due to the PRW receiver).
>>>>
>>>> M3 Coordinator however does not have any nice support to publish to it 
>>>> from a serverless environment (since the primary protocol it supports is 
>>>> Prometheus Remote Write which has no metrics clients, etc I would assume).
>>>>
>>>> Rob
>>>>
>>>>
>>>> On Mon, Nov 15, 2021 at 9:54 PM Bartłomiej Płotka <bwpl...@gmail.com> 
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I would love to resurrect this thread. I think we are missing a good 
>>>>> push-gateway like a product that would ideally live in Prometheus 
>>>>> (repo/binary or can be recommended by us) and convert events to metrics 
>>>>> in 
>>>>> a cheap way. Because this is what it is when we talk about short-living 
>>>>> containers and serverless functions. What's the latest Rob? I would be 
>>>>> interested in some call for this if that is still on the table. (: 
>>>>>
>>>>> I think we have some new options on the table like supporting Otel 
>>>>> metrics as such potential high-cardinal event push, given there are more 
>>>>> and more clients for that API. Potentially Otel collector can work as 
>>>>> such 
>>>>> "push gateway" proxy, but at this point, it's extremely generic, so we 
>>>>> might want to consider something more focused/efficient/easier to 
>>>>> maintain. 
>>>>> Let's see (: The other problem is that Otel metrics is yet another 
>>>>> protocol. Users might want to use push gateway API, remote write or 
>>>>> logs/traces as per @Tobias Schmidt idea 
>>>>>
>>>>> Another service "loggateway" (or otherwise named) would then stream 
>>>>>> the logs, aggregate them and either expose them on the common /metrics 
>>>>>> endpoint or push them with remote write right away to a Prometheus 
>>>>>> instance 
>>>>>> hosted somewhere (like Grafana Cloud)."
>>>>>
>>>>>
>>>>> Kind Regards,
>>>>> Bartek Płotka (@bwplotka)
>>>>>
>>>>>
>>>>> On Fri, Jun 25, 2021 at 6:11 AM Rob Skillington <r...@chronosphere.io> 
>>>>> wrote:
>>>>>
>>>>>> With respect to OpenMetrics push, we had something very similar at 
>>>>>> $prevco that pushed something that looked very similar to the protobuf 
>>>>>> payload of OpenMetrics (but was Thrift snapshot of an aggregated set of 
>>>>>> metrics from in process) that was used by short running tasks (for 
>>>>>> Jenkins, 
>>>>>> Flink jobs, etc).
>>>>>>
>>>>>> I definitely agree it’s not ideal and ideally the platform provider 
>>>>>> can supply a collection point (there is something for Jenkins, a plug-in 
>>>>>> that can do this, but custom metrics is very hard / nigh impossible to 
>>>>>> make 
>>>>>> work with it, and this is a non-cloud provider environment that’s 
>>>>>> actually 
>>>>>> possible to make work, just no one has made it seamless).
>>>>>>
>>>>>> I agree with Richi that something that could push to a Prometheus 
>>>>>> Agent like target that supports OpenMetrics push could be a good middle 
>>>>>> ground with the right support / guidelines:
>>>>>> - A way to specify multiple Prometheus Agent targets and quickly 
>>>>>> failover from one to another if within $X ms one is not responding (you 
>>>>>> could imagine a 5ms budget for each and max 3 are tried, introducing at 
>>>>>> worst 15ms overhead when all are down in 3 local availability zones, but 
>>>>>> in 
>>>>>> general this is a disaster case)
>>>>>> - Deduplication ability so that a retried push is not double counted, 
>>>>>> this might mean timestamping the metrics… (so if written twice only 
>>>>>> first 
>>>>>> record kept, etc)
>>>>>>
>>>>>> I think it should similar to the Push Gateway be generally a last 
>>>>>> resort kind of option and have clear limitations so that pull still 
>>>>>> remains 
>>>>>> the clear choice for anything but these environments.
>>>>>>
>>>>>> Is there any interest discussing this on a call some time?
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>> On Thu, Jun 24, 2021 at 5:09 PM Bjoern Rabenstein <bjo...@rabenste.in> 
>>>>>> wrote:
>>>>>>
>>>>>>> On 22.06.21 11:26, Tobias Schmidt wrote:
>>>>>>> > 
>>>>>>> > Last night I was wondering if there are any other common interfaces
>>>>>>> > available in serverless environments and noticed that all products 
>>>>>>> by AWS
>>>>>>> > (Lambda) and GCP (Functions, Run) at least provide the option to 
>>>>>>> handle log
>>>>>>> > streams, sometimes even log files on disk. I'm currently thinking 
>>>>>>> about
>>>>>>> > experimenting with an approach where containers log metrics to 
>>>>>>> stdout /
>>>>>>> > some file, get picked up by the serverless runtime and written to 
>>>>>>> some log
>>>>>>> > stream. Another service "loggateway" (or otherwise named) would 
>>>>>>> then stream
>>>>>>> > the logs, aggregate them and either expose them on the common 
>>>>>>> /metrics
>>>>>>> > endpoint or push them with remote write right away to a Prometheus 
>>>>>>> instance
>>>>>>> > hosted somewhere (like Grafana Cloud).
>>>>>>>
>>>>>>> Perhaps I'm missing something, but isn't that
>>>>>>> https://github.com/google/mtail ?
>>>>>>>
>>>>>>> -- 
>>>>>>> Björn Rabenstein
>>>>>>> [PGP-ID] 0x851C3DA17D748D03
>>>>>>> [email] bjo...@rabenste.in
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Prometheus Developers" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to prometheus-devel...@googlegroups.com.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/prometheus-developers/20210624210908.GB11559%40jahnn
>>>>>>> .
>>>>>>>
>>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Developers" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to prometheus-devel...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaGy-Rm1qv5%3D6-2ghjmDyW3k1YkO12YfWurHZmzfsv4-g%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaGy-Rm1qv5%3D6-2ghjmDyW3k1YkO12YfWurHZmzfsv4-g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Developers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to prometheus-devel...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UOa5ORJyui5-ORACtCMgS-82ZGz4G1T90EV6WY_RPDpqQ%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UOa5ORJyui5-ORACtCMgS-82ZGz4G1T90EV6WY_RPDpqQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-devel...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gb0ZYLNs%2B%2BYx9LSc885%3DivHMno7DPA3eEvjifgnD5Lx%3DQ%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gb0ZYLNs%2B%2BYx9LSc885%3DivHMno7DPA3eEvjifgnD5Lx%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/3ea01fbb-8fd5-4b97-9e35-1a5491b793f0n%40googlegroups.com.

Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

Reply via email to