On 15.06.21 20:59, Bartłomiej Płotka wrote: > > Let's now talk about FaaS/Serverless.
Excellent! That's my 2nd favorite topic after histograms. (And while I provably talked about histograms as my favorite topic since early 2015, I have only started to talk about FaaS/Serverless as an important gap to fill in the Prometheus story since 2018.) I think "true FaaS" means that the function calls are lightweight. The additional overhead of sending anything over the networks defeats that purpose. So similar to what has been said before, and what Bartek has already nicely worked out, I think the metrics have to be managed by the FaaS runtime, in the same path as billing is managed. And that's, of course, what cloud providers are doing, and it's also a formidable way of locking their customers into their own metrics and monitoring system. And that's in turn precisely where I think Prometheus can use its weight. Prometheus has already proven that cloud providers can essentially not get away with ignoring it, and even halfhearted integrations won't be enough. With more or less native Prometheus support by cloud providers, it might actually just require a small step to come to some convention how to collect and present FaaS metrics in a "Promethean" way. If all cloud providers do it the same way, the lock-in is gone. I think it would be very valuable to study what OpenFaaS has already done: https://docs.openfaas.com/architecture/metrics/ In the simplest case, we could just say: Please, dear cloud providers, please expose exactly the same metrics for general benefit. If there is anything to improve with the OpenFaaS approach, I'm sure they will be delighted to get help. (Spontaneously, I'm missing a way to define custom metrics, e.g. how many records a function call has processed.) > * Suggestion to use event aggregation proxy > <https://github.com/weaveworks/prom-aggregation-gateway> > * Pushgateway improvements > <https://groups.google.com/g/prometheus-users/c/sm5qOrsVY80/m/nSfbzHd9AgAJ> > for > serverless cases Despite all of what I said above, I think there _are_ quite a few user of FaaS who have fairly heavy-weight function calls. For them, pushing counter increments etc. via the network might actually be more convenient than funneling metrics through the FaaS runtime. This is then just another use-case of the "distributed counter" idea, which the Pushgateway quite prominently is not catering for. As discussed in the thread linked above and at countless other places, I strongly recommend to not shoehorn the Pushgateway into this use-case, but create a separate project for it, which would be designed from the beginning for this use-case. Perhaps weaveworks/prom-aggregation-gateway is just that. I haven't studied it in detail yet. In a way, we need "statsd done right". Again, I would suggest to look what others have already done. For example, there are tons of statsd users out there. What have they done in the last years to overcome the known shortcomings? Perhaps statsd instrumentation and the Prometheus statsd exporter just needs a bit of development in that way to make it a viable solution. > I think the main problem appears if those FaaS runtimes are short-living > workloads that automatically spins up only to run some functions (batch > jobs). In some way, this is then a problem of short-living jobs and the > design of those workloads. > > For those short-living jobs, we again see users try to use the push model. > I think there is room to either streamline those initiatives OR propose > an alternative. A quick idea, yolo... why not killing the job after the > first successful scrape (detecting usage on /metric path)? Ugh, that doesn't sound right. I think this problem should be solved within the FaaS runtime in the way they prefer. Cloud providers need billing in any case (they want to make money after all), so they have already solved reliably metrics collection for that. They just need to hook in a simple exporter to present Prometheus metrics. See how OpenFaaS has done it. Knative seems to have gone down the OTel path, but that could be seen as an implementation detail. If they in the end expose a /metrics endpoint with the desired metrics for Prometheus to scrape, all is good. It's just a terribly overengineered exporter, effectively. (o; -- Björn Rabenstein [PGP-ID] 0x851C3DA17D748D03 [email] bjo...@rabenste.in -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/20210618221656.GS3670%40jahnn.