Sn0rt commented on issue #9627: URL: https://github.com/apache/apisix/issues/9627#issuecomment-1870062819
> > > It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects: > > > > > > 1. When there are more routes and upstreams, the metrics data grows exponentially. > > > 2. Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating. > > > 3. Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered, `Metrics Error` will continue to increase. We hope that `Metrics Error` can help us identify issues in a reasonable manner. > > > 4. Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds. > > > > > > In fact [nginx-lua-prometheus](https://github.com/knyar/nginx-lua-prometheus) provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times. > > > Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else. > > > before: > > > ```lua > > > metrics.latency = prometheus:histogram("http_latency", > > > "HTTP request latency in milliseconds per service in APISIX", > > > {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))}, > > > buckets) > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > after: > > > ```lua > > > metrics.latency = prometheus:histogram("http_latency", > > > "HTTP request latency in milliseconds per service in APISIX", > > > {"type", "route", unpack(extra_labels("http_latency"))}, > > > buckets) > > > ``` > > > > > > @hansedong well said 👍 Only retaining `type` and `route` level data is not a universal implementation idea, and other users may not accept it. > > We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think? > > Do we have plan for TTL ? APISIX uses the knyar/nginx-lua-prometheus library to set the metric. The ttl solution would be better if it is supported by the underlying library. Currently in discussion with the maintainer of knyar/nginx-lua-prometheus https://github.com/knyar/nginx-lua-prometheus/issues/164, in any case this issue is already being advanced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
