Sn0rt commented on issue #9627:
URL: https://github.com/apache/apisix/issues/9627#issuecomment-1870062819

   > > > It is obvious that there are problems with the mechanism of this 
Prometheus exporter, which can be seen from four aspects:
   > > > 
   > > > 1. When there are more routes and upstreams, the metrics data grows 
exponentially.
   > > > 2. Due to APISIX Metrics only increasing and not decreasing, 
historical data keeps accumulating.
   > > > 3. Although there is an LRU mechanism to control the size of 
Prometheus Lua shared memory within the set value, this is not a fundamental 
solution. Once the LRU mechanism is triggered, `Metrics Error` will continue to 
increase. We hope that `Metrics Error` can help us identify issues in a 
reasonable manner.
   > > > 4. Although in the new version, APISIX has moved the server for 
prometheus metrics exporter to a privileged process, reducing P100 issues, due 
to Metrics only increasing and not decreasing, it also puts significant 
pressure on the privilege server. In our production environment as an example, 
we have 150k metrics data points. Each time Prometheus pulls data it causes 
Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.
   > > > 
   > > > In fact 
[nginx-lua-prometheus](https://github.com/knyar/nginx-lua-prometheus) provides 
counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus 
plugin may need to delete Prometheus Metric data at certain times.
   > > > Currently, our approach is similar; however we are more aggressive by 
only retaining type-level and route-level data while removing everything else.
   > > > before:
   > > > ```lua
   > > > metrics.latency = prometheus:histogram("http_latency",  
   > > >     "HTTP request latency in milliseconds per service in APISIX",  
   > > >     {"type", "route", "service", "consumer", "node", 
unpack(extra_labels("http_latency"))},  
   > > >     buckets)
   > > > ```
   > > > 
   > > > 
   > > >     
   > > >       
   > > >     
   > > > 
   > > >       
   > > >     
   > > > 
   > > >     
   > > >   
   > > > after:
   > > > ```lua
   > > > metrics.latency = prometheus:histogram("http_latency",  
   > > >     "HTTP request latency in milliseconds per service in APISIX",  
   > > >     {"type", "route", unpack(extra_labels("http_latency"))},  
   > > >     buckets)
   > > > ```
   > > 
   > > 
   > > @hansedong well said 👍 Only retaining `type` and `route` level data is 
not a universal implementation idea, and other users may not accept it.
   > > We are trying to find a general proposal, for example: set the TTL of 
these prom metrics data in LRU to 10 minutes (of course it can be adjusted, 
here is just an example), and then this memory issue can be solved. What do you 
think?
   > 
   > Do we have plan for TTL ?
   
   APISIX uses the knyar/nginx-lua-prometheus library to set the metric. The 
ttl solution would be better if it is supported by the underlying library.
   
   Currently in discussion with the maintainer of knyar/nginx-lua-prometheus 
https://github.com/knyar/nginx-lua-prometheus/issues/164, in any case this 
issue is already being advanced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to