moonming commented on issue #9627:
URL: https://github.com/apache/apisix/issues/9627#issuecomment-1859737688

   > It is obvious that there are problems with the mechanism of this 
Prometheus exporter, which can be seen from four aspects:
   > 
   > 1. When there are more routes and upstreams, the metrics data grows 
exponentially.
   > 2. Due to APISIX Metrics only increasing and not decreasing, historical 
data keeps accumulating.
   > 3. Although there is an LRU mechanism to control the size of Prometheus 
Lua shared memory within the set value, this is not a fundamental solution. 
Once the LRU mechanism is triggered, `Metrics Error` will continue to increase. 
We hope that `Metrics Error` can help us identify issues in a reasonable manner.
   > 4. Although in the new version, APISIX has moved the server for prometheus 
metrics exporter to a privileged process, reducing P100 issues, due to Metrics 
only increasing and not decreasing, it also puts significant pressure on the 
privilege server. In our production environment as an example, we have 150k 
metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU 
usage to reach 100% and lasts for about 5-10 seconds.
   > 
   > In fact 
[nginx-lua-prometheus](https://github.com/knyar/nginx-lua-prometheus) provides 
counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus 
plugin may need to delete Prometheus Metric data at certain times.
   > 
   > Currently, our approach is similar; however we are more aggressive by only 
retaining type-level and route-level data while removing everything else.
   > 
   > before:
   > 
   > ```lua
   > metrics.latency = prometheus:histogram("http_latency",  
   >     "HTTP request latency in milliseconds per service in APISIX",  
   >     {"type", "route", "service", "consumer", "node", 
unpack(extra_labels("http_latency"))},  
   >     buckets)
   > ```
   > 
   > after:
   > 
   > ```lua
   > metrics.latency = prometheus:histogram("http_latency",  
   >     "HTTP request latency in milliseconds per service in APISIX",  
   >     {"type", "route", unpack(extra_labels("http_latency"))},  
   >     buckets)
   > ```
   
   @hansedong well said 👍 
   Only retaining `type` and `route` level data is not a universal 
implementation idea, and other users may not accept it.
   
   We are trying to find a general proposal, for example: set the TTL of these 
prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is 
just an example), and then this memory issue can be solved. What do you think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to