kingluo commented on issue #7949: URL: https://github.com/apache/apisix/issues/7949#issuecomment-1277311643
@hahayyum In `nginx-lua-prometheus` library, each metric instance maintains a lookup table to cache the full metric name of each label values combination. https://github.com/knyar/nginx-lua-prometheus/blob/master/prometheus.lua#L418 The prometheus plugin would update metrics at the log phase of each request and in turn update the lookup table. Because you generates the route id in random string each time, so the lookup table increases indefinitely and causes memory leak: https://github.com/apache/apisix/blob/6ae923a12fc1b7f8414b66efc053b19c7b5ddd08/apisix/plugins/prometheus/exporter.lua#L238 Check my gist page for reproduce detail: https://gist.github.com/kingluo/1df4736665a3381ce745f26cf17181d8 The systemtap script shows: ``` T:@/opt/apisix//deps/share/lua/5.1/prometheus.lua:346 @/opt/apisix//deps/share/lua/5.1/prometheus.lua:346 @/opt/apisix//deps/share/lua/5.1/prometheus.lua:540 @/opt/apisix/apisix/plugins/prometheus/exporter.lua:210 @/opt/apisix/apisix/plugin.lua:884 @/opt/apisix/apisix/plugin.lua:961 @/opt/apisix/apisix/init.lua:314 @/opt/apisix/apisix/init.lua:725 =log_by_lua(nginx.conf:410):0 : 0x0 total 544320 bytes C:json_decode C:json_protect_conversion @/opt/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:770 @/opt/apisix/apisix/core/config_etcd.lua:123 @/opt/apisix/apisix/core/config_etcd.lua:298 @/opt/apisix/apisix/core/config_etcd.lua:538 builtin#22 @/opt/apisix/apisix/core/config_etcd.lua:516 : 0xc total 358888 bytes builtin#177 =coroutine_api:7 @/opt/apisix//deps/share/lua/5.1/resty/http.lua:495 : 0xb total 323136 bytes ``` It points out the leak reason. **Note that the luajit gc does not free the unused memory immediately, so the output would include some memory footprint still in gc. We should check the codes and figure out which is fake memory leak.** **And also note that, because your url is in random string each time, so the pcre compiled lru cache would keep increasing util full of capacity. Those memory should be ignored too. That is, those memory would not increase after some times:** ``` T:@/usr/local/openresty-debug/lualib/resty/core/regex.lua:559 @/usr/local/openresty-debug/lualib/resty/core/regex.lua:559 @/usr/local/openresty-debug/lualib/ngx/re.lua:133 @/opt/apisix/apisix/admin/init.lua:119 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:687 @/opt/apisix/apisix/init.lua:803 =content_by_lua(nginx.conf:234):0 : 0x0 756 samples T:@/usr/local/openresty-debug/lualib/resty/lrucache.lua:104 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:491 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:610 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:629 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:687 @/opt/apisix/apisix/http/route.lua:104 @/opt/apisix/apisix/init.lua:331 =access_by_lua(nginx.conf:347):0 : 0x0 186 samples T:@/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:455 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:491 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:610 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:629 @/opt/apisix//deps/share/lua/5.1/resty/radixtree.lua:687 @/opt/apisix/apisix/http/route.lua:104 @/opt/apisix/apisix/init.lua:331 =access_by_lua(nginx.conf:347):0 : 0x0 3 samples / 756 samples \/http\/rtGjsw\/((.|\n)*) 1 samples \/http\/JilThq\/((.|\n)*) 1 samples ``` Currently, the `nginx-lua-prometheus` library used by apisix is old. It will cause other errors in your scenario. I would post a bugfix and tell the author to publish a new release: https://github.com/knyar/nginx-lua-prometheus/issues/150#issuecomment-1277219593 The simple bugfix is to replace the whole lookup table with a new empty table when full: ```diff --- /opt/prometheus.lua.orig 2022-10-13 17:21:34.766568987 +0800 +++ /opt/apisix//deps/share/lua/5.1/prometheus.lua 2022-10-13 17:26:05.729765362 +0800 @@ -353,6 +353,12 @@ return nil, string.format("inconsistent labels count, expected %d, got %d", self.label_count, cnt) end + + if self.lookup_size > self.lookup_max_size then + self.lookup_size = 0 + self.lookup = {} + end + local t = self.lookup if label_values then -- Don't use ipairs here to avoid inner loop generates trace first @@ -405,6 +411,9 @@ full_name = full_metric_name(self.name, self.label_names, label_values) end t[LEAF_KEY] = full_name + + self.lookup_size = self.lookup_size + 1 + local err = self._key_index:add(full_name) if err then return nil, err @@ -791,6 +800,8 @@ -- ['my.net']['200'][LEAF_KEY] = 'http_count{host="my.net",status="200"}' -- ['my.net']['500'][LEAF_KEY] = 'http_count{host="my.net",status="500"}' lookup = {}, + lookup_size = 0, + lookup_max_size = 100, parent = self, -- Store a reference for logging functions for faster lookup. _log_error = function(...) self:log_error(...) end, ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
