[GitHub] [apisix] alptugay opened a new issue, #9775: bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock)

via GitHub Tue, 04 Jul 2023 00:40:57 -0700


alptugay opened a new issue, #9775:
URL: https://github.com/apache/apisix/issues/9775


   ### Current Behavior
   
   Our Apisix instance has more than 2500 route and upstream all of which has 
active healthcheck enabled. Sometimes (once/twice a week) we see that one or 
more workers use %100 of CPU. At the same time we see the following error log:
   
   ```
   2023/07/04 03:15:53 [error] 161670#161670: *27856946269 [lua] 
healthcheck.lua:1150: log(): [healthcheck] 
(upstream#/apisix/routes/461091278096960700) failed to release lock 
'lua-resty-healthcheck:upstream#/apisix/routes/461091278096960700:target_lock:*.*.*.*:8500':
 unlocked, context: ngx.timer, client: *.*.*.*, server: 0.0.0.0:80
   ```
   
   We have encountered this situation on multiple instances and multiple times:
   
   At the same exact time we can see the CPU core running this worker starts to 
use %100 CPU
   
   <img width="1359" alt="Screenshot 2023-07-04 at 09 57 28" 
src="https://github.com/apache/apisix/assets/23238365/3b509414-5e0a-402f-8983-90a81ba30041";>
   
   That Cpu Core also sees an increase in time spent
    
   <img width="887" alt="Screenshot 2023-07-04 at 10 05 47" 
src="https://github.com/apache/apisix/assets/23238365/7fb39008-7c75-4260-9b8a-aade587bc521";>
   
   Sockstat usage increases as well:
   
   <img width="881" alt="Screenshot 2023-07-04 at 10 07 06" 
src="https://github.com/apache/apisix/assets/23238365/5ab93658-7a77-4e89-92d8-299048825b0e";>
   
   <img width="871" alt="Screenshot 2023-07-04 at 10 07 16" 
src="https://github.com/apache/apisix/assets/23238365/867685fd-0f75-40ea-9fc4-2f2e2c0928a8";>
   
   After killing the worker process it returns to normal (BTW we cant kill the 
process gracefully, we need to use "kill -9")
   
   We are normally using Apisix 3.2.1 but to solve the issue we cherry picked 
"v3.0.0 of lua-resty-healthcheck", because the lock mechanisim seems to have 
changed but that caused massive memory leak so we reverted.
   
   All of our upstreams have the following timeout and healthcheck values:
   
   ```
    "timeout": {
       "connect": 60,
       "send": 60,
       "read": 60
     },
     "type": "roundrobin",
     "checks": {
       "active": {
         "concurrency": 10,
         "healthy": {
           "http_statuses": [
             200,
             302
           ],
           "interval": 1,
           "successes": 2
         },
         "http_path": "/",
         "https_verify_certificate": true,
         "timeout": 1,
         "type": "tcp",
         "unhealthy": {
           "http_failures": 5,
           "http_statuses": [
             429,
             404,
             500,
             501,
             502,
             503,
             504,
             505
           ],
           "interval": 1,
           "tcp_failures": 2,
           "timeouts": 3
         }
       }
     },
   ```
   
   There are some other abnormalities as well, I don't know if it is related or 
not so I'll briefly share those as well:
   
   We see lots of upstream time out errors, however the upstreams are healthy 
and running. These connections seem to be related to a watch API of K8s (not 
sure)
   ```
   2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out 
(110: Connection timed out) while reading upstream, client: *.*.*.* , server: 
_, request: "GET 
/apis/kyverno.io/v1/clusterpolicies?allowWatchBookmarks=true&resourceVersion=2809
   57361&watch=true HTTP/2.0", upstream: 
"https://*.*.*.*:6443/apis/kyverno.io/v1/clusterpolicies?allowWatchBookmarks=true&resourceVersion=280957361&watch=true";,
 host: "example.com"
   2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out 
(110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, 
request: "GET 
/apis/apps/v1/deployments?allowWatchBookmarks=true&resourceVersion=280957362&watch=true
 HTTP/2.0", upstream: 
"https://*.*.*.*:6443/apis/apps/v1/deployments?allowWatchBookmarks=true&resourceVersion=280957362&watch=true";,
 host: "example.com"
   2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out 
(110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, 
request: "GET 
/apis/wgpolicyk8s.io/v1alpha2/policyreports?allowWatchBookmarks=true&resourceVersion=280957363&watch=true
 HTTP/2.0", upstream: 
"https://*.*.*.*:6443/apis/wgpolicyk8s.io/v1alpha2/policyreports?allowWatchBookmarks=true&resourceVersion=280957363&watch=true";,
 host: "example.com"
   2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out 
(110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, 
request: "GET 
/apis/apps/v1/replicasetsallowWatchBookmarks=true&resourceVersion=280957365&watch=true
 HTTP/2.0", upstream: 
"https://*.*.*.*:6443/apis/apps/v1/replicasets?allowWatchBookmarks=true&resourceVersion=280957365&watch=true";,
 host: "example.com"
   2023/07/04 03:15:52 [error] 161656#161656: *27855374513 upstream timed out 
(110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, 
request: "GET 
/apis/kyverno.io/v1alpha2/admissionreports?allowWatchBookmarks=true&resourceVersi
   on=280957366&watch=true HTTP/2.0", upstream: 
"https://*.*.*.*:6443/apis/kyverno.io/v1alpha2/admissionreports?allowWatchBookmarks=true&resourceVersion=280957366&watch=true";,
 host: "example.com"
   2023/07/04 03:15:52 [error] 161656#161656: *27855374513 upstream timed out 
(110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, 
request: "GET 
/apis/apiextensions.k8s.io/v1/customresourcedefinitions?allowWatchBookmarks=true&
   resourceVersion=280957366&watch=true HTTP/2.0", upstream: 
"https://*.*.*.*:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions?allowWatchBookmarks=true&resourceVersion=280957366&watch=true";,
 host: "example.com"
   ```
   
   We have TEngine running on parallel with Apisix, both have healthchecks 
enabled. But connection states are very different, for example on TEngine we 
see less connection in TimeWait and more inUse:
    
   <img width="852" alt="Screenshot 2023-07-04 at 10 09 22" 
src="https://github.com/apache/apisix/assets/23238365/68d890e6-3c65-471b-9ddd-4ab98989e191";>
   
   However in Apisix we see more in Timewait and less inUse:
   
   <img width="871" alt="Screenshot 2023-07-04 at 10 30 21" 
src="https://github.com/apache/apisix/assets/23238365/94f3f884-affa-41e4-b8e5-23454203623c";>
   
   When we disable healthcheck, it returns to the similar state with TEngine.
   
   Our config.yml
   ```
   
   apisix:
     node_listen:                      # This style support multiple ports
       - port: 80
   
     enable_admin: true
     enable_dev_mode: false            # Sets nginx worker_processes to 1 if 
set to true
     enable_reuseport: true            # Enable nginx SO_REUSEPORT switch if 
set to true.
     show_upstream_status_in_response_header: false # when true all upstream 
status write to `X-APISIX-Upstream-Status` otherwise only 5xx code
     enable_ipv6: true
   
     enable_server_tokens: false        # Whether the APISIX version number 
should be shown in Server header.
   
     extra_lua_path: ""                # extend lua_package_path to load third 
party code
     extra_lua_cpath: ""               # extend lua_package_cpath to load third 
party code
   
     ssl_session_cache_size: 100m
   
     geoip_shared_dict_size: 100m
   
     proxy_cache_distributed_shared_dict_size: 200m
     redis_healthcheck_shared_dict_size: 50m
   
     proxy_cache:                      # Proxy Caching configuration
       cache_ttl: 10s                  # The default caching time in disk if 
the upstream does not specify the cache time
       zones:                          # The parameters of a cache
         - name: disk_cache_one        # The name of the cache, administrator 
can specify
           memory_size: 50m            # The size of shared memory, it's used 
to store the cache index for
           disk_size: 1G               # The size of disk, it's used to store 
the cache data (disk)
           disk_path: /tmp/disk_cache_one  # The path to store the cache data 
(disk)
           cache_levels: 1:2           # The hierarchy levels of a cache (disk)
         - name: memory_cache
           memory_size: 50m
   
     delete_uri_tail_slash: false    # delete the '/' at the end of the URI
     normalize_uri_like_servlet: false
     router:
       http: radixtree_host_uri         # radixtree_uri: match route by 
uri(base on radixtree)
       ssl: radixtree_sni          # radixtree_sni: match route by SNI(base on 
radixtree)
     stream_proxy:                  # TCP/UDP proxy
       only: false                   # use stream proxy only, don't enable HTTP 
stuff
       tcp:                         # TCP proxy port list
       - addr: "*.*.*.*:22"
       - addr: "*.*.*.*:22"
       - addr: "1789"
       - addr: "5000"
       - addr: "6780"
       - addr: "8000"
       - addr: "8004"
       - addr: "8041"
       - addr: "8042"
       - addr: "8774"
       - addr: "8776"
       - addr: "8780"
       - addr: "8786"
       - addr: "9001"
       - addr: "9292"
       - addr: "9311"
       - addr: "9322"
       - addr: "9511"
       - addr: "9696"
       - addr: "9876"
   
     resolver_timeout: 5             # resolver timeout
     enable_resolv_search_opt: true  # enable search option in resolv.conf
     ssl:
       enable: true
       listen:                       # APISIX listening port in https.
         - port: 443
           enable_http2: true
       ssl_protocols: TLSv1.2 TLSv1.3
       ssl_ciphers: 
ECDH+AESGCM:ECDH+AES256:ECDH+AES128:DHE+AES128:!ADH:!AECDH:!MD5
       ssl_session_cache: 100m
   
   
     enable_control: true
     control:
       ip: 127.0.0.1
       port: 9090
     disable_sync_configuration_during_start: false  # safe exit. Remove this 
once the feature is stable
     data_encryption:                # add `encrypt_fields = { $field },` in 
plugin schema to enable encryption
       enable: false                 # if not set, the default value is `false`.
       keyring:
         - qeddd145sfvddff3          # If not set, will save origin value into 
etcd.
   
   nginx_config:                     # config for render the template to 
generate nginx.conf
     user: apisix                     # specifies the execution user of the 
worker process.
     error_log:
       - syslog:server=unix:/dev/rsyslog,tag=lb_error_log,nohostname warn
     error_log_level:  warn          # warn,error
   
     enable_cpu_affinity: true       # enable cpu affinity, this is just work 
well only on physical machine
     worker_processes: 15          # if you want use multiple cores in 
container, you can inject the number of cpu as environment variable 
"APISIX_WORKER_PROCESSES"
     custom_cpu_affinity: |
       1111111111111110
     worker_rlimit_nofile: 1048576     # the number of files a worker process 
can open, should be larger than worker_connections
     worker_shutdown_timeout: 300s   # timeout for a graceful shutdown of 
worker processes
   
     max_pending_timers: 16384       # increase it if you see "too many pending 
timers" error
     max_running_timers: 4096        # increase it if you see 
"lua_max_running_timers are not enough" error
   
     event:
       worker_connections: 500000
   
     meta:
       lua_shared_dict:
         prometheus-metrics: 15m
   
     stream:
       enable_access_log: true         # enable access log or not, default true
       access_log:
         - 
syslog:server=unix:/dev/rsyslog,tag=lb_access_2xx3xx,nohostname,severity=info 
main if=$status_2xx3xx
         - 
syslog:server=unix:/dev/rsyslog,tag=lb_access_non_2xx3xx,nohostname,severity=info
 main if=$status_non_2xx3xx
       access_log_format: 
'{"bytes_sent":"$bytes_sent","connection":"$connection","protocol":"$protocol","remote_addr":"$remote_addr","remote_port":"$remote_port","server_addr":"$server_addr","server_port":"$server_port","session_time":"$session_time","ssl_server_name":"$ssl_server_name","status":"$status","upstream_addr":"$upstream_addr","upstream_bytes_received":"$upstream_bytes_received","upstream_bytes_sent":"$upstream_bytes_sent","upstream_connect_time":"$upstream_connect_time","upstream_session_time":"$upstream_session_time"}'
       access_log_format_escape: json          # allows setting json or default 
characters escaping in variables
       lua_shared_dict:
         etcd-cluster-health-check-stream: 10m
         lrucache-lock-stream: 10m
         plugin-limit-conn-stream: 10m
         upstream-healthcheck-stream: 100m
   
     main_configuration_snippet: |
     http_configuration_snippet: |
       map_hash_max_size 20480;
       map_hash_bucket_size 20480;
       map $http_cf_ipcountry $ipcountry { ""                1; default         
  0; tr                1; }
   
       map $status $status_2xx3xx {
           ~^[23]  1;
           default 0;
       }
       map $status $status_non_2xx3xx {
           ~^[23]  0;
          default 1;
       }
   
       sendfile        on;
       tcp_nopush      on;
       tcp_nodelay     on;
       proxy_buffers 4 16k;
       proxy_busy_buffers_size 16k;
       proxy_buffer_size 16k;
     http_server_configuration_snippet: |
       set $masked_hostname "******";
       client_body_buffer_size 128k;
       client_header_buffer_size 5120k;
       large_client_header_buffers 16 5120k;
     http_server_location_configuration_snippet: |
     http_admin_configuration_snippet: |
     http_end_configuration_snippet: |
     stream_configuration_snippet: |
       map $status $status_2xx3xx {
           ~^[23]  1;
           default 0;
       }
       map $status $status_non_2xx3xx {
           ~^[23]  0;
          default 1;
       }
   
   
     http:
       enable_access_log: true         # enable access log or not, default true
       access_log:
         - 
syslog:server=unix:/dev/rsyslog,tag=lb_access_2xx3xx,nohostname,severity=info 
main if=$status_2xx3xx
         - 
syslog:server=unix:/dev/rsyslog,tag=lb_access_non_2xx3xx,nohostname,severity=info
 main if=$status_non_2xx3xx
       access_log_format: 
'{"cf_ipcountry":"$http_cf_ipcountry","http_x_client_ip":"$http_x_client_ip","http_True_Client_IP":"$http_True_Client_IP","upstream_http_X_Proxy_Cache":"$upstream_http_X_Proxy_Cache","request_id":"$request_id","route_id":"$http_route_id","request_length":"$request_length","remote_addr":"$remote_addr","remote_port":"$remote_port","request":"$request","args":"$args","uri":"$uri","status":"$status","bytes_sent":"$bytes_sent","http_user_agent":"$http_user_agent","http_x_forwarded_for":"$http_x_forwarded_for","http_host":"$http_host","server_name":"$server_name","request_time":"$request_time","upstream":"$upstream_addr","upstream_connect_time":"$upstream_connect_time","upstream_status":"$upstream_status","upstream_response_time":"$upstream_response_time","upstream_cache_status":"$upstream_cache_status","ssl_protocol":"$ssl_protocol","ssl_cipher":"$ssl_cipher","scheme":"$scheme","server_port":"$server_port","request_method":"$request_method","server_protocol":"$se
 
rver_protocol","http_cf_ray":"$http_cf_ray","ty_lb_waf_id":"$http_ty_lb_waf_id","ty_lb_cc":"$http_ty_lb_cc","ty_lb_asn":"$http_ty_lb_asn"}'
       access_log_format_escape: json       # allows setting json or default 
characters escaping in variables
       keepalive_timeout: 60s          # timeout during which a keep-alive 
client connection will stay open on the server side.
       client_header_timeout: 60s      # timeout for reading client request 
header, then 408 (Request Time-out) error is returned to the client
       client_body_timeout: 60s        # timeout for reading client request 
body, then 408 (Request Time-out) error is returned to the client
       client_max_body_size: 0         # The maximum allowed size of the client 
request body.
   
       send_timeout: 30s              # timeout for transmitting a response to 
the client.then the connection is closed
       underscores_in_headers: "on"   # default enables the use of underscores 
in client request header fields
       real_ip_header: X-Real-IP      # 
http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_header
       real_ip_recursive: "off"       # 
http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_recursive
       real_ip_from:                  # 
http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
         - 127.0.0.1
         - "unix:"
   
       proxy_ssl_server_name: true
       upstream:
         keepalive: 320                # Sets the maximum number of idle 
keepalive connections to upstream servers that are preserved in the cache of 
each worker process.
         keepalive_requests: 100000      # Sets the maximum number of requests 
that can be served through one keepalive connection.
         keepalive_timeout: 60s        # Sets a timeout during which an idle 
keepalive connection to an upstream server will stay open.
       charset: utf-8                  # Adds the specified charset to the 
"Content-Type" response header field, see
       variables_hash_max_size: 2048   # Sets the maximum size of the variables 
hash table.
   
       lua_shared_dict:
         internal-status: 100m
         plugin-limit-req: 100m
         plugin-limit-count: 100m
         prometheus-metrics: 1024m
         plugin-limit-conn: 100m
         upstream-healthcheck: 100m
         worker-events: 100m
         lrucache-lock: 100m
         balancer-ewma: 100m
         balancer-ewma-locks: 100m
         balancer-ewma-last-touched-at: 100m
         plugin-limit-count-redis-cluster-slot-lock: 100m
         tracing_buffer: 100m
         plugin-api-breaker: 100m
         etcd-cluster-health-check: 100m
         discovery: 100m
         jwks: 100m
         introspection: 100m
         access-tokens: 100m
         ext-plugin: 100m
         tars: 100m
         cas-auth: 100m
   
   
   
   
   graphql:
     max_size: 1048576               # the maximum size limitation of graphql 
in bytes, default 1MiB
   
   
   plugins:                          # plugin list (sorted by priority)
     - real-ip                        # priority: 23000
     - ai                             # priority: 22900
     - client-control                 # priority: 22000
     - proxy-control                  # priority: 21990
     - request-id                     # priority: 12015
     - zipkin                         # priority: 12011
     - ext-plugin-pre-req             # priority: 12000
     - fault-injection                # priority: 11000
     - mocking                        # priority: 10900
     - serverless-pre-function        # priority: 10000
     - cors                           # priority: 4000
     - ip-restriction                 # priority: 3000
     - ua-restriction                 # priority: 2999
     - referer-restriction            # priority: 2990
     - csrf                           # priority: 2980
     - uri-blocker                    # priority: 2900
     - request-validation             # priority: 2800
     - openid-connect                 # priority: 2599
     - cas-auth                       # priority: 2597
     - authz-casbin                   # priority: 2560
     - authz-casdoor                  # priority: 2559
     - wolf-rbac                      # priority: 2555
     - ldap-auth                      # priority: 2540
     - hmac-auth                      # priority: 2530
     - basic-auth                     # priority: 2520
     - jwt-auth                       # priority: 2510
     - key-auth                       # priority: 2500
     - consumer-restriction           # priority: 2400
     - forward-auth                   # priority: 2002
     - opa                            # priority: 2001
     - body-transformer               # priority: 1080
     - proxy-mirror                   # priority: 1010
     - proxy-cache-distributed        # priority: 1009
     - proxy-rewrite                  # priority: 1008
     - workflow                       # priority: 1006
     - api-breaker                    # priority: 1005
     - limit-conn                     # priority: 1003
     - limit-count                    # priority: 1002
     - limit-req                      # priority: 1001
     - gzip                           # priority: 995
     - server-info                    # priority: 990
     - multi-dc                       # priority: 967
     - traffic-split                  # priority: 966
     - redirect                       # priority: 900
     - response-rewrite               # priority: 899
     - degraphql                      # priority: 509
     - grpc-transcode                 # priority: 506
     - grpc-web                       # priority: 505
     - public-api                     # priority: 501
     - prometheus                     # priority: 500
     - datadog                        # priority: 495
     - elasticsearch-logger           # priority: 413
     - echo                           # priority: 412
     - loggly                         # priority: 411
     - http-logger                    # priority: 410
     - splunk-hec-logging             # priority: 409
     - skywalking-logger              # priority: 408
     - google-cloud-logging           # priority: 407
     - sls-logger                     # priority: 406
     - tcp-logger                     # priority: 405
     - kafka-logger                   # priority: 403
     - rocketmq-logger                # priority: 402
     - syslog                         # priority: 401
     - udp-logger                     # priority: 400
     - file-logger                    # priority: 399
     - clickhouse-logger              # priority: 398
     - tencent-cloud-cls              # priority: 397
     - inspect                        # priority: 200
     - example-plugin                 # priority: 0
     - aws-lambda                     # priority: -1899
     - azure-functions                # priority: -1900
     - openwhisk                      # priority: -1901
     - openfunction                   # priority: -1902
     - serverless-post-function       # priority: -2000
     - ext-plugin-post-req            # priority: -3000
     - ext-plugin-post-resp           # priority: -4000
     - ty-geoip                       # priority: -9000
   stream_plugins: # sorted by priority
     - ip-restriction                 # priority: 3000
     - limit-conn                     # priority: 1003
     - mqtt-proxy                     # priority: 1000
     - syslog                         # priority: 401
   
   
   
   plugin_attr:
     log-rotate:
       interval: 3600    # rotate interval (unit: second)
       max_kept: 168     # max number of log files will be kept
       max_size: -1      # max size bytes of log files to be rotated, size 
check would be skipped with a value less than 0
       enable_compression: true    # enable log file compression(gzip) or not, 
default false
     skywalking:
       service_name: APISIX
       service_instance_name: APISIX Instance Name
       endpoint_addr: http://127.0.0.1:12800
     opentelemetry:
       trace_id_source: x-request-id
       resource:
         service.name: APISIX
       collector:
         address: 127.0.0.1:4318
         request_timeout: 3
         request_headers:
           Authorization: token
       batch_span_processor:
         drop_on_queue_full: false
         max_queue_size: 1024
         batch_timeout: 2
         inactive_timeout: 1
         max_export_batch_size: 16
     prometheus:
       export_uri: /apisix/prometheus/metrics
       metric_prefix: apisix_
       enable_export_server: true
       export_addr:
         ip: 10.40.129.15
         port: 9091
     server-info:
       report_ttl: 60   # live time for server info in etcd (unit: second)
     dubbo-proxy:
       upstream_multiplex_count: 32
     request-id:
       snowflake:
         enable: false
         snowflake_epoc: 1609459200000   # the starting timestamp is expressed 
in milliseconds
         data_machine_bits: 12           # data machine bit, maximum 31, 
because Lua cannot do bit operations greater than 31
         sequence_bits: 10               # each machine generates a maximum of 
(1 << sequence_bits) serial numbers per millisecond
         data_machine_ttl: 30            # live time for data_machine in etcd 
(unit: second)
         data_machine_interval: 10       # lease renewal interval in etcd 
(unit: second)
     proxy-mirror:
       timeout:                          # proxy timeout in mirrored sub-request
         connect: 60s
         read: 60s
         send: 60s
     inspect:
       delay: 3            # in seconds
       hooks_file: "/usr/local/apisix/plugin_inspect_hooks.lua"
   
   deployment:
     role: traditional
     role_traditional:
       config_provider: etcd
     admin:
       admin_key:
         -
           name: admin
           key: **********
           role: admin                 # admin: manage all configuration data
   
       enable_admin_cors: true         # Admin API support CORS response 
headers.
       allow_admin:                    # 
http://nginx.org/en/docs/http/ngx_http_access_module.html#allow
         - 127.0.0.0/24                # If we don't set any IP list, then any 
IP access is allowed by default.
       admin_listen:                 # use a separate port
         ip: 127.0.0.1                 # Specific IP, if not set, the default 
value is `0.0.0.0`.
         port: 9180                  # Specific port, which must be different 
from node_listen's port.
   
   
       admin_api_mtls:               # Depends on `admin_listen` and 
`https_admin`.
         admin_ssl_cert: ""          # Path of your self-signed server side 
cert.
         admin_ssl_cert_key: ""      # Path of your self-signed server side key.
         admin_ssl_ca_cert: ""       # Path of your self-signed ca cert.The CA 
is used to sign all admin api callers' certificates.
   
       admin_api_version: v3         # The version of admin api, latest version 
is v3.
   
     etcd:
       host:                           # it's possible to define multiple etcd 
hosts addresses of the same etcd cluster.
       - "http://*.*.*.*:2379";
       - "http://*.*.*.*:2379";
       - "http://*.*.*.*2379";
   
       prefix: /apisix                 # configuration prefix in etcd
       use_grpc: false                 # enable the experimental configuration 
sync via gRPC
       timeout: 30                     # 30 seconds. Use a much higher timeout 
(like an hour) if the `use_grpc` is true.
       startup_retry: 2                # the number of retry to etcd during the 
startup, default to 2
       tls:
   
         verify: true                  # whether to verify the etcd endpoint 
certificate when setup a TLS connection to etcd,
   ```
   
   ### Expected Behavior
   
   Cpu should not be consumed %100
   
   ### Error Logs
   
   2023/07/04 03:15:53 [error] 161670#161670: *27856946269 [lua] 
healthcheck.lua:1150: log(): [healthcheck] 
(upstream#/apisix/routes/461091278096960700) failed to release lock 
'lua-resty-healthcheck:upstream#/apisix/routes/461091278096960700:target_lock:*.*.*.*:8500':
 unlocked, context: ngx.timer, client: *.*.*.*, server: 0.0.0.0:80
   
   ### Steps to Reproduce
   
   Seems to be random. But 2500+ route/service and upstream. All with active 
healthchecks enabled
   
   ### Environment
   
   - APISIX 3.2.1
   - Ubuntu 22.04
   nginx version: openresty/1.21.4.1
   built with OpenSSL 1.1.1s  1 Nov 2022
   TLS SNI support enabled
   configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 
-DAPISIX_BASE_VER=1.21.4.1 
-DNGX_GRPC_CLI_ENGINE_PATH=/usr/local/openresty/libgrpc_engine.so 
-DNGX_HTTP_GRPC_CLI_ENGINE_PATH=/usr/local/openresty/libgrpc_engine.so 
-DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include 
-I/usr/local/openresty/pcre/include -I/usr/local/openresty/openssl111/include' 
--add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 
--add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 
--add-module=../set-misc-nginx-module-0.33 
--add-module=../form-input-nginx-module-0.12 
--add-module=../encrypted-session-nginx-module-0.09 
--add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.21 
--add-module=../ngx_lua_upstream-0.07 
--add-module=../headers-more-nginx-module-0.33 
--add-module=../array-var-nginx-module-0.05 
--add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 
--add-module=../redis-nginx-module-0.3.9 --add-
 module=../ngx_stream_lua-0.0.11 
--with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib 
-Wl,-rpath,/usr/local/openresty/wasmtime-c-api/lib 
-L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib 
-L/usr/local/openresty/openssl111/lib 
-Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib:/usr/local/openresty/openssl111/lib'
 
--add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../mod_dubbo-1.0.2
 
--add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../ngx_multi_upstream_module-1.1.1
 
--add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0
 
--add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0/src/stream
 
--add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0/s
 rc/meta 
--add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../wasm-nginx-module-0.6.4
 
--add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../lua-var-nginx-module-v0.5.3
 
--add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../grpc-client-nginx-module-v0.4.2
 --with-poll_module --with-pcre-jit --with-stream --with-stream_ssl_module 
--with-stream_ssl_preread_module --with-http_v2_module 
--without-mail_pop3_module --without-mail_imap_module 
--without-mail_smtp_module --with-http_stub_status_module 
--with-http_realip_module --with-http_addition_module 
--with-http_auth_request_module --with-http_secure_link_module 
--with-http_random_index_module --with-http_gzip_static_module 
--with-http_sub_module --with-http_dav_module --with-http_flv_module 
--with-http_mp4_module --with-http_gunzip_module --with-threads --with-compat 
--with-stream --with-http_ssl_modu
 le
   - etcd version, if relevant (run `curl 
http://127.0.0.1:9090/v1/server_info`):
   - APISIX Dashboard version, if relevant:
   - Plugin runner version, for issues related to plugin runners:
   - LuaRocks version, for installation issues (run `luarocks --version`):
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [apisix] alptugay opened a new issue, #9775: bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock)

Reply via email to