[I] bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. [apisix]

via GitHub Wed, 15 Nov 2023 19:27:31 -0800


wodingyang opened a new issue, #10500:
URL: https://github.com/apache/apisix/issues/10500


   ### Current Behavior
   
   Configure two nodes in the upstream, one healthy and one unhealthy, and 
enable active heartbeat checking. Then initiate a user request, and when 
heartbeat health is enabled, the unhealthy node is kicked out. After waiting 
for a while, it is observed that the unhealthy node is added back and then 
re-triggered for eviction.During this waiting period, regardless of whether 
there are user requests or not, the unhealthy node is added back. In our 
testing, we disabled retries. After a continuous series of requests, it was 
observed that the unhealthy node was added back. The testing system received a 
significant number of 502 errors, and upon checking the logs of API Gateway, it 
was found that these 502 errors were forwarded to the unhealthy node.
   
   ### Expected Behavior
   
   We attempted to inspect the source code and found in upstream.lua that...
   
   local healthcheck_parent = upstream.parent
       if healthcheck_parent.checker and healthcheck_parent.checker_upstream == 
upstream then
           return healthcheck_parent.checker
       end
   
       local checker, err = healthcheck.new({
           name = get_healthchecker_name(healthcheck_parent),
           shm_name = "upstream-healthcheck",
           checks = upstream.checks,
       })
   
   The first time the heartbeat is enabled, it calls `healthcheck.new()`, and 
subsequent requests go through the logic of `if healthcheck_parent.checker and 
healthcheck_parent.checker_upstream == upstream then`. However, after a certain 
period of time, this `if` statement becomes ineffective and a new check is 
created, while the previous check is deleted. We are unsure about the reason 
behind this `if` statement and would appreciate assistance in identifying the 
issue.
   The time for adding unhealthy nodes back is highly variable, sometimes 
around one minute and other times around five minutes. Is there any caching or 
mechanism involved in this process?
   
   
   ### Error Logs
   
   _No response_
   
   ### Steps to Reproduce
   
   Step 1: Configuring the upstream and route
   {
     "nodes": {
       "10.110.3.52:8888": 1,
       "10.110.3.51:8888": 1,
       "10.110.3.50:8888": 1
     },
     "retries": 0,
     "name": "测试心跳",
     "type": "roundrobin",
     "timeout": {
       "connect": 60,
       "read": 60,
       "send": 60
     },
     "pass_host": "pass",
     "scheme": "http",
     "keepalive_pool": {
       "idle_timeout": 60,
       "requests": 1000,
       "size": 320
     },
     "checks": {
       "active": {
         "concurrency": 10,
         "healthy": {
           "http_statuses": [
             200,
             302
           ],
           "interval": 1,
           "successes": 2
         },
         "http_path": "/firstwork/healthcheck",
         "timeout": 5,
         "type": "http",
         "unhealthy": {
           "http_failures": 5,
           "http_statuses": [
             429,
             404,
             500,
             501,
             502,
             503,
             504,
             505
           ],
           "interval": 10,
           "tcp_failures": 2,
           "timeouts": 2
         }
       }
     }
   }
   
   {
     "uri": "/*",
     "upstream_id": "00021698712371002437",
     "name": "测试心跳问题",
     "host": "",
     "methods": [
       "GET",
       "HEAD",
       "POST",
       "PUT",
       "DELETE",
       "OPTIONS",
       "PATCH",
       "CONNECT",
       "TRACE"
     ],
     "priority": 0,
     "enable_websocket": false,
     "status": 1
   }
   Step 2: Continuously initiate user requests
   Step 3: Check the logs and receive
   2023/11/16 11:19:45 [warn] 30949#30949: *13223 [lua]  healthcheck.lua:1107: 
log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy 
TCP increment (1/2) for '(10.110.3.50:8888)'
   2023/11/16 11:19:46 [warn] 30949#30949: *13631 [lua] healthcheck.lua:1107: 
log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy 
TCP increment (2/2) for '(10.110.3.50:8888)'
   Step 4: Receive again after a certain period of time
   2023/11/16 11:24:46 [warn] 30949#30949: *25421 [lua] healthcheck.lua:1107: 
log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy 
TCP increment (1/2) for '(10.110.3.50:8888)'
   2023/11/16 11:24:47 [warn] 30949#30949: *25591 [lua] healthcheck.lua:1107: 
log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy 
TCP increment (2/2) for '(10.110.3.50:8888)'
   
   ### Environment
   
   - APISIX version (run `apisix version`):2.15.3
   - Operating system (run `uname -a`):centos7
   - OpenResty / Nginx version (run `openresty -V` or `nginx -V`):1.21.4.1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. [apisix]

Reply via email to