wodingyang opened a new issue, #10500:
URL: https://github.com/apache/apisix/issues/10500
### Current Behavior
Configure two nodes in the upstream, one healthy and one unhealthy, and
enable active heartbeat checking. Then initiate a user request, and when
heartbeat health is enabled, the unhealthy node is kicked out. After waiting
for a while, it is observed that the unhealthy node is added back and then
re-triggered for eviction.During this waiting period, regardless of whether
there are user requests or not, the unhealthy node is added back. In our
testing, we disabled retries. After a continuous series of requests, it was
observed that the unhealthy node was added back. The testing system received a
significant number of 502 errors, and upon checking the logs of API Gateway, it
was found that these 502 errors were forwarded to the unhealthy node.
### Expected Behavior
We attempted to inspect the source code and found in upstream.lua that...
local healthcheck_parent = upstream.parent
if healthcheck_parent.checker and healthcheck_parent.checker_upstream ==
upstream then
return healthcheck_parent.checker
end
local checker, err = healthcheck.new({
name = get_healthchecker_name(healthcheck_parent),
shm_name = "upstream-healthcheck",
checks = upstream.checks,
})
The first time the heartbeat is enabled, it calls `healthcheck.new()`, and
subsequent requests go through the logic of `if healthcheck_parent.checker and
healthcheck_parent.checker_upstream == upstream then`. However, after a certain
period of time, this `if` statement becomes ineffective and a new check is
created, while the previous check is deleted. We are unsure about the reason
behind this `if` statement and would appreciate assistance in identifying the
issue.
The time for adding unhealthy nodes back is highly variable, sometimes
around one minute and other times around five minutes. Is there any caching or
mechanism involved in this process?
### Error Logs
_No response_
### Steps to Reproduce
Step 1: Configuring the upstream and route
{
"nodes": {
"10.110.3.52:8888": 1,
"10.110.3.51:8888": 1,
"10.110.3.50:8888": 1
},
"retries": 0,
"name": "测试心跳",
"type": "roundrobin",
"timeout": {
"connect": 60,
"read": 60,
"send": 60
},
"pass_host": "pass",
"scheme": "http",
"keepalive_pool": {
"idle_timeout": 60,
"requests": 1000,
"size": 320
},
"checks": {
"active": {
"concurrency": 10,
"healthy": {
"http_statuses": [
200,
302
],
"interval": 1,
"successes": 2
},
"http_path": "/firstwork/healthcheck",
"timeout": 5,
"type": "http",
"unhealthy": {
"http_failures": 5,
"http_statuses": [
429,
404,
500,
501,
502,
503,
504,
505
],
"interval": 10,
"tcp_failures": 2,
"timeouts": 2
}
}
}
}
{
"uri": "/*",
"upstream_id": "00021698712371002437",
"name": "测试心跳问题",
"host": "",
"methods": [
"GET",
"HEAD",
"POST",
"PUT",
"DELETE",
"OPTIONS",
"PATCH",
"CONNECT",
"TRACE"
],
"priority": 0,
"enable_websocket": false,
"status": 1
}
Step 2: Continuously initiate user requests
Step 3: Check the logs and receive
2023/11/16 11:19:45 [warn] 30949#30949: *13223 [lua] healthcheck.lua:1107:
log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy
TCP increment (1/2) for '(10.110.3.50:8888)'
2023/11/16 11:19:46 [warn] 30949#30949: *13631 [lua] healthcheck.lua:1107:
log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy
TCP increment (2/2) for '(10.110.3.50:8888)'
Step 4: Receive again after a certain period of time
2023/11/16 11:24:46 [warn] 30949#30949: *25421 [lua] healthcheck.lua:1107:
log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy
TCP increment (1/2) for '(10.110.3.50:8888)'
2023/11/16 11:24:47 [warn] 30949#30949: *25591 [lua] healthcheck.lua:1107:
log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy
TCP increment (2/2) for '(10.110.3.50:8888)'
### Environment
- APISIX version (run `apisix version`):2.15.3
- Operating system (run `uname -a`):centos7
- OpenResty / Nginx version (run `openresty -V` or `nginx -V`):1.21.4.1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]