hanzhenfang commented on issue #13141:
URL: https://github.com/apache/apisix/issues/13141#issuecomment-4686861553
Hi @Baoyuantop, @nic-6443 , I built a small repro that makes the stale
health-check target visible through the removed node's access log. The removed
node stays alive, so the signal is not a timeout; it is simply whether APISIX
keeps sending `GET /health` to a node that is no longer in the upstream.
One thing I could not tell from the issue is how the upstream was updated.
That matters here. A full `PUT` replacement and a partial `PATCH` have
different semantics.
## Files
### docker-compose.yaml
```yaml
name: issue-13141
services:
etcd:
image: bitnamilegacy/etcd:3.6.4
container_name: issue-13141-etcd
restart: "no"
environment:
ALLOW_NONE_AUTHENTICATION: "yes"
ETCD_ADVERTISE_CLIENT_URLS: http://etcd:2379
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ports:
- "13144:2379"
apisix:
image: ${APISIX_IMAGE:-apache/apisix:3.14.1-debian}
container_name: issue-13141-apisix
restart: "no"
depends_on:
- etcd
- node1
- node2
volumes:
- ./config.yaml:/usr/local/apisix/conf/config.yaml:ro
ports:
- "13141:9080"
- "13142:9180"
- "13143:9090"
node1:
image: nginx:alpine
container_name: issue-13141-node1
restart: "no"
environment:
NODE_NAME: node1
volumes:
- ./mock-nginx.conf:/etc/nginx/templates/default.conf.template:ro
networks:
default:
ipv4_address: 172.31.41.11
node2:
image: nginx:alpine
container_name: issue-13141-node2
restart: "no"
environment:
NODE_NAME: node2
volumes:
- ./mock-nginx.conf:/etc/nginx/templates/default.conf.template:ro
networks:
default:
ipv4_address: 172.31.41.12
networks:
default:
ipam:
config:
- subnet: 172.31.41.0/24
```
### config.yaml
```yaml
apisix:
node_listen:
- 9080
enable_admin: true
nginx_config:
error_log_level: info
worker_processes: 1
deployment:
role: traditional
role_traditional:
config_provider: etcd
admin:
admin_listen:
ip: 0.0.0.0
port: 9180
admin_key:
- name: admin
key: issue-13141-admin-key
role: admin
allow_admin:
- 0.0.0.0/0
etcd:
host:
- http://etcd:2379
control:
ip: 0.0.0.0
port: 9090
```
### mock-nginx.conf
```nginx
log_format issue13141 '$time_iso8601 $hostname $request_method $uri $status';
server {
listen 8080;
access_log /dev/stdout issue13141;
error_log /dev/stderr info;
location = /health {
add_header X-Issue-Node "${NODE_NAME}" always;
return 200 "${NODE_NAME} healthy\n";
}
location / {
add_header X-Issue-Node "${NODE_NAME}" always;
return 200 "${NODE_NAME} response\n";
}
}
```
## Steps
Start:
```sh
docker compose up -d --force-recreate --remove-orphans
docker compose exec -T apisix apisix version
```
Create an upstream with two nodes and active health checks:
```sh
curl -sS -X PUT http://127.0.0.1:13142/apisix/admin/upstreams/1 \
-H 'X-API-KEY: issue-13141-admin-key' \
-H 'Content-Type: application/json' \
-d '{
"type": "roundrobin",
"nodes": {
"172.31.41.11:8080": 1,
"172.31.41.12:8080": 1
},
"checks": {
"active": {
"type": "http",
"http_path": "/health",
"healthy": {
"interval": 1,
"successes": 1
},
"unhealthy": {
"interval": 1,
"http_failures": 1
}
}
}
}'
curl -sS -X PUT http://127.0.0.1:13142/apisix/admin/routes/1 \
-H 'X-API-KEY: issue-13141-admin-key' \
-H 'Content-Type: application/json' \
-d '{
"uri": "/hc-cache",
"upstream_id": "1"
}'
```
Warm up and confirm both nodes are checked:
```sh
curl -sS http://127.0.0.1:13141/hc-cache
sleep 5
docker logs --since 10s issue-13141-node1 2>&1 | grep 'GET /health'
docker logs --since 10s issue-13141-node2 2>&1 | grep 'GET /health'
```
Remove `node2` from the upstream:
```sh
curl -sS -X PUT http://127.0.0.1:13142/apisix/admin/upstreams/1 \
-H 'X-API-KEY: issue-13141-admin-key' \
-H 'Content-Type: application/json' \
-d '{
"type": "roundrobin",
"nodes": {
"172.31.41.11:8080": 1
},
"checks": {
"active": {
"type": "http",
"http_path": "/health",
"healthy": {
"interval": 1,
"successes": 1
},
"unhealthy": {
"interval": 1,
"http_failures": 1
}
}
}
}'
```
Confirm the node is really gone from the stored upstream:
```sh
curl -sS http://127.0.0.1:13142/apisix/admin/upstreams/1 \
-H 'X-API-KEY: issue-13141-admin-key'
```
Hit the route again so APISIX instantiates the checker for the new upstream
version, then count health checks after the update:
```sh
curl -sS http://127.0.0.1:13141/hc-cache
MARK=$(date -u +%Y-%m-%dT%H:%M:%SZ)
sleep 8
echo "after update marker: $MARK"
echo "node1 health probes after update:"
docker logs --since "$MARK" issue-13141-node1 2>&1 | grep -c 'GET /health'
|| true
echo "node2 health probes after update:"
docker logs --since "$MARK" issue-13141-node2 2>&1 | grep -c 'GET /health'
|| true
```
Expected healthy behavior:
```text
node1 health probes after update: non-zero
node2 health probes after update: 0
```
Bug signal:
```text
node2 health probes after update: non-zero
```
If `node2` is still being probed after the Admin API shows only
`172.31.41.11:8080` in the upstream, the health checker target set is stale in
the worker and was not reconciled when `upstream.nodes` changed.
In my local run with `apache/apisix:3.14.1-debian`, this repro did show
stale probing when the upstream was replaced with `PUT`:
```text
before update:
node1 -> GET /health once per second
node2 -> GET /health once per second
after PUT with only 172.31.41.11:8080 and another route hit:
node1 -> GET /health once per second
node2 -> GET /health once per second
```
I also tested several update methods:
```text
admin PUT replace upstream body:
stored config no longer contained node2
node2 still received health checks
=> stale checker reproduced
admin PATCH partial nodes object:
stored config still contained node2
node2 still received health checks
=> not a stale-checker signal; node2 was not removed
admin PATCH node2 weight 0:
stored config still contained node2 with weight 0
node2 still received health checks
=> node remains a health-check target
direct etcd v3 put full replacement:
stored config no longer contained node2
node2 still received health checks
=> stale checker reproduced
```
One caveat I found while testing: `PATCH` with a partial `nodes` object
merges the object and does not remove omitted nodes. In that case the Admin API
response still contains the old node, so continued health checks are expected.
I also checked the timing. With full `PUT` replacement, stale health checks
to the removed node were visible immediately after the update, but they did not
persist indefinitely in this local lab:
```text
PUT then immediate route hit:
first 8s window: node1_health=8, node2_health=8
next 20s window: node1_health=21, node2_health=4
PUT long window:
first 30s window: node1_health=29, node2_health=11
next 60s window: node1_health=59, node2_health=0
next 120s window: node1_health=118, node2_health=0
```
So this local repro shows a stale checker transition window after a full
upstream replacement, but it does not by itself explain a case where a removed
node is still checked days later. For that longer-lived symptom, I would first
verify whether the old node still exists in the effective data-plane upstream
config, whether only some APISIX data-plane instances missed the update, or
whether the health checks are coming from another upstream/service/route that
still references the node.
**I think we should troubleshoot how to "remove" the old upstream nodes.**
```text
Admin API PUT:
full replacement; omitted nodes should disappear from stored config
Admin API PATCH:
partial merge; omitted nodes may remain in stored config
Dashboard/control-plane update:
needs checking at the generated Admin API / etcd / DP config layer,
because the UI action may be implemented as either replace or merge
direct etcd write:
depends on whether the full upstream object is replaced or a stale object
is written back
decoupled CP/DP sync:
needs checking on each data plane, because one DP may still hold an older
upstream version even if the control plane has the new object
```
The key diagnostic split is: if the removed node is still present in the
effective upstream config for the APISIX instance sending `/health`, the
behavior is explained by update semantics or config propagation. If the
effective upstream config no longer contains the node but that same APISIX
instance still sends active checks to it, then the problem is stale
health-checker state. For multi-DP or decoupled deployments, this check should
be done per data-plane instance, not only on the control-plane object.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]