Re: [I] bug: health check keeps probing stale upstream nodes after nodes update (checker cache not invalidated) [apisix]

via GitHub Thu, 11 Jun 2026 19:46:31 -0700


hanzhenfang commented on issue #13141:
URL: https://github.com/apache/apisix/issues/13141#issuecomment-4686861553


   Hi @Baoyuantop, @nic-6443 ,  I built a small repro that makes the stale 
health-check target visible through the removed node's access log. The removed 
node stays alive, so the signal is not a timeout; it is simply whether APISIX 
keeps sending `GET /health` to a node that is no longer in the upstream.
   
   One thing I could not tell from the issue is how the upstream was updated. 
That matters here. A full `PUT` replacement and a partial `PATCH` have 
different semantics.
   
   ## Files
   
   ### docker-compose.yaml
   
   ```yaml
   name: issue-13141
   
   services:
     etcd:
       image: bitnamilegacy/etcd:3.6.4
       container_name: issue-13141-etcd
       restart: "no"
       environment:
         ALLOW_NONE_AUTHENTICATION: "yes"
         ETCD_ADVERTISE_CLIENT_URLS: http://etcd:2379
         ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
       ports:
         - "13144:2379"
   
     apisix:
       image: ${APISIX_IMAGE:-apache/apisix:3.14.1-debian}
       container_name: issue-13141-apisix
       restart: "no"
       depends_on:
         - etcd
         - node1
         - node2
       volumes:
         - ./config.yaml:/usr/local/apisix/conf/config.yaml:ro
       ports:
         - "13141:9080"
         - "13142:9180"
         - "13143:9090"
   
     node1:
       image: nginx:alpine
       container_name: issue-13141-node1
       restart: "no"
       environment:
         NODE_NAME: node1
       volumes:
         - ./mock-nginx.conf:/etc/nginx/templates/default.conf.template:ro
       networks:
         default:
           ipv4_address: 172.31.41.11
   
     node2:
       image: nginx:alpine
       container_name: issue-13141-node2
       restart: "no"
       environment:
         NODE_NAME: node2
       volumes:
         - ./mock-nginx.conf:/etc/nginx/templates/default.conf.template:ro
       networks:
         default:
           ipv4_address: 172.31.41.12
   
   networks:
     default:
       ipam:
         config:
           - subnet: 172.31.41.0/24
   ```
   
   ### config.yaml
   
   ```yaml
   apisix:
     node_listen:
       - 9080
     enable_admin: true
   
   nginx_config:
     error_log_level: info
     worker_processes: 1
   
   deployment:
     role: traditional
     role_traditional:
       config_provider: etcd
     admin:
       admin_listen:
         ip: 0.0.0.0
         port: 9180
       admin_key:
         - name: admin
           key: issue-13141-admin-key
           role: admin
       allow_admin:
         - 0.0.0.0/0
     etcd:
       host:
         - http://etcd:2379
   
   control:
     ip: 0.0.0.0
     port: 9090
   ```
   
   ### mock-nginx.conf
   
   ```nginx
   log_format issue13141 '$time_iso8601 $hostname $request_method $uri $status';
   
   server {
       listen 8080;
       access_log /dev/stdout issue13141;
       error_log /dev/stderr info;
   
       location = /health {
           add_header X-Issue-Node "${NODE_NAME}" always;
           return 200 "${NODE_NAME} healthy\n";
       }
   
       location / {
           add_header X-Issue-Node "${NODE_NAME}" always;
           return 200 "${NODE_NAME} response\n";
       }
   }
   ```
   
   ## Steps
   
   Start:
   
   ```sh
   docker compose up -d --force-recreate --remove-orphans
   docker compose exec -T apisix apisix version
   ```
   
   Create an upstream with two nodes and active health checks:
   
   ```sh
   curl -sS -X PUT http://127.0.0.1:13142/apisix/admin/upstreams/1 \
     -H 'X-API-KEY: issue-13141-admin-key' \
     -H 'Content-Type: application/json' \
     -d '{
       "type": "roundrobin",
       "nodes": {
         "172.31.41.11:8080": 1,
         "172.31.41.12:8080": 1
       },
       "checks": {
         "active": {
           "type": "http",
           "http_path": "/health",
           "healthy": {
             "interval": 1,
             "successes": 1
           },
           "unhealthy": {
             "interval": 1,
             "http_failures": 1
           }
         }
       }
     }'
   
   curl -sS -X PUT http://127.0.0.1:13142/apisix/admin/routes/1 \
     -H 'X-API-KEY: issue-13141-admin-key' \
     -H 'Content-Type: application/json' \
     -d '{
       "uri": "/hc-cache",
       "upstream_id": "1"
     }'
   ```
   
   Warm up and confirm both nodes are checked:
   
   ```sh
   curl -sS http://127.0.0.1:13141/hc-cache
   sleep 5
   docker logs --since 10s issue-13141-node1 2>&1 | grep 'GET /health'
   docker logs --since 10s issue-13141-node2 2>&1 | grep 'GET /health'
   ```
   
   Remove `node2` from the upstream:
   
   ```sh
   curl -sS -X PUT http://127.0.0.1:13142/apisix/admin/upstreams/1 \
     -H 'X-API-KEY: issue-13141-admin-key' \
     -H 'Content-Type: application/json' \
     -d '{
       "type": "roundrobin",
       "nodes": {
         "172.31.41.11:8080": 1
       },
       "checks": {
         "active": {
           "type": "http",
           "http_path": "/health",
           "healthy": {
             "interval": 1,
             "successes": 1
           },
           "unhealthy": {
             "interval": 1,
             "http_failures": 1
           }
         }
       }
     }'
   ```
   
   Confirm the node is really gone from the stored upstream:
   
   ```sh
   curl -sS http://127.0.0.1:13142/apisix/admin/upstreams/1 \
     -H 'X-API-KEY: issue-13141-admin-key'
   ```
   
   Hit the route again so APISIX instantiates the checker for the new upstream 
version, then count health checks after the update:
   
   ```sh
   curl -sS http://127.0.0.1:13141/hc-cache
   MARK=$(date -u +%Y-%m-%dT%H:%M:%SZ)
   sleep 8
   echo "after update marker: $MARK"
   echo "node1 health probes after update:"
   docker logs --since "$MARK" issue-13141-node1 2>&1 | grep -c 'GET /health' 
|| true
   echo "node2 health probes after update:"
   docker logs --since "$MARK" issue-13141-node2 2>&1 | grep -c 'GET /health' 
|| true
   ```
   
   Expected healthy behavior:
   
   ```text
   node1 health probes after update: non-zero
   node2 health probes after update: 0
   ```
   
   Bug signal:
   
   ```text
   node2 health probes after update: non-zero
   ```
   
   If `node2` is still being probed after the Admin API shows only 
`172.31.41.11:8080` in the upstream, the health checker target set is stale in 
the worker and was not reconciled when `upstream.nodes` changed.
   
   In my local run with `apache/apisix:3.14.1-debian`, this repro did show 
stale probing when the upstream was replaced with `PUT`:
   
   ```text
   before update:
     node1 -> GET /health once per second
     node2 -> GET /health once per second
   
   after PUT with only 172.31.41.11:8080 and another route hit:
     node1 -> GET /health once per second
     node2 -> GET /health once per second
   ```
   
   I also tested several update methods:
   
   ```text
   admin PUT replace upstream body:
     stored config no longer contained node2
     node2 still received health checks
     => stale checker reproduced
   
   admin PATCH partial nodes object:
     stored config still contained node2
     node2 still received health checks
     => not a stale-checker signal; node2 was not removed
   
   admin PATCH node2 weight 0:
     stored config still contained node2 with weight 0
     node2 still received health checks
     => node remains a health-check target
   
   direct etcd v3 put full replacement:
     stored config no longer contained node2
     node2 still received health checks
     => stale checker reproduced
   ```
   
   One caveat I found while testing: `PATCH` with a partial `nodes` object 
merges the object and does not remove omitted nodes. In that case the Admin API 
response still contains the old node, so continued health checks are expected.
   
   I also checked the timing. With full `PUT` replacement, stale health checks 
to the removed node were visible immediately after the update, but they did not 
persist indefinitely in this local lab:
   
   ```text
   PUT then immediate route hit:
     first 8s window:  node1_health=8,  node2_health=8
     next 20s window:  node1_health=21, node2_health=4
   
   PUT long window:
     first 30s window: node1_health=29, node2_health=11
     next 60s window:  node1_health=59, node2_health=0
     next 120s window: node1_health=118, node2_health=0
   ```
   
   So this local repro shows a stale checker transition window after a full 
upstream replacement, but it does not by itself explain a case where a removed 
node is still checked days later. For that longer-lived symptom, I would first 
verify whether the old node still exists in the effective data-plane upstream 
config, whether only some APISIX data-plane instances missed the update, or 
whether the health checks are coming from another upstream/service/route that 
still references the node.
   
   **I think we should troubleshoot how to "remove" the old upstream nodes.**
   
   ```text
   Admin API PUT:
     full replacement; omitted nodes should disappear from stored config
   
   Admin API PATCH:
     partial merge; omitted nodes may remain in stored config
   
   Dashboard/control-plane update:
     needs checking at the generated Admin API / etcd / DP config layer, 
because the UI action may be implemented as either replace or merge
   
   direct etcd write:
     depends on whether the full upstream object is replaced or a stale object 
is written back
   
   decoupled CP/DP sync:
     needs checking on each data plane, because one DP may still hold an older 
upstream version even if the control plane has the new object
   ```
   
   The key diagnostic split is: if the removed node is still present in the 
effective upstream config for the APISIX instance sending `/health`, the 
behavior is explained by update semantics or config propagation. If the 
effective upstream config no longer contains the node but that same APISIX 
instance still sends active checks to it, then the problem is stale 
health-checker state. For multi-DP or decoupled deployments, this check should 
be done per data-plane instance, not only on the control-plane object.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] bug: health check keeps probing stale upstream nodes after nodes update (checker cache not invalidated) [apisix]

Reply via email to