[PR] fix: running stale healthchecker when new node count <= 1 [apisix]

via GitHub Wed, 02 Apr 2025 05:09:39 -0700


Revolyssup opened a new pull request, #12118:
URL: https://github.com/apache/apisix/pull/12118


   ### Description
   Problem description
   The customer has a domain name configured with health check (the upstream 
type is K8S service discovery). Now it is found that after the upstream service 
node is updated, the nodes in the health check are not updated, resulting in a 
large number of health check failure logs:
   
   *content:2025/03/28 16:36:17 [error] 42#42: *309692373 [lua] 
healthcheck.lua:1150: log(): [healthcheck] 
(upstream#/common/ontest/internal/routes/570) failed to send http request to 
'nil (10.24.64.115:14699)': connection timed out, context: ngx.timer, client: 
172.24.128.181, server: 0.0.0.0:9443*
   
   Reproduction steps
   1. Create a route and enable health check
   2. Determine how to deploy multiple pods upstream
   3. Access route triggers health check
   4. Reduce the size of the upstream node to 1 pod (when the upstream node is 
1, the health check related logic will no longer be entered, and because it is 
service discovery, there is no etcd change, so the health checker cannot be 
released 
https://github.com/apache/apisix/blob/release/3.10/apisix/upstream.lua#L352-L355)
   5. Observing the logs, you will find that health checks will still be sent 
to all pods before scaling down (you need to set the log level to info or add a 
warn log after healthcheck attempts to connect to the node, otherwise 
healthcheck will not generate logs after timeout; the customer's IP and port 
are open, but the http request cannot be processed, so there are error level 
logs)
   
   ## Solution
   There are 3 possible solutions and in this PR I have decided to go with 1st 
one though reviewers can tell their opinion.
   
   ## Fix can be any of the following
   
   ```bash
    diff orig fixed
   1,4c1,2
   <     if nodes_count > 1 then
   <         local checker = fetch_healthchecker(up_conf)
   <         api_ctx.up_checker = checker
   <     end
   ---
   >     local checker = fetch_healthchecker(up_conf)
   >     api_ctx.up_checker = checker
   
   ```
   
   OR
   
   ```bash
   diff orig fixed2
   1c1
   <     if nodes_count > 1 then
   ---
   >     if nodes_count > 0 then
   
   ```
   
   OR
   
   ```bash
        if nodes_count > 1 then
            local checker = fetch_healthchecker(up_conf)
            api_ctx.up_checker = checker
        end    
   >    if nodes_count == 1 and up_conf.parent and up_conf.parent.checker then
   >        release_checker(up_conf.parent)
   >    end
   ```
   
   ## Conclusion
   
   Solution 1 looks more general so we can go with it.
   
   ### Checklist
   
   - [ ] I have explained the need for this PR and the problem it solves
   - [ ] I have explained the changes or the new features added to this PR
   - [ ] I have added tests corresponding to this change
   - [ ] I have updated the documentation to reflect this change
   - [ ] I have verified that this change is backward compatible (If not, 
please discuss on the [APISIX mailing 
list](https://github.com/apache/apisix/tree/master#community) first)
   
   <!--
   
   Note
   
   1. Mark the PR as draft until it's ready to be reviewed.
   2. Always add/update tests for any changes unless you have a good reason.
   3. Always update the documentation to reflect the changes made in the PR.
   4. Make a new commit to resolve conversations instead of `push -f`.
   5. To resolve merge conflicts, merge master instead of rebasing.
   6. Use "request review" to notify the reviewer after making changes.
   7. Only a reviewer can mark a conversation as resolved.
   
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix: running stale healthchecker when new node count <= 1 [apisix]

Reply via email to