jizhuozhi opened a new issue, #12937:
URL: https://github.com/apache/apisix/issues/12937
### Current Behavior
## Bug Description
There is a bug in the Consul service discovery implementation where a single
malformed node (missing `Service` field) can cause the entire service to become
unavailable.
## Root Cause
The `:: CONTINUE ::` label is placed **outside** the inner `for` loop (line
648), but the `goto CONTINUE` statement is used **inside** the inner loop (line
578). This causes the goto statement to skip not only the current node but also:
- All remaining nodes in the result
- The sorting logic
- The critical `up_services[service_name] = nodes` assignment
## Affected Code
File: `apisix/discovery/consul/init.lua`
```lua
for service_name, _ in pairs(catalog_res.body) do
-- ... get nodes from consul ...
for _, node in ipairs(result.body) do
local service = node.Service
if not service then
goto CONTINUE -- Line 578: BUG! This jumps outside the inner
loop
end
-- ... process node ...
end -- Inner loop ends here (line 633)
-- Sorting logic
if nodes then
-- ... sort nodes ...
end
up_services[service_name] = nodes
end -- Outer loop ends here
:: CONTINUE :: -- Line 648: Label is HERE, outside both loops!
```
## Impact
When a single node in the Consul response has a missing `Service` field:
1. The `goto CONTINUE` is triggered
2. It jumps to line 648, **skipping the entire remaining processing for that
service**
3. The `up_services[service_name] = nodes` assignment is never executed
4. APISIX does not update the node list for that service
5. If this happens during startup or after a previous service removal, the
service becomes completely unavailable
## Proposed Fix
Move the `:: CONTINUE ::` label inside the inner loop, or rename the labels
to avoid confusion:
```lua
for service_name, _ in pairs(catalog_res.body) do
-- ... get nodes from consul ...
for _, node in ipairs(result.body) do
local service = node.Service
if not service then
goto CONTINUE_NODE -- Skip only this node
end
-- ... process node ...
:: CONTINUE_NODE :: -- Label inside the inner loop
end -- Inner loop ends
-- Sorting and assignment continue normally
if nodes then
-- ... sort nodes ...
end
up_services[service_name] = nodes
end
:: CONTINUE :: -- Keep outer label for service-level skipping
```
## Additional Notes
This bug has likely existed for a long time but may have gone unnoticed due
to the rarity of malformed Consul responses. However, it represents a critical
reliability issue as a single bad node entry can cause complete service
unavailability.
### Expected Behavior
When a single node is malformed, only that specific node should be skipped.
Other healthy nodes should still be processed and added to `up_services`.
### Error Logs
None
### Steps to Reproduce
1. Register a service in Consul with multiple instances
2. Somehow create a node entry with a missing `Service` field (e.g., through
API manipulation or data corruption)
3. APISIX will fail to discover ANY nodes for that service, even the healthy
ones
4. Requests to that service will return "no valid upstream node" errors
### Environment
- APISIX version (run `apisix version`):
- Operating system (run `uname -a`):
- OpenResty / Nginx version (run `openresty -V` or `nginx -V`):
- etcd version, if relevant (run `curl
http://127.0.0.1:9090/v1/server_info`):
- APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run `luarocks --version`):
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]