Falven opened a new issue, #12662:
URL: https://github.com/apache/apisix/issues/12662

   ### Current Behavior
   
   When running APISIX 3.13.0 in file-driven standalone mode 
(deployment.role=data_plane, config_provider=yaml), the `/status/ready` health 
check endpoint always returns HTTP 503 with error "worker id: X has not 
received configuration", despite:
   - Routes working correctly
   - Configuration being successfully loaded from apisix.yaml
   - All workers functioning normally
   
   Example error response:
   ```json
   {"error":"worker id: 0 has not received configuration","status":"error"}
   ```
   
   ### Expected Behavior
   
   The `/status/ready` endpoint should return HTTP 200 with `{"status":"ok"}` 
when all workers have successfully loaded the configuration from the YAML file.
   
   ### Error Logs
   
   ```
   2025/01/10 00:41:47 [warn] 33#33: *3 [lua] init.lua:1003: status_ready(): 
worker id: 0 has not received configuration, context: ngx.timer
   ```
   
   ### Steps to Reproduce
   
   1. Configure APISIX in file-driven standalone mode:
   ```yaml
   # config.yaml
   deployment:
     role: data_plane
     role_data_plane:
       config_provider: yaml
   apisix:
     enable_admin: false
   ```
   
   2. Create a valid apisix.yaml with routes
   3. Start APISIX
   4. Test the health check endpoint:
   ```bash
   curl http://127.0.0.1:7085/status/ready
   ```
   
   5. Observe HTTP 503 error despite routes working correctly
   
   ### Environment
   
   - APISIX version: 3.13.0
   - Operating System: Docker (apache/apisix:3.13.0-debian)
   - OpenResty / Nginx version: From official image
   - Deployment mode: data_plane with yaml config_provider
   
   ### Root Cause Analysis (UPDATED)
   
   After extensive debugging with added logging, I've identified the actual 
root cause. The issue occurs when the configuration file is rendered **before** 
APISIX starts (common in container environments):
   
   **Timing Issue:**
   1. Configuration file (`apisix.yaml`) is created by an entrypoint script 
before APISIX starts
   2. Master process reads the file during startup, setting `apisix_yaml_mtime` 
global variable
   3. Workers initialize and call `sync_status_to_shdict(false)` marking 
themselves as **unhealthy**
   4. Workers create timers that call `read_apisix_config()` every second
   5. **Critical bug**: `read_apisix_config()` checks if file mtime has changed:
      ```lua
      if apisix_yaml_mtime == last_modification_time then
          return  -- File hasn't changed, return early
      end
      ```
   6. Because the file was rendered before startup, the mtime never changes
   7. `update_config()` is **never called** by workers
   8. Workers remain marked as unhealthy forever
   9. `/status/ready` endpoint fails perpetually
   
   **Debug Evidence:**
   Adding logging to `config_yaml.lua` confirmed:
   - `update_config()` is only called once by the master process (PID 1) during 
startup
   - Master's call to `sync_status_to_shdict(true)` does nothing because it 
checks `if process.type() ~= "worker" then return end`
   - All 12 workers successfully create timers
   - Timers fire every second but return early due to unchanged mtime
   - Workers never call `update_config()`, thus never call 
`sync_status_to_shdict(true)`
   
   ### Relevant Code
   
   **apisix/core/config_yaml.lua** - Lines ~565-585:
   ```lua
   function _M.init_worker()
       sync_status_to_shdict(false)  -- Mark worker as unhealthy
       
       if is_use_admin_api() then
           apisix_yaml = {}
           apisix_yaml_mtime = 0
           return true
       end
   
       -- sync data in each non-master process
       ngx.timer.every(1, read_apisix_config)  -- Timer created but never calls 
update_config
       
       return true
   end
   ```
   
   **apisix/core/config_yaml.lua** - Lines ~150-165:
   ```lua
   local function read_apisix_config(premature, pre_mtime)
       if premature then
           return
       end
       
       local attributes, err = lfs.attributes(config_file.path)
       if not attributes then
           log.error("failed to fetch ", config_file.path, " attributes: ", err)
           return
       end
   
       local last_modification_time = attributes.modification
       if apisix_yaml_mtime == last_modification_time then
           return  -- BUG: Returns early, never calls update_config()
       end
       
       -- This code is never reached if file hasn't changed since startup
       local config_new, err = config_file:parse()
       if err then
           log.error("failed to parse the content of file ", config_file.path, 
": ", err)
           return
       end
   
       update_config(config_new, last_modification_time)
       log.warn("config file ", config_file.path, " reloaded.")
   end
   ```
   
   **apisix/core/config_yaml.lua** - Lines ~136-148:
   ```lua
   local function sync_status_to_shdict(status)
       if process.type() ~= "worker" then
           return  -- Master process calls are ignored
       end
   
       local dict_name = "status-report"
       local key = worker_id()
       local shdict = ngx.shared[dict_name]
       local _, err = shdict:set(key, status)
       if err then
           log.error("failed to ", status and "set" or "clear",
                     " shdict " .. dict_name .. ", key=" .. key, ", err: ", err)
       end
   end
   ```
   
   ### Proposed Solution
   
   In `init_worker()`, immediately call `update_config()` after creating the 
timer to mark the worker as healthy:
   
   ```lua
   function _M.init_worker()
       sync_status_to_shdict(false)
       
       if is_use_admin_api() then
           apisix_yaml = {}
           apisix_yaml_mtime = 0
           return true
       end
   
       -- sync data in each non-master process
       ngx.timer.every(1, read_apisix_config)
       
       -- FIX: Mark worker as healthy immediately if config already loaded
       if apisix_yaml then
           update_config(apisix_yaml, apisix_yaml_mtime)
       end
       
       return true
   end
   ```
   
   This ensures workers are marked healthy on initialization, before the timer 
even fires. The timer will still update configuration when the file changes.
   
   ### Verified Fix
   
   I patched the code in a running container and confirmed:
   - All 12 workers call `update_config()` in `init_worker_by_lua*` context
   - `/status/ready` returns `{"status":"ok"}` with HTTP 200
   - Docker health check passes (container shows "healthy" status)
   - Routes continue working correctly
   
   ### Impact
   
   This bug affects production deployments using:
   - Kubernetes readiness probes with file-driven standalone mode
   - Docker health checks
   - Load balancers that depend on `/status/ready` endpoint
   - Any container orchestration that renders config files before starting 
APISIX
   
   The health check always fails, preventing proper deployment orchestration, 
even though APISIX is functioning correctly and serving traffic.
   
   ### Additional Context
   
   The bug is specific to the timing of when the configuration file is created 
relative to APISIX startup. If the file is created and never modified, workers 
never get marked as healthy. This is a common pattern in containerized 
deployments where entrypoint scripts render configuration from environment 
variables before starting the main process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to