wk-mls opened a new issue, #2774:
URL: https://github.com/apache/apisix-ingress-controller/issues/2774

   ### Current Behavior
   
   Occasionally, an ingress controller pod will time out when attempting to 
access the `Lease` resource used for leadership coordination and lose 
leadership. This results in the termination of the `apisix-ingress-controller` 
process in the `manager` container and a subsequent restart of the container, 
but not the termination of the pod itself. When this same ingress controller 
pod is selected as the leader again, single pod deployments seem to cause 
`*_conf_version` (e.g. `upstreams_conf_version`) errors to occur during syncing 
for the resources associated with those single pod deployments, which starts 
immediately after the new leader pod starts syncing. This syncing error 
prevents the updating of the upstream IP address associated with all single pod 
deployments and causes 502/504 errors on the user end until:
   
   1. Any of the affected single pod deployments are scaled up to two or more 
pods
   2. Either `apisix` or `apisix-ingress-controller` pods are restarted
   
   It's not clear to me why scaling up resolves the issue. Restarting the 
`apisix` pods resets the `*_conf_version` values in their configurations 
(https://apisix.apache.org/docs/ingress-controller/reference/apisix-ingress-controller/configuration-troubleshoot/#inspect-synchronized-gateway-configurations)
 to 0 which allows the `apisix-ingress-controller` to write new values, and 
restarting the `apisix-ingress-controller` pods seems to correct the internal 
Unix timestamp that they use for `*_conf_version` values. There appear to be no 
differences between the configurations reflected in the ADC configurations 
viewable through the debug API 
(https://apisix.apache.org/docs/ingress-controller/reference/apisix-ingress-controller/configuration-troubleshoot/#inspect-translated-adc-configurations)
 or the debug logs before and after the `apisix-ingress-controller` process 
restarts, so it seems like the issue is isolated to the `manager` container and 
the `apisix-ingress-controller` process wit
 hin. The Unix timestamp within the `*_conf_version` value reported in the 
error only seem to correspond to a successful sync on the leader pod at that 
time.
   
   https://github.com/apache/apisix-ingress-controller/issues/2708 may be 
experiencing a similar issue but in this case, only single pod deployments are 
affected.
   
   ### Expected Behavior
   
   Single pod deployments are able to have their upstream IPs updated correctly 
even when a pod in which the `apisix-ingress-controller` process in the 
`manager` container terminates becomes the ingress controller leader again.
   
   ### Error Logs
   
   Lease timeout and process restart:
   ```
   2026-05-21T11:06:19.710Z     E0521 11:06:19.710497       1 
leaderelection.go:429] Failed to update lock optimistically: Put 
"https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/ingress-apisix/leases/apisix-ingress-controller-leader?timeout=10s":
 context deadline exceeded (Client.Timeout exceeded while awaiting headers), 
falling back to slow path apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:29.709Z     E0521 11:06:29.709691       1 
leaderelection.go:436] error retrieving resource lock 
ingress-apisix/apisix-ingress-controller-leader: Get 
"https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/ingress-apisix/leases/apisix-ingress-controller-leader?timeout=10s":
 context deadline exceeded apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:29.709Z     I0521 11:06:29.709728       1 
leaderelection.go:297] failed to renew lease 
ingress-apisix/apisix-ingress-controller-leader: context deadline exceeded 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:29.710Z     Error: leader election lost 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:29.710Z     2026-05-21T11:06:29.709Z        DEBUG   
controller-runtime.events       recorder/recorder.go:104        
apisix-ingress-controller-aaaaaaaaaa-aaaaa_4f765088-329e-487e-b94d-b9b5f782252b 
stopped leading {"type": "Normal", "object": 
{"kind":"Lease","namespace":"ingress-apisix","name":"apisix-ingress-controller-leader","uid":"49eeacf3-2112-429d-b2ad-13c653d548e5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1540593399"},
 "reason": "LeaderElection"} apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:29.710Z     Usage:
     apisix-ingress-controller [command] [flags]
     apisix-ingress-controller [command] 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   
   2026-05-21T11:06:29.710Z     Available Commands:
     completion  Generate the autocompletion script for the specified shell
     help        Help about any command
     version     version for apisix-ingress-controller 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   
   2026-05-21T11:06:29.710Z     Flags:
     -c, --config-path string                 configuration file path for 
apisix-ingress-controller
         --controller-name string             The name of the controller 
(default "apisix.apache.org/apisix-ingress-controller")
         --health-probe-bind-address string   The address the probe endpoint 
binds to. (default ":8081")
     -h, --help                               help for apisix-ingress-controller
         --log-level string                   The log level for 
apisix-ingress-controller (default "info")
         --metrics-bind-address string        The address the metrics endpoint 
binds to. Use :8443 for HTTPS or :8080 for HTTP, or leave as 0 to disable the 
metrics service. (default "0")
     -v, --version                            version for 
apisix-ingress-controller apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:29.710Z     Use "apisix-ingress-controller [command] 
--help" for more information about a command. 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:29.710Z     Error: leader election lost 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:29.709Z     INFO    controller-runtime      
manager/internal.go:538 Stopping and waiting for non leader election runnables 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:30.238Z     INFO    root/root.go:125        controller 
start configuration  {"config": 
{"log_level":"debug","controller_name":"apisix.apache.org/apisix-ingress-controller","leader_election_id":"apisix-ingress-controller-leader","metrics_addr":":8080","server_addr":":9092","enable_server":false,"enable_http2":false,"probe_addr":":8081","secure_metrics":false,"leader_election":{"lease_duration":"30s","renew_deadline":"20s","retry_period":"2s"},"exec_adc_timeout":"15s","provider":{"type":"apisix-standalone","sync_period":"1m0s","init_sync_delay":"20m0s"},"webhook":{"enable":true,"tls_cert_file":"tls.crt","tls_key_file":"tls.key","tls_cert_dir":"/certs","port":9443},"disable_gateway_api":true}}
 apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   ```
   Leadership change:
   ```
   2026-05-21T11:06:30.454Z     I0521 11:06:30.454706       1 
leaderelection.go:257] attempting to acquire leader lease 
ingress-apisix/apisix-ingress-controller-leader... 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-21T11:06:41.883Z     I0521 11:06:41.878897       1 
leaderelection.go:271] successfully acquired lease 
ingress-apisix/apisix-ingress-controller-leader 
apisix-ingress-controller-bbbbbbbbbb-bbbbb manager
   2026-05-21T11:06:41.879Z     INFO    provider        apisix/provider.go:254  
starting provider, waiting for readiness 
apisix-ingress-controller-bbbbbbbbbb-bbbbb manager
   2026-05-21T11:06:41.879Z     INFO    status.updater  status/updater.go:131   
started status update handler apisix-ingress-controller-bbbbbbbbbb-bbbbb manager
   2026-05-21T11:06:41.879Z     DEBUG   controller-runtime.events       
recorder/recorder.go:104        
apisix-ingress-controller-bbbbbbbbbb-bbbbb_66e0b0a2-e092-4c01-8a56-2cce983bb382 
became leader   {"type": "Normal", "object": 
{"kind":"Lease","namespace":"ingress-apisix","name":"apisix-ingress-controller-leader","uid":"49eeacf3-2112-429d-b2ad-13c653d548e5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1540594430"},
 "reason": "LeaderElection"} apisix-ingress-controller-bbbbbbbbbb-bbbbb manager
   ```
   Leadership returns to original pod:
   ```
   2026-05-22T14:44:54.036Z     I0522 14:44:54.036708       1 
leaderelection.go:257] attempting to acquire leader lease 
ingress-apisix/apisix-ingress-controller-leader... 
apisix-ingress-controller-bbbbbbbbbb-bbbbb manager
   2026-05-22T14:45:24.996Z     I0522 14:45:24.996003       1 
leaderelection.go:271] successfully acquired lease 
ingress-apisix/apisix-ingress-controller-leader 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-22T14:45:24.996Z     DEBUG   controller-runtime.events       
recorder/recorder.go:104        
apisix-ingress-controller-aaaaaaaaaa-aaaaa_ae6f86ca-a178-46ec-b99f-e3c301bcda8d 
became leader   {"type": "Normal", "object": 
{"kind":"Lease","namespace":"ingress-apisix","name":"apisix-ingress-controller-leader","uid":"49eeacf3-2112-429d-b2ad-13c653d548e5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1543653763"},
 "reason": "LeaderElection"} apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-22T14:45:24.996Z     INFO    provider        apisix/provider.go:254  
starting provider, waiting for readiness 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   2026-05-22T14:45:24.996Z     INFO    status.updater  status/updater.go:131   
started status update handler apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   ```
   Sync failure:
   ```
   2026-05-22T14:50:26.156Z     DEBUG   provider.executor       
client/executor.go:303  received HTTP response from ADC Server  {"server": 
"http://10.10.10.10:9180,http://10.10.10.20:9180";, "status": 202, "response": 
"{\"status\":\"all_failed\",\"total_resources\":2,\"success_count\":0,\"failed_count\":2,\"success\":[],\"failed\":[{\"server\":\"http://10.10.10.10:9180\",\"event\":{},\"failed_at\":\"2026-05-22T14:50:26.000Z\",\"reason\":\"upstreams_conf_version
 must be greater than or equal to 
(1779434128737)\",\"response\":{\"status\":400,\"headers\":{\"date\":\"Fri, 22 
May 2026 14:50:26 
GMT\",\"content-type\":\"application/json\",\"transfer-encoding\":\"chunked\",\"connection\":\"keep-alive\",\"server\":\"APISIX/3.16.0\",\"access-control-allow-origin\":\"*\",\"access-control-allow-credentials\":\"true\",\"access-control-expose-headers\":\"*\",\"access-control-max-age\":\"3600\"},\"data\":{\"error_msg\":\"upstreams_conf_version
 must be greater than or equal to (1779434128737)\"}}},{\"server\":\"ht
 
tp://10.10.10.20:9180\",\"event\":{},\"failed_at\":\"2026-05-22T14:50:26.000Z\",\"reason\":\"upstreams_conf_version
 must be greater than or equal to 
(1779434128737)\",\"response\":{\"status\":400,\"headers\":{\"date\":\"Fri, 22 
May 2026 14:50:26 
GMT\",\"content-type\":\"application/json\",\"transfer-encoding\":\"chunked\",\"connection\":\"keep-alive\",\"server\":\"APISIX/3.16.0\",\"access-control-allow-origin\":\"*\",\"access-control-allow-credentials\":\"true\",\"access-control-expose-headers\":\"*\",\"access-control-max-age\":\"3600\"},\"data\":{\"error_msg\":\"upstreams_conf_version
 must be greater than or equal to (1779434128737)\"}}}]}"} 
apisix-ingress-controller-aaaaaaaaaa-aaaaa manager
   ```
   
   ### Steps to Reproduce
   
   The bug's dependence on the apisix-ingress-controller process timing out 
when getting the Lease resource has made it difficult to reproduce, but these 
are the steps:
   1. Scale target deployment to 1 pod and the apisix-ingress-controller 
deployment to 2 pods
   2. Wait for one of the apisix-ingress-controller processes in the manager 
container to self-terminate in the leader pod
   3. If the new leader pod selected isn't the same as the previous, terminate 
the leader until leadership returns to the previous pod
   4. Kill the target deployment's pod, spawning a new pod
   5. upstream_conf_version error
   
   ### Environment
   
   - APISIX version: 3.16.0
   - APISIX Ingress controller version: 2.0.1
   - APISIX Helm chart version: 2.14.0
   - Kubernetes cluster version: 1.34.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to