wk-mls opened a new issue, #2774: URL: https://github.com/apache/apisix-ingress-controller/issues/2774
### Current Behavior Occasionally, an ingress controller pod will time out when attempting to access the `Lease` resource used for leadership coordination and lose leadership. This results in the termination of the `apisix-ingress-controller` process in the `manager` container and a subsequent restart of the container, but not the termination of the pod itself. When this same ingress controller pod is selected as the leader again, single pod deployments seem to cause `*_conf_version` (e.g. `upstreams_conf_version`) errors to occur during syncing for the resources associated with those single pod deployments, which starts immediately after the new leader pod starts syncing. This syncing error prevents the updating of the upstream IP address associated with all single pod deployments and causes 502/504 errors on the user end until: 1. Any of the affected single pod deployments are scaled up to two or more pods 2. Either `apisix` or `apisix-ingress-controller` pods are restarted It's not clear to me why scaling up resolves the issue. Restarting the `apisix` pods resets the `*_conf_version` values in their configurations (https://apisix.apache.org/docs/ingress-controller/reference/apisix-ingress-controller/configuration-troubleshoot/#inspect-synchronized-gateway-configurations) to 0 which allows the `apisix-ingress-controller` to write new values, and restarting the `apisix-ingress-controller` pods seems to correct the internal Unix timestamp that they use for `*_conf_version` values. There appear to be no differences between the configurations reflected in the ADC configurations viewable through the debug API (https://apisix.apache.org/docs/ingress-controller/reference/apisix-ingress-controller/configuration-troubleshoot/#inspect-translated-adc-configurations) or the debug logs before and after the `apisix-ingress-controller` process restarts, so it seems like the issue is isolated to the `manager` container and the `apisix-ingress-controller` process wit hin. The Unix timestamp within the `*_conf_version` value reported in the error only seem to correspond to a successful sync on the leader pod at that time. https://github.com/apache/apisix-ingress-controller/issues/2708 may be experiencing a similar issue but in this case, only single pod deployments are affected. ### Expected Behavior Single pod deployments are able to have their upstream IPs updated correctly even when a pod in which the `apisix-ingress-controller` process in the `manager` container terminates becomes the ingress controller leader again. ### Error Logs Lease timeout and process restart: ``` 2026-05-21T11:06:19.710Z E0521 11:06:19.710497 1 leaderelection.go:429] Failed to update lock optimistically: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/ingress-apisix/leases/apisix-ingress-controller-leader?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers), falling back to slow path apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.709Z E0521 11:06:29.709691 1 leaderelection.go:436] error retrieving resource lock ingress-apisix/apisix-ingress-controller-leader: Get "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/ingress-apisix/leases/apisix-ingress-controller-leader?timeout=10s": context deadline exceeded apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.709Z I0521 11:06:29.709728 1 leaderelection.go:297] failed to renew lease ingress-apisix/apisix-ingress-controller-leader: context deadline exceeded apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.710Z Error: leader election lost apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.710Z 2026-05-21T11:06:29.709Z DEBUG controller-runtime.events recorder/recorder.go:104 apisix-ingress-controller-aaaaaaaaaa-aaaaa_4f765088-329e-487e-b94d-b9b5f782252b stopped leading {"type": "Normal", "object": {"kind":"Lease","namespace":"ingress-apisix","name":"apisix-ingress-controller-leader","uid":"49eeacf3-2112-429d-b2ad-13c653d548e5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1540593399"}, "reason": "LeaderElection"} apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.710Z Usage: apisix-ingress-controller [command] [flags] apisix-ingress-controller [command] apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.710Z Available Commands: completion Generate the autocompletion script for the specified shell help Help about any command version version for apisix-ingress-controller apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.710Z Flags: -c, --config-path string configuration file path for apisix-ingress-controller --controller-name string The name of the controller (default "apisix.apache.org/apisix-ingress-controller") --health-probe-bind-address string The address the probe endpoint binds to. (default ":8081") -h, --help help for apisix-ingress-controller --log-level string The log level for apisix-ingress-controller (default "info") --metrics-bind-address string The address the metrics endpoint binds to. Use :8443 for HTTPS or :8080 for HTTP, or leave as 0 to disable the metrics service. (default "0") -v, --version version for apisix-ingress-controller apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.710Z Use "apisix-ingress-controller [command] --help" for more information about a command. apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.710Z Error: leader election lost apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:29.709Z INFO controller-runtime manager/internal.go:538 Stopping and waiting for non leader election runnables apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:30.238Z INFO root/root.go:125 controller start configuration {"config": {"log_level":"debug","controller_name":"apisix.apache.org/apisix-ingress-controller","leader_election_id":"apisix-ingress-controller-leader","metrics_addr":":8080","server_addr":":9092","enable_server":false,"enable_http2":false,"probe_addr":":8081","secure_metrics":false,"leader_election":{"lease_duration":"30s","renew_deadline":"20s","retry_period":"2s"},"exec_adc_timeout":"15s","provider":{"type":"apisix-standalone","sync_period":"1m0s","init_sync_delay":"20m0s"},"webhook":{"enable":true,"tls_cert_file":"tls.crt","tls_key_file":"tls.key","tls_cert_dir":"/certs","port":9443},"disable_gateway_api":true}} apisix-ingress-controller-aaaaaaaaaa-aaaaa manager ``` Leadership change: ``` 2026-05-21T11:06:30.454Z I0521 11:06:30.454706 1 leaderelection.go:257] attempting to acquire leader lease ingress-apisix/apisix-ingress-controller-leader... apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-21T11:06:41.883Z I0521 11:06:41.878897 1 leaderelection.go:271] successfully acquired lease ingress-apisix/apisix-ingress-controller-leader apisix-ingress-controller-bbbbbbbbbb-bbbbb manager 2026-05-21T11:06:41.879Z INFO provider apisix/provider.go:254 starting provider, waiting for readiness apisix-ingress-controller-bbbbbbbbbb-bbbbb manager 2026-05-21T11:06:41.879Z INFO status.updater status/updater.go:131 started status update handler apisix-ingress-controller-bbbbbbbbbb-bbbbb manager 2026-05-21T11:06:41.879Z DEBUG controller-runtime.events recorder/recorder.go:104 apisix-ingress-controller-bbbbbbbbbb-bbbbb_66e0b0a2-e092-4c01-8a56-2cce983bb382 became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"ingress-apisix","name":"apisix-ingress-controller-leader","uid":"49eeacf3-2112-429d-b2ad-13c653d548e5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1540594430"}, "reason": "LeaderElection"} apisix-ingress-controller-bbbbbbbbbb-bbbbb manager ``` Leadership returns to original pod: ``` 2026-05-22T14:44:54.036Z I0522 14:44:54.036708 1 leaderelection.go:257] attempting to acquire leader lease ingress-apisix/apisix-ingress-controller-leader... apisix-ingress-controller-bbbbbbbbbb-bbbbb manager 2026-05-22T14:45:24.996Z I0522 14:45:24.996003 1 leaderelection.go:271] successfully acquired lease ingress-apisix/apisix-ingress-controller-leader apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-22T14:45:24.996Z DEBUG controller-runtime.events recorder/recorder.go:104 apisix-ingress-controller-aaaaaaaaaa-aaaaa_ae6f86ca-a178-46ec-b99f-e3c301bcda8d became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"ingress-apisix","name":"apisix-ingress-controller-leader","uid":"49eeacf3-2112-429d-b2ad-13c653d548e5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1543653763"}, "reason": "LeaderElection"} apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-22T14:45:24.996Z INFO provider apisix/provider.go:254 starting provider, waiting for readiness apisix-ingress-controller-aaaaaaaaaa-aaaaa manager 2026-05-22T14:45:24.996Z INFO status.updater status/updater.go:131 started status update handler apisix-ingress-controller-aaaaaaaaaa-aaaaa manager ``` Sync failure: ``` 2026-05-22T14:50:26.156Z DEBUG provider.executor client/executor.go:303 received HTTP response from ADC Server {"server": "http://10.10.10.10:9180,http://10.10.10.20:9180", "status": 202, "response": "{\"status\":\"all_failed\",\"total_resources\":2,\"success_count\":0,\"failed_count\":2,\"success\":[],\"failed\":[{\"server\":\"http://10.10.10.10:9180\",\"event\":{},\"failed_at\":\"2026-05-22T14:50:26.000Z\",\"reason\":\"upstreams_conf_version must be greater than or equal to (1779434128737)\",\"response\":{\"status\":400,\"headers\":{\"date\":\"Fri, 22 May 2026 14:50:26 GMT\",\"content-type\":\"application/json\",\"transfer-encoding\":\"chunked\",\"connection\":\"keep-alive\",\"server\":\"APISIX/3.16.0\",\"access-control-allow-origin\":\"*\",\"access-control-allow-credentials\":\"true\",\"access-control-expose-headers\":\"*\",\"access-control-max-age\":\"3600\"},\"data\":{\"error_msg\":\"upstreams_conf_version must be greater than or equal to (1779434128737)\"}}},{\"server\":\"ht tp://10.10.10.20:9180\",\"event\":{},\"failed_at\":\"2026-05-22T14:50:26.000Z\",\"reason\":\"upstreams_conf_version must be greater than or equal to (1779434128737)\",\"response\":{\"status\":400,\"headers\":{\"date\":\"Fri, 22 May 2026 14:50:26 GMT\",\"content-type\":\"application/json\",\"transfer-encoding\":\"chunked\",\"connection\":\"keep-alive\",\"server\":\"APISIX/3.16.0\",\"access-control-allow-origin\":\"*\",\"access-control-allow-credentials\":\"true\",\"access-control-expose-headers\":\"*\",\"access-control-max-age\":\"3600\"},\"data\":{\"error_msg\":\"upstreams_conf_version must be greater than or equal to (1779434128737)\"}}}]}"} apisix-ingress-controller-aaaaaaaaaa-aaaaa manager ``` ### Steps to Reproduce The bug's dependence on the apisix-ingress-controller process timing out when getting the Lease resource has made it difficult to reproduce, but these are the steps: 1. Scale target deployment to 1 pod and the apisix-ingress-controller deployment to 2 pods 2. Wait for one of the apisix-ingress-controller processes in the manager container to self-terminate in the leader pod 3. If the new leader pod selected isn't the same as the previous, terminate the leader until leadership returns to the previous pod 4. Kill the target deployment's pod, spawning a new pod 5. upstream_conf_version error ### Environment - APISIX version: 3.16.0 - APISIX Ingress controller version: 2.0.1 - APISIX Helm chart version: 2.14.0 - Kubernetes cluster version: 1.34.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
