arjunbhut commented on issue #2749:
URL: 
https://github.com/apache/apisix-ingress-controller/issues/2749#issuecomment-4421267853

   Hi @Baoyuantop — I'd like to add a second-trigger to this issue. We're 
hitting the same `*_conf_version must be greater than or equal to (X)` 
rejections, but our setup never touches the Admin API, never mixes management, 
and the issue still recurs every sync cycle. I believe there's a structural 
multi-replica drift in api-driven standalone that the current diagnosis doesn't 
cover.
   
   ## Setup
   
   | Component | Version |
   |---|---|
   | Helm chart | `apisix/apisix:2.14.0` |
   | APISIX | `apache/apisix:3.16.0-ubuntu` |
   | Ingress controller | `apache/apisix-ingress-controller:2.0.1` |
   | ADC sidecar | `ghcr.io/api7/adc:0.25.0` (overridden from chart default 
`0.23.1`) |
   | Replicas | **10** |
   | Provider | `apisix-standalone` (per official install guide) |
   
   Deployment values follow the docs verbatim:
   
   ```yaml
   apisix.deployment.role: traditional
   apisix.deployment.role_traditional.config_provider: yaml
   etcd.enabled: false
   ingress-controller.config.provider.type: apisix-standalone
   ```
   
   ## What we *don't* do
   
   - No `curl` against the Admin API (port 9180).
   - No `kubectl exec` to write config files.
   - No APISIX Dashboard.
   - Config managed exclusively via `Gateway`, `HTTPRoute`, and `GatewayProxy` 
CRDs (Gateway API + `apisix.apache.org/v1alpha1`).
   - Fresh cluster: `helm uninstall` + reinstall, fresh CRD apply, fresh 
secrets — same drift recurs within minutes.
   
   ## Symptom
   
   ADC sync cycle every ~minute. Each cycle, **a different subset of pods 
rejects with 400**, citing some `*_conf_version must be >= (epoch_ms)`. Sample 
over a 5-minute window (output from ADC 0.25.0's per-endpoint status — much 
appreciated, by the way):
   
   ```
   cycle 1: success=9, failed=1   pod 10.53.4.224  ssls_conf_version must be >= 
1778497564298
   cycle 2: success=8, failed=2   pod 10.53.4.224  ssls_conf_version
                                   pod 10.53.4.221  upstreams_conf_version
   cycle 3: success=9, failed=1   pod 10.53.3.55   services_conf_version
   cycle 4: success=10            (one good sync)
   cycle 5: success=8, failed=2   pod 10.53.4.222, 10.53.4.220   
ssls_conf_version
   ```
   
   Note that **the stuck pod rotates**. It's not stuck-forever-on-one-pod 
(which would match the user-triggered single-spike scenario in this issue). 
External effect: TLS handshake succeeds for traffic landing on the "good" pods, 
fails (`tlsv1 alert internal_error` because no SSL loaded) on the "stuck" pods. 
Hit rate fluctuates between ~30% and ~85% depending on which pods are currently 
behind.
   
   ## Why "don't mix management" doesn't apply here
   
   The diagnosis above suggests the user pushed a high version via Admin API, 
so ADC's lower number gets rejected. That's a single-trigger scenario, 
recoverable by restart-both-sides. We've:
   
   1. Confirmed no Admin API access from our side (verified via apisix pod 
access logs — only `axios/1.13.2` UA, the controller's ADC client).
   2. Done the recommended `delete apisix pod + restart ingress-controller` 
recovery. Works for a sync cycle or two, then drift returns.
   3. Fully torn down: `helm uninstall apisix`, delete all CRs, fresh `helm 
install`. Drift still recurs within minutes.
   
   What we're seeing is **organic, structural drift** between N independent 
in-memory version counters and ADC's local CacheKey, with no user-induced 
trigger.
   
   ## Hypothesis
   
   `*_conf_version` is per-pod in-memory state. ADC sends one canonical version 
to N pods in parallel. Any time *one* pod's accept lands a moment later, or a 
retry pushes a slightly higher version to a single pod, that pod is now 
\"ahead\" of ADC's local CacheKey *and* ahead of the other N-1 pods. On the 
next periodic sync, ADC's version is below the ahead-pod's required minimum → 
400. ADC may not be updating its CacheKey from the maximum observed across 
pods, only from its local notion.
   
   In api-driven standalone with multiple replicas, this divergence appears 
unbounded: each cycle can produce a new \"ahead\" pod (because every cycle has 
its own micro-race), so the set of stuck pods rotates over time.
   
   ## Reproducible?
   
   Yes, in any cluster with `replicaCount: 10` + `apisix-standalone` provider + 
only-CRD management. Happy to provide:
   
   - Helm values used.
   - ADC controller logs across many cycles showing the rotation.
   - Full per-pod 204/400 distribution.
   
   ## Questions
   
   1. Is there a documented multi-replica-safe configuration for 
`apisix-standalone`, or is the deployment mode currently single-replica by 
design intent?
   2. Would it be feasible for ADC to update its local CacheKey from 
`max(local, last_observed_per_pod)` rather than only from its own writes? That 
would eliminate the drift after one full cycle.
   3. Is there a \`disable_conf_version_check\` or similar option for 
environments where the controller is the only writer?
   
   If the answer to (1) is \"single replica only,\" that should probably be 
more prominent in the [Install with 
Helm](https://apisix.apache.org/docs/ingress-controller/install/) guide — the 
canonical example does not call this out, and 10 replicas seemed like a 
reasonable default for an ingress.
   
   Happy to file as a separate \`bug:\` if the maintainers prefer keeping this 
issue scoped to the original (Admin API mixed-management) trigger.
   
   ---
   
   *Comment drafted with Claude Code assistance, reviewed and posted by 
@arjunbhut.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to