Re: [PR] fix(limit-count): upgrade redis-cluster lib so NOSCRIPT is not treated as node failure [apisix]

via GitHub Tue, 23 Jun 2026 23:52:02 -0700


shreemaan-abhishek commented on PR #13579:
URL: https://github.com/apache/apisix/pull/13579#issuecomment-4786651822

Thanks for the careful read, you're right that this isn't behavior-neutral.
I dug into the why so the trade-off is explicit.

The staged connect timeout (100ms first attempt, 800ms on retries) and the
lower `max_redirection` / `max_connection_attempts` defaults (2/2) are
intentional, not accidental. They come from
[api7/lua-resty-redis-cluster#10](https://github.com/api7/lua-resty-redis-cluster/pull/10)
("add healthcheck for failed master nodes and refactor retry strategy"). The
goal there was to keep gateway CPU bounded when a redis node goes down: instead
of every request blocking for the full `redis_timeout` on a dead node's connect
(which piles up under load), the lib fails the connect fast, records the node
as unhealthy in a shared dict, short-circuits subsequent requests to it, and a
background timer re-probes to rehabilitate. That makes it a fast-fail circuit
breaker rather than a timeout-exhaustion path.

Two things worth noting on scope:

- `send_timeout` / `read_timeout` on these paths still honor
`redis_timeout`, and the non-retry paths (`_do_cmd_master`, pipeline) still
honor `connect_timeout` fully. So `redis_timeout` is not bypassed wholesale,
only the per-attempt connect budget in the retry/discovery loops is fixed.
- The breaker needs a `redis_cluster_health` shared dict plus
`resty.rediscluster.init()` to start its health-check timer; without them it is
inert. I've added both in this PR (dict guarded on the
limit-conn/limit-req/limit-count plugins, `init()` started in the privileged
agent), which is the same wiring this library has run with in production.

On your core concern, high-latency / TLS / cross-AZ clusters where a 100ms
first-attempt connect may routinely fail: that's a real limitation, and it's
that the staged floor is hardcoded in the library rather than configurable. I'd
prefer to fix that upstream in the library (make the connect floor
configurable, or honor `connect_timeout` as a floor when it is larger) as a
follow-up, rather than reverting the fast-fail design here, since reverting
would re-introduce the CPU problem #10 solved. I'll open that on the library
and add a note on the connect-timeout semantics to the limit-count docs so the
change is disclosed.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(limit-count): upgrade redis-cluster lib so NOSCRIPT is not treated as node failure [apisix]

Reply via email to