shreemaan-abhishek commented on PR #13579:
URL: https://github.com/apache/apisix/pull/13579#issuecomment-4786651822

   Thanks for the careful read, you're right that this isn't behavior-neutral. 
I dug into the why so the trade-off is explicit.
   
   The staged connect timeout (100ms first attempt, 800ms on retries) and the 
lower `max_redirection` / `max_connection_attempts` defaults (2/2) are 
intentional, not accidental. They come from 
[api7/lua-resty-redis-cluster#10](https://github.com/api7/lua-resty-redis-cluster/pull/10)
 ("add healthcheck for failed master nodes and refactor retry strategy"). The 
goal there was to keep gateway CPU bounded when a redis node goes down: instead 
of every request blocking for the full `redis_timeout` on a dead node's connect 
(which piles up under load), the lib fails the connect fast, records the node 
as unhealthy in a shared dict, short-circuits subsequent requests to it, and a 
background timer re-probes to rehabilitate. That makes it a fast-fail circuit 
breaker rather than a timeout-exhaustion path.
   
   Two things worth noting on scope:
   
   - `send_timeout` / `read_timeout` on these paths still honor 
`redis_timeout`, and the non-retry paths (`_do_cmd_master`, pipeline) still 
honor `connect_timeout` fully. So `redis_timeout` is not bypassed wholesale, 
only the per-attempt connect budget in the retry/discovery loops is fixed.
   - The breaker needs a `redis_cluster_health` shared dict plus 
`resty.rediscluster.init()` to start its health-check timer; without them it is 
inert. I've added both in this PR (dict guarded on the 
limit-conn/limit-req/limit-count plugins, `init()` started in the privileged 
agent), which is the same wiring this library has run with in production.
   
   On your core concern, high-latency / TLS / cross-AZ clusters where a 100ms 
first-attempt connect may routinely fail: that's a real limitation, and it's 
that the staged floor is hardcoded in the library rather than configurable. I'd 
prefer to fix that upstream in the library (make the connect floor 
configurable, or honor `connect_timeout` as a floor when it is larger) as a 
follow-up, rather than reverting the fast-fail design here, since reverting 
would re-introduce the CPU problem #10 solved. I'll open that on the library 
and add a note on the connect-timeout semantics to the limit-count docs so the 
change is disclosed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to