[D] SetClusterNodes causing timeouts and mass disconnect/reconnects when under heavy load [kvrocks]

via GitHub Tue, 06 Jan 2026 18:51:55 -0800


GitHub user bseto created a discussion: SetClusterNodes causing timeouts and 
mass disconnect/reconnects when under heavy load


Hi, I'm wondering if anyone else is also having this problem, and if there's 
anything I can do to help find a resolution to this. 

# Problem
Note: This only happens to our production cluster with heavy load

When adding an empty shard (so no migrations are involved) using the 
kvrocks-controller `create shard ...`, we observe that our readers and writers 
will start mass disconnecting, and reconnecting and continue to do so until we 
move reads off of the cluster. 

This also happens when we migrate slots, but I think I've narrowed it down to 
it happening on `SetClusterNodes` and maybe something to do with the 
`kCmdExclusive` taking too long and commands timeout? 

We have 4 regions we deploy to, and during November and December testing, the 
2nd and 3rd largest clusters would display the above behaviour when I added an 
empty shard, or did any migrations. As a hail-mary, I had Claude Opus read the 
kvrocks codebase and see if it could find anything, and it suggested i set 
`persist-cluster-nodes-enabled no` since that could add to how long the 
exclusive lock needed to be held.

It's January now, and the load is similar (but slightly lower than it was in 
November/December), but when I tried migrations on the 2nd and 3rd largest 
clusters after setting `persist-cluster-nodes-enabled no`, it seems those 
clusters no longer have issues. 

However when I try it on the largest cluster we have, I'm still seeing the 
issue. 

# Testing

## Attempt at Reproducing not in production

I spent a week in November trying to reproduce this issue on a non-production 
cluster and I was unable to do so. 
I had 8 r6idn.large nodes in the cluster, with 16 c6gn.2xlarge nodes to load 
test (a mix of readers and writers) and was unable to get this issue happening. 

## In Production

I'll just give the details of our largest cluster:

kvrocks version: 2.12.1
Instance Type: r6idn.8xlarge
Number of nodes: 44
Operations: 3.4M ops/s (3M read, 130k hsetexpire, 130k hmget)
Each Node averages 80k op/s at peak times. 

We were originally using [ruedis](https://github.com/redis/rueidis) client to 
connect to the cluster, and we tried to change the timeout settings and 
pipelines. Originally I had thought it might have been the way the client was 
handling `MOVED` errors, but getting the timeouts even when we just add an 
empty shard ruled that out (I had also modified the client to track if we were 
getting MOVED errors and we weren't). 

We then moved to [go-redis](https://github.com/redis/go-redis) and tried to 
play around with the timeouts there, and ignoring context.Timeout. 

Followed this [document](https://uptrace.dev/blog/golang-context-timeout) about 
Go Context Timeouts being potentially harmful and basically set go-redis to 
have a 10second timeout on the client, and we'd have our own wrapping function 
that'd timeout the original request, while letting the original redis 
connection live and not need to quit. The idea here was that future requests to 
the go-redis client would try to grab from the pool and get errors from 
go-redis since the pool of connections would be exhausted instead of spamming 
the kvrocks cluster. However this still didn't work. Somehow it'd still 
timeout. 

Currently we're using go-redis and using the `Limiter` interface to create a 
circuit breaker. So whenever we get a large amount of errors, we just open the 
circuit breaker for a bit before connecting again - but this still takes 
5-10mins, and isn't ideal since it'd still mean we'd have 40+ migrations to do, 
with each migration ending with a disconnect/reconnect storm. 

Note: I'm also noticing it's not every node. It's usually maybe 10% of the 
nodes in the cluster that get hammered hard and have clients continuously 
connect/disconnect. 

# Thoughts

Is there a way I can confirm if `SetClusterNodes` is indeed taking too long and 
the culprit? 
Or is there any ideas on what settings we can change? 




GitHub link: https://github.com/apache/kvrocks/discussions/3331

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[D] SetClusterNodes causing timeouts and mass disconnect/reconnects when under heavy load [kvrocks]

Reply via email to