Re: [D] Adding Empty shard causing mass timeouts when under heavy load [kvrocks]

via GitHub Tue, 06 Jan 2026 22:54:42 -0800


GitHub user bseto added a comment to the discussion: Adding Empty shard causing 
mass timeouts when under heavy load


> This shouldn't be an issue because it only writes the cluster nodes 
> information into the local file nodes.conf.

Would it be an issue if my disk was very busy? 

> Could you please provide the logs of the cluster node? It might have the 
> potential clues about what's going on at that time. For example, whether this 
> node role was changed or not?
Here's a zipped log. 
[kvrocks-node-33-1h.logs.zip](https://github.com/user-attachments/files/24465746/kvrocks-node-33-1h.logs.zip)

I included a few minutes before the incident, but here's some useful 
information: 

1. My grafana's 14:30 timestamp correlates with the logs 21:30 timestamp. 
2. Migration finishes at the ~14:29min mark and we can start to see that the 
op/s starts going crazy. 
<img width="289" height="348" alt="image" 
src="https://github.com/user-attachments/assets/ca043bf5-d517-4458-8492-9cd94b649be2";
 /> 

3. the kvrocks-controller starts reporting errors right after the migration is 
finished. I don't have the logs from when it began, but the errors are 
basically the same, but with more nodes. 
```
{"level":"error","timestamp":"2026-01-06T22:29:43.082Z","caller":"controller/cluster.go:313","msg":"Failed
 to sync the clusterName 
info","id":"JPurQbrFcDUaULqDQsvExCI0gEB9WKipZ6gYl5Jf","is_master":true,"addr":"172.30.95.9:6379","error":"read
 tcp 172.30.86.1:58634->172.30.95.9:6379: i/o timeout"...
{"level":"error","timestamp":"2026-01-06T22:29:43.082Z","caller":"controller/cluster.go:313","msg":"Failed
 to sync the clusterName 
info","id":"YDAcU3srM7d6iMyOX44rIuKp59fUEM9IoA0t49gv","is_master":true,"addr":"172.30.95.247:6379","error":"read
 tcp 172.30.86.1:33576->172.30.95.247:6379: i/o timeout"...
{"level":"warn","timestamp":"2026-01-06T22:29:49.089Z","caller":"controller/cluster.go:304","msg":"Failed
 to probe the 
node","id":"YDAcU3srM7d6iMyOX44rIuKp59fUEM9IoA0t49gv","is_master":true,"addr":"172.30.95.247:6379","error":"read
 tcp 172.30.86.1:48212->172.30.95.247:6379: i/o timeout","failure_count":1}
```
4. We start to see `MASTER MODE enabled by cluster topology setting` at 
21:29:30 (log timestamp) 
5. at 21:37:30 we start to see more of the `Going to remove the client` errors. 
Which is around when the op/s stops reporting
<img width="913" height="359" alt="image" 
src="https://github.com/user-attachments/assets/ed91a3f8-4845-4ca2-bd54-f7b51b5fc748";
 />
Something I notice when on a machine that starts removing clients is that it's 
btop (basically top), will report that it's still uploading/downloading a lot 
of data through the network. It's cpu usage is still quite high too so it's 
still operating. 
6. I have some circuit breaking happening but I think for this cluster it 
didn't work very well. I had to manually restart the node before things 
returned to normal
<img width="1822" height="668" alt="image" 
src="https://github.com/user-attachments/assets/d7fcbd67-fe52-4281-81d3-b5c3e72814d9";
 />

> You could see the command_stats via the info command.

```
# CommandStats
cmdstat_client:calls=3204,usec=455,usec_per_call=0.1420099875156055
cmdstat_cluster:calls=13731,usec=835218,usec_per_call=60.82717937513655
cmdstat_clusterx:calls=2,usec=2693,usec_per_call=1346.5
cmdstat_command:calls=1,usec=1792,usec_per_call=1792
cmdstat_config:calls=3204,usec=242110,usec_per_call=75.5649188514357
cmdstat_hello:calls=102010,usec=380233,usec_per_call=3.7274090775414175
cmdstat_hgetall:calls=1996260836,usec=284523117119,usec_per_call=142.52802639213826
cmdstat_hmget:calls=90774738,usec=6918249866,usec_per_call=76.21338291276588
cmdstat_hsetexpire:calls=92825387,usec=32159375464,usec_per_call=346.4502169433455
cmdstat_info:calls=3205,usec=856819644,usec_per_call=267338.4224648986
cmdstat_ping:calls=131,usec=30,usec_per_call=0.22900763358778625
cmdstat_slowlog:calls=6408,usec=12585,usec_per_call=1.9639513108614233
```

hmm, the clusterx seems like it's not taking that long? 

GitHub link: 
https://github.com/apache/kvrocks/discussions/3331#discussioncomment-15431225

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Adding Empty shard causing mass timeouts when under heavy load [kvrocks]

Reply via email to