GitHub user bseto added a comment to the discussion: Adding Empty shard causing mass timeouts when under heavy load
> This shouldn't be an issue because it only writes the cluster nodes > information into the local file nodes.conf. Would it be an issue if my disk was very busy? > Could you please provide the logs of the cluster node? It might have the > potential clues about what's going on at that time. For example, whether this > node role was changed or not? Here's a zipped log. [kvrocks-node-33-1h.logs.zip](https://github.com/user-attachments/files/24465746/kvrocks-node-33-1h.logs.zip) I included a few minutes before the incident, but here's some useful information: 1. My grafana's 14:30 timestamp correlates with the logs 21:30 timestamp. 2. Migration finishes at the ~14:29min mark and we can start to see that the op/s starts going crazy. <img width="289" height="348" alt="image" src="https://github.com/user-attachments/assets/ca043bf5-d517-4458-8492-9cd94b649be2" /> 3. the kvrocks-controller starts reporting errors right after the migration is finished. I don't have the logs from when it began, but the errors are basically the same, but with more nodes. ``` {"level":"error","timestamp":"2026-01-06T22:29:43.082Z","caller":"controller/cluster.go:313","msg":"Failed to sync the clusterName info","id":"JPurQbrFcDUaULqDQsvExCI0gEB9WKipZ6gYl5Jf","is_master":true,"addr":"172.30.95.9:6379","error":"read tcp 172.30.86.1:58634->172.30.95.9:6379: i/o timeout"... {"level":"error","timestamp":"2026-01-06T22:29:43.082Z","caller":"controller/cluster.go:313","msg":"Failed to sync the clusterName info","id":"YDAcU3srM7d6iMyOX44rIuKp59fUEM9IoA0t49gv","is_master":true,"addr":"172.30.95.247:6379","error":"read tcp 172.30.86.1:33576->172.30.95.247:6379: i/o timeout"... {"level":"warn","timestamp":"2026-01-06T22:29:49.089Z","caller":"controller/cluster.go:304","msg":"Failed to probe the node","id":"YDAcU3srM7d6iMyOX44rIuKp59fUEM9IoA0t49gv","is_master":true,"addr":"172.30.95.247:6379","error":"read tcp 172.30.86.1:48212->172.30.95.247:6379: i/o timeout","failure_count":1} ``` 4. We start to see `MASTER MODE enabled by cluster topology setting` at 21:29:30 (log timestamp) 5. at 21:37:30 we start to see more of the `Going to remove the client` errors. Which is around when the op/s stops reporting <img width="913" height="359" alt="image" src="https://github.com/user-attachments/assets/ed91a3f8-4845-4ca2-bd54-f7b51b5fc748" /> Something I notice when on a machine that starts removing clients is that it's btop (basically top), will report that it's still uploading/downloading a lot of data through the network. It's cpu usage is still quite high too so it's still operating. 6. I have some circuit breaking happening but I think for this cluster it didn't work very well. I had to manually restart the node before things returned to normal <img width="1822" height="668" alt="image" src="https://github.com/user-attachments/assets/d7fcbd67-fe52-4281-81d3-b5c3e72814d9" /> > You could see the command_stats via the info command. ``` # CommandStats cmdstat_client:calls=3204,usec=455,usec_per_call=0.1420099875156055 cmdstat_cluster:calls=13731,usec=835218,usec_per_call=60.82717937513655 cmdstat_clusterx:calls=2,usec=2693,usec_per_call=1346.5 cmdstat_command:calls=1,usec=1792,usec_per_call=1792 cmdstat_config:calls=3204,usec=242110,usec_per_call=75.5649188514357 cmdstat_hello:calls=102010,usec=380233,usec_per_call=3.7274090775414175 cmdstat_hgetall:calls=1996260836,usec=284523117119,usec_per_call=142.52802639213826 cmdstat_hmget:calls=90774738,usec=6918249866,usec_per_call=76.21338291276588 cmdstat_hsetexpire:calls=92825387,usec=32159375464,usec_per_call=346.4502169433455 cmdstat_info:calls=3205,usec=856819644,usec_per_call=267338.4224648986 cmdstat_ping:calls=131,usec=30,usec_per_call=0.22900763358778625 cmdstat_slowlog:calls=6408,usec=12585,usec_per_call=1.9639513108614233 ``` hmm, the clusterx seems like it's not taking that long? GitHub link: https://github.com/apache/kvrocks/discussions/3331#discussioncomment-15431225 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
