Zoltan Chovan has posted comments on this change. ( http://gerrit.cloudera.org:8080/22867 )
Change subject: Add option to send no-op heartbeat operations batched PART1 ...................................................................... Patch Set 5: (6 comments) http://gerrit.cloudera.org:8080/#/c/22867/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22867/5//COMMIT_MSG@10 PS5, Line 10: CPU and networking maybe "CPU and network resources"? http://gerrit.cloudera.org:8080/#/c/22867/5//COMMIT_MSG@34 PS5, Line 34: in at http://gerrit.cloudera.org:8080/#/c/22867/5//COMMIT_MSG@41 PS5, Line 41: sigle single http://gerrit.cloudera.org:8080/#/c/22867/5//COMMIT_MSG@44 PS5, Line 44: hearthbeat is waiting heartbeats are waiting http://gerrit.cloudera.org:8080/#/c/22867/5//COMMIT_MSG@44 PS5, Line 44: write writes http://gerrit.cloudera.org:8080/#/c/22867/5//COMMIT_MSG@59 PS5, Line 59: Next 2 parts that are needed: : : 1) Process response in multiple threads: : If we start to write multiple tablets at the same time that are in the : same buffer, then after the flush, when their responses arrive, the new : heartbeats (with operations) will be sent out on the same thread : (unbatched, so their responses will be multi-threaded, and Kudu will : return back to normal). : : This would rarely cause problems on a usual cluster. However, if you : have a 3 tserver setup with a single table having 30 tablets with hash : partitions, it can add multiple seconds of delay to the write : operation (but not increase the overall CPU consumption): If all 30 : heartbeats are waiting in the buffer, one of the writes will flush it. : When the response arrives back, we will process it in a single thread. : We will send out 30 updates with actual operations on this single : thread. : : Possible solution: : + Keep track if we are called in batch mode and if there was already : 1-2 "send_more_immediately" cases, then request a callback instead : of sending the message immediately. : : 2) If a write request finds a no-op message still in the : buffer, it should discard it, not flush the buffer. It would make the : problem in 1) appear much less frequently (and stabilize the unit : tests that are now flaky with enable_multi_raft_heartbeat_batcher=1), : so this should be done after 1) is implemented (so we do not hide it). : This seems like it's a duplicate of the previous part -- To view, visit http://gerrit.cloudera.org:8080/22867 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie92ba4de5eae00d56cd513cb644dce8fb6e14538 Gerrit-Change-Number: 22867 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Martonka <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Zoltan Chovan <[email protected]> Gerrit-Comment-Date: Wed, 28 May 2025 09:25:09 +0000 Gerrit-HasComments: Yes
