[kudu-CR] Add option to send no-op heartbeat operations batched PART1

Zoltan Martonka (Code Review) Tue, 13 May 2025 06:30:31 -0700

Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit


    http://gerrit.cloudera.org:8080/22867

to look at the new patch set (#3).

Change subject: Add option to send no-op heartbeat operations batched PART1
......................................................................

Add option to send no-op heartbeat operations batched PART1

Due to the periodically sent heartbeat messages, a Kudu cluster having
thousands of tablets still uses significant CPU and networking even
without any user activity.
When we send multiple messages to the same host within a short
time-frame, they can be batched to reduce CPU impact.

This would result in fewer RPC calls and some fields can also be shared
between the no-op messages.

Measurement:
I launched an AWS cluster with a 1 master, 4 tserver configuration
(t3.xlarge). Then put 2000 tablets (RF=3) on them. cpu_stime seems to
be decreased by 10-15% (the same number of no-op messages arrive).

Here is one result over 500 sec for the flag being turned on or off
(change in metrics over 500 sec):

Metric: no_op_heartbeat_count, off: 920739, on: 919910, inc: -0.0900%
Metric: heartbeat_batch_count, off: 0, on: 30664, inc: n/a%
Metric: cpu_stime, off: 81208, on: 64079, inc: -21.093%
Metric: cpu_utime, off: 176558, on: 170430, inc: -3.471%

Next 2 PARTS that are needed:

1) Process response in multiple threads:
  If we start to write multiple tablets in the same time that are in the
  same buffer, then after the flush, when their responses arrive the new
  hearthbeats (with operations) will be sent out on the same thread
  (unbatched, so their responses will be multi threaded, and kudu will
  return back to normal).

  This would rarely cause problems on a usual cluster. However if you
  have a 3 tserver setup with a sigle table having 30 tablets with hash
  partitions, it can add multiple seconds of delay to the write
  operation (but not increase the overall cpu consumption): If all 30
  hearthbeat is waiting in the buffer, one of the write will flush it.
  When the response arrives back, we will process it in a single thread.
  We will send out 30 updates with actual operations on this single
  thread.

  Possible solution:
  + Keep track if we are called in batch mode and if there was already
  1-2 send "send_more_immediately" case, then request a callback instead
  of sending the message.

2) If a write requests finds a no-op message still in the
  buffer, it should discard it, not flush a buffer. It would make the
  problem in 1) appear much less frequently (and stabilize the unit
  tests, that are now flaky with enable_multi_raft_heartbeat_batcher=1)

Next 2 parts that are needed:

1) Process response in multiple threads:
  If we start to write multiple tablets at the same time that are in the
  same buffer, then after the flush, when their responses arrive, the new
  heartbeats (with operations) will be sent out on the same thread
  (unbatched, so their responses will be multi-threaded, and Kudu will
  return back to normal).

  This would rarely cause problems on a usual cluster. However, if you
  have a 3 tserver setup with a single table having 30 tablets with hash
  partitions, it can add multiple seconds of delay to the write
  operation (but not increase the overall CPU consumption): If all 30
  heartbeats are waiting in the buffer, one of the writes will flush it.
  When the response arrives back, we will process it in a single thread.
  We will send out 30 updates with actual operations on this single
  thread.

  Possible solution:
  + Keep track if we are called in batch mode and if there was already
    1-2 "send_more_immediately" cases, then request a callback instead
    of sending the message immediately.

2) If a write request finds a no-op message still in the
  buffer, it should discard it, not flush the buffer. It would make the
  problem in 1) appear much less frequently (and stabilize the unit
  tests that are now flaky with enable_multi_raft_heartbeat_batcher=1),
  so this should be done after 1) is implemented (so we do not hide it).

Change-Id: Ie92ba4de5eae00d56cd513cb644dce8fb6e14538
---
M src/kudu/client/client-test.cc
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
A src/kudu/consensus/multi_raft_batcher.cc
A src/kudu/consensus/multi_raft_batcher.h
M src/kudu/consensus/peer_manager.cc
M src/kudu/consensus/peer_manager.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
M src/kudu/tablet/tablet_replica-test-base.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_source_session-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
23 files changed, 765 insertions(+), 36 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/67/22867/3
--
To view, visit http://gerrit.cloudera.org:8080/22867
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie92ba4de5eae00d56cd513cb644dce8fb6e14538
Gerrit-Change-Number: 22867
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Martonka <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] Add option to send no-op heartbeat operations batched PART1

Reply via email to