Hello Zoltan Chovan, Alexey Serbin, Kudu Jenkins, Abhishek Chennaka,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/22867
to look at the new patch set (#7).
Change subject: KUDU-3665 Send Send no-op heartbeat operations batched PART1
......................................................................
KUDU-3665 Send Send no-op heartbeat operations batched PART1
Due to the periodically sent heartbeat messages, a Kudu cluster with
thousands of tablets still consumes significant CPU and network
resources, even without any user activity. When multiple messages
are sent to the same host within a short time frame, they can be
batched to reduce CPU usage.
This results in fewer RPC calls, and some fields can also be shared
between the no-op messages, further reducing network traffic.
We only batch the periodic heartbeats and still send the leadership
no-op heartbeats unbatched. Batching only the periodic no-op heartbeats
allows the following:
+ Any message in the buffer is "discardable" if another request
(e.g., due to a write) arrives, so the buffering does not increase the
response time for write requests.
+ We can process the batch request on a single thread, since an empty
periodic update does not take much time.
Note that If some of the responses in the batch trigger further
updates, ProcessResponse calls DoProcessResponse on a separate
thread. So processing the response on a single thread is fine
as well, no need to change anything on that logic.
Measurement:
I started a 1 master + 4 TS setup using t3.xlarge instances.
Created 200 tablets with 10 hash partitions.
Then performed the following random sampling:
+ Set the flag randomly to 0 or 1 and wait 5 seconds.
+ Start measuring CPU usage and packet count changes for 40 seconds.
+ Sometimes at the start of the 40-second window, start 5 separate write tasks
(from 5 different VMs) writing 1 million rows into a random table.
The results were the following (write time call time of the Kudu client):
batching: True, write: True:
Run count: 41
write_time avg: 9.405540355821936, min 6.921137809753418, max
13.24696135520935, med 9.362545013427734
cpu_utime avg: 55224.170731707316, min 36642, max 65589, med 56495
cpu_stime avg: 12610.90243902439, min 8967, max 18016, med 12052
no_op_heartbeat_count avg: 70389.63414634146, min 69293, max 71336, med
70410
heartbeat_batch_count avg: 2346.1951219512193, min 2310, max 2378, med
2347
batching: False, write: True:
Run count: 38
write_time avg: 9.792724048463922, min 7.515683650970459, max
13.574491500854492, med 9.749269485473633
cpu_utime avg: 57716.68421052631, min 39489, max 66397, med 58957.0
cpu_stime avg: 15128.28947368421, min 11146, max 18691, med 14974.0
no_op_heartbeat_count avg: 70649.84210526316, min 69802, max 71540, med
70640.0
heartbeat_batch_count avg: 0.0, min 0, max 0, med 0.0
batching: True, write: False:
Run count: 46
write_time avg: n/a, min n/a, max n/a, med n/a
cpu_utime avg: 32699.347826086956, min 17489, max 51737, med 28683.0
cpu_stime avg: 10972.717391304348, min 8285, max 15309, med 10992.5
no_op_heartbeat_count avg: 68949.84782608696, min 68778, max 69293, med
68928.5
heartbeat_batch_count avg: 2298.195652173913, min 2293, max 2309, med
2298.0
batching: False, write: False:
Run count: 53
write_time avg: n/a, min n/a, max n/a, med n/a
cpu_utime avg: 35980.16981132075, min 24341, max 54436, med 31923
cpu_stime avg: 13775.792452830188, min 9336, max 17132, med 13524
no_op_heartbeat_count avg: 68985.75471698113, min 68752, max 69294, med
68950
heartbeat_batch_count avg: 0.0, min 0, max 0, med 0
Change-Id: Ie92ba4de5eae00d56cd513cb644dce8fb6e14538
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
A src/kudu/consensus/multi_raft_batcher.cc
A src/kudu/consensus/multi_raft_batcher.h
M src/kudu/consensus/peer_manager.cc
M src/kudu/consensus/peer_manager.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
M src/kudu/tablet/tablet_replica-test-base.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_source_session-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
22 files changed, 845 insertions(+), 44 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/67/22867/7
--
To view, visit http://gerrit.cloudera.org:8080/22867
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie92ba4de5eae00d56cd513cb644dce8fb6e14538
Gerrit-Change-Number: 22867
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Martonka <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>