Grant Henke created KUDU-3134:
---------------------------------
Summary: Adjust default value for --raft_heartbeat_interval
Key: KUDU-3134
URL: https://issues.apache.org/jira/browse/KUDU-3134
Project: Kudu
Issue Type: Improvement
Affects Versions: 1.12.0
Reporter: Grant Henke
Users often increase the `--raft_heartbeat_interval` on larger clusters or on
clusters with high replica counts. This helps avoid the servers flooding each
other with heartbeat RPCs causing queue overflows and using too much idle CPU.
Users have adjusted the values from 1.5 seconds to as high as 10s and we have
never seen people complain about problems after doing so.
Anecdotally, I recently saw a cluster with 4k tablets per tablet server using
~150% cpu usage while idle. By increasing the `--raft_heartbeat_interval` from
500ms to 1500ms the cpu usage dropped to ~50%.
Generally speaking users often care about Kudu stability and scalability over
an extremely short MTTR. Additionally our default client RPC timeouts of 30s
also seem to indicate slightly longer failover/retry times are tolerable in the
default case.
We should consider adjusting the default value of `--raft_heartbeat_interval`
to a higher value to support larger and more efficient clusters by default.
Users who need a low MTTR can always adjust the value lower while also
adjusting other related timeouts. We may also want to consider adjusting the
default `--heartbeat_interval_ms` accordingly.
Note: Batching the RPCs like mentioned in KUDU-1973 or providing a server to
server proxy for heartbeating may be a way to solve the issues without
adjusting the default configuration. However, adjusting the configuration is
easy and has proven effective in production deployments. Additionally adjusting
the defaults along with a KUDU-1973 like approach could lead to even lower idle
resource usage.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)