[ 
https://issues.apache.org/jira/browse/KUDU-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke reassigned KUDU-3134:
---------------------------------

    Assignee:     (was: Grant Henke)

> Adjust default value for --raft_heartbeat_interval
> --------------------------------------------------
>
>                 Key: KUDU-3134
>                 URL: https://issues.apache.org/jira/browse/KUDU-3134
>             Project: Kudu
>          Issue Type: Improvement
>    Affects Versions: 1.12.0
>            Reporter: Grant Henke
>            Priority: Major
>
> Users often increase the `--raft_heartbeat_interval` on larger clusters or on 
> clusters with high replica counts. This helps avoid the servers flooding each 
> other with heartbeat RPCs causing queue overflows and using too much idle 
> CPU. Users have adjusted the values from 1.5 seconds to as high as 10s and we 
> have never seen people complain about problems after doing so.
> Anecdotally, I recently saw a cluster with 4k tablets per tablet server using 
> ~150% cpu usage while idle. By increasing the `--raft_heartbeat_interval` 
> from 500ms to 1500ms the cpu usage dropped to ~50%.
> Generally speaking users often care about Kudu stability and scalability over 
> an extremely short MTTR. Additionally our default client RPC timeouts of 30s 
> also seem to indicate slightly longer failover/retry times are tolerable in 
> the default case. 
> We should consider adjusting the default value of `--raft_heartbeat_interval` 
> to a higher value  to support larger and more efficient clusters by default. 
> Users who need a low MTTR can always adjust the value lower while also 
> adjusting other related timeouts. We may also want to consider adjusting the 
> default `--heartbeat_interval_ms` accordingly.
> Note: Batching the RPCs like mentioned in KUDU-1973 or providing a server to 
> server proxy for heartbeating may be a way to solve the issues without 
> adjusting the default configuration. However, adjusting the configuration is 
> easy and has proven effective in production deployments. Additionally 
> adjusting the defaults along with a KUDU-1973 like approach could lead to 
> even lower idle resource usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to