I am wondering if anyone has had experience tuning the following options to get faster failure detection of a storage node: - osd heartbeat interval (default 6s) - osd heartbeat grace (default 20s)
I am working with a very small cluster: - 2 storage nodes - 1 to 6 OSDs per storage node - replication of 2 In this configuration, losing a storage node (e.g. power failure) results in an interruption to users of the cluster for 30 or more seconds - due to the length of the heartbeat interval and grace period. I am just wondering why the defaults for these are so high and whether anyone has experience with tuning these to reduce the service interruption on storage node failure. I know there is always a trade-off between faster failure detection times and incorrectly detecting a failure - just wondering how much room there is to reduce these settings. Bart Wensley, Wind River _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
