I am wondering if anyone has had experience tuning the following options to get 
faster failure detection of a storage node:
- osd heartbeat interval (default 6s)
- osd heartbeat grace (default 20s)

I am working with a very small cluster:
- 2 storage nodes
- 1 to 6 OSDs per storage node
- replication of 2

In this configuration, losing a storage node (e.g. power failure) results in an 
interruption to users of the cluster for 30 or more seconds - due to the length 
of the heartbeat interval and grace period. I am just wondering why the 
defaults for these are so high and whether anyone has experience with tuning 
these to reduce the service interruption on storage node failure. I know there 
is always a trade-off between faster failure detection times and incorrectly 
detecting a failure - just wondering how much room there is to reduce these 
settings.

Bart Wensley, Wind River

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to