osd_heartbeat_grace is a setting for how many seconds since the last time
an osd received a successful response from another osd before telling the
mons that it's down.  This is one you may want to lower from its default
value of 20 seconds.

mon_osd_min_down_reporters is a setting for how many osds need to report an
osd as down before the mons will mark it as down.  I recommend setting this
to N+1 where N is how many osds you have in a node or failure domain.  If
you end up with a network problem and you have 1 osd node that can talk to
the mons, but not the other osd nodes, then you will end up with that one
node marking the entire cluster down while the rest of the cluster marks
that node down. If your min_down_reporters is N+1, then 1 node cannot mark
down the rest of the cluster.  The default setting is 1 so that small test
clusters can mark down osds, but if you have 3+ nodes, you should set it to
N+1 if you can.  Setting it to more than 2 nodes is equally as
problematic.  However, if you just want things to report as fast as
possible, leaving this at 1 still might be optimal to getting it marked
down sooner.

The downside to lowering these settings is if OSDs are getting marked down
for running slower, then they will re-assert themselves to the mons and end
up causing backfilling and peering for no really good reason.  You'll want
to monitor your cluster for OSDs being marked down for a few seconds before
marking themselves back up.  You can see this in the OSD logs where the OSD
says it was wrongfully marked down in one line and then the next is where
it tells the mons it is actually up.

On Thu, Jun 15, 2017 at 9:53 AM Oliver Humpage <[email protected]>
wrote:

>
> > On 15 Jun 2017, at 14:24, David Byte <[email protected]> wrote:
>
> > Overall, performance is good.  There are a few different approaches to
> mitigate the timeouts that happen for OSD failure detection.
> > 1 – tune the thresholds for failure detection
>
> This may have been blogged somewhere, but do you have any details on which
> config options to change?
>
> Oliver.
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to