[ceph-users] Luminous: example of a single down osd taking out a cluster

Dan van der Ster Mon, 22 Jan 2018 11:08:37 -0800

Hi all,

We just saw an example of one single down OSD taking down a whole
(small) luminous 12.2.2 cluster.


The cluster has only 5 OSDs, on 5 different servers. Three of those
servers also run a mon/mgr combo.

First, we had one server (mon+osd) go down legitimately [1] -- I can
tell when it went down because the mon quorum broke:

2018-01-22 18:26:31.521695 mon.cephcta-mon-658cb618c9 mon.0
137.138.62.69:6789/0 121277 : cluster [WRN] Health check failed: 1/3
mons down, quorum cephcta-mon-658cb618c9,cephcta-mon-3e0d524825
(MON_DOWN)

Then there's a long pileup of slow requests until the OSD is finally
marked down due to no beacon:

2018-01-22 18:47:31.549791 mon.cephcta-mon-658cb618c9 mon.0
137.138.62.69:6789/0 121447 : cluster [WRN] Health check update: 372
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-01-22 18:47:56.671360 mon.cephcta-mon-658cb618c9 mon.0
137.138.62.69:6789/0 121448 : cluster [INF] osd.2 marked down after no
beacon for 903.538932 seconds
2018-01-22 18:47:56.672315 mon.cephcta-mon-658cb618c9 mon.0
137.138.62.69:6789/0 121449 : cluster [WRN] Health check failed: 1
osds down (OSD_DOWN)


So, first question is: why didn't that OSD get detected as failing much earlier?


The slow requests continue until almost 10 minutes later ceph marks 3
of the other 4 OSDs down after seeing no beacons:

2018-01-22 18:56:31.727970 mon.cephcta-mon-658cb618c9 mon.0
137.138.62.69:6789/0 121539 : cluster [INF] osd.1 marked down after no
beacon for 900.091770 seconds
2018-01-22 18:56:31.728105 mon.cephcta-mon-658cb618c9 mon.0
137.138.62.69:6789/0 121540 : cluster [INF] osd.3 marked down after no
beacon for 900.091770 seconds
2018-01-22 18:56:31.728197 mon.cephcta-mon-658cb618c9 mon.0
137.138.62.69:6789/0 121541 : cluster [INF] osd.4 marked down after no
beacon for 900.091770 seconds
2018-01-22 18:56:31.730108 mon.cephcta-mon-658cb618c9 mon.0
137.138.62.69:6789/0 121542 : cluster [WRN] Health check update: 4
osds down (OSD_DOWN)


900 is the default mon_osd_report_timeout -- why are these OSDs all
stuck not sending beacons? Why haven't they noticed that the osd.2 had
failed, then recover things on the remaining OSDs?

The config [2] is pretty standard, save for one perhaps culprit:

   osd op thread suicide timeout = 1800

That's part of our standard config, mostly to prevent OSDs from
suiciding during FileStore splitting. (This particular cluster is 100%
bluestore, so admittedly we could revert that here).

Any idea what went wrong here?

I can create a tracker and post logs if this is interesting.

Best Regards,

Dan

[1] The failure mode of this OSD appears like its block device just
froze. It runs inside a VM and the console showed several of the
typical 120s block dev timeouts. The machine remained pingable, but
wasn't doing any IO.

[2] https://gist.github.com/dvanders/7eca771b6a8d1164bae8ea1fe45cf9f2
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Luminous: example of a single down osd taking out a cluster

Reply via email to