Thanks JC & Greg, I've changed the "mon osd min down reporters" to 1.
According to this:
http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/
the default is already 1, though. I don't remember the value before I
changed it everywhere, so I can't say for sure now. But I think it was 2
despite what the docs say. Whatever. It's now 1 everywhere.
Another somewhat weird thing I found was: When I check the values of an
OSD(!) with "ceph daemon osd.0 config show | sort | grep mon_osd" I see
an entry "mon osd min down reporters". I can even change it. But
according to the docs, this is just a setting for monitors. Why does it
appear there? Does it influence anything? If not: Is there a way to only
show relevant config entries for a daemon?
Then, when checking the doc page mentioned above and reading the
descriptions of the multitude of config settings, I wonder: How can I
properly estimate the time until my cluster works again? Since I get
hung requests until the failed node is finally declared *down*, this
time is obviously quite important for me. What exactly is the sequence
of events when a node fails (i.e. someone accidentally hits the power
off button). My (possibly totally wrong & dumb) idea:
1) osd0 fails/doesn't answer
2) osd1 pings osd0 every 6 seconds ( osd heartbeat interval). Thus,
after 6 seconds max. osd1 notices osd0 *could be* down.
3) After another 20 seconds (osd heartbeat grace), osd1 decides osd0 is
definitely down.
4) Another 120 seconds might elapse ( osd mon report interval max) until
osd1 reports the bad news to the monitor.
5) The monitor gets the information about failed osd0 and since "mon osd
min down reporters" is 1, this single osd is sufficent for the monitor
to believe the bad news that osd0 is unresponsive.
6) But since "mon osd min down reports" is 3, all the stuff up until now
has to happen 3 times in a row until the monitor finally realizes osd0
is *really* unresponsive.
7) After another 900 seconds (mon osd report timeout) of waiting in
hope of another news that osd0 is still/back alive, the monitor marks
osd0 as down
8) After another 300 seconds (mon osd down out interval) the monitor
marks osd0 as down+out
So, after my possibly very naive understanding, it takes 3*(6+20+120) +
900 + 300 seconds from the event "someone accidentally hit the power off
switch" to "osd0 is marked down+out".
Correct? I expect not. Which config variables did I misunderstand?
Thank you
Ranjan
Am 29.09.2016 um 20:48 schrieb LOPEZ Jean-Charles:
mon_osd_min_down_reporters by default set to 2
I guess you’ll have to set it to 1 in your case
JC
On Sep 29, 2016, at 08:16, Gregory Farnum <[email protected]
<mailto:[email protected]>> wrote:
I think the problem is that Ceph requires a certain number of OSDs or
a certain number of reports of failure before it marks an OSD down.
These thresholds are not tuned for a 2-OSD cluster; you probably want
to set them to 1.
Also keep in mind that the OSDs provide a grace period of 20-30
seconds before they'll report somebody down; this helps prevent
spurious recovery but means you will get paused IO on an unclean
shutdown.
I can't recall the exact config options off-hand, but it's something
like "mon osd min down reports". Search the docs for that. :)
-Greg
On Thursday, September 29, 2016, Peter Maloney
<[email protected]
<mailto:[email protected]>> wrote:
On 09/29/16 14:07, Ranjan Ghosh wrote:
> Wow. Amazing. Thanks a lot!!! This works. 2 (hopefully) last
questions
> on this issue:
>
> 1) When the first node is coming back up, I can just call "ceph
osd up
> 0" and Ceph will start auto-repairing everything everything, right?
> That is, if there are e.g. new files that were created during
the time
> the first node was down, they will (sooner or later) get replicated
> there?
Nope, there is no "ceph osd up <id>"; you just start the osd, and it
already gets recognized as up. (if you don't like this, you set
it out,
not just down; and there is a "ceph osd in <id>" to undo that.)
>
> 2) If I don't call "osd down" manually (perhaps at the weekend when
> I'm not at the office) when a node dies - did I understand
correctly
> that the "hanging" I experienced is temporary and that after a few
> minutes (don't want to try out now) the node should also go down
> automatically?
I believe so, yes.
Also, FYI, RBD images don't seem to have this issue, and work
right away
on a 3 osd cluster. Maybe cephfs would also work better with a
3rd osd,
even an empty one (weight=0). (and I had an unresolved issue
testing the
same with cephfs on my virtual test cluster)
>
> BR,
> Ranjan
>
>
> Am 29.09.2016 um 13:00 schrieb Peter Maloney:
>>
>> And also you could try:
>> ceph osd down <osd id>
>
_______________________________________________
ceph-users mailing list
[email protected] <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com