Re: [ceph-users] Ceph Very Small Cluster

Ranjan Ghosh Mon, 24 Oct 2016 03:31:47 -0700

Thanks JC & Greg, I've changed the "mon osd min down reporters" to 1.According to this:


http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/

the default is already 1, though. I don't remember the value before Ichanged it everywhere, so I can't say for sure now. But I think it was 2despite what the docs say. Whatever. It's now 1 everywhere.

Another somewhat weird thing I found was: When I check the values of anOSD(!) with "ceph daemon osd.0 config show | sort | grep mon_osd" I seean entry "mon osd min down reporters". I can even change it. Butaccording to the docs, this is just a setting for monitors. Why does itappear there? Does it influence anything? If not: Is there a way to onlyshow relevant config entries for a daemon?

Then, when checking the doc page mentioned above and reading thedescriptions of the multitude of config settings, I wonder: How can Iproperly estimate the time until my cluster works again? Since I gethung requests until the failed node is finally declared *down*, thistime is obviously quite important for me. What exactly is the sequenceof events when a node fails (i.e. someone accidentally hits the poweroff button). My (possibly totally wrong & dumb) idea:


1) osd0 fails/doesn't answer

2) osd1 pings osd0 every 6 seconds ( osd heartbeat interval). Thus,after 6 seconds max. osd1 notices osd0 *could be* down.

3) After another 20 seconds (osd heartbeat grace), osd1 decides osd0 isdefinitely down.

4) Another 120 seconds might elapse ( osd mon report interval max) untilosd1 reports the bad news to the monitor.

5) The monitor gets the information about failed osd0 and since "mon osdmin down reporters" is 1, this single osd is sufficent for the monitorto believe the bad news that osd0 is unresponsive.

6) But since "mon osd min down reports" is 3, all the stuff up until nowhas to happen 3 times in a row until the monitor finally realizes osd0is *really* unresponsive.

7) After another 900 seconds (mon osd report timeout) of waiting inhope of another news that osd0 is still/back alive, the monitor marksosd0 as down

8) After another 300 seconds (mon osd down out interval) the monitormarks osd0 as down+out

So, after my possibly very naive understanding, it takes 3*(6+20+120) +900 + 300 seconds from the event "someone accidentally hit the power offswitch" to "osd0 is marked down+out".


Correct? I expect not. Which config variables did I misunderstand?


Thank you

Ranjan




Am 29.09.2016 um 20:48 schrieb LOPEZ Jean-Charles:

mon_osd_min_down_reporters by default set to 2

I guess you’ll have to set it to 1 in your case

JC

On Sep 29, 2016, at 08:16, Gregory Farnum <[email protected]<mailto:[email protected]>> wrote:

I think the problem is that Ceph requires a certain number of OSDs ora certain number of reports of failure before it marks an OSD down.These thresholds are not tuned for a 2-OSD cluster; you probably wantto set them to 1.Also keep in mind that the OSDs provide a grace period of 20-30seconds before they'll report somebody down; this helps preventspurious recovery but means you will get paused IO on an uncleanshutdown.

I can't recall the exact config options off-hand, but it's somethinglike "mon osd min down reports". Search the docs for that. :)

-Greg

On Thursday, September 29, 2016, Peter Maloney<[email protected]<mailto:[email protected]>> wrote:


    On 09/29/16 14:07, Ranjan Ghosh wrote:
    > Wow. Amazing. Thanks a lot!!! This works. 2 (hopefully) last
    questions
    > on this issue:
    >
    > 1) When the first node is coming back up, I can just call "ceph
    osd up
    > 0" and Ceph will start auto-repairing everything everything, right?
    > That is, if there are e.g. new files that were created during
    the time
    > the first node was down, they will (sooner or later) get replicated
    > there?
    Nope, there is no "ceph osd up <id>"; you just start the osd, and it
    already gets recognized as up. (if you don't like this, you set
    it out,
    not just down; and there is a "ceph osd in <id>" to undo that.)
    >
    > 2) If I don't call "osd down" manually (perhaps at the weekend when
    > I'm not at the office) when a node dies - did I understand
    correctly
    > that the "hanging" I experienced is temporary and that after a few
    > minutes (don't want to try out now) the node should also go down
    > automatically?
    I believe so, yes.

    Also, FYI, RBD images don't seem to have this issue, and work
    right away
    on a 3 osd cluster. Maybe cephfs would also work better with a
    3rd osd,
    even an empty one (weight=0). (and I had an unresolved issue
    testing the
    same with cephfs on my virtual test cluster)
    >
    > BR,
    > Ranjan
    >
    >
    > Am 29.09.2016 um 13:00 schrieb Peter Maloney:
    >>
    >> And also you could try:
    >>      ceph osd down <osd id>
    >

    _______________________________________________
    ceph-users mailing list
    [email protected] <javascript:;>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Very Small Cluster

Reply via email to