Re: [ceph-users] Ceph monitors overloaded on large cluster restart

Andras Pataki Wed, 19 Dec 2018 17:02:54 -0800

Hi Dan,

'noup' now makes a lot of sense - that's probably the major help thatour cluster start would have needed. Essentially this way only one mapchange occurs in the cluster when all the OSDs are marked 'in' and thatgets distributed, vs hundreds or thousands of map changes as variousOSDs boot at slightly different times. I have a smaller cluster that Ican test it with and measure how the network traffic changes to themons, and will plan this in the next shutdown/restart/upgrade.


Thanks for the quick response and the tip!

Andras


On 12/19/18 6:47 PM, Dan van der Ster wrote:

Hey Andras,
Three mons is possibly too few for such a large cluster. We've hadlots of good stable experience with 5-mon clusters. I've never tried7, so I can't say if that would lead to other problems (e.g.leader/peon sync scalability).
That said, our 10000-osd bigbang tests managed with only 3 mons, and Iassume that outside of this full system reboot scenario your 3 copewell enough. You should probably add 2 more, but I wouldn't expectthat alone to solve this problem in the future.
Instead, with a slightly tuned procedure and a bit of osd loggrepping, I think you could've booted this cluster more quickly than 4hours with those mere 3 mons.
As you know, each osds boot process requires the downloading of allknown osdmaps. If all osds are booting together, and the mons aresaturated, the osds can become sluggish when responding to theirpeers, which could lead to the flapping scenario you saw. Flappingleads to new osdmap epochs that then need to be distributed, worseningthe issue. It's good that you used nodown and noout, because withoutthese the boot time would've been even longer. Next time also set noupand noin to further reduce the osdmap churn.
One other thing: there's a debug_osd level -- 10 or 20, I forgetexactly -- that you can set to watch the maps sync up on each osd.Grep the osd logs for some variations on "map" and "epoch".
In short, here's what I would've done:

0. boot the mons, waiting until they have a full quorum.
1. set nodown, noup, noin, noout <-- with these, there should bezero new osdmaps generated while the osds boot.2. start booting osds. set the necessary debug_osd level to see theosdmap sync progress in the ceph-osd logs.3. if the mons are over saturated, boot progressively -- one rack at atime, for example.4. once all osds have caught up to the current osdmap, unset noup. Theosds should then all "boot" (as far as the mons are concerned) and bemarked up. (this might be sluggish on a 3400 osd cluster, perhapstaking a few 10s of seconds). the pgs should be active+clean at thispoint.5. unset nodown, noin, noout. which should change nothing provided allwent well.
Hope that helps for next time!

Dan
On Wed, Dec 19, 2018 at 11:39 PM Andras Pataki<apat...@flatironinstitute.org <mailto:apat...@flatironinstitute.org>>wrote:
    Forgot to mention: all nodes are on Luminous 12.2.8 currently on
    CentOS 7.5.

    On 12/19/18 5:34 PM, Andras Pataki wrote:
    > Dear ceph users,
    >
    > We have a large-ish ceph cluster with about 3500 osds. We run 3
    mons
    > on dedicated hosts, and the mons typically use a few percent of a
    > core, and generate about 50Mbits/sec network traffic. They are
    > connected at 20Gbits/sec (bonded dual 10Gbit) and are running on
    2x14
    > core servers.
    >
    > We recently had to shut ceph down completely for maintenance
    (which we
    > rarely do), and had significant difficulties starting it up.  The
    > symptoms included OSDs hanging on startup, being marked down,
    flapping
    > and all that bad stuff.  After some investigation we found that the
    > 20Gbit/sec network interfaces of the monitors were completely
    > saturated as the OSDs were starting, while the monitor processes
    were
    > using about 3 cores (300% CPU).  We ended up having to start the
    OSDs
    > up super slow to make sure that the monitors could keep up - it
    took
    > about 4 hours to start 3500 OSDs (at a rate about 4 seconds per
    OSD).
    > We've tried setting noout and nodown, but that didn't really help
    > either.  A few questions that would be good to understand in
    order to
    > move to a better configuration.
    >
    > 1. How does the monitor traffic scale with the number of OSDs?
    > Presumably the traffic comes from distributing cluster maps as the
    > cluster changes on OSD starts.  The cluster map is perhaps O(N)
    for N
    > OSDs, and each OSD needs an update on a cluster change so that
    would
    > make one change an O(N^2) traffic.  As OSDs start, the cluster
    changes
    > quite a lot (N times?), so would that make the startup traffic
    > O(N^3)?  If so, that sounds pretty scary for scalability.
    >
    > 2. Would adding more monitors help here?  I.e. presumably each OSD
    > gets its maps from one monitor, so they would share the traffic.
    Would
    > the inter-monitor communication/elections/etc. be problematic
    for more
    > monitors (5, 7 or even more)?  Would more monitors be
    recommended?  If
    > so, how many is practical?
    >
    > 3. Are there any config parameters useful for tuning the traffic
    > (perhaps send mon updates less frequently, or something along those
    > lines)?
    >
    > Any other advice on this topic would also be helpful.
    >
    > Thanks,
    >
    > Andras
    >
    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitors overloaded on large cluster restart

Reply via email to