Hi Dan,

'noup' now makes a lot of sense - that's probably the major help that our cluster start would have needed.  Essentially this way only one map change occurs in the cluster when all the OSDs are marked 'in' and that gets distributed, vs hundreds or thousands of map changes as various OSDs boot at slightly different times.  I have a smaller cluster that I can test it with and measure how the network traffic changes to the mons, and will plan this in the next shutdown/restart/upgrade.

Thanks for the quick response and the tip!

Andras


On 12/19/18 6:47 PM, Dan van der Ster wrote:
Hey Andras,

Three mons is possibly too few for such a large cluster. We've had lots of good stable experience with 5-mon clusters. I've never tried 7, so I can't say if that would lead to other problems (e.g. leader/peon sync scalability).

That said, our 10000-osd bigbang tests managed with only 3 mons, and I assume that outside of this full system reboot scenario your 3 cope well enough. You should probably add 2 more, but I wouldn't expect that alone to solve this problem in the future.

Instead, with a slightly tuned procedure and a bit of osd log grepping, I think you could've booted this cluster more quickly than 4 hours with those mere 3 mons.

As you know, each osds boot process requires the downloading of all known osdmaps. If all osds are booting together, and the mons are saturated, the osds can become sluggish when responding to their peers, which could lead to the flapping scenario you saw. Flapping leads to new osdmap epochs that then need to be distributed, worsening the issue. It's good that you used nodown and noout, because without these the boot time would've been even longer. Next time also set noup and noin to further reduce the osdmap churn.

One other thing: there's a debug_osd level -- 10 or 20, I forget exactly -- that you can set to watch the maps sync up on each osd. Grep the osd logs for some variations on "map" and "epoch".

In short, here's what I would've done:

0. boot the mons, waiting until they have a full quorum.
1. set nodown, noup, noin, noout   <-- with these, there should be zero new osdmaps generated while the osds boot. 2. start booting osds. set the necessary debug_osd level to see the osdmap sync progress in the ceph-osd logs. 3. if the mons are over saturated, boot progressively -- one rack at a time, for example. 4. once all osds have caught up to the current osdmap, unset noup. The osds should then all "boot" (as far as the mons are concerned) and be marked up. (this might be sluggish on a 3400 osd cluster, perhaps taking a few 10s of seconds). the pgs should be active+clean at this point. 5. unset nodown, noin, noout. which should change nothing provided all went well.

Hope that helps for next time!

Dan


On Wed, Dec 19, 2018 at 11:39 PM Andras Pataki <apat...@flatironinstitute.org <mailto:apat...@flatironinstitute.org>> wrote:

    Forgot to mention: all nodes are on Luminous 12.2.8 currently on
    CentOS 7.5.

    On 12/19/18 5:34 PM, Andras Pataki wrote:
    > Dear ceph users,
    >
    > We have a large-ish ceph cluster with about 3500 osds. We run 3
    mons
    > on dedicated hosts, and the mons typically use a few percent of a
    > core, and generate about 50Mbits/sec network traffic. They are
    > connected at 20Gbits/sec (bonded dual 10Gbit) and are running on
    2x14
    > core servers.
    >
    > We recently had to shut ceph down completely for maintenance
    (which we
    > rarely do), and had significant difficulties starting it up.  The
    > symptoms included OSDs hanging on startup, being marked down,
    flapping
    > and all that bad stuff.  After some investigation we found that the
    > 20Gbit/sec network interfaces of the monitors were completely
    > saturated as the OSDs were starting, while the monitor processes
    were
    > using about 3 cores (300% CPU).  We ended up having to start the
    OSDs
    > up super slow to make sure that the monitors could keep up - it
    took
    > about 4 hours to start 3500 OSDs (at a rate about 4 seconds per
    OSD).
    > We've tried setting noout and nodown, but that didn't really help
    > either.  A few questions that would be good to understand in
    order to
    > move to a better configuration.
    >
    > 1. How does the monitor traffic scale with the number of OSDs?
    > Presumably the traffic comes from distributing cluster maps as the
    > cluster changes on OSD starts.  The cluster map is perhaps O(N)
    for N
    > OSDs, and each OSD needs an update on a cluster change so that
    would
    > make one change an O(N^2) traffic.  As OSDs start, the cluster
    changes
    > quite a lot (N times?), so would that make the startup traffic
    > O(N^3)?  If so, that sounds pretty scary for scalability.
    >
    > 2. Would adding more monitors help here?  I.e. presumably each OSD
    > gets its maps from one monitor, so they would share the traffic.
    Would
    > the inter-monitor communication/elections/etc. be problematic
    for more
    > monitors (5, 7 or even more)?  Would more monitors be
    recommended?  If
    > so, how many is practical?
    >
    > 3. Are there any config parameters useful for tuning the traffic
    > (perhaps send mon updates less frequently, or something along those
    > lines)?
    >
    > Any other advice on this topic would also be helpful.
    >
    > Thanks,
    >
    > Andras
    >
    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to