Hey Andras,

Three mons is possibly too few for such a large cluster. We've had lots of
good stable experience with 5-mon clusters. I've never tried 7, so I can't
say if that would lead to other problems (e.g. leader/peon sync
scalability).

That said, our 10000-osd bigbang tests managed with only 3 mons, and I
assume that outside of this full system reboot scenario your 3 cope well
enough. You should probably add 2 more, but I wouldn't expect that alone to
solve this problem in the future.

Instead, with a slightly tuned procedure and a bit of osd log grepping, I
think you could've booted this cluster more quickly than 4 hours with those
mere 3 mons.

As you know, each osds boot process requires the downloading of all known
osdmaps. If all osds are booting together, and the mons are saturated, the
osds can become sluggish when responding to their peers, which could lead
to the flapping scenario you saw. Flapping leads to new osdmap epochs that
then need to be distributed, worsening the issue. It's good that you used
nodown and noout, because without these the boot time would've been even
longer. Next time also set noup and noin to further reduce the osdmap churn.

One other thing: there's a debug_osd level -- 10 or 20, I forget exactly --
that you can set to watch the maps sync up on each osd. Grep the osd logs
for some variations on "map" and "epoch".

In short, here's what I would've done:

0. boot the mons, waiting until they have a full quorum.
1. set nodown, noup, noin, noout   <-- with these, there should be zero new
osdmaps generated while the osds boot.
2. start booting osds. set the necessary debug_osd level to see the osdmap
sync progress in the ceph-osd logs.
3. if the mons are over saturated, boot progressively -- one rack at a
time, for example.
4. once all osds have caught up to the current osdmap, unset noup. The osds
should then all "boot" (as far as the mons are concerned) and be marked up.
(this might be sluggish on a 3400 osd cluster, perhaps taking a few 10s of
seconds). the pgs should be active+clean at this point.
5. unset nodown, noin, noout. which should change nothing provided all went
well.

Hope that helps for next time!

Dan


On Wed, Dec 19, 2018 at 11:39 PM Andras Pataki <
[email protected]> wrote:

> Forgot to mention: all nodes are on Luminous 12.2.8 currently on CentOS
> 7.5.
>
> On 12/19/18 5:34 PM, Andras Pataki wrote:
> > Dear ceph users,
> >
> > We have a large-ish ceph cluster with about 3500 osds.  We run 3 mons
> > on dedicated hosts, and the mons typically use a few percent of a
> > core, and generate about 50Mbits/sec network traffic.  They are
> > connected at 20Gbits/sec (bonded dual 10Gbit) and are running on 2x14
> > core servers.
> >
> > We recently had to shut ceph down completely for maintenance (which we
> > rarely do), and had significant difficulties starting it up.  The
> > symptoms included OSDs hanging on startup, being marked down, flapping
> > and all that bad stuff.  After some investigation we found that the
> > 20Gbit/sec network interfaces of the monitors were completely
> > saturated as the OSDs were starting, while the monitor processes were
> > using about 3 cores (300% CPU).  We ended up having to start the OSDs
> > up super slow to make sure that the monitors could keep up - it took
> > about 4 hours to start 3500 OSDs (at a rate about 4 seconds per OSD).
> > We've tried setting noout and nodown, but that didn't really help
> > either.  A few questions that would be good to understand in order to
> > move to a better configuration.
> >
> > 1. How does the monitor traffic scale with the number of OSDs?
> > Presumably the traffic comes from distributing cluster maps as the
> > cluster changes on OSD starts.  The cluster map is perhaps O(N) for N
> > OSDs, and each OSD needs an update on a cluster change so that would
> > make one change an O(N^2) traffic.  As OSDs start, the cluster changes
> > quite a lot (N times?), so would that make the startup traffic
> > O(N^3)?  If so, that sounds pretty scary for scalability.
> >
> > 2. Would adding more monitors help here?  I.e. presumably each OSD
> > gets its maps from one monitor, so they would share the traffic. Would
> > the inter-monitor communication/elections/etc. be problematic for more
> > monitors (5, 7 or even more)?  Would more monitors be recommended?  If
> > so, how many is practical?
> >
> > 3. Are there any config parameters useful for tuning the traffic
> > (perhaps send mon updates less frequently, or something along those
> > lines)?
> >
> > Any other advice on this topic would also be helpful.
> >
> > Thanks,
> >
> > Andras
> >
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to