[ceph-users] Re: Slow initial boot of OSDs in large cluster with unclean state

Anthony D'Atri Tue, 07 Jan 2025 10:45:20 -0800


> On our 6000+ HDD OSD cluster (pacific)


That’s the bleeding edge in a number of respects.  Updating to at least Reef 
would bring various improvements, and I have some suggestions I'd like to*love* 
to run by you wrt upgrade speed in such a cluster, if you’re using cephadm / 
ceph orch.  Would also love offline to get a copy of any non-default values 
you’ve found valuable.

> we've been noticing takes significantly longer for brand new OSDs to go from 
> booting to active when the cluster has been in a state of flux for some time. 
> It can take over an hour for a newly created OSD to be marked up in some 
> cases!

That’s pretty extreme.

> We've just put up with it for some time, but I finally got annoyed enough 
> with it to look into it today...
> 
> Looking at the logs of a new OSD when it's starting:
> 
> 2025-01-07T13:44:05.534+0000 7f0b8b830700  3 osd.2016 5165598 handle_osd_map 
> epochs [5165599,5165638], i have 5165598, src has [5146718,5175990]
> 2025-01-07T13:44:08.988+0000 7f0b8d6ed700 10 osd.2016 5165638  msg say newest 
> map is 5175990, requesting more
> 2025-01-07T13:44:08.990+0000 7f0b8b830700  3 osd.2016 5165638 handle_osd_map 
> epochs [5165639,5165678], i have 5165638, src has [5146718,5175990]
> 2025-01-07T13:44:12.391+0000 7f0b8d6ed700 10 osd.2016 5165678  msg say newest 
> map is 5175990, requesting more
> 2025-01-07T13:44:12.394+0000 7f0b8b830700  3 osd.2016 5165678 handle_osd_map 
> epochs [5165679,5165718], i have 5165678, src has [5146718,5175990]
> 2025-01-07T13:44:16.047+0000 7f0b8d6ed700 10 osd.2016 5165718  msg say newest 
> map is 5175990, requesting more
> 
> It's pulling down OSD maps, 40 at a time, taking about 4 seconds each time. 
> With the ~30,000(!) OSD maps it pulls down, it takes approximately an hour. 
> At ~4MB a map, this then matches up with the ~115GB storage consumed by the 
> resulting OSD with no PGs.
> 
> I realise the obvious answer here is don't leave big cluster in an unclean 
> state for this long.

Patient:  It turns when I do this
Doctor:  Don’t do that.

> Currently we've got PGs that have been remapped for 5 days, which matches the 
> 30,000 OSDMap epoch range perfectly.

Are you adding OSDs while the cluster is converging / backfilling / recovering?


> This is something we're always looking at from a procedure point of view e.g. 
> keeping max_backfills as high as possible by default, ensuring balancer 
> max_misplaced is appropriate, re-evaluating disk and node addition/removal 
> processes. But the reality on this cluster is that sometimes these 'logjams' 
> happen, and it would be good to understand if we can improve the OSD addition 
> experience so we can continue to be flexible with our operation scheduling.

I might speculate that putting off OSD addition until the cluster is converged 
might help, maybe then a rolling mon compaction in advance of adding OSDs.

> The first thing I noted was the OSD block devices aren't busy during the 
> OSDmap fetching process - they're barely doing 50MB/s and 50 wr/s. I started 
> looking into raising 'osd_map_share_max_epochs' to hopefully increase the 
> number of maps shared with the new OSD per request and improve the rate, but 
> I balked a bit after realising I would have to do this across the whole 
> cluster (I think, anyway, not actually sure where the maps are being pulled 
> from at this point).

From the mons I think.

> All tuning of this value I could see talked about reducing this value which 
> further scared me.
> 
> Additionally, there's clearly some interplay between 'osd_map_cache_size' and 
> 'osd_map_message_max' to consider. These historic maps must be being pulled 
> from disk in general (be it osd or mon),

Are your mon DBs on spinners?

> so it shouldn't make a difference if osd_map_share_max_epochs > 
> osd_map_cache_size, but in general I suppose you don't want OSDs having to 
> grab maps off disk for requests from peers?
> 
> (There may also be a completely different dominating factor of the time to 
> download and store the maps that I'm not considering here.)
> 
> So, any advice on improving the speed of the OSDmap download for fresh OSDs 
> would be appreciated, or any other thoughts about this situation.
> 
> Thanks,
> Tom
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow initial boot of OSDs in large cluster with unclean state

Reply via email to