> On our 6000+ HDD OSD cluster (pacific)
That’s the bleeding edge in a number of respects. Updating to at least Reef would bring various improvements, and I have some suggestions I'd like to*love* to run by you wrt upgrade speed in such a cluster, if you’re using cephadm / ceph orch. Would also love offline to get a copy of any non-default values you’ve found valuable. > we've been noticing takes significantly longer for brand new OSDs to go from > booting to active when the cluster has been in a state of flux for some time. > It can take over an hour for a newly created OSD to be marked up in some > cases! That’s pretty extreme. > We've just put up with it for some time, but I finally got annoyed enough > with it to look into it today... > > Looking at the logs of a new OSD when it's starting: > > 2025-01-07T13:44:05.534+0000 7f0b8b830700 3 osd.2016 5165598 handle_osd_map > epochs [5165599,5165638], i have 5165598, src has [5146718,5175990] > 2025-01-07T13:44:08.988+0000 7f0b8d6ed700 10 osd.2016 5165638 msg say newest > map is 5175990, requesting more > 2025-01-07T13:44:08.990+0000 7f0b8b830700 3 osd.2016 5165638 handle_osd_map > epochs [5165639,5165678], i have 5165638, src has [5146718,5175990] > 2025-01-07T13:44:12.391+0000 7f0b8d6ed700 10 osd.2016 5165678 msg say newest > map is 5175990, requesting more > 2025-01-07T13:44:12.394+0000 7f0b8b830700 3 osd.2016 5165678 handle_osd_map > epochs [5165679,5165718], i have 5165678, src has [5146718,5175990] > 2025-01-07T13:44:16.047+0000 7f0b8d6ed700 10 osd.2016 5165718 msg say newest > map is 5175990, requesting more > > It's pulling down OSD maps, 40 at a time, taking about 4 seconds each time. > With the ~30,000(!) OSD maps it pulls down, it takes approximately an hour. > At ~4MB a map, this then matches up with the ~115GB storage consumed by the > resulting OSD with no PGs. > > I realise the obvious answer here is don't leave big cluster in an unclean > state for this long. Patient: It turns when I do this Doctor: Don’t do that. > Currently we've got PGs that have been remapped for 5 days, which matches the > 30,000 OSDMap epoch range perfectly. Are you adding OSDs while the cluster is converging / backfilling / recovering? > This is something we're always looking at from a procedure point of view e.g. > keeping max_backfills as high as possible by default, ensuring balancer > max_misplaced is appropriate, re-evaluating disk and node addition/removal > processes. But the reality on this cluster is that sometimes these 'logjams' > happen, and it would be good to understand if we can improve the OSD addition > experience so we can continue to be flexible with our operation scheduling. I might speculate that putting off OSD addition until the cluster is converged might help, maybe then a rolling mon compaction in advance of adding OSDs. > The first thing I noted was the OSD block devices aren't busy during the > OSDmap fetching process - they're barely doing 50MB/s and 50 wr/s. I started > looking into raising 'osd_map_share_max_epochs' to hopefully increase the > number of maps shared with the new OSD per request and improve the rate, but > I balked a bit after realising I would have to do this across the whole > cluster (I think, anyway, not actually sure where the maps are being pulled > from at this point). From the mons I think. > All tuning of this value I could see talked about reducing this value which > further scared me. > > Additionally, there's clearly some interplay between 'osd_map_cache_size' and > 'osd_map_message_max' to consider. These historic maps must be being pulled > from disk in general (be it osd or mon), Are your mon DBs on spinners? > so it shouldn't make a difference if osd_map_share_max_epochs > > osd_map_cache_size, but in general I suppose you don't want OSDs having to > grab maps off disk for requests from peers? > > (There may also be a completely different dominating factor of the time to > download and store the maps that I'm not considering here.) > > So, any advice on improving the speed of the OSDmap download for fresh OSDs > would be appreciated, or any other thoughts about this situation. > > Thanks, > Tom > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io