Thanks for the list of parameters. I suppose I was just asking whether the docs saying "planned carefully" was something more than being ready to moderate the rebuild process etc. I think that clears things up for me.
Sincerely -Dave On 2022-05-05 1:15 p.m., Anthony D'Atri wrote: > [△EXTERNAL] > > > >> The balancer was driving all the weights to 1.00000 so I turned it off. > Which weights (CRUSH or reweight?) And which balancer? > > Assuming the ceph-mgr balancer module in upmap mode, you’d want the reweight > values to be 1.000 since it uses the newer pg-upmap functionality to > distribute capacity. Lower reweight values have a way of confusing the > balancer and preventing good uniformity. If you had a bunch of significantly > adjusted reweight values, eg. from prior runs of reweight-by-utilization, > that could contribute to suboptimal balancing. > >> You mentioned that all solutions would cause data migration and would need >> to be planned carefully. I've seen that language in the docs and other >> messages but what I can't find is what is meant by "planned carefully". > There are many ways to proceed; documenting them all might be a bit of a > rabbit-hole. > > > >> Doing any of these will cause data migration like crazy but it's not >> avoidable other than to change the number of max backfills etc. but the >> systems should still be accessible during this time but with reduced >> bandwidth and higher latency. Is it just a warning that the system could be >> degraded for a long period of time or is it suggesting that users should >> take an outage while the rebuild happens? > Throttling recovery/backfill can reduce the impact of big data migrations, at > the expense of increased elapsed time to complete. > > osd_max_backfills=1 > osd_recovery_max_active=1 > osd_recovery_op_priority=1 > osd_recovery_max_single_start=1 > osd_scrub_during_recovery=false > > Also, ensure that > > osd_op_queue_cut_off = high > > This will help ensure that recovery / backfill doesn’t DoS client traffic. > I’m not sure if this is default in your release. If changed, I believe that > OSDs would need to be restarted for the new value to take effect. > > PGs: > > pg_num = ( #OSDs * ratio ) / replication > ratio = pg_num * replication / #OSDs > > On clusters with multiple pools this can get a bit complicated when more than > one pool have significant numbers of PGs; the end goal is the total number of > PGs on a given OSD, which `ceph osd df` reports. > > Your OSDs look to have ~~ 190 PGs each on average, which is probably ok given > your media. If you do have big empty pools, deleting them would show more > indicative numbers. PG ratio targets are somewhat controversial, but > depending on your media and RAM an aggregate around this range is reasonable; > you can go higher with flash. > > This calculator can help when you have multiple pools: > > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fold.ceph.com%2Fpgcalc%2F&data=05%7C01%7Cdschulz%40ucalgary.ca%7C403a1880f2d94adc374808da2ecbc34a%7Cc609a0eca5e346319686192280bd9151%7C1%7C0%7C637873750505242659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HSTPwi0QFzPqXqzSi0NWe1sXYvhLwuOaK%2FNMKWBH%2Fho%3D&reserved=0 > > If you need to bump pg_num for a pool, you don’t have to do it in one step. > You can increase it by, say, 32 at a time. > > >> Thanks for your guidance. >> >> -Dave >> >> >> On 2022-05-05 2:33 a.m., Erdem Agaoglu wrote: >> [△EXTERNAL] >> >> >> Hi David, >> >> I think you're right with your option 2. 512 pgs is just too few. You're also >> right with the "inflation" but you should add your erasure bits to the >> calculation, so 9x512=4608. With 144 OSDs, you would average 32 pgs per OSD. >> Some old advice for that number was around 100. >> >> But your current PGs per OSD is around 180-190 according to the df output. >> This >> is probably because of your empty pool 4 fsdata, having 4096 pgs with size 5, >> and adding 5x4096=20480, 20480/144=142 more pgs per OSD. >> >> I'm not really sure how empty/unused PGs would affect OSD, but I think it >> will >> affect the balancer which tries to balance the number of PGs, which might >> explain things getting worse. Also your df output shows several >> modifications in >> weights/reweights but I'm not sure if they're manual or balancer adjusted. >> >> I would first delete that empty pool to have a more clear picture of PGs on >> OSDs. Then I would increase the pg_num for pool 6 to 2048. And after >> everything >> settles, if it's still too unbalanced I'd go for the upmap balancer. >> Needless to >> say, all these would cause major data migration so it should be planned >> carefully. >> >> Best, >> >> >> >> On Thu, May 5, 2022 at 12:02 AM David Schulz >> <dsch...@ucalgary.ca<mailto:dsch...@ucalgary.ca>> wrote: >> Hi Josh, >> >> We do have an old pool that is empty so there's 4611 empty PGs but the >> rest seem fairly close: >> >> # ceph pg ls|awk '{print $7/1024/1024/10}'|cut -d "." -f 1|sed -e >> 's/$/0/'|sort -n|uniq -c >> 4611 00 >> 1 1170 >> 8 1180 >> 10 1190 >> 28 1200 >> 51 1210 >> 54 1220 >> 52 1230 >> 32 1240 >> 13 1250 >> 7 1260 >> Hmm, that's interesting, adding up the first column except the 4611 >> gives 256 but there are 512 PGs in the main data pool. >> >> Here are our pool settings: >> >> pool 3 'fsmeta' replicated size 3 min_size 1 crush_rule 0 object_hash >> rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 35490 >> flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 >> recovery_priority 5 application cephfs >> pool 4 'fsdata' erasure size 5 min_size 4 crush_rule 1 object_hash >> rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 35490 >> lfor 0/0/4742 flags hashpspool,ec_overwrites stripe_width 12288 >> application cephfs >> pool 6 'fsdatak7m2' erasure size 9 min_size 8 crush_rule 3 object_hash >> rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 35490 >> flags hashpspool,ec_overwrites stripe_width 28672 application cephfs >> >> The fsdata was originally created with very safe erasure coding that >> wasted too much space, then the fsdatak7m2 was created and everything >> was migrated to it. This is why there's at least 4096 pgs with 0 bytes. >> >> -Dave >> >> On 2022-05-04 2:08 p.m., Josh Baergen wrote: >>> [△EXTERNAL] >>> >>> >>> >>> Hi Dave, >>> >>>> This cluster was upgraded from 13.x to 14.2.9 some time ago. The entire >>>> cluster was installed at the 13.x time and was upgraded together so all >>>> OSDs should have the same formatting etc. >>> OK, thanks, that should rule out a difference in bluestore >>> min_alloc_size, for example. >>> >>>> Below is pasted the ceph osd df tree output. >>> It looks like there is some pretty significant skew in terms of the >>> amount of bytes per active PG. If you issue "ceph pg ls", are you able >>> to find any PGs with a significantly higher byte count? >>> >>> Josh >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> >> To unsubscribe send an email to >> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> >> >> >> -- >> erdem agaoglu >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io