> On Aug 14, 2025, at 3:19 AM, Vishnu Bhaskar <vishn...@acceleronlabs.com> > wrote: > > Hi Anthony > > CEPH OSD DF TREE :: =========================== > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > AVAIL %USE VAR PGS STATUS TYPE NAME > -51 12.22272 - 12 TiB 4.8 TiB 4.8 TiB 112 MiB 20 GiB > 7.4 TiB 39.21 3.24 - root cache_root > -27 0.87329 - 894 GiB 306 GiB 305 GiB 8.1 MiB 1.3 GiB > 588 GiB 34.25 2.83 - host cache_node1 > 0 ssd 0.87329 1.00000 894 GiB 306 GiB 305 GiB 8.1 MiB 1.3 GiB > 588 GiB 34.25 2.83 4 up osd.0 These are nominal 960 TB SSDs?
Under the PGS column the numbers are indeed doubleplus ungood. These should be rather higher. > -45 0.87299 - 894 GiB 382 GiB 381 GiB 6.3 MiB 1.5 GiB > 512 GiB 42.76 3.53 - host cache_node10 > 45 ssd 0.87299 1.00000 894 GiB 382 GiB 381 GiB 6.3 MiB 1.5 GiB > 512 GiB 42.76 3.53 3 up osd.45 > -47 0.87299 - 894 GiB 458 GiB 456 GiB 4.1 MiB 1.7 GiB > 436 GiB 51.21 4.23 - host cache_node11 > 50 ssd 0.87299 1.00000 894 GiB 458 GiB 456 GiB 4.1 MiB 1.7 GiB > 436 GiB 51.21 4.23 6 up osd.50 > -49 0.87299 - 894 GiB 535 GiB 533 GiB 8.0 MiB 1.7 GiB > 359 GiB 59.84 4.94 - host cache_node12 > 55 ssd 0.87299 1.00000 894 GiB 535 GiB 533 GiB 8.0 MiB 1.7 GiB > 359 GiB 59.84 4.94 7 up osd.55 The larger OSDs also have many fewer PGs than they should. > 40 ssd 0.87299 1.00000 894 GiB 612 GiB 610 GiB 9.7 MiB 1.8 GiB > 282 GiB 68.46 5.65 8 up osd.40 > -1 195.60869 - 196 TiB 20 TiB 20 TiB 810 MiB 77 GiB > 175 TiB 10.42 0.86 - root default > -3 13.97198 - 14 TiB 1.3 TiB 1.3 TiB 31 MiB 5.1 GiB > 13 TiB 9.07 0.75 - host node1 > 1 ssd 3.49300 1.00000 3.5 TiB 309 GiB 308 GiB 7.8 MiB 1.2 GiB > 3.2 TiB 8.64 0.71 36 up osd.1 > 2 ssd 3.49300 1.00000 3.5 TiB 331 GiB 330 GiB 5.2 MiB 1.4 GiB > 3.2 TiB 9.26 0.76 36 up osd.2 > 3 ssd 3.49300 1.00000 3.5 TiB 292 GiB 291 GiB 8.2 MiB 1.2 GiB > 3.2 TiB 8.16 0.67 34 up osd.3 > 4 ssd 3.49300 1.00000 3.5 TiB 365 GiB 364 GiB 9.9 MiB 1.4 GiB > 3.1 TiB 10.21 0.84 38 up osd.4 You only have 4 OSDs per node? What kind of nodes are these? Are they converged with compute? Since *all* of your OSDs appear to be SSDs, why do you have a cache tier in the first place? > > MIN/MAX VAR: 0.67/5.65 STDDEV: 14.49 The standard deviation here is perturbed by the wide variance in OSD sizes. I have an RFE in to break down these figures by device class so that they are more useful for heterogeneous clusters. For now I suggest: ceph config set global mon_max_pg_per_osd 500 ceph config set global mon_target_pg_per_osd 250 ceph config set mgr mgr/balancer/upmap_max_deviation 1 Then check that you don't have any of these set at lower scope, most things can just be set at "global" honestly. ceph config dump | grep pg_per_osd If you have existing entries at "mon" or "global" scopes, I'd suggest using "ceph config rm" to clear those so the global-scope values above are the only ones in force. I expect that your warning will clear after the dust settles. > > > CEPH OSD POOL LS DETAIL :: ==================== > pool 2 'device_health_metrics' replicated size 2 min_size 1 crush_rule 0 > object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 44903 > flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth > pool 3 'volumes' replicated size 2 min_size 1 crush_rule 0 object_hash > rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 63235 lfor > 353/353/62134 I suggest enabling the autoscaler for this pool after making the above settings. > flags hashpspool,selfmanaged_snaps tiers 4 read_tier 4 write_tier 4 > stripe_width 0 application rbd > pool 4 'volumes_cache' replicated size 2 min_size 1 crush_rule 1 object_hash > rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 63235 lfor > 353/353/353 flags hashpspool,incomplete_clones,selfmanaged_snaps tier_of 3 > cache_mode writeback target_bytes 3298534883328 hit_set > bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 14400s x4 > decay_rate 0 search_last_n 0 stripe_width 0 application rbd Unless I'm missing something, I would look up procedures for removing the cache tier entirely. I don't think it's doing anything for you. Actually I suspect it's slowing you down. > pool 5 'images' replicated size 2 min_size 1 crush_rule 0 object_hash > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 51925 lfor > 0/0/368 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd > pool 6 'internal' replicated size 2 min_size 1 crush_rule 0 object_hash > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 49949 flags > hashpspool,selfmanaged_snaps stripe_width 0 application rbd I notice that all of these pools have size=2, min_size=1. This is dangerous, I strongly suggest setting all of these pools to size=3 min_size=2. Each of the above steps will result in waves of peering and backfill. This is normal. Do one item at a time and let the cluster converge before proceeding. >> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io