> On Aug 14, 2025, at 3:19 AM, Vishnu Bhaskar <vishn...@acceleronlabs.com> 
> wrote:
> 
> Hi Anthony
> 
> CEPH OSD DF TREE :: ===========================
> ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     
> AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
> -51          12.22272         -   12 TiB  4.8 TiB  4.8 TiB  112 MiB   20 GiB  
> 7.4 TiB  39.21  3.24    -          root cache_root      
> -27           0.87329         -  894 GiB  306 GiB  305 GiB  8.1 MiB  1.3 GiB  
> 588 GiB  34.25  2.83    -              host cache_node1 
>   0    ssd    0.87329   1.00000  894 GiB  306 GiB  305 GiB  8.1 MiB  1.3 GiB  
> 588 GiB  34.25  2.83    4      up          osd.0        
These are nominal 960 TB SSDs?

Under the PGS column the numbers are indeed doubleplus ungood. These should be 
rather higher.
> -45           0.87299         -  894 GiB  382 GiB  381 GiB  6.3 MiB  1.5 GiB  
> 512 GiB  42.76  3.53    -              host cache_node10
>  45    ssd    0.87299   1.00000  894 GiB  382 GiB  381 GiB  6.3 MiB  1.5 GiB  
> 512 GiB  42.76  3.53    3      up          osd.45       
> -47           0.87299         -  894 GiB  458 GiB  456 GiB  4.1 MiB  1.7 GiB  
> 436 GiB  51.21  4.23    -              host cache_node11
>  50    ssd    0.87299   1.00000  894 GiB  458 GiB  456 GiB  4.1 MiB  1.7 GiB  
> 436 GiB  51.21  4.23    6      up          osd.50       
> -49           0.87299         -  894 GiB  535 GiB  533 GiB  8.0 MiB  1.7 GiB  
> 359 GiB  59.84  4.94    -              host cache_node12
>  55    ssd    0.87299   1.00000  894 GiB  535 GiB  533 GiB  8.0 MiB  1.7 GiB  
> 359 GiB  59.84  4.94    7      up          osd.55       
The larger OSDs also have many fewer PGs than they should.

>  40    ssd    0.87299   1.00000  894 GiB  612 GiB  610 GiB  9.7 MiB  1.8 GiB  
> 282 GiB  68.46  5.65    8      up          osd.40       
>  -1         195.60869         -  196 TiB   20 TiB   20 TiB  810 MiB   77 GiB  
> 175 TiB  10.42  0.86    -          root default         
>  -3          13.97198         -   14 TiB  1.3 TiB  1.3 TiB   31 MiB  5.1 GiB  
>  13 TiB   9.07  0.75    -              host node1       
>   1    ssd    3.49300   1.00000  3.5 TiB  309 GiB  308 GiB  7.8 MiB  1.2 GiB  
> 3.2 TiB   8.64  0.71   36      up          osd.1        
>   2    ssd    3.49300   1.00000  3.5 TiB  331 GiB  330 GiB  5.2 MiB  1.4 GiB  
> 3.2 TiB   9.26  0.76   36      up          osd.2        
>   3    ssd    3.49300   1.00000  3.5 TiB  292 GiB  291 GiB  8.2 MiB  1.2 GiB  
> 3.2 TiB   8.16  0.67   34      up          osd.3        
>   4    ssd    3.49300   1.00000  3.5 TiB  365 GiB  364 GiB  9.9 MiB  1.4 GiB  
> 3.1 TiB  10.21  0.84   38      up          osd.4   
You only have 4 OSDs per node? What kind of nodes are these?  Are they 
converged with compute?

Since *all* of your OSDs appear to be SSDs, why do you have a cache tier in the 
first place?

>            
> MIN/MAX VAR: 0.67/5.65  STDDEV: 14.49
The standard deviation here is perturbed by the wide variance in OSD sizes.  I 
have an RFE in to break down these figures by device class so that they are 
more useful for heterogeneous clusters.

For now I suggest:

ceph config set global mon_max_pg_per_osd                   500
ceph config set global mon_target_pg_per_osd                250
ceph config set mgr mgr/balancer/upmap_max_deviation        1 


Then check that you don't have any of these set at lower scope, most things can 
just be set at "global" honestly.

ceph config dump | grep pg_per_osd

If you have existing entries at "mon" or "global" scopes, I'd suggest using 
"ceph config rm" to clear those so the global-scope values above are the only 
ones in force.

I expect that your warning will clear after the dust settles.
> 
> 
> CEPH OSD POOL LS DETAIL :: ====================
> pool 2 'device_health_metrics' replicated size 2 min_size 1 crush_rule 0 
> object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 44903 
> flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
> pool 3 'volumes' replicated size 2 min_size 1 crush_rule 0 object_hash 
> rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 63235 lfor 
> 353/353/62134 
I suggest enabling the autoscaler for this pool after making the above settings.


> flags hashpspool,selfmanaged_snaps tiers 4 read_tier 4 write_tier 4 
> stripe_width 0 application rbd
> pool 4 'volumes_cache' replicated size 2 min_size 1 crush_rule 1 object_hash 
> rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 63235 lfor 
> 353/353/353 flags hashpspool,incomplete_clones,selfmanaged_snaps tier_of 3 
> cache_mode writeback target_bytes 3298534883328 hit_set 
> bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 14400s x4 
> decay_rate 0 search_last_n 0 stripe_width 0 application rbd
Unless I'm missing something, I would look up procedures for removing the cache 
tier entirely.  I don't think it's doing anything for you.  Actually I suspect 
it's slowing you down.


> pool 5 'images' replicated size 2 min_size 1 crush_rule 0 object_hash 
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 51925 lfor 
> 0/0/368 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 6 'internal' replicated size 2 min_size 1 crush_rule 0 object_hash 
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 49949 flags 
> hashpspool,selfmanaged_snaps stripe_width 0 application rbd

I notice that all of these pools have size=2, min_size=1.  This is dangerous, I 
strongly suggest setting all of these pools to size=3 min_size=2.

Each of the above steps will result in waves of peering and backfill.  This is 
normal. Do one item at a time and let the cluster converge before proceeding.


>> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to