> On Sep 2, 2025, at 6:36 AM, Steven Vacaroaia <ste...@gmail.com> wrote:
> 
> Thanks again Anthony
> 
> I guess I assumed wrong that "osd.all-available-devices" includes all 
> available devices 

I *think* the last match wins when there are multiple OSD specs.  Please 
excerpt them from `ceph orch ls --export` so we can see them in detail.

> It seems I should have targeted specifical ones (like osd.hdd_osds)
> 
> Going back to the capacity increase
> I am trying to estimate / calculate the final capacity  
> 
> Looking at the output of "ceph df" below
> it would be greatly appreciated if you could please confirm 
> if this is the correct interpretation for hdd_class :
> 
> "Out of a total capacity of 1.4 PiB you currently use 579 TiB .

Yes.

> The 579 TiB used are the sum of the data used by pools configured to use 
>  hdd_class e.g osd.all-available-devices (434) ,  hdd_ec_archive    
> (135)...etc

Yes.  Note that if you have any pools using a CRUSH rule that does not specify 
a device class, they will use *all* OSDs and hilarity will ensue.

> 
> Therefore, when the backfilling / balancing is done
>  (and assuming no more data is added to the pools)
> the amount of pools "MAX Available" should be 814 TB

Ah.  Not so fast.  814 TiB is aggregate raw capacity.  Max avail is nuanced.  
This can be slippery to wrap one's mind around.  If your mind is pre-warped, 
that helps.

Max avail is *per pool*. It indicates the projected amount of user data that 
may be written to that pool *if* no other pools mapped to the same OSDs gain or 
lose data.    It's a zero sum game.

Note for example that .mgr and default.rgw.buckets.index (and a slew of others) 
all report 93 TiB max avail.  They're all (presumably) replicated size=3 and on 
the same OSDs.  So that *set* of pools can accept ~~ 93 TiB of new data split 
any way across them.

This is a function of the delta between the 90% full_ratio and that the 
most-full current OSD is 28.58% full, divided by 3 for replication.

It becomes a bit more complicated for the hdd_class, where it would seem that 
EC is in effect.  It's also a function of the CRUSH topology and rules.  I see 
that for example ceph-host-1 offers ~ 98 CRUSH weight of hdd_class, where 
ceph-host-1 offers ~ 226 CRUSH weight of hdd_class.  Your latest message 
doesn't show the CRUSH rules and pool attributes but I suspect multiple EC 
profiles are in use, which complicates the calculation a bit. You have 7 hosts 
and thus likely 7 CRUSH failure domains.  If you have, say, a 5+2 EC pool, that 
pool's max avail will be limited by the lowest CRUSH weight of the failure 
domains, since each PG will need to hit one OSD on every host.  A 4+2 pool 
however would have more flexibility in placement and be less constrained by the 
unequal distribution.  Why does ceph-host-1 have fewer HDDs than the other 
hosts?






> 
> "
> 
> Is the above correct ?
> 
> Note
> The "MAX available " is increasing - 202 TB now 

That's a function of the balancer reducing how full the most-full OSD of that 
device class is

> 
> Steven
> 
> 
> --- RAW STORAGE ---
> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54
> nvme_class  293 TiB  293 TiB  216 GiB   216 GiB       0.07
> ssd_class   587 TiB  424 TiB  163 TiB   163 TiB      27.80
> TOTAL       2.2 PiB  1.5 PiB  742 TiB   742 TiB      32.64
>  
> --- POOLS ---
> POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  MAX 
> AVAIL
> .mgr                         1     1  277 MiB       71  831 MiB      0     93 
> TiB
> .rgw.root                    2    32  1.6 KiB        6   72 KiB      0     93 
> TiB
> default.rgw.log              3    32   63 KiB      210  972 KiB      0     93 
> TiB
> default.rgw.control          4    32      0 B        8      0 B      0     93 
> TiB
> default.rgw.meta             5    32  1.4 KiB        8   72 KiB      0     93 
> TiB
> default.rgw.buckets.data     6  2048  289 TiB  100.35M  434 TiB  68.04    136 
> TiB
> default.rgw.buckets.index    7  1024  5.4 GiB      521   16 GiB      0     93 
> TiB
> default.rgw.buckets.non-ec   8    32    551 B        1   13 KiB      0     93 
> TiB
> metadata_fs_ssd              9   128  6.1 GiB   15.69M   18 GiB      0     93 
> TiB
> ssd_ec_project              10  1024  108 TiB   44.46M  162 TiB  29.39    260 
> TiB
> metadata_fs_hdd             11   128  9.9 GiB    8.38M   30 GiB   0.01     93 
> TiB
> hdd_ec_archive              12  1024   90 TiB   87.94M  135 TiB  39.79    136 
> TiB
> 
> On Mon, 1 Sept 2025 at 17:07, Anthony D'Atri <a...@dreamsnake.net 
> <mailto:a...@dreamsnake.net>> wrote:
>> 
>> 
>>> Hi,
>>> Thanks Anthony - as always a very useful and comprehensive response
>> 
>> Most welcome.
>> 
>>> yes, there were only 139 OSD and , indeed, raw capacity increased
>>> I also noticed that the "max available "  column from "cepf df" is getting 
>>> higher (177 TB) so, it seems the capacity is being added
>> 
>> Aye, as expected.  
>> 
>>> Too many MDS daemons are due to the fact that I have over 150 CEPHFS clients
>>> and thought that deploying a daemon on each host for each filesystem is 
>>> going to
>>> provide better performance  - was I wrong ?
>> 
>> MDS strategy is nuanced, but I think deploying them redundantly on each 
>> back-end node might not give you much.  The workload AIUI is  more a 
>> function of the number and size of files.
>> 
>>> High number of  remapped PGs are due to improperly using 
>>> osd.all-available-devices unmaged command
>>> so adding the new drives triggered  automatically detecting and adding them
>> 
>> Yep, I typically suggest setting OSD services to unmanaged when not actively 
>> using them.
>> 
>>> I am not sure what I did wrong  though - see below output from "ceph orch 
>>> ls" before adding the drive 
>>> Shouldn't setting it like that prevent automatic discovery  ?
>> 
>> You do have three other OSD rules, at least one of them likely matched and 
>> took action.
>>                                                                              
>>> osd.all-available-devices                                     0  -          
>>> 4w   <unmanaged>                                                            
>>>                       
>>> osd.hdd_osds                                                 72  10m ago    
>>> 5w   *                                                                      
>>>                       
>>> osd.nvme_osds                                                25  10m ago    
>>> 5w   *                                                                      
>>>                       
>>> osd.ssd_osds                                                 84  10m ago    
>>> 3w   ceph-host-1                                                            
>>>                       
>> 
>> `ceph orch ls --export` 
>> 
>>> Resources mentioned are very useful
>>> running upmap-remapped.py did  bring the number of PGs to be remapped close 
>>> to zero
>> 
>> It's a super, super useful tool. Letting the balancer do the work 
>> incrementally helps deter unexpected OSD fullness issues and if a problem 
>> arises, you're a lot closer to HEALTH_OK.
>> 
>>> 2. upmap-remapped.py | sh
>> 
>> Sometimes 2-3 runs are needed for full effect.
>> 
>>> 
>>> 3. change target_max_misplaced_ratio to a higher number than default 0.005  
>>> (since we want to rebalance faster and client performance is not a huge 
>>> issue )
>>> 
>>> 4. enable balancer
>>> 
>>> 5.wait
>>> 
>>> Doing it like this will, eventually, increase the number misplaced PGs 
>>> until it is higher than the ratio when , I guess, the balancer stops
>>> (  "optimize_result": "Too many objects (0.115742 > 0.090000) are 
>>> misplaced; try again later )
>> 
>> Exactly.
>> 
>>> Should I repeat the process when the number of objects misplaced is higher 
>>> than the ratio or what is the proper way of doing it ?
>> 
>> As the backfill progresses and the misplaced percentage drops, the balancer 
>> will kick in with another increment.
>> 
>>> 
>>> Steven
>>> 
>>> 
>>> On Sun, 31 Aug 2025 at 11:08, Anthony D'Atri <a...@dreamsnake.net 
>>> <mailto:a...@dreamsnake.net>> wrote:
>>>> 
>>>> 
>>>>> On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <ste...@gmail.com 
>>>>> <mailto:ste...@gmail.com>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers )
>>>> 
>>>> The ultimate answer to Ceph, the cluster, and everything!
>>>> 
>>>>> My expectation was that the pools configured to use "hdd_class" will
>>>>> have their capacity  increased ( e.g. default.rgw.buckets.data which is
>>>>> uses an EC 4+2 pool  for data )
>>>> 
>>>> First, did the raw capacity increase when you added these drives?
>>>> 
>>>>> --- RAW STORAGE ---
>>>>> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
>>>>> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54
>>>> 
>>>> Was the number of OSDs previously 139?
>>>> 
>>>>> It seems it is not happening ...yet ?!
>>>>> Is it because the peering is still going ?
>>>> 
>>>> Ceph nomenclature can be mystifying at first.  And sometimes at thirteenth.
>>>> 
>>>> Peering is daemons checking in with each other to ensure they’re in 
>>>> agreement.
>>>> 
>>>> I think you mean backfill / balancing.
>>>> 
>>>> The available space reported by “ceph df” for a *pool* is a function of:
>>>> 
>>>> * Raw space available in the associated CRUSH rule’s device class (or if 
>>>> the rule isn’t ideal, all device classes)
>>>> * The cluster’s three full ratios # ceph osd dump | grep ratio
>>>> * The fullness of the single most-full OSD in the device class
>>>> 
>>>> BTW I learned only yesterday that you can restrict `ceph osd df` by 
>>>> specifying a device class, so try running
>>>> 
>>>> `ceph osd df hdd_class | tail -10`
>>>> 
>>>> Notably, this will show you the min/max variance among OSDs of just that 
>>>> device class, and the standard deviation.
>>>> When you have multiple OSD sizes, these figures are much less useful when 
>>>> calculated across the whole cluster by “ceph osd df”
>>>> 
>>>> # ceph osd df hdd
>>>> ...
>>>> 318    hdd  18.53969   1.00000   19 TiB   15 TiB   15 TiB   15 KiB  67 GiB 
>>>>  3.6 TiB  80.79  1.04  127      up
>>>> 319    hdd  18.53969   1.00000   19 TiB   15 TiB   14 TiB  936 KiB  60 GiB 
>>>>  3.7 TiB  79.87  1.03  129      up
>>>> 320    hdd  18.53969   1.00000   19 TiB   15 TiB   14 TiB   33 KiB  72 GiB 
>>>>  3.7 TiB  79.99  1.03  129      up
>>>>  30    hdd  18.53969   1.00000   19 TiB  3.3 TiB  2.9 TiB  129 KiB   11 
>>>> GiB   15 TiB  17.55  0.23   26      up
>>>>                          TOTAL  5.4 PiB  4.2 PiB  4.1 PiB  186 MiB  17 TiB 
>>>>  1.2 PiB  77.81
>>>> MIN/MAX VAR: 0.23/1.09  STDDEV: 4.39
>>>> 
>>>> You can even run this for a specific OSD so you don’t have to get creative 
>>>> with an egrep regex or exercise your pattern-matching skills, though the 
>>>> summary values naturally aren’t useful.
>>>> 
>>>> # ceph osd df osd.30
>>>> ID  CLASS  WEIGHT    REWEIGHT  SIZE    RAW USE  DATA     OMAP     META    
>>>> AVAIL   %USE   VAR   PGS  STATUS
>>>> 30    hdd  18.53969   1.00000  19 TiB  3.3 TiB  2.9 TiB  129 KiB  12 GiB  
>>>> 15 TiB  17.55  1.00   26      up
>>>>                         TOTAL  19 TiB  3.3 TiB  2.9 TiB  130 KiB  12 GiB  
>>>> 15 TiB  17.55
>>>> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>>>> 
>>>> Here there’s a wide variation among the hdd OSDs because osd.30 had been 
>>>> down for a while and was recently restarted due to a host reboot, so it’s 
>>>> slowly filling with data.
>>>> 
>>>> 
>>>>> ssd_class     6.98630
>>>> 
>>>> That seems like an unusual size, what are these? Are they SAN LUNs?
>>>> 
>>>>> Below are outputs from
>>>>> ceph -s
>>>>> ceph df
>>>>> ceph osd df tree
>>>> 
>>>> Thanks for providing the needful up front.
>>>> 
>>>>>  cluster:
>>>>>    id:     0cfa836d-68b5-11f0-90bf-7cc2558e5ce8
>>>>>    health: HEALTH_WARN
>>>>>            1 OSD(s) experiencing slow operations in BlueStore
>>>> 
>>>> This warning state by default persists for a long time after it clears, 
>>>> I’m not sure why but I like to set this lower:
>>>> 
>>>> # ceph config dump | grep blue
>>>> global                                                            advanced 
>>>>  bluestore_slow_ops_warn_lifetime           300
>>>> 
>>>> 
>>>> 
>>>>>            1 failed cephadm daemon(s)
>>>>>            39 daemons have recently crashed
>>>> 
>>>> That’s a bit worrisome, what happened?
>>>> 
>>>> `ceph crash ls`
>>>> 
>>>> 
>>>>>            569 pgs not deep-scrubbed in time
>>>>>            2609 pgs not scrubbed in time
>>>> 
>>>> Scrubs don’t happen during recovery, when complete these should catch up.
>>>> 
>>>>>  services:
>>>>>    mon: 5 daemons, quorum
>>>>> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m)
>>>>>    mgr: ceph-host-1.lqlece(active, since 18h), standbys: 
>>>>> ceph-host-2.suiuxi
>>>> 
>>>> I’m paranoid and would suggest deploying at least one more mgr.
>>>> 
>>>>>    mds: 19/19 daemons up, 7 standby
>>>> 
>>>> Yikes why so many?
>>>> 
>>>>>    osd: 181 osds: 181 up (since 4d), 181 in (since 14h)
>>>> 
>>>> What happened 14 hours ago?  It seems unusual for these durations to vary 
>>>> so much.
>>>> 
>>>>> 2770 remapped pgs
>>>> 
>>>> That’s an indication of balancing or backfill in progress.
>>>> 
>>>>>         flags noautoscale
>>>>> 
>>>>>  data:
>>>>>    volumes: 4/4 healthy
>>>>>    pools:   16 pools, 7137 pgs
>>>>>    objects: 256.82M objects, 484 TiB
>>>>>    usage:   742 TiB used, 1.5 PiB / 2.2 PiB avail
>>>>>    pgs:     575889786/1468742421 objects misplaced (39.210%)
>>>> 
>>>> 39% is a lot of misplaced objects, this would be consistent with you 
>>>> having successfully added those OSDs.
>>>> Here is where the factor of the most-full OSD comes in.
>>>> 
>>>> Technically backfill is a subset of recovery, but in practice people 
>>>> usually think in terms:
>>>> 
>>>> Recovery: PGs healing from OSDs having failed or been down
>>>> Backfill: Rebalancing of data due to topology changes, including adjusted 
>>>> CRUSH rules, expansion, etc.
>>>> 
>>>> 
>>>>>             4247 active+clean
>>>>>             2763 active+remapped+backfill_wait
>>>>>             77   active+clean+scrubbing
>>>>>             43   active+clean+scrubbing+deep
>>>>>             7    active+remapped+backfilling
>>>> 
>>>> Configuration options throttle how much backfill goes on in parallel to 
>>>> keep the cluster from DoSing itself.  Here I suspect that you’re running a 
>>>> recent release with the notorious mclock op scheduling shortcomings, which 
>>>> is a tangent.
>>>> 
>>>> 
>>>> I suggest checking out these two resources re upmap-remapped.py :
>>>> 
>>>> https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph 
>>>> Operations with Upmap.pdf 
>>>> <https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf>
>>>> https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph
>>>>  
>>>> 
>>>> 
>>>> This tool, in conjunction with the balancer module, will do the backfill 
>>>> more elegantly with various benefits.
>>>> 
>>>> 
>>>>> --- RAW STORAGE ---
>>>>> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
>>>>> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54
>>>> 
>>>> I hope the formatting below comes through, makes it a lot easier to read a 
>>>> table.
>>>> 
>>>>> 
>>>>> --- POOLS ---
>>>>> POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  
>>>>> MAX AVAIL
>>>>> .mgr                         1     1  277 MiB       71  831 MiB      0 93 
>>>>> TiB
>>>>> .rgw.root                    2    32  1.6 KiB        6   72 KiB      0 93 
>>>>> TiB
>>>>> default.rgw.log              3    32   63 KiB      210  972 KiB      0 93 
>>>>> TiB
>>>>> default.rgw.control          4    32      0 B        8      0 B      0 93 
>>>>> TiB
>>>>> default.rgw.meta             5    32  1.4 KiB        8   72 KiB      0 93 
>>>>> TiB
>>>>> default.rgw.buckets.data     6  2048  289 TiB  100.35M  434 TiB  68.04 
>>>>> 136 TiB
>>>>> default.rgw.buckets.index    7  1024  5.4 GiB      521   16 GiB      0 93 
>>>>> TiB
>>>>> default.rgw.buckets.non-ec   8    32    551 B        1   13 KiB      0 93 
>>>>> TiB
>>>>> metadata_fs_ssd              9   128  6.1 GiB   15.69M   18 GiB      0 93 
>>>>> TiB
>>>>> ssd_ec_project              10  1024  108 TiB   44.46M  162 TiB  29.39 
>>>>> 260 TiB
>>>>> metadata_fs_hdd             11   128  9.9 GiB    8.38M   30 GiB   0.01 93 
>>>>> TiB
>>>>> hdd_ec_archive              12  1024   90 TiB   87.94M  135 TiB  39.79 
>>>>> 136 TiB
>>>>> metadata_fs_nvme            13    32  260 MiB      177  780 MiB      0 93 
>>>>> TiB
>>>>> metadata_fs_ssd_rep         14    32   17 MiB      103   51 MiB      0 93 
>>>>> TiB
>>>>> ssd_rep_projects            15  1024    132 B        1   12 KiB      0 
>>>>> 130 TiB
>>>>> nvme_rep_projects           16   512  3.5 KiB       30  336 KiB      093 
>>>>> TiB
>>>> 
>>>> Do you have multiple EC RBD pools and/or multiple CephFSes?
>>>> 
>>>> 
>>>>> ID   CLASS       WEIGHT      REWEIGHT  SIZE     RAW USE  DATA     OMAP   
>>>>> META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
>>>>> -1              2272.93311         -  2.2 PiB  742 TiB  731 TiB   63 GiB 
>>>>> 3.1 TiB  1.5 PiB  32.64  1.00    -          root default
>>>>> -7               254.54175         -  255 TiB  104 TiB  102 TiB  9.6 GiB 
>>>>> 455 GiB  151 TiB  40.78  1.25    -              host ceph-host-1
>>>> 
>>>>> ...
>>>> 
>>>>> 137   hdd_class    18.43300   1.00000   18 TiB   14 TiB   14 TiB    6 KiB 
>>>>> 50 GiB   4.3 TiB  76.47  2.34  449      up          osd.137
>>>>> 152   hdd_class    18.19040   1.00000   18 TiB  241 GiB  239 GiB   10 KiB 
>>>>> 1.8 GiB   18 TiB   1.29  0.04    7      up          osd.152
>>>>> 3.1 TiB  1.5 PiB  32.64
>>>>> MIN/MAX VAR: 0.00/2.46  STDDEV: 26.17
>>>> 
>>>> There ya go.  osd.152 must be one of the new OSDs.  Note that only 7 PGs 
>>>> are currently resident and that it holds just 4% of the average amount of 
>>>> data on the entire set of OSDs.
>>>> Run the focused `osd df` above and that number will change slightly.
>>>> 
>>>> Here is your least full hdd_class OSD:
>>>> 
>>>> 151   hdd_class    18.19040   1.00000   18 TiB   38 GiB   37 GiB    6 KiB 
>>>> 1.1 GiB   18 TiB   0.20  0.01    1      up          osd.151
>>>> 
>>>> And the most full:
>>>> 
>>>> 180   hdd_class    18.19040   1.00000   18 TiB  198 GiB  197 GiB   10 KiB 
>>>> 1.7 GiB   18 TiB   1.07  0.03    5      up          osd.180
>>>> 
>>>> 
>>>> I suspect that the most-full is at 107% of average due to the bolus of 
>>>> backfill and/or the balancer not being active.  Using upmap-remapped as 
>>>> described above can help avoid this kind of overload.
>>>> 
>>>> In a nutshell, the available space will gradually increase as data is 
>>>> backfilled, especially if you have the balancer enabled.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to