> On Sep 2, 2025, at 6:36 AM, Steven Vacaroaia <ste...@gmail.com> wrote: > > Thanks again Anthony > > I guess I assumed wrong that "osd.all-available-devices" includes all > available devices
I *think* the last match wins when there are multiple OSD specs. Please excerpt them from `ceph orch ls --export` so we can see them in detail. > It seems I should have targeted specifical ones (like osd.hdd_osds) > > Going back to the capacity increase > I am trying to estimate / calculate the final capacity > > Looking at the output of "ceph df" below > it would be greatly appreciated if you could please confirm > if this is the correct interpretation for hdd_class : > > "Out of a total capacity of 1.4 PiB you currently use 579 TiB . Yes. > The 579 TiB used are the sum of the data used by pools configured to use > hdd_class e.g osd.all-available-devices (434) , hdd_ec_archive > (135)...etc Yes. Note that if you have any pools using a CRUSH rule that does not specify a device class, they will use *all* OSDs and hilarity will ensue. > > Therefore, when the backfilling / balancing is done > (and assuming no more data is added to the pools) > the amount of pools "MAX Available" should be 814 TB Ah. Not so fast. 814 TiB is aggregate raw capacity. Max avail is nuanced. This can be slippery to wrap one's mind around. If your mind is pre-warped, that helps. Max avail is *per pool*. It indicates the projected amount of user data that may be written to that pool *if* no other pools mapped to the same OSDs gain or lose data. It's a zero sum game. Note for example that .mgr and default.rgw.buckets.index (and a slew of others) all report 93 TiB max avail. They're all (presumably) replicated size=3 and on the same OSDs. So that *set* of pools can accept ~~ 93 TiB of new data split any way across them. This is a function of the delta between the 90% full_ratio and that the most-full current OSD is 28.58% full, divided by 3 for replication. It becomes a bit more complicated for the hdd_class, where it would seem that EC is in effect. It's also a function of the CRUSH topology and rules. I see that for example ceph-host-1 offers ~ 98 CRUSH weight of hdd_class, where ceph-host-1 offers ~ 226 CRUSH weight of hdd_class. Your latest message doesn't show the CRUSH rules and pool attributes but I suspect multiple EC profiles are in use, which complicates the calculation a bit. You have 7 hosts and thus likely 7 CRUSH failure domains. If you have, say, a 5+2 EC pool, that pool's max avail will be limited by the lowest CRUSH weight of the failure domains, since each PG will need to hit one OSD on every host. A 4+2 pool however would have more flexibility in placement and be less constrained by the unequal distribution. Why does ceph-host-1 have fewer HDDs than the other hosts? > > " > > Is the above correct ? > > Note > The "MAX available " is increasing - 202 TB now That's a function of the balancer reducing how full the most-full OSD of that device class is > > Steven > > > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 > nvme_class 293 TiB 293 TiB 216 GiB 216 GiB 0.07 > ssd_class 587 TiB 424 TiB 163 TiB 163 TiB 27.80 > TOTAL 2.2 PiB 1.5 PiB 742 TiB 742 TiB 32.64 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > .mgr 1 1 277 MiB 71 831 MiB 0 93 > TiB > .rgw.root 2 32 1.6 KiB 6 72 KiB 0 93 > TiB > default.rgw.log 3 32 63 KiB 210 972 KiB 0 93 > TiB > default.rgw.control 4 32 0 B 8 0 B 0 93 > TiB > default.rgw.meta 5 32 1.4 KiB 8 72 KiB 0 93 > TiB > default.rgw.buckets.data 6 2048 289 TiB 100.35M 434 TiB 68.04 136 > TiB > default.rgw.buckets.index 7 1024 5.4 GiB 521 16 GiB 0 93 > TiB > default.rgw.buckets.non-ec 8 32 551 B 1 13 KiB 0 93 > TiB > metadata_fs_ssd 9 128 6.1 GiB 15.69M 18 GiB 0 93 > TiB > ssd_ec_project 10 1024 108 TiB 44.46M 162 TiB 29.39 260 > TiB > metadata_fs_hdd 11 128 9.9 GiB 8.38M 30 GiB 0.01 93 > TiB > hdd_ec_archive 12 1024 90 TiB 87.94M 135 TiB 39.79 136 > TiB > > On Mon, 1 Sept 2025 at 17:07, Anthony D'Atri <a...@dreamsnake.net > <mailto:a...@dreamsnake.net>> wrote: >> >> >>> Hi, >>> Thanks Anthony - as always a very useful and comprehensive response >> >> Most welcome. >> >>> yes, there were only 139 OSD and , indeed, raw capacity increased >>> I also noticed that the "max available " column from "cepf df" is getting >>> higher (177 TB) so, it seems the capacity is being added >> >> Aye, as expected. >> >>> Too many MDS daemons are due to the fact that I have over 150 CEPHFS clients >>> and thought that deploying a daemon on each host for each filesystem is >>> going to >>> provide better performance - was I wrong ? >> >> MDS strategy is nuanced, but I think deploying them redundantly on each >> back-end node might not give you much. The workload AIUI is more a >> function of the number and size of files. >> >>> High number of remapped PGs are due to improperly using >>> osd.all-available-devices unmaged command >>> so adding the new drives triggered automatically detecting and adding them >> >> Yep, I typically suggest setting OSD services to unmanaged when not actively >> using them. >> >>> I am not sure what I did wrong though - see below output from "ceph orch >>> ls" before adding the drive >>> Shouldn't setting it like that prevent automatic discovery ? >> >> You do have three other OSD rules, at least one of them likely matched and >> took action. >> >>> osd.all-available-devices 0 - >>> 4w <unmanaged> >>> >>> osd.hdd_osds 72 10m ago >>> 5w * >>> >>> osd.nvme_osds 25 10m ago >>> 5w * >>> >>> osd.ssd_osds 84 10m ago >>> 3w ceph-host-1 >>> >> >> `ceph orch ls --export` >> >>> Resources mentioned are very useful >>> running upmap-remapped.py did bring the number of PGs to be remapped close >>> to zero >> >> It's a super, super useful tool. Letting the balancer do the work >> incrementally helps deter unexpected OSD fullness issues and if a problem >> arises, you're a lot closer to HEALTH_OK. >> >>> 2. upmap-remapped.py | sh >> >> Sometimes 2-3 runs are needed for full effect. >> >>> >>> 3. change target_max_misplaced_ratio to a higher number than default 0.005 >>> (since we want to rebalance faster and client performance is not a huge >>> issue ) >>> >>> 4. enable balancer >>> >>> 5.wait >>> >>> Doing it like this will, eventually, increase the number misplaced PGs >>> until it is higher than the ratio when , I guess, the balancer stops >>> ( "optimize_result": "Too many objects (0.115742 > 0.090000) are >>> misplaced; try again later ) >> >> Exactly. >> >>> Should I repeat the process when the number of objects misplaced is higher >>> than the ratio or what is the proper way of doing it ? >> >> As the backfill progresses and the misplaced percentage drops, the balancer >> will kick in with another increment. >> >>> >>> Steven >>> >>> >>> On Sun, 31 Aug 2025 at 11:08, Anthony D'Atri <a...@dreamsnake.net >>> <mailto:a...@dreamsnake.net>> wrote: >>>> >>>> >>>>> On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <ste...@gmail.com >>>>> <mailto:ste...@gmail.com>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers ) >>>> >>>> The ultimate answer to Ceph, the cluster, and everything! >>>> >>>>> My expectation was that the pools configured to use "hdd_class" will >>>>> have their capacity increased ( e.g. default.rgw.buckets.data which is >>>>> uses an EC 4+2 pool for data ) >>>> >>>> First, did the raw capacity increase when you added these drives? >>>> >>>>> --- RAW STORAGE --- >>>>> CLASS SIZE AVAIL USED RAW USED %RAW USED >>>>> hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 >>>> >>>> Was the number of OSDs previously 139? >>>> >>>>> It seems it is not happening ...yet ?! >>>>> Is it because the peering is still going ? >>>> >>>> Ceph nomenclature can be mystifying at first. And sometimes at thirteenth. >>>> >>>> Peering is daemons checking in with each other to ensure they’re in >>>> agreement. >>>> >>>> I think you mean backfill / balancing. >>>> >>>> The available space reported by “ceph df” for a *pool* is a function of: >>>> >>>> * Raw space available in the associated CRUSH rule’s device class (or if >>>> the rule isn’t ideal, all device classes) >>>> * The cluster’s three full ratios # ceph osd dump | grep ratio >>>> * The fullness of the single most-full OSD in the device class >>>> >>>> BTW I learned only yesterday that you can restrict `ceph osd df` by >>>> specifying a device class, so try running >>>> >>>> `ceph osd df hdd_class | tail -10` >>>> >>>> Notably, this will show you the min/max variance among OSDs of just that >>>> device class, and the standard deviation. >>>> When you have multiple OSD sizes, these figures are much less useful when >>>> calculated across the whole cluster by “ceph osd df” >>>> >>>> # ceph osd df hdd >>>> ... >>>> 318 hdd 18.53969 1.00000 19 TiB 15 TiB 15 TiB 15 KiB 67 GiB >>>> 3.6 TiB 80.79 1.04 127 up >>>> 319 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 936 KiB 60 GiB >>>> 3.7 TiB 79.87 1.03 129 up >>>> 320 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 33 KiB 72 GiB >>>> 3.7 TiB 79.99 1.03 129 up >>>> 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 11 >>>> GiB 15 TiB 17.55 0.23 26 up >>>> TOTAL 5.4 PiB 4.2 PiB 4.1 PiB 186 MiB 17 TiB >>>> 1.2 PiB 77.81 >>>> MIN/MAX VAR: 0.23/1.09 STDDEV: 4.39 >>>> >>>> You can even run this for a specific OSD so you don’t have to get creative >>>> with an egrep regex or exercise your pattern-matching skills, though the >>>> summary values naturally aren’t useful. >>>> >>>> # ceph osd df osd.30 >>>> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META >>>> AVAIL %USE VAR PGS STATUS >>>> 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 12 GiB >>>> 15 TiB 17.55 1.00 26 up >>>> TOTAL 19 TiB 3.3 TiB 2.9 TiB 130 KiB 12 GiB >>>> 15 TiB 17.55 >>>> MIN/MAX VAR: 1.00/1.00 STDDEV: 0 >>>> >>>> Here there’s a wide variation among the hdd OSDs because osd.30 had been >>>> down for a while and was recently restarted due to a host reboot, so it’s >>>> slowly filling with data. >>>> >>>> >>>>> ssd_class 6.98630 >>>> >>>> That seems like an unusual size, what are these? Are they SAN LUNs? >>>> >>>>> Below are outputs from >>>>> ceph -s >>>>> ceph df >>>>> ceph osd df tree >>>> >>>> Thanks for providing the needful up front. >>>> >>>>> cluster: >>>>> id: 0cfa836d-68b5-11f0-90bf-7cc2558e5ce8 >>>>> health: HEALTH_WARN >>>>> 1 OSD(s) experiencing slow operations in BlueStore >>>> >>>> This warning state by default persists for a long time after it clears, >>>> I’m not sure why but I like to set this lower: >>>> >>>> # ceph config dump | grep blue >>>> global advanced >>>> bluestore_slow_ops_warn_lifetime 300 >>>> >>>> >>>> >>>>> 1 failed cephadm daemon(s) >>>>> 39 daemons have recently crashed >>>> >>>> That’s a bit worrisome, what happened? >>>> >>>> `ceph crash ls` >>>> >>>> >>>>> 569 pgs not deep-scrubbed in time >>>>> 2609 pgs not scrubbed in time >>>> >>>> Scrubs don’t happen during recovery, when complete these should catch up. >>>> >>>>> services: >>>>> mon: 5 daemons, quorum >>>>> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m) >>>>> mgr: ceph-host-1.lqlece(active, since 18h), standbys: >>>>> ceph-host-2.suiuxi >>>> >>>> I’m paranoid and would suggest deploying at least one more mgr. >>>> >>>>> mds: 19/19 daemons up, 7 standby >>>> >>>> Yikes why so many? >>>> >>>>> osd: 181 osds: 181 up (since 4d), 181 in (since 14h) >>>> >>>> What happened 14 hours ago? It seems unusual for these durations to vary >>>> so much. >>>> >>>>> 2770 remapped pgs >>>> >>>> That’s an indication of balancing or backfill in progress. >>>> >>>>> flags noautoscale >>>>> >>>>> data: >>>>> volumes: 4/4 healthy >>>>> pools: 16 pools, 7137 pgs >>>>> objects: 256.82M objects, 484 TiB >>>>> usage: 742 TiB used, 1.5 PiB / 2.2 PiB avail >>>>> pgs: 575889786/1468742421 objects misplaced (39.210%) >>>> >>>> 39% is a lot of misplaced objects, this would be consistent with you >>>> having successfully added those OSDs. >>>> Here is where the factor of the most-full OSD comes in. >>>> >>>> Technically backfill is a subset of recovery, but in practice people >>>> usually think in terms: >>>> >>>> Recovery: PGs healing from OSDs having failed or been down >>>> Backfill: Rebalancing of data due to topology changes, including adjusted >>>> CRUSH rules, expansion, etc. >>>> >>>> >>>>> 4247 active+clean >>>>> 2763 active+remapped+backfill_wait >>>>> 77 active+clean+scrubbing >>>>> 43 active+clean+scrubbing+deep >>>>> 7 active+remapped+backfilling >>>> >>>> Configuration options throttle how much backfill goes on in parallel to >>>> keep the cluster from DoSing itself. Here I suspect that you’re running a >>>> recent release with the notorious mclock op scheduling shortcomings, which >>>> is a tangent. >>>> >>>> >>>> I suggest checking out these two resources re upmap-remapped.py : >>>> >>>> https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph >>>> Operations with Upmap.pdf >>>> <https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf> >>>> https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph >>>> >>>> >>>> >>>> This tool, in conjunction with the balancer module, will do the backfill >>>> more elegantly with various benefits. >>>> >>>> >>>>> --- RAW STORAGE --- >>>>> CLASS SIZE AVAIL USED RAW USED %RAW USED >>>>> hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 >>>> >>>> I hope the formatting below comes through, makes it a lot easier to read a >>>> table. >>>> >>>>> >>>>> --- POOLS --- >>>>> POOL ID PGS STORED OBJECTS USED %USED >>>>> MAX AVAIL >>>>> .mgr 1 1 277 MiB 71 831 MiB 0 93 >>>>> TiB >>>>> .rgw.root 2 32 1.6 KiB 6 72 KiB 0 93 >>>>> TiB >>>>> default.rgw.log 3 32 63 KiB 210 972 KiB 0 93 >>>>> TiB >>>>> default.rgw.control 4 32 0 B 8 0 B 0 93 >>>>> TiB >>>>> default.rgw.meta 5 32 1.4 KiB 8 72 KiB 0 93 >>>>> TiB >>>>> default.rgw.buckets.data 6 2048 289 TiB 100.35M 434 TiB 68.04 >>>>> 136 TiB >>>>> default.rgw.buckets.index 7 1024 5.4 GiB 521 16 GiB 0 93 >>>>> TiB >>>>> default.rgw.buckets.non-ec 8 32 551 B 1 13 KiB 0 93 >>>>> TiB >>>>> metadata_fs_ssd 9 128 6.1 GiB 15.69M 18 GiB 0 93 >>>>> TiB >>>>> ssd_ec_project 10 1024 108 TiB 44.46M 162 TiB 29.39 >>>>> 260 TiB >>>>> metadata_fs_hdd 11 128 9.9 GiB 8.38M 30 GiB 0.01 93 >>>>> TiB >>>>> hdd_ec_archive 12 1024 90 TiB 87.94M 135 TiB 39.79 >>>>> 136 TiB >>>>> metadata_fs_nvme 13 32 260 MiB 177 780 MiB 0 93 >>>>> TiB >>>>> metadata_fs_ssd_rep 14 32 17 MiB 103 51 MiB 0 93 >>>>> TiB >>>>> ssd_rep_projects 15 1024 132 B 1 12 KiB 0 >>>>> 130 TiB >>>>> nvme_rep_projects 16 512 3.5 KiB 30 336 KiB 093 >>>>> TiB >>>> >>>> Do you have multiple EC RBD pools and/or multiple CephFSes? >>>> >>>> >>>>> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP >>>>> META AVAIL %USE VAR PGS STATUS TYPE NAME >>>>> -1 2272.93311 - 2.2 PiB 742 TiB 731 TiB 63 GiB >>>>> 3.1 TiB 1.5 PiB 32.64 1.00 - root default >>>>> -7 254.54175 - 255 TiB 104 TiB 102 TiB 9.6 GiB >>>>> 455 GiB 151 TiB 40.78 1.25 - host ceph-host-1 >>>> >>>>> ... >>>> >>>>> 137 hdd_class 18.43300 1.00000 18 TiB 14 TiB 14 TiB 6 KiB >>>>> 50 GiB 4.3 TiB 76.47 2.34 449 up osd.137 >>>>> 152 hdd_class 18.19040 1.00000 18 TiB 241 GiB 239 GiB 10 KiB >>>>> 1.8 GiB 18 TiB 1.29 0.04 7 up osd.152 >>>>> 3.1 TiB 1.5 PiB 32.64 >>>>> MIN/MAX VAR: 0.00/2.46 STDDEV: 26.17 >>>> >>>> There ya go. osd.152 must be one of the new OSDs. Note that only 7 PGs >>>> are currently resident and that it holds just 4% of the average amount of >>>> data on the entire set of OSDs. >>>> Run the focused `osd df` above and that number will change slightly. >>>> >>>> Here is your least full hdd_class OSD: >>>> >>>> 151 hdd_class 18.19040 1.00000 18 TiB 38 GiB 37 GiB 6 KiB >>>> 1.1 GiB 18 TiB 0.20 0.01 1 up osd.151 >>>> >>>> And the most full: >>>> >>>> 180 hdd_class 18.19040 1.00000 18 TiB 198 GiB 197 GiB 10 KiB >>>> 1.7 GiB 18 TiB 1.07 0.03 5 up osd.180 >>>> >>>> >>>> I suspect that the most-full is at 107% of average due to the bolus of >>>> backfill and/or the balancer not being active. Using upmap-remapped as >>>> described above can help avoid this kind of overload. >>>> >>>> In a nutshell, the available space will gradually increase as data is >>>> backfilled, especially if you have the balancer enabled. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io