[ceph-users] Re: Upgrade CEPH 14.x -> 16.X + switch from Filestore to Bluestore = strange behavior

Eugen Block Sat, 13 Sep 2025 02:02:20 -0700

Hi,

I can't say that we upgraded a lot of clusters from N to P, but thosethat we upgraded didn't show any of these symptoms you describe. Butwe always did the Filestore to Bluestore conversion before the actualupgrade. In SUSE Enterprise Storage (which we also supported at thattime) this was pointed out as a requirement. I just checked the cephdocs, I can't find such a statement (yet).

All our pools are of the following type: replicated size 3 min_size1 crush_rule 0 (or 1).

I would recommend to increase min_size to 2, otherwise you let Cephlose two of three PGs before pausing IO, this can make recoverydifficult. Reducing min_size to 1 should only be a temporary solutionto preserve stalling client IO during recovery.


Regards,
Eugen

Zitat von Olivier Delcourt <olivier.delco...@uclouvain.be>:

Hi,
After reading your posts for years, I feel compelled to ask for yourhelp/advice. First, I need to explain the context of our CEPHcluster, the problems we have encountered, and finally, my questions.
Thanks taking the time for reading me.
Cheers,
Olivier



*** Background ***
Our CEPH cluster was created in 2015 with version 0.94.x (Hammer),which has been upgraded over time to version 10.2.x (Jewel), then12.x (Luminous) and then 14.x (Nautilus). The MONitors andCEPHstores have always run on Linux Debian, with versions updatedaccording to the requirements for supporting the underlying hardwareand/or CEPH releases.
In terms of hardware, we have three monitors (cephmon) and 30storage servers (cephstore) spread across three datacenters. Theseservers are connected to the network via an aggregate (LACP) of two10 Gbps fibre connections, through which two VLANs pass, one for theCEPH frontend network and one for the CEPH backend network. In doingso, we have always given ourselves the option of separating thefrontend and backend into dedicated aggregates if the bandwidthbecomes insufficient.
Each of the storage servers comes with HDDs whose size variesdepending on the server generation, as well as SSDs whose size ismore consistent but still varies (depending on price).The idea has always been to add HDD and SSD storage to the CEPHcluster when we add storage servers to expand it or replace oldones. At the OSD level, the basic rule has always been followed: onedevice = one OSD with metadatas (FileStore) on dedicated partitionedSSDs (up to 6 for 32 OSDs) and, for the past few years, on apartitioned NVMe RAID1 (MD).
In total, we have:

    100 hdd 10.90999 TB
     48 hdd 11.00000 TB
     48 hdd 14.54999 TB
     24 hdd 15.00000 TB
      9 hdd 5.45999 TB
    108 hdd 9.09999 TB

     84 ssd 0.89400 TB
    198 ssd 0.89424 TB
     18 ssd 0.93599 TB
     32 ssd 1.45999 TB
     16 ssd 1.50000 TB
     48 ssd 1.75000 TB
     24 ssd 1.79999 TB

--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    3.6 PiB  1.7 PiB  1.9 PiB   1.9 PiB      53.45
ssd    480 TiB  321 TiB  158 TiB   158 TiB      33.04
TOTAL  4.0 PiB  2.0 PiB  2.1 PiB   2.1 PiB      51.08
Regarding the CRUSHmap, and since at the time the CEPH cluster waslaunched, classes (ssd/hdd) did not exist and we wanted to be ableto create pools on disk storage or flash storage, we created twotrees:
ID CLASS WEIGHT TYPE NAME STATUSREWEIGHT PRI-AFF
   -2         3660.19751  root main_storage
  -11         1222.29883      datacenter DC1
  -68          163.79984              host cephstore16
 -280          109.09988              host cephstore28
  -20          109.09988              host cephstore34
 -289          109.09988              host cephstore31
  -31          116.39990              host cephstore40
 -205          116.39990              host cephstore37
  -81          109.09988              host cephstore22
  -71          163.79984              host cephstore19
  -84          109.09988              host cephstore25
 -179          116.39990              host cephstore43
  -12         1222.29883      datacenter DC2
  -69          163.79984              host cephstore17
  -82          109.09988              host cephstore23
 -295          109.09988              host cephstore32
  -72          163.79984              host cephstore20
 -283          109.09988              host cephstore29
  -87          109.09988              host cephstore35
  -85          109.09988              host cephstore26
 -222          116.39990              host cephstore44
  -36          116.39990              host cephstore41
 -242          116.39990              host cephstore38
  -25         1215.59998      datacenter DC3
  -70          163.80000              host cephstore18
  -74          163.80000              host cephstore21
  -83           99.00000              host cephstore24
  -86          110.00000              host cephstore27
 -286          110.00000              host cephstore30
 -298           99.00000              host cephstore33
 -102          110.00000              host cephstore36
 -304          120.00000              host cephstore39
 -136          120.00000              host cephstore42
 -307          120.00000              host cephstore45

   -1          516.06305  root high-speed_storage
  -21          171.91544      datacenter xDC1
  -62           16.84781              host xcephstore16
 -259           14.00000              host xcephstore28
   -3           14.00000              host xcephstore34
 -268           14.00000              host xcephstore31
 -310           14.30786              host xcephstore40
 -105           14.30786              host xcephstore37
  -46           30.68784              host xcephstore10
  -75           11.67993              host xcephstore22
  -61           16.09634              host xcephstore19
  -78           11.67993              host xcephstore25
 -322           14.30786              host xcephstore43
  -15          171.16397      datacenter xDC2
  -63           16.09634              host xcephstore17
  -76           11.67993              host xcephstore23
 -274           14.00000              host xcephstore32
  -65           16.09634              host xcephstore20
 -262           14.00000              host xcephstore29
  -13           14.00000              host xcephstore35
  -79           11.67993              host xcephstore26
  -51           30.68784              host xcephstore11
 -325           14.30786              host xcephstore44
 -313           14.30786              host xcephstore41
 -175           14.30786              host xcephstore38
  -28          172.98364      datacenter xDC3
  -56           30.68784              host xcephstore12
  -64           16.09200              host xcephstore18
  -67           16.09200              host xcephstore21
  -77           12.00000              host xcephstore24
  -80           12.00000              host xcephstore27
 -265           14.39999              host xcephstore30
 -277           14.39990              host xcephstore33
  -17           14.39999              host xcephstore36
 -204           14.30399              host xcephstore39
 -319           14.30399              host xcephstore42
 -328           14.30396              host xcephstore45

Our allocation rules are:

# rules
rule main_storage_ruleset {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take main_storage
        step chooseleaf firstn 0 type datacenter
        step emit
}
rule high-speed_storage_ruleset {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take high-speed_storage
        step chooseleaf firstn 0 type datacenter
        step emit
}
All our pools are of the following type: replicated size 3 min_size1 crush_rule 0 (or 1).
This CEPH cluster is currently only used for RBD. The volumes areused by our ~ 1,200 KVM VMs.
*** Problems ***
Everything was working fine until last August, when we scheduled anupdate from CEPH 14.x (Nautilus) to 16.X (Pacific) (and an updatefrom Debian 10 to Debian 11, which was not a problem).
* First problem:
We were forced to switch from FileStore to BlueStore in an emergencyand unscheduled manner because after upgrading the CEPH packages onthe first storage server, the FileStore OSDs would no longer start.We did not have this problem on our small test cluster, whichobviously did not have the ‘same upgrade life’ as the productioncluster. We therefore took the opportunity, DC by DC (since this isour ‘failure domain’), not only to update CEPH but also to recreatethe OSDs in BlueStore.
* Second problem:
Since our failure domain is a DC, we had to upgrade a DC and thenwait for it to recover (~500 TB net). SSD storage recovery takes afew hours, while HDD storage recovery takes approximately three days.Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3hours (the phenomenon is also observed on HDD-type OSDs, but as wehave a large capacity, it is less critical).Manual (re)weight changes only provided a temporary solution and,despite all our attempts (OSD restart, etc.), we reached thecritical full_ratio threshold, which is 0.97 for us.I'll leave you to imagine the effect on the virtual machines and theservices provided to our users.We also had very strong growth in the size of the MONitor databases(~3 GB -> 100 GB) (compaction did not really help).Once our VMs were shut down (crashed), the cluster completed itsrecovery (HDD-type OSDs) and, curiously, the SSD-type OSDs began to‘empty’.
The day after that, we began updating the storage servers in oursecond DC, and the phenomenon started again. We did not wait untilwe reached full_ratio to shut down our virtualisation environmentand this time, the ‘SSD’ OSDs began to ‘empty’ after the followingcommands: ceph osd unset noscrub && ceph osd unset nodeep-scrub.
In fact, we used to block scrubs and deep scrubs during massiveupgrades and recoveries to save I/O. This never caused any problemsin FileStore.It should be added that since we started using the CEPH Cluster(2015), scrubs have only been enabled at night so as not to impactproduction I/O, via the following options: osd_recovery_delay_start= 5, osd_scrub_begin_hour = 19, osd_scrub_end_hour = 7,osd_scrub_sleep = 0.1 (the latter may be removed since classes arenow available) .
After this second total recovery of the CEPH cluster and the restartof the virtualisation environment, we still have the third DC (10cephstore) to update from CEPH 14 to 16, and our ‘SSD’ OSDs arefilling up again until the automatic activation ofscrubs/deep-scrubs at 7 p.m.Since then, progress has stopped, the use of the various OSDs isstable and more or less evenly distributed (via active upmapbalancer).
*** Questions / Assumptions / Opinions ***
Have you ever encountered a similar phenomenon? We agree that havingdifferent versions of OSDs coexisting is not a good solution and isnot desirable in the medium term, but we are dependent on recoverytime (and, in addition, on the issue I am presenting to you here).
Our current hypothesis, following the restoration of stability andthe fact that we have never had this problem with OSDs in FileStore,is that there is some kind of ‘housekeeping’ of BlueStore OSDs viascrubs. Does that make sense? Any clues ? ideas ?
I also read on the Internet (somewhere...) that in any case, whenthe cluster is not ‘healthy’, scrubs are suspended by default.Indeed, in our case:
root@cephstore16:~# ceph daemon osd.11636 config show | grep‘osd_scrub_during_recovery’
    ‘osd_scrub_during_recovery’: ‘false’,
This could explain why, during the three days of recovery, nocleaning is performed and if bluestore does not perform maintenance,it fills up?
(It would be possible to temporarily change this behaviour via: cephtell “osd.*” injectargs --osd-scrub-during-recovery=1 (to be tested).)
Do you have any suggestions for things to check? Although we haveexperience with FileStore, we have not yet had time to gainexperience with BlueStore.



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade CEPH 14.x -> 16.X + switch from Filestore to Bluestore = strange behavior

Reply via email to