Hi,
I can't say that we upgraded a lot of clusters from N to P, but those
that we upgraded didn't show any of these symptoms you describe. But
we always did the Filestore to Bluestore conversion before the actual
upgrade. In SUSE Enterprise Storage (which we also supported at that
time) this was pointed out as a requirement. I just checked the ceph
docs, I can't find such a statement (yet).
All our pools are of the following type: replicated size 3 min_size
1 crush_rule 0 (or 1).
I would recommend to increase min_size to 2, otherwise you let Ceph
lose two of three PGs before pausing IO, this can make recovery
difficult. Reducing min_size to 1 should only be a temporary solution
to preserve stalling client IO during recovery.
Regards,
Eugen
Zitat von Olivier Delcourt <olivier.delco...@uclouvain.be>:
Hi,
After reading your posts for years, I feel compelled to ask for your
help/advice. First, I need to explain the context of our CEPH
cluster, the problems we have encountered, and finally, my questions.
Thanks taking the time for reading me.
Cheers,
Olivier
*** Background ***
Our CEPH cluster was created in 2015 with version 0.94.x (Hammer),
which has been upgraded over time to version 10.2.x (Jewel), then
12.x (Luminous) and then 14.x (Nautilus). The MONitors and
CEPHstores have always run on Linux Debian, with versions updated
according to the requirements for supporting the underlying hardware
and/or CEPH releases.
In terms of hardware, we have three monitors (cephmon) and 30
storage servers (cephstore) spread across three datacenters. These
servers are connected to the network via an aggregate (LACP) of two
10 Gbps fibre connections, through which two VLANs pass, one for the
CEPH frontend network and one for the CEPH backend network. In doing
so, we have always given ourselves the option of separating the
frontend and backend into dedicated aggregates if the bandwidth
becomes insufficient.
Each of the storage servers comes with HDDs whose size varies
depending on the server generation, as well as SSDs whose size is
more consistent but still varies (depending on price).
The idea has always been to add HDD and SSD storage to the CEPH
cluster when we add storage servers to expand it or replace old
ones. At the OSD level, the basic rule has always been followed: one
device = one OSD with metadatas (FileStore) on dedicated partitioned
SSDs (up to 6 for 32 OSDs) and, for the past few years, on a
partitioned NVMe RAID1 (MD).
In total, we have:
100 hdd 10.90999 TB
48 hdd 11.00000 TB
48 hdd 14.54999 TB
24 hdd 15.00000 TB
9 hdd 5.45999 TB
108 hdd 9.09999 TB
84 ssd 0.89400 TB
198 ssd 0.89424 TB
18 ssd 0.93599 TB
32 ssd 1.45999 TB
16 ssd 1.50000 TB
48 ssd 1.75000 TB
24 ssd 1.79999 TB
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 3.6 PiB 1.7 PiB 1.9 PiB 1.9 PiB 53.45
ssd 480 TiB 321 TiB 158 TiB 158 TiB 33.04
TOTAL 4.0 PiB 2.0 PiB 2.1 PiB 2.1 PiB 51.08
Regarding the CRUSHmap, and since at the time the CEPH cluster was
launched, classes (ssd/hdd) did not exist and we wanted to be able
to create pools on disk storage or flash storage, we created two
trees:
ID CLASS WEIGHT TYPE NAME STATUS
REWEIGHT PRI-AFF
-2 3660.19751 root main_storage
-11 1222.29883 datacenter DC1
-68 163.79984 host cephstore16
-280 109.09988 host cephstore28
-20 109.09988 host cephstore34
-289 109.09988 host cephstore31
-31 116.39990 host cephstore40
-205 116.39990 host cephstore37
-81 109.09988 host cephstore22
-71 163.79984 host cephstore19
-84 109.09988 host cephstore25
-179 116.39990 host cephstore43
-12 1222.29883 datacenter DC2
-69 163.79984 host cephstore17
-82 109.09988 host cephstore23
-295 109.09988 host cephstore32
-72 163.79984 host cephstore20
-283 109.09988 host cephstore29
-87 109.09988 host cephstore35
-85 109.09988 host cephstore26
-222 116.39990 host cephstore44
-36 116.39990 host cephstore41
-242 116.39990 host cephstore38
-25 1215.59998 datacenter DC3
-70 163.80000 host cephstore18
-74 163.80000 host cephstore21
-83 99.00000 host cephstore24
-86 110.00000 host cephstore27
-286 110.00000 host cephstore30
-298 99.00000 host cephstore33
-102 110.00000 host cephstore36
-304 120.00000 host cephstore39
-136 120.00000 host cephstore42
-307 120.00000 host cephstore45
-1 516.06305 root high-speed_storage
-21 171.91544 datacenter xDC1
-62 16.84781 host xcephstore16
-259 14.00000 host xcephstore28
-3 14.00000 host xcephstore34
-268 14.00000 host xcephstore31
-310 14.30786 host xcephstore40
-105 14.30786 host xcephstore37
-46 30.68784 host xcephstore10
-75 11.67993 host xcephstore22
-61 16.09634 host xcephstore19
-78 11.67993 host xcephstore25
-322 14.30786 host xcephstore43
-15 171.16397 datacenter xDC2
-63 16.09634 host xcephstore17
-76 11.67993 host xcephstore23
-274 14.00000 host xcephstore32
-65 16.09634 host xcephstore20
-262 14.00000 host xcephstore29
-13 14.00000 host xcephstore35
-79 11.67993 host xcephstore26
-51 30.68784 host xcephstore11
-325 14.30786 host xcephstore44
-313 14.30786 host xcephstore41
-175 14.30786 host xcephstore38
-28 172.98364 datacenter xDC3
-56 30.68784 host xcephstore12
-64 16.09200 host xcephstore18
-67 16.09200 host xcephstore21
-77 12.00000 host xcephstore24
-80 12.00000 host xcephstore27
-265 14.39999 host xcephstore30
-277 14.39990 host xcephstore33
-17 14.39999 host xcephstore36
-204 14.30399 host xcephstore39
-319 14.30399 host xcephstore42
-328 14.30396 host xcephstore45
Our allocation rules are:
# rules
rule main_storage_ruleset {
id 0
type replicated
min_size 1
max_size 10
step take main_storage
step chooseleaf firstn 0 type datacenter
step emit
}
rule high-speed_storage_ruleset {
id 1
type replicated
min_size 1
max_size 10
step take high-speed_storage
step chooseleaf firstn 0 type datacenter
step emit
}
All our pools are of the following type: replicated size 3 min_size
1 crush_rule 0 (or 1).
This CEPH cluster is currently only used for RBD. The volumes are
used by our ~ 1,200 KVM VMs.
*** Problems ***
Everything was working fine until last August, when we scheduled an
update from CEPH 14.x (Nautilus) to 16.X (Pacific) (and an update
from Debian 10 to Debian 11, which was not a problem).
* First problem:
We were forced to switch from FileStore to BlueStore in an emergency
and unscheduled manner because after upgrading the CEPH packages on
the first storage server, the FileStore OSDs would no longer start.
We did not have this problem on our small test cluster, which
obviously did not have the ‘same upgrade life’ as the production
cluster. We therefore took the opportunity, DC by DC (since this is
our ‘failure domain’), not only to update CEPH but also to recreate
the OSDs in BlueStore.
* Second problem:
Since our failure domain is a DC, we had to upgrade a DC and then
wait for it to recover (~500 TB net). SSD storage recovery takes a
few hours, while HDD storage recovery takes approximately three days.
Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3
hours (the phenomenon is also observed on HDD-type OSDs, but as we
have a large capacity, it is less critical).
Manual (re)weight changes only provided a temporary solution and,
despite all our attempts (OSD restart, etc.), we reached the
critical full_ratio threshold, which is 0.97 for us.
I'll leave you to imagine the effect on the virtual machines and the
services provided to our users.
We also had very strong growth in the size of the MONitor databases
(~3 GB -> 100 GB) (compaction did not really help).
Once our VMs were shut down (crashed), the cluster completed its
recovery (HDD-type OSDs) and, curiously, the SSD-type OSDs began to
‘empty’.
The day after that, we began updating the storage servers in our
second DC, and the phenomenon started again. We did not wait until
we reached full_ratio to shut down our virtualisation environment
and this time, the ‘SSD’ OSDs began to ‘empty’ after the following
commands: ceph osd unset noscrub && ceph osd unset nodeep-scrub.
In fact, we used to block scrubs and deep scrubs during massive
upgrades and recoveries to save I/O. This never caused any problems
in FileStore.
It should be added that since we started using the CEPH Cluster
(2015), scrubs have only been enabled at night so as not to impact
production I/O, via the following options: osd_recovery_delay_start
= 5, osd_scrub_begin_hour = 19, osd_scrub_end_hour = 7,
osd_scrub_sleep = 0.1 (the latter may be removed since classes are
now available) .
After this second total recovery of the CEPH cluster and the restart
of the virtualisation environment, we still have the third DC (10
cephstore) to update from CEPH 14 to 16, and our ‘SSD’ OSDs are
filling up again until the automatic activation of
scrubs/deep-scrubs at 7 p.m.
Since then, progress has stopped, the use of the various OSDs is
stable and more or less evenly distributed (via active upmap
balancer).
*** Questions / Assumptions / Opinions ***
Have you ever encountered a similar phenomenon? We agree that having
different versions of OSDs coexisting is not a good solution and is
not desirable in the medium term, but we are dependent on recovery
time (and, in addition, on the issue I am presenting to you here).
Our current hypothesis, following the restoration of stability and
the fact that we have never had this problem with OSDs in FileStore,
is that there is some kind of ‘housekeeping’ of BlueStore OSDs via
scrubs. Does that make sense? Any clues ? ideas ?
I also read on the Internet (somewhere...) that in any case, when
the cluster is not ‘healthy’, scrubs are suspended by default.
Indeed, in our case:
root@cephstore16:~# ceph daemon osd.11636 config show | grep
‘osd_scrub_during_recovery’
‘osd_scrub_during_recovery’: ‘false’,
This could explain why, during the three days of recovery, no
cleaning is performed and if bluestore does not perform maintenance,
it fills up?
(It would be possible to temporarily change this behaviour via: ceph
tell “osd.*” injectargs --osd-scrub-during-recovery=1 (to be tested).)
Do you have any suggestions for things to check? Although we have
experience with FileStore, we have not yet had time to gain
experience with BlueStore.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io