Dear List,
until we upgraded our cluster 3 weeks ago we had a cute high performing small
productive CEPH cluster running Nautilus 14.2.22 on Proxmox 6.4 (Kernel 5.4-143
at this time). Then we started the upgrade to Octopus 15.2.15. Since we did an
online upgrade, we stopped the autoconvert with
ceph config set osd bluestore_fsck_quick_fix_on_mount false
but followed up the OMAP conversion after the complete upgrade step by step by
restarting one OSD after the other.
Our Setup is
5 x Storage Node, each : 16 x 2.3GHz, 64GB RAM, 1 x SSD OSD 1.6TB, 1 x 7.68TB
(both WD Enterprise, SAS-12), 3 HDD OSD (10TB, SAS-12) with Optane Cache)
4 x Compute Nodes
40 GE Storage network (Mellanox Switch + Mellanox CX354 40GE Dual Port Cards,
Linux OSS drivers)
10 GE Cluster/Mgmt Network
Our performance before the upgrade, Ceph 14.2.22 (about 36k IOPS on the SSD
Pool)
### SSD Pool on 40GE Switches
# rados bench -p SSD 30 -t 256 -b 1024 write
hints = 1
Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for up
to 30 seconds or 0 objects
...
Total time run: 30.004
Total writes made: 1094320
Write size: 1024
Object size: 1024
Bandwidth (MB/sec): 35.6177
Stddev Bandwidth: 4.71909
Max bandwidth (MB/sec): 40.7314
Min bandwidth (MB/sec): 21.3037
Average IOPS: 36472
Stddev IOPS: 4832.35
Max IOPS: 41709
Min IOPS: 21815
Average Latency(s): 0.00701759
Stddev Latency(s): 0.00854068
Max latency(s): 0.445397
Min latency(s): 0.000909089
Cleaning up (deleting benchmark objects)
Our performance after the update CEPH 15.2.15 (drops to max 17k IOPS on the SSD
Pool)
# rados bench -p SSD 30 -t 256 -b 1024 write
hints = 1
Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for up
to 30 seconds or 0 objects
...
Total time run: 30.0146
Total writes made: 468513
Write size: 1024
Object size: 1024
Bandwidth (MB/sec): 15.2437
Stddev Bandwidth: 0.78677
Max bandwidth (MB/sec): 16.835
Min bandwidth (MB/sec): 13.3184
Average IOPS: 15609
Stddev IOPS: 805.652
Max IOPS: 17239
Min IOPS: 13638
Average Latency(s): 0.016396
Stddev Latency(s): 0.00777054
Max latency(s): 0.140793
Min latency(s): 0.00106735
Cleaning up (deleting benchmark objects)
Note : OSD.17 is out on purpose
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 208.94525 root default
-3 41.43977 host xx-ceph01
0 hdd 9.17380 osd.0 up 1.00000 1.00000
5 hdd 9.17380 osd.5 up 1.00000 1.00000
23 hdd 14.65039 osd.23 up 1.00000 1.00000
7 ssd 1.45549 osd.7 up 1.00000 1.00000
15 ssd 6.98630 osd.15 up 1.00000 1.00000
-5 41.43977 host xx-ceph02
1 hdd 9.17380 osd.1 up 1.00000 1.00000
4 hdd 9.17380 osd.4 up 1.00000 1.00000
24 hdd 14.65039 osd.24 up 1.00000 1.00000
9 ssd 1.45549 osd.9 up 1.00000 1.00000
20 ssd 6.98630 osd.20 up 1.00000 1.00000
-7 41.43977 host xx-ceph03
2 hdd 9.17380 osd.2 up 1.00000 1.00000
3 hdd 9.17380 osd.3 up 1.00000 1.00000
25 hdd 14.65039 osd.25 up 1.00000 1.00000
8 ssd 1.45549 osd.8 up 1.00000 1.00000
21 ssd 6.98630 osd.21 up 1.00000 1.00000
-17 41.43977 host xx-ceph04
10 hdd 9.17380 osd.10 up 1.00000 1.00000
11 hdd 9.17380 osd.11 up 1.00000 1.00000
26 hdd 14.65039 osd.26 up 1.00000 1.00000
6 ssd 1.45549 osd.6 up 1.00000 1.00000
22 ssd 6.98630 osd.22 up 1.00000 1.00000
-21 43.18616 host xx-ceph05
13 hdd 9.17380 osd.13 up 1.00000 1.00000
14 hdd 9.17380 osd.14 up 1.00000 1.00000
27 hdd 14.65039 osd.27 up 1.00000 1.00000
12 ssd 1.45540 osd.12 up 1.00000 1.00000
16 ssd 1.74660 osd.16 up 1.00000 1.00000
17 ssd 3.49309 osd.17 up 0 1.00000
18 ssd 1.74660 osd.18 up 1.00000 1.00000
19 ssd 1.74649 osd.19 up 1.00000 1.00000
# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS
0 hdd 9.17380 1.00000 9.2 TiB 2.5 TiB 2.4 TiB 28 MiB 5.0 GiB 6.6
TiB 27.56 0.96 88 up
5 hdd 9.17380 1.00000 9.2 TiB 2.6 TiB 2.5 TiB 57 MiB 5.1 GiB 6.6
TiB 27.89 0.98 89 up
23 hdd 14.65039 1.00000 15 TiB 3.9 TiB 3.8 TiB 40 MiB 7.2 GiB 11
TiB 26.69 0.93 137 up
7 ssd 1.45549 1.00000 1.5 TiB 634 GiB 633 GiB 33 MiB 1.8 GiB 856
GiB 42.57 1.49 64 up
15 ssd 6.98630 1.00000 7.0 TiB 2.6 TiB 2.6 TiB 118 MiB 5.9 GiB 4.4
TiB 37.70 1.32 272 up
1 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 31 MiB 4.7 GiB 6.8
TiB 26.04 0.91 83 up
4 hdd 9.17380 1.00000 9.2 TiB 2.6 TiB 2.5 TiB 28 MiB 5.2 GiB 6.6
TiB 28.51 1.00 91 up
24 hdd 14.65039 1.00000 15 TiB 4.0 TiB 3.9 TiB 38 MiB 7.2 GiB 11
TiB 27.06 0.95 139 up
9 ssd 1.45549 1.00000 1.5 TiB 583 GiB 582 GiB 30 MiB 1.6 GiB 907
GiB 39.13 1.37 59 up
20 ssd 6.98630 1.00000 7.0 TiB 2.5 TiB 2.5 TiB 81 MiB 7.4 GiB 4.5
TiB 35.45 1.24 260 up
2 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 26 MiB 4.8 GiB 6.8
TiB 26.01 0.91 83 up
3 hdd 9.17380 1.00000 9.2 TiB 2.7 TiB 2.6 TiB 29 MiB 5.4 GiB 6.5
TiB 29.38 1.03 94 up
25 hdd 14.65039 1.00000 15 TiB 4.2 TiB 4.1 TiB 41 MiB 7.7 GiB 10
TiB 28.79 1.01 149 up
8 ssd 1.45549 1.00000 1.5 TiB 637 GiB 635 GiB 34 MiB 1.7 GiB 854
GiB 42.71 1.49 65 up
21 ssd 6.98630 1.00000 7.0 TiB 2.5 TiB 2.5 TiB 96 MiB 7.5 GiB 4.5
TiB 35.49 1.24 260 up
10 hdd 9.17380 1.00000 9.2 TiB 2.2 TiB 2.1 TiB 26 MiB 4.5 GiB 7.0
TiB 24.21 0.85 77 up
11 hdd 9.17380 1.00000 9.2 TiB 2.5 TiB 2.4 TiB 30 MiB 5.0 GiB 6.7
TiB 27.24 0.95 87 up
26 hdd 14.65039 1.00000 15 TiB 3.6 TiB 3.5 TiB 37 MiB 6.6 GiB 11
TiB 24.64 0.86 127 up
6 ssd 1.45549 1.00000 1.5 TiB 572 GiB 570 GiB 29 MiB 1.5 GiB 918
GiB 38.38 1.34 57 up
22 ssd 6.98630 1.00000 7.0 TiB 2.3 TiB 2.3 TiB 77 MiB 7.0 GiB 4.7
TiB 33.23 1.16 243 up
13 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 25 MiB 4.8 GiB 6.8
TiB 26.07 0.91 84 up
14 hdd 9.17380 1.00000 9.2 TiB 2.3 TiB 2.2 TiB 54 MiB 4.6 GiB 6.9
TiB 25.13 0.88 80 up
27 hdd 14.65039 1.00000 15 TiB 3.7 TiB 3.6 TiB 54 MiB 6.9 GiB 11
TiB 25.55 0.89 131 up
12 ssd 1.45540 1.00000 1.5 TiB 619 GiB 617 GiB 163 MiB 2.3 GiB 871
GiB 41.53 1.45 63 up
16 ssd 1.74660 1.00000 1.7 TiB 671 GiB 669 GiB 23 MiB 2.2 GiB 1.1
TiB 37.51 1.31 69 up
17 ssd 3.49309 0 0 B 0 B 0 B 0 B 0 B
0 B 0 0 0 up
18 ssd 1.74660 1.00000 1.7 TiB 512 GiB 509 GiB 18 MiB 2.3 GiB 1.2
TiB 28.62 1.00 52 up
19 ssd 1.74649 1.00000 1.7 TiB 709 GiB 707 GiB 64 MiB 2.0 GiB 1.1
TiB 39.64 1.39 72 up
TOTAL 205 TiB 59 TiB 57 TiB 1.3 GiB 128 GiB 147
TiB 28.60
MIN/MAX VAR: 0.85/1.49 STDDEV: 6.81
What we have done so far (no success)
- reformat two of the SSD OSD's (one was still from luminos, non LVM)
- set bluestore_allocator from hybrid back to bitmap
- set osd_memory_target to 6442450944 for some of the SSD OSDs
- cpupower idle-set -D 11
- bluefs_buffered_io to true
- disabled default firewalls between CEPH nodes (for testing only)
- disabled apparmor
- added memory (runs now on 128GB per Node)
- upgraded OS, runs now on kernel 5.13.19-1
What we observe
- HDD Pool has similar behaviour
- load is higher since update, seems like more CPU consumption (see graph1),
migration was on 10. Nov, around 10pm
- latency on the "big" 7TB SSD's (i.e. OSD.15) is significantly higher than on
the small 1.6TB SSDs (OSD.12), see graph2, must be due to the higher weight
though
- load of OSD.15 is 4 times higher than load of OSD.12, must be due to the
higher weight though as well
- start of OSD.15 (the 7TB SSD's is significantly slower (~10 sec) compared to
the 1.6TB SSDs
- increasing the block size in the benchmark to 4k, 8k or even 16k increases
the throughput but keeps the IOPS more or less stable, the drop at 32k is
minimal to ~14k IOPS in average
We already checked the ProxMoxx List without any remedies yet and we are a bit
helpless, any suggestions and / or does someone else has similar experiences?
We are a bit hesitant to upgrade to Pacific, given the current situation.
Thanks,
Kai
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]