[ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Enrico Bocchi Thu, 15 May 2025 04:02:05 -0700

Hi Kasper,

As Frédéric pointed out, you should consider resharding the RocksDBdatabase to use column families (if the OSD was create pre-pacific):https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#rocksdb-shardingThere's additional documentation available with some preliminary steps,including making sure your RocksDB does not already use column families.

RocksDB options have changed in recent reef releases, and seem to bequite different w.r.t. Pacific/Quincy. You may want to check if any ofthe configuration options that have been modified are relevant for yoursetup.Here is an excellent deep-dive blog post by the unequaled Mark Nelson:https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/

For the used vs total bytes I have to disagree with Frédéric (sorry):11631853568 / 88906653696 gives 13% utilization. So the OSD should notoverspill to slow storage.I have seen this in the past and rolling back to previousbluestore_rocksdb options helped. However, I have not resharded RocksDBto column families yet. Would you please keep us posted if you reshardto cf and this fixes overspill?


Cheers,
Enrico


On 5/15/25 12:35, Frédéric Nass wrote:

Hi Kasper,

Thanks for sharing.

I don't see anything wrong with this specific OSD when it comes to 
bluestore_rocksdb_*. It's RocksDB database is using column families and this 
OSD was resharded properly (if not created or recreated in Pacific). What the 
perf dump shows is that the db_used_bytes is above the db_total_bytes. If this 
cluster makes heavy use of metadata (RGW workloads for example) then 90GB of DB 
device for 10TB drives is less than 1% which is not enough. General 
recommendation for RGW workloads is to use a DB device of at least 4% in size 
of the data device [1].

Now, your best move is probably to enable RocksDB compression (ceph config set 
osd bluestore_rocksdb_options_annex 'compression=kLZ4Compression'), restart and 
compact these OSDs to update bluefs stats, and consider giving those OSDs 
larger RocksDB partitions in the future.

Regards,
Frédéric.

[1] 
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

----- Le 15 Mai 25, à 7:44, Kasper Rasmussen kasper_steenga...@hotmail.com a 
écrit :

perf dump:
"bluefs": {
"db_total_bytes": 88906653696,
"db_used_bytes": 11631853568,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 9796816207872,
"slow_used_bytes": 1881341952,
"num_files": 229,
"log_bytes": 11927552,
"log_compactions": 78,
"log_write_count": 281792,
"logged_bytes": 1154220032,
"files_written_wal": 179,
"files_written_sst": 311,
"write_count_wal": 280405,
"write_count_sst": 29432,
"bytes_written_wal": 4015595520,
"bytes_written_sst": 15728308224,
"bytes_written_slow": 2691231744,
"max_bytes_wal": 0,
"max_bytes_db": 13012828160,
"max_bytes_slow": 3146252288,
"alloc_unit_slow": 65536,
"alloc_unit_db": 1048576,
"alloc_unit_wal": 0,
"read_random_count": 1871590,
"read_random_bytes": 18959576586,
"read_random_disk_count": 563421,
"read_random_disk_bytes": 17110012647,
"read_random_disk_bytes_wal": 0,
"read_random_disk_bytes_db": 11373755941,
"read_random_disk_bytes_slow": 5736256706,
"read_random_buffer_count": 1313456,
"read_random_buffer_bytes": 1849563939,
"read_count": 275731,
"read_bytes": 4825912551,
"read_disk_count": 225997,
"read_disk_bytes": 4016943104,
"read_disk_bytes_wal": 0,
"read_disk_bytes_db": 3909947392,
"read_disk_bytes_slow": 106999808,
"read_prefetch_count": 274534,
"read_prefetch_bytes": 4785141168,
"write_count": 591760,
"write_disk_count": 591838,
"write_bytes": 21062987776,
"compact_lat": {
"avgcount": 78,
"sum": 0.572247346,
"avgtime": 0.007336504
},
"compact_lock_lat": {
"avgcount": 78,
"sum": 0.182746199,
"avgtime": 0.002342899
},
"alloc_slow_fallback": 0,
"alloc_slow_size_fallback": 0,
"read_zeros_candidate": 0,
"read_zeros_errors": 0,
"wal_alloc_lat": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"db_alloc_lat": {
"avgcount": 969,
"sum": 0.006368060,
"avgtime": 0.000006571
},
"slow_alloc_lat": {
"avgcount": 39,
"sum": 0.004502210,
"avgtime": 0.000115441
},
"alloc_wal_max_lat": 0.000000000,
"alloc_db_max_lat": 0.000113831,
"alloc_slow_max_lat": 0.000301347
},


config show:
"bluestore_rocksdb_cf": "true",
"bluestore_rocksdb_cfs": "m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru}
L=min_write_buffer_number_to_merge=32 P=min_write_buffer_number_to_merge=32",
"bluestore_rocksdb_options":
"compression=kLZ4Compression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0",
"bluestore_rocksdb_options_annex": "",


Dono if it is of any help, but I've compared the config from an OSD not
reporting an issues, and there is no difference.


________________________________
From: Enrico Bocchi <enrico.boc...@cern.ch>
Sent: Wednesday, May 14, 2025 22:47
To: Kasper Rasmussen <kasper_steenga...@hotmail.com>; ceph-users
<ceph-users@ceph.io>
Subject: Re: BLUEFS_SPILLOVER after Reef upgrade

Hi Kasper,

Would you mind sharing the output of `perf dump` and `config show` from the
daemon socket of one of the OSDs reporting blues spillover? I am interested in
the bluefs part of the former and in the bluestore_rocksdb options of the
latter.

The warning about slow ops in bluestore is a different story. There have been
several messages on this mailing list recently with suggestions on how to tune
the alert threshold. From my experience, they very likely relate to some
problem with the underlying storage device, so I'd recommend investigating the
root cause rather than simply silencing the warning.

Cheers,
Enrico


________________________________
From: Kasper Rasmussen <kasper_steenga...@hotmail.com>
Sent: Wednesday, May 14, 2025 8:22:46 PM
To: ceph-users <ceph-users@ceph.io>
Subject: [ceph-users] BLUEFS_SPILLOVER after Reef upgrade

I've just upgraded our ceph cluster from pacific 16.2.15 -> Reef 18.2.7

After that I see the warnings:

[WRN] BLUEFS_SPILLOVER: 5 OSD(s) experiencing BlueFS spillover
     osd.110 spilled over 4.5 GiB metadata from 'db' device (8.0 GiB used of 83 
GiB)
     to slow device
     osd.455 spilled over 1.1 GiB metadata from 'db' device (11 GiB used of 83 
GiB)
     to slow device
     osd.533 spilled over 426 MiB metadata from 'db' device (10 GiB used of 83 
GiB)
     to slow device
     osd.560 spilled over 389 MiB metadata from 'db' device (9.8 GiB used of 83 
GiB)
     to slow device
     osd.597 spilled over 8.6 GiB metadata from 'db' device (7.7 GiB used of 83 
GiB)
     to slow device
[WRN] BLUESTORE_SLOW_OP_ALERT: 4 OSD(s) experiencing slow operations in
BlueStore
     osd.410 observed slow operation indications in BlueStore
     osd.443 observed slow operation indications in BlueStore
     osd.508 observed slow operation indications in BlueStore
     osd.593 observed slow operation indications in BlueStore

I've tried to run  ceph tell osd.XXX compact with no result.

Bluefs stats:

ceph tell osd.110 bluefs stats
1 : device size 0x14b33fe000 : using 0x202c00000(8.0 GiB)
2 : device size 0x8e8ffc00000 : using 0x5d31d150000(5.8 TiB)
RocksDBBlueFSVolumeSelector

Settings<< extra=0 B, l0_size=1 GiB, l_base=1 GiB, l_multi=8 B

DEV/LEV     WAL         DB          SLOW        *           *           REAL
FILES
LOG         0 B         16 MiB      0 B         0 B         0 B         15 MiB
1
WAL         0 B         18 MiB      0 B         0 B         0 B         6.3 MiB
1
DB          0 B         8.0 GiB     0 B         0 B         0 B         8.0 GiB
140
SLOW        0 B         0 B         4.5 GiB     0 B         0 B         4.5 GiB
78
TOTAL       0 B         8.0 GiB     4.5 GiB     0 B         0 B         0 B
220
MAXIMUMS:
LOG         0 B         25 MiB      0 B         0 B         0 B         21 MiB
WAL         0 B         118 MiB     0 B         0 B         0 B         93 MiB
DB          0 B         8.2 GiB     0 B         0 B         0 B         8.2 GiB
SLOW        0 B         0 B         14 GiB      0 B         0 B         14 GiB
TOTAL       0 B         8.2 GiB     14 GiB      0 B         0 B         0 B

SIZE <<  0 B         79 GiB      8.5 TiB

Help with what to do next will, be much appreciated


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management  - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Reply via email to