Hi ,

The issue here is that one nvme failed so related osds ( let's say osd.1 , osd.2 , osd.3 and osd.4 ) all crashes and in the meantime another osd ( osd.108 ) on the same host crashes too (2 times with 2 different failed nvme ) on the same assert_line and assert_file as below ) .

So why this happened just one the nvme failed? Is they are related ?

I can,'t have a clear vue ...


Regards

On 8/21/25 15:57, Anthony D'Atri wrote:

We had a ceph cluster version 18.2.7 deployed with four osds per nvme device.
Post-Quincy the benefits of multiple OSDs per NVMe device are mostly obviated.

Or are you saying that you have HDD OSDs that offload WAL+DB?

Two weeks ago, we lost one hard drive ( so 4 osds ) , just after the osds had crashed , 
we had another "healthy" osd which crashed ( few minutes after the initial hard 
drive failure ) .

Again , this week we lost another hard drive ( lost 4 osds ) and again the same 
osd crashed too ( few minutes later ) .

Here below the crash information , the same crash information for both crashes :


  ceph crash info 
2025-08-21T00:31:13.929426Z_576e6e42-7c6b-49b8-90f1-9d51730f8ac2
{
     "assert_condition": "r == 0",
     "assert_file": 
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/os/bluestore/BlueStore.cc",
     "assert_func": "void BlueStore::_txc_apply_kv(BlueStore::TransContext*, 
bool)",
     "assert_line": 12944,
     "assert_msg": 
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/os/bluestore/BlueStore.cc:
 In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7f5fba6a3640 
time 
2025-08-21T00:31:13.926894+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/os/bluestore/BlueStore.cc:
 12944: FAILED ceph_assert(r == 0)\n",
     "assert_thread_name": "bstore_kv_sync",
     "backtrace": [
         "/lib64/libc.so.6(+0x3ebf0) [0x7f5fce5a5bf0]",
         "/lib64/libc.so.6(+0x8bf5c) [0x7f5fce5f2f5c]",
         "raise()",
         "abort()",
         "(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x169) [0x560bc2d0e44d]",
         "/usr/bin/ceph-osd(+0x3cc5ae) [0x560bc2d0e5ae]",
         "/usr/bin/ceph-osd(+0x3b8864) [0x560bc2cfa864]",
         "(BlueStore::_kv_sync_thread()+0x1073) [0x560bc32bb623]",
         "/usr/bin/ceph-osd(+0x8fd3b1) [0x560bc323f3b1]",
         "/lib64/libc.so.6(+0x8a21a) [0x7f5fce5f121a]",
         "/lib64/libc.so.6(+0x10f290) [0x7f5fce676290]"
     ],
     "ceph_version": "18.2.7",
     "crash_id": 
"2025-08-21T00:31:13.929426Z_576e6e42-7c6b-49b8-90f1-9d51730f8ac2",
     "entity_name": "osd.108",
     "os_id": "centos",
     "os_name": "CentOS Stream",
     "os_version": "9",
     "os_version_id": "9",
     "process_name": "ceph-osd",
     "stack_sig": 
"0080731b49e5583e6d168903c1ea7df8bc2caded6b5e24ee4381077d54b045e2",
     "timestamp": "2025-08-21T00:31:13.929426Z",
     "utsname_hostname": "pub1-cephosd-9",
     "utsname_machine": "x86_64",
     "utsname_release": "6.1.0-31-amd64",
     "utsname_sysname": "Linux",
     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Debian 6.1.128-1 (2025-02-07)"


Regards
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to