Posting this for posterity, in case someone runs into it down the line and 
finds it in the archives when trying to figure out what the heck is going on.

On a Reef 18.2.1 cluster, when nodes reboot, some OSDs experience the below:


    "assert_msg": 
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/osd/OSD.cc:
 In function 'int OSD::shutdown()' thread 7f1286b61700 time 
2025-08-24T19:57:36.343925+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/osd/OSD.cc:
 4495: FAILED ceph_assert(end_time - start_time_func < 
cct->_conf->osd_fast_shutdown_timeout)\n”,


which seems to lead to the below at OSD startup:

2025-08-25T00:04:29.669+0000 7faa68c20740  1 freelist _read_cfg
2025-08-25T00:04:29.881+0000 7faa68c20740  1 
bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty 
file)
2025-08-25T00:04:29.881+0000 7faa68c20740  0 
bluestore(/var/lib/ceph/osd/ceph-29) _init_alloc::NCB::restore_allocator() 
failed! Run Full Recovery from ONodes (might take a while) …


After some research my understanding is that the root cause is addressed in 
18.2.6.

My sense is that after this 18.2.1 cluster is completely on 18.2.7 the problem 
should fade.  During the upgrade, however, the instances of it complicate the 
process and require archiving crashes, waiting an hour or two (20TB spinners) 
for the abovedescribed full recovery at OSD startup, etc.

I’ve found that doubling the default timeout:

ceph config set global  osd_fast_shutdown_timeout 30

is making a dramatic difference.  After setting the above, the upgrade is 
progressing nicely as expected.


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to