Posting this for posterity, in case someone runs into it down the line and finds it in the archives when trying to figure out what the heck is going on.
On a Reef 18.2.1 cluster, when nodes reboot, some OSDs experience the below: "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 7f1286b61700 time 2025-08-24T19:57:36.343925+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/osd/OSD.cc: 4495: FAILED ceph_assert(end_time - start_time_func < cct->_conf->osd_fast_shutdown_timeout)\n”, which seems to lead to the below at OSD startup: 2025-08-25T00:04:29.669+0000 7faa68c20740 1 freelist _read_cfg 2025-08-25T00:04:29.881+0000 7faa68c20740 1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file) 2025-08-25T00:04:29.881+0000 7faa68c20740 0 bluestore(/var/lib/ceph/osd/ceph-29) _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while) … After some research my understanding is that the root cause is addressed in 18.2.6. My sense is that after this 18.2.1 cluster is completely on 18.2.7 the problem should fade. During the upgrade, however, the instances of it complicate the process and require archiving crashes, waiting an hour or two (20TB spinners) for the abovedescribed full recovery at OSD startup, etc. I’ve found that doubling the default timeout: ceph config set global osd_fast_shutdown_timeout 30 is making a dramatic difference. After setting the above, the upgrade is progressing nicely as expected. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io