Hello Ceph community,

I'm writing on behalf of a friend who is experiencing a critical cluster
issue after upgrading and would appreciate any assistance.

Environment:

   - 5 MON nodes, 2 MGR nodes, 40 OSD servers (306 OSDs total)
   - OS: CentOS 8.2 upgraded to 8.4
   - Ceph: 15.2.17 upgraded to 17.2.7
   - Upgrade method: yum update in rolling batches

Timeline: The upgrade started on October 8th at 1:00 PM. We upgraded
MON/MGR servers first, and then upgraded OSD nodes in batches of 5 nodes.
The process appeared normal initially, but when approximately 10 OSD
servers remained, OSDs began going down.

MON Quorum Issue: When the OSDs began failing, the monitors failed to form
a quorum. In an attempt to recover, we stopped 4 out of 5 monitors.
However, the remaining monitor (mbjson20010) then failed to start due to a
missing .ldb file. We eventually recovered this single monitor from OSD
using the instructions at
https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-mon/#mon-store-recovery-using-osds,
so
we now have only 1 MON in the cluster instead of the original 5.

However, rebuilding the MON store did not help, and restarting the OSD
servers also failed to resolve the issue. The cluster status remains
problematic.

Current Cluster Status:

   - Only 1 MON daemon active (quorum: mbjson20010) - down from 5 MONs
   - OSDs: 91 up / 229 in (out of 306 total)
   - 88.872% of PGs are not active
   - 4.779% of PGs are unknown
   - 3,918 PGs down
   - 1,311 PGs stale+down
   - Only 12 PGs active+clean

Critical Error: When examining OSD logs, we discovered that some OSDs are
failing to start with the following error:

osd.43 39677784 init missing pg_pool_t for deleted pool 9 for pg 9.3ds3;
please downgrade to luminous and allow pg deletion to complete before
upgrading

Full error context from one of the failing OSDs:

# tail  /var/log/ceph/ceph-osd.43.log

    -7> 2025-10-12T13:40:05.987+0800 7fdd13259540  1
bluestore(/var/lib/ceph/osd/ceph-43) _upgrade_super from 4, latest 4

    -6> 2025-10-12T13:40:05.987+0800 7fdd13259540  1
bluestore(/var/lib/ceph/osd/ceph-43) _upgrade_super done

    -5> 2025-10-12T13:40:05.987+0800 7fdd13259540  2 osd.43 0 journal looks
like ssd

    -4> 2025-10-12T13:40:05.987+0800 7fdd13259540  2 osd.43 0 boot

    -3> 2025-10-12T13:40:05.987+0800 7fdceb2cc700  5
bluestore.MempoolThread(0x55c7b0c66b40) _resize_shards cache_size:
8589934592 kv_alloc: 1717986918 kv_used: 91136 kv_onode_alloc: 343597383
kv_onode_used: 23328 meta_alloc: 6871947673 meta_used: 2984 data_alloc: 0
data_used: 0

    -2> 2025-10-12T13:40:05.989+0800 7fdd13259540 -1 osd.43 39677784 init
missing pg_pool_t for deleted pool 9 for pg 9.3ds3; please downgrade to
luminous and allow pg deletion to complete before upgrading

    -1> 2025-10-12T13:40:05.991+0800 7fdd13259540 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
In function 'int OSD::init()' thread 7fdd13259540 time
2025-10-12T13:40:05.990845+0800

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
3735: ceph_abort_msg("abort() called")

# tail  /var/log/ceph/ceph-osd.51.log
  -7> 2025-10-12T13:39:36.739+0800 7f603e5f7540  1
bluestore(/var/lib/ceph/osd/ceph-51) _upgrade_super from 4, latest 4
    -6> 2025-10-12T13:39:36.739+0800 7f603e5f7540  1
bluestore(/var/lib/ceph/osd/ceph-51) _upgrade_super done
    -5> 2025-10-12T13:39:36.739+0800 7f603e5f7540  2 osd.51 0 journal looks
like ssd
    -4> 2025-10-12T13:39:36.739+0800 7f603e5f7540  2 osd.51 0 boot
    -3> 2025-10-12T13:39:36.739+0800 7f6016669700  5
bluestore.MempoolThread(0x55e839d4cb40) _resize_shards cache_size:
8589934592 kv_alloc: 1717986918 kv_used: 31232 kv_onode_alloc: 343597383
kv_onode_used: 21584 meta_alloc: 6871947673 meta_used: 1168 data_alloc: 0
data_used: 0
    -2> 2025-10-12T13:39:36.741+0800 7f603e5f7540 -1 osd.51 39677784 init
missing pg_pool_t for deleted pool 6 for pg 6.1f; please downgrade to
luminous and allow pg deletion to complete before upgrading
    -1> 2025-10-12T13:39:36.742+0800 7f603e5f7540 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
In function 'int OSD::init()' thread 7f603e5f7540 time
2025-10-12T13:39:36.742527+0800
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/osd/OSD.cc:
3735: ceph_abort_msg("abort() called")

Investigation Findings: We examined all OSD instances that failed to start.
All of them exhibit the same error pattern in their logs and all contain PG
references to non-existent pools. For example, running
"ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op list-pgs"
shows PG references to pools that no longer exist (e.g., pool 9, pool 10,
pool 4, pool 6, pool 8), while the current pools are numbered 101, 140,
141, 149, 212, 213, 216, 217, 218, 219. Notably, each affected OSD contains
only 2-3 PGs referencing these non-existent pools, which is significantly
fewer than the hundreds of PGs a regular OSD typically contains. It appears
the OSD metadata has been corrupted or overwritten with stale references to
deleted pools from previous operations, preventing these OSDs from starting
and causing widespread PG state abnormalities across the cluster.

2 PGs referencing non-existent pools were found in osd.51:

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op list-pgs
1.0
6.1f

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51 --op list
Error getting attr on : 1.0_head,#1:00000000::::head#, (61) No data
available
Error getting attr on : 6.1f_head,#6:f8000000::::head#, (61) No data
available
["1.0",{"oid":"","key":"","snapid":-2,"hash":0,"max":0,"pool":1,"namespace":"","max":0}]
["1.0",{"oid":"main.db-journal.0000000000000000","key":"","snapid":-2,"hash":1969844440,"max":0,"pool":1,"namespace":"devicehealth","max":0}]
["1.0",{"oid":"main.db.0000000000000000","key":"","snapid":-2,"hash":1315310604,"max":0,"pool":1,"namespace":"devicehealth","max":0}]
["6.1f",{"oid":"","key":"","snapid":-2,"hash":31,"max":0,"pool":6,"namespace":"","max":0}]

We also performed a comprehensive check by listing all PGs from all OSD
nodes using "ceph-objectstore-tool --op list-pgs" and comparing the results
with the output of "ceph pg dump". This comparison revealed that quite a
few PGs are missing from the OSD listings. We suspect that some OSDs that
previously held these missing PGs may now be corrupted, which would explain
both the missing PGs and the widespread cluster degradation. It appears the
OSD metadata has been corrupted or overwritten with stale references to
deleted pools from previous operations, preventing these OSDs from starting
and causing widespread PG state abnormalities across the cluster.

It appears the OSD objectstore's metadata has been corrupted or overwritten
with stale references to deleted pools from previous operations, preventing
these OSDs from starting and causing widespread PG state abnormalities
across the cluster.

Questions:

   1. How can we safely restore the missing PGs from the OSD without data
   loss?
   2. Has anyone encountered similar issues when upgrading from Octopus
   (15.2.x) to Quincy (17.2.x)?

We understand that skipping major versions may not be officially supported,
but we urgently need guidance on the safest recovery path at this point.

Any help would be greatly appreciated. Thank you in advance.

-- 
Regards
Kefu Chai
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to