Hi,

the MGR doesn't always report the correct PG status, so don't rely on that too much. Sometimes it's necessary to restart primary OSDs for stuck PGs, although a repeer could have been sufficient. Your Ceph clients had to refresh their osdmap, that's when they notice that there had been down OSDs. It's not a real-time log in this case, no need to worry. It's a common question though, I think we also asked it 8 to 10 years ago. ;-)

Regards,
Eugen

Zitat von Wannes Smet via ceph-users <[email protected]>:

Hi,

I'm running a Ceph cluster 19.2.2, 23 nodes, 152 OSDs, cephadm deployed. Most SAS SSDs, 12 NVMe SSDs.

Yesterday we experienced a total power failure and everything went down hard. Also our Ceph cluster. There were a couple of things, but this stood out after it got back up:

[ERR] OSD_UNREACHABLE: 2 osds(s) are not reachable
 osd.53's public address is not in '192.168.11.0/24' subnet
 osd.86's public address is not in '192.168.11.0/24' subnet

ceph -s did not say reduced data {availability,redundancy} which is a bit "off", given that both OSDs are in separate hosts, failure domain=host. There must have been PGs with less than 3 replicas and also PGs with just one replica left?

So I manually restarted those OSDs with systemctl , a recovery process started and all our VMs, "magically" started booting now. I'm also surprised that the recovery process only started when those OSDs got back up.

I didn't make too much of the above, but now this morning, I'm looking at the kernel ring buffer of our PVE nodes and I notice the logs below. Just a single "blip". All at the same time on all of our PVE nodes (ceph clients):

[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100 e179035): osd53 down [Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100 e179050): osd53 up [Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100 e179057): osd86 down [Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100 e179074): osd86 up

I don't see anything weird in the Ceph cluster itself, neither in the log files of the ODS.

I'm not sure what to make from this. Why would this happen and what would you do?

Thanks for your insights,

Wannes Smet

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to