Hi,

I'm running a Ceph cluster 19.2.2, 23 nodes, 152 OSDs, cephadm deployed. Most 
SAS SSDs, 12 NVMe SSDs.

Yesterday we experienced a total power failure and everything went down hard. 
Also our Ceph cluster. There were a couple of things, but this stood out after 
it got back up:

[ERR] OSD_UNREACHABLE: 2 osds(s) are not reachable
 osd.53's public address is not in '192.168.11.0/24' subnet
 osd.86's public address is not in '192.168.11.0/24' subnet

ceph -s did not say reduced data {availability,redundancy} which is a bit 
"off", given that both OSDs are in separate hosts, failure domain=host. There 
must have been PGs with less than 3 replicas and also PGs with just one replica 
left?

So I manually restarted those OSDs with systemctl , a recovery process started 
and all our VMs, "magically" started booting now. I'm also surprised that the 
recovery process only started when those OSDs got back up.

I didn't make too much of the above, but now this morning, I'm looking at the 
kernel ring buffer of our PVE nodes and I notice the logs below. Just a single 
"blip". All at the same time on all of our PVE nodes (ceph clients):

[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100 
e179035): osd53 down
[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100 
e179050): osd53 up
[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100 
e179057): osd86 down
[Sat May 30 22:03:46 2026] libceph (e8020818-2100-11f0-8a12-9cdc71772100 
e179074): osd86 up

I don't see anything weird in the Ceph cluster itself, neither in the log files 
of the ODS.

I'm not sure what to make from this. Why would this happen and what would you 
do?

Thanks for your insights,

Wannes Smet

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to