[ceph-users] Re: Weird behavior for 2 OSDs in our cluster

Wannes Smet via ceph-users Mon, 01 Jun 2026 06:09:14 -0700

I think I know what happened. ceph -s reported `2 osds(s) are not reachable`.


What I think happened is that during boot up of those 2 nodes the network bond 
on ceph-public might not have been up yet, while the network bond on the 
ceph-cluster network was up. Those 2 OSD containers were started anyway, 
probably only just before ceph-public was up. All the other OSD containers were 
started after the network stack was fully functional

That explains all of the rest of my observations:

  *
OSDs are still UP because heartbeating works (other bond was likely up). 
Cluster thinks all is well except for 2 OSDs "unreachable"
  *
hence mgr does not report OSD down/out/reduced data availability, they're 
effectively not down indeed.
  *
No mention of "reduced data availability"
  *
Also matches with recovery not starting. Not OUT=no recovery.
  *
Also matches with the VMs running on our Proxmox nodes (which use RBD from this 
cluster) were just stuck at very early boot. They probably neede a PG on either 
OSD.53/86 and kep trying because they didn't get a new clustermap.

Wannes
________________________________
From: Eugen Block via ceph-users <[email protected]>
Sent: Sunday, May 31, 2026 11:13
To: [email protected] <[email protected]>
Subject: [ceph-users] Re: Weird behavior for 2 OSDs in our cluster

Hi, the MGR doesn't always report the correct PG status, so don't rely on that 
too much. Sometimes it's necessary to restart primary OSDs for stuck PGs, 
although a repeer could have been sufficient. Your Ceph clients had to refresh 
their osdmap,
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

Hi,

the MGR doesn't always report the correct PG status, so don't rely on
that too much. Sometimes it's necessary to restart primary OSDs for
stuck PGs, although a repeer could have been sufficient. Your Ceph
clients had to refresh their osdmap, that's when they notice that
there had been down OSDs. It's not a real-time log in this case, no
need to worry. It's a common question though, I think we also asked it
8 to 10 years ago. ;-)

Regards,
Eugen

Zitat von Wannes Smet via ceph-users <[email protected]>:

> Hi,
>
> I'm running a Ceph cluster 19.2.2, 23 nodes, 152 OSDs, cephadm
> deployed. Most SAS SSDs, 12 NVMe SSDs.
>
> Yesterday we experienced a total power failure and everything went
> down hard. Also our Ceph cluster. There were a couple of things, but
> this stood out after it got back up:
>
> [ERR] OSD_UNREACHABLE: 2 osds(s) are not reachable
>  osd.53's public address is not in '192.168.11.0/24' subnet
>  osd.86's public address is not in '192.168.11.0/24' subnet
>
> ceph -s did not say reduced data {availability,redundancy} which is
> a bit "off", given that both OSDs are in separate hosts, failure
> domain=host. There must have been PGs with less than 3 replicas and
> also PGs with just one replica left?
>
> So I manually restarted those OSDs with systemctl , a recovery
> process started and all our VMs, "magically" started booting now.
> I'm also surprised that the recovery process only started when those
> OSDs got back up.
>
> I didn't make too much of the above, but now this morning, I'm
> looking at the kernel ring buffer of our PVE nodes and I notice the
> logs below. Just a single "blip". All at the same time on all of our
> PVE nodes (ceph clients):
>
> [Sat May 30 22:03:46 2026] libceph
> (e8020818-2100-11f0-8a12-9cdc71772100 e179035): osd53 down
> [Sat May 30 22:03:46 2026] libceph
> (e8020818-2100-11f0-8a12-9cdc71772100 e179050): osd53 up
> [Sat May 30 22:03:46 2026] libceph
> (e8020818-2100-11f0-8a12-9cdc71772100 e179057): osd86 down
> [Sat May 30 22:03:46 2026] libceph
> (e8020818-2100-11f0-8a12-9cdc71772100 e179074): osd86 up
>
> I don't see anything weird in the Ceph cluster itself, neither in
> the log files of the ODS.
>
> I'm not sure what to make from this. Why would this happen and what
> would you do?
>
> Thanks for your insights,
>
> Wannes Smet
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Weird behavior for 2 OSDs in our cluster

Reply via email to