On Wednesday, January 15, 2020 14:37 GMT, "Nick Fisk" <n...@fisk.me.uk> wrote: > Hi All, > > Running 14.2.5, currently experiencing some network blips isolated to a > single rack which is under investigation. However, it appears following a > network blip, random OSD's in unaffected racks are sometimes not recovering > from the incident and are left running running in a zombie state. The OSD's > appear to be running from a process perspective, but the cluster thinks they > are down and will not rejoin the cluster until the OSD process is restarted, > which incidentally takes a lot longer than usual (systemctl command takes a > couple of minutes to complete). > > If the OSD is left in this state, CPU and memory usage of the process appears > to climb, but never rejoins, at least for several hours that I have left > them. Not exactly sure what the OSD is trying to do during this period. > There's nothing in the logs during this hung state to indicate that anything > is happening, but I will try and inject more verbose logging next time it > occurs. > > Not sure if anybody has come across this before or any ideas? In the past as > long as OSD's have been running they have always re-joint following any > network issues. > > Nick > > Sample from OSD and cluster logs below. Blip happened at 12:06, I restarted > OSD at 12:26 > > OSD Logs from OSD that hung (Note this OSD was not directly affected by > network outage) > 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first ping > sent 2020-01-15 12:06:1
<snip> Its just happened again and managed to pull this out of debug_osd 20 : 2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.182 v2:[2a03:25e0:253:5::76]:6839/8394683 says i am down in 2343138 2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.184 v2:[2a03:25e0:253:5::76]:6814/7394522 says i am down in 2343138 2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.190 v2:[2a03:25e0:253:5::76]:6860/5986687 says i am down in 2343138 2020-01-15 16:29:01.668 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.19 v2:[2a03:25e0:253:5::12]:6815/5153900 says i am down in 2343138 And this from the daemon status output: sudo ceph daemon osd.87 status { "cluster_fsid": "c1703b54-b4cd-41ab-a3ba-4fab241b62f3", "osd_fsid": "0cd8fe7d-17be-4982-b76f-ef1cbed0c19b", "whoami": 87, "state": "waiting_for_healthy", "oldest_map": 2342407, "newest_map": 2343121, "num_pgs": 218 } So OSD doesn't seem to be getting latest map from the mon's. Map 2343138 obviously has OSD.87 marked down hence the error messages from the osd_pings. But I'm guessing the latest map the OSD has 2343121 has it as marked up, so it never tries to "re-connect"? Seems similar to this post from a few years back, which didn't seem to end with a form of resolution https://www.spinics.net/lists/ceph-devel/msg31788.html Also found this PR for Nautilus which suggested it might be a fix for the issue, but should already be part of the release I'm running: https://github.com/ceph/ceph/pull/23958 Nick _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com