On Wednesday, January 15, 2020 14:37 GMT, "Nick Fisk" <n...@fisk.me.uk> wrote: 
 
> Hi All,
> 
> Running 14.2.5, currently experiencing some network blips isolated to a 
> single rack which is under investigation. However, it appears following a 
> network blip, random OSD's in unaffected racks are sometimes not recovering 
> from the incident and are left running running in a zombie state. The OSD's 
> appear to be running from a process perspective, but the cluster thinks they 
> are down and will not rejoin the cluster until the OSD process is restarted, 
> which incidentally takes a lot longer than usual (systemctl command takes a 
> couple of minutes to complete).
> 
> If the OSD is left in this state, CPU and memory usage of the process appears 
> to climb, but never rejoins, at least for several hours that I have left 
> them. Not exactly sure what the OSD is trying to do during this period. 
> There's nothing in the logs during this hung state to indicate that anything 
> is happening, but I will try and inject more verbose logging next time it 
> occurs.
> 
> Not sure if anybody has come across this before or any ideas? In the past as 
> long as OSD's have been running they have always re-joint following any 
> network issues.
> 
> Nick
> 
> Sample from OSD and cluster logs below. Blip happened at 12:06, I restarted 
> OSD at 12:26
> 
> OSD Logs from OSD that hung (Note this OSD was not directly affected by 
> network outage)
> 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no 
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first ping 
> sent 2020-01-15 12:06:1

<snip>
 
 Its just happened again and managed to pull this out of debug_osd 20 :

2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.182 
v2:[2a03:25e0:253:5::76]:6839/8394683 says i am down in 2343138
2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.184 
v2:[2a03:25e0:253:5::76]:6814/7394522 says i am down in 2343138
2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.190 
v2:[2a03:25e0:253:5::76]:6860/5986687 says i am down in 2343138
2020-01-15 16:29:01.668 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.19 
v2:[2a03:25e0:253:5::12]:6815/5153900 says i am down in 2343138

And this from the daemon status output:
sudo ceph daemon osd.87 status
{
    "cluster_fsid": "c1703b54-b4cd-41ab-a3ba-4fab241b62f3",
    "osd_fsid": "0cd8fe7d-17be-4982-b76f-ef1cbed0c19b",
    "whoami": 87,
    "state": "waiting_for_healthy",
    "oldest_map": 2342407,
    "newest_map": 2343121,
    "num_pgs": 218
}

So OSD doesn't seem to be getting latest map from the mon's. Map 2343138 
obviously has OSD.87 marked down hence the error messages from the osd_pings. 
But I'm guessing the latest map the OSD has 2343121 has it as marked up, so it 
never tries to "re-connect"?

Seems similar to this post from a few years back, which didn't seem to end with 
a form of resolution
https://www.spinics.net/lists/ceph-devel/msg31788.html

Also found this PR for Nautilus which suggested it might be a fix for the 
issue, but should already be part of the release I'm running:
https://github.com/ceph/ceph/pull/23958

Nick

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to