Hi all,

We have experienced something strange and scary on our ceph cluster yesterday. 
Now our cluster is back to health. I want to share our experience here and 
hopefully, someone can help us find the root cause and prevent it from 
happening again.

TL;DR--An OSD became very slow for an unknown reason and was resolved by 
restarting the OSD daemon. Then I suspect MGR and MON were overloaded by a lot 
of slow ops cluster logs.

I noticed this about one hour after the cluster got into HEALTH_WARN, I had 
MDS_TRIM, MDS_HEALTH_CLIENT_LATE_RELEASE, MDS_CLIENT_OLDEST_TID, 
MDS_SLOW_METADATA_IO, SLOW_OPS, as far as I can remember.

There was about 20k slow ops, spreading over all OSDs, When inspecting them, I 
found out almost all slow ops are waiting for sub-ops from OSD.25, while ops in 
OSD.25 itself are queued for pg. This OSD was still working but exceptionally 
slow. Block device write IO wait time was 2s [1], and osd write op latency was 
over 1200s [2]. At the time all these starts, about 2000 iops write requests 
hit the cluster, but all other OSDs work well. For comparison, another OSD 
(located in the same host and have identical configuration) had write IO wait 
time <10ms, write op latency <50ms.

Then I restarted OSD.25 daemon. Most things are back to normal (So this does 
not look like a hardware issue). But why it ever got into this state is still a 
mystery. I can only find logs complaining about slow ops.

I was left with about 2k slow ops spreading over all 5 MONs then. Most of these 
are logm requests. I was thinking that this can be resolved by itself, but 
things go worse again. I started to get MON_DISK_BIG, MON_DISK_LOW. This is 
scary because all MON's disk usage seems to go up unbonded, and the cluster 
cannot work without MON. And one of our hosts goes OOM.

As for the OOM, I suspect it is the MGR that taking too much memory, because 
after the stand by MGR took over, its memory usage is also growing unbounded. 
And I suspect the MGR has some memory leakage when processing cluster log, 
because 30 min after the OSD slow ops are cleared, slow ops logs from over 30 
min ago are still rolling in the dashboard.

The MON cluster is also having a hard time. After I reset the OOM'd host, the 
MON daemon on that host spent 30 min before it can rejoin the quorum. Almost at 
the same time, warnings about the MONs are cleared. I don't know if it is I 
restarted another MON that works, or they just finished processing all logs. 
The MON disk usage is also going back to <200MB.

So how can we prevent this from happening again? For the slow OSD, I think it 
is hard to find the root cause. But I appreciate it if someone can give me a 
hint. I think ceph may have an unbounded queue for cluster log in OSD, MON 
and/or MGR. If so, I think we should drop logs if upstream cannot handle it 
instead on queue it up. Also, MONs and MGRs should have some mechanisms to 
prevent themselves from being overloaded by logs.

We are running ceph 16.2.5, with ~30 OSDs, most of them are HDD with DB in SSD.

Regards,
Weiwen Hu

[1]: PromQL: rate(node_disk_write_time_seconds_total[1m]) / 
rate(node_disk_writes_completed_total[1m])
[2]: PromQL: rate(ceph_osd_op_w_latency_sum[1m]) / 
rate(ceph_osd_op_w_latency_count[1m])
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to