Hi Fabian,

Thank you for your prompt response. It's crucial to understand how things work, 
and I appreciate your assistance.

After replacing the switch for our Ceph environment, we experienced three days 
of normalcy before the issue recurred this morning. I noticed that the TCP 
in/out became unstable, and TCP errors occurred simultaneously. The UDP in/out 
values were 70K and 150K, respectively, while the errors peaked at around 50K 
per second.

I reviewed the Proxmox documentation and found that it is recommended to 
separate the cluster network and storage network. Currently, we have more than 
20 Ceph nodes across five different locations, and only one location has 
experienced this issue. We are fortunate that it has not happened in other 
areas. While we plan to separate the network soon, I was wondering if there are 
any temporary solutions or configurations that could limit the UDP triggering 
and resolve the "corosync" issue.

I appreciate your help in this matter and look forward to your response.

Peter

-----Original Message-----
From: Fabian Grünbichler <f.gruenbich...@proxmox.com> 
Sent: Wednesday, April 26, 2023 12:42 AM
To: ceph-users@ceph.io; Peter <peter...@raksmart.com>
Subject: Re: [ceph-users] PVE CEPH OSD heartbeat show

On April 25, 2023 9:03 pm, Peter wrote:
> Dear all,
> 
> We are experiencing with Ceph after deploying it by PVE with the network 
> backed by a 10G Cisco switch with VPC feature on. We are encountering a slow 
> OSD heartbeat and have not been able to identify any network traffic issues.
> 
> Upon checking, we found that the ping is around 0.1ms, and there is 
> occasional 2% packet loss when using flood ping, but not consistently. We 
> also noticed a large number of UDP port 5405 packets and the 'corosync' 
> process utilizing a significant amount of CPU.
> 
> When running the 'ceph -s' command, we observed a slow OSD heartbeat on the 
> back and front, with the longest latency being 2250.54ms. We suspect that 
> this may be a network issue, but we are unsure of how Ceph detects such long 
> latency. Additionally, we are wondering if a 2% packet loss can significantly 
> affect Ceph's performance and even cause the OSD process to fail sometimes.
> 
> We have heard about potential issues with rockdb 6 causing OSD process 
> failures, and we are curious about how to check the rockdb version. 
> Furthermore, we are wondering how severe traffic package loss and latency 
> must be to cause OSD process crashes, and how the monitoring system 
> determines that an OSD is offline.
> 
> We would greatly appreciate any assistance or insights you could provide on 
> these matters.
> Thanks,

are you using separate (physical) links for Corosync and Ceph traffic?
if not, they will step on each others toes and cause problems. Corosync is very 
latency sensitive.

https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network_requirements

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to