On April 25, 2023 9:03 pm, Peter wrote:
> Dear all,
> 
> We are experiencing with Ceph after deploying it by PVE with the network 
> backed by a 10G Cisco switch with VPC feature on. We are encountering a slow 
> OSD heartbeat and have not been able to identify any network traffic issues.
> 
> Upon checking, we found that the ping is around 0.1ms, and there is 
> occasional 2% packet loss when using flood ping, but not consistently. We 
> also noticed a large number of UDP port 5405 packets and the 'corosync' 
> process utilizing a significant amount of CPU.
> 
> When running the 'ceph -s' command, we observed a slow OSD heartbeat on the 
> back and front, with the longest latency being 2250.54ms. We suspect that 
> this may be a network issue, but we are unsure of how Ceph detects such long 
> latency. Additionally, we are wondering if a 2% packet loss can significantly 
> affect Ceph's performance and even cause the OSD process to fail sometimes.
> 
> We have heard about potential issues with rockdb 6 causing OSD process 
> failures, and we are curious about how to check the rockdb version. 
> Furthermore, we are wondering how severe traffic package loss and latency 
> must be to cause OSD process crashes, and how the monitoring system 
> determines that an OSD is offline.
> 
> We would greatly appreciate any assistance or insights you could provide on 
> these matters.
> Thanks,

are you using separate (physical) links for Corosync and Ceph traffic?
if not, they will step on each others toes and cause problems. Corosync
is very latency sensitive.

https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network_requirements
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to