On April 25, 2023 9:03 pm, Peter wrote: > Dear all, > > We are experiencing with Ceph after deploying it by PVE with the network > backed by a 10G Cisco switch with VPC feature on. We are encountering a slow > OSD heartbeat and have not been able to identify any network traffic issues. > > Upon checking, we found that the ping is around 0.1ms, and there is > occasional 2% packet loss when using flood ping, but not consistently. We > also noticed a large number of UDP port 5405 packets and the 'corosync' > process utilizing a significant amount of CPU. > > When running the 'ceph -s' command, we observed a slow OSD heartbeat on the > back and front, with the longest latency being 2250.54ms. We suspect that > this may be a network issue, but we are unsure of how Ceph detects such long > latency. Additionally, we are wondering if a 2% packet loss can significantly > affect Ceph's performance and even cause the OSD process to fail sometimes. > > We have heard about potential issues with rockdb 6 causing OSD process > failures, and we are curious about how to check the rockdb version. > Furthermore, we are wondering how severe traffic package loss and latency > must be to cause OSD process crashes, and how the monitoring system > determines that an OSD is offline. > > We would greatly appreciate any assistance or insights you could provide on > these matters. > Thanks,
are you using separate (physical) links for Corosync and Ceph traffic? if not, they will step on each others toes and cause problems. Corosync is very latency sensitive. https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network_requirements _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
