Did you tired to TCPdump the connections to see who and how closes the connection? Normal fin-ack, or timeout? Maybe some network device between? (This later has small probably since you told that you can trigger the error by high load.)
<dpglus...@posteo.de> 于 2022年8月18日周四 12:38写道: > I just niced all glusterfsd processes on all nodes to a value of -10. > The problem just occured, so it seems nicing the processes didn't help. > > Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ: > > What if you renice the gluster processes to some negative value? > > > > <dpglus...@posteo.de> 于 2022年8月18日周四 09:45写道: > > > >> Hi folks, > >> > >> i am running multiple GlusterFS servers in multiple datacenters. > >> Every > >> datacenter is basically the same setup: 3x storage nodes, 3x kvm > >> hypervisors (oVirt) and 2x HPE switches which are acting as one > >> logical > >> unit. The NICs of all servers are attached to both switches with a > >> bonding of two NICs, in case one of the switches has a major > >> problem. > >> In one datacenter i have strange problems with the glusterfs for > >> nearly > >> half of a year now and i'm not able to figure out the root cause. > >> > >> Enviorment > >> - glusterfs 9.5 running on a centos 7.9.2009 (Core) > >> - three gluster volumes, all options equally configured > >> > >> root@storage-001# gluster volume info > >> Volume Name: g-volume-domain > >> Type: Replicate > >> Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0 > >> Status: Started > >> Snapshot Count: 0 > >> Number of Bricks: 1 x 3 = 3 > >> Transport-type: tcp > >> Bricks: > >> Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain > >> Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain > >> Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain > >> Options Reconfigured: > >> client.event-threads: 4 > >> performance.cache-size: 1GB > >> server.event-threads: 4 > >> server.allow-insecure: On > >> network.ping-timeout: 42 > >> performance.client-io-threads: off > >> nfs.disable: on > >> transport.address-family: inet > >> cluster.quorum-type: auto > >> network.remote-dio: enable > >> cluster.eager-lock: enable > >> performance.stat-prefetch: off > >> performance.io-cache: off > >> performance.quick-read: off > >> cluster.data-self-heal-algorithm: diff > >> storage.owner-uid: 36 > >> storage.owner-gid: 36 > >> performance.readdir-ahead: on > >> performance.read-ahead: off > >> client.ssl: off > >> server.ssl: off > >> auth.ssl-allow: > >> > > > storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain > >> ssl.cipher-list: HIGH:!SSLv2 > >> cluster.shd-max-threads: 4 > >> diagnostics.latency-measurement: on > >> diagnostics.count-fop-hits: on > >> performance.io-thread-count: 32 > >> > >> Problem > >> The glusterd on one storage node seems to loose connection to one > >> another storage node. If the problem occurs, the first message in > >> /var/log/glusterfs/glusterd.log is always the following (variable > >> values > >> are filled with "x": > >> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-00x.my.domain> (<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> > >> I will post a filtered log for this specific error on each of my > >> storage > >> nodes below. > >> storage-001: > >> root@storage-001# tail -n 100000 /var/log/glusterfs/glusterd.log | > >> grep > >> "has disconnected from" | grep "2022-08-16" > >> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> [2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> [2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> root@storage-001# > >> > >> storage-002: > >> root@storage-002# tail -n 100000 /var/log/glusterfs/glusterd.log | > >> grep > >> "has disconnected from" | grep "2022-08-16" > >> [2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> [2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> [2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> [2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> root@storage-002# > >> > >> storage-003: > >> root@storage-003# tail -n 100000 /var/log/glusterfs/glusterd.log | > >> grep > >> "has disconnected from" | grep "2022-08-16" > >> [2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> [2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> [2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004] > >> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: > >> Peer > >> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), > >> in > >> state <Peer in Cluster>, has disconnected from glusterd. > >> root@storage-003# > >> > >> After this message it takes a couple secounds (in specific example > >> of > >> 2022-08-16 it's one to four secounds) and the disconnected node is > >> reachable again: > >> [2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493] > >> [glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd: > >> Received > >> ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host: > >> storage-002.my.domain, port: 0 > >> > >> This behavior is the same on all nodes - there is a disconnect of a > >> > >> gluster node and a couple secounds later the disconnected node is > >> reachable again. After the reconnect the glustershd is invoked and > >> heals > >> all the data. How can i figure out the root cause of this random > >> disconnects? > >> > >> My debugging actions so far: > >> - check dmesg -> zero messages around the time of the disconnects > >> - check the switch -> no port down/up, no packet errors > >> - disabled ssl on the gluster volumes -> disconnects are still > >> occuring > >> - check the dropped/error packages on the network interface of the > >> storage nodes -> no dropped packages, no errors > >> - constant pingcheck between all nodes, while a disconnect occurs > >> -> > >> zero packet loss, zero high latencys > >> - temporary deactivated one of the two interfaces which are > >> building the > >> bond -> disconnects are still occuring > >> - updated gluster from 6.x to 9.5 -> disconnects are still occuring > >> > >> Important info: I can force this error to happen if i put some high > >> > >> i/o-load to one of the gluster volumes. > >> > >> I suspect there could be an issue with a network queue overflow or > >> something like that, but that theory does not match the result of > >> my > >> pingcheck. > >> > >> What would be your next step to debug this error? > >> > >> Thanks in advance! > >> ________ > >> > >> Community Meeting Calendar: > >> > >> Schedule - > >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >> Bridge: https://meet.google.com/cpu-eiue-hvk [1] > >> Gluster-users mailing list > >> Gluster-users@gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users [2] > > > > > > Links: > > ------ > > [1] https://meet.google.com/cpu-eiue-hvk > > [2] https://lists.gluster.org/mailman/listinfo/gluster-users >
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users