I just niced all glusterfsd processes on all nodes to a value of -10. The problem just occured, so it seems nicing the processes didn't help.

Am 18.08.2022 09:54 schrieb Péter Károly JUHÁSZ:
What if you renice the gluster processes to some negative value?

 <dpglus...@posteo.de> 于 2022年8月18日周四 09:45写道:

Hi folks,

i am running multiple GlusterFS servers in multiple datacenters.
Every
datacenter is basically the same setup: 3x storage nodes, 3x kvm
hypervisors (oVirt) and 2x HPE switches which are acting as one
logical
unit. The NICs of all servers are attached to both switches with a
bonding of two NICs, in case one of the switches has a major
problem.
In one datacenter i have strange problems with the glusterfs for
nearly
half of a year now and i'm not able to figure out the root cause.

Enviorment
- glusterfs 9.5 running on a centos 7.9.2009 (Core)
- three gluster volumes, all options equally configured

root@storage-001# gluster volume info
Volume Name: g-volume-domain
Type: Replicate
Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
Options Reconfigured:
client.event-threads: 4
performance.cache-size: 1GB
server.event-threads: 4
server.allow-insecure: On
network.ping-timeout: 42
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.quick-read: off
cluster.data-self-heal-algorithm: diff
storage.owner-uid: 36
storage.owner-gid: 36
performance.readdir-ahead: on
performance.read-ahead: off
client.ssl: off
server.ssl: off
auth.ssl-allow:

storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain
ssl.cipher-list: HIGH:!SSLv2
cluster.shd-max-threads: 4
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
performance.io-thread-count: 32

Problem
The glusterd on one storage node seems to loose connection to one
another storage node. If the problem occurs, the first message in
/var/log/glusterfs/glusterd.log is always the following (variable
values
are filled with "x":
[2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-00x.my.domain> (<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>),
in
state <Peer in Cluster>, has disconnected from glusterd.

I will post a filtered log for this specific error on each of my
storage
nodes below.
storage-001:
root@storage-001# tail -n 100000 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state <Peer in Cluster>, has disconnected from glusterd.
[2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>),
in
state <Peer in Cluster>, has disconnected from glusterd.
[2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state <Peer in Cluster>, has disconnected from glusterd.
root@storage-001#

storage-002:
root@storage-002# tail -n 100000 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>),
in
state <Peer in Cluster>, has disconnected from glusterd.
[2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
in
state <Peer in Cluster>, has disconnected from glusterd.
[2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
in
state <Peer in Cluster>, has disconnected from glusterd.
[2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>),
in
state <Peer in Cluster>, has disconnected from glusterd.
root@storage-002#

storage-003:
root@storage-003# tail -n 100000 /var/log/glusterfs/glusterd.log |
grep
"has disconnected from" | grep "2022-08-16"
[2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state <Peer in Cluster>, has disconnected from glusterd.
[2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state <Peer in Cluster>, has disconnected from glusterd.
[2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer
<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>),
in
state <Peer in Cluster>, has disconnected from glusterd.
root@storage-003#

After this message it takes a couple secounds (in specific example
of
2022-08-16 it's one to four secounds) and the disconnected node is
reachable again:
[2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493]
[glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd:
Received
ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host:
storage-002.my.domain, port: 0

This behavior is the same on all nodes - there is a disconnect of a

gluster node and a couple secounds later the disconnected node is
reachable again. After the reconnect the glustershd is invoked and
heals
all the data. How can i figure out the root cause of this random
disconnects?

My debugging actions so far:
- check dmesg -> zero messages around the time of the disconnects
- check the switch -> no port down/up, no packet errors
- disabled ssl on the gluster volumes -> disconnects are still
occuring
- check the dropped/error packages on the network interface of the
storage nodes -> no dropped packages, no errors
- constant pingcheck between all nodes, while a disconnect occurs
->
zero packet loss, zero high latencys
- temporary deactivated one of the two interfaces which are
building the
bond -> disconnects are still occuring
- updated gluster from 6.x to 9.5 -> disconnects are still occuring

Important info: I can force this error to happen if i put some high

i/o-load to one of the gluster volumes.

I suspect there could be an issue with a network queue overflow or
something like that, but that theory does not match the result of
my
pingcheck.

What would be your next step to debug this error?

Thanks in advance!
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk [1]
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users [2]


Links:
------
[1] https://meet.google.com/cpu-eiue-hvk
[2] https://lists.gluster.org/mailman/listinfo/gluster-users
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to