HI Ravi, thanks again for your help.
Here is the output of "cat graphs/active/VM_Storage_1-client-11/private" from the same node where glustershd is complaining: [xlator.protocol.client.VM_Storage_1-client-11.priv] fd.0.remote_fd = 1 ------ = ------ granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, l_start = 100, l_len = 1 granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = F_RDLCK, l_start = 101, l_len = 1 granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = F_RDLCK, l_start = 103, l_len = 1 granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, l_start = 201, l_len = 1 granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, l_start = 203, l_len = 1 ------ = ------ fd.1.remote_fd = 0 ------ = ------ granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, l_start = 100, l_len = 1 granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, l_start = 201, l_len = 1 granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, l_start = 203, l_len = 1 ------ = ------ fd.2.remote_fd = 3 ------ = ------ granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, l_start = 100, l_len = 1 granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, l_start = 201, l_len = 1 granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, l_start = 203, l_len = 1 ------ = ------ fd.3.remote_fd = 2 ------ = ------ granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = F_RDLCK, l_start = 100, l_len = 1 granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = F_RDLCK, l_start = 101, l_len = 1 granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = F_RDLCK, l_start = 103, l_len = 1 granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = F_RDLCK, l_start = 201, l_len = 1 granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = F_RDLCK, l_start = 203, l_len = 1 ------ = ------ connected = 1 total_bytes_read = 6665235356 ping_timeout = 42 total_bytes_written = 4756303549 ping_msgs_sent = 3662 msgs_sent = 16786186 So they seem to be connected there. *However* -- they are not connected apparently in server 2 (where I have just re-mounted the volume): [root@lab-cnvirt-h02 .meta]# cat graphs/active/VM_Storage_1-client-11/private [xlator.protocol.client.VM_Storage_1-client-11.priv] *connected = 0* total_bytes_read = 50020 ping_timeout = 42 total_bytes_written = 84628 ping_msgs_sent = 0 msgs_sent = 0 [root@lab-cnvirt-h02 .meta]# cat graphs/active/VM_Storage_1-client-20/private [xlator.protocol.client.VM_Storage_1-client-20.priv] *connected = 0* total_bytes_read = 53300 ping_timeout = 42 total_bytes_written = 90180 ping_msgs_sent = 0 msgs_sent = 0 The other bricks look connected... Regards, Marco On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishan...@redhat.com> wrote: > Hi Marco, > > On Wed, May 19, 2021 at 8:02 PM Marco Fais <evi...@gmail.com> wrote: > >> Hi Ravi, >> >> thanks a million for your reply. >> >> I have replicated the issue in my test cluster by bringing one of the >> nodes down, and then up again. >> The glustershd process in the restarted node is now complaining about >> connectivity to two bricks in one of my volumes: >> >> --- >> [2021-05-19 14:05:14.462133 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] >> 2-VM_Storage_1-client-11: changing port to 49170 (from 0) >> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057] >> [client-handshake.c:1128:select_server_supported_programs] >> 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1}, >> {Num=1298437}, {Version=400}] >> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043] >> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11: >> failed to set the volume [{errno=2}, {error=No such file or directory}] >> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007] >> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11: >> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >> argument}] >> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044] >> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11: >> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, >> {error=No such file or directory}] >> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051] >> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11: >> sending CHILD_CONNECTING event [] >> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018] >> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: disconnected >> from client, process will keep trying to connect glusterd until brick's >> port is available [{conn-name=VM_Storage_1-client-11}] >> [2021-05-19 14:05:14.473598 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] >> 2-VM_Storage_1-client-20: changing port to 49173 (from 0) >> > > The above logs indicate that shd is trying to connect to the bricks on > ports 49170 and 49173 respectively, when it should have done so > using 49172 and 49169 (as per the volume status and ps output). Shd gets > the brick port numbers info from glusterd, so I'm not sure what is going on > here. Do you have fuse mounts on this particular node? If you don't, you > can mount it temporarily, then check if the connection to the bricks is > successful from the .meta folder of the mount: > > cd /path-to-fuse-mount > cd .meta > cat graphs/active/VM_Storage_1-client-11/private > cat graphs/active/VM_Storage_1-client-20/private > etc. and check if connected=1 or 0. > > I just wanted to see if it is only the shd or even the other clients are > unable to connect to the bricks from this node. FWIW, I tried upgrading > from 7.9 to 8.4 on a test machine and the shd was able to connect to the > bricks just fine. > Regards, > Ravi > > > > >> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057] >> [client-handshake.c:1128:select_server_supported_programs] >> 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x v1}, >> {Num=1298437}, {Version=400}] >> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043] >> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20: >> failed to set the volume [{errno=2}, {error=No such file or directory}] >> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007] >> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20: >> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >> argument}] >> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044] >> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20: >> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, >> {error=No such file or directory}] >> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051] >> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20: >> sending CHILD_CONNECTING event [] >> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018] >> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: disconnected >> from client, process will keep trying to connect glusterd until brick's >> port is available [{conn-name=VM_Storage_1-client-20}] >> --- >> >> The two bricks are the following: >> VM_Storage_1-client-20 --> Brick21: >> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter) >> VM_Storage_1-client-11 --> Brick12: >> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter) >> (In this case it the issue is on two arbiter nodes, but it is not always >> the case) >> >> The port information via "gluster volume status VM_Storage_1" on the >> affected node (same as the one running the glustershd reporting the issue) >> is: >> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick >> *49172 *0 Y 3978256 >> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick >> *49169 *0 Y 3978224 >> >> This is aligned to the actual port of the process: >> root 3978256 1.5 0.0 1999568 30372 ? Ssl May18 15:56 >> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p >> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid >> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name >> /bricks/vm_b5_arb/brick -l >> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option >> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name >> brick --brick-port *49172 *--xlator-option >> VM_Storage_1-server.listen-port=*49172* >> root 3978224 4.3 0.0 1867976 27928 ? Ssl May18 44:55 >> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p >> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid >> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name >> /bricks/vm_b3_arb/brick -l >> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option >> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name >> brick --brick-port *49169 *--xlator-option >> VM_Storage_1-server.listen-port=*49169* >> >> So the issue seems to be specifically on glustershd, as the *glusterd >> process seems to be aware of the right port *(as it matches the real >> port, and the brick is indeed up according to the status). >> >> I have then requested a statedump as you have suggested, and the bricks >> seem to be not connected: >> >> [xlator.protocol.client.VM_Storage_1-client-11.priv] >> *connected=0* >> total_bytes_read=341120 >> ping_timeout=42 >> total_bytes_written=594008 >> ping_msgs_sent=0 >> msgs_sent=0 >> >> [xlator.protocol.client.VM_Storage_1-client-20.priv] >> *connected=0* >> total_bytes_read=341120 >> ping_timeout=42 >> total_bytes_written=594008 >> ping_msgs_sent=0 >> msgs_sent=0 >> >> The important other thing to notice is that normally the bricks that are >> not connecting are always in the same (remote) node... i.e. they are both >> in node 3 in this case. That seems to be always the case, I have not >> encountered a scenario where bricks from different nodes are reporting this >> issue (at least for the same volume). >> >> Please let me know if you need any additional info. >> >> Regards, >> Marco >> >> >> On Wed, 19 May 2021 at 06:31, Ravishankar N <ravishan...@redhat.com> >> wrote: >> >>> >>> >>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evi...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am having significant issues with glustershd with releases 8.4 and >>>> 9.1. >>>> >>>> My oVirt clusters are using gluster storage backends, and were running >>>> fine with Gluster 7.x (shipped with earlier versions of oVirt Node 4.4.x). >>>> Recently the oVirt project moved to Gluster 8.4 for the nodes, and hence I >>>> have moved to this release when upgrading my clusters. >>>> >>>> Since then I am having issues whenever one of the nodes is brought >>>> down; when the nodes come back up online the bricks are typically back up >>>> and working, but some (random) glustershd processes in the various nodes >>>> seem to have issues connecting to some of them. >>>> >>>> >>> When the issue happens, can you check if the TCP port number of the >>> brick (glusterfsd) processes displayed in `gluster volume status` matches >>> with that of the actual port numbers observed (i.e. the --brick-port >>> argument) when you run `ps aux | grep glusterfsd` ? If they don't match, >>> then glusterd has incorrect brick port information in its memory and >>> serving it to glustershd. Restarting glusterd instead of (killing the >>> bricks + `volume start force`) should fix it, although we need to find why >>> glusterd serves incorrect port numbers. >>> >>> If they do match, then can you take a statedump of glustershd to check >>> that it is indeed disconnected from the bricks? You will need to verify >>> that 'connected=1' in the statedump. See "Self-heal is stuck/ not getting >>> completed." section in >>> https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/. >>> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be >>> generated in the /var/run/gluster/ directory. >>> >>> Regards, >>> Ravi >>> >>> >>>
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users