Srijan no problem at all -- thanks for your help. If you need any additional information please let me know.
Regards, Marco On Thu, 27 May 2021 at 18:39, Srijan Sivakumar <ssiva...@redhat.com> wrote: > Hi Marco, > > Thank you for opening the issue. I'll check the log contents and get back > to you. > > On Thu, May 27, 2021 at 10:50 PM Marco Fais <evi...@gmail.com> wrote: > >> Srijan >> >> thanks a million -- I have opened the issue as requested here: >> >> https://github.com/gluster/glusterfs/issues/2492 >> >> I have attached the glusterd.log and glustershd.log files, but please let >> me know if there is any other test I should do or logs I should provide. >> >> >> Thanks, >> Marco >> >> >> On Wed, 26 May 2021 at 18:09, Srijan Sivakumar <ssiva...@redhat.com> >> wrote: >> >>> Hi Marco, >>> >>> If possible, let's open an issue in github and track this from there. I >>> am checking the previous mails in the chain to see if I can infer something >>> about the situation. It would be helpful if we could analyze this with the >>> help of log files. Especially glusterd.log and glustershd.log. >>> >>> To open an issue, you can use this link : Open a new issue >>> <https://github.com/gluster/glusterfs/issues/new> >>> >>> On Wed, May 26, 2021 at 5:02 PM Marco Fais <evi...@gmail.com> wrote: >>> >>>> Ravi, >>>> >>>> thanks a million. >>>> @Mohit, @Srijan please let me know if you need any additional >>>> information. >>>> >>>> Thanks, >>>> Marco >>>> >>>> >>>> On Tue, 25 May 2021 at 17:28, Ravishankar N <ravishan...@redhat.com> >>>> wrote: >>>> >>>>> Hi Marco, >>>>> I haven't had any luck yet. Adding Mohit and Srijan who work in >>>>> glusterd in case they have some inputs. >>>>> -Ravi >>>>> >>>>> >>>>> On Tue, May 25, 2021 at 9:31 PM Marco Fais <evi...@gmail.com> wrote: >>>>> >>>>>> Hi Ravi >>>>>> >>>>>> just wondering if you have any further thoughts on this -- >>>>>> unfortunately it is something still very much affecting us at the moment. >>>>>> I am trying to understand how to troubleshoot it further but haven't >>>>>> been able to make much progress... >>>>>> >>>>>> Thanks, >>>>>> Marco >>>>>> >>>>>> >>>>>> On Thu, 20 May 2021 at 19:04, Marco Fais <evi...@gmail.com> wrote: >>>>>> >>>>>>> Just to complete... >>>>>>> >>>>>>> from the FUSE mount log on server 2 I see the same errors as in >>>>>>> glustershd.log on node 1: >>>>>>> >>>>>>> [2021-05-20 17:58:34.157971 +0000] I [MSGID: 114020] >>>>>>> [client.c:2319:notify] 0-VM_Storage_1-client-11: parent translators are >>>>>>> ready, attempting connect on transport [] >>>>>>> [2021-05-20 17:58:34.160586 +0000] I >>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-VM_Storage_1-client-11: changing >>>>>>> port >>>>>>> to 49170 (from 0) >>>>>>> [2021-05-20 17:58:34.160608 +0000] I >>>>>>> [socket.c:849:__socket_shutdown] 0-VM_Storage_1-client-11: intentional >>>>>>> socket shutdown(20) >>>>>>> [2021-05-20 17:58:34.161403 +0000] I [MSGID: 114046] >>>>>>> [client-handshake.c:857:client_setvolume_cbk] 0-VM_Storage_1-client-10: >>>>>>> Connected, attached to remote volume >>>>>>> [{conn-name=VM_Storage_1-client-10}, >>>>>>> {remote_subvol=/bricks/vm_b3_vol/brick}] >>>>>>> [2021-05-20 17:58:34.161513 +0000] I [MSGID: 108002] >>>>>>> [afr-common.c:6435:afr_notify] 0-VM_Storage_1-replicate-3: >>>>>>> Client-quorum is >>>>>>> met >>>>>>> [2021-05-20 17:58:34.162043 +0000] I [MSGID: 114020] >>>>>>> [client.c:2319:notify] 0-VM_Storage_1-client-13: parent translators are >>>>>>> ready, attempting connect on transport [] >>>>>>> [2021-05-20 17:58:34.162491 +0000] I >>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-VM_Storage_1-client-12: changing >>>>>>> port >>>>>>> to 49170 (from 0) >>>>>>> [2021-05-20 17:58:34.162507 +0000] I >>>>>>> [socket.c:849:__socket_shutdown] 0-VM_Storage_1-client-12: intentional >>>>>>> socket shutdown(26) >>>>>>> [2021-05-20 17:58:34.163076 +0000] I [MSGID: 114057] >>>>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>>>> 0-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x >>>>>>> v1}, >>>>>>> {Num=1298437}, {Version=400}] >>>>>>> [2021-05-20 17:58:34.163339 +0000] W [MSGID: 114043] >>>>>>> [client-handshake.c:727:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>>>>>> [2021-05-20 17:58:34.163351 +0000] W [MSGID: 114007] >>>>>>> [client-handshake.c:752:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, >>>>>>> {error=Invalid >>>>>>> argument}] >>>>>>> [2021-05-20 17:58:34.163360 +0000] E [MSGID: 114044] >>>>>>> [client-handshake.c:757:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, >>>>>>> {errno=2}, >>>>>>> {error=No such file or directory}] >>>>>>> [2021-05-20 17:58:34.163365 +0000] I [MSGID: 114051] >>>>>>> [client-handshake.c:879:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>>>>>> sending CHILD_CONNECTING event [] >>>>>>> [2021-05-20 17:58:34.163425 +0000] I [MSGID: 114018] >>>>>>> [client.c:2229:client_rpc_notify] 0-VM_Storage_1-client-11: disconnected >>>>>>> from client, process will keep trying to connect glusterd until brick's >>>>>>> port is available [{conn-name=VM_Storage_1-client-11}] >>>>>>> >>>>>>> On Thu, 20 May 2021 at 18:54, Marco Fais <evi...@gmail.com> wrote: >>>>>>> >>>>>>>> HI Ravi, >>>>>>>> >>>>>>>> thanks again for your help. >>>>>>>> >>>>>>>> Here is the output of "cat >>>>>>>> graphs/active/VM_Storage_1-client-11/private" from the same node >>>>>>>> where glustershd is complaining: >>>>>>>> >>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>>>>>> fd.0.remote_fd = 1 >>>>>>>> ------ = ------ >>>>>>>> granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 100, l_len = 1 >>>>>>>> granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 101, l_len = 1 >>>>>>>> granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 103, l_len = 1 >>>>>>>> granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 201, l_len = 1 >>>>>>>> granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 203, l_len = 1 >>>>>>>> ------ = ------ >>>>>>>> fd.1.remote_fd = 0 >>>>>>>> ------ = ------ >>>>>>>> granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 100, l_len = 1 >>>>>>>> granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 201, l_len = 1 >>>>>>>> granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 203, l_len = 1 >>>>>>>> ------ = ------ >>>>>>>> fd.2.remote_fd = 3 >>>>>>>> ------ = ------ >>>>>>>> granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 100, l_len = 1 >>>>>>>> granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 201, l_len = 1 >>>>>>>> granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 203, l_len = 1 >>>>>>>> ------ = ------ >>>>>>>> fd.3.remote_fd = 2 >>>>>>>> ------ = ------ >>>>>>>> granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 100, l_len = 1 >>>>>>>> granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 101, l_len = 1 >>>>>>>> granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 103, l_len = 1 >>>>>>>> granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 201, l_len = 1 >>>>>>>> granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = >>>>>>>> F_RDLCK, l_start = 203, l_len = 1 >>>>>>>> ------ = ------ >>>>>>>> connected = 1 >>>>>>>> total_bytes_read = 6665235356 >>>>>>>> ping_timeout = 42 >>>>>>>> total_bytes_written = 4756303549 >>>>>>>> ping_msgs_sent = 3662 >>>>>>>> msgs_sent = 16786186 >>>>>>>> >>>>>>>> So they seem to be connected there. >>>>>>>> *However* -- they are not connected apparently in server 2 (where >>>>>>>> I have just re-mounted the volume): >>>>>>>> [root@lab-cnvirt-h02 .meta]# cat >>>>>>>> graphs/active/VM_Storage_1-client-11/private >>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>>>>>> *connected = 0* >>>>>>>> total_bytes_read = 50020 >>>>>>>> ping_timeout = 42 >>>>>>>> total_bytes_written = 84628 >>>>>>>> ping_msgs_sent = 0 >>>>>>>> msgs_sent = 0 >>>>>>>> [root@lab-cnvirt-h02 .meta]# cat >>>>>>>> graphs/active/VM_Storage_1-client-20/private >>>>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv] >>>>>>>> *connected = 0* >>>>>>>> total_bytes_read = 53300 >>>>>>>> ping_timeout = 42 >>>>>>>> total_bytes_written = 90180 >>>>>>>> ping_msgs_sent = 0 >>>>>>>> msgs_sent = 0 >>>>>>>> >>>>>>>> The other bricks look connected... >>>>>>>> >>>>>>>> Regards, >>>>>>>> Marco >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishan...@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Marco, >>>>>>>>> >>>>>>>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais <evi...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Ravi, >>>>>>>>>> >>>>>>>>>> thanks a million for your reply. >>>>>>>>>> >>>>>>>>>> I have replicated the issue in my test cluster by bringing one of >>>>>>>>>> the nodes down, and then up again. >>>>>>>>>> The glustershd process in the restarted node is now complaining >>>>>>>>>> about connectivity to two bricks in one of my volumes: >>>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> [2021-05-19 14:05:14.462133 +0000] I >>>>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-11: >>>>>>>>>> changing port >>>>>>>>>> to 49170 (from 0) >>>>>>>>>> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057] >>>>>>>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>>>>>>> 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x >>>>>>>>>> v1}, >>>>>>>>>> {Num=1298437}, {Version=400}] >>>>>>>>>> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043] >>>>>>>>>> [client-handshake.c:727:client_setvolume_cbk] >>>>>>>>>> 2-VM_Storage_1-client-11: >>>>>>>>>> failed to set the volume [{errno=2}, {error=No such file or >>>>>>>>>> directory}] >>>>>>>>>> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007] >>>>>>>>>> [client-handshake.c:752:client_setvolume_cbk] >>>>>>>>>> 2-VM_Storage_1-client-11: >>>>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, >>>>>>>>>> {error=Invalid >>>>>>>>>> argument}] >>>>>>>>>> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044] >>>>>>>>>> [client-handshake.c:757:client_setvolume_cbk] >>>>>>>>>> 2-VM_Storage_1-client-11: >>>>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, >>>>>>>>>> {errno=2}, >>>>>>>>>> {error=No such file or directory}] >>>>>>>>>> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051] >>>>>>>>>> [client-handshake.c:879:client_setvolume_cbk] >>>>>>>>>> 2-VM_Storage_1-client-11: >>>>>>>>>> sending CHILD_CONNECTING event [] >>>>>>>>>> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018] >>>>>>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: >>>>>>>>>> disconnected >>>>>>>>>> from client, process will keep trying to connect glusterd until >>>>>>>>>> brick's >>>>>>>>>> port is available [{conn-name=VM_Storage_1-client-11}] >>>>>>>>>> [2021-05-19 14:05:14.473598 +0000] I >>>>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-20: >>>>>>>>>> changing port >>>>>>>>>> to 49173 (from 0) >>>>>>>>>> >>>>>>>>> >>>>>>>>> The above logs indicate that shd is trying to connect to the >>>>>>>>> bricks on ports 49170 and 49173 respectively, when it should have >>>>>>>>> done so using 49172 and 49169 (as per the volume status and ps >>>>>>>>> output). Shd >>>>>>>>> gets the brick port numbers info from glusterd, so I'm not sure what >>>>>>>>> is >>>>>>>>> going on here. Do you have fuse mounts on this particular node? If >>>>>>>>> you >>>>>>>>> don't, you can mount it temporarily, then check if the connection to >>>>>>>>> the >>>>>>>>> bricks is successful from the .meta folder of the mount: >>>>>>>>> >>>>>>>>> cd /path-to-fuse-mount >>>>>>>>> cd .meta >>>>>>>>> cat graphs/active/VM_Storage_1-client-11/private >>>>>>>>> cat graphs/active/VM_Storage_1-client-20/private >>>>>>>>> etc. and check if connected=1 or 0. >>>>>>>>> >>>>>>>>> I just wanted to see if it is only the shd or even the other >>>>>>>>> clients are unable to connect to the bricks from this node. FWIW, I >>>>>>>>> tried >>>>>>>>> upgrading from 7.9 to 8.4 on a test machine and the shd was able to >>>>>>>>> connect >>>>>>>>> to the bricks just fine. >>>>>>>>> Regards, >>>>>>>>> Ravi >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057] >>>>>>>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>>>>>>> 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x >>>>>>>>>> v1}, >>>>>>>>>> {Num=1298437}, {Version=400}] >>>>>>>>>> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043] >>>>>>>>>> [client-handshake.c:727:client_setvolume_cbk] >>>>>>>>>> 2-VM_Storage_1-client-20: >>>>>>>>>> failed to set the volume [{errno=2}, {error=No such file or >>>>>>>>>> directory}] >>>>>>>>>> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007] >>>>>>>>>> [client-handshake.c:752:client_setvolume_cbk] >>>>>>>>>> 2-VM_Storage_1-client-20: >>>>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, >>>>>>>>>> {error=Invalid >>>>>>>>>> argument}] >>>>>>>>>> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044] >>>>>>>>>> [client-handshake.c:757:client_setvolume_cbk] >>>>>>>>>> 2-VM_Storage_1-client-20: >>>>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, >>>>>>>>>> {errno=2}, >>>>>>>>>> {error=No such file or directory}] >>>>>>>>>> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051] >>>>>>>>>> [client-handshake.c:879:client_setvolume_cbk] >>>>>>>>>> 2-VM_Storage_1-client-20: >>>>>>>>>> sending CHILD_CONNECTING event [] >>>>>>>>>> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018] >>>>>>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: >>>>>>>>>> disconnected >>>>>>>>>> from client, process will keep trying to connect glusterd until >>>>>>>>>> brick's >>>>>>>>>> port is available [{conn-name=VM_Storage_1-client-20}] >>>>>>>>>> --- >>>>>>>>>> >>>>>>>>>> The two bricks are the following: >>>>>>>>>> VM_Storage_1-client-20 --> Brick21: >>>>>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter) >>>>>>>>>> VM_Storage_1-client-11 --> Brick12: >>>>>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter) >>>>>>>>>> (In this case it the issue is on two arbiter nodes, but it is not >>>>>>>>>> always the case) >>>>>>>>>> >>>>>>>>>> The port information via "gluster volume status VM_Storage_1" on >>>>>>>>>> the affected node (same as the one running the glustershd reporting >>>>>>>>>> the >>>>>>>>>> issue) is: >>>>>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick >>>>>>>>>> *49172 *0 Y 3978256 >>>>>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick >>>>>>>>>> *49169 *0 Y 3978224 >>>>>>>>>> >>>>>>>>>> This is aligned to the actual port of the process: >>>>>>>>>> root 3978256 1.5 0.0 1999568 30372 ? Ssl May18 >>>>>>>>>> 15:56 /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >>>>>>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p >>>>>>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid >>>>>>>>>> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name >>>>>>>>>> /bricks/vm_b5_arb/brick -l >>>>>>>>>> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option >>>>>>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 >>>>>>>>>> --process-name >>>>>>>>>> brick --brick-port *49172 *--xlator-option >>>>>>>>>> VM_Storage_1-server.listen-port=*49172* >>>>>>>>>> root 3978224 4.3 0.0 1867976 27928 ? Ssl May18 >>>>>>>>>> 44:55 /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >>>>>>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p >>>>>>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid >>>>>>>>>> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name >>>>>>>>>> /bricks/vm_b3_arb/brick -l >>>>>>>>>> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option >>>>>>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 >>>>>>>>>> --process-name >>>>>>>>>> brick --brick-port *49169 *--xlator-option >>>>>>>>>> VM_Storage_1-server.listen-port=*49169* >>>>>>>>>> >>>>>>>>>> So the issue seems to be specifically on glustershd, as the *glusterd >>>>>>>>>> process seems to be aware of the right port *(as it matches the >>>>>>>>>> real port, and the brick is indeed up according to the status). >>>>>>>>>> >>>>>>>>>> I have then requested a statedump as you have suggested, and the >>>>>>>>>> bricks seem to be not connected: >>>>>>>>>> >>>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>>>>>>>> *connected=0* >>>>>>>>>> total_bytes_read=341120 >>>>>>>>>> ping_timeout=42 >>>>>>>>>> total_bytes_written=594008 >>>>>>>>>> ping_msgs_sent=0 >>>>>>>>>> msgs_sent=0 >>>>>>>>>> >>>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv] >>>>>>>>>> *connected=0* >>>>>>>>>> total_bytes_read=341120 >>>>>>>>>> ping_timeout=42 >>>>>>>>>> total_bytes_written=594008 >>>>>>>>>> ping_msgs_sent=0 >>>>>>>>>> msgs_sent=0 >>>>>>>>>> >>>>>>>>>> The important other thing to notice is that normally the bricks >>>>>>>>>> that are not connecting are always in the same (remote) node... i.e. >>>>>>>>>> they >>>>>>>>>> are both in node 3 in this case. That seems to be always the case, I >>>>>>>>>> have >>>>>>>>>> not encountered a scenario where bricks from different nodes are >>>>>>>>>> reporting >>>>>>>>>> this issue (at least for the same volume). >>>>>>>>>> >>>>>>>>>> Please let me know if you need any additional info. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Marco >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, 19 May 2021 at 06:31, Ravishankar N < >>>>>>>>>> ravishan...@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evi...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I am having significant issues with glustershd with releases >>>>>>>>>>>> 8.4 and 9.1. >>>>>>>>>>>> >>>>>>>>>>>> My oVirt clusters are using gluster storage backends, and were >>>>>>>>>>>> running fine with Gluster 7.x (shipped with earlier versions of >>>>>>>>>>>> oVirt Node >>>>>>>>>>>> 4.4.x). Recently the oVirt project moved to Gluster 8.4 for the >>>>>>>>>>>> nodes, and >>>>>>>>>>>> hence I have moved to this release when upgrading my clusters. >>>>>>>>>>>> >>>>>>>>>>>> Since then I am having issues whenever one of the nodes is >>>>>>>>>>>> brought down; when the nodes come back up online the bricks are >>>>>>>>>>>> typically >>>>>>>>>>>> back up and working, but some (random) glustershd processes in the >>>>>>>>>>>> various >>>>>>>>>>>> nodes seem to have issues connecting to some of them. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> When the issue happens, can you check if the TCP port number of >>>>>>>>>>> the brick (glusterfsd) processes displayed in `gluster volume >>>>>>>>>>> status` >>>>>>>>>>> matches with that of the actual port numbers observed (i.e. the >>>>>>>>>>> --brick-port argument) when you run `ps aux | grep glusterfsd` ? If >>>>>>>>>>> they >>>>>>>>>>> don't match, then glusterd has incorrect brick port information in >>>>>>>>>>> its >>>>>>>>>>> memory and serving it to glustershd. Restarting glusterd instead of >>>>>>>>>>> (killing the bricks + `volume start force`) should fix it, although >>>>>>>>>>> we need >>>>>>>>>>> to find why glusterd serves incorrect port numbers. >>>>>>>>>>> >>>>>>>>>>> If they do match, then can you take a statedump of glustershd >>>>>>>>>>> to check that it is indeed disconnected from the bricks? You >>>>>>>>>>> will need to verify that 'connected=1' in the statedump. See >>>>>>>>>>> "Self-heal is >>>>>>>>>>> stuck/ not getting completed." section in >>>>>>>>>>> https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/. >>>>>>>>>>> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It >>>>>>>>>>> will be >>>>>>>>>>> generated in the /var/run/gluster/ directory. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Ravi >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>> >>> -- >>> Regards, >>> Srijan >>> >> > > -- > Regards, > Srijan >
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users