Srijan thanks a million -- I have opened the issue as requested here:
https://github.com/gluster/glusterfs/issues/2492 I have attached the glusterd.log and glustershd.log files, but please let me know if there is any other test I should do or logs I should provide. Thanks, Marco On Wed, 26 May 2021 at 18:09, Srijan Sivakumar <ssiva...@redhat.com> wrote: > Hi Marco, > > If possible, let's open an issue in github and track this from there. I am > checking the previous mails in the chain to see if I can infer something > about the situation. It would be helpful if we could analyze this with the > help of log files. Especially glusterd.log and glustershd.log. > > To open an issue, you can use this link : Open a new issue > <https://github.com/gluster/glusterfs/issues/new> > > On Wed, May 26, 2021 at 5:02 PM Marco Fais <evi...@gmail.com> wrote: > >> Ravi, >> >> thanks a million. >> @Mohit, @Srijan please let me know if you need any additional information. >> >> Thanks, >> Marco >> >> >> On Tue, 25 May 2021 at 17:28, Ravishankar N <ravishan...@redhat.com> >> wrote: >> >>> Hi Marco, >>> I haven't had any luck yet. Adding Mohit and Srijan who work in >>> glusterd in case they have some inputs. >>> -Ravi >>> >>> >>> On Tue, May 25, 2021 at 9:31 PM Marco Fais <evi...@gmail.com> wrote: >>> >>>> Hi Ravi >>>> >>>> just wondering if you have any further thoughts on this -- >>>> unfortunately it is something still very much affecting us at the moment. >>>> I am trying to understand how to troubleshoot it further but haven't >>>> been able to make much progress... >>>> >>>> Thanks, >>>> Marco >>>> >>>> >>>> On Thu, 20 May 2021 at 19:04, Marco Fais <evi...@gmail.com> wrote: >>>> >>>>> Just to complete... >>>>> >>>>> from the FUSE mount log on server 2 I see the same errors as in >>>>> glustershd.log on node 1: >>>>> >>>>> [2021-05-20 17:58:34.157971 +0000] I [MSGID: 114020] >>>>> [client.c:2319:notify] 0-VM_Storage_1-client-11: parent translators are >>>>> ready, attempting connect on transport [] >>>>> [2021-05-20 17:58:34.160586 +0000] I >>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-VM_Storage_1-client-11: changing >>>>> port >>>>> to 49170 (from 0) >>>>> [2021-05-20 17:58:34.160608 +0000] I [socket.c:849:__socket_shutdown] >>>>> 0-VM_Storage_1-client-11: intentional socket shutdown(20) >>>>> [2021-05-20 17:58:34.161403 +0000] I [MSGID: 114046] >>>>> [client-handshake.c:857:client_setvolume_cbk] 0-VM_Storage_1-client-10: >>>>> Connected, attached to remote volume [{conn-name=VM_Storage_1-client-10}, >>>>> {remote_subvol=/bricks/vm_b3_vol/brick}] >>>>> [2021-05-20 17:58:34.161513 +0000] I [MSGID: 108002] >>>>> [afr-common.c:6435:afr_notify] 0-VM_Storage_1-replicate-3: Client-quorum >>>>> is >>>>> met >>>>> [2021-05-20 17:58:34.162043 +0000] I [MSGID: 114020] >>>>> [client.c:2319:notify] 0-VM_Storage_1-client-13: parent translators are >>>>> ready, attempting connect on transport [] >>>>> [2021-05-20 17:58:34.162491 +0000] I >>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-VM_Storage_1-client-12: changing >>>>> port >>>>> to 49170 (from 0) >>>>> [2021-05-20 17:58:34.162507 +0000] I [socket.c:849:__socket_shutdown] >>>>> 0-VM_Storage_1-client-12: intentional socket shutdown(26) >>>>> [2021-05-20 17:58:34.163076 +0000] I [MSGID: 114057] >>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>> 0-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1}, >>>>> {Num=1298437}, {Version=400}] >>>>> [2021-05-20 17:58:34.163339 +0000] W [MSGID: 114043] >>>>> [client-handshake.c:727:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>>>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>>>> [2021-05-20 17:58:34.163351 +0000] W [MSGID: 114007] >>>>> [client-handshake.c:752:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid >>>>> argument}] >>>>> [2021-05-20 17:58:34.163360 +0000] E [MSGID: 114044] >>>>> [client-handshake.c:757:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, >>>>> {errno=2}, >>>>> {error=No such file or directory}] >>>>> [2021-05-20 17:58:34.163365 +0000] I [MSGID: 114051] >>>>> [client-handshake.c:879:client_setvolume_cbk] 0-VM_Storage_1-client-11: >>>>> sending CHILD_CONNECTING event [] >>>>> [2021-05-20 17:58:34.163425 +0000] I [MSGID: 114018] >>>>> [client.c:2229:client_rpc_notify] 0-VM_Storage_1-client-11: disconnected >>>>> from client, process will keep trying to connect glusterd until brick's >>>>> port is available [{conn-name=VM_Storage_1-client-11}] >>>>> >>>>> On Thu, 20 May 2021 at 18:54, Marco Fais <evi...@gmail.com> wrote: >>>>> >>>>>> HI Ravi, >>>>>> >>>>>> thanks again for your help. >>>>>> >>>>>> Here is the output of "cat >>>>>> graphs/active/VM_Storage_1-client-11/private" from the same node >>>>>> where glustershd is complaining: >>>>>> >>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>>>> fd.0.remote_fd = 1 >>>>>> ------ = ------ >>>>>> granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = >>>>>> F_RDLCK, l_start = 100, l_len = 1 >>>>>> granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = >>>>>> F_RDLCK, l_start = 101, l_len = 1 >>>>>> granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = >>>>>> F_RDLCK, l_start = 103, l_len = 1 >>>>>> granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = >>>>>> F_RDLCK, l_start = 201, l_len = 1 >>>>>> granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = >>>>>> F_RDLCK, l_start = 203, l_len = 1 >>>>>> ------ = ------ >>>>>> fd.1.remote_fd = 0 >>>>>> ------ = ------ >>>>>> granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = >>>>>> F_RDLCK, l_start = 100, l_len = 1 >>>>>> granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = >>>>>> F_RDLCK, l_start = 201, l_len = 1 >>>>>> granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = >>>>>> F_RDLCK, l_start = 203, l_len = 1 >>>>>> ------ = ------ >>>>>> fd.2.remote_fd = 3 >>>>>> ------ = ------ >>>>>> granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = >>>>>> F_RDLCK, l_start = 100, l_len = 1 >>>>>> granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = >>>>>> F_RDLCK, l_start = 201, l_len = 1 >>>>>> granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = >>>>>> F_RDLCK, l_start = 203, l_len = 1 >>>>>> ------ = ------ >>>>>> fd.3.remote_fd = 2 >>>>>> ------ = ------ >>>>>> granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type = >>>>>> F_RDLCK, l_start = 100, l_len = 1 >>>>>> granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type = >>>>>> F_RDLCK, l_start = 101, l_len = 1 >>>>>> granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type = >>>>>> F_RDLCK, l_start = 103, l_len = 1 >>>>>> granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type = >>>>>> F_RDLCK, l_start = 201, l_len = 1 >>>>>> granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK >>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type = >>>>>> F_RDLCK, l_start = 203, l_len = 1 >>>>>> ------ = ------ >>>>>> connected = 1 >>>>>> total_bytes_read = 6665235356 >>>>>> ping_timeout = 42 >>>>>> total_bytes_written = 4756303549 >>>>>> ping_msgs_sent = 3662 >>>>>> msgs_sent = 16786186 >>>>>> >>>>>> So they seem to be connected there. >>>>>> *However* -- they are not connected apparently in server 2 (where I >>>>>> have just re-mounted the volume): >>>>>> [root@lab-cnvirt-h02 .meta]# cat >>>>>> graphs/active/VM_Storage_1-client-11/private >>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>>>> *connected = 0* >>>>>> total_bytes_read = 50020 >>>>>> ping_timeout = 42 >>>>>> total_bytes_written = 84628 >>>>>> ping_msgs_sent = 0 >>>>>> msgs_sent = 0 >>>>>> [root@lab-cnvirt-h02 .meta]# cat >>>>>> graphs/active/VM_Storage_1-client-20/private >>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv] >>>>>> *connected = 0* >>>>>> total_bytes_read = 53300 >>>>>> ping_timeout = 42 >>>>>> total_bytes_written = 90180 >>>>>> ping_msgs_sent = 0 >>>>>> msgs_sent = 0 >>>>>> >>>>>> The other bricks look connected... >>>>>> >>>>>> Regards, >>>>>> Marco >>>>>> >>>>>> >>>>>> On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishan...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Marco, >>>>>>> >>>>>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais <evi...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Ravi, >>>>>>>> >>>>>>>> thanks a million for your reply. >>>>>>>> >>>>>>>> I have replicated the issue in my test cluster by bringing one of >>>>>>>> the nodes down, and then up again. >>>>>>>> The glustershd process in the restarted node is now complaining >>>>>>>> about connectivity to two bricks in one of my volumes: >>>>>>>> >>>>>>>> --- >>>>>>>> [2021-05-19 14:05:14.462133 +0000] I >>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-11: changing >>>>>>>> port >>>>>>>> to 49170 (from 0) >>>>>>>> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057] >>>>>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>>>>> 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x >>>>>>>> v1}, >>>>>>>> {Num=1298437}, {Version=400}] >>>>>>>> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043] >>>>>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>>>>>>> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007] >>>>>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, >>>>>>>> {error=Invalid >>>>>>>> argument}] >>>>>>>> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044] >>>>>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, >>>>>>>> {errno=2}, >>>>>>>> {error=No such file or directory}] >>>>>>>> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051] >>>>>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11: >>>>>>>> sending CHILD_CONNECTING event [] >>>>>>>> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018] >>>>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: >>>>>>>> disconnected >>>>>>>> from client, process will keep trying to connect glusterd until brick's >>>>>>>> port is available [{conn-name=VM_Storage_1-client-11}] >>>>>>>> [2021-05-19 14:05:14.473598 +0000] I >>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-20: changing >>>>>>>> port >>>>>>>> to 49173 (from 0) >>>>>>>> >>>>>>> >>>>>>> The above logs indicate that shd is trying to connect to the bricks >>>>>>> on ports 49170 and 49173 respectively, when it should have done so >>>>>>> using 49172 and 49169 (as per the volume status and ps output). Shd gets >>>>>>> the brick port numbers info from glusterd, so I'm not sure what is >>>>>>> going on >>>>>>> here. Do you have fuse mounts on this particular node? If you don't, >>>>>>> you >>>>>>> can mount it temporarily, then check if the connection to the bricks is >>>>>>> successful from the .meta folder of the mount: >>>>>>> >>>>>>> cd /path-to-fuse-mount >>>>>>> cd .meta >>>>>>> cat graphs/active/VM_Storage_1-client-11/private >>>>>>> cat graphs/active/VM_Storage_1-client-20/private >>>>>>> etc. and check if connected=1 or 0. >>>>>>> >>>>>>> I just wanted to see if it is only the shd or even the other clients >>>>>>> are unable to connect to the bricks from this node. FWIW, I tried >>>>>>> upgrading >>>>>>> from 7.9 to 8.4 on a test machine and the shd was able to connect to the >>>>>>> bricks just fine. >>>>>>> Regards, >>>>>>> Ravi >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057] >>>>>>>> [client-handshake.c:1128:select_server_supported_programs] >>>>>>>> 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x >>>>>>>> v1}, >>>>>>>> {Num=1298437}, {Version=400}] >>>>>>>> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043] >>>>>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}] >>>>>>>> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007] >>>>>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, >>>>>>>> {error=Invalid >>>>>>>> argument}] >>>>>>>> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044] >>>>>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, >>>>>>>> {errno=2}, >>>>>>>> {error=No such file or directory}] >>>>>>>> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051] >>>>>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20: >>>>>>>> sending CHILD_CONNECTING event [] >>>>>>>> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018] >>>>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: >>>>>>>> disconnected >>>>>>>> from client, process will keep trying to connect glusterd until brick's >>>>>>>> port is available [{conn-name=VM_Storage_1-client-20}] >>>>>>>> --- >>>>>>>> >>>>>>>> The two bricks are the following: >>>>>>>> VM_Storage_1-client-20 --> Brick21: >>>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter) >>>>>>>> VM_Storage_1-client-11 --> Brick12: >>>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter) >>>>>>>> (In this case it the issue is on two arbiter nodes, but it is not >>>>>>>> always the case) >>>>>>>> >>>>>>>> The port information via "gluster volume status VM_Storage_1" on >>>>>>>> the affected node (same as the one running the glustershd reporting the >>>>>>>> issue) is: >>>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick >>>>>>>> *49172 *0 Y 3978256 >>>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick >>>>>>>> *49169 *0 Y 3978224 >>>>>>>> >>>>>>>> This is aligned to the actual port of the process: >>>>>>>> root 3978256 1.5 0.0 1999568 30372 ? Ssl May18 15:56 >>>>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >>>>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p >>>>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid >>>>>>>> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name >>>>>>>> /bricks/vm_b5_arb/brick -l >>>>>>>> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option >>>>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 >>>>>>>> --process-name >>>>>>>> brick --brick-port *49172 *--xlator-option >>>>>>>> VM_Storage_1-server.listen-port=*49172* >>>>>>>> root 3978224 4.3 0.0 1867976 27928 ? Ssl May18 44:55 >>>>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id >>>>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p >>>>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid >>>>>>>> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name >>>>>>>> /bricks/vm_b3_arb/brick -l >>>>>>>> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option >>>>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 >>>>>>>> --process-name >>>>>>>> brick --brick-port *49169 *--xlator-option >>>>>>>> VM_Storage_1-server.listen-port=*49169* >>>>>>>> >>>>>>>> So the issue seems to be specifically on glustershd, as the *glusterd >>>>>>>> process seems to be aware of the right port *(as it matches the >>>>>>>> real port, and the brick is indeed up according to the status). >>>>>>>> >>>>>>>> I have then requested a statedump as you have suggested, and the >>>>>>>> bricks seem to be not connected: >>>>>>>> >>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv] >>>>>>>> *connected=0* >>>>>>>> total_bytes_read=341120 >>>>>>>> ping_timeout=42 >>>>>>>> total_bytes_written=594008 >>>>>>>> ping_msgs_sent=0 >>>>>>>> msgs_sent=0 >>>>>>>> >>>>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv] >>>>>>>> *connected=0* >>>>>>>> total_bytes_read=341120 >>>>>>>> ping_timeout=42 >>>>>>>> total_bytes_written=594008 >>>>>>>> ping_msgs_sent=0 >>>>>>>> msgs_sent=0 >>>>>>>> >>>>>>>> The important other thing to notice is that normally the bricks >>>>>>>> that are not connecting are always in the same (remote) node... i.e. >>>>>>>> they >>>>>>>> are both in node 3 in this case. That seems to be always the case, I >>>>>>>> have >>>>>>>> not encountered a scenario where bricks from different nodes are >>>>>>>> reporting >>>>>>>> this issue (at least for the same volume). >>>>>>>> >>>>>>>> Please let me know if you need any additional info. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Marco >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 19 May 2021 at 06:31, Ravishankar N <ravishan...@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evi...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am having significant issues with glustershd with releases 8.4 >>>>>>>>>> and 9.1. >>>>>>>>>> >>>>>>>>>> My oVirt clusters are using gluster storage backends, and were >>>>>>>>>> running fine with Gluster 7.x (shipped with earlier versions of >>>>>>>>>> oVirt Node >>>>>>>>>> 4.4.x). Recently the oVirt project moved to Gluster 8.4 for the >>>>>>>>>> nodes, and >>>>>>>>>> hence I have moved to this release when upgrading my clusters. >>>>>>>>>> >>>>>>>>>> Since then I am having issues whenever one of the nodes is >>>>>>>>>> brought down; when the nodes come back up online the bricks are >>>>>>>>>> typically >>>>>>>>>> back up and working, but some (random) glustershd processes in the >>>>>>>>>> various >>>>>>>>>> nodes seem to have issues connecting to some of them. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> When the issue happens, can you check if the TCP port number of >>>>>>>>> the brick (glusterfsd) processes displayed in `gluster volume status` >>>>>>>>> matches with that of the actual port numbers observed (i.e. the >>>>>>>>> --brick-port argument) when you run `ps aux | grep glusterfsd` ? If >>>>>>>>> they >>>>>>>>> don't match, then glusterd has incorrect brick port information in its >>>>>>>>> memory and serving it to glustershd. Restarting glusterd instead of >>>>>>>>> (killing the bricks + `volume start force`) should fix it, although >>>>>>>>> we need >>>>>>>>> to find why glusterd serves incorrect port numbers. >>>>>>>>> >>>>>>>>> If they do match, then can you take a statedump of glustershd to >>>>>>>>> check that it is indeed disconnected from the bricks? You will >>>>>>>>> need to verify that 'connected=1' in the statedump. See "Self-heal is >>>>>>>>> stuck/ not getting completed." section in >>>>>>>>> https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/. >>>>>>>>> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will >>>>>>>>> be >>>>>>>>> generated in the /var/run/gluster/ directory. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Ravi >>>>>>>>> >>>>>>>>> >>>>>>>>> > > -- > Regards, > Srijan >
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users