Hi Mahdi, I already listed the steps that it took - simply upgrading one of the four nodes from 5.13 to 7.5 and observing the log.
Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | @ArtemR <http://twitter.com/ArtemR> On Sun, Jun 21, 2020 at 12:03 PM Mahdi Adnan <ma...@sysmin.io> wrote: > I think if it's reproducible than someone can look into it, can you list > the steps to reproduce it? > > On Sun, Jun 21, 2020 at 9:12 PM Artem Russakovskii <archon...@gmail.com> > wrote: > >> There's been 0 progress or attention to this issue in a month on github >> or otherwise. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> <http://www.apkmirror.com/>, Illogical Robot LLC >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >> >> On Thu, May 21, 2020 at 12:43 PM Artem Russakovskii <archon...@gmail.com> >> wrote: >> >>> I've also moved this to github: >>> https://github.com/gluster/glusterfs/issues/1257. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>> >>> >>> On Fri, May 15, 2020 at 2:51 PM Artem Russakovskii <archon...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I see the team met up recently and one of the discussed items was >>>> issues upgrading to v7. What were the results of this discussion? >>>> >>>> Is the team going to respond to this thread with their thoughts and >>>> analysis? >>>> >>>> Thanks. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>>> >>>> >>>> On Mon, May 4, 2020 at 10:23 PM Strahil Nikolov <hunter86...@yahoo.com> >>>> wrote: >>>> >>>>> On May 4, 2020 4:26:32 PM GMT+03:00, Amar Tumballi <a...@kadalu.io> >>>>> wrote: >>>>> >On Sat, May 2, 2020 at 10:49 PM Artem Russakovskii >>>>> ><archon...@gmail.com> >>>>> >wrote: >>>>> > >>>>> >> I don't have geo replication. >>>>> >> >>>>> >> Still waiting for someone from the gluster team to chime in. They >>>>> >used to >>>>> >> be a lot more responsive here. Do you know if there is a holiday >>>>> >perhaps, >>>>> >> or have the working hours been cut due to Coronavirus currently? >>>>> >> >>>>> >> >>>>> >It was Holiday on May 1st, and 2nd and 3rd were Weekend days! And >>>>> also >>>>> >I >>>>> >guess many of Developers from Red Hat were attending Virtual Summit! >>>>> > >>>>> > >>>>> > >>>>> >> I'm not inclined to try a v6 upgrade without their word first. >>>>> >> >>>>> > >>>>> >Fair bet! I will bring this topic in one of the community meetings, >>>>> and >>>>> >ask >>>>> >developers if they have some feedback! I personally have not seen >>>>> these >>>>> >errors, and don't have a hunch on which patch would have caused an >>>>> >increase >>>>> >in logs! >>>>> > >>>>> >-Amar >>>>> > >>>>> > >>>>> >> >>>>> >> On Sat, May 2, 2020, 12:47 AM Strahil Nikolov < >>>>> hunter86...@yahoo.com> >>>>> >> wrote: >>>>> >> >>>>> >>> On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii < >>>>> >>> archon...@gmail.com> wrote: >>>>> >>> >The good news is the downgrade seems to have worked and was >>>>> >painless. >>>>> >>> > >>>>> >>> >zypper install --oldpackage glusterfs-5.13, restart gluster, and >>>>> >almost >>>>> >>> >immediately there are no heal pending entries anymore. >>>>> >>> > >>>>> >>> >The only things still showing up in the logs, besides some healing >>>>> >is >>>>> >>> >0-glusterfs-fuse: >>>>> >>> >writing to fuse device failed: No such file or directory: >>>>> >>> >==> mnt-androidpolice_data3.log <== >>>>> >>> >[2020-05-01 16:54:21.085643] E >>>>> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >>>>> >>> >(--> >>>>> >>> >>>>> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>>>> >>> >(--> >>>>> >>> >>>>> >>>>> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>>>> >>> >(--> >>>>> >>> >>>>> >>>>> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>>>> >>> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>>>> >>> >/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >>>>> >0-glusterfs-fuse: >>>>> >>> >writing to fuse device failed: No such file or directory >>>>> >>> >==> mnt-apkmirror_data1.log <== >>>>> >>> >[2020-05-01 16:54:21.268842] E >>>>> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >>>>> >>> >(--> >>>>> >>> >>>>> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d] >>>>> >>> >(--> >>>>> >>> >>>>> >>>>> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a] >>>>> >>> >(--> >>>>> >>> >>>>> >>>>> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb] >>>>> >>> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (--> >>>>> >>> >/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) >>>>> >0-glusterfs-fuse: >>>>> >>> >writing to fuse device failed: No such file or directory >>>>> >>> > >>>>> >>> >It'd be very helpful if it had more info about what failed to >>>>> write >>>>> >and >>>>> >>> >why. >>>>> >>> > >>>>> >>> >I'd still really love to see the analysis of this failed upgrade >>>>> >from >>>>> >>> >core >>>>> >>> >gluster maintainers to see what needs fixing and how we can >>>>> upgrade >>>>> >in >>>>> >>> >the >>>>> >>> >future. >>>>> >>> > >>>>> >>> >Thanks. >>>>> >>> > >>>>> >>> >Sincerely, >>>>> >>> >Artem >>>>> >>> > >>>>> >>> >-- >>>>> >>> >Founder, Android Police <http://www.androidpolice.com>, APK >>>>> Mirror >>>>> >>> ><http://www.apkmirror.com/>, Illogical Robot LLC >>>>> >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>>>> >>> > >>>>> >>> > >>>>> >>> >On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii >>>>> ><archon...@gmail.com> >>>>> >>> >wrote: >>>>> >>> > >>>>> >>> >> I do not have snapshots, no. I have a general file based backup, >>>>> >but >>>>> >>> >also >>>>> >>> >> the other 3 nodes are up. >>>>> >>> >> >>>>> >>> >> OpenSUSE 15.1. >>>>> >>> >> >>>>> >>> >> If I try to downgrade and it doesn't work, what's the brick >>>>> >>> >replacement >>>>> >>> >> scenario - is this still accurate? >>>>> >>> >> >>>>> >>> > >>>>> >>> >>>>> > >>>>> https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick >>>>> >>> >> >>>>> >>> >> Any feedback about the issues themselves yet please? >>>>> >Specifically, is >>>>> >>> >> there a chance this is happening because of the mismatched >>>>> >gluster >>>>> >>> >> versions? Though, what's the solution then? >>>>> >>> >> >>>>> >>> >> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov >>>>> ><hunter86...@yahoo.com> >>>>> >>> >> wrote: >>>>> >>> >> >>>>> >>> >>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii < >>>>> >>> >>> archon...@gmail.com> wrote: >>>>> >>> >>> >If more time is needed to analyze this, is this an option? >>>>> Shut >>>>> >>> >down >>>>> >>> >>> >7.5, >>>>> >>> >>> >downgrade it back to 5.13 and restart, or would this screw >>>>> >>> >something up >>>>> >>> >>> >badly? I didn't up the op-version yet. >>>>> >>> >>> > >>>>> >>> >>> >Thanks. >>>>> >>> >>> > >>>>> >>> >>> >Sincerely, >>>>> >>> >>> >Artem >>>>> >>> >>> > >>>>> >>> >>> >-- >>>>> >>> >>> >Founder, Android Police <http://www.androidpolice.com>, APK >>>>> >Mirror >>>>> >>> >>> ><http://www.apkmirror.com/>, Illogical Robot LLC >>>>> >>> >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>>>> >>> >>> > >>>>> >>> >>> > >>>>> >>> >>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii >>>>> >>> >>> ><archon...@gmail.com> >>>>> >>> >>> >wrote: >>>>> >>> >>> > >>>>> >>> >>> >> The number of heal pending on citadel, the one that was >>>>> >upgraded >>>>> >>> >to >>>>> >>> >>> >7.5, >>>>> >>> >>> >> has now gone to 10s of thousands and continues to go up. >>>>> >>> >>> >> >>>>> >>> >>> >> Sincerely, >>>>> >>> >>> >> Artem >>>>> >>> >>> >> >>>>> >>> >>> >> -- >>>>> >>> >>> >> Founder, Android Police <http://www.androidpolice.com>, APK >>>>> >>> >Mirror >>>>> >>> >>> >> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>> >>> >>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>>>> >>> >>> >> >>>>> >>> >>> >> >>>>> >>> >>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii >>>>> >>> >>> ><archon...@gmail.com> >>>>> >>> >>> >> wrote: >>>>> >>> >>> >> >>>>> >>> >>> >>> Hi all, >>>>> >>> >>> >>> >>>>> >>> >>> >>> Today, I decided to upgrade one of the four servers >>>>> >(citadel) we >>>>> >>> >>> >have to >>>>> >>> >>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse >>>>> >>> >mounts >>>>> >>> >>> >(I sent >>>>> >>> >>> >>> the full details earlier in another message). If everything >>>>> >>> >looked >>>>> >>> >>> >OK, I >>>>> >>> >>> >>> would have proceeded the rolling upgrade for all of them, >>>>> >>> >following >>>>> >>> >>> >the >>>>> >>> >>> >>> full heal. >>>>> >>> >>> >>> >>>>> >>> >>> >>> However, as soon as I upgraded and restarted, the logs >>>>> >filled >>>>> >>> >with >>>>> >>> >>> >>> messages like these: >>>>> >>> >>> >>> >>>>> >>> >>> >>> [2020-04-30 21:39:21.316149] E >>>>> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >>>>> >actor >>>>> >>> >>> >>> (1298437:400:17) failed to complete successfully >>>>> >>> >>> >>> [2020-04-30 21:39:21.382891] E >>>>> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >>>>> >actor >>>>> >>> >>> >>> (1298437:400:17) failed to complete successfully >>>>> >>> >>> >>> [2020-04-30 21:39:21.442440] E >>>>> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >>>>> >actor >>>>> >>> >>> >>> (1298437:400:17) failed to complete successfully >>>>> >>> >>> >>> [2020-04-30 21:39:21.445587] E >>>>> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >>>>> >actor >>>>> >>> >>> >>> (1298437:400:17) failed to complete successfully >>>>> >>> >>> >>> [2020-04-30 21:39:21.571398] E >>>>> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >>>>> >actor >>>>> >>> >>> >>> (1298437:400:17) failed to complete successfully >>>>> >>> >>> >>> [2020-04-30 21:39:21.668192] E >>>>> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >>>>> >actor >>>>> >>> >>> >>> (1298437:400:17) failed to complete successfully >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> The message "I [MSGID: 108031] >>>>> >>> >>> >>> [afr-common.c:2581:afr_local_discovery_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-replicate-0: selecting local >>>>> >read_child >>>>> >>> >>> >>> androidpolice_data3-client-3" repeated 10 times between >>>>> >>> >[2020-04-30 >>>>> >>> >>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] >>>>> >>> >>> >>> The message "W [MSGID: 114031] >>>>> >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-client-1: remote operation failed >>>>> >>> >[Transport >>>>> >>> >>> >endpoint >>>>> >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >>>>> >>> >>> >21:46:32.129567] >>>>> >>> >>> >>> and [2020-04-30 21:48:29.905008] >>>>> >>> >>> >>> The message "W [MSGID: 114031] >>>>> >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-client-0: remote operation failed >>>>> >>> >[Transport >>>>> >>> >>> >endpoint >>>>> >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >>>>> >>> >>> >21:46:32.129602] >>>>> >>> >>> >>> and [2020-04-30 21:48:29.905040] >>>>> >>> >>> >>> The message "W [MSGID: 114031] >>>>> >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-client-2: remote operation failed >>>>> >>> >[Transport >>>>> >>> >>> >endpoint >>>>> >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >>>>> >>> >>> >21:46:32.129512] >>>>> >>> >>> >>> and [2020-04-30 21:48:29.905047] >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> Once in a while, I'm seeing this: >>>>> >>> >>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <== >>>>> >>> >>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] >>>>> >>> >>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-server: 5725811: SETATTR / >>>>> >>> >>> >>> >>>>> >>> >>> > >>>>> >>> >>> >>>>> >>> > >>>>> >>> >>>>> > >>>>> androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png >>>>> >>> >>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: >>>>> >>> >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, >>>>> >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >>>>> >not >>>>> >>> >>> >permitted] >>>>> >>> >>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] >>>>> >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-server: 201833: SETATTR / >>>>> >>> >>> >>> androidpolice.com/public/wp-content/uploads >>>>> >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>>>> >>> >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>>>> >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >>>>> >not >>>>> >>> >>> >permitted] >>>>> >>> >>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] >>>>> >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-server: 201842: SETATTR / >>>>> >>> >>> >>> androidpolice.com/public/wp-content/uploads >>>>> >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>>>> >>> >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>>>> >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >>>>> >not >>>>> >>> >>> >permitted] >>>>> >>> >>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] >>>>> >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-server: 202865: SETATTR / >>>>> >>> >>> >>> androidpolice.com/public/wp-content/uploads >>>>> >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>>>> >>> >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>>>> >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >>>>> >not >>>>> >>> >>> >permitted] >>>>> >>> >>> >>> >>>>> >>> >>> >>> There's also lots of self-healing happening that I didn't >>>>> >expect >>>>> >>> >at >>>>> >>> >>> >all, >>>>> >>> >>> >>> since the upgrade only took ~10-15s. >>>>> >>> >>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] >>>>> >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>>>> >>> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal >>>>> >on >>>>> >>> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 >>>>> >>> >>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] >>>>> >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>>>> >>> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal >>>>> >on >>>>> >>> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 >>>>> 1 >>>>> >2 >>>>> >>> >>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] >>>>> >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>>>> >>> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal >>>>> >on >>>>> >>> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296 >>>>> >>> >>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] >>>>> >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>>>> >>> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal >>>>> >on >>>>> >>> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 >>>>> 1 >>>>> >2 >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> I'm also seeing "remote operation failed" and "writing to >>>>> >fuse >>>>> >>> >>> >device >>>>> >>> >>> >>> failed: No such file or directory" messages >>>>> >>> >>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] >>>>> >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>>>> >>> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata >>>>> >selfheal >>>>> >>> >on >>>>> >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] >>>>> >sinks=3 >>>>> >>> >>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] >>>>> >>> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-client-0: remote operation failed >>>>> >>> >[Operation >>>>> >>> >>> >not >>>>> >>> >>> >>> permitted] >>>>> >>> >>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] >>>>> >>> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-client-1: remote operation failed >>>>> >>> >[Operation >>>>> >>> >>> >not >>>>> >>> >>> >>> permitted] >>>>> >>> >>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] >>>>> >>> >>> >>> [afr-common.c:2543:afr_local_discovery_cbk] >>>>> >>> >>> >>> 0-androidpolice_data3-replicate-0: selecting local >>>>> >read_child >>>>> >>> >>> >>> androidpolice_data3-client-2 >>>>> >>> >>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] >>>>> >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>>>> >>> >>> >>> 0-androidpolice_data3-replicate-0: performing metadata >>>>> >selfheal >>>>> >>> >on >>>>> >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 >>>>> >>> >>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] >>>>> >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>>>> >>> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata >>>>> >selfheal >>>>> >>> >on >>>>> >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] >>>>> >sinks=3 >>>>> >>> >>> >>> [2020-04-30 21:46:37.245599] E >>>>> >>> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >>>>> >>> >>> >>> (--> >>>>> >>> >>> >>>>> >>> >>>>> >>>>> >>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>>>> >>> >>> >>> (--> >>>>> >>> >>> >>> >>>>> >>> >>> >>>>> >>> >>>>> >>>>> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>>>> >>> >>> >>> (--> >>>>> >>> >>> >>> >>>>> >>> >>> >>>>> >>> >>>>> >>>>> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>>>> >>> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>>>> >>> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >>>>> >>> >0-glusterfs-fuse: >>>>> >>> >>> >>> writing to fuse device failed: No such file or directory >>>>> >>> >>> >>> [2020-04-30 21:46:50.864797] E >>>>> >>> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >>>>> >>> >>> >>> (--> >>>>> >>> >>> >>>>> >>> >>>>> >>>>> >>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>>>> >>> >>> >>> (--> >>>>> >>> >>> >>> >>>>> >>> >>> >>>>> >>> >>>>> >>>>> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>>>> >>> >>> >>> (--> >>>>> >>> >>> >>> >>>>> >>> >>> >>>>> >>> >>>>> >>>>> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>>>> >>> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>>>> >>> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >>>>> >>> >0-glusterfs-fuse: >>>>> >>> >>> >>> writing to fuse device failed: No such file or directory >>>>> >>> >>> >>> >>>>> >>> >>> >>> The number of items being healed is going up and down >>>>> >wildly, >>>>> >>> >from 0 >>>>> >>> >>> >to >>>>> >>> >>> >>> 8000+ and sometimes taking a really long time to return a >>>>> >value. >>>>> >>> >I'm >>>>> >>> >>> >really >>>>> >>> >>> >>> worried as this is a production system, and I didn't >>>>> observe >>>>> >>> >this in >>>>> >>> >>> >our >>>>> >>> >>> >>> test system. >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> gluster v heal apkmirror_data1 info summary >>>>> >>> >>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 >>>>> >>> >>> >>> Status: Connected >>>>> >>> >>> >>> Total Number of entries: 27 >>>>> >>> >>> >>> Number of entries in heal pending: 27 >>>>> >>> >>> >>> Number of entries in split-brain: 0 >>>>> >>> >>> >>> Number of entries possibly healing: 0 >>>>> >>> >>> >>> >>>>> >>> >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 >>>>> >>> >>> >>> Status: Connected >>>>> >>> >>> >>> Total Number of entries: 27 >>>>> >>> >>> >>> Number of entries in heal pending: 27 >>>>> >>> >>> >>> Number of entries in split-brain: 0 >>>>> >>> >>> >>> Number of entries possibly healing: 0 >>>>> >>> >>> >>> >>>>> >>> >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 >>>>> >>> >>> >>> Status: Connected >>>>> >>> >>> >>> Total Number of entries: 27 >>>>> >>> >>> >>> Number of entries in heal pending: 27 >>>>> >>> >>> >>> Number of entries in split-brain: 0 >>>>> >>> >>> >>> Number of entries possibly healing: 0 >>>>> >>> >>> >>> >>>>> >>> >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 >>>>> >>> >>> >>> Status: Connected >>>>> >>> >>> >>> Total Number of entries: 8540 >>>>> >>> >>> >>> Number of entries in heal pending: 8540 >>>>> >>> >>> >>> Number of entries in split-brain: 0 >>>>> >>> >>> >>> Number of entries possibly healing: 0 >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> gluster v heal androidpolice_data3 info summary >>>>> >>> >>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 >>>>> >>> >>> >>> Status: Connected >>>>> >>> >>> >>> Total Number of entries: 1 >>>>> >>> >>> >>> Number of entries in heal pending: 1 >>>>> >>> >>> >>> Number of entries in split-brain: 0 >>>>> >>> >>> >>> Number of entries possibly healing: 0 >>>>> >>> >>> >>> >>>>> >>> >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 >>>>> >>> >>> >>> Status: Connected >>>>> >>> >>> >>> Total Number of entries: 1 >>>>> >>> >>> >>> Number of entries in heal pending: 1 >>>>> >>> >>> >>> Number of entries in split-brain: 0 >>>>> >>> >>> >>> Number of entries possibly healing: 0 >>>>> >>> >>> >>> >>>>> >>> >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 >>>>> >>> >>> >>> Status: Connected >>>>> >>> >>> >>> Total Number of entries: 1 >>>>> >>> >>> >>> Number of entries in heal pending: 1 >>>>> >>> >>> >>> Number of entries in split-brain: 0 >>>>> >>> >>> >>> Number of entries possibly healing: 0 >>>>> >>> >>> >>> >>>>> >>> >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 >>>>> >>> >>> >>> Status: Connected >>>>> >>> >>> >>> Total Number of entries: 1149 >>>>> >>> >>> >>> Number of entries in heal pending: 1149 >>>>> >>> >>> >>> Number of entries in split-brain: 0 >>>>> >>> >>> >>> Number of entries possibly healing: 0 >>>>> >>> >>> >>> >>>>> >>> >>> >>> >>>>> >>> >>> >>> What should I do at this point? The files I tested seem to >>>>> >be >>>>> >>> >>> >replicating >>>>> >>> >>> >>> correctly, but I don't know if it's the case for all of >>>>> >them, >>>>> >>> >and >>>>> >>> >>> >the heals >>>>> >>> >>> >>> going up and down, and all these log messages are making me >>>>> >very >>>>> >>> >>> >nervous. >>>>> >>> >>> >>> >>>>> >>> >>> >>> Thank you. >>>>> >>> >>> >>> >>>>> >>> >>> >>> Sincerely, >>>>> >>> >>> >>> Artem >>>>> >>> >>> >>> >>>>> >>> >>> >>> -- >>>>> >>> >>> >>> Founder, Android Police <http://www.androidpolice.com>, >>>>> APK >>>>> >>> >Mirror >>>>> >>> >>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>> >>> >>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>>>> >>> >>> >>> >>>>> >>> >>> >> >>>>> >>> >>> >>>>> >>> >>> I's not supported , but usually it works. >>>>> >>> >>> >>>>> >>> >>> In worst case scenario, you can remove the node, wipe gluster >>>>> >on >>>>> >>> >the >>>>> >>> >>> node, reinstall the packages and add it - it will require full >>>>> >heal >>>>> >>> >of the >>>>> >>> >>> brick and as you have previously reported could lead to >>>>> >performance >>>>> >>> >>> degradation. >>>>> >>> >>> >>>>> >>> >>> I think you are on SLES, but I could be wrong . Do you have >>>>> >btrfs or >>>>> >>> >LVM >>>>> >>> >>> snapshots to revert from ? >>>>> >>> >>> >>>>> >>> >>> Best Regards, >>>>> >>> >>> Strahil Nikolov >>>>> >>> >>> >>>>> >>> >> >>>>> >>> >>>>> >>> Hi Artem, >>>>> >>> >>>>> >>> You can increase the brick log level following >>>>> >>> >>>>> > >>>>> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level >>>>> >>> but keep in mind that logs grow quite fast - so don't keep them >>>>> >above the >>>>> >>> current level for too much time. >>>>> >>> >>>>> >>> >>>>> >>> Do you have a geo replication running ? >>>>> >>> >>>>> >>> About the migration issue - I have no clue why this happened. Last >>>>> >time I >>>>> >>> skipped a major release(3.12 to 5.5) I got a huge trouble (all >>>>> >files >>>>> >>> ownership was switched to root) and I have the feeling that it >>>>> >won't >>>>> >>> happen again if you go through v6. >>>>> >>> >>>>> >>> Best Regards, >>>>> >>> Strahil Nikolov >>>>> >>> >>>>> >> ________ >>>>> >> >>>>> >> >>>>> >> >>>>> >> Community Meeting Calendar: >>>>> >> >>>>> >> Schedule - >>>>> >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >>>>> >> Bridge: https://bluejeans.com/441850968 >>>>> >> >>>>> >> Gluster-users mailing list >>>>> >> Gluster-users@gluster.org >>>>> >> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >> >>>>> >>>>> Hey Artem, >>>>> >>>>> I just checked if the 'replica 4' is causing the issue , but that's >>>>> not true (tested with 1 node down, but it's the same situation). >>>>> >>>>> I created 4 VMs on CentOS 7 & Gluster v7.5 (brick has only noatime >>>>> mount option) and created a 'replica 4' volume. >>>>> Then I created a dir and placed 50000 very small files there via: >>>>> for i in {1..50000}; do echo $RANDOM > $i ; done >>>>> >>>>> The find command 'finds' them in 4s and after some tuning I have >>>>> managed to lower it to 2.5s. >>>>> >>>>> What has caused some improvement was: >>>>> A) Activated the rhgs-random-io tuned profile which you can take from >>>>> ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-storage-server-3.5.0.0-1.el7rhgs.src.rpm >>>>> B) using noatime for the mount option and if you use SELINUX you could >>>>> use the 'context=system_u:object_r:glusterd_brick_t:s0' mount option to >>>>> prevent selinux context lookups >>>>> C) Activation of the gluster group of settings 'metadata-cache' or >>>>> 'nl-cache' brought 'find' to the same results - lowered from 3.5s to 2.5s >>>>> after an initial run. >>>>> >>>>> I know that I'm not compairing apples to apples , but still it might >>>>> help. >>>>> >>>>> I would like to learn what actually gluster does when a 'find' or 'ls' >>>>> is invoked, as I doubt it just executes it on the bricks. >>>>> >>>>> Best Regards, >>>>> Strahil Nikolov >>>>> >>>> ________ >> >> >> >> Community Meeting Calendar: >> >> Schedule - >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> Bridge: https://bluejeans.com/441850968 >> >> Gluster-users mailing list >> Gluster-users@gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> > > > -- > Respectfully > Mahdi >
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users