Dear Ravi, Thank you for your answer. I will start first by sending you below the getfattr from the first entry which does not get healed (it is in fact a directory). It is the following path/dir from the output of one of my previous mails: /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir
# NODE 1 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.myvol-pro-client-1=0x000000000000000300000003 trusted.gfid=0x25e2616b4fb64b2a89451afc956fff19 trusted.glusterfs.dht=0x000000010000000000000000ffffffff # NODE 2 trusted.gfid=0xd9ac192ce85e4402af105551f587ed9a trusted.glusterfs.dht=0x000000010000000000000000ffffffff # NODE 3 (arbiter) trusted.afr.dirty=0x000000000000000000000000 trusted.afr.myvol-pro-client-1=0x000000000000000300000003 trusted.gfid=0x25e2616b4fb64b2a89451afc956fff19 trusted.glusterfs.dht=0x000000010000000000000000ffffffff Notice here that node 2 does not seem to have any AFR attributes which must be problematic. Also that specific directory on my node 2 has the oldest timestamp (14:12) where as that same directory on node 1 and 3 have 14:19 as timestamp. I did run "volume heal myvol-pro" and on the console it shows: Launching heal operation to perform index self heal on volume myvol-pro has been successful Use heal info commands to check status. but then in the glustershd.log file of both 3 nodes there has been nothing new logged. The log file cmd_history.log shows: [2018-11-08 07:20:24.481603] : volume heal myvol-pro : SUCCESS and glusterd.log: [2018-11-08 07:20:24.474032] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glustershd: error returned while attempting to connect to host:(null), port:0 That's it... To me it looks like a split-brain but GlusterFS does not report it as split-brain and neither does any self-heal on it. What do you think? Regards, M. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Thursday, November 8, 2018 5:00 AM, Ravishankar N <ravishan...@redhat.com> wrote: > Can you share the getfattr output of all 4 entries from all 3 bricks? > > Also, can you tailf glustershd.log on all nodes and see if anything is > logged for these entries when you run 'gluster volume heal $volname'? > > Regards, > > Ravi > > On 11/07/2018 01:22 PM, mabi wrote: > > > To my eyes this specific case looks like a split-brain scenario but the > > output of "volume info split-brain" does not show any files. Should I still > > use the process for split-brain files as documented in the glusterfs > > documentation? or what do you recommend here? > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > On Monday, November 5, 2018 4:36 PM, mabi m...@protonmail.ch wrote: > > > > > Ravi, I did not yet modify the cluster.data-self-heal parameter to off > > > because in the mean time node2 of my cluster had a memory shortage (this > > > node has 32 GB of RAM) and as such I had to reboot it. After that reboot > > > all locks got released and there are no more files left to heal on that > > > volume. So the reboot of node2 did the trick (but this still seems to be > > > a bug). > > > Now on another volume of this same cluster I have a total of 8 files (4 > > > of them being directories) unsynced from node1 and node3 (arbiter) as you > > > can see below: > > > Brick node1:/data/myvol-pro/brick > > > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir > > > gfid:3c92459b-8fa1-4669-9a3d-b38b8d41c360 > > > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/le_dir > > > gfid:aae4098a-1a71-4155-9cc9-e564b89957cf > > > Status: Connected > > > Number of entries: 4 > > > Brick node2:/data/myvol-pro/brick > > > Status: Connected > > > Number of entries: 0 > > > Brick node3:/srv/glusterfs/myvol-pro/brick > > > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir > > > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/le_dir > > > gfid:aae4098a-1a71-4155-9cc9-e564b89957cf > > > gfid:3c92459b-8fa1-4669-9a3d-b38b8d41c360 > > > Status: Connected > > > Number of entries: 4 > > > If I check the > > > "/data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/" with an "ls > > > -l" directory on the client (gluster fuse mount) I get the following > > > garbage: > > > drwxr-xr-x 4 www-data www-data 4096 Nov 5 14:19 . > > > drwxr-xr-x 31 www-data www-data 4096 Nov 5 14:23 .. > > > d????????? ? ? ? ? ? le_dir > > > I checked on the nodes and indeed node1 and node3 have the same directory > > > from the time 14:19 but node2 has a directory from the time 14:12. > > > Again here the self-heal daemon doesn't seem to be doing anything... What > > > do you recommend me to do in order to heal these unsynced files? > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > On Monday, November 5, 2018 2:42 AM, Ravishankar N ravishan...@redhat.com > > > wrote: > > > > > > > On 11/03/2018 04:13 PM, mabi wrote: > > > > > > > > > Ravi (or anyone else who can help), I now have even more files which > > > > > are pending for healing. > > > > > If the count is increasing, there is likely a network (disconnect) > > > > > problem between the gluster clients and the bricks that needs fixing. > > > > > > > > > Here is the output of a "volume heal info summary": > > > > > Brick node1:/data/myvol-private/brick > > > > > Status: Connected > > > > > Total Number of entries: 49845 > > > > > Number of entries in heal pending: 49845 > > > > > Number of entries in split-brain: 0 > > > > > Number of entries possibly healing: 0 > > > > > Brick node2:/data/myvol-private/brick > > > > > Status: Connected > > > > > Total Number of entries: 26644 > > > > > Number of entries in heal pending: 26644 > > > > > Number of entries in split-brain: 0 > > > > > Number of entries possibly healing: 0 > > > > > Brick node3:/srv/glusterfs/myvol-private/brick > > > > > Status: Connected > > > > > Total Number of entries: 0 > > > > > Number of entries in heal pending: 0 > > > > > Number of entries in split-brain: 0 > > > > > Number of entries possibly healing: 0 > > > > > Should I try to set the "cluster.data-self-heal" parameter of that > > > > > volume to "off" as mentioned in the bug? > > > > > Yes, as mentioned in the workaround in the thread that I shared. > > > > > > > > > And by doing that, does it mean that my files pending heal are in > > > > > danger of being lost? > > > > > No. > > > > > > > > > Also is it dangerous to leave "cluster.data-self-heal" to off? > > > > > No. This is only disabling client side data healing. Self-heal daemon > > > > > would still heal the files. > > > > > -Ravi > > > > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > > > On Saturday, November 3, 2018 1:31 AM, Ravishankar N > > > > > ravishan...@redhat.com wrote: > > > > > > > > > > > Mabi, > > > > > > If bug 1637953 is what you are experiencing, then you need to > > > > > > follow the > > > > > > workarounds mentioned in > > > > > > https://lists.gluster.org/pipermail/gluster-users/2018-October/035178.html. > > > > > > Can you see if this works? > > > > > > -Ravi > > > > > > On 11/02/2018 11:40 PM, mabi wrote: > > > > > > > > > > > > > I tried again to manually run a heal by using the "gluster volume > > > > > > > heal" command because still not files have been healed and > > > > > > > noticed the following warning in the glusterd.log file: > > > > > > > [2018-11-02 18:04:19.454702] I [MSGID: 106533] > > > > > > > [glusterd-volume-ops.c:938:__glusterd_handle_cli_heal_volume] > > > > > > > 0-management: Received heal vol req for volume myvol-private > > > > > > > [2018-11-02 18:04:19.457311] W [rpc-clnt.c:1753:rpc_clnt_submit] > > > > > > > 0-glustershd: error returned while attempting to connect to > > > > > > > host:(null), port:0 > > > > > > > It looks like the glustershd can't connect to "host:(null)", > > > > > > > could that be the reason why there is no healing taking place? if > > > > > > > yes why do I see here "host:(null)"? and what needs fixing? > > > > > > > This seeem to have happened since I upgraded from 3.12.14 to > > > > > > > 4.1.5. > > > > > > > I really would appreciate some help here, I suspect being an > > > > > > > issue with GlusterFS 4.1.5. > > > > > > > Thank you in advance for any feedback. > > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > > > > > On Wednesday, October 31, 2018 11:13 AM, mabi m...@protonmail.ch > > > > > > > wrote: > > > > > > > > > > > > > > > Hello, > > > > > > > > I have a GlusterFS 4.1.5 cluster with 3 nodes (including 1 > > > > > > > > arbiter) and currently have a volume with around 27174 files > > > > > > > > which are not being healed. The "volume heal info" command > > > > > > > > shows the same 27k files under the first node and the second > > > > > > > > node but there is nothing under the 3rd node (arbiter). > > > > > > > > I already tried running a "volume heal" but none of the files > > > > > > > > got healed. > > > > > > > > In the glfsheal log file for that particular volume the only > > > > > > > > error I see is a few of these entries: > > > > > > > > [2018-10-31 10:06:41.524300] E [rpc-clnt.c:184:call_bail] > > > > > > > > 0-myvol-private-client-0: bailing out frame type(GlusterFS 4.x > > > > > > > > v1) op(INODELK(29)) xid = 0x108b sent = 2018-10-31 > > > > > > > > 09:36:41.314203. timeout = 1800 for 127.0.1.1:49152 > > > > > > > > and then a few of these warnings: > > > > > > > > [2018-10-31 10:08:12.161498] W [dict.c:671:dict_ref] > > > > > > > > (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/replicate.so(+0x6734a) > > > > > > > > [0x7f2a6dff434a] > > > > > > > > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x5da84) > > > > > > > > [0x7f2a798e8a84] > > > > > > > > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x58) > > > > > > > > [0x7f2a798a37f8] ) 0-dict: dict is NULL [Invalid argument] > > > > > > > > the glustershd.log file shows the following: > > > > > > > > [2018-10-31 10:10:52.502453] E [rpc-clnt.c:184:call_bail] > > > > > > > > 0-myvol-private-client-0: bailing out frame type(GlusterFS 4.x > > > > > > > > v1) op(INODELK(29)) xid = 0xaa398 sent = 2018-10-31 > > > > > > > > 09:40:50.927816. timeout = 1800 for 127.0.1.1:49152 > > > > > > > > [2018-10-31 10:10:52.502502] E [MSGID: 114031] > > > > > > > > [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] > > > > > > > > 0-myvol-private-client-0: remote operation failed [Transport > > > > > > > > endpoint is not connected] > > > > > > > > any idea what could be wrong here? > > > > > > > > Regards, > > > > > > > > Mabi > > > > > > > > Gluster-users mailing list > > > > > > > > Gluster-users@gluster.org > > > > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users