Re: [Gluster-users] Self-healing not healing 27k files on GlusterFS 4.1.5 3 nodes replica

mabi Wed, 07 Nov 2018 23:29:56 -0800

Dear Ravi,

Thank you for your answer. I will start first by sending you below the getfattr 
from the first entry which does not get healed (it is in fact a directory). It 
is the following path/dir from the output of one of my previous mails: 
/data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir


# NODE 1
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.myvol-pro-client-1=0x000000000000000300000003
trusted.gfid=0x25e2616b4fb64b2a89451afc956fff19
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

# NODE 2
trusted.gfid=0xd9ac192ce85e4402af105551f587ed9a
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

# NODE 3 (arbiter)
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.myvol-pro-client-1=0x000000000000000300000003
trusted.gfid=0x25e2616b4fb64b2a89451afc956fff19
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

Notice here that node 2 does not seem to have any AFR attributes which must be 
problematic. Also that specific directory on my node 2 has the oldest timestamp 
(14:12) where as that same directory on node 1 and 3 have 14:19 as timestamp.

I did run "volume heal myvol-pro" and on the console it shows:

Launching heal operation to perform index self heal on volume myvol-pro has 
been successful
Use heal info commands to check status.

but then in the glustershd.log file of both 3 nodes there has been nothing new 
logged.

The log file cmd_history.log shows:
[2018-11-08 07:20:24.481603]  : volume heal myvol-pro : SUCCESS

and glusterd.log:
[2018-11-08 07:20:24.474032] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glustershd: 
error returned while attempting to connect to host:(null), port:0

That's it... To me it looks like a split-brain but GlusterFS does not report it 
as split-brain and neither does any self-heal on it.

What do you think?

Regards,
M.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, November 8, 2018 5:00 AM, Ravishankar N <[email protected]> 
wrote:

> Can you share the getfattr output of all 4 entries from all 3 bricks?
>
> Also, can you tailf glustershd.log on all nodes and see if anything is
> logged for these entries when you run 'gluster volume heal $volname'?
>
> Regards,
>
> Ravi
>
> On 11/07/2018 01:22 PM, mabi wrote:
>
> > To my eyes this specific case looks like a split-brain scenario but the 
> > output of "volume info split-brain" does not show any files. Should I still 
> > use the process for split-brain files as documented in the glusterfs 
> > documentation? or what do you recommend here?
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Monday, November 5, 2018 4:36 PM, mabi [email protected] wrote:
> >
> > > Ravi, I did not yet modify the cluster.data-self-heal parameter to off 
> > > because in the mean time node2 of my cluster had a memory shortage (this 
> > > node has 32 GB of RAM) and as such I had to reboot it. After that reboot 
> > > all locks got released and there are no more files left to heal on that 
> > > volume. So the reboot of node2 did the trick (but this still seems to be 
> > > a bug).
> > > Now on another volume of this same cluster I have a total of 8 files (4 
> > > of them being directories) unsynced from node1 and node3 (arbiter) as you 
> > > can see below:
> > > Brick node1:/data/myvol-pro/brick
> > > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir
> > > gfid:3c92459b-8fa1-4669-9a3d-b38b8d41c360
> > > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/le_dir
> > > gfid:aae4098a-1a71-4155-9cc9-e564b89957cf
> > > Status: Connected
> > > Number of entries: 4
> > > Brick node2:/data/myvol-pro/brick
> > > Status: Connected
> > > Number of entries: 0
> > > Brick node3:/srv/glusterfs/myvol-pro/brick
> > > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/oc_dir
> > > /data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/le_dir
> > > gfid:aae4098a-1a71-4155-9cc9-e564b89957cf
> > > gfid:3c92459b-8fa1-4669-9a3d-b38b8d41c360
> > > Status: Connected
> > > Number of entries: 4
> > > If I check the 
> > > "/data/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/" with an "ls 
> > > -l" directory on the client (gluster fuse mount) I get the following 
> > > garbage:
> > > drwxr-xr-x 4 www-data www-data 4096 Nov 5 14:19 .
> > > drwxr-xr-x 31 www-data www-data 4096 Nov 5 14:23 ..
> > > d????????? ? ? ? ? ? le_dir
> > > I checked on the nodes and indeed node1 and node3 have the same directory 
> > > from the time 14:19 but node2 has a directory from the time 14:12.
> > > Again here the self-heal daemon doesn't seem to be doing anything... What 
> > > do you recommend me to do in order to heal these unsynced files?
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Monday, November 5, 2018 2:42 AM, Ravishankar N [email protected] 
> > > wrote:
> > >
> > > > On 11/03/2018 04:13 PM, mabi wrote:
> > > >
> > > > > Ravi (or anyone else who can help), I now have even more files which 
> > > > > are pending for healing.
> > > > > If the count is increasing, there is likely a network (disconnect)
> > > > > problem between the gluster clients and the bricks that needs fixing.
> > > >
> > > > > Here is the output of a "volume heal info summary":
> > > > > Brick node1:/data/myvol-private/brick
> > > > > Status: Connected
> > > > > Total Number of entries: 49845
> > > > > Number of entries in heal pending: 49845
> > > > > Number of entries in split-brain: 0
> > > > > Number of entries possibly healing: 0
> > > > > Brick node2:/data/myvol-private/brick
> > > > > Status: Connected
> > > > > Total Number of entries: 26644
> > > > > Number of entries in heal pending: 26644
> > > > > Number of entries in split-brain: 0
> > > > > Number of entries possibly healing: 0
> > > > > Brick node3:/srv/glusterfs/myvol-private/brick
> > > > > Status: Connected
> > > > > Total Number of entries: 0
> > > > > Number of entries in heal pending: 0
> > > > > Number of entries in split-brain: 0
> > > > > Number of entries possibly healing: 0
> > > > > Should I try to set the "cluster.data-self-heal" parameter of that 
> > > > > volume to "off" as mentioned in the bug?
> > > > > Yes, as  mentioned in the workaround in the thread that I shared.
> > > >
> > > > > And by doing that, does it mean that my files pending heal are in 
> > > > > danger of being lost?
> > > > > No.
> > > >
> > > > > Also is it dangerous to leave "cluster.data-self-heal" to off?
> > > > > No. This is only disabling client side data healing. Self-heal daemon
> > > > > would still heal the files.
> > > > > -Ravi
> > > >
> > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > On Saturday, November 3, 2018 1:31 AM, Ravishankar N 
> > > > > [email protected] wrote:
> > > > >
> > > > > > Mabi,
> > > > > > If bug 1637953 is what you are experiencing, then you need to 
> > > > > > follow the
> > > > > > workarounds mentioned in
> > > > > > https://lists.gluster.org/pipermail/gluster-users/2018-October/035178.html.
> > > > > > Can you see if this works?
> > > > > > -Ravi
> > > > > > On 11/02/2018 11:40 PM, mabi wrote:
> > > > > >
> > > > > > > I tried again to manually run a heal by using the "gluster volume 
> > > > > > > heal" command because still not files have been healed and 
> > > > > > > noticed the following warning in the glusterd.log file:
> > > > > > > [2018-11-02 18:04:19.454702] I [MSGID: 106533] 
> > > > > > > [glusterd-volume-ops.c:938:__glusterd_handle_cli_heal_volume] 
> > > > > > > 0-management: Received heal vol req for volume myvol-private
> > > > > > > [2018-11-02 18:04:19.457311] W [rpc-clnt.c:1753:rpc_clnt_submit] 
> > > > > > > 0-glustershd: error returned while attempting to connect to 
> > > > > > > host:(null), port:0
> > > > > > > It looks like the glustershd can't connect to "host:(null)", 
> > > > > > > could that be the reason why there is no healing taking place? if 
> > > > > > > yes why do I see here "host:(null)"? and what needs fixing?
> > > > > > > This seeem to have happened since I upgraded from 3.12.14 to 
> > > > > > > 4.1.5.
> > > > > > > I really would appreciate some help here, I suspect being an 
> > > > > > > issue with GlusterFS 4.1.5.
> > > > > > > Thank you in advance for any feedback.
> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > > > On Wednesday, October 31, 2018 11:13 AM, mabi [email protected] 
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > > I have a GlusterFS 4.1.5 cluster with 3 nodes (including 1 
> > > > > > > > arbiter) and currently have a volume with around 27174 files 
> > > > > > > > which are not being healed. The "volume heal info" command 
> > > > > > > > shows the same 27k files under the first node and the second 
> > > > > > > > node but there is nothing under the 3rd node (arbiter).
> > > > > > > > I already tried running a "volume heal" but none of the files 
> > > > > > > > got healed.
> > > > > > > > In the glfsheal log file for that particular volume the only 
> > > > > > > > error I see is a few of these entries:
> > > > > > > > [2018-10-31 10:06:41.524300] E [rpc-clnt.c:184:call_bail] 
> > > > > > > > 0-myvol-private-client-0: bailing out frame type(GlusterFS 4.x 
> > > > > > > > v1) op(INODELK(29)) xid = 0x108b sent = 2018-10-31 
> > > > > > > > 09:36:41.314203. timeout = 1800 for 127.0.1.1:49152
> > > > > > > > and then a few of these warnings:
> > > > > > > > [2018-10-31 10:08:12.161498] W [dict.c:671:dict_ref] 
> > > > > > > > (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/replicate.so(+0x6734a)
> > > > > > > >  [0x7f2a6dff434a] 
> > > > > > > > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x5da84) 
> > > > > > > > [0x7f2a798e8a84] 
> > > > > > > > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x58) 
> > > > > > > > [0x7f2a798a37f8] ) 0-dict: dict is NULL [Invalid argument]
> > > > > > > > the glustershd.log file shows the following:
> > > > > > > > [2018-10-31 10:10:52.502453] E [rpc-clnt.c:184:call_bail] 
> > > > > > > > 0-myvol-private-client-0: bailing out frame type(GlusterFS 4.x 
> > > > > > > > v1) op(INODELK(29)) xid = 0xaa398 sent = 2018-10-31 
> > > > > > > > 09:40:50.927816. timeout = 1800 for 127.0.1.1:49152
> > > > > > > > [2018-10-31 10:10:52.502502] E [MSGID: 114031] 
> > > > > > > > [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 
> > > > > > > > 0-myvol-private-client-0: remote operation failed [Transport 
> > > > > > > > endpoint is not connected]
> > > > > > > > any idea what could be wrong here?
> > > > > > > > Regards,
> > > > > > > > Mabi
> > > > > > > > Gluster-users mailing list
> > > > > > > > [email protected]
> > > > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users


_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Self-healing not healing 27k files on GlusterFS 4.1.5 3 nodes replica

Reply via email to