Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)

mabi Tue, 27 Oct 2020 03:41:39 -0700

First to answer your question how this first happened, I reached that issue 
first by simply rebooting my arbiter node yesterday morning in order to due 
some maintenance which I do on a regular basis and was never a problem before 
GlusterFS 7.8.

I have now removed the arbiter brick from all of my volumes (I have 3 volumes 
and only one volume uses quota). So I was then able to do a "detach" and then a 
"probe" of my arbiter node.

So far so good, so I decided to add back an aribter brick to one of my smallest 
volumes which does not have quota but I get the following error message:

$ gluster volume add-brick othervol replica 3 arbiter 1 
arbiternode.domain.tld:/srv/glusterfs/othervol/brick

volume add-brick: failed: Commit failed on arbiternode.domain.tld. Please check 
log file for details.

Checking the glusterd.log file of the arbiter node shows the following:

[2020-10-27 06:25:36.011955] I [MSGID: 106578] 
[glusterd-brick-ops.c:1024:glusterd_op_perform_add_bricks] 0-management: 
replica-count is set 3
[2020-10-27 06:25:36.011988] I [MSGID: 106578] 
[glusterd-brick-ops.c:1029:glusterd_op_perform_add_bricks] 0-management: 
arbiter-count is set 1
[2020-10-27 06:25:36.012017] I [MSGID: 106578] 
[glusterd-brick-ops.c:1033:glusterd_op_perform_add_bricks] 0-management: type 
is set 0, need to change it
[2020-10-27 06:25:36.093551] E [MSGID: 106053] 
[glusterd-utils.c:13790:glusterd_handle_replicate_brick_ops] 0-management: 
Failed to set extended attribute trusted.add-brick : Transport endpoint is not 
connected [Transport endpoint is not connected]
[2020-10-27 06:25:36.104897] E [MSGID: 101042] [compat.c:605:gf_umount_lazy] 
0-management: Lazy unmount of /tmp/mntQQVzyD [Transport endpoint is not 
connected]
[2020-10-27 06:25:36.104973] E [MSGID: 106073] 
[glusterd-brick-ops.c:2051:glusterd_op_add_brick] 0-glusterd: Unable to add 
bricks
[2020-10-27 06:25:36.105001] E [MSGID: 106122] 
[glusterd-mgmt.c:317:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit 
failed.
[2020-10-27 06:25:36.105023] E [MSGID: 106122] 
[glusterd-mgmt-handler.c:594:glusterd_handle_commit_fn] 0-management: commit 
failed on operation Add brick

After that I tried to restart the glusterd service on my arbiter node and now 
it is again rejected from the other nodes with exactly the same error message 
as yesterday regarding the quota checksum being different as you can see here:

[2020-10-27 06:30:21.729577] E [MSGID: 106012] 
[glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of 
quota configuration of volume myvol-private differ. local cksum = 0, remote  
cksum = 66908910 on peer node2.domain.tld
[2020-10-27 06:30:21.731966] E [MSGID: 106012] 
[glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of 
quota configuration of volume myvol-private differ. local cksum = 0, remote  
cksum = 66908910 on peer node1.domain.tld

This is really weird because at this stage I did not even try yet to add the 
brick to the arbiter node from my volume which has the quota enabled...

After detaching the arbiter node, am I supposed to delete something on the 
arbiter node?

Something is really wrong here and I am stuck in a loop somehow... any help 
would be greatly appreciated.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, October 27, 2020 1:26 AM, Strahil Nikolov <hunter86...@yahoo.com> 
wrote:

> You need to fix that "reject" issue before trying anything else.
> Have you tried to "detach" the arbiter and then "probe" it again ?
>
> I have no idea what you did to reach that state - can you provide the details 
> ?
>
> Best Regards,
> Strahil Nikolov
>
> В понеделник, 26 октомври 2020 г., 20:38:38 Гринуич+2, mabi 
> m...@protonmail.ch написа:
>
> Ok I see I won't go down that path of disabling quota.
>
> I could now remove the arbiter brick of my volume which has the quota issue 
> so it is now a simple 2 nodes replica with 1 brick per node.
>
> Now I would like to add the brick back but I get the following error:
>
> volume add-brick: failed: Host arbiternode.domain.tld is not in 'Peer in 
> Cluster' state
>
> In fact I checked and the arbiter node is still rejected as you can see here:
>
> State: Peer Rejected (Connected)
>
> On the arbiter node glusted.log file I see the following errors:
>
> [2020-10-26 18:35:05.605124] E [MSGID: 106012] 
> [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums 
> of quota configuration of volume woelkli-private differ. local cksum = 0, 
> remote  cksum = 66908910 on peer node1.domain.tld
> [2020-10-26 18:35:05.617009] E [MSGID: 106012] 
> [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums 
> of quota configuration of volume myvol-private differ. local cksum = 0, 
> remote  cksum = 66908910 on peer node2.domain.tld
>
> So although I have removed the arbiter brick from my volume it it still 
> complains about that checksum of the quota configuration. I also tried to 
> restart glusterd on my arbiter node but it does not help. The peer is still 
> rejected.
>
> What should I do at this stage?
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> On Monday, October 26, 2020 6:06 PM, Strahil Nikolov hunter86...@yahoo.com 
> wrote:
>
> > Detaching the arbiter is pointless...
> > Quota is an extended file attribute, and thus disabling and reenabling 
> > quota on a volume with millions of files will take a lot of time and lots 
> > of IOPS. I would leave it as a last resort.
> > Also, it was mentioned in the list about the following script that might 
> > help you:
> > https://github.com/gluster/glusterfs/blob/devel/extras/quota/quota_fsck.py
> > You can take a look in the mailing list for usage and more details.
> > Best Regards,
> > Strahil Nikolov
> > В понеделник, 26 октомври 2020 г., 16:40:06 Гринуич+2, Diego Zuccato 
> > diego.zucc...@unibo.it написа:
> > Il 26/10/20 15:09, mabi ha scritto:
> >
> > > Right, seen liked that this sounds reasonable. Do you actually remember 
> > > the exact command you ran in order to remove the brick? I was thinking 
> > > this should be it:
> > > gluster volume remove-brick <VOLNAME> <BRICK> force
> > > but should I use "force" or "start"?
> >
> > Memory does not serve me well (there are 28 disks, not 26!), but bash
> > history does :)
> > gluster volume remove-brick BigVol replica 2
> > =============================================
> > str957-biostq:/srv/arbiters/{00..27}/BigVol force
> > gluster peer detach str957-biostq
> > ==================================
> > gluster peer probe str957-biostq
> > =================================
> > gluster volume add-brick BigVol replica 3 arbiter 1
> > ====================================================
> > str957-biostq:/srv/arbiters/{00..27}/BigVol
> > You obviously have to wait for remove-brick to complete before detaching
> > arbiter.
> >
> > > > IIRC it took about 3 days, but the arbiters are on a VM (8CPU, 8GB RAM)
> > > > that uses an iSCSI disk. More than 80% continuous load on both CPUs and 
> > > > RAM.
> > > > That's quite long I must say and I am in the same case as you, my 
> > > > arbiter is a VM.
> >
> > Give all the CPU and RAM you can. Less than 8GB RAM is asking for
> > troubles (in my case).
> >
> > Diego Zuccato
> > DIFA - Dip. di Fisica e Astronomia
> > Servizi Informatici
> > Alma Mater Studiorum - Università di Bologna
> > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> > tel.: +39 051 20 95786
> > Community Meeting Calendar:
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://bluejeans.com/441850968
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)

Reply via email to