Re: [Gluster-users] Volume stuck unable to add a brick

Boris Goldowsky Tue, 16 Apr 2019 05:06:15 -0700

OK, log files attached.

Boris



From: Karthik Subrahmanya <[email protected]>
Date: Tuesday, April 16, 2019 at 2:52 AM
To: Atin Mukherjee <[email protected]>, Boris Goldowsky 
<[email protected]>
Cc: Gluster-users <[email protected]>
Subject: Re: [Gluster-users] Volume stuck unable to add a brick



On Mon, Apr 15, 2019 at 9:43 PM Atin Mukherjee 
<[email protected]<mailto:[email protected]>> wrote:
+Karthik Subrahmanya<mailto:[email protected]>

Didn't we we fix this problem recently? Failed to set extended attribute 
indicates that temp mount is failing and we don't have quorum number of bricks 
up.

We had two fixes which handles two kind of add-brick scenarios.
[1] Fails add-brick when increasing the replica count if any of the brick is 
down to avoid data loss. This can be overridden by using the force option.
[2] Allow add-brick to set the extended attributes by the temp mount if the 
volume is already mounted (has clients).

They are in version 3.12.2 so, patch [1] is present there. But since they are 
using the force option it should not have any problem even if they have any 
brick down. The error message they are getting is also different, so it is not 
because of any brick being down I guess.
Patch [2] is not present in 3.12.2 and it is not the conversion from plain 
distribute to replicate volume. So the scenario is different here.
It seems like they are hitting some other issue.

@Boris,
Can you attach the add-brick's temp mount log. The file name should look 
something like "dockervols-add-brick-mount.log". Can you also provide all the 
brick logs of that volume during that time.

[1] https://review.gluster.org/#/c/glusterfs/+/16330/
[2] https://review.gluster.org/#/c/glusterfs/+/21791/

Regards,
Karthik

Boris - What's the gluster version are you using?



On Mon, Apr 15, 2019 at 7:35 PM Boris Goldowsky 
<[email protected]<mailto:[email protected]>> wrote:
Atin, thank you for the reply.  Here are all of those pieces of information:


[bgoldowsky@webserver9 ~]$ gluster --version

glusterfs 3.12.2
(same on all nodes)


[bgoldowsky@webserver9 ~]$ sudo gluster peer status

Number of Peers: 3



Hostname: webserver11.cast.org<http://webserver11.cast.org>

Uuid: c2b147fd-cab4-4859-9922-db5730f8549d

State: Peer in Cluster (Connected)



Hostname: webserver1.cast.org<http://webserver1.cast.org>

Uuid: 4b918f65-2c9d-478e-8648-81d1d6526d4c

State: Peer in Cluster (Connected)

Other names:

192.168.200.131

webserver1



Hostname: webserver8.cast.org<http://webserver8.cast.org>

Uuid: be2f568b-61c5-4016-9264-083e4e6453a2

State: Peer in Cluster (Connected)

Other names:

webserver8


[bgoldowsky@webserver1 ~]$ sudo gluster v info

Volume Name: dockervols

Type: Replicate

Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: webserver1:/data/gluster/dockervols

Brick2: webserver11:/data/gluster/dockervols

Brick3: webserver9:/data/gluster/dockervols

Options Reconfigured:

nfs.disable: on

transport.address-family: inet

auth.allow: 127.0.0.1



Volume Name: testvol

Type: Replicate

Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 4 = 4

Transport-type: tcp

Bricks:

Brick1: webserver1:/data/gluster/testvol

Brick2: webserver9:/data/gluster/testvol

Brick3: webserver11:/data/gluster/testvol

Brick4: webserver8:/data/gluster/testvol

Options Reconfigured:

transport.address-family: inet

nfs.disable: on


[bgoldowsky@webserver8 ~]$ sudo gluster v info

Volume Name: dockervols

Type: Replicate

Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: webserver1:/data/gluster/dockervols

Brick2: webserver11:/data/gluster/dockervols

Brick3: webserver9:/data/gluster/dockervols

Options Reconfigured:

nfs.disable: on

transport.address-family: inet

auth.allow: 127.0.0.1



Volume Name: testvol

Type: Replicate

Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 4 = 4

Transport-type: tcp

Bricks:

Brick1: webserver1:/data/gluster/testvol

Brick2: webserver9:/data/gluster/testvol

Brick3: webserver11:/data/gluster/testvol

Brick4: webserver8:/data/gluster/testvol

Options Reconfigured:

nfs.disable: on

transport.address-family: inet


[bgoldowsky@webserver9 ~]$ sudo gluster v info

Volume Name: dockervols

Type: Replicate

Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: webserver1:/data/gluster/dockervols

Brick2: webserver11:/data/gluster/dockervols

Brick3: webserver9:/data/gluster/dockervols

Options Reconfigured:

nfs.disable: on

transport.address-family: inet

auth.allow: 127.0.0.1



Volume Name: testvol

Type: Replicate

Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 4 = 4

Transport-type: tcp

Bricks:

Brick1: webserver1:/data/gluster/testvol

Brick2: webserver9:/data/gluster/testvol

Brick3: webserver11:/data/gluster/testvol

Brick4: webserver8:/data/gluster/testvol

Options Reconfigured:

nfs.disable: on

transport.address-family: inet


[bgoldowsky@webserver11 ~]$ sudo gluster v info

Volume Name: dockervols

Type: Replicate

Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: webserver1:/data/gluster/dockervols

Brick2: webserver11:/data/gluster/dockervols

Brick3: webserver9:/data/gluster/dockervols

Options Reconfigured:

auth.allow: 127.0.0.1

transport.address-family: inet

nfs.disable: on



Volume Name: testvol

Type: Replicate

Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 4 = 4

Transport-type: tcp

Bricks:

Brick1: webserver1:/data/gluster/testvol

Brick2: webserver9:/data/gluster/testvol

Brick3: webserver11:/data/gluster/testvol

Brick4: webserver8:/data/gluster/testvol

Options Reconfigured:

transport.address-family: inet

nfs.disable: on


[bgoldowsky@webserver9 ~]$ sudo gluster volume add-brick dockervols replica 4 
webserver8:/data/gluster/dockervols force

volume add-brick: failed: Commit failed on 
webserver8.cast.org<http://webserver8.cast.org>. Please check log file for 
details.

Webserver8 glusterd.log:


[2019-04-15 13:55:42.338197] I [MSGID: 106488] 
[glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: 
Received get vol req

The message "I [MSGID: 106488] 
[glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: 
Received get vol req" repeated 2 times between [2019-04-15 13:55:42.338197] and 
[2019-04-15 13:55:42.341618]

[2019-04-15 14:00:20.445011] I [run.c:190:runner_log] 
(-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x3a215) 
[0x7fe697764215] 
-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0xe3e9d) 
[0x7fe69780de9d] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fe6a2d16ea5] 
) 0-management: Ran script: 
/var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh 
--volname=dockervols --version=1 --volume-op=add-brick 
--gd-workdir=/var/lib/glusterd

[2019-04-15 14:00:20.445148] I [MSGID: 106578] 
[glusterd-brick-ops.c:1354:glusterd_op_perform_add_bricks] 0-management: 
replica-count is set 4

[2019-04-15 14:00:20.445184] I [MSGID: 106578] 
[glusterd-brick-ops.c:1364:glusterd_op_perform_add_bricks] 0-management: type 
is set 0, need to change it

[2019-04-15 14:00:20.672347] E [MSGID: 106054] 
[glusterd-utils.c:13863:glusterd_handle_replicate_brick_ops] 0-management: 
Failed to set extended attribute trusted.add-brick : Transport endpoint is not 
connected [Transport endpoint is not connected]

[2019-04-15 14:00:20.693491] E [MSGID: 101042] [compat.c:569:gf_umount_lazy] 
0-management: Lazy unmount of /tmp/mntmvdFGq [Transport endpoint is not 
connected]

[2019-04-15 14:00:20.693597] E [MSGID: 106074] 
[glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to add 
bricks

[2019-04-15 14:00:20.693637] E [MSGID: 106123] 
[glusterd-mgmt.c:312:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit 
failed.

[2019-04-15 14:00:20.693667] E [MSGID: 106123] 
[glusterd-mgmt-handler.c:616:glusterd_handle_commit_fn] 0-management: commit 
failed on operation Add brick

Webserver11 log file:


[2019-04-15 13:56:29.563270] I [MSGID: 106488] 
[glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: 
Received get vol req

The message "I [MSGID: 106488] 
[glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: 
Received get vol req" repeated 2 times between [2019-04-15 13:56:29.563270] and 
[2019-04-15 13:56:29.566209]

[2019-04-15 14:00:33.996866] I [run.c:190:runner_log] 
(-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x3a215) 
[0x7f36de924215] 
-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0xe3e9d) 
[0x7f36de9cde9d] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f36e9ed6ea5] 
) 0-management: Ran script: 
/var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh 
--volname=dockervols --version=1 --volume-op=add-brick 
--gd-workdir=/var/lib/glusterd

[2019-04-15 14:00:33.996979] I [MSGID: 106578] 
[glusterd-brick-ops.c:1354:glusterd_op_perform_add_bricks] 0-management: 
replica-count is set 4

[2019-04-15 14:00:33.997004] I [MSGID: 106578] 
[glusterd-brick-ops.c:1364:glusterd_op_perform_add_bricks] 0-management: type 
is set 0, need to change it

[2019-04-15 14:00:34.013789] I [MSGID: 106132] 
[glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: nfs already stopped

[2019-04-15 14:00:34.013849] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: nfs service is stopped

[2019-04-15 14:00:34.017535] I [MSGID: 106568] 
[glusterd-proc-mgmt.c:88:glusterd_proc_stop] 0-management: Stopping glustershd 
daemon running in pid: 6087

[2019-04-15 14:00:35.018783] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: glustershd service is 
stopped

[2019-04-15 14:00:35.018952] I [MSGID: 106567] 
[glusterd-svc-mgmt.c:211:glusterd_svc_start] 0-management: Starting glustershd 
service

[2019-04-15 14:00:35.028306] I [MSGID: 106132] 
[glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: bitd already stopped

[2019-04-15 14:00:35.028408] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: bitd service is 
stopped

[2019-04-15 14:00:35.028601] I [MSGID: 106132] 
[glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already stopped

[2019-04-15 14:00:35.028645] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: scrub service is 
stopped

Thank you for taking a look!

Boris


From: Atin Mukherjee 
<[email protected]<mailto:[email protected]>>
Date: Friday, April 12, 2019 at 1:10 PM
To: Boris Goldowsky <[email protected]<mailto:[email protected]>>
Cc: Gluster-users <[email protected]<mailto:[email protected]>>
Subject: Re: [Gluster-users] Volume stuck unable to add a brick



On Fri, 12 Apr 2019 at 22:32, Boris Goldowsky 
<[email protected]<mailto:[email protected]>> wrote:
I’ve got a replicated volume with three bricks  (“1x3=3”), the idea is to have 
a common set of files that are locally available on all the machines 
(Scientific Linux 7, which is essentially CentOS 7) in a cluster.

I tried to add on a fourth machine, so used a command like this:


sudo gluster volume add-brick dockervols replica 4 
webserver8:/data/gluster/dockervols force

but the result is:

volume add-brick: failed: Commit failed on webserver1. Please check log file 
for details.

Commit failed on webserver8. Please check log file for details.

Commit failed on webserver11. Please check log file for details.

Tried: removing the new brick (this also fails) and trying again.
Tried: checking the logs. The log files are not enlightening to me – I don’t 
know what’s normal and what’s not.

From webserver8 & webserver11 could you attach glusterd log files?

Also please share following:
- gluster version? (gluster —version)
- Output of ‘gluster peer status’
- Output of ‘gluster v info’ from all 4 nodes.

Tried: deleting the brick directory from previous attempt, so that it’s not in 
the way.
Tried: restarting gluster services
Tried: rebooting
Tried: setting up a new volume, replicated to all four machines. This works, so 
I’m assuming it’s not a networking issue.  But still fails with this existing 
volume that has the critical data in it.

Running out of ideas. Any suggestions?  Thank you!

Boris

_______________________________________________
Gluster-users mailing list
[email protected]<mailto:[email protected]>
https://lists.gluster.org/mailman/listinfo/gluster-users
--
--Atin

data-gluster-dockervols.log-webserver1
Description: data-gluster-dockervols.log-webserver1

data-gluster-dockervols.log-webserver11
Description: data-gluster-dockervols.log-webserver11

data-gluster-dockervols.log-webserver9
Description: data-gluster-dockervols.log-webserver9

dockervols-add-brick-mount.log
Description: dockervols-add-brick-mount.log

_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Volume stuck unable to add a brick

Reply via email to