Re: [Gluster-users] Distribute rebalance issues

Nithya Balachandran Tue, 17 Oct 2017 02:27:49 -0700

On 17 October 2017 at 14:48, Stephen Remde <[email protected]>
wrote:


> Hi,
>
>
> I have a rebalance that has failed on one peer twice now. Rebalance logs 
> below (directories anonomised and some irrelevant log lines cut). It looks 
> like it loses connection to the brick, but immediately stops the rebalance on 
> that peer instead of waiting for reconnection - which happens a second or so 
> later.
> Is this normal behaviour? So far it has been the same server and the same 
> (remote) brick.
>
>
> The brick shows a high number of disconnects compared to the other bricks on 
> the same server
>
>
> ./export-md0-brick.log.1      2
> ./export-md1-brick.log.1      2
> ./export-md2-brick.log.1    181
> ./export-md3-brick.log.1      2
>
>
> Any clues? What could be causing this because there is nothing in the log to 
> indicate cause.
>
> The rebalance process requires that all DHT child subvols be up during the
operation as it needs to reapply the directory layouts (which requires all
child subvols to be up). As this is a pure distribute volume, even a single
brick getting disconnected is enough to cause the process to stop.

You would need to figure out why that brick is disconnecting so often. The
brick logs might help with that.

Regards,
Nithya


>
> Steve
>
>
> gluster volume info video
>
> Volume Name: video
> Type: Distribute
> Volume ID: ccdac37f-9b0e-415f-b62e-9071d8168199
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 9
> Transport-type: tcp
> Bricks:
> Brick1: 10.0.0.31:/export/md0/brick
> Brick2: 10.0.0.32:/export/md0/brick
> Brick3: 10.0.0.31:/export/md1/brick
> Brick4: 10.0.0.32:/export/md1/brick
> Brick5: 10.0.0.31:/export/md2/brick
> Brick6: 10.0.0.32:/export/md2/brick
> Brick7: 10.0.0.31:/export/md3/brick
> Brick8: 10.0.0.32:/export/md3/brick
> Brick9: 10.0.0.33:/export/md0/brick
> Options Reconfigured:
> network.ping-timeout: 10
> cluster.min-free-disk: 1%
> transport.address-family: inet
> performance.readdir-ahead: on
> nfs.disable: on
> cluster.rebal-throttle: lazy
>
> [2017-10-12 23:00:55.099153] W [socket.c:590:__socket_rwv] 0-video-client-4: 
> readv on 10.0.0.31:49164 failed (Connection reset by peer)
> [2017-10-12 23:00:55.099709] I [MSGID: 114018] 
> [client.c:2280:client_rpc_notify] 0-video-client-4: disconnected from 
> video-client-4. Client process will keep trying to connect to glusterd until 
> brick's port is available
> [2017-10-12 23:00:55.099741] W [MSGID: 109073] [dht-common.c:8839:dht_notify] 
> 0-video-dht: Received CHILD_DOWN. Exiting
> [2017-10-12 23:00:55.099752] I [MSGID: 109029] 
> [dht-rebalance.c:4195:gf_defrag_stop] 0-: Received stop command on rebalance
> [2017-10-12 23:01:05.478462] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 
> 0-video-client-4: changing port to 49164 (from 0)
> [2017-10-12 23:01:05.481180] I [MSGID: 114057] 
> [client-handshake.c:1446:select_server_supported_programs] 0-video-client-4: 
> Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2017-10-12 23:01:05.482630] I [MSGID: 114046] 
> [client-handshake.c:1222:client_setvolume_cbk] 0-video-client-4: Connected to 
> video-client-4, attached to remote volume '/export/md2/brick'.
> [2017-10-12 23:01:05.482659] I [MSGID: 114047] 
> [client-handshake.c:1233:client_setvolume_cbk] 0-video-client-4: Server and 
> Client lk-version numbers are not same, reopening the fds
> [2017-10-12 23:01:05.483365] I [MSGID: 114035] 
> [client-handshake.c:201:client_set_lk_version_cbk] 0-video-client-4: Server 
> lk version = 1
> [2017-10-12 23:01:30.310089] I [dht-rebalance.c:2819:gf_defrag_process_dir] 
> 0-DHT: Found critical error from gf_defrag_get_entry
> [2017-10-12 23:01:30.310166] E [MSGID: 109111] 
> [dht-rebalance.c:3090:gf_defrag_fix_layout] 0-video-dht: 
> gf_defrag_process_dir failed for directory: /y/y/y/y/y
> [2017-10-12 23:01:30.380574] E [MSGID: 109016] 
> [dht-rebalance.c:3267:gf_defrag_fix_layout] 0-video-dht: Fix layout failed 
> for /y/y/y/y/y
> [2017-10-12 23:01:30.380756] E [MSGID: 109016] 
> [dht-rebalance.c:3267:gf_defrag_fix_layout] 0-video-dht: Fix layout failed 
> for /y/y/y/y
> [2017-10-12 23:01:30.380879] E [MSGID: 109016] 
> [dht-rebalance.c:3267:gf_defrag_fix_layout] 0-video-dht: Fix layout failed 
> for /y/y/y
> [2017-10-12 23:01:30.380965] E [MSGID: 109016] 
> [dht-rebalance.c:3267:gf_defrag_fix_layout] 0-video-dht: Fix layout failed 
> for /y/y
> [2017-10-12 23:03:09.285157] W [glusterfsd.c:1327:cleanup_and_exit] 
> (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f112b6d16ba] 
> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55b325019545] 
> -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55b3250193b4] ) 0-: received 
> signum (15), shutting down
>
> [2017-10-17 03:20:28.921512] W [socket.c:590:__socket_rwv] 0-video-client-4: 
> readv on 10.0.0.31:49164 failed (Connection reset by peer)
> [2017-10-17 03:20:28.921554] I [MSGID: 114018] 
> [client.c:2280:client_rpc_notify] 0-video-client-4: disconnected from 
> video-client-4. Client process will keep trying to connect to glusterd until 
> brick's port is available
> [2017-10-17 03:20:28.921570] W [MSGID: 109073] [dht-common.c:8839:dht_notify] 
> 0-video-dht: Received CHILD_DOWN. Exiting
> [2017-10-17 03:20:28.921578] I [MSGID: 109029] 
> [dht-rebalance.c:4195:gf_defrag_stop] 0-: Received stop command on rebalance
> [2017-10-17 03:20:39.344417] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 
> 0-video-client-4: changing port to 49164 (from 0)
> [2017-10-17 03:20:39.347440] I [MSGID: 114057] 
> [client-handshake.c:1446:select_server_supported_programs] 0-video-client-4: 
> Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2017-10-17 03:20:39.349244] I [MSGID: 114046] 
> [client-handshake.c:1222:client_setvolume_cbk] 0-video-client-4: Connected to 
> video-client-4, attached to remote volume '/export/md2/brick'.
> [2017-10-17 03:20:39.349261] I [MSGID: 114047] 
> [client-handshake.c:1233:client_setvolume_cbk] 0-video-client-4: Server and 
> Client lk-version numbers are not same, reopening the fds
> [2017-10-17 03:20:39.350611] I [MSGID: 114035] 
> [client-handshake.c:201:client_set_lk_version_cbk] 0-video-client-4: Server 
> lk version = 1
> [2017-10-17 03:27:17.231133] I [dht-rebalance.c:2819:gf_defrag_process_dir] 
> 0-DHT: Found critical error from gf_defrag_get_entry
> [2017-10-17 03:27:17.231214] E [MSGID: 109111] 
> [dht-rebalance.c:3090:gf_defrag_fix_layout] 0-video-dht: 
> gf_defrag_process_dir failed for directory: /x/x/x/x/x
> [2017-10-17 03:27:17.562481] E [MSGID: 109016] 
> [dht-rebalance.c:3267:gf_defrag_fix_layout] 0-video-dht: Fix layout failed 
> for /x/x/x/x/x
> [2017-10-17 03:27:17.562619] E [MSGID: 109016] 
> [dht-rebalance.c:3267:gf_defrag_fix_layout] 0-video-dht: Fix layout failed 
> for /x/x/x/x
> [2017-10-17 03:27:17.562726] E [MSGID: 109016] 
> [dht-rebalance.c:3267:gf_defrag_fix_layout] 0-video-dht: Fix layout failed 
> for /x/x/x
> [2017-10-17 03:27:17.562810] E [MSGID: 109016] 
> [dht-rebalance.c:3267:gf_defrag_fix_layout] 0-video-dht: Fix layout failed 
> for /x/x
> [2017-10-17 03:27:18.379825] W [glusterfsd.c:1327:cleanup_and_exit] 
> (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f700b9696ba] 
> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55f9c0022545] 
> -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55f9c00223b4] ) 0-: received 
> signum (15), shutting down
>
>
>
> _______________________________________________
> Gluster-users mailing list
> [email protected]
> http://lists.gluster.org/mailman/listinfo/gluster-users
>

_______________________________________________
Gluster-users mailing list
[email protected]
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Distribute rebalance issues

Reply via email to