On Sat, Dec 16, 2017 at 12:45 AM, Matt Waymack <[email protected]> wrote:
> Hi all, > > > > I have an issue where our volume will not start from any node. When > attempting to start the volume it will eventually return: > > Error: Request timed out > > > > For some time after that, the volume is locked and we either have to wait > or restart Gluster services. In the gluserd.log, it shows the following: > > > > [2017-12-15 18:00:12.423478] I [glusterd-utils.c:5926:glusterd_brick_start] > 0-management: starting a fresh brick process for brick /exp/b1/gv0 > > [2017-12-15 18:03:12.673885] I > [glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] > 0-management: In gd_mgmt_v3_unlock_timer_cbk > > [2017-12-15 18:06:34.304868] I [MSGID: 106499] > [glusterd-handler.c:4303:__glusterd_handle_status_volume] > 0-management: Received status volume req for volume gv0 > > [2017-12-15 18:06:34.306603] E [MSGID: 106301] > [glusterd-syncop.c:1353:gd_stage_op_phase] > 0-management: Staging of operation 'Volume Status' failed on localhost : > Volume gv0 is not started > > [2017-12-15 18:11:39.412700] I [glusterd-utils.c:5926:glusterd_brick_start] > 0-management: starting a fresh brick process for brick /exp/b2/gv0 > > [2017-12-15 18:11:42.405966] I [MSGID: 106143] > [glusterd-pmap.c:280:pmap_registry_bind] > 0-pmap: adding brick /exp/b2/gv0 on port 49153 > > [2017-12-15 18:11:42.406415] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-management: setting frame-timeout to 600 > > [2017-12-15 18:11:42.406669] I [glusterd-utils.c:5926:glusterd_brick_start] > 0-management: starting a fresh brick process for brick /exp/b3/gv0 > > [2017-12-15 18:14:39.737192] I > [glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] > 0-management: In gd_mgmt_v3_unlock_timer_cbk > > [2017-12-15 18:35:20.856849] I [MSGID: 106143] > [glusterd-pmap.c:280:pmap_registry_bind] > 0-pmap: adding brick /exp/b1/gv0 on port 49152 > > [2017-12-15 18:35:20.857508] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-management: setting frame-timeout to 600 > > [2017-12-15 18:35:20.858277] I [glusterd-utils.c:5926:glusterd_brick_start] > 0-management: starting a fresh brick process for brick /exp/b4/gv0 > > [2017-12-15 18:46:07.953995] I [MSGID: 106143] > [glusterd-pmap.c:280:pmap_registry_bind] > 0-pmap: adding brick /exp/b3/gv0 on port 49154 > > [2017-12-15 18:46:07.954432] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-management: setting frame-timeout to 600 > > [2017-12-15 18:46:07.971355] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-snapd: setting frame-timeout to 600 > > [2017-12-15 18:46:07.989392] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-nfs: setting frame-timeout to 600 > > [2017-12-15 18:46:07.989543] I [MSGID: 106132] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] > 0-management: nfs already stopped > > [2017-12-15 18:46:07.989562] I [MSGID: 106568] > [glusterd-svc-mgmt.c:229:glusterd_svc_stop] > 0-management: nfs service is stopped > > [2017-12-15 18:46:07.989575] I [MSGID: 106600] > [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] > 0-management: nfs/server.so xlator is not installed > > [2017-12-15 18:46:07.989601] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > > [2017-12-15 18:46:08.003011] I [MSGID: 106132] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] > 0-management: glustershd already stopped > > [2017-12-15 18:46:08.003039] I [MSGID: 106568] > [glusterd-svc-mgmt.c:229:glusterd_svc_stop] > 0-management: glustershd service is stopped > > [2017-12-15 18:46:08.003079] I [MSGID: 106567] > [glusterd-svc-mgmt.c:197:glusterd_svc_start] > 0-management: Starting glustershd service > > [2017-12-15 18:46:09.005173] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-quotad: setting frame-timeout to 600 > > [2017-12-15 18:46:09.005569] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-bitd: setting frame-timeout to 600 > > [2017-12-15 18:46:09.005673] I [MSGID: 106132] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] > 0-management: bitd already stopped > > [2017-12-15 18:46:09.005689] I [MSGID: 106568] > [glusterd-svc-mgmt.c:229:glusterd_svc_stop] > 0-management: bitd service is stopped > > [2017-12-15 18:46:09.005712] I [rpc-clnt.c:1044:rpc_clnt_connection_init] > 0-scrub: setting frame-timeout to 600 > > [2017-12-15 18:46:09.005892] I [MSGID: 106132] > [glusterd-proc-mgmt.c:83:glusterd_proc_stop] > 0-management: scrub already stopped > > [2017-12-15 18:46:09.005912] I [MSGID: 106568] > [glusterd-svc-mgmt.c:229:glusterd_svc_stop] > 0-management: scrub service is stopped > > [2017-12-15 18:46:09.026559] I [socket.c:3672:socket_submit_reply] > 0-socket.management: not connected (priv->connected = -1) > > [2017-12-15 18:46:09.026568] E [rpcsvc.c:1364:rpcsvc_submit_generic] > 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc > cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management) > > [2017-12-15 18:46:09.026582] E [MSGID: 106430] > [glusterd-utils.c:568:glusterd_submit_reply] > 0-glusterd: Reply submission failed > > [2017-12-15 18:56:17.962251] E [rpc-clnt.c:185:call_bail] 0-management: > bailing out frame type(glusterd mgmt v3) op(--(4)) xid = 0x14 sent = > 2017-12-15 18:46:09.005976. timeout = 600 for 10.17.100.208:24007 > There's a call bail here which means glusterd was never able to get a cbk response back from nsgtpcfs02.corp.nsgdv.com . I am guessing you have ended up with a duplicate peerinfo entry of nsgtpcfs02.corp.nsgdv.com in /var/lib/glusterd/peers folder on the node where the CLI failed. Can you please share the output of gluster peer status along with the content of "cat /var/lib/glusterd/peers/* " from all the nodes? [2017-12-15 18:56:17.962324] E [MSGID: 106116] [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] > 0-management: Commit failed on nsgtpcfs02.corp.nsgdv.com. Please check > log file for details. > > [2017-12-15 18:56:17.962408] E [MSGID: 106123] > [glusterd-mgmt.c:1677:glusterd_mgmt_v3_commit] > 0-management: Commit failed on peers > > [2017-12-15 18:56:17.962656] E [MSGID: 106123] [glusterd-mgmt.c:2209: > glusterd_mgmt_v3_initiate_all_phases] 0-management: Commit Op Failed > > [2017-12-15 18:56:17.964004] E [MSGID: 106116] > [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking > failed on nsgtpcfs02.corp.nsgdv.com. Please check log file for details. > > [2017-12-15 18:56:17.965184] E [MSGID: 106116] > [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking > failed on tpc-arbiter1-100617. Please check log file for details. > > [2017-12-15 18:56:17.965277] E [MSGID: 106118] [glusterd-mgmt.c:2087: > glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock failed on peers > > [2017-12-15 18:56:17.965372] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock] > (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe5631) > [0x7f48e44a1631] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe543e) > [0x7f48e44a143e] > -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe4625) > [0x7f48e44a0625] ) 0-management: Lock for vol gv0 not held > > [2017-12-15 18:56:17.965394] E [MSGID: 106118] [glusterd-locks.c:356: > glusterd_mgmt_v3_unlock_entity] 0-management: Failed to release lock for > vol gv0 on behalf of 711ffb0c-57b7-46ec-ba8d-185de969e6cc. > > [2017-12-15 18:56:17.965409] E [MSGID: 106147] [glusterd-locks.c:483: > glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to unlock all vol > > [2017-12-15 18:56:17.965424] E [MSGID: 106118] [glusterd-mgmt.c:2240: > glusterd_mgmt_v3_initiate_all_phases] 0-management: Failed to release > mgmt_v3 locks on localhost > > [2017-12-15 18:56:17.965469] I [socket.c:3672:socket_submit_reply] > 0-socket.management: not connected (priv->connected = -1) > > [2017-12-15 18:56:17.965474] E [rpcsvc.c:1364:rpcsvc_submit_generic] > 0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc > cli, ProgVers: 2, Proc: 8) to rpc-transport (socket.management) > > [2017-12-15 18:56:17.965486] E [MSGID: 106430] > [glusterd-utils.c:568:glusterd_submit_reply] > 0-glusterd: Reply submission failed > > > > This issue started after a gluster volume stop followed by a reboot of all > nodes. We also updated to the latest available in the CentOS repo and are > at version 3.12.3. I’m not sure where to look as the log doesn’t seem to > show me anything other than it just not working. > > > > gluster peer status shows all peers connected across all nodes, firewall > has all ports opened and was disabled for troubleshooting. The volume is a > distributed-replicated with arbiter for a total of 3 nodes. > > > > The volume is a production volume with over 120TB of data so I’d really > like to not have to start over with the volume. Anyone have any > suggestions on where else to look? > > > > Thank you! > > _______________________________________________ > Gluster-users mailing list > [email protected] > http://lists.gluster.org/mailman/listinfo/gluster-users >
_______________________________________________ Gluster-users mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-users
