Hi thank you for the reply.  Ultimately the volume did eventually start after 
about 1.5 hours from the volume start command.  Could it have something to do 
with the amount of files on the volume?

From: Atin Mukherjee [mailto:[email protected]]
Sent: Monday, December 18, 2017 1:26 AM
To: Matt Waymack <[email protected]>
Cc: gluster-users <[email protected]>
Subject: Re: [Gluster-users] Production Volume will not start



On Sat, Dec 16, 2017 at 12:45 AM, Matt Waymack 
<[email protected]<mailto:[email protected]>> wrote:

Hi all,



I have an issue where our volume will not start from any node.  When attempting 
to start the volume it will eventually return:

Error: Request timed out



For some time after that, the volume is locked and we either have to wait or 
restart Gluster services.  In the gluserd.log, it shows the following:



[2017-12-15 18:00:12.423478] I [glusterd-utils.c:5926:glusterd_brick_start] 
0-management: starting a fresh brick process for brick /exp/b1/gv0

[2017-12-15 18:03:12.673885] I 
[glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] 0-management: In 
gd_mgmt_v3_unlock_timer_cbk

[2017-12-15 18:06:34.304868] I [MSGID: 106499] 
[glusterd-handler.c:4303:__glusterd_handle_status_volume] 0-management: 
Received status volume req for volume gv0

[2017-12-15 18:06:34.306603] E [MSGID: 106301] 
[glusterd-syncop.c:1353:gd_stage_op_phase] 0-management: Staging of operation 
'Volume Status' failed on localhost : Volume gv0 is not started

[2017-12-15 18:11:39.412700] I [glusterd-utils.c:5926:glusterd_brick_start] 
0-management: starting a fresh brick process for brick /exp/b2/gv0

[2017-12-15 18:11:42.405966] I [MSGID: 106143] 
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b2/gv0 on 
port 49153

[2017-12-15 18:11:42.406415] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600

[2017-12-15 18:11:42.406669] I [glusterd-utils.c:5926:glusterd_brick_start] 
0-management: starting a fresh brick process for brick /exp/b3/gv0

[2017-12-15 18:14:39.737192] I 
[glusterd-locks.c:729:gd_mgmt_v3_unlock_timer_cbk] 0-management: In 
gd_mgmt_v3_unlock_timer_cbk

[2017-12-15 18:35:20.856849] I [MSGID: 106143] 
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b1/gv0 on 
port 49152

[2017-12-15 18:35:20.857508] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600

[2017-12-15 18:35:20.858277] I [glusterd-utils.c:5926:glusterd_brick_start] 
0-management: starting a fresh brick process for brick /exp/b4/gv0

[2017-12-15 18:46:07.953995] I [MSGID: 106143] 
[glusterd-pmap.c:280:pmap_registry_bind] 0-pmap: adding brick /exp/b3/gv0 on 
port 49154

[2017-12-15 18:46:07.954432] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600

[2017-12-15 18:46:07.971355] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-snapd: setting frame-timeout to 600

[2017-12-15 18:46:07.989392] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-nfs: setting frame-timeout to 600

[2017-12-15 18:46:07.989543] I [MSGID: 106132] 
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped

[2017-12-15 18:46:07.989562] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: nfs service is stopped

[2017-12-15 18:46:07.989575] I [MSGID: 106600] 
[glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so 
xlator is not installed

[2017-12-15 18:46:07.989601] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-glustershd: setting frame-timeout to 600

[2017-12-15 18:46:08.003011] I [MSGID: 106132] 
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already 
stopped

[2017-12-15 18:46:08.003039] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: glustershd service is 
stopped

[2017-12-15 18:46:08.003079] I [MSGID: 106567] 
[glusterd-svc-mgmt.c:197:glusterd_svc_start] 0-management: Starting glustershd 
service

[2017-12-15 18:46:09.005173] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-quotad: setting frame-timeout to 600

[2017-12-15 18:46:09.005569] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-bitd: setting frame-timeout to 600

[2017-12-15 18:46:09.005673] I [MSGID: 106132] 
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped

[2017-12-15 18:46:09.005689] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: bitd service is 
stopped

[2017-12-15 18:46:09.005712] I [rpc-clnt.c:1044:rpc_clnt_connection_init] 
0-scrub: setting frame-timeout to 600

[2017-12-15 18:46:09.005892] I [MSGID: 106132] 
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped

[2017-12-15 18:46:09.005912] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:229:glusterd_svc_stop] 0-management: scrub service is 
stopped

[2017-12-15 18:46:09.026559] I [socket.c:3672:socket_submit_reply] 
0-socket.management: not connected (priv->connected = -1)

[2017-12-15 18:46:09.026568] E [rpcsvc.c:1364:rpcsvc_submit_generic] 
0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, 
ProgVers: 2, Proc: 27) to rpc-transport (socket.management)

[2017-12-15 18:46:09.026582] E [MSGID: 106430] 
[glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission failed

[2017-12-15 18:56:17.962251] E [rpc-clnt.c:185:call_bail] 0-management: bailing 
out frame type(glusterd mgmt v3) op(--(4)) xid = 0x14 sent = 2017-12-15 
18:46:09.005976. timeout = 600 for 
10.17.100.208:24007<http://10.17.100.208:24007>

There's a call bail here which means glusterd was never able to get a cbk 
response back from nsgtpcfs02.corp.nsgdv.com<http://nsgtpcfs02.corp.nsgdv.com> .

I am guessing you have ended up with a duplicate peerinfo entry of 
nsgtpcfs02.corp.nsgdv.com<http://nsgtpcfs02.corp.nsgdv.com> in 
/var/lib/glusterd/peers folder on the node where the CLI failed. Can you please 
share the output of gluster peer status along with the content of "cat 
/var/lib/glusterd/peers/* " from all the nodes?


[2017-12-15 18:56:17.962324] E [MSGID: 106116] 
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 
nsgtpcfs02.corp.nsgdv.com<http://nsgtpcfs02.corp.nsgdv.com>. Please check log 
file for details.

[2017-12-15 18:56:17.962408] E [MSGID: 106123] 
[glusterd-mgmt.c:1677:glusterd_mgmt_v3_commit] 0-management: Commit failed on 
peers

[2017-12-15 18:56:17.962656] E [MSGID: 106123] 
[glusterd-mgmt.c:2209:glusterd_mgmt_v3_initiate_all_phases] 0-management: 
Commit Op Failed

[2017-12-15 18:56:17.964004] E [MSGID: 106116] 
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed 
on nsgtpcfs02.corp.nsgdv.com<http://nsgtpcfs02.corp.nsgdv.com>. Please check 
log file for details.

[2017-12-15 18:56:17.965184] E [MSGID: 106116] 
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed 
on tpc-arbiter1-100617. Please check log file for details.

[2017-12-15 18:56:17.965277] E [MSGID: 106118] 
[glusterd-mgmt.c:2087:glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock 
failed on peers

[2017-12-15 18:56:17.965372] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock] 
(-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe5631) 
[0x7f48e44a1631] 
-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe543e) 
[0x7f48e44a143e] 
-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xe4625) 
[0x7f48e44a0625] ) 0-management: Lock for vol gv0 not held

[2017-12-15 18:56:17.965394] E [MSGID: 106118] 
[glusterd-locks.c:356:glusterd_mgmt_v3_unlock_entity] 0-management: Failed to 
release lock for vol gv0 on behalf of 711ffb0c-57b7-46ec-ba8d-185de969e6cc.

[2017-12-15 18:56:17.965409] E [MSGID: 106147] 
[glusterd-locks.c:483:glusterd_multiple_mgmt_v3_unlock] 0-management: Unable to 
unlock all vol

[2017-12-15 18:56:17.965424] E [MSGID: 106118] 
[glusterd-mgmt.c:2240:glusterd_mgmt_v3_initiate_all_phases] 0-management: 
Failed to release mgmt_v3 locks on localhost

[2017-12-15 18:56:17.965469] I [socket.c:3672:socket_submit_reply] 
0-socket.management: not connected (priv->connected = -1)

[2017-12-15 18:56:17.965474] E [rpcsvc.c:1364:rpcsvc_submit_generic] 
0-rpc-service: failed to submit message (XID: 0x2, Program: GlusterD svc cli, 
ProgVers: 2, Proc: 8) to rpc-transport (socket.management)

[2017-12-15 18:56:17.965486] E [MSGID: 106430] 
[glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission failed



This issue started after a gluster volume stop followed by a reboot of all 
nodes.  We also updated to the latest available in the CentOS repo and are at 
version 3.12.3.  I’m not sure where to look as the log doesn’t seem to show me 
anything other than it just not working.



gluster peer status shows all peers connected across all nodes, firewall has 
all ports opened and was disabled for troubleshooting.  The volume is a 
distributed-replicated with arbiter for a total of 3 nodes.



The volume is a production volume with over 120TB of data so I’d really like to 
not have to start over with the volume.  Anyone have any suggestions on where 
else to look?



Thank you!

_______________________________________________
Gluster-users mailing list
[email protected]<mailto:[email protected]>
http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
http://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to