Re: [Gluster-users] Unable to make HA work; mounts hang on remote node reboot

CJ Baar Fri, 24 Apr 2015 15:51:29 -0700

Corey—
I was able to get a third node setup. I recreated the volume as “replica 3”. 
The hang still happens (on two nodes, now) when I reboot a single node, even 
though two are still surviving, which should constitute a quorum.
—CJ



> On Apr 17, 2015, at 6:18 AM, Corey Kovacs <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Typically you need to meet a quorum requirement to run just about any 
> cluster.  By definition,  two nodes doesn't make a good cluster. A third node 
> would let you start with just two since that would allow you to meet quorum. 
> Can you add a third node to at least test?
> 
> Corey
> 
> On Apr 16, 2015 6:52 PM, "CJ Baar" <[email protected] <mailto:[email protected]>> 
> wrote:
> I appreciate the info. I have tried adjust the ping-timeout setting, and it 
> has seems to have no effect. The whole system hangs for 45+ seconds, which is 
> about what it takes the second node to reboot, no matter what the value of 
> ping-timeout is.  The output of the mnt-log is below.  It shows the adjust 
> value I am currently testing (30s), but the system still hangs for longer 
> than that.
> 
> Also, I have realized that the problem is deeper than I originally thought.  
> It’s not just the mount that is hanging when a node reboots… it appears to be 
> the entire system.  I cannot use my SSH connection, no matter where I am in 
> the system, and services such as httpd become unresponsive.  I can ping the 
> “surviving” system, but other than that it appears pretty unusable.  This is 
> a major drawback to using gluster.  I can’t afford to lost two entire systems 
> if one dies.
> 
> [2015-04-16 22:59:21.281365] C 
> [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-common-client-0: server 
> 172.31.64.200:49152 <http://172.31.64.200:49152/> has not responded in the 
> last 30 seconds, disconnecting.
> [2015-04-16 22:59:21.281560] E [rpc-clnt.c:362:saved_frames_unwind] (--> 
> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (--> 
> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (--> 
> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (--> 
> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951] 
> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] ))))) 
> 0-common-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) 
> called at 2015-04-16 22:58:45.830962 (xid=0x6d)
> [2015-04-16 22:59:21.281588] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 
> 0-common-client-0: remote operation failed: Transport endpoint is not 
> connected. Path: / (00000000-0000-0000-0000-000000000001)
> [2015-04-16 22:59:21.281788] E [rpc-clnt.c:362:saved_frames_unwind] (--> 
> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (--> 
> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (--> 
> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (--> 
> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951] 
> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] ))))) 
> 0-common-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 
> 2015-04-16 22:58:51.277528 (xid=0x6e)
> [2015-04-16 22:59:21.281806] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk] 
> 0-common-client-0: socket disconnected
> [2015-04-16 22:59:21.281816] I [client.c:2215:client_rpc_notify] 
> 0-common-client-0: disconnected from common-client-0. Client process will 
> keep trying to connect to glusterd until brick's port is available
> [2015-04-16 22:59:21.283637] I [socket.c:3292:socket_submit_request] 
> 0-common-client-0: not connected (priv->connected = 0)
> [2015-04-16 22:59:21.283663] W [rpc-clnt.c:1562:rpc_clnt_submit] 
> 0-common-client-0: failed to submit rpc-request (XID: 0x6f Program: GlusterFS 
> 3.3, ProgVers: 330, Proc: 27) to rpc-transport (common-client-0)
> [2015-04-16 22:59:21.283674] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 
> 0-common-client-0: remote operation failed: Transport endpoint is not 
> connected. Path: /src (63fc077b-869d-4928-8819-a79cc5c5ffa6)
> [2015-04-16 22:59:21.284219] W [client-rpc-fops.c:2766:client3_3_lookup_cbk] 
> 0-common-client-0: remote operation failed: Transport endpoint is not 
> connected. Path: (null) (00000000-0000-0000-0000-000000000000)
> [2015-04-16 22:59:52.322952] E 
> [client-handshake.c:1496:client_query_portmap_cbk] 0-common-client-0: failed 
> to get the port number for [root@cfm-c glusterfs]#
> 
> 
> —CJ
> 
> 
> 
>> On Apr 7, 2015, at 10:26 PM, Ravishankar N <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> 
>> 
>> On 04/07/2015 10:11 PM, CJ Baar wrote:
>>> Then, I issue “init 0” on node2, and the mount on node1 becomes 
>>> unresponsive. This is the log from node1
>>> [2015-04-07 16:36:04.250693] W 
>>> [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx 
>>> modification failed
>>> [2015-04-07 16:36:04.251102] I 
>>> [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: 
>>> Received status volume req for volume test1
>>> The message "I [MSGID: 106004] 
>>> [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer 
>>> 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has 
>>> disconnected from glusterd." repeated 39 times between [2015-04-07 
>>> 16:34:40.609878] and [2015-04-07 16:36:37.752489]
>>> [2015-04-07 16:36:40.755989] I [MSGID: 106004] 
>>> [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer 
>>> 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has 
>>> disconnected from glusterd.
>> This is the glusterd log. Could you also share the mount log of the healthy 
>> node in the non-responsive -->responsive time interval?
>> If this is indeed the ping timer issue, you should see something like: 
>> "server xxx has not responded in the last 42 seconds, disconnecting."
>> Have you, for testing sake, tried reducing the network.ping-timeout value to 
>> something lower and checked that the hang happens only for that time?
>>> 
>>> This does not seem like desired behaviour. I was trying to create this 
>>> cluster because I was under the impression it would be more resilient than 
>>> a single-point-of-failure NFS server. However, if the mount halts when one 
>>> node in the cluster dies, then I’m no better off.
>>> 
>>> I also can’t seem to figure out how to bring a volume online if only one 
>>> node in the cluster is running; again, not really functioning as HA. The 
>>> gluster service runs and the volume “starts”, but it is not “online” or 
>>> mountable until both nodes are running. In a situation where a node fails 
>>> and we need storage online before we can troubleshoot the cause of the node 
>>> failure, how do I get a volume to go online?
>> This is expected behavior. In a two node cluster, if only one is powered on, 
>> glusterd will not start other gluster processes (brick, nfs, shd ) until the 
>> glusterd of the other node is also up (i.e. quorum is met). If you want to 
>> override this behavior, do a `gluster vol start <volname> force` on the node 
>> that is up.
>> 
>> -Ravi
>>> 
>>> Thanks.
>> 
> 
> 
> _______________________________________________
> Gluster-users mailing list
> [email protected] <mailto:[email protected]>
> http://www.gluster.org/mailman/listinfo/gluster-users 
> <http://www.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Unable to make HA work; mounts hang on remote node reboot

Reply via email to