Thanks, Joe. So what is starting glusterfsd if its not started by the init scripts? And, if something else starts it, why do the init scripts exist? I don’t know what the timing on the network shutdown is, but if I manually touch “/var/lock/subsys/glusterfsd”, then the glusterfsd K script runs and the system shuts downs cleanly. This seems like an issue with the gluster process management, not my network scripts. —CJ
> On Apr 28, 2015, at 6:09 AM, Joe Julian <[email protected]> wrote: > > No, self-heal daemon is glusterfs (client) with the glustershd vol file. > > glusterfsd is the brick server. > > Normally the network would stay up through the final process kill as part of > shutdown. That kill gracefully shuts down the brick process(es) allowing the > clients to continue without waiting for the tcp connection. > > Apparently your init shutdown process disconnects the network. This is > uncommon and may be considered a bug in whatever K script that's doing it. > > On April 28, 2015 12:28:40 AM PDT, Corey Kovacs <[email protected]> > wrote: > Someone correct me if i am wrong, but glusterfsd is for self healing as I > recall. Its launched when it's needed. > > On Mon, Apr 27, 2015 at 1:59 PM, CJ Baar <[email protected] > <mailto:[email protected]>> wrote: > FYI, I’ve tried with both glusterfs and NFS mounts, and the reaction is the > same. The value of ping.timeout seems to have no effect at all. > > I did discover one thing that makes a difference on reboot. There is a second > service descriptor for “glusterfsd”, which is not enabled by default, but is > started by something else (glusterd, I assume?). However, whatever it is that > starts the process, does not shut it down cleanly during a reboot… and it > appears to be the loss of that process without de-registration in the peer > group that causes the other nodes to hang. If I enable the service (chkconfig > glusterfsd on), it does nothing by default because the config is commented > out (/etc/sysconfig/glusterfsd). But, having those K scripts in place in > rc.d, I can manually touch /var/lock/subsys/glusterfsd, and then I can > successfully reboot one node without the others hanging. This at least helps > when I need to take a node down for maintenance; it obviously still does > nothing for a true node failure. > > I guess my next step is to figure out to modify the init scripts for glusterd > to touch the other lock file on startup as well. Does not seem a very elegant > solution, but having the lock file in place and the init scripts enabled > seems to solve at least half of the issue. > > —CJ > > > >> On Apr 25, 2015, at 11:34 AM, Corey Kovacs <[email protected] >> <mailto:[email protected]>> wrote: >> >> That's not cool..you certainly have a quorum. are you using the fuse client >> or regular old nfs? >> >> C >> >> On Apr 24, 2015 4:50 PM, "CJ Baar" <[email protected] >> <mailto:[email protected]>> wrote: >> Corey— >> I was able to get a third node setup. I recreated the volume as “replica 3”. >> The hang still happens (on two nodes, now) when I reboot a single node, even >> though two are still surviving, which should constitute a quorum. >> —CJ >> >> >>> On Apr 17, 2015, at 6:18 AM, Corey Kovacs <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Typically you need to meet a quorum requirement to run just about any >>> cluster. By definition, two nodes doesn't make a good cluster. A third >>> node would let you start with just two since that would allow you to meet >>> quorum. Can you add a third node to at least test? >>> >>> Corey >>> >>> On Apr 16, 2015 6:52 PM, "CJ Baar" <[email protected] >>> <mailto:[email protected]>> wrote: >>> I appreciate the info. I have tried adjust the ping-timeout setting, and it >>> has seems to have no effect. The whole system hangs for 45+ seconds, which >>> is about what it takes the second node to reboot, no matter what the value >>> of ping-timeout is. The output of the mnt-log is below. It shows the >>> adjust value I am currently testing (30s), but the system still hangs for >>> longer than that. >>> >>> Also, I have realized that the problem is deeper than I originally thought. >>> It’s not just the mount that is hanging when a node reboots… it appears to >>> be the entire system. I cannot use my SSH connection, no matter where I am >>> in the system, and services such as httpd become unresponsive. I can ping >>> the “surviving” system, but other than that it appears pretty unusable. >>> This is a major drawback to using gluster. I can’t afford to lost two >>> entire systems if one dies. >>> >>> [2015-04-16 22:59:21.281365] C >>> [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-common-client-0: server >>> 172.31.64.200:49152 <http://172.31.64.200:49152/> has not responded in the >>> last 30 seconds, disconnecting. >>> [2015-04-16 22:59:21.281560] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (--> >>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (--> >>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (--> >>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951] >>> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] ))))) >>> 0-common-client-0: forced unwinding frame type(GlusterFS 3.3) >>> op(LOOKUP(27)) called at 2015-04-16 22:58:45.830962 (xid=0x6d) >>> [2015-04-16 22:59:21.281588] W >>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote >>> operation failed: Transport endpoint is not connected. Path: / >>> (00000000-0000-0000-0000-000000000001) >>> [2015-04-16 22:59:21.281788] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (--> >>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (--> >>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (--> >>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951] >>> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] ))))) >>> 0-common-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called >>> at 2015-04-16 22:58:51.277528 (xid=0x6e) >>> [2015-04-16 22:59:21.281806] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk] >>> 0-common-client-0: socket disconnected >>> [2015-04-16 22:59:21.281816] I [client.c:2215:client_rpc_notify] >>> 0-common-client-0: disconnected from common-client-0. Client process will >>> keep trying to connect to glusterd until brick's port is available >>> [2015-04-16 22:59:21.283637] I [socket.c:3292:socket_submit_request] >>> 0-common-client-0: not connected (priv->connected = 0) >>> [2015-04-16 22:59:21.283663] W [rpc-clnt.c:1562:rpc_clnt_submit] >>> 0-common-client-0: failed to submit rpc-request (XID: 0x6f Program: >>> GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (common-client-0) >>> [2015-04-16 22:59:21.283674] W >>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote >>> operation failed: Transport endpoint is not connected. Path: /src >>> (63fc077b-869d-4928-8819-a79cc5c5ffa6) >>> [2015-04-16 22:59:21.284219] W >>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-common-client-0: remote >>> operation failed: Transport endpoint is not connected. Path: (null) >>> (00000000-0000-0000-0000-000000000000) >>> [2015-04-16 22:59:52.322952] E >>> [client-handshake.c:1496:client_query_portmap_cbk] 0-common-client-0: >>> failed to get the port number for [root@cfm-c glusterfs]# >>> >>> >>> —CJ >>> >>> >>> >>>> On Apr 7, 2015, at 10:26 PM, Ravishankar N <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> >>>> >>>> On 04/07/2015 10:11 PM, CJ Baar wrote: >>>>> Then, I issue “init 0” on node2, and the mount on node1 becomes >>>>> unresponsive. This is the log from node1 >>>>> [2015-04-07 16:36:04.250693] W >>>>> [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx >>>>> modification failed >>>>> [2015-04-07 16:36:04.251102] I >>>>> [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: >>>>> Received status volume req for volume test1 >>>>> The message "I [MSGID: 106004] >>>>> [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer >>>>> 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has >>>>> disconnected from glusterd." repeated 39 times between [2015-04-07 >>>>> 16:34:40.609878] and [2015-04-07 16:36:37.752489] >>>>> [2015-04-07 16:36:40.755989] I [MSGID: 106004] >>>>> [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer >>>>> 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has >>>>> disconnected from glusterd. >>>> This is the glusterd log. Could you also share the mount log of the >>>> healthy node in the non-responsive -->responsive time interval? >>>> If this is indeed the ping timer issue, you should see something like: >>>> "server xxx has not responded in the last 42 seconds, disconnecting." >>>> Have you, for testing sake, tried reducing the network.ping-timeout value >>>> to something lower and checked that the hang happens only for that time? >>>>> >>>>> This does not seem like desired behaviour. I was trying to create this >>>>> cluster because I was under the impression it would be more resilient >>>>> than a single-point-of-failure NFS server. However, if the mount halts >>>>> when one node in the cluster dies, then I’m no better off. >>>>> >>>>> I also can’t seem to figure out how to bring a volume online if only one >>>>> node in the cluster is running; again, not really functioning as HA. The >>>>> gluster service runs and the volume “starts”, but it is not “online” or >>>>> mountable until both nodes are running. In a situation where a node fails >>>>> and we need storage online before we can troubleshoot the cause of the >>>>> node failure, how do I get a volume to go online? >>>> This is expected behavior. In a two node cluster, if only one is powered >>>> on, glusterd will not start other gluster processes (brick, nfs, shd ) >>>> until the glusterd of the other node is also up (i.e. quorum is met). If >>>> you want to override this behavior, do a `gluster vol start <volname> >>>> force` on the node that is up. >>>> >>>> -Ravi >>>>> >>>>> Thanks. >>>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> [email protected] <mailto:[email protected]> >>> http://www.gluster.org/mailman/listinfo/gluster-users >>> <http://www.gluster.org/mailman/listinfo/gluster-users> >> > > > > > Gluster-users mailing list > [email protected] > http://www.gluster.org/mailman/listinfo/gluster-users > <http://www.gluster.org/mailman/listinfo/gluster-users> > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
