Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Greg Scott Sat, 13 Jul 2013 19:13:57 -0700

Hmmm – I wonder what’s different now when it behaves as expected versus before 
when it behaved badly?


Well – by now both systems have been up and running in my testbed for several 
days.  I’ve umounted and mounted the volumes a bunch of times.  But thinking 
back – the behavior changed when I mounted the volume on each node with the 
other node as the backupvolfile.

On fw1:
mount -t glusterfs -o backupvolfile-server=192.168.253.2 
192.168.253.1:/firewall-scripts /firewall-scripts

And on fw2:
mount -t glusterfs -o backupvolfile-server=192.168.253.1 
192.168.253.2:/firewall-scripts /firewall-scripts

Since then, I’ve stopped and restarted glusterd and umounted and mounted the 
volumes again as set up in fstab without the backupvolfile.  But maybe that 
backupvolfile switch set some parameter permanently.

Here is the rc.local I set up in each node.  I wonder if some kind of timing 
thing is going on?  Or if -o backupvolfile-server=(the other node) permanently 
cleared a glitch from the initial setup?  I guess I could try some reboots and 
see what happens.

#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
#
# Note removed by default starting in Fedora 16.

touch /var/lock/subsys/local

#***********************************
# Local stuff below

echo "Making sure the Gluster stuff is mounted"
mount -av
# The fstab mounts happen early in startup, then Gluster starts up later.
# By now, Gluster should be up and running and the mounts should work.
# That _netdev option is supposed to account for the delay but doesn't seem
# to work right.

echo "Starting up firewall common items"
/firewall-scripts/etc/rc.d/common-rc.local
[root@chicago-fw1 log]#

Here is what fstab looks like on each node.

From fw1:

[root@chicago-fw1 log]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul  6 04:26:01 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root /                       ext4    defaults        1 1
UUID=818c4142-e389-4f28-a28e-6e26df3caa32 /boot                   ext4    
defaults        1 2
UUID=C57B-BCF9          /boot/efi               vfat    
umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw1 /gluster-fw1            xfs     defaults        
1 2
/dev/mapper/fedora-swap swap                    swap    defaults        0 0
# Added gluster stuff Greg Scott
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[root@chicago-fw1 log]#

And fw2:

[root@chicago-fw2 log]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul  6 05:08:55 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root /                       ext4    defaults        1 1
UUID=f0cceb6a-61c4-409b-b882-5d6779a52505 /boot                   ext4    
defaults        1 2
UUID=665D-DF0B          /boot/efi               vfat    
umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw2 /gluster-fw2            ext4    defaults        
1 2
/dev/mapper/fedora-swap swap                    swap    defaults        0 0
# Added gluster stuff Greg Scott
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[root@chicago-fw2 log]#

- Greg

From: Joe Julian [mailto:[email protected]] 
Sent: Saturday, July 13, 2013 7:38 PM
To: Greg Scott
Cc: '[email protected]'
Subject: Re: [Gluster-users] One node goes offline, the other node can't see 
the replicated volume anymore

These logs show different results. The results you reported and pasted earlier 
included, "[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 
0-firewall-scripts-replicate-0: no subvolumes up", which would produce the 
"Transport endpoint not connected" error you reported at first. These results 
look normal and should have produced the behavior I described.

42 is The Answer to Life, The Universe, and Everything.

Re-establishing FDs and locks is an expensive operation. The ping-timeout is 
long because it should not happen, but if there is temporary network congestion 
you'd (normally) rather have your volume remain up and pause than have to 
re-establish everything. Typically, unless you expect your servers to crash 
often, leaving ping-timeout at the default is best. YMMV and it's configurable 
in case you know what you're doing and why.


_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Reply via email to