Re: [Linux-HA] Problem with migration on a nfs/exportfs setup while copying via rsync

Dejan Muhamedagic Fri, 28 May 2010 03:29:29 -0700

Hi,

On Thu, May 27, 2010 at 11:23:16AM +0200, RaSca wrote:
> Hi all,
> I've got some problems with my setup and I'm trying to understand if I 
> am missing something or is a bug, here is how to reproduce the error:
> 
> node debian-lenny-nodo1
> node debian-lenny-nodo2
> primitive drbd0 ocf:linbit:drbd \
>       params drbd_resource="r0" \
>       op monitor interval="20s" timeout="40s" \
>       op start interval="0" timeout="240s" \
>       op stop interval="0" timeout="100s"
> primitive nfs-common lsb:nfs-common
> primitive nfs-kernel-server lsb:nfs-kernel-server
> primitive ping ocf:pacemaker:ping \
>       params host_list="192.168.1.1" name="ping" \
>       op monitor interval="60s" timeout="60s" \
>       op start interval="0" timeout="60s"
> primitive portmap lsb:portmap
> primitive store-LVM ocf:heartbeat:LVM \
>       params volgrpname="vg_drbd" \
>       op monitor interval="10s" timeout="30s" \
>       op start interval="0" timeout="30s" \
>       op stop interval="0" timeout="30s"
> primitive store-exportfs ocf:heartbeat:exportfs \
>       params directory="/store/share" clientspec="192.168.1.0/24" 
> options="rw,sync,no_subtree_check,no_root_squash" fsid="1" \
>       op monitor interval="10s" timeout="30s" \
>       op start interval="0" timeout="40s" \
>       op stop interval="0" timeout="40s" \
>       meta target-role="Started"
> primitive store-fs ocf:heartbeat:Filesystem \
>       params device="/dev/vg_drbd/lv_store" directory="/store" fstype="ext3" \
>       op monitor interval="20s" timeout="40s" \
>       op start interval="0" timeout="60s" \
>       op stop interval="0" timeout="60s" \
>       meta is-managed="true"
> primitive store-ip ocf:heartbeat:IPaddr2 \
>       params ip="192.168.1.53" nic="bond0" \
>       op monitor interval="20s" timeout="40s"
> group nfs portmap nfs-common nfs-kernel-server
> group store store-ip store-LVM store-fs store-exportfs
> ms ms-drbd0 drbd0 \
>       meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true"
> clone nfs_clone nfs \
>       meta globally-unique="false"
> clone ping_clone ping \
>       meta globally-unique="false"
> location cli-prefer-store store \
>       rule $id="cli-prefer-rule-store" inf: #uname eq debian-lenny-nodo1
> location store_on_connected_node store \
>       rule $id="store_on_connected_node-rule" -inf: not_defined ping or ping 
> lte 0
> colocation store_on_ms-drbd0 inf: store ms-drbd0:Master
> order store_after_ms-drbd0 inf: ms-drbd0:promote store:start
> property $id="cib-bootstrap-options" \
>       dc-version="1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75" \
>       no-quorum-policy="ignore" \
>       stonith-enabled="false" \
>       cluster-infrastructure="openais" \
>       expected-quorum-votes="2" \
>       last-lrm-refresh="1274949951"
> 
> Everything comes up smoothly:
> 
> Online: [ debian-lenny-nodo1 debian-lenny-nodo2 ]
> 
>   Clone Set: ping_clone
>       Started: [ debian-lenny-nodo1 debian-lenny-nodo2 ]
>   Master/Slave Set: ms-drbd0
>       Masters: [ debian-lenny-nodo1 ]
>       Slaves: [ debian-lenny-nodo2 ]
>   Resource Group: store
>       store-ip   (ocf::heartbeat:IPaddr2):       Started debian-lenny-nodo1
>       store-LVM  (ocf::heartbeat:LVM):   Started debian-lenny-nodo1
>       store-fs   (ocf::heartbeat:Filesystem):    Started debian-lenny-nodo1
>       store-exportfs     (ocf::heartbeat:exportfs):      Started 
> debian-lenny-nodo1
>   Clone Set: nfs_clone
>       Started: [ debian-lenny-nodo2 debian-lenny-nodo1 ]
> 
> I mount the share on a network client, with default options, and then 
> begin to copy with cp command.
> The copy goes on and after a while i migrate the group store on the 
> second node:
> 
> crm resource migrate store debian-lenny-nodo2
> 
> Everything goes smooth and on the client the copy hangs for a minute or 
> two, and the restart.
> After that, from the client i copy another thing on the nfs storage, 
> this time with rsync command.
> The copy starts and after a while i launch the migration command.
> The cluster this time hangs, giving a failure on the filesystem resource:
> 
> store-fs   (ocf::heartbeat:Filesystem):    Started debian-lenny-nodo2 
> (unmanaged) FAILED
> 
> the only way to make things work again is to cleanup the nfs_clone 
> resource (or restart the nfs-kernel-server daemon) and then cleanup the 
> store group. It seems that the filesystem is keep opened by the nfs daemon.


Which it shouldn't. The lsb:nfs-kernel-server should've exited
only once the server was really stopped.

> So, what's the difference between a simple copy and a rsync? Why with 
> rsync the fs resource isn't able to unmount the filesystem?

Did you try strace with rsync to see what is different?

> There is 
> something I am missing or this should be an fs resource agent bug?

Not strictly a Filesystem RA bug, though it could behave better.
Currently, the stop operation fails quickly (in six seconds) in
case there's something using the filesystem, which won't go away.
As is the case with kernel threads. You can try the attached
patch with which the Filesystem RA is going to wait until the
defined stop timeout for the filesystem to be unmounted. Short
instructions: set the fast_stop parameter to "no" and set the
timeout for the stop operation of the filesystem to however long
the nfsd takes to exit.

Thanks,

Dejan

> Here are the logs:
> 
> May 27 11:20:41 debian-lenny-nodo1 Filesystem[28197]: INFO: Running stop 
> for /dev/vg_drbd/lv_store on /store
> May 27 11:20:41 debian-lenny-nodo1 Filesystem[28197]: INFO: Trying to 
> unmount /store
> May 27 11:20:41 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
> device is busy
> May 27 11:20:41 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
> unmount /store; trying cleanup with SIGTERM
> May 27 11:20:41 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
> on /store were signalled
> May 27 11:20:42 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
> device is busy
> May 27 11:20:42 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
> unmount /store; trying cleanup with SIGTERM
> May 27 11:20:42 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
> on /store were signalled
> May 27 11:20:43 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr) umount: /store: device is busy
> May 27 11:20:43 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr)
> May 27 11:20:43 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr) umount: /store: device is busy
> May 27 11:20:43 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr)
> May 27 11:20:43 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
> unmount /store; trying cleanup with SIGTERM
> May 27 11:20:43 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
> on /store were signalled
> May 27 11:20:44 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
> device is busy
> May 27 11:20:44 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
> unmount /store; trying cleanup with SIGKILL
> May 27 11:20:44 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
> on /store were signalled
> May 27 11:20:45 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
> device is busy
> May 27 11:20:45 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
> unmount /store; trying cleanup with SIGKILL
> May 27 11:20:45 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
> on /store were signalled
> May 27 11:20:46 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
> (store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
> device is busy
> May 27 11:20:46 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
> unmount /store; trying cleanup with SIGKILL
> May 27 11:20:46 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
> on /store were signalled
> May 27 11:20:47 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
> unmount /store, giving up!
> May 27 11:20:48 debian-lenny-nodo1 crmd: [2592]: info: 
> process_lrm_event: LRM operation store-fs_stop_0 (call=188, rc=1, 
> cib-update=389, confirmed=true) unknown error
> May 27 11:20:48 debian-lenny-nodo1 crmd: [2592]: WARN: status_from_rc: 
> Action 58 (store-fs_stop_0) on debian-lenny-nodo1 failed (target: 0 vs. 
> rc: 1): Error
> 
> Thanks for your help!
> 
> -- 
> RaSca
> Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
> ra...@miamammausalinux.org
> http://www.miamammausalinux.org
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

diff -r 11bd908ea5c5 heartbeat/Filesystem
--- a/heartbeat/Filesystem	Wed May 26 14:48:00 2010 +0200
+++ b/heartbeat/Filesystem	Fri May 28 12:28:55 2010 +0200
@@ -151,6 +151,18 @@
 <content type="string" default="$DFLT_STATUSDIR" />
 </parameter>
 
+<parameter name="fast_stop">
+<longdesc lang="en">
+Normally, we expect no users of the filesystem and the stop
+operation to finish quickly. If you cannot control the filesystem
+users easily and want to prevent the stop action from failing,
+then set this parameter to "no" and add an appropriate timeout
+for the stop operation.
+</longdesc>
+<shortdesc lang="en">fast stop</shortdesc>
+<content type="boolean" default="yes" />
+</parameter>
+
 </parameters>
 
 <actions>
@@ -632,6 +644,51 @@
 	done
 }
 
+signal_processes() {
+	local dir=$1
+	local sig=$2
+	# fuser returns a non-zero return code if none of the
+	# specified files is accessed or in case of a fatal 
+	# error.
+	if [ "X${HOSTOS}" = "XOpenBSD" ];then
+		PIDS=`fstat | grep $dir | awk '{print $3}'`
+		for PID in ${PIDS};do
+			kill -s $sig ${PID}
+			ocf_log info "Sent signal $sig to ${PID}"
+		done
+	else
+		if $FUSER -$sig -m -k $dir ; then
+			ocf_log info "Some processes on $dir were signalled"
+		else
+			ocf_log info "No processes on $dir were signalled"
+		fi
+	fi
+}
+try_umount() {
+	local SUB=$1
+	$UMOUNT $umount_force $SUB
+	list_mounts | grep -q " $SUB " >/dev/null 2>&1 || {
+		ocf_log info "unmounted $SUB successfully"
+		return $OCF_SUCCESS
+	}
+	return $OCF_ERR_GENERIC
+}
+fs_stop() {
+	local SUB=$1 timeout=$2 sig cnt
+	for sig in TERM KILL; do
+		cnt=$((timeout/2)) # try half time with TERM
+		while [ $cnt -gt 0 ]; do
+			try_umount $SUB &&
+				return $OCF_SUCCESS
+			ocf_log err "Couldn't unmount $SUB; trying cleanup with $sig"
+			signal_processes $SUB $sig
+			cnt=$((cnt-1))
+			sleep 1
+		done
+	done
+	return $OCF_ERR_GENERIC
+}
+
 #
 # STOP: Unmount the filesystem
 #
@@ -662,37 +719,17 @@
 		esac
 
 		# Umount all sub-filesystems mounted under $MOUNTPOINT/ too.
+		local timeout
 		for SUB in `list_submounts $MOUNTPOINT` $MOUNTPOINT; do
-			ocf_log info "Trying to unmount $MOUNTPOINT"
-			for sig in SIGTERM SIGTERM SIGTERM SIGKILL SIGKILL SIGKILL; do
-				$UMOUNT $umount_force $SUB
-				if list_mounts | grep -q " $SUB " >/dev/null 2>&1; then
-					rc=$OCF_ERR_GENERIC
-					ocf_log err "Couldn't unmount $SUB; trying cleanup with $sig"
-					# fuser returns a non-zero return code if none of the
-					# specified files is accessed or in case of a fatal 
-					# error.
-					if [ "X${HOSTOS}" = "XOpenBSD" ];then
-						PIDS=`fstat | grep ${SUB} | awk '{print $3}'`
-						for PID in ${PIDS};do
-							kill -s 9 ${PID}
-							ocf_log info "Sent kill -9 to ${PID}"
-						done
-					else
-						if $FUSER -$sig -m -k $SUB ; then
-							ocf_log info "Some processes on $SUB were signalled"
-						else
-							ocf_log info "No processes on $SUB were signalled"
-						fi
-					fi
-					sleep 1
-				else
-					rc=$OCF_SUCCESS
-					ocf_log info "unmounted $SUB successfully"
-					break
-				fi
-			done
-
+			ocf_log info "Trying to unmount $SUB"
+			if ocf_is_true "$FAST_STOP"; then
+				timeout=6
+			else
+				timeout=${OCF_RESKEY_CRM_meta_timeout:="20000"}
+				timeout=$((timeout/1000))
+			fi
+			fs_stop $SUB $timeout
+			rc=$?
 			if [ $rc -ne $OCF_SUCCESS ]; then
 				ocf_log err "Couldn't unmount $SUB, giving up!"
 			fi
@@ -876,6 +913,7 @@
 if [ ! -z "$OCF_RESKEY_options" ]; then
 	options="-o $OCF_RESKEY_options"
 fi
+FAST_STOP={$OCF_RESKEY_fast_stop:="yes"}
 
 OP=$1

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Problem with migration on a nfs/exportfs setup while copying via rsync

Reply via email to