Re: [Linux-ha-dev] NFS resource agent for active-active clusters.

Ben Timby Thu, 25 Mar 2010 08:47:08 -0700

As always, inline below....

On Thu, Mar 25, 2010 at 10:54 AM, Tim Serong <[email protected]> wrote:
>> > Now for a little potential nastiness...  I did some work in this area
>> > a year or two ago, and at the time, we ran into some curious edge cases.
>> > Hopefully things have moved on a little since then in NFS-land (I was
>> > using SLES 10 SP2, from memory), but for reference, have a look at:
>> >
>> >  http://marc.info/?l=linux-nfs&m=123175640421702&w=2
>> >
>> > This describes an edge case where (depending on what the clients are
>> > doing), it's possible that running "exportfs -i" to export one directory
>> > will result in an interruption of service to an unrelated exported
>> > directory on the same node.
>>
>> I think you are advocating additional testing, I address that below...
>
> Yes.  But, I should probably explicitly state that the additional testing
> I'm advocating is focused on testing NFS in an HA environment, i.e. these
> issues (assuming they still exist) need to be resolved somewhere in the
> NFS server, and are not specific to your RA.  It's just that you don't hit
> them until you try to do active/active, rather than active/passive (i.e.
> start/stop entire NFS server).


Actually, reading through that post, the testing I suggested is close,
but not quite. The problems was explictly caused by write buffers from
the client in the 32K range, as these were small enough to send a lot
of them in a short amount of time, but large enough to be dropped by
the NFS server rather than deferred. This was the crux of the problem.
I am not sure how to get 32K writes, besides...

dd if=/dev/zero of=/path/to/fs0/smallfile bs=32K count=1024

>> > There's also a problem whereby you almost certainly can't rely on the
>> > return code from exportfs actually telling you the directory was exported
>> > successfully.  The only reason exportfs will fail is if you pass invalid
>> > options, and it's possible that exportfs will return before the export
>> > has actually appeared in /var/lib/nfs/etab (exportfs says "please kernel,
>> > export this when you get a chance, kthxbye").
>>
>> Are you suggesting that we poll the etab file to make sure our export
>> appears before calling the operation a success?
>
> Something like that :)  I actually don't know what the best way to do this is.
> Looping a few times, polling the file then sleeping for one second on each
> iteration is probably reasonable.  In the normal case it'll succeed instantly
> anyway, with almost zero impact on start time, but it'll catch the (probably
> freakishly unlikely) case where the filesystem can't be exported for some
> reason.
>
> You could also parse the output of "showmount -e", which should tell you
> what the server thinks it's exporting.  Which reminds me...  There was
> another issue at one point with the NFS server checking the mtime of
> /var/lib/nfs/etab to determine what to export.  Thus if "exportfs -i" was
> run more than once per second, the file could be out of sync with what the
> server thought it was exporting.  Again, this is something that needs to
> be bashed on a bit, but it's really a Linux kernel NFS server thing.

OK, all I can do is try to detect the case, not fix it, so I will
verify the mount succeeded by using showmount -e rather than depending
on the exportfs return code. I will use a timeout of 5 seconds...

>> > We ran into these issues because we were doing failover testing while
>> > the system was under heavy load (continuous write of several GB, followed
>> > by reading the same data back for verification), while failing over
>> > multiple NFS exports from node to node.  You probably won't ever hit them
>> > unless the system is being severely hammered...  But I'd still recommend
>> > further testing along these lines, out of sheer paranoia.
>>
>> I will definitely do some testing. If I understand your statement,
>> then the following scenario will help to determine if this is a
>> problem for me or not.
>>
>> 1. Bring up cluster in active-active mode, both nodes online.
>> 2. Start $ dd if=/dev/zero of=/path/to/fs0/bigfile bs=1GB count=10 on client
>> 3. Fail over resource fs1.
>> 4. Make sure the addition of fs1 to the node handling fs0 does not
>> cause disruption.
>>
>> ... and then ...
>>
>> 1. Bring up cluster in active-active mode, both nodes online.
>> 2. Start $ dd if=/path/to/fs0/bigfile of=/dev/null bs=1GB count=10 on client
>> 3. Fail over resource fs1.
>> 4. Make sure the addition of fs1 to the node handling fs0 does not
>> cause disruption.
>
> Yep, that's the sort of test.  I'll see if I can find out anything else useful
> about the tools we were using at the time (not sure if they ever got publicly
> released, unfortunately :-/)

Any more info you can provide will be helpful. I need to get my
testing done soon, as these boxes are going into production this
weekend. I am way behind schedule, you would not believe how long it
took me to build a 30TB array and then sync it via DRBD (4 days for
Linux RAID, 10 days for DRBD). Actually, I had to split it into two
volumes as the DRBD volume limit is a scant 18TB :-).

I have attached another patch, this one includes the success check via
showmount -e, and some tighter regular expressions to avoid matching
/exported/fs/subdirectory when exporting /exported/fs.

diff -r 61cf8556ad31 heartbeat/exportfs
--- a/heartbeat/exportfs	Wed Mar 24 20:10:42 2010 +0100
+++ b/heartbeat/exportfs	Thu Mar 25 11:45:50 2010 -0400
@@ -46,7 +46,7 @@
 <content type="string" default="" />
 </parameter>
 
-<parameter name="dir" unique="0" required="1">
+<parameter name="directory" unique="0" required="1">
 <longdesc lang="en">
 The directory which you wish to export using NFS.
 </longdesc>
@@ -56,6 +56,17 @@
 <content type="string" default="" />
 </parameter>
 
+<parameter name="fsid" unique="1" required="1">
+<longdesc lang="en">
+The fsid option to pass to exportfs. This should be a unique positive integer, avoid 0 unless you understand it's special status.
+This value will override any fsid provided via the options parameter.
+</longdesc>
+<shortdesc lang="en">
+Unique fsid within cluster.
+</shortdesc>
+<content type="string" default="" />
+</parameter>
+
 </parameters>
 
 <actions>
@@ -93,23 +104,21 @@
 		;;	
 esac
 
-#fp="$OCF_RESKEY_nfs_shared_infodir"
-
 backup_rmtab ()
 {
-	grep ${OCF_RESKEY_dir} /var/lib/nfs/rmtab > ${OCF_RESKEY_dir}/.rmtab
+	grep :${OCF_RESKEY_directory}: /var/lib/nfs/rmtab > ${OCF_RESKEY_directory}/.rmtab
 }
 
 clean_rmtab ()
 {
-	REMOVE=`echo ${OCF_RESKEY_dir} | sed 's/\//\\\\\//g'`
-	sed -i -e /${REMOVE}/d /var/lib/nfs/rmtab
+	REMOVE=`echo ${OCF_RESKEY_directory} | sed 's/\//\\\\\//g'`
+	sed -i -e /:${REMOVE}:/d /var/lib/nfs/rmtab
 }
 
 exportfs_monitor ()
 {
 	fn=`/bin/mktemp`
-	grep "${OCF_RESKEY_dir}" /var/lib/nfs/etab > $fn 2>&1
+	grep "${OCF_RESKEY_directory}" /var/lib/nfs/etab > $fn 2>&1
 	rc=$?
 
 #Adapt grep status code to OCF return code
@@ -134,30 +143,19 @@
 		OPTIONS="${OCF_RESKEY_options}"
 		OPTPREFIX=','
 	fi
-	#generate fsid if none provided...
-	if [ ! `echo ${OPTIONS} | grep fsid` ]; then
-		if [ -f ${OCF_RESKEY_dir}/.fsid ]; then
-			FSID=`cat ${OCF_RESKEY_dir}/.fsid`
-		else
-			FSID=$RANDOM
-		fi
-		echo $FSID > ${OCF_RESKEY_dir}/.fsid
-		OPTIONS="${OPTIONS}${OPTPREFIX}fsid=${FSID}"
+	if [ `echo ${OPTIONS} | grep fsid` ]; then
+		#replace fsid provided in options list with one provided in fsid param.
+		OPTIONS=`echo ${OPTIONS} | sed 's/fsid=[0-9]\+/fsid=${OCF_RESKEY_fsid}/g'`
+	else
+		#tack the fsid option onto our options list.
+		OPTIONS="${OPTIONS}${OPTPREFIX}fsid=${OCF_RESKEY_fsid}"
 	fi
 	OPTIONS="-o ${OPTIONS}"
 
 	fn=`/bin/mktemp`
-	exportfs ${OPTIONS} ${OCF_RESKEY_clientspec}:${OCF_RESKEY_dir} > $fn 2>&1
+	exportfs ${OPTIONS} ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} > $fn 2>&1
 	rc=$?
 
-	#restore saved rmtab backup from other server:
-	if [ -f ${OCF_RESKEY_dir}/.rmtab ]; then
-		cat ${OCF_RESKEY_dir}/.rmtab >> /var/lib/nfs/rmtab
-		rm -f ${OCF_RESKEY_dir}/.rmtab
-	fi
-
-	/bin/sh $0 backup &
-
 	if [ $rc -ne 0 ]; then
 		ocf_log debug "Error invoking exportfs: `cat $fn`"
 		ocf_log err "Failed to export file system"
@@ -165,6 +163,31 @@
 	fi	
 	rm -f $fn
 
+	RETRIES=0
+	while [ 1 ]; do
+		showmount -e | grep -P ^${OCF_RESKEY_directory}\ ${OCF_RESKEY_clientspec}$
+		rc=$?
+		if [ $rc -eq 0 ]; then
+			break
+		fi
+		RETRIES=`expr ${RETRIES} + 1`
+		if  [ ${RETRIES} -eq 4 ]; then
+			ocf_log debug "Export not reported by showmount -e"
+			ocf_log err "Export not reported by showmount -e"
+			return ${OCF_NOT_RUNNING}
+		fi
+		sleep 1
+	done
+
+	#restore saved rmtab backup from other server:
+	if [ -f ${OCF_RESKEY_directory}/.rmtab ]; then
+		cat ${OCF_RESKEY_directory}/.rmtab >> /var/lib/nfs/rmtab
+		rm -f ${OCF_RESKEY_directory}/.rmtab
+	fi
+
+	#spawn our background process to backup the rmtab each 2 seconds
+	/bin/sh $0 backup &
+
 	ocf_log info "File system exported"
 	return $OCF_SUCCESS
 }
@@ -174,12 +197,12 @@
 	ocf_log info "Un-exporting file system ..."
 
 	fn=`/bin/mktemp`
-	exportfs -u ${OCF_RESKEY_clientspec}:${OCF_RESKEY_dir} > $fn 2>&1
+	exportfs -u ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} > $fn 2>&1
 	rc=$?
 
-	if [ -f ${OCF_RESKEY_dir}/.exportfs_backup.pid ]; then
-		kill `cat ${OCF_RESKEY_dir}/.exportfs_backup.pid`
-		rm ${OCF_RESKEY_dir}/.exportfs_backup.pid
+	if [ -f ${OCF_RESKEY_directory}/.exportfs_backup.pid ]; then
+		kill `cat ${OCF_RESKEY_directory}/.exportfs_backup.pid`
+		rm ${OCF_RESKEY_directory}/.exportfs_backup.pid
 	fi
 
 	backup_rmtab
@@ -198,7 +221,7 @@
 
 exportfs_backup ()
 {
-	echo $$ > ${OCF_RESKEY_dir}/.exportfs_backup.pid
+	echo $$ > ${OCF_RESKEY_directory}/.exportfs_backup.pid
 	while [ 1 ]; do
 		backup_rmtab
 		sleep 2
@@ -207,18 +230,13 @@
 
 exportfs_validate ()
 {
-	if [ -d $OCF_RESKEY_dir ]; then
+	if [ -d $OCF_RESKEY_directory ]; then
 		return $OCF_SUCCESS
 	else
 		exit $OCF_ERR_ARGS
 	fi
 }
 
-if [ -n "$OCF_RESKEY_CRM_meta_clone" ]; then
-	ocf_log err "THIS RA DOES NOT SUPPORT CLONE MODE!"
-	exit $OCF_ERR_CONFIGURED
-fi
-
 exportfs_validate
 
 case $__OCF_ACTION in

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] NFS resource agent for active-active clusters.

Reply via email to