As always, inline below....
On Thu, Mar 25, 2010 at 10:54 AM, Tim Serong <[email protected]> wrote:
>> > Now for a little potential nastiness... I did some work in this area
>> > a year or two ago, and at the time, we ran into some curious edge cases.
>> > Hopefully things have moved on a little since then in NFS-land (I was
>> > using SLES 10 SP2, from memory), but for reference, have a look at:
>> >
>> > http://marc.info/?l=linux-nfs&m=123175640421702&w=2
>> >
>> > This describes an edge case where (depending on what the clients are
>> > doing), it's possible that running "exportfs -i" to export one directory
>> > will result in an interruption of service to an unrelated exported
>> > directory on the same node.
>>
>> I think you are advocating additional testing, I address that below...
>
> Yes. But, I should probably explicitly state that the additional testing
> I'm advocating is focused on testing NFS in an HA environment, i.e. these
> issues (assuming they still exist) need to be resolved somewhere in the
> NFS server, and are not specific to your RA. It's just that you don't hit
> them until you try to do active/active, rather than active/passive (i.e.
> start/stop entire NFS server).
Actually, reading through that post, the testing I suggested is close,
but not quite. The problems was explictly caused by write buffers from
the client in the 32K range, as these were small enough to send a lot
of them in a short amount of time, but large enough to be dropped by
the NFS server rather than deferred. This was the crux of the problem.
I am not sure how to get 32K writes, besides...
dd if=/dev/zero of=/path/to/fs0/smallfile bs=32K count=1024
>> > There's also a problem whereby you almost certainly can't rely on the
>> > return code from exportfs actually telling you the directory was exported
>> > successfully. The only reason exportfs will fail is if you pass invalid
>> > options, and it's possible that exportfs will return before the export
>> > has actually appeared in /var/lib/nfs/etab (exportfs says "please kernel,
>> > export this when you get a chance, kthxbye").
>>
>> Are you suggesting that we poll the etab file to make sure our export
>> appears before calling the operation a success?
>
> Something like that :) I actually don't know what the best way to do this is.
> Looping a few times, polling the file then sleeping for one second on each
> iteration is probably reasonable. In the normal case it'll succeed instantly
> anyway, with almost zero impact on start time, but it'll catch the (probably
> freakishly unlikely) case where the filesystem can't be exported for some
> reason.
>
> You could also parse the output of "showmount -e", which should tell you
> what the server thinks it's exporting. Which reminds me... There was
> another issue at one point with the NFS server checking the mtime of
> /var/lib/nfs/etab to determine what to export. Thus if "exportfs -i" was
> run more than once per second, the file could be out of sync with what the
> server thought it was exporting. Again, this is something that needs to
> be bashed on a bit, but it's really a Linux kernel NFS server thing.
OK, all I can do is try to detect the case, not fix it, so I will
verify the mount succeeded by using showmount -e rather than depending
on the exportfs return code. I will use a timeout of 5 seconds...
>> > We ran into these issues because we were doing failover testing while
>> > the system was under heavy load (continuous write of several GB, followed
>> > by reading the same data back for verification), while failing over
>> > multiple NFS exports from node to node. You probably won't ever hit them
>> > unless the system is being severely hammered... But I'd still recommend
>> > further testing along these lines, out of sheer paranoia.
>>
>> I will definitely do some testing. If I understand your statement,
>> then the following scenario will help to determine if this is a
>> problem for me or not.
>>
>> 1. Bring up cluster in active-active mode, both nodes online.
>> 2. Start $ dd if=/dev/zero of=/path/to/fs0/bigfile bs=1GB count=10 on client
>> 3. Fail over resource fs1.
>> 4. Make sure the addition of fs1 to the node handling fs0 does not
>> cause disruption.
>>
>> ... and then ...
>>
>> 1. Bring up cluster in active-active mode, both nodes online.
>> 2. Start $ dd if=/path/to/fs0/bigfile of=/dev/null bs=1GB count=10 on client
>> 3. Fail over resource fs1.
>> 4. Make sure the addition of fs1 to the node handling fs0 does not
>> cause disruption.
>
> Yep, that's the sort of test. I'll see if I can find out anything else useful
> about the tools we were using at the time (not sure if they ever got publicly
> released, unfortunately :-/)
Any more info you can provide will be helpful. I need to get my
testing done soon, as these boxes are going into production this
weekend. I am way behind schedule, you would not believe how long it
took me to build a 30TB array and then sync it via DRBD (4 days for
Linux RAID, 10 days for DRBD). Actually, I had to split it into two
volumes as the DRBD volume limit is a scant 18TB :-).
I have attached another patch, this one includes the success check via
showmount -e, and some tighter regular expressions to avoid matching
/exported/fs/subdirectory when exporting /exported/fs.
diff -r 61cf8556ad31 heartbeat/exportfs
--- a/heartbeat/exportfs Wed Mar 24 20:10:42 2010 +0100
+++ b/heartbeat/exportfs Thu Mar 25 11:45:50 2010 -0400
@@ -46,7 +46,7 @@
<content type="string" default="" />
</parameter>
-<parameter name="dir" unique="0" required="1">
+<parameter name="directory" unique="0" required="1">
<longdesc lang="en">
The directory which you wish to export using NFS.
</longdesc>
@@ -56,6 +56,17 @@
<content type="string" default="" />
</parameter>
+<parameter name="fsid" unique="1" required="1">
+<longdesc lang="en">
+The fsid option to pass to exportfs. This should be a unique positive integer, avoid 0 unless you understand it's special status.
+This value will override any fsid provided via the options parameter.
+</longdesc>
+<shortdesc lang="en">
+Unique fsid within cluster.
+</shortdesc>
+<content type="string" default="" />
+</parameter>
+
</parameters>
<actions>
@@ -93,23 +104,21 @@
;;
esac
-#fp="$OCF_RESKEY_nfs_shared_infodir"
-
backup_rmtab ()
{
- grep ${OCF_RESKEY_dir} /var/lib/nfs/rmtab > ${OCF_RESKEY_dir}/.rmtab
+ grep :${OCF_RESKEY_directory}: /var/lib/nfs/rmtab > ${OCF_RESKEY_directory}/.rmtab
}
clean_rmtab ()
{
- REMOVE=`echo ${OCF_RESKEY_dir} | sed 's/\//\\\\\//g'`
- sed -i -e /${REMOVE}/d /var/lib/nfs/rmtab
+ REMOVE=`echo ${OCF_RESKEY_directory} | sed 's/\//\\\\\//g'`
+ sed -i -e /:${REMOVE}:/d /var/lib/nfs/rmtab
}
exportfs_monitor ()
{
fn=`/bin/mktemp`
- grep "${OCF_RESKEY_dir}" /var/lib/nfs/etab > $fn 2>&1
+ grep "${OCF_RESKEY_directory}" /var/lib/nfs/etab > $fn 2>&1
rc=$?
#Adapt grep status code to OCF return code
@@ -134,30 +143,19 @@
OPTIONS="${OCF_RESKEY_options}"
OPTPREFIX=','
fi
- #generate fsid if none provided...
- if [ ! `echo ${OPTIONS} | grep fsid` ]; then
- if [ -f ${OCF_RESKEY_dir}/.fsid ]; then
- FSID=`cat ${OCF_RESKEY_dir}/.fsid`
- else
- FSID=$RANDOM
- fi
- echo $FSID > ${OCF_RESKEY_dir}/.fsid
- OPTIONS="${OPTIONS}${OPTPREFIX}fsid=${FSID}"
+ if [ `echo ${OPTIONS} | grep fsid` ]; then
+ #replace fsid provided in options list with one provided in fsid param.
+ OPTIONS=`echo ${OPTIONS} | sed 's/fsid=[0-9]\+/fsid=${OCF_RESKEY_fsid}/g'`
+ else
+ #tack the fsid option onto our options list.
+ OPTIONS="${OPTIONS}${OPTPREFIX}fsid=${OCF_RESKEY_fsid}"
fi
OPTIONS="-o ${OPTIONS}"
fn=`/bin/mktemp`
- exportfs ${OPTIONS} ${OCF_RESKEY_clientspec}:${OCF_RESKEY_dir} > $fn 2>&1
+ exportfs ${OPTIONS} ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} > $fn 2>&1
rc=$?
- #restore saved rmtab backup from other server:
- if [ -f ${OCF_RESKEY_dir}/.rmtab ]; then
- cat ${OCF_RESKEY_dir}/.rmtab >> /var/lib/nfs/rmtab
- rm -f ${OCF_RESKEY_dir}/.rmtab
- fi
-
- /bin/sh $0 backup &
-
if [ $rc -ne 0 ]; then
ocf_log debug "Error invoking exportfs: `cat $fn`"
ocf_log err "Failed to export file system"
@@ -165,6 +163,31 @@
fi
rm -f $fn
+ RETRIES=0
+ while [ 1 ]; do
+ showmount -e | grep -P ^${OCF_RESKEY_directory}\ ${OCF_RESKEY_clientspec}$
+ rc=$?
+ if [ $rc -eq 0 ]; then
+ break
+ fi
+ RETRIES=`expr ${RETRIES} + 1`
+ if [ ${RETRIES} -eq 4 ]; then
+ ocf_log debug "Export not reported by showmount -e"
+ ocf_log err "Export not reported by showmount -e"
+ return ${OCF_NOT_RUNNING}
+ fi
+ sleep 1
+ done
+
+ #restore saved rmtab backup from other server:
+ if [ -f ${OCF_RESKEY_directory}/.rmtab ]; then
+ cat ${OCF_RESKEY_directory}/.rmtab >> /var/lib/nfs/rmtab
+ rm -f ${OCF_RESKEY_directory}/.rmtab
+ fi
+
+ #spawn our background process to backup the rmtab each 2 seconds
+ /bin/sh $0 backup &
+
ocf_log info "File system exported"
return $OCF_SUCCESS
}
@@ -174,12 +197,12 @@
ocf_log info "Un-exporting file system ..."
fn=`/bin/mktemp`
- exportfs -u ${OCF_RESKEY_clientspec}:${OCF_RESKEY_dir} > $fn 2>&1
+ exportfs -u ${OCF_RESKEY_clientspec}:${OCF_RESKEY_directory} > $fn 2>&1
rc=$?
- if [ -f ${OCF_RESKEY_dir}/.exportfs_backup.pid ]; then
- kill `cat ${OCF_RESKEY_dir}/.exportfs_backup.pid`
- rm ${OCF_RESKEY_dir}/.exportfs_backup.pid
+ if [ -f ${OCF_RESKEY_directory}/.exportfs_backup.pid ]; then
+ kill `cat ${OCF_RESKEY_directory}/.exportfs_backup.pid`
+ rm ${OCF_RESKEY_directory}/.exportfs_backup.pid
fi
backup_rmtab
@@ -198,7 +221,7 @@
exportfs_backup ()
{
- echo $$ > ${OCF_RESKEY_dir}/.exportfs_backup.pid
+ echo $$ > ${OCF_RESKEY_directory}/.exportfs_backup.pid
while [ 1 ]; do
backup_rmtab
sleep 2
@@ -207,18 +230,13 @@
exportfs_validate ()
{
- if [ -d $OCF_RESKEY_dir ]; then
+ if [ -d $OCF_RESKEY_directory ]; then
return $OCF_SUCCESS
else
exit $OCF_ERR_ARGS
fi
}
-if [ -n "$OCF_RESKEY_CRM_meta_clone" ]; then
- ocf_log err "THIS RA DOES NOT SUPPORT CLONE MODE!"
- exit $OCF_ERR_CONFIGURED
-fi
-
exportfs_validate
case $__OCF_ACTION in
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/