Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA

Christoph Bartoschek Fri, 15 Apr 2011 04:55:26 -0700

Hi,

could you show the configuration you use via crm configure show?


Christoph

Am 15.04.2011 12:02, schrieb Caspar Smit:
> Hi all,
>
> I'm too testing with High Available NFS over TCP, I'd like to share my
> findings of a whole day of testing regarding NFS over TCP and some very
> interesting conclusions!
>
> Note: I send these findings to Florian Haas of linbit (maintainer of the
> exportfs RA) and he noted that the exportfs RA is meant to be used in active
> active setups like Rasca's and not in active/passive setups (what I am
> testing at the moment).
>
> First of all I started with a fresh install of every node and rebooted the
> NFS client machine.
>
> While starting the first test I noticed the failover actually DID work. So I
> started to investigate further. After a few more failovers I was stuck at
> the situation where I have a stale mount on the client.
>
> These first test were all done using the migrate command.
>
> Rebooted everything again and started the second batch of tests (now with
> the node standby command). I noticed that this way of failover could survive
> way more failovers. Only when I started using the NFS mount during failover
> by writing something to it I noticed that the time it took to survive the
> failover increased consideratly.
>
> I digged deeper and started to monitor the nfs tcp connections using netstat
> and wrote down the results:
>
> node1 = active node
> node2 = passive node
>
> node1 netstat =
> tcp        0      0 *:nfs                   *:*                     LISTEN
> tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
> ESTABLISHED
> udp        0      0 *:nfs                   *:*
>
> node2 netstat =
> tcp        0      0 *:nfs                   *:*                     LISTEN
> udp        0      0 *:nfs                   *:*
>
> I did a failover (migrate resource) from node1 ->  node2
>
> node1 netstat =
> tcp        0      0 *:nfs                   *:*                     LISTEN
> tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
> ESTABLISHED
> udp        0      0 *:nfs                   *:*
>
> node2 netstat =
> tcp        0      0 *:nfs                   *:*                     LISTEN
> tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
> ESTABLISHED
> udp        0      0 *:nfs                   *:*
>
> Having the nfs-kernel-server LSB script run as a clone keeps tcp sessions
> ESTABLISHED on the passive node after a failover for about 10 minutes. After
> that the state changes to FIN_WAIT1 and lasts about another 4 minutes.
>
> During the time the session is ESTABLISHED and FIN_WAIT1 (about 14 minutes)
> it is not possible to migrate the resource back as this results in a stale
> mount,
>
> Then I started testing with node standby failovers and saw the following:
>
> node1 netstat =
> tcp        0      0 *:nfs                   *:*                     LISTEN
> tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
> ESTABLISHED
> udp        0      0 *:nfs                   *:*
>
> node2 netstat =
> tcp        0      0 *:nfs                   *:*                     LISTEN
> udp        0      0 *:nfs                   *:*
>
> I did a failover (node standby) from node1 ->  node2
>
> node1 netstat =
> tcp        0      0 *:nfs                   *:*                     LISTEN
> tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
> TIME_WAIT
> udp        0      0 *:nfs                   *:*
>
> node2 netstat =
> tcp        0      0 *:nfs                   *:*                     LISTEN
> tcp        0      0 192.168.0.30:nfs        192.168.0.10:767
> ESTABLISHED
> udp        0      0 *:nfs                   *:*
>
> The session is immediatly changed to TIME_WAIT and lasts a bit shorter
> (around 2 minutes) then FIN_WAIT1 using the migrate command.
>
> It is still not possible to do a failover back during the FIN_WAIT1 state
> but after around 2 minutes the session is restored and doesn't become stale.
>
>
> I concluded that the stopping and starting of nfs-kernel-server (which
> happens only when doing node standby) is the main difference here.
>
> SO i started testing without having the nfs-kernel-server as a cloned
> resource but as a normal resource (so it gets stopped/started during
> failover)
>
> After a failover the tcp state of the passive node sometimes remained
> ESTABLISHED and sometimes became TIME_WAIT.
>
> I noticed that as i didn't use the nfs mount during failover the state
> became TIME_WAIT and if I did use the nfs mount it remained ESTABLISHED.
>
> So it had to do something with nfs-kernel-server not shutting down all
> connections on a stop command. I checked the /etc/init.d/nfs-kernel-server
> LSB script and saw
> that the stop command was using a signal 2 to stop all nfsd instances. I
> noticed when the session is active the nfsd instance is not stopped. So I
> changed the signal into
> 9 and then it killed all nfsd instances on a stop command.
>
> Conslusion:
>
> - *Using nfs-kernel-server as a cloned resource prevents quick failovers
> (<15 minutes) if you use NFS over TCP*, using it as a normal resource stops
> and starts the nfsd instances which keep the TCP connections.
> - For this to work in active/passive mode the nfs-kernel-server init script
> needs to be changed, the stop command must use signal 9 to kill all nfsd
> instances instead of signal 2
>
> Kind regards,
>
> Caspar Smit
>
> 2011/4/14 Alessandro Iurlano<[email protected]>
>
>> Thanks a lot, Rasca.
>> Using your configuration I was able to setup the active active NFS server.
>> I had to use the UDP protocol for NFS to work. With TCP, the NFS
>> clients would occasionally hang.
>> With UDP it seems to work well without any need of rmtab file
>> replication/synchronization.
>>
>> Now I'm trying to go a little further by using OCFS2 cluster
>> filesystem with a dual primary DRBD configuration. The goal is to be
>> able to share the same directory from both nodes while still having
>> the failover on a single node.
>> With the actual configuration, the cluster comes up and every services
>> is running as expected.
>> But when I unplug the network cable of a node, on the remaining active
>> node the exportfs processes hangs and I can't see why.
>> Any suggestion?
>>
>> This is my current configuration:
>> http://nopaste.voric.com/paste.php?f=sxub6z
>>
>> Thanks!
>> Alessandro
>>
>> On Mon, Apr 4, 2011 at 11:23 AM, RaSca<[email protected]>  wrote:
>>> Il giorno Sab 02 Apr 2011 19:04:08 CET, Alessandro Iurlano ha scritto:
>>>>
>>>> On Fri, Apr 1, 2011 at 11:34 AM, RaSca<[email protected]>
>>   wrote:
>>>>>>
>>>>>> Then I tried to find a way to keep just the rmtab file synchronized on
>>>>>> both nodes. I cannot find a way to have pacemaker do this for me. Is
>>>>>> there one?
>>>>>
>>>>> As far as I know, all those operations are handled by the exportfs RA.
>>>>
>>>> I believe this was true till the backup part was removed. See the git
>>>> commit below.
>>>
>>> So, for some reasons this is not needed anymore, but I don't think this
>> may
>>> create problems, surely the RA maintainer has done all the necessary
>> tests.
>>>
>>>> I checked the boot order and indeed I was doing it the wrong way.
>>>> After I fixed it, a couple of tests worked right away, while the
>>>> client hanged again when I switched back the cluster to both nodes
>>>> online.
>>>> Could you post your working configuration?
>>>> Thanks,
>>>> Alessandro
>>>
>>> Here it is, note that I'm using DRBD instead of a shared storage
>> (basically
>>> each drbd is a stand alone export that can reside independently on a
>> node):
>>>
>>> node ubuntu-nodo1
>>> node ubuntu-nodo2
>>> primitive drbd0 ocf:linbit:drbd \
>>>         params drbd_resource="r0" \
>>>         op monitor interval="20s" timeout="40s"
>>> primitive drbd1 ocf:linbit:drbd \
>>>         params drbd_resource="r1" \
>>>         op monitor interval="20s" timeout="40s"
>>> primitive nfs-kernel-server lsb:nfs-kernel-server \
>>>         op monitor interval="10s" timeout="30s"
>>> primitive ping ocf:pacemaker:ping \
>>>         params host_list="172.16.0.1" multiplier="100" name="ping" \
>>>         op monitor interval="20s" timeout="60s" \
>>>         op start interval="0" timeout="60s"
>>> primitive portmap lsb:portmap \
>>>         op monitor interval="10s" timeout="30s"
>>> primitive share-a-exportfs ocf:heartbeat:exportfs \
>>>         params directory="/share-a" clientspec="172.16.0.0/24"
>>> options="rw,async,no_subtree_check,no_root_squash" fsid="1" \
>>>         op monitor interval="10s" timeout="30s" \
>>>         op start interval="0" timeout="40s" \
>>>         op stop interval="0" timeout="40s"
>>> primitive share-a-fs ocf:heartbeat:Filesystem \
>>>         params device="/dev/drbd0" directory="/share-a" fstype="ext3"
>>> options="noatime" fast_stop="no" \
>>>         op monitor interval="20s" timeout="40s" \
>>>         op start interval="0" timeout="60s" \
>>>         op stop interval="0" timeout="60s"
>>> primitive share-a-ip ocf:heartbeat:IPaddr2 \
>>>         params ip="172.16.0.63" nic="eth0" \
>>>         op monitor interval="20s" timeout="40s"
>>> primitive share-b-exportfs ocf:heartbeat:exportfs \
>>>         params directory="/share-b" clientspec="172.16.0.0/24"
>>> options="rw,no_root_squash" fsid="2" \
>>>         op monitor interval="10s" timeout="30s" \
>>>         op start interval="0" timeout="40s" \
>>>         op stop interval="0" timeout="40s"
>>> primitive share-b-fs ocf:heartbeat:Filesystem \
>>>         params device="/dev/drbd1" directory="/share-b" fstype="ext3"
>>> options="noatime" fast_stop="no" \
>>>         op monitor interval="20s" timeout="40s" \
>>>         op start interval="0" timeout="60s" \
>>>         op stop interval="0" timeout="60s"
>>> primitive share-b-ip ocf:heartbeat:IPaddr2 \
>>>         params ip="172.16.0.64" nic="eth0" \
>>>         op monitor interval="20s" timeout="40s"
>>> primitive statd lsb:statd \
>>>         op monitor interval="10s" timeout="30s"
>>> group nfs portmap statd nfs-kernel-server
>>> group share-a share-a-fs share-a-exportfs share-a-ip
>>> group share-b share-b-fs share-b-exportfs share-b-ip
>>> ms ms_drbd0 drbd0 \
>>>         meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>> ms ms_drbd1 drbd1 \
>>>         meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true" target-role="Started"
>>> clone nfs_clone nfs \
>>>         meta globally-unique="false"
>>> clone ping_clone ping \
>>>         meta globally-unique="false"
>>> location share-a_on_connected_node share-a \
>>>         rule $id="share-a_on_connected_node-rule" -inf: not_defined ping
>> or
>>> ping lte 0
>>> location share-b_on_connected_node share-b \
>>>         rule $id="share-b_on_connected_node-rule" -inf: not_defined ping
>> or
>>> ping lte 0
>>> colocation share-a_on_ms_drbd0 inf: share-a ms_drbd0:Master
>>> colocation share-b_on_ms_drbd1 inf: share-b ms_drbd1:Master
>>> order share-a_after_ms_drbd0 inf: ms_drbd0:promote share-a:start
>>> order share-b_after_ms_drbd1 inf: ms_drbd1:promote share-b:start
>>> property $id="cib-bootstrap-options" \
>>>         dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
>>>         cluster-infrastructure="openais" \
>>>         expected-quorum-votes="2" \
>>>         no-quorum-policy="ignore" \
>>>         stonith-enabled="false" \
>>>         last-lrm-refresh="1301915944"
>>>
>>> Note that I've grouped all the nfs-server daemons (portmap, nfs-common
>> and
>>> nfs-kernel-server) in the cloned group nfs_clone.
>>>
>>> --
>>> RaSca
>>> Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
>>> [email protected]
>>> http://www.miamammausalinux.org
>>>
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA

Reply via email to