[Linux-HA] Problem with migration on a nfs/exportfs setup while copying via rsync

RaSca Thu, 27 May 2010 02:23:34 -0700

Hi all,
I've got some problems with my setup and I'm trying to understand if I 
am missing something or is a bug, here is how to reproduce the error:


node debian-lenny-nodo1
node debian-lenny-nodo2
primitive drbd0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="20s" timeout="40s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive nfs-common lsb:nfs-common
primitive nfs-kernel-server lsb:nfs-kernel-server
primitive ping ocf:pacemaker:ping \
        params host_list="192.168.1.1" name="ping" \
        op monitor interval="60s" timeout="60s" \
        op start interval="0" timeout="60s"
primitive portmap lsb:portmap
primitive store-LVM ocf:heartbeat:LVM \
        params volgrpname="vg_drbd" \
        op monitor interval="10s" timeout="30s" \
        op start interval="0" timeout="30s" \
        op stop interval="0" timeout="30s"
primitive store-exportfs ocf:heartbeat:exportfs \
        params directory="/store/share" clientspec="192.168.1.0/24" 
options="rw,sync,no_subtree_check,no_root_squash" fsid="1" \
        op monitor interval="10s" timeout="30s" \
        op start interval="0" timeout="40s" \
        op stop interval="0" timeout="40s" \
        meta target-role="Started"
primitive store-fs ocf:heartbeat:Filesystem \
        params device="/dev/vg_drbd/lv_store" directory="/store" fstype="ext3" \
        op monitor interval="20s" timeout="40s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        meta is-managed="true"
primitive store-ip ocf:heartbeat:IPaddr2 \
        params ip="192.168.1.53" nic="bond0" \
        op monitor interval="20s" timeout="40s"
group nfs portmap nfs-common nfs-kernel-server
group store store-ip store-LVM store-fs store-exportfs
ms ms-drbd0 drbd0 \
        meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true"
clone nfs_clone nfs \
        meta globally-unique="false"
clone ping_clone ping \
        meta globally-unique="false"
location cli-prefer-store store \
        rule $id="cli-prefer-rule-store" inf: #uname eq debian-lenny-nodo1
location store_on_connected_node store \
        rule $id="store_on_connected_node-rule" -inf: not_defined ping or ping 
lte 0
colocation store_on_ms-drbd0 inf: store ms-drbd0:Master
order store_after_ms-drbd0 inf: ms-drbd0:promote store:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        last-lrm-refresh="1274949951"

Everything comes up smoothly:

Online: [ debian-lenny-nodo1 debian-lenny-nodo2 ]

  Clone Set: ping_clone
      Started: [ debian-lenny-nodo1 debian-lenny-nodo2 ]
  Master/Slave Set: ms-drbd0
      Masters: [ debian-lenny-nodo1 ]
      Slaves: [ debian-lenny-nodo2 ]
  Resource Group: store
      store-ip   (ocf::heartbeat:IPaddr2):       Started debian-lenny-nodo1
      store-LVM  (ocf::heartbeat:LVM):   Started debian-lenny-nodo1
      store-fs   (ocf::heartbeat:Filesystem):    Started debian-lenny-nodo1
      store-exportfs     (ocf::heartbeat:exportfs):      Started 
debian-lenny-nodo1
  Clone Set: nfs_clone
      Started: [ debian-lenny-nodo2 debian-lenny-nodo1 ]

I mount the share on a network client, with default options, and then 
begin to copy with cp command.
The copy goes on and after a while i migrate the group store on the 
second node:

crm resource migrate store debian-lenny-nodo2

Everything goes smooth and on the client the copy hangs for a minute or 
two, and the restart.
After that, from the client i copy another thing on the nfs storage, 
this time with rsync command.
The copy starts and after a while i launch the migration command.
The cluster this time hangs, giving a failure on the filesystem resource:

store-fs   (ocf::heartbeat:Filesystem):    Started debian-lenny-nodo2 
(unmanaged) FAILED

the only way to make things work again is to cleanup the nfs_clone 
resource (or restart the nfs-kernel-server daemon) and then cleanup the 
store group. It seems that the filesystem is keep opened by the nfs daemon.

So, what's the difference between a simple copy and a rsync? Why with 
rsync the fs resource isn't able to unmount the filesystem? There is 
something I am missing or this should be an fs resource agent bug?

Here are the logs:

May 27 11:20:41 debian-lenny-nodo1 Filesystem[28197]: INFO: Running stop 
for /dev/vg_drbd/lv_store on /store
May 27 11:20:41 debian-lenny-nodo1 Filesystem[28197]: INFO: Trying to 
unmount /store
May 27 11:20:41 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
device is busy
May 27 11:20:41 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
unmount /store; trying cleanup with SIGTERM
May 27 11:20:41 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
on /store were signalled
May 27 11:20:42 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
device is busy
May 27 11:20:42 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
unmount /store; trying cleanup with SIGTERM
May 27 11:20:42 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
on /store were signalled
May 27 11:20:43 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr) umount: /store: device is busy
May 27 11:20:43 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr)
May 27 11:20:43 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr) umount: /store: device is busy
May 27 11:20:43 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr)
May 27 11:20:43 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
unmount /store; trying cleanup with SIGTERM
May 27 11:20:43 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
on /store were signalled
May 27 11:20:44 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
device is busy
May 27 11:20:44 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
unmount /store; trying cleanup with SIGKILL
May 27 11:20:44 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
on /store were signalled
May 27 11:20:45 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
device is busy
May 27 11:20:45 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
unmount /store; trying cleanup with SIGKILL
May 27 11:20:45 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
on /store were signalled
May 27 11:20:46 debian-lenny-nodo1 lrmd: [2589]: info: RA output: 
(store-fs:stop:stderr) umount: /store: device is busy#012umount: /store: 
device is busy
May 27 11:20:46 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
unmount /store; trying cleanup with SIGKILL
May 27 11:20:46 debian-lenny-nodo1 Filesystem[28197]: INFO: No processes 
on /store were signalled
May 27 11:20:47 debian-lenny-nodo1 Filesystem[28197]: ERROR: Couldn't 
unmount /store, giving up!
May 27 11:20:48 debian-lenny-nodo1 crmd: [2592]: info: 
process_lrm_event: LRM operation store-fs_stop_0 (call=188, rc=1, 
cib-update=389, confirmed=true) unknown error
May 27 11:20:48 debian-lenny-nodo1 crmd: [2592]: WARN: status_from_rc: 
Action 58 (store-fs_stop_0) on debian-lenny-nodo1 failed (target: 0 vs. 
rc: 1): Error

Thanks for your help!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
[email protected]
http://www.miamammausalinux.org
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Problem with migration on a nfs/exportfs setup while copying via rsync

Reply via email to