On 03/23/2010 01:18 PM, Greg Woods wrote:
>
>>> On one node, i can get all services to start(and they work fine), but
>>> whenever fail over occurs, there's nfs related handles left open thus
>>> inhibiting/hanging the fail over. more specifically, the file systems fails
>>> to unmount.
>
> If you are referring to file systems on the server that are made
> available for NFS mounting that hang on unmount (it's not clear from the
> above if your cluster nodes are NFS servers or clients), then you need
> to unexport the file systems first, then you can umount them. I handled
> this by writing my own nfs-exports RA that basically just does an
> "exportfs -u" with the appropriate parameters, and used an "order" line
> in crm shell to make sure that the Filesystem resource is ordered before
> the nfs-exports resource. The nfs-exports resource will export the file
> system on start, and unexport it on stop.
>
> --Greg
>
This is what I am seeing for NFS related open files on cluster node that is
trying to perform the unmount. As you can see,
theres no open files in the shared path (/data). The PID's referenced are NFS
kernel processes.
r...@vanessa:~# lsof | grep /data
r...@vanessa:~# lsof | grep nfs
nfsiod 15479 root cwd DIR 104,3 4096 2 /
nfsiod 15479 root rtd DIR 104,3 4096 2 /
nfsiod 15479 root txt unknown
/proc/15479/exe
nfsd4 15511 root cwd DIR 104,3 4096 2 /
nfsd4 15511 root rtd DIR 104,3 4096 2 /
nfsd4 15511 root txt unknown
/proc/15511/exe
nfsd 15512 root cwd DIR 104,3 4096 2 /
nfsd 15512 root rtd DIR 104,3 4096 2 /
nfsd 15512 root txt unknown
/proc/15512/exe
nfsd 15513 root cwd DIR 104,3 4096 2 /
nfsd 15513 root rtd DIR 104,3 4096 2 /
nfsd 15513 root txt unknown
/proc/15513/exe
nfsd 15514 root cwd DIR 104,3 4096 2 /
nfsd 15514 root rtd DIR 104,3 4096 2 /
nfsd 15514 root txt unknown
/proc/15514/exe
nfsd 15515 root cwd DIR 104,3 4096 2 /
nfsd 15515 root rtd DIR 104,3 4096 2 /
nfsd 15515 root txt unknown
/proc/15515/exe
nfsd 15516 root cwd DIR 104,3 4096 2 /
nfsd 15516 root rtd DIR 104,3 4096 2 /
nfsd 15516 root txt unknown
/proc/15516/exe
nfsd 15517 root cwd DIR 104,3 4096 2 /
nfsd 15517 root rtd DIR 104,3 4096 2 /
nfsd 15517 root txt unknown
/proc/15517/exe
nfsd 15518 root cwd DIR 104,3 4096 2 /
nfsd 15518 root rtd DIR 104,3 4096 2 /
nfsd 15518 root txt unknown
/proc/15518/exe
nfsd 15519 root cwd DIR 104,3 4096 2 /
nfsd 15519 root rtd DIR 104,3 4096 2 /
nfsd 15519 root txt unknown
/proc/15519/exe
Looks like everything but the filesystem resource stopped correctly.
r...@valerie:/# crm_resource -L
Master/Slave Set: master-drbd1
Masters: [ vanessa ]
Stopped: [ drbd1:1 ]
Resource Group: fileserver_cluster_group
fileserver_fs0 (ocf::heartbeat:Filesystem) Started
fileserver_vip0 (ocf::heartbeat:IPaddr) Stopped
fileserver_nfs-common (lsb:nfs-common) Stopped
fileserver_nfs (lsb:nfs-kernel-server) Stopped
fileserver_notify_admin (ocf::heartbeat:MailTo) Stopped
At this point, if I reboot the hung node with `echo b > /proc/sysrq-trigger`,
the resource/resource group fail over to the
good node just fine. Once the node reboot, all is good once again, that is,
until I try again.
Note:
Clients seem to handle the node fail over quite well, even if the fail over
takes a little while because I have to intervene.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems