gfs2 hangs if a node crashes

William Seligman Wed, 14 Mar 2012 06:03:45 -0700

On 3/14/12 6:02 AM, emmanuel segura wrote:

I think it's better you make clvmd start at boot


chkconfig cman on ; chkconfig clvmd on

I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes.

I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference?

Il giorno 13 marzo 2012 23:29, William Seligman<[email protected]

ha scritto:

On 3/13/12 5:50 PM, emmanuel segura wrote:

So if you using cman why you use lsb::clvmd

I think you are very confused


I don't dispute that I may be very confused!

However, from what I can tell, I still need to run clvmd even if
I'm running cman (I'm not using rgmanager). If I just run cman,
gfs2 and any other form of mount fails. If I run cman, then clvmd,
then gfs2, everything behaves normally.

Going by these instructions:

<https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>

the resources he puts under "cluster control" (rgmanager) I have to
put under pacemaker control. Those include drbd, clvmd, and gfs2.

The difference between what I've got, and what's in "Clusters From
Scratch", is in CFS they assign one DRBD volume to a single
filesystem. I create an LVM physical volume on my DRBD resource,
as in the above tutorial, and so I have to start clvmd or the
logical volumes in the DRBD partition won't be recognized.>> Is
there some way to get logical volumes recognized automatically by
cman without rgmanager that I've missed?

Il giorno 13 marzo 2012 22:42, William Seligman<

[email protected]

ha scritto:

On 3/13/12 12:29 PM, William Seligman wrote:

I'm not sure if this is a "Linux-HA" question; please direct
me to the appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as
described in "Clusters From Scratch." Fencing is through
forcibly rebooting a node by cutting and restoring its power
via UPS.

My fencing/failover tests have revealed a problem. If I
gracefully turn off one node ("crm node standby"; "service
pacemaker stop"; "shutdown -r now") all the resources
transfer to the other node with no problems. If I cut power
to one node (as would happen if it were fenced), the
lsb::clvmd resource on the remaining node eventually fails.
Since all the other resources depend on clvmd, all the
resources on the remaining node stop and the cluster is left
with nothing running.

I've traced why the lsb::clvmd fails: The monitor/status
command includes "vgdisplay", which hangs indefinitely.
Therefore the monitor will always time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm:
If a node is cut off, the cluster isn't handling it properly.
Has anyone on this list seen this before? Any ideas?

>> Details:


versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87


This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E "(dlm|gfs2}clvmd|fenc|syslogd)" /var/log/messages

<http://pastebin.com/uqC6bc1b>

It looks like what's happening is that the fence agent (one I
wrote) is not returning the proper error code when a node
crashes. According to this page, if a fencing agent fails GFS2
will freeze to protect the data:

<http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html
 >

As a test, I tried to fence my test node via standard means:

stonith_admin -F orestes-corosync.nevis.columbia.edu

These were the log messages, which show that stonith_admin did
its job and CMAN was notified of the
fencing:<http://pastebin.com/jaH820Bv>.

Unfortunately, I still got the gfs2 freeze, so this is not the
complete story.

First things first. I vaguely recall a web page that went over
the STONITH return codes, but I can't locate it again. Is there
any reference to the return codes expected from a fencing
agent, perhaps as function of the state of the fencing device?


--
Bill Seligman             | mailto://[email protected]
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137                |
Irvington NY 10533  USA   | Phone: (914) 591-2823

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

Reply via email to