gfs2 hangs if a node crashes

William Seligman Wed, 14 Mar 2012 15:18:00 -0700

On 3/14/12 9:20 AM, emmanuel segura wrote:
> Hello William
> 
> i did new you are using drbd and i dont't know what type of configuration
> you using
> 
> But it's better you try to start clvm with clvmd -d
> 
> like thak we can see what it's the problem


For what it's worth, here's the output of running clvmd -d on the node that
stays up: <http://pastebin.com/sWjaxAEF>

What's probably important in that big mass of output are the last two lines. Up
to that point, I have both nodes up and running cman + clvmd; cluster.conf is
here: <http://pastebin.com/w5XNYyAX>

At the time of the next-to-the-last line, I cut power to the other node.

At the time of the last line, I run "vgdisplay" on the remaining node, which
hangs forever.

After a lot of web searching, I found that I'm not the only one with this
problem. Here's one case that doesn't seem relevant to me, since I don't use
qdisk:
<http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html>.
Here's one with the same problem with the same OS:
<http://bugs.centos.org/view.php?id=5229>, but with no resolution.

Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster
work for them?

> Il giorno 14 marzo 2012 14:02, William Seligman <[email protected]
>> ha scritto:
> 
>> On 3/14/12 6:02 AM, emmanuel segura wrote:
>>
>>  I think it's better you make clvmd start at boot
>>>
>>> chkconfig cman on ; chkconfig clvmd on
>>>
>>
>> I've already tried it. It doesn't work. The problem is that my LVM
>> information is on the drbd. If I start up clvmd before drbd, it won't find
>> the logical volumes.
>>
>> I also don't see why that would make a difference (although this could be
>> part of the confusion): a service is a service. I've tried starting up
>> clvmd inside and outside pacemaker control, with the same problem. Why
>> would starting clvmd at boot make a difference?
>>
>>  Il giorno 13 marzo 2012 23:29, William Seligman<seligman@nevis.**
>>> columbia.edu <[email protected]>
>>>
>>>> ha scritto:
>>>>
>>>
>>>  On 3/13/12 5:50 PM, emmanuel segura wrote:
>>>>
>>>>  So if you using cman why you use lsb::clvmd
>>>>>
>>>>> I think you are very confused
>>>>>
>>>>
>>>> I don't dispute that I may be very confused!
>>>>
>>>> However, from what I can tell, I still need to run clvmd even if
>>>> I'm running cman (I'm not using rgmanager). If I just run cman,
>>>> gfs2 and any other form of mount fails. If I run cman, then clvmd,
>>>> then gfs2, everything behaves normally.
>>>>
>>>> Going by these instructions:
>>>>
>>>> <https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial<https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>
>>>>>
>>>>
>>>> the resources he puts under "cluster control" (rgmanager) I have to
>>>> put under pacemaker control. Those include drbd, clvmd, and gfs2.
>>>>
>>>> The difference between what I've got, and what's in "Clusters From
>>>> Scratch", is in CFS they assign one DRBD volume to a single
>>>> filesystem. I create an LVM physical volume on my DRBD resource,
>>>> as in the above tutorial, and so I have to start clvmd or the
>>>> logical volumes in the DRBD partition won't be recognized.>> Is
>>>> there some way to get logical volumes recognized automatically by
>>>> cman without rgmanager that I've missed?
>>>>
>>>
>>>  Il giorno 13 marzo 2012 22:42, William Seligman<
>>>>>
>>>> [email protected]
>>>>
>>>>> ha scritto:
>>>>>>
>>>>>
>>>>>  On 3/13/12 12:29 PM, William Seligman wrote:
>>>>>>
>>>>>>> I'm not sure if this is a "Linux-HA" question; please direct
>>>>>>> me to the appropriate list if it's not.
>>>>>>>
>>>>>>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as
>>>>>>> described in "Clusters From Scratch." Fencing is through
>>>>>>> forcibly rebooting a node by cutting and restoring its power
>>>>>>> via UPS.
>>>>>>>
>>>>>>> My fencing/failover tests have revealed a problem. If I
>>>>>>> gracefully turn off one node ("crm node standby"; "service
>>>>>>> pacemaker stop"; "shutdown -r now") all the resources
>>>>>>> transfer to the other node with no problems. If I cut power
>>>>>>> to one node (as would happen if it were fenced), the
>>>>>>> lsb::clvmd resource on the remaining node eventually fails.
>>>>>>> Since all the other resources depend on clvmd, all the
>>>>>>> resources on the remaining node stop and the cluster is left
>>>>>>> with nothing running.
>>>>>>>
>>>>>>> I've traced why the lsb::clvmd fails: The monitor/status
>>>>>>> command includes "vgdisplay", which hangs indefinitely.
>>>>>>> Therefore the monitor will always time-out.
>>>>>>>
>>>>>>> So this isn't a problem with pacemaker, but with clvmd/dlm:
>>>>>>> If a node is cut off, the cluster isn't handling it properly.
>>>>>>> Has anyone on this list seen this before? Any ideas?
>>>>>>>
>>>>>>>> Details:
>>>>>
>>>>>>
>>>>>>> versions:
>>>>>>> Redhat Linux 6.2 (kernel 2.6.32)
>>>>>>> cman-3.0.12.1
>>>>>>> corosync-1.4.1
>>>>>>> pacemaker-1.1.6
>>>>>>> lvm2-2.02.87
>>>>>>> lvm2-cluster-2.02.87
>>>>>>>
>>>>>>
>>>>>> This may be a Linux-HA question after all!
>>>>>>
>>>>>> I ran a few more tests. Here's the output from a typical test of
>>>>>>
>>>>>> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)**" /var/log/messages
>>>>>>
>>>>>> <http://pastebin.com/uqC6bc1b>
>>>>>>
>>>>>> It looks like what's happening is that the fence agent (one I
>>>>>> wrote) is not returning the proper error code when a node
>>>>>> crashes. According to this page, if a fencing agent fails GFS2
>>>>>> will freeze to protect the data:
>>>>>>
>>>>>> <http://docs.redhat.com/docs/**en-US/Red_Hat_Enterprise_**
>>>>>> Linux/6/html/Global_File_**System_2/s1-gfs2hand-allnodes.**html<http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html>>
>>>>>>
>>>>>> As a test, I tried to fence my test node via standard means:
>>>>>>
>>>>>> stonith_admin -F 
>>>>>> orestes-corosync.nevis.**columbia.edu<http://orestes-corosync.nevis.columbia.edu>
>>>>>>
>>>>>> These were the log messages, which show that stonith_admin did
>>>>>> its job and CMAN was notified of the
>>>>>> fencing:<http://pastebin.com/**jaH820Bv <http://pastebin.com/jaH820Bv>
>>>>>>> .
>>>>>>
>>>>>> Unfortunately, I still got the gfs2 freeze, so this is not the
>>>>>> complete story.
>>>>>>
>>>>>> First things first. I vaguely recall a web page that went over
>>>>>> the STONITH return codes, but I can't locate it again. Is there
>>>>>> any reference to the return codes expected from a fencing
>>>>>> agent, perhaps as function of the state of the fencing device?

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://[email protected]
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

Reply via email to