gfs2 hangs if a node crashes

William Seligman Thu, 15 Mar 2012 12:51:13 -0700

On 3/15/12 12:55 PM, emmanuel segura wrote:

> I don't see any error and the answer for your question it's yes
> 
> can you show me your /etc/cluster/cluster.conf and your crm configure show
> 
> like that more later i can try to look if i found some fix


Thanks for taking a look.

My cluster.conf: <http://pastebin.com/w5XNYyAX>
crm configure show: <http://pastebin.com/atVkXjkn>

Before you spend a lot of time on the second file, remember that clvmd will hang
whether or not I'm running pacemaker.

> Il giorno 15 marzo 2012 17:42, William Seligman <selig...@nevis.columbia.edu
>> ha scritto:
> 
>> On 3/15/12 12:15 PM, emmanuel segura wrote:
>>
>>> Ho did you created your volume group
>>
>> pvcreate /dev/drbd0
>> vgcreate -c y ADMIN /dev/drbd0
>> lvcreate -L 200G -n usr ADMIN # ... and so on
>> # "Nevis-HA" is the cluster name I used in cluster.conf
>> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so on
>>
>>> give me the output of vgs command when the cluster it's up
>>
>> Here it is:
>>
>>    Logging initialised at Thu Mar 15 12:40:39 2012
>>    Set umask from 0022 to 0077
>>    Finding all volume groups
>>    Finding volume group "ROOT"
>>    Finding volume group "ADMIN"
>>  VG    #PV #LV #SN Attr   VSize   VFree
>>  ADMIN   1   5   0 wz--nc   2.61t 765.79g
>>  ROOT    1   2   0 wz--n- 117.16g      0
>>    Wiping internal VG cache
>>
>> I assume the "c" in the ADMIN attributes means that clustering is turned
>> on?
>>
>>> Il giorno 15 marzo 2012 17:06, William Seligman <
>> selig...@nevis.columbia.edu
>>>> ha scritto:
>>>
>>>> On 3/15/12 11:50 AM, emmanuel segura wrote:
>>>>> yes william
>>>>>
>>>>> Now try clvmd -d and see what happen
>>>>>
>>>>> locking_type = 3 it's lvm cluster lock type
>>>>
>>>> Since you asked for confirmation, here it is: the output of 'clvmd -d' 
>>>> just now. <http://pastebin.com/bne8piEw>. I crashed the other node at
>>>> Mar 15 12:02:35, when you see the only additional line of output.
>>>>
>>>> I don't see any particular difference between this and the previous
>>>> result <http://pastebin.com/sWjaxAEF>, which suggests that I had
>>>> cluster locking enabled before, and still do now.
>>>>
>>>>> Il giorno 15 marzo 2012 16:15, William Seligman <
>>>> selig...@nevis.columbia.edu
>>>>>> ha scritto:
>>>>>
>>>>>> On 3/15/12 5:18 AM, emmanuel segura wrote:
>>>>>>
>>>>>>> The first thing i seen in your clvmd log it's this
>>>>>>>
>>>>>>> =============================================
>>>>>>>  WARNING: Locking disabled. Be careful! This could corrupt your 
>>>>>>> metadata.
>>>>>>> =============================================
>>>>>>
>>>>>> I saw that too, and thought the same as you did. I did some checks
>>>>>> (see below), but some web searches suggest that this message is a
>>>>>> normal consequence of clvmd initialization; e.g.,
>>>>>>
>>>>>> <http://markmail.org/message/vmy53pcv52wu7ghx>
>>>>>>
>>>>>>> use this command
>>>>>>>
>>>>>>> lvmconf --enable-cluster
>>>>>>>
>>>>>>> and remember for cman+pacemaker you don't need qdisk
>>>>>>
>>>>>> Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
>>>>>> <http://pastebin.com/841VZRzW> and the output of "lvm dumpconfig":
>>>>>> <http://pastebin.com/rtw8c3Pf>.
>>>>>>
>>>>>> Then I did as you suggested, but with a check to see if anything
>>>>>> changed:
>>>>>>
>>>>>> # cd /etc/lvm/
>>>>>> # cp lvm.conf lvm.conf.cluster
>>>>>> # lvmconf --enable-cluster
>>>>>> # diff lvm.conf lvm.conf.cluster
>>>>>> #
>>>>>>
>>>>>> So the key lines have been there all along:
>>>>>>    locking_type = 3
>>>>>>    fallback_to_local_locking = 0
>>>>>>
>>>>>>
>>>>>>> Il giorno 14 marzo 2012 23:17, William Seligman <
>>>>>> selig...@nevis.columbia.edu
>>>>>>>> ha scritto:
>>>>>>>
>>>>>>>> On 3/14/12 9:20 AM, emmanuel segura wrote:
>>>>>>>>> Hello William
>>>>>>>>>
>>>>>>>>> i did new you are using drbd and i dont't know what type of 
>>>>>>>>> configuration you using
>>>>>>>>>
>>>>>>>>> But it's better you try to start clvm with clvmd -d
>>>>>>>>>
>>>>>>>>> like thak we can see what it's the problem
>>>>>>>>
>>>>>>>> For what it's worth, here's the output of running clvmd -d on
>>>>>>>> the node that stays up: <http://pastebin.com/sWjaxAEF>
>>>>>>>>
>>>>>>>> What's probably important in that big mass of output are the
>>>>>>>> last two lines. Up to that point, I have both nodes up and
>>>>>>>> running cman + clvmd; cluster.conf is here:
>>>>>>>> <http://pastebin.com/w5XNYyAX>
>>>>>>>>
>>>>>>>> At the time of the next-to-the-last line, I cut power to the
>>>>>>>> other node.
>>>>>>>>
>>>>>>>> At the time of the last line, I run "vgdisplay" on the
>>>>>>>> remaining node, which hangs forever.
>>>>>>>>
>>>>>>>> After a lot of web searching, I found that I'm not the only one
>>>>>>>> with this problem. Here's one case that doesn't seem relevant
>>>>>>>> to me, since I don't use qdisk:
>>>>>>>> <http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html>.
>>>>>>>> Here's one with the same problem with the same OS:
>>>>>>>> <http://bugs.centos.org/view.php?id=5229>, but with no resolution.
>>>>>>>>
>>>>>>>> Out of curiosity, has anyone on this list made a two-node
>>>>>>>> cman+clvmd cluster work for them?
>>>>>>>>
>>>>>>>>> Il giorno 14 marzo 2012 14:02, William Seligman <
>>>>>>>> selig...@nevis.columbia.edu
>>>>>>>>>> ha scritto:
>>>>>>>>>
>>>>>>>>>> On 3/14/12 6:02 AM, emmanuel segura wrote:
>>>>>>>>>>
>>>>>>>>>>  I think it's better you make clvmd start at boot
>>>>>>>>>>>
>>>>>>>>>>> chkconfig cman on ; chkconfig clvmd on
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've already tried it. It doesn't work. The problem is that
>>>>>>>>>> my LVM information is on the drbd. If I start up clvmd
>>>>>>>>>> before drbd, it won't find the logical volumes.
>>>>>>>>>> 
>>>>>>>>>> I also don't see why that would make a difference (although
>>>>>>>>>> this could be part of the confusion): a service is a
>>>>>>>>>> service. I've tried starting up clvmd inside and outside
>>>>>>>>>> pacemaker control, with the same problem. Why would
>>>>>>>>>> starting clvmd at boot make a difference?
>>>>>>>>>>
>>>>>>>>>>  Il giorno 13 marzo 2012 23:29, William 
>>>>>>>>>> Seligman<selig...@nevis.columbia.edu>
>>>>>>>>>>>
>>>>>>>>>>>> ha scritto:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  On 3/13/12 5:50 PM, emmanuel segura wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>  So if you using cman why you use lsb::clvmd
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think you are very confused
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I don't dispute that I may be very confused!
>>>>>>>>>>>>
>>>>>>>>>>>> However, from what I can tell, I still need to run
>>>>>>>>>>>> clvmd even if I'm running cman (I'm not using
>>>>>>>>>>>> rgmanager). If I just run cman, gfs2 and any other form
>>>>>>>>>>>> of mount fails. If I run cman, then clvmd, then gfs2,
>>>>>>>>>>>> everything behaves normally.
>>>>>>>>>>>>
>>>>>>>>>>>> Going by these instructions:
>>>>>>>>>>>>
>>>>>>>>>>>> <https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial>
>>>>>>>>>>>>
>>>>>>>>>>>> the resources he puts under "cluster control"
>>>>>>>>>>>> (rgmanager) I have to put under pacemaker control.
>>>>>>>>>>>> Those include drbd, clvmd, and gfs2.
>>>>>>>>>>>> 
>>>>>>>>>>>> The difference between what I've got, and what's in
>>>>>>>>>>>> "Clusters From Scratch", is in CFS they assign one DRBD
>>>>>>>>>>>> volume to a single filesystem. I create an LVM physical
>>>>>>>>>>>> volume on my DRBD resource, as in the above tutorial,
>>>>>>>>>>>> and so I have to start clvmd or the logical volumes in
>>>>>>>>>>>> the DRBD partition won't be recognized.>> Is there some
>>>>>>>>>>>> way to get logical volumes recognized automatically by 
>>>>>>>>>>>> cman without rgmanager that I've missed?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Il giorno 13 marzo 2012 22:42, William Seligman<
>>>>>>>>>>>>>
>>>>>>>>>>>> selig...@nevis.columbia.edu
>>>>>>>>>>>>
>>>>>>>>>>>>> ha scritto:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  On 3/13/12 12:29 PM, William Seligman wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure if this is a "Linux-HA" question;
>>>>>>>>>>>>>>> please direct me to the appropriate list if it's
>>>>>>>>>>>>>>> not.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm setting up a two-node cman+pacemaker+gfs2
>>>>>>>>>>>>>>> cluster as described in "Clusters From Scratch."
>>>>>>>>>>>>>>> Fencing is through forcibly rebooting a node by
>>>>>>>>>>>>>>> cutting and restoring its power via UPS.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> My fencing/failover tests have revealed a
>>>>>>>>>>>>>>> problem. If I gracefully turn off one node ("crm
>>>>>>>>>>>>>>> node standby"; "service pacemaker stop";
>>>>>>>>>>>>>>> "shutdown -r now") all the resources transfer to
>>>>>>>>>>>>>>> the other node with no problems. If I cut power 
>>>>>>>>>>>>>>> to one node (as would happen if it were fenced),
>>>>>>>>>>>>>>> the lsb::clvmd resource on the remaining node
>>>>>>>>>>>>>>> eventually fails. Since all the other resources
>>>>>>>>>>>>>>> depend on clvmd, all the resources on the
>>>>>>>>>>>>>>> remaining node stop and the cluster is left with
>>>>>>>>>>>>>>> nothing running.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I've traced why the lsb::clvmd fails: The
>>>>>>>>>>>>>>> monitor/status command includes "vgdisplay",
>>>>>>>>>>>>>>> which hangs indefinitely. Therefore the monitor
>>>>>>>>>>>>>>> will always time-out.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So this isn't a problem with pacemaker, but with
>>>>>>>>>>>>>>> clvmd/dlm: If a node is cut off, the cluster
>>>>>>>>>>>>>>> isn't handling it properly. Has anyone on this
>>>>>>>>>>>>>>> list seen this before? Any ideas?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Details:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> versions:
>>>>>>>>>>>>>>> Redhat Linux 6.2 (kernel 2.6.32)
>>>>>>>>>>>>>>> cman-3.0.12.1
>>>>>>>>>>>>>>> corosync-1.4.1
>>>>>>>>>>>>>>> pacemaker-1.1.6
>>>>>>>>>>>>>>> lvm2-2.02.87
>>>>>>>>>>>>>>> lvm2-cluster-2.02.87
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This may be a Linux-HA question after all!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I ran a few more tests. Here's the output from a
>>>>>>>>>>>>>> typical test of
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)**"
>>>>>>>>>>>>>> /var/log/messages
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> <http://pastebin.com/uqC6bc1b>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It looks like what's happening is that the fence
>>>>>>>>>>>>>> agent (one I wrote) is not returning the proper
>>>>>>>>>>>>>> error code when a node crashes. According to this
>>>>>>>>>>>>>> page, if a fencing agent fails GFS2 will freeze to
>>>>>>>>>>>>>> protect the data:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As a test, I tried to fence my test node via
>>>>>>>>>>>>>> standard means:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> stonith_admin -F \
>>>>>>>>>>>>>> orestes-corosync.nevis.columbia.edu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> These were the log messages, which show that
>>>>>>>>>>>>>> stonith_admin did its job and CMAN was notified of
>>>>>>>>>>>>>> the fencing:<http://pastebin.com/jaH820Bv>.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Unfortunately, I still got the gfs2 freeze, so this
>>>>>>>>>>>>>> is not the complete story.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> First things first. I vaguely recall a web page 
>>>>>>>>>>>>>> that went over the STONITH return codes, but I 
>>>>>>>>>>>>>> can't locate it again. Is there any reference to 
>>>>>>>>>>>>>> the return codes expected from a fencing agent, 
>>>>>>>>>>>>>> perhaps as function of the state of the fencing 
>>>>>>>>>>>>>> device?

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

Reply via email to