gfs2 hangs if a node crashes

emmanuel segura Thu, 15 Mar 2012 13:57:51 -0700

Ok William

we can try to understand what happen when clvm hang


edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
uncomment this line

file = "/var/log/lvm2.log"

Il giorno 15 marzo 2012 20:50, William Seligman <[email protected]
> ha scritto:

> On 3/15/12 12:55 PM, emmanuel segura wrote:
>
> > I don't see any error and the answer for your question it's yes
> >
> > can you show me your /etc/cluster/cluster.conf and your crm configure
> show
> >
> > like that more later i can try to look if i found some fix
>
> Thanks for taking a look.
>
> My cluster.conf: <http://pastebin.com/w5XNYyAX>
> crm configure show: <http://pastebin.com/atVkXjkn>
>
> Before you spend a lot of time on the second file, remember that clvmd
> will hang
> whether or not I'm running pacemaker.
>
> > Il giorno 15 marzo 2012 17:42, William Seligman <
> [email protected]
> >> ha scritto:
> >
> >> On 3/15/12 12:15 PM, emmanuel segura wrote:
> >>
> >>> Ho did you created your volume group
> >>
> >> pvcreate /dev/drbd0
> >> vgcreate -c y ADMIN /dev/drbd0
> >> lvcreate -L 200G -n usr ADMIN # ... and so on
> >> # "Nevis-HA" is the cluster name I used in cluster.conf
> >> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so
> on
> >>
> >>> give me the output of vgs command when the cluster it's up
> >>
> >> Here it is:
> >>
> >>    Logging initialised at Thu Mar 15 12:40:39 2012
> >>    Set umask from 0022 to 0077
> >>    Finding all volume groups
> >>    Finding volume group "ROOT"
> >>    Finding volume group "ADMIN"
> >>  VG    #PV #LV #SN Attr   VSize   VFree
> >>  ADMIN   1   5   0 wz--nc   2.61t 765.79g
> >>  ROOT    1   2   0 wz--n- 117.16g      0
> >>    Wiping internal VG cache
> >>
> >> I assume the "c" in the ADMIN attributes means that clustering is turned
> >> on?
> >>
> >>> Il giorno 15 marzo 2012 17:06, William Seligman <
> >> [email protected]
> >>>> ha scritto:
> >>>
> >>>> On 3/15/12 11:50 AM, emmanuel segura wrote:
> >>>>> yes william
> >>>>>
> >>>>> Now try clvmd -d and see what happen
> >>>>>
> >>>>> locking_type = 3 it's lvm cluster lock type
> >>>>
> >>>> Since you asked for confirmation, here it is: the output of 'clvmd -d'
> >>>> just now. <http://pastebin.com/bne8piEw>. I crashed the other node at
> >>>> Mar 15 12:02:35, when you see the only additional line of output.
> >>>>
> >>>> I don't see any particular difference between this and the previous
> >>>> result <http://pastebin.com/sWjaxAEF>, which suggests that I had
> >>>> cluster locking enabled before, and still do now.
> >>>>
> >>>>> Il giorno 15 marzo 2012 16:15, William Seligman <
> >>>> [email protected]
> >>>>>> ha scritto:
> >>>>>
> >>>>>> On 3/15/12 5:18 AM, emmanuel segura wrote:
> >>>>>>
> >>>>>>> The first thing i seen in your clvmd log it's this
> >>>>>>>
> >>>>>>> =============================================
> >>>>>>>  WARNING: Locking disabled. Be careful! This could corrupt your
> metadata.
> >>>>>>> =============================================
> >>>>>>
> >>>>>> I saw that too, and thought the same as you did. I did some checks
> >>>>>> (see below), but some web searches suggest that this message is a
> >>>>>> normal consequence of clvmd initialization; e.g.,
> >>>>>>
> >>>>>> <http://markmail.org/message/vmy53pcv52wu7ghx>
> >>>>>>
> >>>>>>> use this command
> >>>>>>>
> >>>>>>> lvmconf --enable-cluster
> >>>>>>>
> >>>>>>> and remember for cman+pacemaker you don't need qdisk
> >>>>>>
> >>>>>> Before I tried your lvmconf suggestion, here was my
> /etc/lvm/lvm.conf:
> >>>>>> <http://pastebin.com/841VZRzW> and the output of "lvm dumpconfig":
> >>>>>> <http://pastebin.com/rtw8c3Pf>.
> >>>>>>
> >>>>>> Then I did as you suggested, but with a check to see if anything
> >>>>>> changed:
> >>>>>>
> >>>>>> # cd /etc/lvm/
> >>>>>> # cp lvm.conf lvm.conf.cluster
> >>>>>> # lvmconf --enable-cluster
> >>>>>> # diff lvm.conf lvm.conf.cluster
> >>>>>> #
> >>>>>>
> >>>>>> So the key lines have been there all along:
> >>>>>>    locking_type = 3
> >>>>>>    fallback_to_local_locking = 0
> >>>>>>
> >>>>>>
> >>>>>>> Il giorno 14 marzo 2012 23:17, William Seligman <
> >>>>>> [email protected]
> >>>>>>>> ha scritto:
> >>>>>>>
> >>>>>>>> On 3/14/12 9:20 AM, emmanuel segura wrote:
> >>>>>>>>> Hello William
> >>>>>>>>>
> >>>>>>>>> i did new you are using drbd and i dont't know what type of
> >>>>>>>>> configuration you using
> >>>>>>>>>
> >>>>>>>>> But it's better you try to start clvm with clvmd -d
> >>>>>>>>>
> >>>>>>>>> like thak we can see what it's the problem
> >>>>>>>>
> >>>>>>>> For what it's worth, here's the output of running clvmd -d on
> >>>>>>>> the node that stays up: <http://pastebin.com/sWjaxAEF>
> >>>>>>>>
> >>>>>>>> What's probably important in that big mass of output are the
> >>>>>>>> last two lines. Up to that point, I have both nodes up and
> >>>>>>>> running cman + clvmd; cluster.conf is here:
> >>>>>>>> <http://pastebin.com/w5XNYyAX>
> >>>>>>>>
> >>>>>>>> At the time of the next-to-the-last line, I cut power to the
> >>>>>>>> other node.
> >>>>>>>>
> >>>>>>>> At the time of the last line, I run "vgdisplay" on the
> >>>>>>>> remaining node, which hangs forever.
> >>>>>>>>
> >>>>>>>> After a lot of web searching, I found that I'm not the only one
> >>>>>>>> with this problem. Here's one case that doesn't seem relevant
> >>>>>>>> to me, since I don't use qdisk:
> >>>>>>>> <
> http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html>.
> >>>>>>>> Here's one with the same problem with the same OS:
> >>>>>>>> <http://bugs.centos.org/view.php?id=5229>, but with no
> resolution.
> >>>>>>>>
> >>>>>>>> Out of curiosity, has anyone on this list made a two-node
> >>>>>>>> cman+clvmd cluster work for them?
> >>>>>>>>
> >>>>>>>>> Il giorno 14 marzo 2012 14:02, William Seligman <
> >>>>>>>> [email protected]
> >>>>>>>>>> ha scritto:
> >>>>>>>>>
> >>>>>>>>>> On 3/14/12 6:02 AM, emmanuel segura wrote:
> >>>>>>>>>>
> >>>>>>>>>>  I think it's better you make clvmd start at boot
> >>>>>>>>>>>
> >>>>>>>>>>> chkconfig cman on ; chkconfig clvmd on
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I've already tried it. It doesn't work. The problem is that
> >>>>>>>>>> my LVM information is on the drbd. If I start up clvmd
> >>>>>>>>>> before drbd, it won't find the logical volumes.
> >>>>>>>>>>
> >>>>>>>>>> I also don't see why that would make a difference (although
> >>>>>>>>>> this could be part of the confusion): a service is a
> >>>>>>>>>> service. I've tried starting up clvmd inside and outside
> >>>>>>>>>> pacemaker control, with the same problem. Why would
> >>>>>>>>>> starting clvmd at boot make a difference?
> >>>>>>>>>>
> >>>>>>>>>>  Il giorno 13 marzo 2012 23:29, William Seligman<
> [email protected]>
> >>>>>>>>>>>
> >>>>>>>>>>>> ha scritto:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>  On 3/13/12 5:50 PM, emmanuel segura wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>  So if you using cman why you use lsb::clvmd
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think you are very confused
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't dispute that I may be very confused!
> >>>>>>>>>>>>
> >>>>>>>>>>>> However, from what I can tell, I still need to run
> >>>>>>>>>>>> clvmd even if I'm running cman (I'm not using
> >>>>>>>>>>>> rgmanager). If I just run cman, gfs2 and any other form
> >>>>>>>>>>>> of mount fails. If I run cman, then clvmd, then gfs2,
> >>>>>>>>>>>> everything behaves normally.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Going by these instructions:
> >>>>>>>>>>>>
> >>>>>>>>>>>> <https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial>
> >>>>>>>>>>>>
> >>>>>>>>>>>> the resources he puts under "cluster control"
> >>>>>>>>>>>> (rgmanager) I have to put under pacemaker control.
> >>>>>>>>>>>> Those include drbd, clvmd, and gfs2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The difference between what I've got, and what's in
> >>>>>>>>>>>> "Clusters From Scratch", is in CFS they assign one DRBD
> >>>>>>>>>>>> volume to a single filesystem. I create an LVM physical
> >>>>>>>>>>>> volume on my DRBD resource, as in the above tutorial,
> >>>>>>>>>>>> and so I have to start clvmd or the logical volumes in
> >>>>>>>>>>>> the DRBD partition won't be recognized.>> Is there some
> >>>>>>>>>>>> way to get logical volumes recognized automatically by
> >>>>>>>>>>>> cman without rgmanager that I've missed?
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>  Il giorno 13 marzo 2012 22:42, William Seligman<
> >>>>>>>>>>>>>
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>
> >>>>>>>>>>>>> ha scritto:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  On 3/13/12 12:29 PM, William Seligman wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm not sure if this is a "Linux-HA" question;
> >>>>>>>>>>>>>>> please direct me to the appropriate list if it's
> >>>>>>>>>>>>>>> not.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm setting up a two-node cman+pacemaker+gfs2
> >>>>>>>>>>>>>>> cluster as described in "Clusters From Scratch."
> >>>>>>>>>>>>>>> Fencing is through forcibly rebooting a node by
> >>>>>>>>>>>>>>> cutting and restoring its power via UPS.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> My fencing/failover tests have revealed a
> >>>>>>>>>>>>>>> problem. If I gracefully turn off one node ("crm
> >>>>>>>>>>>>>>> node standby"; "service pacemaker stop";
> >>>>>>>>>>>>>>> "shutdown -r now") all the resources transfer to
> >>>>>>>>>>>>>>> the other node with no problems. If I cut power
> >>>>>>>>>>>>>>> to one node (as would happen if it were fenced),
> >>>>>>>>>>>>>>> the lsb::clvmd resource on the remaining node
> >>>>>>>>>>>>>>> eventually fails. Since all the other resources
> >>>>>>>>>>>>>>> depend on clvmd, all the resources on the
> >>>>>>>>>>>>>>> remaining node stop and the cluster is left with
> >>>>>>>>>>>>>>> nothing running.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've traced why the lsb::clvmd fails: The
> >>>>>>>>>>>>>>> monitor/status command includes "vgdisplay",
> >>>>>>>>>>>>>>> which hangs indefinitely. Therefore the monitor
> >>>>>>>>>>>>>>> will always time-out.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> So this isn't a problem with pacemaker, but with
> >>>>>>>>>>>>>>> clvmd/dlm: If a node is cut off, the cluster
> >>>>>>>>>>>>>>> isn't handling it properly. Has anyone on this
> >>>>>>>>>>>>>>> list seen this before? Any ideas?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Details:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> versions:
> >>>>>>>>>>>>>>> Redhat Linux 6.2 (kernel 2.6.32)
> >>>>>>>>>>>>>>> cman-3.0.12.1
> >>>>>>>>>>>>>>> corosync-1.4.1
> >>>>>>>>>>>>>>> pacemaker-1.1.6
> >>>>>>>>>>>>>>> lvm2-2.02.87
> >>>>>>>>>>>>>>> lvm2-cluster-2.02.87
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This may be a Linux-HA question after all!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I ran a few more tests. Here's the output from a
> >>>>>>>>>>>>>> typical test of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)**"
> >>>>>>>>>>>>>> /var/log/messages
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> <http://pastebin.com/uqC6bc1b>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It looks like what's happening is that the fence
> >>>>>>>>>>>>>> agent (one I wrote) is not returning the proper
> >>>>>>>>>>>>>> error code when a node crashes. According to this
> >>>>>>>>>>>>>> page, if a fencing agent fails GFS2 will freeze to
> >>>>>>>>>>>>>> protect the data:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> <
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html
> >
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As a test, I tried to fence my test node via
> >>>>>>>>>>>>>> standard means:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> stonith_admin -F \
> >>>>>>>>>>>>>> orestes-corosync.nevis.columbia.edu
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> These were the log messages, which show that
> >>>>>>>>>>>>>> stonith_admin did its job and CMAN was notified of
> >>>>>>>>>>>>>> the fencing:<http://pastebin.com/jaH820Bv>.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Unfortunately, I still got the gfs2 freeze, so this
> >>>>>>>>>>>>>> is not the complete story.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> First things first. I vaguely recall a web page
> >>>>>>>>>>>>>> that went over the STONITH return codes, but I
> >>>>>>>>>>>>>> can't locate it again. Is there any reference to
> >>>>>>>>>>>>>> the return codes expected from a fencing agent,
> >>>>>>>>>>>>>> perhaps as function of the state of the fencing
> >>>>>>>>>>>>>> device?
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://[email protected]
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

Reply via email to