gfs2 hangs if a node crashes

emmanuel segura Thu, 15 Mar 2012 09:15:31 -0700

Hello William

Ho did you created your volume group


give me the output of vgs command when the cluster it's up



Il giorno 15 marzo 2012 17:06, William Seligman <[email protected]
> ha scritto:

> On 3/15/12 11:50 AM, emmanuel segura wrote:
> > yes william
> >
> > Now try clvmd -d and see what happen
> >
> > locking_type = 3 it's lvm cluster lock type
>
> Since you asked for confirmation, here it is: the output of 'clvmd -d'
> just now.
> <http://pastebin.com/bne8piEw>. I crashed the other node at Mar 15
> 12:02:35,
> when you see the only additional line of output.
>
> I don't see any particular difference between this and the previous result
> <http://pastebin.com/sWjaxAEF>, which suggests that I had cluster locking
> enabled before, and still do now.
>
> > Il giorno 15 marzo 2012 16:15, William Seligman <
> [email protected]
> >> ha scritto:
> >
> >> On 3/15/12 5:18 AM, emmanuel segura wrote:
> >>
> >>> The first thing i seen in your clvmd log it's this
> >>>
> >>> =============================================
> >>>  WARNING: Locking disabled. Be careful! This could corrupt your
> metadata.
> >>> =============================================
> >>
> >> I saw that too, and thought the same as you did. I did some checks (see
> >> below),
> >> but some web searches suggest that this message is a normal consequence
> of
> >> clvmd
> >> initialization; e.g.,
> >>
> >> <http://markmail.org/message/vmy53pcv52wu7ghx>
> >>
> >>> use this command
> >>>
> >>> lvmconf --enable-cluster
> >>>
> >>> and remember for cman+pacemaker you don't need qdisk
> >>
> >> Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
> >> <http://pastebin.com/841VZRzW> and the output of "lvm dumpconfig":
> >> <http://pastebin.com/rtw8c3Pf>.
> >>
> >> Then I did as you suggested, but with a check to see if anything
> changed:
> >>
> >> # cd /etc/lvm/
> >> # cp lvm.conf lvm.conf.cluster
> >> # lvmconf --enable-cluster
> >> # diff lvm.conf lvm.conf.cluster
> >> #
> >>
> >> So the key lines have been there all along:
> >>    locking_type = 3
> >>    fallback_to_local_locking = 0
> >>
> >>
> >>> Il giorno 14 marzo 2012 23:17, William Seligman <
> >> [email protected]
> >>>> ha scritto:
> >>>
> >>>> On 3/14/12 9:20 AM, emmanuel segura wrote:
> >>>>> Hello William
> >>>>>
> >>>>> i did new you are using drbd and i dont't know what type of
> >> configuration
> >>>>> you using
> >>>>>
> >>>>> But it's better you try to start clvm with clvmd -d
> >>>>>
> >>>>> like thak we can see what it's the problem
> >>>>
> >>>> For what it's worth, here's the output of running clvmd -d on the node
> >> that
> >>>> stays up: <http://pastebin.com/sWjaxAEF>
> >>>>
> >>>> What's probably important in that big mass of output are the last two
> >>>> lines. Up
> >>>> to that point, I have both nodes up and running cman + clvmd;
> >> cluster.conf
> >>>> is
> >>>> here: <http://pastebin.com/w5XNYyAX>
> >>>>
> >>>> At the time of the next-to-the-last line, I cut power to the other
> node.
> >>>>
> >>>> At the time of the last line, I run "vgdisplay" on the remaining node,
> >>>> which
> >>>> hangs forever.
> >>>>
> >>>> After a lot of web searching, I found that I'm not the only one with
> >> this
> >>>> problem. Here's one case that doesn't seem relevant to me, since I
> don't
> >>>> use
> >>>> qdisk:
> >>>> <
> >> http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html
> >.
> >>>> Here's one with the same problem with the same OS:
> >>>> <http://bugs.centos.org/view.php?id=5229>, but with no resolution.
> >>>>
> >>>> Out of curiosity, has anyone on this list made a two-node cman+clvmd
> >>>> cluster
> >>>> work for them?
> >>>>
> >>>>> Il giorno 14 marzo 2012 14:02, William Seligman <
> >>>> [email protected]
> >>>>>> ha scritto:
> >>>>>
> >>>>>> On 3/14/12 6:02 AM, emmanuel segura wrote:
> >>>>>>
> >>>>>>  I think it's better you make clvmd start at boot
> >>>>>>>
> >>>>>>> chkconfig cman on ; chkconfig clvmd on
> >>>>>>>
> >>>>>>
> >>>>>> I've already tried it. It doesn't work. The problem is that my LVM
> >>>>>> information is on the drbd. If I start up clvmd before drbd, it
> won't
> >>>> find
> >>>>>> the logical volumes.
> >>>>>>
> >>>>>> I also don't see why that would make a difference (although this
> could
> >>>> be
> >>>>>> part of the confusion): a service is a service. I've tried starting
> up
> >>>>>> clvmd inside and outside pacemaker control, with the same problem.
> Why
> >>>>>> would starting clvmd at boot make a difference?
> >>>>>>
> >>>>>>  Il giorno 13 marzo 2012 23:29, William Seligman<seligman@nevis.**
> >>>>>>> columbia.edu <[email protected]>
> >>>>>>>
> >>>>>>>> ha scritto:
> >>>>>>>>
> >>>>>>>
> >>>>>>>  On 3/13/12 5:50 PM, emmanuel segura wrote:
> >>>>>>>>
> >>>>>>>>  So if you using cman why you use lsb::clvmd
> >>>>>>>>>
> >>>>>>>>> I think you are very confused
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't dispute that I may be very confused!
> >>>>>>>>
> >>>>>>>> However, from what I can tell, I still need to run clvmd even if
> >>>>>>>> I'm running cman (I'm not using rgmanager). If I just run cman,
> >>>>>>>> gfs2 and any other form of mount fails. If I run cman, then clvmd,
> >>>>>>>> then gfs2, everything behaves normally.
> >>>>>>>>
> >>>>>>>> Going by these instructions:
> >>>>>>>>
> >>>>>>>> <https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial<
> >>>> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> the resources he puts under "cluster control" (rgmanager) I have
> to
> >>>>>>>> put under pacemaker control. Those include drbd, clvmd, and gfs2.
> >>>>>>>>
> >>>>>>>> The difference between what I've got, and what's in "Clusters From
> >>>>>>>> Scratch", is in CFS they assign one DRBD volume to a single
> >>>>>>>> filesystem. I create an LVM physical volume on my DRBD resource,
> >>>>>>>> as in the above tutorial, and so I have to start clvmd or the
> >>>>>>>> logical volumes in the DRBD partition won't be recognized.>> Is
> >>>>>>>> there some way to get logical volumes recognized automatically by
> >>>>>>>> cman without rgmanager that I've missed?
> >>>>>>>>
> >>>>>>>
> >>>>>>>  Il giorno 13 marzo 2012 22:42, William Seligman<
> >>>>>>>>>
> >>>>>>>> [email protected]
> >>>>>>>>
> >>>>>>>>> ha scritto:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  On 3/13/12 12:29 PM, William Seligman wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I'm not sure if this is a "Linux-HA" question; please direct
> >>>>>>>>>>> me to the appropriate list if it's not.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as
> >>>>>>>>>>> described in "Clusters From Scratch." Fencing is through
> >>>>>>>>>>> forcibly rebooting a node by cutting and restoring its power
> >>>>>>>>>>> via UPS.
> >>>>>>>>>>>
> >>>>>>>>>>> My fencing/failover tests have revealed a problem. If I
> >>>>>>>>>>> gracefully turn off one node ("crm node standby"; "service
> >>>>>>>>>>> pacemaker stop"; "shutdown -r now") all the resources
> >>>>>>>>>>> transfer to the other node with no problems. If I cut power
> >>>>>>>>>>> to one node (as would happen if it were fenced), the
> >>>>>>>>>>> lsb::clvmd resource on the remaining node eventually fails.
> >>>>>>>>>>> Since all the other resources depend on clvmd, all the
> >>>>>>>>>>> resources on the remaining node stop and the cluster is left
> >>>>>>>>>>> with nothing running.
> >>>>>>>>>>>
> >>>>>>>>>>> I've traced why the lsb::clvmd fails: The monitor/status
> >>>>>>>>>>> command includes "vgdisplay", which hangs indefinitely.
> >>>>>>>>>>> Therefore the monitor will always time-out.
> >>>>>>>>>>>
> >>>>>>>>>>> So this isn't a problem with pacemaker, but with clvmd/dlm:
> >>>>>>>>>>> If a node is cut off, the cluster isn't handling it properly.
> >>>>>>>>>>> Has anyone on this list seen this before? Any ideas?
> >>>>>>>>>>>
> >>>>>>>>>>>> Details:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> versions:
> >>>>>>>>>>> Redhat Linux 6.2 (kernel 2.6.32)
> >>>>>>>>>>> cman-3.0.12.1
> >>>>>>>>>>> corosync-1.4.1
> >>>>>>>>>>> pacemaker-1.1.6
> >>>>>>>>>>> lvm2-2.02.87
> >>>>>>>>>>> lvm2-cluster-2.02.87
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> This may be a Linux-HA question after all!
> >>>>>>>>>>
> >>>>>>>>>> I ran a few more tests. Here's the output from a typical test of
> >>>>>>>>>>
> >>>>>>>>>> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)**" /var/log/messages
> >>>>>>>>>>
> >>>>>>>>>> <http://pastebin.com/uqC6bc1b>
> >>>>>>>>>>
> >>>>>>>>>> It looks like what's happening is that the fence agent (one I
> >>>>>>>>>> wrote) is not returning the proper error code when a node
> >>>>>>>>>> crashes. According to this page, if a fencing agent fails GFS2
> >>>>>>>>>> will freeze to protect the data:
> >>>>>>>>>>
> >>>>>>>>>> <http://docs.redhat.com/docs/**en-US/Red_Hat_Enterprise_**
> >>>>>>>>>> Linux/6/html/Global_File_**System_2/s1-gfs2hand-allnodes.**html<
> >>>>
> >>
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html
> >>>>>>
> >>>>>>>>>>
> >>>>>>>>>> As a test, I tried to fence my test node via standard means:
> >>>>>>>>>>
> >>>>>>>>>> stonith_admin -F orestes-corosync.nevis.**columbia.edu<
> >>>> http://orestes-corosync.nevis.columbia.edu>
> >>>>>>>>>>
> >>>>>>>>>> These were the log messages, which show that stonith_admin did
> >>>>>>>>>> its job and CMAN was notified of the
> >>>>>>>>>> fencing:<http://pastebin.com/**jaH820Bv <
> >>>> http://pastebin.com/jaH820Bv>
> >>>>>>>>>>> .
> >>>>>>>>>>
> >>>>>>>>>> Unfortunately, I still got the gfs2 freeze, so this is not the
> >>>>>>>>>> complete story.
> >>>>>>>>>>
> >>>>>>>>>> First things first. I vaguely recall a web page that went over
> >>>>>>>>>> the STONITH return codes, but I can't locate it again. Is there
> >>>>>>>>>> any reference to the return codes expected from a fencing
> >>>>>>>>>> agent, perhaps as function of the state of the fencing device?
> >>
>
>
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://[email protected]
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

Reply via email to