On 3/15/12 6:05 PM, William Seligman wrote: > On 3/15/12 4:57 PM, emmanuel segura wrote: > >> we can try to understand what happen when clvm hang >> >> edit the /etc/lvm/lvm.conf and change level = 7 in the log session and >> uncomment this line >> >> file = "/var/log/lvm2.log" > > Here's the tail end of the file (the original is 1.6M). Because there no times > in the log, it's hard for me to point you to the point where I crashed the > other > system. I think (though I'm not sure) that the crash happened after the last > occurrence of > > cache/lvmcache.c:1484 Wiping internal VG cache > > Honestly, it looks like a wall of text to me. Does it suggest anything to you?
Maybe it would help if I included the link to the pastebin where I put the output: <http://pastebin.com/8pgW3Muw> >> Il giorno 15 marzo 2012 20:50, William Seligman <[email protected] >>> ha scritto: >> >>> On 3/15/12 12:55 PM, emmanuel segura wrote: >>> >>>> I don't see any error and the answer for your question it's yes >>>> >>>> can you show me your /etc/cluster/cluster.conf and your crm configure >>> show >>>> >>>> like that more later i can try to look if i found some fix >>> >>> Thanks for taking a look. >>> >>> My cluster.conf: <http://pastebin.com/w5XNYyAX> >>> crm configure show: <http://pastebin.com/atVkXjkn> >>> >>> Before you spend a lot of time on the second file, remember that clvmd >>> will hang >>> whether or not I'm running pacemaker. >>> >>>> Il giorno 15 marzo 2012 17:42, William Seligman < >>> [email protected] >>>>> ha scritto: >>>> >>>>> On 3/15/12 12:15 PM, emmanuel segura wrote: >>>>> >>>>>> Ho did you created your volume group >>>>> >>>>> pvcreate /dev/drbd0 >>>>> vgcreate -c y ADMIN /dev/drbd0 >>>>> lvcreate -L 200G -n usr ADMIN # ... and so on >>>>> # "Nevis-HA" is the cluster name I used in cluster.conf >>>>> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so >>> on >>>>> >>>>>> give me the output of vgs command when the cluster it's up >>>>> >>>>> Here it is: >>>>> >>>>> Logging initialised at Thu Mar 15 12:40:39 2012 >>>>> Set umask from 0022 to 0077 >>>>> Finding all volume groups >>>>> Finding volume group "ROOT" >>>>> Finding volume group "ADMIN" >>>>> VG #PV #LV #SN Attr VSize VFree >>>>> ADMIN 1 5 0 wz--nc 2.61t 765.79g >>>>> ROOT 1 2 0 wz--n- 117.16g 0 >>>>> Wiping internal VG cache >>>>> >>>>> I assume the "c" in the ADMIN attributes means that clustering is turned >>>>> on? >>>>> >>>>>> Il giorno 15 marzo 2012 17:06, William Seligman < >>>>> [email protected] >>>>>>> ha scritto: >>>>>> >>>>>>> On 3/15/12 11:50 AM, emmanuel segura wrote: >>>>>>>> yes william >>>>>>>> >>>>>>>> Now try clvmd -d and see what happen >>>>>>>> >>>>>>>> locking_type = 3 it's lvm cluster lock type >>>>>>> >>>>>>> Since you asked for confirmation, here it is: the output of 'clvmd -d' >>>>>>> just now. <http://pastebin.com/bne8piEw>. I crashed the other node at >>>>>>> Mar 15 12:02:35, when you see the only additional line of output. >>>>>>> >>>>>>> I don't see any particular difference between this and the previous >>>>>>> result <http://pastebin.com/sWjaxAEF>, which suggests that I had >>>>>>> cluster locking enabled before, and still do now. >>>>>>> >>>>>>>> Il giorno 15 marzo 2012 16:15, William Seligman < >>>>>>> [email protected] >>>>>>>>> ha scritto: >>>>>>>> >>>>>>>>> On 3/15/12 5:18 AM, emmanuel segura wrote: >>>>>>>>> >>>>>>>>>> The first thing i seen in your clvmd log it's this >>>>>>>>>> >>>>>>>>>> ============================================= >>>>>>>>>> WARNING: Locking disabled. Be careful! This could corrupt your >>> metadata. >>>>>>>>>> ============================================= >>>>>>>>> >>>>>>>>> I saw that too, and thought the same as you did. I did some checks >>>>>>>>> (see below), but some web searches suggest that this message is a >>>>>>>>> normal consequence of clvmd initialization; e.g., >>>>>>>>> >>>>>>>>> <http://markmail.org/message/vmy53pcv52wu7ghx> >>>>>>>>> >>>>>>>>>> use this command >>>>>>>>>> >>>>>>>>>> lvmconf --enable-cluster >>>>>>>>>> >>>>>>>>>> and remember for cman+pacemaker you don't need qdisk >>>>>>>>> >>>>>>>>> Before I tried your lvmconf suggestion, here was my >>> /etc/lvm/lvm.conf: >>>>>>>>> <http://pastebin.com/841VZRzW> and the output of "lvm dumpconfig": >>>>>>>>> <http://pastebin.com/rtw8c3Pf>. >>>>>>>>> >>>>>>>>> Then I did as you suggested, but with a check to see if anything >>>>>>>>> changed: >>>>>>>>> >>>>>>>>> # cd /etc/lvm/ >>>>>>>>> # cp lvm.conf lvm.conf.cluster >>>>>>>>> # lvmconf --enable-cluster >>>>>>>>> # diff lvm.conf lvm.conf.cluster >>>>>>>>> # >>>>>>>>> >>>>>>>>> So the key lines have been there all along: >>>>>>>>> locking_type = 3 >>>>>>>>> fallback_to_local_locking = 0 >>>>>>>>> >>>>>>>>> >>>>>>>>>> Il giorno 14 marzo 2012 23:17, William Seligman < >>>>>>>>> [email protected] >>>>>>>>>>> ha scritto: >>>>>>>>>> >>>>>>>>>>> On 3/14/12 9:20 AM, emmanuel segura wrote: >>>>>>>>>>>> Hello William >>>>>>>>>>>> >>>>>>>>>>>> i did new you are using drbd and i dont't know what type of >>>>>>>>>>>> configuration you using >>>>>>>>>>>> >>>>>>>>>>>> But it's better you try to start clvm with clvmd -d >>>>>>>>>>>> >>>>>>>>>>>> like thak we can see what it's the problem >>>>>>>>>>> >>>>>>>>>>> For what it's worth, here's the output of running clvmd -d on >>>>>>>>>>> the node that stays up: <http://pastebin.com/sWjaxAEF> >>>>>>>>>>> >>>>>>>>>>> What's probably important in that big mass of output are the >>>>>>>>>>> last two lines. Up to that point, I have both nodes up and >>>>>>>>>>> running cman + clvmd; cluster.conf is here: >>>>>>>>>>> <http://pastebin.com/w5XNYyAX> >>>>>>>>>>> >>>>>>>>>>> At the time of the next-to-the-last line, I cut power to the >>>>>>>>>>> other node. >>>>>>>>>>> >>>>>>>>>>> At the time of the last line, I run "vgdisplay" on the >>>>>>>>>>> remaining node, which hangs forever. >>>>>>>>>>> >>>>>>>>>>> After a lot of web searching, I found that I'm not the only one >>>>>>>>>>> with this problem. Here's one case that doesn't seem relevant >>>>>>>>>>> to me, since I don't use qdisk: >>>>>>>>>>> < >>> http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html>. >>>>>>>>>>> Here's one with the same problem with the same OS: >>>>>>>>>>> <http://bugs.centos.org/view.php?id=5229>, but with no >>> resolution. >>>>>>>>>>> >>>>>>>>>>> Out of curiosity, has anyone on this list made a two-node >>>>>>>>>>> cman+clvmd cluster work for them? >>>>>>>>>>> >>>>>>>>>>>> Il giorno 14 marzo 2012 14:02, William Seligman < >>>>>>>>>>> [email protected] >>>>>>>>>>>>> ha scritto: >>>>>>>>>>>> >>>>>>>>>>>>> On 3/14/12 6:02 AM, emmanuel segura wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> I think it's better you make clvmd start at boot >>>>>>>>>>>>>> >>>>>>>>>>>>>> chkconfig cman on ; chkconfig clvmd on >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I've already tried it. It doesn't work. The problem is that >>>>>>>>>>>>> my LVM information is on the drbd. If I start up clvmd >>>>>>>>>>>>> before drbd, it won't find the logical volumes. >>>>>>>>>>>>> >>>>>>>>>>>>> I also don't see why that would make a difference (although >>>>>>>>>>>>> this could be part of the confusion): a service is a >>>>>>>>>>>>> service. I've tried starting up clvmd inside and outside >>>>>>>>>>>>> pacemaker control, with the same problem. Why would >>>>>>>>>>>>> starting clvmd at boot make a difference? >>>>>>>>>>>>> >>>>>>>>>>>>> Il giorno 13 marzo 2012 23:29, William Seligman< >>> [email protected]> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> ha scritto: >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 3/13/12 5:50 PM, emmanuel segura wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So if you using cman why you use lsb::clvmd >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think you are very confused >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I don't dispute that I may be very confused! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, from what I can tell, I still need to run >>>>>>>>>>>>>>> clvmd even if I'm running cman (I'm not using >>>>>>>>>>>>>>> rgmanager). If I just run cman, gfs2 and any other form >>>>>>>>>>>>>>> of mount fails. If I run cman, then clvmd, then gfs2, >>>>>>>>>>>>>>> everything behaves normally. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Going by these instructions: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> <https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> the resources he puts under "cluster control" >>>>>>>>>>>>>>> (rgmanager) I have to put under pacemaker control. >>>>>>>>>>>>>>> Those include drbd, clvmd, and gfs2. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The difference between what I've got, and what's in >>>>>>>>>>>>>>> "Clusters From Scratch", is in CFS they assign one DRBD >>>>>>>>>>>>>>> volume to a single filesystem. I create an LVM physical >>>>>>>>>>>>>>> volume on my DRBD resource, as in the above tutorial, >>>>>>>>>>>>>>> and so I have to start clvmd or the logical volumes in >>>>>>>>>>>>>>> the DRBD partition won't be recognized.>> Is there some >>>>>>>>>>>>>>> way to get logical volumes recognized automatically by >>>>>>>>>>>>>>> cman without rgmanager that I've missed? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Il giorno 13 marzo 2012 22:42, William Seligman< >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ha scritto: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 3/13/12 12:29 PM, William Seligman wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm not sure if this is a "Linux-HA" question; >>>>>>>>>>>>>>>>>> please direct me to the appropriate list if it's >>>>>>>>>>>>>>>>>> not. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm setting up a two-node cman+pacemaker+gfs2 >>>>>>>>>>>>>>>>>> cluster as described in "Clusters From Scratch." >>>>>>>>>>>>>>>>>> Fencing is through forcibly rebooting a node by >>>>>>>>>>>>>>>>>> cutting and restoring its power via UPS. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> My fencing/failover tests have revealed a >>>>>>>>>>>>>>>>>> problem. If I gracefully turn off one node ("crm >>>>>>>>>>>>>>>>>> node standby"; "service pacemaker stop"; >>>>>>>>>>>>>>>>>> "shutdown -r now") all the resources transfer to >>>>>>>>>>>>>>>>>> the other node with no problems. If I cut power >>>>>>>>>>>>>>>>>> to one node (as would happen if it were fenced), >>>>>>>>>>>>>>>>>> the lsb::clvmd resource on the remaining node >>>>>>>>>>>>>>>>>> eventually fails. Since all the other resources >>>>>>>>>>>>>>>>>> depend on clvmd, all the resources on the >>>>>>>>>>>>>>>>>> remaining node stop and the cluster is left with >>>>>>>>>>>>>>>>>> nothing running. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I've traced why the lsb::clvmd fails: The >>>>>>>>>>>>>>>>>> monitor/status command includes "vgdisplay", >>>>>>>>>>>>>>>>>> which hangs indefinitely. Therefore the monitor >>>>>>>>>>>>>>>>>> will always time-out. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> So this isn't a problem with pacemaker, but with >>>>>>>>>>>>>>>>>> clvmd/dlm: If a node is cut off, the cluster >>>>>>>>>>>>>>>>>> isn't handling it properly. Has anyone on this >>>>>>>>>>>>>>>>>> list seen this before? Any ideas? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Details: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> versions: >>>>>>>>>>>>>>>>>> Redhat Linux 6.2 (kernel 2.6.32) >>>>>>>>>>>>>>>>>> cman-3.0.12.1 >>>>>>>>>>>>>>>>>> corosync-1.4.1 >>>>>>>>>>>>>>>>>> pacemaker-1.1.6 >>>>>>>>>>>>>>>>>> lvm2-2.02.87 >>>>>>>>>>>>>>>>>> lvm2-cluster-2.02.87 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This may be a Linux-HA question after all! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I ran a few more tests. Here's the output from a >>>>>>>>>>>>>>>>> typical test of >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)**" >>>>>>>>>>>>>>>>> /var/log/messages >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> <http://pastebin.com/uqC6bc1b> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It looks like what's happening is that the fence >>>>>>>>>>>>>>>>> agent (one I wrote) is not returning the proper >>>>>>>>>>>>>>>>> error code when a node crashes. According to this >>>>>>>>>>>>>>>>> page, if a fencing agent fails GFS2 will freeze to >>>>>>>>>>>>>>>>> protect the data: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> < >>> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html >>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> As a test, I tried to fence my test node via >>>>>>>>>>>>>>>>> standard means: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> stonith_admin -F \ >>>>>>>>>>>>>>>>> orestes-corosync.nevis.columbia.edu >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> These were the log messages, which show that >>>>>>>>>>>>>>>>> stonith_admin did its job and CMAN was notified of >>>>>>>>>>>>>>>>> the fencing:<http://pastebin.com/jaH820Bv>. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Unfortunately, I still got the gfs2 freeze, so this >>>>>>>>>>>>>>>>> is not the complete story. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> First things first. I vaguely recall a web page >>>>>>>>>>>>>>>>> that went over the STONITH return codes, but I >>>>>>>>>>>>>>>>> can't locate it again. Is there any reference to >>>>>>>>>>>>>>>>> the return codes expected from a fencing agent, >>>>>>>>>>>>>>>>> perhaps as function of the state of the fencing >>>>>>>>>>>>>>>>> device? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://[email protected] PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
