Hello William for the lvm hang you can use this in your /etc/lvm/lvm.conf
ignore_suspended_devices = 1 because i seen in the lvm log, =============================================== and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0 =============================================== Il giorno 15 marzo 2012 23:50, William Seligman <[email protected] > ha scritto: > On 3/15/12 6:07 PM, William Seligman wrote: > > On 3/15/12 6:05 PM, William Seligman wrote: > >> On 3/15/12 4:57 PM, emmanuel segura wrote: > >> > >>> we can try to understand what happen when clvm hang > >>> > >>> edit the /etc/lvm/lvm.conf and change level = 7 in the log session and > >>> uncomment this line > >>> > >>> file = "/var/log/lvm2.log" > >> > >> Here's the tail end of the file (the original is 1.6M). Because there > no times > >> in the log, it's hard for me to point you to the point where I crashed > the other > >> system. I think (though I'm not sure) that the crash happened after the > last > >> occurrence of > >> > >> cache/lvmcache.c:1484 Wiping internal VG cache > >> > >> Honestly, it looks like a wall of text to me. Does it suggest anything > to you? > > > > Maybe it would help if I included the link to the pastebin where I put > the > > output: <http://pastebin.com/8pgW3Muw> > > Could the problem be with lvm+drbd? > > In lvm2.conf, I see this sequence of lines pre-crash: > > device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > device/dev-io.c:271 /dev/md0: size is 1027968 sectors > device/dev-io.c:137 /dev/md0: block size is 1024 bytes > device/dev-io.c:588 Closed /dev/md0 > device/dev-io.c:271 /dev/md0: size is 1027968 sectors > device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > device/dev-io.c:137 /dev/md0: block size is 1024 bytes > device/dev-io.c:588 Closed /dev/md0 > filters/filter-composite.c:31 Using /dev/md0 > device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > device/dev-io.c:137 /dev/md0: block size is 1024 bytes > label/label.c:186 /dev/md0: No label detected > device/dev-io.c:588 Closed /dev/md0 > device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT > device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors > device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes > device/dev-io.c:588 Closed /dev/drbd0 > device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors > device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT > device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes > device/dev-io.c:588 Closed /dev/drbd0 > > I interpret this: Look at /dev/md0, get some info, close; look at > /dev/drbd0, > get some info, close. > > Post-crash, I see: > > evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > device/dev-io.c:271 /dev/md0: size is 1027968 sectors > device/dev-io.c:137 /dev/md0: block size is 1024 bytes > device/dev-io.c:588 Closed /dev/md0 > device/dev-io.c:271 /dev/md0: size is 1027968 sectors > device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > device/dev-io.c:137 /dev/md0: block size is 1024 bytes > device/dev-io.c:588 Closed /dev/md0 > filters/filter-composite.c:31 Using /dev/md0 > device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT > device/dev-io.c:137 /dev/md0: block size is 1024 bytes > label/label.c:186 /dev/md0: No label detected > device/dev-io.c:588 Closed /dev/md0 > device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT > device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors > device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes > > ... and then it hangs. Comparing the two, it looks like it can't close > /dev/drbd0. > > If I look at /proc/drbd when I crash one node, I see this: > > # cat /proc/drbd > version: 8.3.12 (api:88/proto:86-96) > GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by > [email protected], 2012-02-28 18:01:34 > 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s----- > ns:7000064 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 > wo:b oos:0 > > > If I look at /proc/drbd if I bring down one node gracefully (crm node > standby), > I get this: > > # cat /proc/drbd > version: 8.3.12 (api:88/proto:86-96) > GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by > [email protected], 2012-02-28 18:01:34 > 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- > ns:7000064 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 > wo:b > oos:0 > > Could it be that drbd can't respond to certain requests from lvm if the > state of > the peer is DUnknown instead of Outdated? > > >>> Il giorno 15 marzo 2012 20:50, William Seligman < > [email protected] > >>>> ha scritto: > >>> > >>>> On 3/15/12 12:55 PM, emmanuel segura wrote: > >>>> > >>>>> I don't see any error and the answer for your question it's yes > >>>>> > >>>>> can you show me your /etc/cluster/cluster.conf and your crm configure > >>>> show > >>>>> > >>>>> like that more later i can try to look if i found some fix > >>>> > >>>> Thanks for taking a look. > >>>> > >>>> My cluster.conf: <http://pastebin.com/w5XNYyAX> > >>>> crm configure show: <http://pastebin.com/atVkXjkn> > >>>> > >>>> Before you spend a lot of time on the second file, remember that clvmd > >>>> will hang > >>>> whether or not I'm running pacemaker. > >>>> > >>>>> Il giorno 15 marzo 2012 17:42, William Seligman < > >>>> [email protected] > >>>>>> ha scritto: > >>>>> > >>>>>> On 3/15/12 12:15 PM, emmanuel segura wrote: > >>>>>> > >>>>>>> Ho did you created your volume group > >>>>>> > >>>>>> pvcreate /dev/drbd0 > >>>>>> vgcreate -c y ADMIN /dev/drbd0 > >>>>>> lvcreate -L 200G -n usr ADMIN # ... and so on > >>>>>> # "Nevis-HA" is the cluster name I used in cluster.conf > >>>>>> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... > and so > >>>> on > >>>>>> > >>>>>>> give me the output of vgs command when the cluster it's up > >>>>>> > >>>>>> Here it is: > >>>>>> > >>>>>> Logging initialised at Thu Mar 15 12:40:39 2012 > >>>>>> Set umask from 0022 to 0077 > >>>>>> Finding all volume groups > >>>>>> Finding volume group "ROOT" > >>>>>> Finding volume group "ADMIN" > >>>>>> VG #PV #LV #SN Attr VSize VFree > >>>>>> ADMIN 1 5 0 wz--nc 2.61t 765.79g > >>>>>> ROOT 1 2 0 wz--n- 117.16g 0 > >>>>>> Wiping internal VG cache > >>>>>> > >>>>>> I assume the "c" in the ADMIN attributes means that clustering is > turned > >>>>>> on? > >>>>>> > >>>>>>> Il giorno 15 marzo 2012 17:06, William Seligman < > >>>>>> [email protected] > >>>>>>>> ha scritto: > >>>>>>> > >>>>>>>> On 3/15/12 11:50 AM, emmanuel segura wrote: > >>>>>>>>> yes william > >>>>>>>>> > >>>>>>>>> Now try clvmd -d and see what happen > >>>>>>>>> > >>>>>>>>> locking_type = 3 it's lvm cluster lock type > >>>>>>>> > >>>>>>>> Since you asked for confirmation, here it is: the output of > 'clvmd -d' > >>>>>>>> just now. <http://pastebin.com/bne8piEw>. I crashed the other > node at > >>>>>>>> Mar 15 12:02:35, when you see the only additional line of output. > >>>>>>>> > >>>>>>>> I don't see any particular difference between this and the > previous > >>>>>>>> result <http://pastebin.com/sWjaxAEF>, which suggests that I had > >>>>>>>> cluster locking enabled before, and still do now. > >>>>>>>> > >>>>>>>>> Il giorno 15 marzo 2012 16:15, William Seligman < > >>>>>>>> [email protected] > >>>>>>>>>> ha scritto: > >>>>>>>>> > >>>>>>>>>> On 3/15/12 5:18 AM, emmanuel segura wrote: > >>>>>>>>>> > >>>>>>>>>>> The first thing i seen in your clvmd log it's this > >>>>>>>>>>> > >>>>>>>>>>> ============================================= > >>>>>>>>>>> WARNING: Locking disabled. Be careful! This could corrupt your > >>>> metadata. > >>>>>>>>>>> ============================================= > >>>>>>>>>> > >>>>>>>>>> I saw that too, and thought the same as you did. I did some > checks > >>>>>>>>>> (see below), but some web searches suggest that this message is > a > >>>>>>>>>> normal consequence of clvmd initialization; e.g., > >>>>>>>>>> > >>>>>>>>>> <http://markmail.org/message/vmy53pcv52wu7ghx> > >>>>>>>>>> > >>>>>>>>>>> use this command > >>>>>>>>>>> > >>>>>>>>>>> lvmconf --enable-cluster > >>>>>>>>>>> > >>>>>>>>>>> and remember for cman+pacemaker you don't need qdisk > >>>>>>>>>> > >>>>>>>>>> Before I tried your lvmconf suggestion, here was my > >>>> /etc/lvm/lvm.conf: > >>>>>>>>>> <http://pastebin.com/841VZRzW> and the output of "lvm > dumpconfig": > >>>>>>>>>> <http://pastebin.com/rtw8c3Pf>. > >>>>>>>>>> > >>>>>>>>>> Then I did as you suggested, but with a check to see if anything > >>>>>>>>>> changed: > >>>>>>>>>> > >>>>>>>>>> # cd /etc/lvm/ > >>>>>>>>>> # cp lvm.conf lvm.conf.cluster > >>>>>>>>>> # lvmconf --enable-cluster > >>>>>>>>>> # diff lvm.conf lvm.conf.cluster > >>>>>>>>>> # > >>>>>>>>>> > >>>>>>>>>> So the key lines have been there all along: > >>>>>>>>>> locking_type = 3 > >>>>>>>>>> fallback_to_local_locking = 0 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Il giorno 14 marzo 2012 23:17, William Seligman < > >>>>>>>>>> [email protected] > >>>>>>>>>>>> ha scritto: > >>>>>>>>>>> > >>>>>>>>>>>> On 3/14/12 9:20 AM, emmanuel segura wrote: > >>>>>>>>>>>>> Hello William > >>>>>>>>>>>>> > >>>>>>>>>>>>> i did new you are using drbd and i dont't know what type of > >>>>>>>>>>>>> configuration you using > >>>>>>>>>>>>> > >>>>>>>>>>>>> But it's better you try to start clvm with clvmd -d > >>>>>>>>>>>>> > >>>>>>>>>>>>> like thak we can see what it's the problem > >>>>>>>>>>>> > >>>>>>>>>>>> For what it's worth, here's the output of running clvmd -d on > >>>>>>>>>>>> the node that stays up: <http://pastebin.com/sWjaxAEF> > >>>>>>>>>>>> > >>>>>>>>>>>> What's probably important in that big mass of output are the > >>>>>>>>>>>> last two lines. Up to that point, I have both nodes up and > >>>>>>>>>>>> running cman + clvmd; cluster.conf is here: > >>>>>>>>>>>> <http://pastebin.com/w5XNYyAX> > >>>>>>>>>>>> > >>>>>>>>>>>> At the time of the next-to-the-last line, I cut power to the > >>>>>>>>>>>> other node. > >>>>>>>>>>>> > >>>>>>>>>>>> At the time of the last line, I run "vgdisplay" on the > >>>>>>>>>>>> remaining node, which hangs forever. > >>>>>>>>>>>> > >>>>>>>>>>>> After a lot of web searching, I found that I'm not the only > one > >>>>>>>>>>>> with this problem. Here's one case that doesn't seem relevant > >>>>>>>>>>>> to me, since I don't use qdisk: > >>>>>>>>>>>> < > >>>> > http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html>. > >>>>>>>>>>>> Here's one with the same problem with the same OS: > >>>>>>>>>>>> <http://bugs.centos.org/view.php?id=5229>, but with no > >>>> resolution. > >>>>>>>>>>>> > >>>>>>>>>>>> Out of curiosity, has anyone on this list made a two-node > >>>>>>>>>>>> cman+clvmd cluster work for them? > >>>>>>>>>>>> > >>>>>>>>>>>>> Il giorno 14 marzo 2012 14:02, William Seligman < > >>>>>>>>>>>> [email protected] > >>>>>>>>>>>>>> ha scritto: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On 3/14/12 6:02 AM, emmanuel segura wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I think it's better you make clvmd start at boot > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> chkconfig cman on ; chkconfig clvmd on > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I've already tried it. It doesn't work. The problem is that > >>>>>>>>>>>>>> my LVM information is on the drbd. If I start up clvmd > >>>>>>>>>>>>>> before drbd, it won't find the logical volumes. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I also don't see why that would make a difference (although > >>>>>>>>>>>>>> this could be part of the confusion): a service is a > >>>>>>>>>>>>>> service. I've tried starting up clvmd inside and outside > >>>>>>>>>>>>>> pacemaker control, with the same problem. Why would > >>>>>>>>>>>>>> starting clvmd at boot make a difference? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Il giorno 13 marzo 2012 23:29, William Seligman< > >>>> [email protected]> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> ha scritto: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 3/13/12 5:50 PM, emmanuel segura wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> So if you using cman why you use lsb::clvmd > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I think you are very confused > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I don't dispute that I may be very confused! > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> However, from what I can tell, I still need to run > >>>>>>>>>>>>>>>> clvmd even if I'm running cman (I'm not using > >>>>>>>>>>>>>>>> rgmanager). If I just run cman, gfs2 and any other form > >>>>>>>>>>>>>>>> of mount fails. If I run cman, then clvmd, then gfs2, > >>>>>>>>>>>>>>>> everything behaves normally. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Going by these instructions: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> < > https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> the resources he puts under "cluster control" > >>>>>>>>>>>>>>>> (rgmanager) I have to put under pacemaker control. > >>>>>>>>>>>>>>>> Those include drbd, clvmd, and gfs2. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The difference between what I've got, and what's in > >>>>>>>>>>>>>>>> "Clusters From Scratch", is in CFS they assign one DRBD > >>>>>>>>>>>>>>>> volume to a single filesystem. I create an LVM physical > >>>>>>>>>>>>>>>> volume on my DRBD resource, as in the above tutorial, > >>>>>>>>>>>>>>>> and so I have to start clvmd or the logical volumes in > >>>>>>>>>>>>>>>> the DRBD partition won't be recognized.>> Is there some > >>>>>>>>>>>>>>>> way to get logical volumes recognized automatically by > >>>>>>>>>>>>>>>> cman without rgmanager that I've missed? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Il giorno 13 marzo 2012 22:42, William Seligman< > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> [email protected] > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> ha scritto: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On 3/13/12 12:29 PM, William Seligman wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I'm not sure if this is a "Linux-HA" question; > >>>>>>>>>>>>>>>>>>> please direct me to the appropriate list if it's > >>>>>>>>>>>>>>>>>>> not. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I'm setting up a two-node cman+pacemaker+gfs2 > >>>>>>>>>>>>>>>>>>> cluster as described in "Clusters From Scratch." > >>>>>>>>>>>>>>>>>>> Fencing is through forcibly rebooting a node by > >>>>>>>>>>>>>>>>>>> cutting and restoring its power via UPS. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> My fencing/failover tests have revealed a > >>>>>>>>>>>>>>>>>>> problem. If I gracefully turn off one node ("crm > >>>>>>>>>>>>>>>>>>> node standby"; "service pacemaker stop"; > >>>>>>>>>>>>>>>>>>> "shutdown -r now") all the resources transfer to > >>>>>>>>>>>>>>>>>>> the other node with no problems. If I cut power > >>>>>>>>>>>>>>>>>>> to one node (as would happen if it were fenced), > >>>>>>>>>>>>>>>>>>> the lsb::clvmd resource on the remaining node > >>>>>>>>>>>>>>>>>>> eventually fails. Since all the other resources > >>>>>>>>>>>>>>>>>>> depend on clvmd, all the resources on the > >>>>>>>>>>>>>>>>>>> remaining node stop and the cluster is left with > >>>>>>>>>>>>>>>>>>> nothing running. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I've traced why the lsb::clvmd fails: The > >>>>>>>>>>>>>>>>>>> monitor/status command includes "vgdisplay", > >>>>>>>>>>>>>>>>>>> which hangs indefinitely. Therefore the monitor > >>>>>>>>>>>>>>>>>>> will always time-out. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> So this isn't a problem with pacemaker, but with > >>>>>>>>>>>>>>>>>>> clvmd/dlm: If a node is cut off, the cluster > >>>>>>>>>>>>>>>>>>> isn't handling it properly. Has anyone on this > >>>>>>>>>>>>>>>>>>> list seen this before? Any ideas? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Details: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> versions: > >>>>>>>>>>>>>>>>>>> Redhat Linux 6.2 (kernel 2.6.32) > >>>>>>>>>>>>>>>>>>> cman-3.0.12.1 > >>>>>>>>>>>>>>>>>>> corosync-1.4.1 > >>>>>>>>>>>>>>>>>>> pacemaker-1.1.6 > >>>>>>>>>>>>>>>>>>> lvm2-2.02.87 > >>>>>>>>>>>>>>>>>>> lvm2-cluster-2.02.87 > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> This may be a Linux-HA question after all! > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I ran a few more tests. Here's the output from a > >>>>>>>>>>>>>>>>>> typical test of > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)**" > >>>>>>>>>>>>>>>>>> /var/log/messages > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> <http://pastebin.com/uqC6bc1b> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> It looks like what's happening is that the fence > >>>>>>>>>>>>>>>>>> agent (one I wrote) is not returning the proper > >>>>>>>>>>>>>>>>>> error code when a node crashes. According to this > >>>>>>>>>>>>>>>>>> page, if a fencing agent fails GFS2 will freeze to > >>>>>>>>>>>>>>>>>> protect the data: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> < > >>>> > http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html > >>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> As a test, I tried to fence my test node via > >>>>>>>>>>>>>>>>>> standard means: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> stonith_admin -F \ > >>>>>>>>>>>>>>>>>> orestes-corosync.nevis.columbia.edu > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> These were the log messages, which show that > >>>>>>>>>>>>>>>>>> stonith_admin did its job and CMAN was notified of > >>>>>>>>>>>>>>>>>> the fencing:<http://pastebin.com/jaH820Bv>. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Unfortunately, I still got the gfs2 freeze, so this > >>>>>>>>>>>>>>>>>> is not the complete story. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> First things first. I vaguely recall a web page > >>>>>>>>>>>>>>>>>> that went over the STONITH return codes, but I > >>>>>>>>>>>>>>>>>> can't locate it again. Is there any reference to > >>>>>>>>>>>>>>>>>> the return codes expected from a fencing agent, > >>>>>>>>>>>>>>>>>> perhaps as function of the state of the fencing > >>>>>>>>>>>>>>>>>> device? > See also: http://linux-ha.org/ReportingProblems > > > -- > Bill Seligman | Phone: (914) 591-2823 > Nevis Labs, Columbia Univ | mailto://[email protected] > PO Box 137 | > Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- esta es mi vida e me la vivo hasta que dios quiera _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
