On 3/16/12 7:02 AM, Andreas Kurz wrote: > On 03/15/2012 11:50 PM, William Seligman wrote: >> On 3/15/12 6:07 PM, William Seligman wrote: >>> On 3/15/12 6:05 PM, William Seligman wrote: >>>> On 3/15/12 4:57 PM, emmanuel segura wrote: >>>> >>>>> we can try to understand what happen when clvm hang >>>>> >>>>> edit the /etc/lvm/lvm.conf and change level = 7 in the log session and >>>>> uncomment this line >>>>> >>>>> file = "/var/log/lvm2.log" >>>> >>>> Here's the tail end of the file (the original is 1.6M). Because there no >>>> times >>>> in the log, it's hard for me to point you to the point where I crashed the >>>> other >>>> system. I think (though I'm not sure) that the crash happened after the >>>> last >>>> occurrence of >>>> >>>> cache/lvmcache.c:1484 Wiping internal VG cache >>>> >>>> Honestly, it looks like a wall of text to me. Does it suggest anything to >>>> you? >>> >>> Maybe it would help if I included the link to the pastebin where I put the >>> output: <http://pastebin.com/8pgW3Muw> >> >> Could the problem be with lvm+drbd? >> >> In lvm2.conf, I see this sequence of lines pre-crash: >> >> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT >> device/dev-io.c:271 /dev/md0: size is 1027968 sectors >> device/dev-io.c:137 /dev/md0: block size is 1024 bytes >> device/dev-io.c:588 Closed /dev/md0 >> device/dev-io.c:271 /dev/md0: size is 1027968 sectors >> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT >> device/dev-io.c:137 /dev/md0: block size is 1024 bytes >> device/dev-io.c:588 Closed /dev/md0 >> filters/filter-composite.c:31 Using /dev/md0 >> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT >> device/dev-io.c:137 /dev/md0: block size is 1024 bytes >> label/label.c:186 /dev/md0: No label detected >> device/dev-io.c:588 Closed /dev/md0 >> device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT >> device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors >> device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes >> device/dev-io.c:588 Closed /dev/drbd0 >> device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors >> device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT >> device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes >> device/dev-io.c:588 Closed /dev/drbd0 >> >> I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, >> get some info, close. >> >> Post-crash, I see: >> >> evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT >> device/dev-io.c:271 /dev/md0: size is 1027968 sectors >> device/dev-io.c:137 /dev/md0: block size is 1024 bytes >> device/dev-io.c:588 Closed /dev/md0 >> device/dev-io.c:271 /dev/md0: size is 1027968 sectors >> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT >> device/dev-io.c:137 /dev/md0: block size is 1024 bytes >> device/dev-io.c:588 Closed /dev/md0 >> filters/filter-composite.c:31 Using /dev/md0 >> device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT >> device/dev-io.c:137 /dev/md0: block size is 1024 bytes >> label/label.c:186 /dev/md0: No label detected >> device/dev-io.c:588 Closed /dev/md0 >> device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT >> device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors >> device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes >> >> ... and then it hangs. Comparing the two, it looks like it can't close >> /dev/drbd0. >> >> If I look at /proc/drbd when I crash one node, I see this: >> >> # cat /proc/drbd >> version: 8.3.12 (api:88/proto:86-96) >> GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by >> [email protected], 2012-02-28 18:01:34 >> 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s----- >> ns:7000064 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 >> wo:b oos:0 > > s----- ... DRBD suspended io, most likely because of it's > fencing-policy. For valid dual-primary setups you have to use > "resource-and-stonith" policy and a working "fence-peer" handler. In > this mode I/O is suspended until fencing of peer was succesful. Question > is, why the peer does _not_ also suspend its I/O because obviously > fencing was not successful ..... > > So with a correct DRBD configuration one of your nodes should already > have been fenced because of connection loss between nodes (on drbd > replication link). > > You can use e.g. that nice fencing script: > > http://goo.gl/O4N8f
This is the output of "drbdadm dump admin": <http://pastebin.com/kTxvHCtx> So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: <http://www.gossamer-threads.com/lists/linuxha/users/78504#78504> At the moment I'm pursuing the possibility that I'm returning the wrong return codes from my fencing agent: <http://www.gossamer-threads.com/lists/linuxha/users/78572> After that, I'll look at another suggestion with lvm.conf: <http://www.gossamer-threads.com/lists/linuxha/users/78796#78796> Then I'll try DRBD 8.4.1. Hopefully one of these is the source of the issue. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://[email protected] PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
