Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect
On 02/21/2011 02:18 PM, Lars Ellenberg wrote: Feb 16 06:25:04 c02n01 kernel: [3687390.947555] block drbd1: pdsk( UpToDate - DUnknown ) This should not have happened, either: We must not change the pdsk state to DUnknown while keeping conn state at Connected. That's nonsense. Feb 16 06:25:04 c02n01 kernel: [3687390.947633] block drbd1: new current UUID 89084B22FE454C03:3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B please let me know if you need any further input from my side. Only if it is easily reproducible, and if so, how. Sorry, if you wrote that somewhere already, I missed it. Just write it again. i tried to reproduce the problem by rapidly dropping and restoring 469MB .SQL Dumpfile. unfortunately, this did not work. i'll retry this with a bigger dumpfile in the next days. one other thing: i downgraded most of our servers to 8.3.7 and linux 2.6.32-bpo.5-amd64 (debian bpo). with this setup, i still see some Digest integrity check FAILED. messages, but the resync works without any problem now, i have one production cluster where i still did not manage to downgrade both nodes. my current setup is: wc01: master with drbd 8.3.10, kernel 2.6.27.57+ipax (self compiled) wc02: slave with drbd 8.3.7, kernel 2.6.32-bpo.5-amd64 (debian bpo) in this setup, i still see the described issue: root@wc01 ~ # ssh wc01c cat /proc/drbd ; ssh wc02c cat /proc/drbd wc01: version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b046982443d4785d90a2c603378f9017b build by r...@k000866c.ipax.at, 2011-02-03 14:58:22 0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r- ns:59832072 nr:0 dw:410069420 dr:1031415173 al:3623746 bm:11501 lo:16 pe:0 ua:0 ap:16 ep:1 wo:b oos:11453780 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r- ns:15565136 nr:8918376 dw:144977376 dr:62777630 al:157147 bm:746 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 please note the DUnknown/oos values from drbd0 wc02: version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r ns:0 nr:31434060 dw:31434060 dr:0 al:0 bm:25 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r ns:0 nr:15485480 dw:15485480 dr:0 al:0 bm:24 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 wc02 thinks that everything is fine. i don't know if this is of any help for you, but i thought that you can ignore it in case it does not matter. i can provide the logfiles too. thanks, raoul -- DI (FH) Raoul Bhatia M.Sc. email. r.bha...@ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email.off...@ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax.+43 1 3670030 15 ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect
hi, after a couple of days, i can tell that i do not see the described problem with drbd 8.3.7 and kernel 2.6.32-bpo.5-amd64 (backports from squeeze to debian lenny) root@c02n01 ~ # cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 taking a closer look, i also do not see the original error message anymore: (Digest mismatch, buffer modified by upper layers during write: 0s +4096) instead, i now see dmesg like: [197080.750826] block drbd1: Digest integrity check FAILED. [197080.750871] block drbd1: error receiving Data, l: 4136! [197080.750905] block drbd1: peer( Primary - Unknown ) conn( Connected - ProtocolError ) pdsk( UpToDate - DUnknown ) [197080.750977] block drbd1: asender terminated however, the devices correctly get back in sync. i'll additionally run a manual verify later on and will report back. lars: were you able to extract the logfiles from my original post? cheers, raoul -- DI (FH) Raoul Bhatia M.Sc. email. r.bha...@ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email.off...@ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax.+43 1 3670030 15 ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect
On Mon, Feb 21, 2011 at 10:02:30AM +0100, Raoul Bhatia [IPAX] wrote: hi, after a couple of days, i can tell that i do not see the described problem with drbd 8.3.7 and kernel 2.6.32-bpo.5-amd64 (backports from squeeze to debian lenny) root@c02n01 ~ # cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 taking a closer look, i also do not see the original error message anymore: (Digest mismatch, buffer modified by upper layers during write: 0s +4096) we changed the log message, respectively added the ability to distinguish between detecting mismatch on the receiving end (previously possible already), and detecting mismatch on the sending end as well (previously not checked). instead, i now see dmesg like: [197080.750826] block drbd1: Digest integrity check FAILED. [197080.750871] block drbd1: error receiving Data, l: 4136! [197080.750905] block drbd1: peer( Primary - Unknown ) conn( Connected - ProtocolError ) pdsk( UpToDate - DUnknown ) [197080.750977] block drbd1: asender terminated however, the devices correctly get back in sync. i'll additionally run a manual verify later on and will report back. lars: were you able to extract the logfiles from my original post? The logs of your original post are completely boring. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect
On Mon, Feb 21, 2011 at 10:24:13AM +0100, Lars Ellenberg wrote: On Mon, Feb 21, 2011 at 10:02:30AM +0100, Raoul Bhatia [IPAX] wrote: hi, after a couple of days, i can tell that i do not see the described problem with drbd 8.3.7 and kernel 2.6.32-bpo.5-amd64 (backports from squeeze to debian lenny) root@c02n01 ~ # cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 taking a closer look, i also do not see the original error message anymore: (Digest mismatch, buffer modified by upper layers during write: 0s +4096) we changed the log message, respectively added the ability to distinguish between detecting mismatch on the receiving end (previously possible already), and detecting mismatch on the sending end as well (previously not checked). instead, i now see dmesg like: [197080.750826] block drbd1: Digest integrity check FAILED. [197080.750871] block drbd1: error receiving Data, l: 4136! [197080.750905] block drbd1: peer( Primary - Unknown ) conn( Connected - ProtocolError ) pdsk( UpToDate - DUnknown ) [197080.750977] block drbd1: asender terminated however, the devices correctly get back in sync. i'll additionally run a manual verify later on and will report back. lars: were you able to extract the logfiles from my original post? The logs of your original post are completely boring. No, wait. They are not ;-) Feb 16 06:25:03 c02n01 kernel: [3687390.120354] block drbd1: conn( WFBitMapS - SyncSource ) pdsk( Consistent - Inconsistent ) Feb 16 06:25:03 c02n01 kernel: [3687390.120362] block drbd1: Began resync as SyncSource (will sync 4 KB [1 bits set]). Feb 16 06:25:03 c02n01 kernel: [3687390.120797] block drbd1: updated sync UUID 3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B:3CFC3B16AAE1131D Feb 16 06:25:03 c02n01 kernel: [3687390.131787] block drbd1: Retrying drbd_rs_del_all() later. refcnt=1 Feb 16 06:25:04 c02n01 kernel: [3687390.232237] block drbd1: Resync done (total 1 sec; paused 0 sec; 4 K/sec) Feb 16 06:25:04 c02n01 kernel: [3687390.232314] block drbd1: updated UUIDs 3C1DADF6B38C1AD7::E7E50184F3F3AC0B:E7E40184F3F3AC0B Feb 16 06:25:04 c02n01 kernel: [3687390.232434] block drbd1: conn( SyncSource - Connected ) pdsk( Inconsistent - UpToDate ) Feb 16 06:25:04 c02n01 kernel: [3687390.274089] block drbd1: bitmap WRITE of 762 pages took 10 jiffies Feb 16 06:25:04 c02n01 kernel: [3687390.274154] block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Feb 16 06:25:04 c02n01 kernel: [3687390.947353] block drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 1 (0x100) Feb 16 06:25:04 c02n01 kernel: [3687390.947487] block drbd1: fence-peer helper broken, returned 1 Fix your fence-peer helper, that may be the cause of trouble there. Feb 16 06:25:04 c02n01 kernel: [3687390.947555] block drbd1: pdsk( UpToDate - DUnknown ) This should not have happened, either: We must not change the pdsk state to DUnknown while keeping conn state at Connected. That's nonsense. Feb 16 06:25:04 c02n01 kernel: [3687390.947633] block drbd1: new current UUID 89084B22FE454C03:3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect
On Mon, Feb 21, 2011 at 01:21:11PM +0100, Raoul Bhatia [IPAX] wrote: hi, On 02/21/2011 10:36 AM, Lars Ellenberg wrote: Fix your fence-peer helper, that may be the cause of trouble there. which actuall is 'your' fence-peer helper, right? :) Is it. Well, then fix it, anyways. Or maybe it does not need fixing after all. thus, basically coming back to [1] where florian asks: Look at your paste. You have no node where DRBD is Secondary. What do you expect the agent to do? (i know, i talked about the agent in this email. but the the agent and crm-fence-peer.sh are closely tied, aren't they?) Not that much. But I got the impression that you are mixing several issues in those quoted threads. looking at crm-fence-peer.sh's source, i see: Secondary|Primary) # WTF? We are supposed to fence the peer, # but the replication link is just fine? echo WARNING peer is $DRBD_peer, did not place the constraint! rc=0 return ;; esac so, this should actually be obsoleted by fixing the following bug, right? possibly. on the other hand, what's wrong in trying to disconnect and reconnect the resources and see what happens? (e.g. via a tiny contraint that is only valid for PT1M? Nothing? Everything? I don't know. You tell me what is wrong. Feb 16 06:25:04 c02n01 kernel: [3687390.947555] block drbd1: pdsk( UpToDate - DUnknown ) This should not have happened, either: We must not change the pdsk state to DUnknown while keeping conn state at Connected. That's nonsense. Feb 16 06:25:04 c02n01 kernel: [3687390.947633] block drbd1: new current UUID 89084B22FE454C03:3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B please let me know if you need any further input from my side. Only if it is easily reproducible, and if so, how. Sorry, if you wrote that somewhere already, I missed it. Just write it again. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
[DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect
hi, debian lenny, pacemaker 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b, drbd 8.3.10 5c0b046982443d4785d90a2c603378f9017b, ocf ra 1.3 shipped with (self-compiled drbd debian package) kernel 2.6.27.57+ipax every couple of hours, i encounter a digest mismatch: Digest mismatch, buffer modified by upper layers during write: 0s +4096 leading ro a disconnect and reconnect (by pacemaker+drbd) and a split view after the resync, e.g.: node1: version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b046982443d4785d90a2c603378f9017b build by r...@ipax.at, 2011-02-03 14:58:22 0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r- ns:88040564 nr:0 dw:89438380 dr:199396053 al:787279 bm:9 lo:1 pe:0 ua:0 ap:1 ep:1 wo:b oos:343052 node2: version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b046982443d4785d90a2c603378f9017b build by r...@ipax.at, 2011-02-03 14:58:22 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r- ns:0 nr:87855316 dw:87855316 dr:0 al:0 bm:9 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 as you can see, node1 reports ds: UpToDate/DUnknown whereas node2 reports UpToDate/UpToDate config and dmesg logs attached. for your information: Feb 16 06:25:03: devices get out of sync. Feb 16 13:34:32: i manually disconnect and reconnect from node01 to start resync. looks like a bug to me, doesn't it? i have a couple of 2 node clusters running this setup. for a test, i will upgrade one of them to a more recent kernel from squeeze and thus will downgrade drbd to squezze's drbd 8.3.7. cheers, raoul ps. some of my previous posts are, quite possibly, related to this: http://www.gossamer-threads.com/lists/drbd/users/20717#20717 http://www.gossamer-threads.com/lists/drbd/users/20605#20605 + talks via irc -- DI (FH) Raoul Bhatia M.Sc. email. r.bha...@ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email.off...@ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax.+43 1 3670030 15 # /etc/drbd.conf common { protocol C; net { cram-hmac-algsha1; shared-secretUmau4cui Olohfie7 aivaeH4e; data-integrity-alg md5; } disk { on-io-error pass_on; fencing resource-only; } syncer { rate 50M; al-extents 257; verify-alg sha1; } startup { wfc-timeout 15; degr-wfc-timeout 15; outdated-wfc-timeout 2; } handlers { pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; local-io-error /usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o /proc/sysrq-trigger ; halt -f; fence-peer /usr/lib/drbd/crm-fence-peer.sh; after-resync-target /usr/lib/drbd/crm-unfence-peer.sh; } } # resource mail on c02n01: not ignored, not stacked resource mail { on c02n01 { device /dev/drbd2 minor 2; disk /dev/md8; address ipv4 192.168.100.50:7790; meta-diskinternal; } on c02n02 { device /dev/drbd2 minor 2; disk /dev/md8; address ipv4 192.168.100.51:7790; meta-diskinternal; } } # resource mysql on c02n01: not ignored, not stacked resource mysql { on c02n01 { device /dev/drbd1 minor 1; disk /dev/md7; address ipv4 192.168.100.50:7789; meta-diskinternal; } on c02n02 { device /dev/drbd1 minor 1; disk /dev/md7; address ipv4 192.168.100.51:7789; meta-diskinternal; } } # resource www on c02n01: not ignored, not stacked resource www { on c02n01 { device /dev/drbd0 minor 0; disk /dev/md6; address ipv4 192.168.100.50:7788; meta-diskinternal; } on c02n02 { device /dev/drbd0 minor 0; disk /dev/md6; address ipv4 192.168.100.51:7788; meta-diskinternal; } } Feb 16 06:25:03 c02n01 kernel: [3687389.652624] block drbd1: Digest mismatch, buffer modified by upper layers during write: 0s +4096 Feb 16 06:25:03 c02n01 kernel: [3687389.653918] block drbd1: sock was shut down by peer Feb 16 06:25:03 c02n01 kernel:
Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect
On Wed, Feb 16, 2011 at 03:49:34PM +0100, Raoul Bhatia [IPAX] wrote: hi, debian lenny, pacemaker 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b, drbd 8.3.10 5c0b046982443d4785d90a2c603378f9017b, ocf ra 1.3 shipped with (self-compiled drbd debian package) kernel 2.6.27.57+ipax every couple of hours, i encounter a digest mismatch: Digest mismatch, buffer modified by upper layers during write: 0s +4096 leading ro a disconnect and reconnect (by pacemaker+drbd) and a split view after the resync, e.g.: node1: version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b046982443d4785d90a2c603378f9017b build by r...@ipax.at, 2011-02-03 14:58:22 0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r- as you can see, node1 reports ds: UpToDate/DUnknown whereas conn == Connected with pdsk == DUnknown is an invalid state. So yes, that looks like a bug. Grep for state changes in your kernel logs, and find the place where it changes to Connected while not changing pdsk to something != DUnknown. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user