Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect

2011-03-11 Thread Raoul Bhatia [IPAX]
On 02/21/2011 02:18 PM, Lars Ellenberg wrote:
 Feb 16 06:25:04 c02n01 kernel: [3687390.947555] block drbd1: pdsk( UpToDate 
 - DUnknown )

 This should not have happened, either:
 We must not change the pdsk state to DUnknown while keeping conn state at 
 Connected.
 That's nonsense.

 Feb 16 06:25:04 c02n01 kernel: [3687390.947633] block drbd1: new current 
 UUID 89084B22FE454C03:3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B 

 please let me know if you need any further input from my side.
 
 Only if it is easily reproducible, and if so, how.
 Sorry, if you wrote that somewhere already, I missed it.
 Just write it again.

i tried to reproduce the problem by rapidly dropping and restoring
469MB .SQL Dumpfile.

unfortunately, this did not work.
i'll retry this with a bigger dumpfile in the next days.




one other thing: i downgraded most of our servers to 8.3.7 and
linux 2.6.32-bpo.5-amd64 (debian bpo). with this setup, i still see
some Digest integrity check FAILED.
messages, but the resync works without any problem




now, i have one production cluster where i still did not manage to
downgrade both nodes. my current setup is:

wc01: master with drbd 8.3.10, kernel 2.6.27.57+ipax (self compiled)
wc02: slave with drbd 8.3.7, kernel 2.6.32-bpo.5-amd64 (debian bpo)


in this setup, i still see the described issue:
 root@wc01 ~ # ssh wc01c cat /proc/drbd ; ssh wc02c cat /proc/drbd 

wc01:
 version: 8.3.10 (api:88/proto:86-96)
 GIT-hash: 5c0b046982443d4785d90a2c603378f9017b build by 
 r...@k000866c.ipax.at, 2011-02-03 14:58:22
  0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r-
 ns:59832072 nr:0 dw:410069420 dr:1031415173 al:3623746 bm:11501 lo:16 
 pe:0 ua:0 ap:16 ep:1 wo:b oos:11453780
  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-
 ns:15565136 nr:8918376 dw:144977376 dr:62777630 al:157147 bm:746 lo:0 
 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
please note the DUnknown/oos values from drbd0

wc02:
 version: 8.3.7 (api:88/proto:86-91)
 srcversion: EE47D8BF18AC166BE219757 
  0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r
 ns:0 nr:31434060 dw:31434060 dr:0 al:0 bm:25 lo:0 pe:0 ua:0 ap:0 ep:1 
 wo:b oos:0
  1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r
 ns:0 nr:15485480 dw:15485480 dr:0 al:0 bm:24 lo:0 pe:0 ua:0 ap:0 ep:1 
 wo:b oos:0
wc02 thinks that everything is fine.



i don't know if this is of any help for you, but i thought that you
can ignore it in case it does not matter.

i can provide the logfiles too.

thanks,
raoul
-- 

DI (FH) Raoul Bhatia M.Sc.  email.  r.bha...@ipax.at
Technischer Leiter

IPAX - Aloy Bhatia Hava OG  web.  http://www.ipax.at
Barawitzkagasse 10/2/2/11   email.off...@ipax.at
1190 Wien   tel.   +43 1 3670030
FN 277995t HG Wien  fax.+43 1 3670030 15

___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect

2011-02-21 Thread Raoul Bhatia [IPAX]
hi,

after a couple of days, i can tell that i do not see the described
problem with
drbd 8.3.7 and kernel 2.6.32-bpo.5-amd64
(backports from squeeze to debian lenny)

 root@c02n01 ~ # cat /proc/drbd
 version: 8.3.7 (api:88/proto:86-91)
 srcversion: EE47D8BF18AC166BE219757


taking a closer look, i also do not see the original error message
anymore: (Digest mismatch, buffer modified by upper layers during write:
0s +4096)


instead, i now see dmesg like:
 [197080.750826] block drbd1: Digest integrity check FAILED.
 [197080.750871] block drbd1: error receiving Data, l: 4136!
 [197080.750905] block drbd1: peer( Primary - Unknown ) conn( Connected - 
 ProtocolError ) pdsk( UpToDate - DUnknown ) 
 [197080.750977] block drbd1: asender terminated

however, the devices correctly get back in sync.

i'll additionally run a manual verify later on and will report back.

lars: were you able to extract the logfiles from my original post?

cheers,
raoul
-- 

DI (FH) Raoul Bhatia M.Sc.  email.  r.bha...@ipax.at
Technischer Leiter

IPAX - Aloy Bhatia Hava OG  web.  http://www.ipax.at
Barawitzkagasse 10/2/2/11   email.off...@ipax.at
1190 Wien   tel.   +43 1 3670030
FN 277995t HG Wien  fax.+43 1 3670030 15

___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect

2011-02-21 Thread Lars Ellenberg
On Mon, Feb 21, 2011 at 10:02:30AM +0100, Raoul Bhatia [IPAX] wrote:
 hi,
 
 after a couple of days, i can tell that i do not see the described
 problem with
 drbd 8.3.7 and kernel 2.6.32-bpo.5-amd64
 (backports from squeeze to debian lenny)
 
  root@c02n01 ~ # cat /proc/drbd
  version: 8.3.7 (api:88/proto:86-91)
  srcversion: EE47D8BF18AC166BE219757
 
 
 taking a closer look, i also do not see the original error message
 anymore: (Digest mismatch, buffer modified by upper layers during write:
 0s +4096)

we changed the log message, respectively added the ability to
distinguish between detecting mismatch on the receiving end (previously
possible already), and detecting mismatch on the sending end as well
(previously not checked).

 instead, i now see dmesg like:
  [197080.750826] block drbd1: Digest integrity check FAILED.
  [197080.750871] block drbd1: error receiving Data, l: 4136!
  [197080.750905] block drbd1: peer( Primary - Unknown ) conn( Connected - 
  ProtocolError ) pdsk( UpToDate - DUnknown ) 
  [197080.750977] block drbd1: asender terminated
 
 however, the devices correctly get back in sync.
 
 i'll additionally run a manual verify later on and will report back.
 
 lars: were you able to extract the logfiles from my original post?

The logs of your original post are completely boring.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect

2011-02-21 Thread Lars Ellenberg
On Mon, Feb 21, 2011 at 10:24:13AM +0100, Lars Ellenberg wrote:
 On Mon, Feb 21, 2011 at 10:02:30AM +0100, Raoul Bhatia [IPAX] wrote:
  hi,
  
  after a couple of days, i can tell that i do not see the described
  problem with
  drbd 8.3.7 and kernel 2.6.32-bpo.5-amd64
  (backports from squeeze to debian lenny)
  
   root@c02n01 ~ # cat /proc/drbd
   version: 8.3.7 (api:88/proto:86-91)
   srcversion: EE47D8BF18AC166BE219757
  
  
  taking a closer look, i also do not see the original error message
  anymore: (Digest mismatch, buffer modified by upper layers during write:
  0s +4096)
 
 we changed the log message, respectively added the ability to
 distinguish between detecting mismatch on the receiving end (previously
 possible already), and detecting mismatch on the sending end as well
 (previously not checked).
 
  instead, i now see dmesg like:
   [197080.750826] block drbd1: Digest integrity check FAILED.
   [197080.750871] block drbd1: error receiving Data, l: 4136!
   [197080.750905] block drbd1: peer( Primary - Unknown ) conn( Connected 
   - ProtocolError ) pdsk( UpToDate - DUnknown ) 
   [197080.750977] block drbd1: asender terminated
  
  however, the devices correctly get back in sync.
  
  i'll additionally run a manual verify later on and will report back.
  
  lars: were you able to extract the logfiles from my original post?
 
 The logs of your original post are completely boring.

No, wait.
They are not ;-)

Feb 16 06:25:03 c02n01 kernel: [3687390.120354] block drbd1: conn( WFBitMapS - 
SyncSource ) pdsk( Consistent - Inconsistent )
Feb 16 06:25:03 c02n01 kernel: [3687390.120362] block drbd1: Began resync as 
SyncSource (will sync 4 KB [1 bits set]).
Feb 16 06:25:03 c02n01 kernel: [3687390.120797] block drbd1: updated sync UUID 
3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B:3CFC3B16AAE1131D
Feb 16 06:25:03 c02n01 kernel: [3687390.131787] block drbd1: Retrying 
drbd_rs_del_all() later. refcnt=1
Feb 16 06:25:04 c02n01 kernel: [3687390.232237] block drbd1: Resync done (total 
1 sec; paused 0 sec; 4 K/sec)
Feb 16 06:25:04 c02n01 kernel: [3687390.232314] block drbd1: updated UUIDs 
3C1DADF6B38C1AD7::E7E50184F3F3AC0B:E7E40184F3F3AC0B
Feb 16 06:25:04 c02n01 kernel: [3687390.232434] block drbd1: conn( SyncSource 
- Connected ) pdsk( Inconsistent - UpToDate )
Feb 16 06:25:04 c02n01 kernel: [3687390.274089] block drbd1: bitmap WRITE of 
762 pages took 10 jiffies
Feb 16 06:25:04 c02n01 kernel: [3687390.274154] block drbd1: 0 KB (0 bits) 
marked out-of-sync by on disk bit-map.

Feb 16 06:25:04 c02n01 kernel: [3687390.947353] block drbd1: helper command: 
/sbin/drbdadm fence-peer minor-1 exit code 1 (0x100)
Feb 16 06:25:04 c02n01 kernel: [3687390.947487] block drbd1: fence-peer helper 
broken, returned 1

Fix your fence-peer helper,
that may be the cause of trouble there.

Feb 16 06:25:04 c02n01 kernel: [3687390.947555] block drbd1: pdsk( UpToDate - 
DUnknown )

This should not have happened, either:
We must not change the pdsk state to DUnknown while keeping conn state at 
Connected.
That's nonsense.

Feb 16 06:25:04 c02n01 kernel: [3687390.947633] block drbd1: new current UUID 
89084B22FE454C03:3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B 

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect

2011-02-21 Thread Lars Ellenberg
On Mon, Feb 21, 2011 at 01:21:11PM +0100, Raoul Bhatia [IPAX] wrote:
 hi,
 
 
 On 02/21/2011 10:36 AM, Lars Ellenberg wrote:
  Fix your fence-peer helper,
  that may be the cause of trouble there.
 
 which actuall is 'your' fence-peer helper, right? :)

Is it.
Well, then fix it, anyways.
Or maybe it does not need fixing after all.

 thus, basically coming back to [1] where florian asks:
  Look at your paste. You have no node where DRBD is Secondary. What do
  you expect the agent to do? 
 
 (i know, i talked about the agent in this email. but the the agent and
 crm-fence-peer.sh are closely tied, aren't they?)

Not that much.  But I got the impression that you are mixing several
issues in those quoted threads.

 looking at crm-fence-peer.sh's source, i see:
  Secondary|Primary)
  # WTF? We are supposed to fence the peer,
  # but the replication link is just fine?
  echo WARNING peer is $DRBD_peer, did not place the 
  constraint!
  rc=0
  return
  ;;
  esac
 
 so, this should actually be obsoleted by fixing the following bug,
 right?

possibly.

 on the other hand, what's wrong in trying to disconnect and reconnect
 the resources and see what happens? (e.g. via a tiny contraint that is
 only valid for PT1M?

Nothing?
Everything?
I don't know.
You tell me what is wrong.

  Feb 16 06:25:04 c02n01 kernel: [3687390.947555] block drbd1: pdsk( UpToDate 
  - DUnknown )
  
  This should not have happened, either:
  We must not change the pdsk state to DUnknown while keeping conn state at 
  Connected.
  That's nonsense.
  
  Feb 16 06:25:04 c02n01 kernel: [3687390.947633] block drbd1: new current 
  UUID 89084B22FE454C03:3C1DADF6B38C1AD7:E7E50184F3F3AC0B:E7E40184F3F3AC0B 
 
 please let me know if you need any further input from my side.

Only if it is easily reproducible, and if so, how.
Sorry, if you wrote that somewhere already, I missed it.
Just write it again.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect

2011-02-16 Thread Raoul Bhatia [IPAX]
hi,

debian lenny,
pacemaker 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b,
drbd 8.3.10 5c0b046982443d4785d90a2c603378f9017b,
ocf ra 1.3 shipped with (self-compiled drbd debian package)
kernel 2.6.27.57+ipax


every couple of hours, i encounter a digest mismatch:
 Digest mismatch, buffer modified by upper layers during write: 0s +4096

leading ro a disconnect and reconnect (by pacemaker+drbd) and
a split view after the resync, e.g.:

node1:
 version: 8.3.10 (api:88/proto:86-96)
 GIT-hash: 5c0b046982443d4785d90a2c603378f9017b build by r...@ipax.at, 
 2011-02-03 14:58:22
  0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r-
 ns:88040564 nr:0 dw:89438380 dr:199396053 al:787279 bm:9 lo:1 pe:0 ua:0 
 ap:1 ep:1 wo:b oos:343052

node2:
 version: 8.3.10 (api:88/proto:86-96)
 GIT-hash: 5c0b046982443d4785d90a2c603378f9017b build by r...@ipax.at, 
 2011-02-03 14:58:22
  0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-
 ns:0 nr:87855316 dw:87855316 dr:0 al:0 bm:9 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d 
 oos:0


as you can see, node1 reports ds: UpToDate/DUnknown whereas
node2 reports UpToDate/UpToDate


config and dmesg logs attached. for your information:

Feb 16 06:25:03: devices get out of sync.
Feb 16 13:34:32: i manually disconnect and reconnect from node01 to
 start resync.


looks like a bug to me, doesn't it?

i have a couple of 2 node clusters running this setup.
for a test, i will upgrade one of them to a more recent kernel from
squeeze and thus will downgrade drbd to squezze's drbd 8.3.7.


cheers,
raoul

ps. some of my previous posts are, quite possibly, related to this:
http://www.gossamer-threads.com/lists/drbd/users/20717#20717
http://www.gossamer-threads.com/lists/drbd/users/20605#20605
+ talks via irc
-- 

DI (FH) Raoul Bhatia M.Sc.  email.  r.bha...@ipax.at
Technischer Leiter

IPAX - Aloy Bhatia Hava OG  web.  http://www.ipax.at
Barawitzkagasse 10/2/2/11   email.off...@ipax.at
1190 Wien   tel.   +43 1 3670030
FN 277995t HG Wien  fax.+43 1 3670030 15

# /etc/drbd.conf
common {
protocol   C;
net {
cram-hmac-algsha1;
shared-secretUmau4cui Olohfie7 aivaeH4e;
data-integrity-alg md5;
}
disk {
on-io-error  pass_on;
fencing  resource-only;
}
syncer {
rate 50M;
al-extents   257;
verify-alg   sha1;
}
startup {
wfc-timeout   15;
degr-wfc-timeout  15;
outdated-wfc-timeout   2;
}
handlers {
pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ; reboot 
-f;
pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ; reboot 
-f;
local-io-error   /usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o  /proc/sysrq-trigger ; halt 
-f;
fence-peer   /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
}
}

# resource mail on c02n01: not ignored, not stacked
resource mail {
on c02n01 {
device   /dev/drbd2 minor 2;
disk /dev/md8;
address  ipv4 192.168.100.50:7790;
meta-diskinternal;
}
on c02n02 {
device   /dev/drbd2 minor 2;
disk /dev/md8;
address  ipv4 192.168.100.51:7790;
meta-diskinternal;
}
}

# resource mysql on c02n01: not ignored, not stacked
resource mysql {
on c02n01 {
device   /dev/drbd1 minor 1;
disk /dev/md7;
address  ipv4 192.168.100.50:7789;
meta-diskinternal;
}
on c02n02 {
device   /dev/drbd1 minor 1;
disk /dev/md7;
address  ipv4 192.168.100.51:7789;
meta-diskinternal;
}
}

# resource www on c02n01: not ignored, not stacked
resource www {
on c02n01 {
device   /dev/drbd0 minor 0;
disk /dev/md6;
address  ipv4 192.168.100.50:7788;
meta-diskinternal;
}
on c02n02 {
device   /dev/drbd0 minor 0;
disk /dev/md6;
address  ipv4 192.168.100.51:7788;
meta-diskinternal;
}
}

Feb 16 06:25:03 c02n01 kernel: [3687389.652624] block drbd1: Digest mismatch, buffer modified by upper layers during write: 0s +4096
Feb 16 06:25:03 c02n01 kernel: [3687389.653918] block drbd1: sock was shut down by peer
Feb 16 06:25:03 c02n01 kernel: 

Re: [DRBD-user] Digest mismatch resulting in split brain after (!) automatic reconnect

2011-02-16 Thread Lars Ellenberg
On Wed, Feb 16, 2011 at 03:49:34PM +0100, Raoul Bhatia [IPAX] wrote:
 hi,
 
 debian lenny,
 pacemaker 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b,
 drbd 8.3.10 5c0b046982443d4785d90a2c603378f9017b,
 ocf ra 1.3 shipped with (self-compiled drbd debian package)
 kernel 2.6.27.57+ipax
 
 
 every couple of hours, i encounter a digest mismatch:
  Digest mismatch, buffer modified by upper layers during write: 0s +4096
 
 leading ro a disconnect and reconnect (by pacemaker+drbd) and
 a split view after the resync, e.g.:
 
 node1:
  version: 8.3.10 (api:88/proto:86-96)
  GIT-hash: 5c0b046982443d4785d90a2c603378f9017b build by r...@ipax.at, 
  2011-02-03 14:58:22
   0: cs:Connected ro:Primary/Secondary ds:UpToDate/DUnknown C r-


 as you can see, node1 reports ds: UpToDate/DUnknown whereas

conn == Connected with pdsk == DUnknown is an invalid state.

So yes, that looks like a bug.

Grep for state changes in your kernel logs, and find the place where it
changes to Connected while not changing pdsk to something != DUnknown.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user