Hello,

long time DRBD version 8.4 user, I thought I'd give DRBD 9.2 a try.
Setup is typical active-passive 2 node with all flash 6 device mdraid 10 as the lower level storage and 10Gb ethernet between them. My problem is that drbd appears to randomly detach the lower level storage on the secondary node.

(1) Below is the kernel log from such a typical case
after adding disable-write-same yes; to the resource, the initial sync completed and I thought I was done, but still I eventually find the secondary Diskless without serious warnings this time (2).

It appears to happen only on the secondary, regardless of what node that is, the md array appears totally healthy all the time (also checked with ``echo check > /sys/block/md0/md/sync_action'' which completed without issues).

drbdadm attach raid10_ssd
will always work and end up UpToDate until ending up Diskless later, apparently regardless of drbd/device activity.

I've tried various kernels, currently using stock 5.15.102 with drbd-9.2.2 tar.gz from linbit's site, all of which made no real difference. The hardware also appears to have no issues when tested individually. They are pretty standard Dell R420s with H310 mini SAS controllers. The controllers did make me somewhat suspicious and I flashed them with stock IT LSI/Avago firmware, but that didn't make any difference as well.

Any ideas where to look next ?

Best Regards,
Thanos Chatziathanassiou

(1)
---8<---
node1 primary:
drbd raid10_ssd node2: conn( Unconnected -> Connecting )
drbd raid10_ssd node2: Handshake to peer 1 successful: Agreed network protocol version 121 drbd raid10_ssd node2: Feature flags enabled on protocol level: 0x3f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
drbd raid10_ssd node2: Peer authenticated using 20 bytes HMAC
drbd raid10_ssd: Preparing cluster-wide state change 247878758 (0->1 499/145)
drbd raid10_ssd/0 drbd1 node2: drbd_sync_handshake:
drbd raid10_ssd/0 drbd1 node2: self 5E9660F0616814EC:5E0F3ECFB8B9124F:94F3C3D0280CC69C:3B9001DA9BF94284 bits:1720597989 flags:120 drbd raid10_ssd/0 drbd1 node2: peer 5E0F3ECFB8B9124E:0000000000000000:94F3C3D0280CC69C:5CBB72884848CF5C bits:1720607970 flags:1024 drbd raid10_ssd/0 drbd1 node2: uuid_compare()=source-use-bitmap by rule=bitmap-self drbd raid10_ssd: State change 247878758: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
drbd raid10_ssd: Committing cluster-wide state change 247878758 (128ms)
drbd raid10_ssd node2: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) drbd raid10_ssd/0 drbd1 node2: pdsk( DUnknown -> Inconsistent ) repl( Off -> WFBitMapS ) drbd raid10_ssd/0 drbd1 node2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 60(1), total 60; compression: 100.0% drbd raid10_ssd/0 drbd1 node2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 57(1), total 57; compression: 100.0% drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm before-resync-source drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm before-resync-source exit code 0
drbd raid10_ssd/0 drbd1 node2: repl( WFBitMapS -> SyncSource )
drbd raid10_ssd/0 drbd1 node2: Began resync as SyncSource (will sync 6882431920 KB [1720607980 bits set]). drbd raid10_ssd/0 drbd1 node2: ASSERTION __dec_rs_pending(peer_device) >= 0 FAILED in got_BlockAck drbd raid10_ssd/0 drbd1 node2: ASSERTION __dec_rs_pending(peer_device) >= 0 FAILED in got_BlockAck

node2:
drbd raid10_ssd node1: Preparing remote state change 247878758
drbd raid10_ssd/0 drbd1 node1: drbd_sync_handshake:
drbd raid10_ssd/0 drbd1 node1: self 5E0F3ECFB8B9124E:0000000000000000:94F3C3D0280CC69C:5CBB72884848CF5C bits:1720607970 flags:24 drbd raid10_ssd/0 drbd1 node1: peer 5E9660F0616814EC:5E0F3ECFB8B9124F:94F3C3D0280CC69C:3B9001DA9BF94284 bits:1720597989 flags:1120 drbd raid10_ssd/0 drbd1 node1: uuid_compare()=target-use-bitmap by rule=bitmap-peer drbd raid10_ssd node1: Committing remote state change 247878758 (primary_nodes=1) drbd raid10_ssd node1: conn( Connecting -> Connected ) peer( Unknown -> Primary ) drbd raid10_ssd/0 drbd1 node1: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT ) drbd raid10_ssd/0 drbd1 node1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 60(1), total 60; compression: 100.0% drbd raid10_ssd/0 drbd1 node1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 57(1), total 57; compression: 100.0% drbd raid10_ssd/0 drbd1 node1: helper command: /sbin/drbdadm before-resync-target drbd raid10_ssd/0 drbd1 node1: helper command: /sbin/drbdadm before-resync-target exit code 0
drbd raid10_ssd/0 drbd1 node1: repl( WFBitMapT -> SyncTarget )
drbd raid10_ssd/0 drbd1 node1: Began resync as SyncTarget (will sync 6882431920 KB [1720607980 bits set]).
drbd raid10_ssd/0 drbd1: disk( Inconsistent -> Failed )
drbd raid10_ssd/0 drbd1 node1: repl( SyncTarget -> Established )
drbd raid10_ssd/0 drbd1: Local IO failed in drbd_endio_write_sec_final. Detaching...
---8<---

(2)
---8<---
node1:
drbd raid10_ssd/0 drbd1: disk( Attaching -> Negotiating )
drbd raid10_ssd/0 drbd1: attached to current UUID: D425F4E26AD4E468
drbd raid10_ssd/0 drbd1 node2: drbd_sync_handshake:
drbd raid10_ssd/0 drbd1 node2: self D425F4E26AD4E468:0000000000000000:55F0052868F7EB1E:63CE2305E9B88A18 bits:0 flags:0 drbd raid10_ssd/0 drbd1 node2: peer 04699142243419D5:D425F4E26AD4E469:D38998C94E149B92:7BEE5673239AA93E bits:239639 flags:1120 drbd raid10_ssd/0 drbd1 node2: uuid_compare()=target-use-bitmap by rule=bitmap-peer
drbd raid10_ssd/0 drbd1: disk( Negotiating -> Inconsistent )
drbd raid10_ssd/0 drbd1 node2: repl( Established -> WFBitMapT )
drbd raid10_ssd/0 drbd1 node2: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 7602(2), total 7602; compression: 100.0% drbd raid10_ssd/0 drbd1 node2: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 7611(2), total 7611; compression: 100.0% drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm before-resync-target drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm before-resync-target exit code 0
drbd raid10_ssd/0 drbd1 node2: repl( WFBitMapT -> SyncTarget )
drbd raid10_ssd/0 drbd1 node2: Began resync as SyncTarget (will sync 970840 KB [242710 bits set]). drbd raid10_ssd/0 drbd1 node2: Resync done (total 57 sec; paused 0 sec; 17352 K/sec) drbd raid10_ssd/0 drbd1 node2: updated UUIDs 04699142243419D4:0000000000000000:55F0052868F7EB1E:63CE2305E9B88A18
drbd raid10_ssd/0 drbd1: disk( Inconsistent -> UpToDate )
drbd raid10_ssd/0 drbd1 node2: repl( SyncTarget -> Established )
drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm after-resync-target drbd raid10_ssd/0 drbd1 node2: helper command: /sbin/drbdadm after-resync-target exit code 0
drbd raid10_ssd/0 drbd1: disk( UpToDate -> Failed )
drbd raid10_ssd/0 drbd1: Local IO failed in drbd_endio_write_sec_final. Detaching...
drbd raid10_ssd/0 drbd1: disk( Failed -> Diskless )
drbd raid10_ssd/0 drbd1: receiver updated UUIDs to exposed data uuid: 59A2664E6A0DAC91

node2 primary:
drbd raid10_ssd node1: Preparing remote state change 171546960
drbd raid10_ssd node1: Committing remote state change 171546960 (primary_nodes=2)
drbd raid10_ssd/0 drbd1 node1: pdsk( Diskless -> Negotiating )
drbd raid10_ssd/0 drbd1 node1: real peer disk state = Inconsistent
drbd raid10_ssd/0 drbd1 node1: drbd_sync_handshake:
drbd raid10_ssd/0 drbd1 node1: self 04699142243419D5:D425F4E26AD4E469:D38998C94E149B92:7BEE5673239AA93E bits:239639 flags:120 drbd raid10_ssd/0 drbd1 node1: peer D425F4E26AD4E468:0000000000000000:55F0052868F7EB1E:63CE2305E9B88A18 bits:3072 flags:1004 drbd raid10_ssd/0 drbd1 node1: uuid_compare()=source-use-bitmap by rule=bitmap-self drbd raid10_ssd/0 drbd1 node1: pdsk( Negotiating -> Inconsistent ) repl( Established -> WFBitMapS ) drbd raid10_ssd/0 drbd1 node1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 7602(2), total 7602; compression: 100.0% drbd raid10_ssd/0 drbd1 node1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 7611(2), total 7611; compression: 100.0% drbd raid10_ssd/0 drbd1 node1: helper command: /sbin/drbdadm before-resync-source drbd raid10_ssd/0 drbd1 node1: helper command: /sbin/drbdadm before-resync-source exit code 0
drbd raid10_ssd/0 drbd1 node1: repl( WFBitMapS -> SyncSource )
drbd raid10_ssd/0 drbd1 node1: Began resync as SyncSource (will sync 989192 KB [247298 bits set]). drbd raid10_ssd/0 drbd1 node1: updated UUIDs 04699142243419D5:0000000000000000:D425F4E26AD4E468:D38998C94E149B92 drbd raid10_ssd/0 drbd1 node1: Resync done (total 58 sec; paused 0 sec; 17052 K/sec) drbd raid10_ssd/0 drbd1 node1: pdsk( Inconsistent -> UpToDate ) repl( SyncSource -> Established )
drbd raid10_ssd/0 drbd1 node1: pdsk( UpToDate -> Failed )
drbd raid10_ssd/0 drbd1: new current UUID: 59A2664E6A0DAC91 weak: FFFFFFFFFFFFFFFD
drbd raid10_ssd/0 drbd1 node1: pdsk( Failed -> Diskless )
---8<---

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to