[DRBD-user] Interesting issue with drbd 9 and fencing

Digimer Sat, 10 Feb 2018 22:51:55 -0800

Hi all,

I've setup a 3-node cluster (config below). Basically, Node 1 & 2 are protocol C and have resource-and-stonith fencing. Node 1 -> 3 and 2 -> 3 are protocol A and fencing is 'dont-care' (it's not part of the cluster and would only ever be promoted manually).

When I crash node 2 via 'echo c > /proc/sysrq-trigger', pacemaker detected the faults and so does DRBD. DRBD invokes the fence-handler as expected and all is good. However, I want to test breaking just DRBD, so on node 2 I used 'iptables -I INPUT -p tcp -m tcp --dport 7788:7790 -j DROP' to interrupt DRBD traffic. When this is done, the fence handler is not invoked.

Details below:

==== [root@m3-a02n01 ~]# drbdadm dump
# /etc/drbd.confglobal { usage-count yes;}common { options { auto-promote yes; } net { csums-alg md5; data-integrity-alg md5; allow-two-primaries no; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } disk { disk-flushes no; md-flushes no; } handlers { fence-peer /usr/sbin/fence_pacemaker; }}# resource srv01-c7_0 on m3-a02n01.alteeve.com: not ignored, not stacked# defined at /etc/drbd.d/srv01-c7_0.res:2resource srv01-c7_0 { device /dev/drbd0 minor 0; on m3-a02n01.alteeve.com { node-id 0; disk /dev/node01_vg0/srv01-c7; } on m3-a02n02.alteeve.com { node-id 1; disk /dev/node02_vg0/srv01-c7; } on m3-a02dr01.alteeve.com { node-id 2; disk /dev/dr01_vg0/srv01-c7; } connection { host m3-a02n01.alteeve.com address ipv4 10.41.20.1:7788; host m3-a02n02.alteeve.com address ipv4 10.41.20.2:7788; net { protocol C; fencing resource-and-stonith; } } connection { host m3-a02n01.alteeve.com address ipv4 10.41.20.1:7789; host m3-a02dr01.alteeve.com address ipv4 10.41.20.3:7789; net { protocol A; fencing dont-care; } } connection { host m3-a02n02.alteeve.com address ipv4 10.41.20.2:7790; host m3-a02dr01.alteeve.com address ipv4 10.41.20.3:7790; net { protocol A; fencing dont-care; } }}====

DRBD and Pacemaker status pre-iptables break;

==== [root@m3-a02n01 ~]# pcs status
Cluster name: m3-anvil-02Stack: corosyncCurrent DC: m3-a02n01.alteeve.com (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorumLast updated: Sun Feb 11 06:21:21 2018Last change: Sun Feb 11 02:35:25 2018 by root via crm_resource on m3-a02n01.alteeve.com2 nodes configured7 resources configuredOnline: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ]Full list of resources: virsh_node1 (stonith:fence_virsh): Started m3-a02n01.alteeve.com virsh_node2 (stonith:fence_virsh): Started m3-a02n02.alteeve.com Clone Set: hypervisor-clone [hypervisor] Started: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ] Clone Set: drbd-clone [drbd] Started: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ] srv01-c7 (ocf::heartbeat:VirtualDomain): Started m3-a02n01.alteeve.comDaemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled====

Issue the iptables command on node 2. Journald output;

====
-- Logs begin at Sat 2018-02-10 17:51:59 GMT. --Feb 11 06:20:18 m3-a02n01.alteeve.com crmd[2817]: notice: State transition S_TRANSITION_ENGINE -> S_IDLEFeb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: PingAck did not arrive in time.Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: susp-io( no -> fencing)Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: ack_receiver terminatedFeb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Terminating ack_recv threadFeb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02dr01.alteeve.com: Preparing remote state change 1400759070 (primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFA)Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02dr01.alteeve.com: Committing remote state change 1400759070Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: pdsk( DUnknown -> Outdated )Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: new current UUID: 769A55B47EB143CD weak: FFFFFFFFFFFFFFFAFeb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: susp-io( fencing -> no)Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Connection closedFeb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: conn( NetworkFailure -> Unconnected )Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Restarting receiver threadFeb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: conn( Unconnected -> Connecting )Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Handshake to peer 1 successful: Agreed network protocol version 112Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Starting ack_recv thread (from drbd_r_srv01-c7 [3336])Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: Preparing cluster-wide state change 140629015 (0->1 499/145)Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: State change 140629015: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFF8Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: Committing cluster-wide state change 140629015 (0ms)Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: conn( Connecting -> Connected ) peer( Unknown -> Secondary )Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: drbd_sync_handshake:Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: self 769A55B47EB143CD:4CF0E17ADD9D1E0F:4161585F99D3837C:361856E4E3DE837C bits:0 flags:120Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: peer 4CF0E17ADD9D1E0E:0000000000000000:4CF0E17ADD9D1E0E:4161585F99D3837C bits:0 flags:120Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: uuid_compare()=2 by rule 70Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: repl( Off -> WFBitMapS )Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: helper command: /sbin/drbdadm before-resync-sourceFeb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: helper command: /sbin/drbdadm before-resync-source exit code 0 (0x0)Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: pdsk( Outdated -> Inconsistent ) repl( WFBitMapS -> SyncSource )Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: Began resync as SyncSource (will sync 0 KB [0 bits set]).Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: updated UUIDs 769A55B47EB143CD:0000000000000000:4CF0E17ADD9D1E0E:4161585F99D3837CFeb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: Resync done (total 1 sec; paused 0 sec; 0 K/sec)Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: pdsk( Inconsistent -> UpToDate ) repl( SyncSource -> Established )Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: helper command: /sbin/drbdadm unfence-peerFeb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: helper command: /sbin/drbdadm unfence-peer exit code 0 (0x0)====
-- Logs begin at Sun 2018-02-11 06:18:20 GMT. --Feb 11 06:20:30 m3-a02n02.alteeve.com sshd[1968]: pam_unix(sshd:session): session opened for user root by (uid=0)Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: PingAck did not arrive in time.Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown )Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: disk( UpToDate -> Consistent )Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: ack_receiver terminatedFeb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Terminating ack_recv threadFeb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0: Preparing cluster-wide state change 1400759070 (1->-1 0/0)Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0: State change 1400759070: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFAFeb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0: Committing cluster-wide state change 1400759070 (1ms)Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: disk( Consistent -> Outdated )Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Connection closedFeb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: conn( NetworkFailure -> Unconnected )Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Restarting receiver threadFeb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: conn( Unconnected -> Connecting )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Handshake to peer 0 successful: Agreed network protocol version 112Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Starting ack_recv thread (from drbd_r_srv01-c7 [1885])Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Preparing remote state change 140629015 (primary_nodes=0, weak_nodes=0)Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Committing remote state change 140629015Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: conn( Connecting -> Connected ) peer( Unknown -> Primary )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: drbd_sync_handshake:Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: self 4CF0E17ADD9D1E0E:0000000000000000:4CF0E17ADD9D1E0E:4161585F99D3837C bits:0 flags:120Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: peer 769A55B47EB143CD:4CF0E17ADD9D1E0F:4161585F99D3837C:361856E4E3DE837C bits:0 flags:120Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: uuid_compare()=-2 by rule 50Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: helper command: /sbin/drbdadm before-resync-targetFeb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: helper command: /sbin/drbdadm before-resync-target exit code 0 (0x0)Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: disk( Outdated -> Inconsistent )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02dr01.alteeve.com: resync-susp( no -> connection dependency )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: repl( WFBitMapT -> SyncTarget )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: Began resync as SyncTarget (will sync 0 KB [0 bits set]).Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: Resync done (total 1 sec; paused 0 sec; 0 K/sec)Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: updated UUIDs 769A55B47EB143CC:0000000000000000:4CF0E17ADD9D1E0E:4161585F99D3837CFeb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: disk( Inconsistent -> UpToDate )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02dr01.alteeve.com: resync-susp( connection dependency -> no )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: repl( SyncTarget -> Established )Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: helper command: /sbin/drbdadm after-resync-targetFeb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: helper command: /sbin/drbdadm after-resync-target exit code 0 (0x0)====

DRBD status on both nodes, post iptables break;

==== [root@m3-a02n01 ~]# drbdsetup status --verbose
srv01-c7_0 node-id:0 role:Primary suspended:no volume:0 minor:0 disk:UpToDate quorum:yes blocked:no m3-a02dr01.alteeve.com node-id:2 connection:Connected role:Secondary congested:no volume:0 replication:Established peer-disk:UpToDate resync-suspended:no m3-a02n02.alteeve.com node-id:1 connection:Connected role:Secondary congested:no volume:0 replication:Established peer-disk:UpToDate resync-suspended:no==== [root@m3-a02n02 ~]# drbdsetup status --verbose
srv01-c7_0 node-id:1 role:Secondary suspended:no volume:0 minor:0 disk:UpToDate quorum:yes blocked:no m3-a02dr01.alteeve.com node-id:2 connection:Connected role:Secondary congested:no volume:0 replication:Established peer-disk:UpToDate resync-suspended:no m3-a02n01.alteeve.com node-id:0 connection:Connected role:Primary congested:no volume:0 replication:Established peer-disk:UpToDate resync-suspended:no====

The cluster still thinks all is well, too.

==== [root@m3-a02n01 ~]# pcs status
Cluster name: m3-anvil-02Stack: corosyncCurrent DC: m3-a02n01.alteeve.com (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorumLast updated: Sun Feb 11 06:33:48 2018Last change: Sun Feb 11 02:35:25 2018 by root via crm_resource on m3-a02n01.alteeve.com2 nodes configured7 resources configuredOnline: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ]Full list of resources: virsh_node1 (stonith:fence_virsh): Started m3-a02n01.alteeve.com virsh_node2 (stonith:fence_virsh): Started m3-a02n02.alteeve.com Clone Set: hypervisor-clone [hypervisor] Started: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ] Clone Set: drbd-clone [drbd] Started: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ] srv01-c7 (ocf::heartbeat:VirtualDomain): Started m3-a02n01.alteeve.comDaemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled====

To verify, I can't connect to node 2;

==== [root@m3-a02n01 ~]# telnet m3-a02n02.sn 7788
Trying 10.41.20.2...telnet: connect to address 10.41.20.2: Connection timed out====

Did it somehow maintain connection through node 3?

If not, then a) Why didn't the fence-handler get invoked? b) Why is it still showing connected?

If so, then is the connection between node 1 and 2 still protocol C, even if the connection between 1 <-> 3 and 2 <-> 3 are protocol A?

Thanks!

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] Interesting issue with drbd 9 and fencing

Reply via email to