Hi all,

  I've setup a 3-node cluster (config below). Basically, Node 1 & 2 are protocol C and have resource-and-stonith fencing. Node 1 -> 3 and 2 -> 3 are protocol A and fencing is 'dont-care' (it's not part of the cluster and would only ever be promoted manually).

  When I crash node 2 via 'echo c > /proc/sysrq-trigger', pacemaker detected the faults and so does DRBD. DRBD invokes the fence-handler as expected and all is good. However, I want to test breaking just DRBD, so on node 2 I used 'iptables -I INPUT -p tcp -m tcp --dport 7788:7790 -j DROP' to interrupt DRBD traffic. When this is done, the fence handler is not invoked.

  Details below:

==== [root@m3-a02n01 ~]# drbdadm dump
# /etc/drbd.conf
global {
    usage-count yes;
}

common {
    options {
        auto-promote     yes;
    }
    net {
        csums-alg        md5;
        data-integrity-alg md5;
        allow-two-primaries  no;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        disk-flushes      no;
        md-flushes        no;
    }
    handlers {
        fence-peer       /usr/sbin/fence_pacemaker;
    }
}

# resource srv01-c7_0 on m3-a02n01.alteeve.com: not ignored, not stacked
# defined at /etc/drbd.d/srv01-c7_0.res:2
resource srv01-c7_0 {
    device               /dev/drbd0 minor 0;
    on m3-a02n01.alteeve.com {
        node-id 0;
        disk             /dev/node01_vg0/srv01-c7;
    }
    on m3-a02n02.alteeve.com {
        node-id 1;
        disk             /dev/node02_vg0/srv01-c7;
    }
    on m3-a02dr01.alteeve.com {
        node-id 2;
        disk             /dev/dr01_vg0/srv01-c7;
    }
    connection {
        host m3-a02n01.alteeve.com         address         ipv4 10.41.20.1:7788;
        host m3-a02n02.alteeve.com         address         ipv4 10.41.20.2:7788;
        net {
            protocol       C;
            fencing      resource-and-stonith;
        }
    }
    connection {
        host m3-a02n01.alteeve.com         address         ipv4 10.41.20.1:7789;
        host m3-a02dr01.alteeve.com         address         ipv4 10.41.20.3:7789;
        net {
            protocol       A;
            fencing      dont-care;
        }
    }
    connection {
        host m3-a02n02.alteeve.com         address         ipv4 10.41.20.2:7790;
        host m3-a02dr01.alteeve.com         address         ipv4 10.41.20.3:7790;
        net {
            protocol       A;
            fencing      dont-care;
        }
    }
}
====

DRBD and Pacemaker status pre-iptables break;

==== [root@m3-a02n01 ~]# drbdsetup status --verbose
srv01-c7_0 node-id:0 role:Primary suspended:no
  volume:0 minor:0 disk:UpToDate quorum:yes blocked:no
  m3-a02dr01.alteeve.com node-id:2 connection:Connected role:Secondary congested:no
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
  m3-a02n02.alteeve.com node-id:1 connection:Connected role:Secondary congested:no
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
====

==== [root@m3-a02n01 ~]# pcs status
Cluster name: m3-anvil-02
Stack: corosync
Current DC: m3-a02n01.alteeve.com (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Sun Feb 11 06:21:21 2018
Last change: Sun Feb 11 02:35:25 2018 by root via crm_resource on m3-a02n01.alteeve.com

2 nodes configured
7 resources configured

Online: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ]

Full list of resources:

 virsh_node1    (stonith:fence_virsh):    Started m3-a02n01.alteeve.com
 virsh_node2    (stonith:fence_virsh):    Started m3-a02n02.alteeve.com
 Clone Set: hypervisor-clone [hypervisor]
     Started: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ]
 Clone Set: drbd-clone [drbd]
     Started: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ]
 srv01-c7    (ocf::heartbeat:VirtualDomain):    Started m3-a02n01.alteeve.com

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
====

  Issue the iptables command on node 2. Journald output;

====
-- Logs begin at Sat 2018-02-10 17:51:59 GMT. --
Feb 11 06:20:18 m3-a02n01.alteeve.com crmd[2817]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: PingAck did not arrive in time.
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: susp-io( no -> fencing)
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: ack_receiver terminated
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Terminating ack_recv thread
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02dr01.alteeve.com: Preparing remote state change 1400759070 (primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFA)
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02dr01.alteeve.com: Committing remote state change 1400759070
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: pdsk( DUnknown -> Outdated )
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: new current UUID: 769A55B47EB143CD weak: FFFFFFFFFFFFFFFA
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: susp-io( fencing -> no)
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Connection closed
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: conn( NetworkFailure -> Unconnected )
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Restarting receiver thread
Feb 11 06:28:57 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: conn( Unconnected -> Connecting )
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Handshake to peer 1 successful: Agreed network protocol version 112
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: Starting ack_recv thread (from drbd_r_srv01-c7 [3336])
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: Preparing cluster-wide state change 140629015 (0->1 499/145)
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: State change 140629015: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFF8
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0: Committing cluster-wide state change 140629015 (0ms)
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: drbd_sync_handshake:
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: self 769A55B47EB143CD:4CF0E17ADD9D1E0F:4161585F99D3837C:361856E4E3DE837C bits:0 flags:120
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: peer 4CF0E17ADD9D1E0E:0000000000000000:4CF0E17ADD9D1E0E:4161585F99D3837C bits:0 flags:120
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: uuid_compare()=2 by rule 70
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: repl( Off -> WFBitMapS )
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: helper command: /sbin/drbdadm before-resync-source
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: helper command: /sbin/drbdadm before-resync-source exit code 0 (0x0)
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: pdsk( Outdated -> Inconsistent ) repl( WFBitMapS -> SyncSource )
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: Began resync as SyncSource (will sync 0 KB [0 bits set]).
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: updated UUIDs 769A55B47EB143CD:0000000000000000:4CF0E17ADD9D1E0E:4161585F99D3837C
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n02.alteeve.com: pdsk( Inconsistent -> UpToDate ) repl( SyncSource -> Established )
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: helper command: /sbin/drbdadm unfence-peer
Feb 11 06:29:18 m3-a02n01.alteeve.com kernel: drbd srv01-c7_0 m3-a02n02.alteeve.com: helper command: /sbin/drbdadm unfence-peer exit code 0 (0x0)
====
-- Logs begin at Sun 2018-02-11 06:18:20 GMT. --
Feb 11 06:20:30 m3-a02n02.alteeve.com sshd[1968]: pam_unix(sshd:session): session opened for user root by (uid=0)
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: PingAck did not arrive in time.
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown )
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: disk( UpToDate -> Consistent )
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: ack_receiver terminated
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Terminating ack_recv thread
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0: Preparing cluster-wide state change 1400759070 (1->-1 0/0)
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0: State change 1400759070: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFA
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0: Committing cluster-wide state change 1400759070 (1ms)
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: disk( Consistent -> Outdated )
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Connection closed
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: conn( NetworkFailure -> Unconnected )
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Restarting receiver thread
Feb 11 06:28:57 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: conn( Unconnected -> Connecting )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Handshake to peer 0 successful: Agreed network protocol version 112
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Starting ack_recv thread (from drbd_r_srv01-c7 [1885])
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Preparing remote state change 140629015 (primary_nodes=0, weak_nodes=0)
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: Committing remote state change 140629015
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0 m3-a02n01.alteeve.com: conn( Connecting -> Connected ) peer( Unknown -> Primary )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: drbd_sync_handshake:
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: self 4CF0E17ADD9D1E0E:0000000000000000:4CF0E17ADD9D1E0E:4161585F99D3837C bits:0 flags:120
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: peer 769A55B47EB143CD:4CF0E17ADD9D1E0F:4161585F99D3837C:361856E4E3DE837C bits:0 flags:120
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: uuid_compare()=-2 by rule 50
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: helper command: /sbin/drbdadm before-resync-target
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: helper command: /sbin/drbdadm before-resync-target exit code 0 (0x0)
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: disk( Outdated -> Inconsistent )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02dr01.alteeve.com: resync-susp( no -> connection dependency )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: repl( WFBitMapT -> SyncTarget )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: updated UUIDs 769A55B47EB143CC:0000000000000000:4CF0E17ADD9D1E0E:4161585F99D3837C
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0: disk( Inconsistent -> UpToDate )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02dr01.alteeve.com: resync-susp( connection dependency -> no )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: repl( SyncTarget -> Established )
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: helper command: /sbin/drbdadm after-resync-target
Feb 11 06:29:18 m3-a02n02.alteeve.com kernel: drbd srv01-c7_0/0 drbd0 m3-a02n01.alteeve.com: helper command: /sbin/drbdadm after-resync-target exit code 0 (0x0)
====

DRBD status on both nodes, post iptables break;

==== [root@m3-a02n01 ~]# drbdsetup status --verbose
srv01-c7_0 node-id:0 role:Primary suspended:no
  volume:0 minor:0 disk:UpToDate quorum:yes blocked:no
  m3-a02dr01.alteeve.com node-id:2 connection:Connected role:Secondary congested:no
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
  m3-a02n02.alteeve.com node-id:1 connection:Connected role:Secondary congested:no
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
==== [root@m3-a02n02 ~]# drbdsetup status --verbose
srv01-c7_0 node-id:1 role:Secondary suspended:no
  volume:0 minor:0 disk:UpToDate quorum:yes blocked:no
  m3-a02dr01.alteeve.com node-id:2 connection:Connected role:Secondary congested:no
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
  m3-a02n01.alteeve.com node-id:0 connection:Connected role:Primary congested:no
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
====

The cluster still thinks all is well, too.

==== [root@m3-a02n01 ~]# pcs status
Cluster name: m3-anvil-02
Stack: corosync
Current DC: m3-a02n01.alteeve.com (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Sun Feb 11 06:33:48 2018
Last change: Sun Feb 11 02:35:25 2018 by root via crm_resource on m3-a02n01.alteeve.com

2 nodes configured
7 resources configured

Online: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ]

Full list of resources:

 virsh_node1    (stonith:fence_virsh):    Started m3-a02n01.alteeve.com
 virsh_node2    (stonith:fence_virsh):    Started m3-a02n02.alteeve.com
 Clone Set: hypervisor-clone [hypervisor]
     Started: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ]
 Clone Set: drbd-clone [drbd]
     Started: [ m3-a02n01.alteeve.com m3-a02n02.alteeve.com ]
 srv01-c7    (ocf::heartbeat:VirtualDomain):    Started m3-a02n01.alteeve.com

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
====

To verify, I can't connect to node 2;

==== [root@m3-a02n01 ~]# telnet m3-a02n02.sn 7788
Trying 10.41.20.2...
telnet: connect to address 10.41.20.2: Connection timed out
====

  Did it somehow maintain connection through node 3?

  If not, then a) Why didn't the fence-handler get invoked? b) Why is it still showing connected?

  If so, then is the connection between node 1 and 2 still protocol C, even if the connection between 1 <-> 3 and 2 <-> 3 are protocol A?

Thanks!

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to