[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters

Rafael David Tinoco Wed, 25 Sep 2019 06:31:39 -0700

Alright,

As this is a problem that does not only affect keepalived, but, all
cluster-like softwares dealing with aliases in any existing interface,
managed or not by systemd, I have tested the same test case in a
pacemaker based cluster, with 3 nodes, having 1 virtual IP + a lighttpd
instance running in the same resource group:


----

(k)inaddy@kcluster01:~$ crm config show
node 1: kcluster01
node 2: kcluster02
node 3: kcluster03
primitive fence_kcluster01 stonith:fence_virsh \
        params ipaddr=192.168.100.205 plug=kcluster01 action=off 
login=stonithmgr passwd=xxxx use_sudo=true delay=2 \
        op monitor interval=60s
primitive fence_kcluster02 stonith:fence_virsh \
        params ipaddr=192.168.100.205 plug=kcluster02 action=off 
login=stonithmgr passwd=xxxx use_sudo=true delay=4 \
        op monitor interval=60s
primitive fence_kcluster03 stonith:fence_virsh \
        params ipaddr=192.168.100.205 plug=kcluster03 action=off 
login=stonithmgr passwd=xxxx use_sudo=true delay=6 \
        op monitor interval=60s
primitive virtual_ip IPaddr2 \
        params ip=10.0.3.1 nic=eth3 \
        op monitor interval=10s
primitive webserver systemd:lighttpd \
        op monitor interval=10 timeout=60
group webserver_virtual_ip webserver virtual_ip
location l_fence_kcluster01 fence_kcluster01 -inf: kcluster01
location l_fence_kcluster02 fence_kcluster02 -inf: kcluster02
location l_fence_kcluster03 fence_kcluster03 -inf: kcluster03
property cib-bootstrap-options: \
        have-watchdog=true \
        dc-version=2.0.1-9e909a5bdd \
        cluster-infrastructure=corosync \
        cluster-name=debian \
        stonith-enabled=true \
        stonith-action=off \
        no-quorum-policy=stop

----

(k)inaddy@kcluster01:~$ cat /etc/netplan/cluster.yaml 
network:
    version: 2
    renderer: networkd
    ethernets:
        eth1:
            dhcp4: no
            dhcp6: no
            addresses: [10.0.1.2/24]
        eth2:
            dhcp4: no
            dhcp6: no
            addresses: [10.0.2.2/24]
        eth3:
            dhcp4: no
            dhcp6: no
            addresses: [10.0.3.2/24]
        eth4:
            dhcp4: no
            dhcp6: no
            addresses: [10.0.4.2/24]
        eth5:
            dhcp4: no
            dhcp6: no
            addresses: [10.0.5.2/24]

----

AND the virtual IP failed right after the netplan acted in systemd
network interface.

(k)inaddy@kcluster03:~$ sudo netplan apply
(k)inaddy@kcluster03:~$ ping 10.0.3.1
PING 10.0.3.1 (10.0.3.1) 56(84) bytes of data.
>From 10.0.3.4 icmp_seq=1 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=2 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=3 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=4 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=5 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=6 Destination Host Unreachable
64 bytes from 10.0.3.1: icmp_seq=7 ttl=64 time=0.088 ms
64 bytes from 10.0.3.1: icmp_seq=8 ttl=64 time=0.076 ms

--- 10.0.3.1 ping statistics ---
8 packets transmitted, 2 received, +6 errors, 75% packet loss, time 7128ms
rtt min/avg/max/mdev = 0.076/0.082/0.088/0.006 ms, pipe 4

Liked explained in this bug description. With that, virtual_ip_monitor,
from pacemaker, realized the virtual IP was gone and re-started it in
the same node:

----

(k)inaddy@kcluster01:~$ crm status
Stack: corosync
Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Wed Sep 25 13:11:05 2019
Last change: Wed Sep 25 12:49:56 2019 by root via cibadmin on kcluster01

3 nodes configured
5 resources configured

Online: [ kcluster01 kcluster02 kcluster03 ]

Full list of resources:

 fence_kcluster01       (stonith:fence_virsh):  Started kcluster02
 fence_kcluster02       (stonith:fence_virsh):  Started kcluster01
 fence_kcluster03       (stonith:fence_virsh):  Started kcluster01
 Resource Group: webserver_virtual_ip
     webserver  (systemd:lighttpd):     Started kcluster03
     virtual_ip (ocf::heartbeat:IPaddr2):       FAILED kcluster03

Failed Resource Actions:
* virtual_ip_monitor_10000 on kcluster03 'not running' (7): call=100, 
status=complete, exitreason='',
    last-rc-change='Wed Sep 25 13:11:05 2019', queued=0ms, exec=0ms

----

(k)inaddy@kcluster01:~$ crm status
Stack: corosync
Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Wed Sep 25 13:11:07 2019
Last change: Wed Sep 25 12:49:56 2019 by root via cibadmin on kcluster01

3 nodes configured
5 resources configured

Online: [ kcluster01 kcluster02 kcluster03 ]

Full list of resources:

 fence_kcluster01       (stonith:fence_virsh):  Started kcluster02
 fence_kcluster02       (stonith:fence_virsh):  Started kcluster01
 fence_kcluster03       (stonith:fence_virsh):  Started kcluster01
 Resource Group: webserver_virtual_ip
     webserver  (systemd:lighttpd):     Started kcluster03
     virtual_ip (ocf::heartbeat:IPaddr2):       Started kcluster03

Failed Resource Actions:
* virtual_ip_monitor_10000 on kcluster03 'not running' (7): call=100, 
status=complete, exitreason='',
    last-rc-change='Wed Sep 25 13:11:05 2019', queued=0ms, exec=0ms

----

And, if I want, I can query the number of restarts that particular
resource (the virtual_ip monitor) had in that node, to check if the
resource was about to migrate to another node, thinking this was a real
failure (and it is ?):

(k)inaddy@kcluster01:~$ sudo crm_failcount --query -r virtual_ip  -N kcluster03
scope=status  name=fail-count-virtual_ip value=5

So this resource already failed 5 times in that node, and a "netplan
apply" could have migrated the issue, for example.

----

For pacemaker, the issue is not *that big* if the cluster is configured
correctly - with a resource monitor - as the cluster will always try to
restart the virtual IP associated with the resource - lighttpd in my
case - being managed. Nevertheless, resource migrations and possible
downtime could happen in the event of multiple resource monitor
failures.

I'll check now why keepalived can't simply re-establish the virtual IPs
in the event of a failure, like pacemaker does, and, if systemd-networkd
should be altered not to change aliases if having a specific flag, or
things are good the way they are.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1815101

Title:
  [master] Restarting systemd-networkd breaks keepalived clusters

To manage notifications about this bug go to:
https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters

Reply via email to