Package: ifenslave
Version: 2.13
Severity: normal
Hi,
I have a simple bonding configuration with two physical Ethernet
interfaces, both defined with the `allow-hotplug` option. I use
allow-hotplug because the interfaces are on USB. And since I'm
using allow-hotplug, I chose the style of configuration where I use
the `bond-master` configuration option under each slave configuration
stanza, rather than `bond-slaves` in the bonding master interface
stanza.
This configuration is similar to the one in
examples/two_hotplug_ethernet. I'm attaching a copy of my
/etc/network/interfaces file.
The problem is that sometimes after a reboot, I find that one of my
interfaces (wlx0013efd01275: an external, wireless USB interface) is
not a member of the bond:
$ ip addr
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enxb827eb9e4634: mtu 1500 qdisc
pfifo_fast master bond0 state UP group default qlen 1000
link/ether b8:27:eb:9e:46:34 brd ff:ff:ff:ff:ff:ff
3: wlx0013efd01275: mtu 1500 qdisc mq state
UP group default qlen 1000
link/ether 00:13:ef:d0:12:75 brd ff:ff:ff:ff:ff:ff
inet6 2002:ce3f:e590:2:213:efff:fed0:1275/64 scope global dynamic
mngtmpaddr
valid_lft 86179sec preferred_lft 14179sec
inet6 2002:ce3f:e590:1:213:efff:fed0:1275/64 scope global dynamic
mngtmpaddr
valid_lft 86179sec preferred_lft 14179sec
inet6 fe80::213:efff:fed0:1275/64 scope link
valid_lft forever preferred_lft forever
4: bond0: mtu 1500 qdisc noqueue state
UP group default qlen 1000
link/ether b8:27:eb:9e:46:34 brd ff:ff:ff:ff:ff:ff
inet 192.168.66.19/21 brd 192.168.71.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 2002:ce3f:e590:1:1::19/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::ba27:ebff:fe9e:4634/64 scope link
valid_lft forever preferred_lft forever
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v5.10.0-10-rpi
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: enxb827eb9e4634 (primary_reselect always)
Currently Active Slave: enxb827eb9e4634
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
Slave Interface: enxb827eb9e4634
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b8:27:eb:9e:46:34
Slave queue ID: 0
$ /sbin/ifquery --state
lo=lo
bond0=bond0
wlx0013efd01275=wlx0013efd01275
enxb827eb9e4634=enxb827eb9e4634
In this situation, this is logged to syslog (via systemd):
sh[299]: Failed to enslave wlx0013efd01275 to bond0.
I think I understand the cause, and propose a workaround or a bug fix
(depending on how you look at it).
When both devices are present at boot, two ifup@.service units
(one for each interface) are started simultaneously. It seems like
the ifenslave ifupdown scripts are meant to handle this, but in
/etc/network/if-pre-up.d/ifenslave, in the definition of setup_slave_device(),
there is:
# Ensure the master is up or being configured
export IFENSLAVE_ENV_NAME="IFUPDOWN_$IF_BOND_MASTER"
IFUPDOWN_IF_BOND_MASTER="$(printenv "$IFENSLAVE_ENV_NAME")"
unset IFENSLAVE_ENV_NAME
if [ -z "$IFUPDOWN_IF_BOND_MASTER" ] ; then
ifquery --state "$IF_BOND_MASTER" >/dev/null 2>&1 || ifup
"$IF_BOND_MASTER"
fi
I've added loads of debugging and done many reboot cycles to find out
that when the problem occurs (or at least in one case), both
simultaneously-running processes get to the `ifquery` line. One of
the processes executes ifquery and gets a non-zero return code,
leading it to run `ifup "$IF_BOND_MASTER"`. After ifup starts, the
other script process executes ifquery and gets a zero return code.
In this case, the bond interface hasn't come up yet, but the command
to bring it up has started, which I think is why ifquery is returning
zero here. I can reproduce this behavior of ifquery with a dummy
interface:
iface dummy0 inet manual
pre-up modprobe dummy
pre-up sleep 5
up ip link add dummy0 type dummy
down ip link del dummy0
$ q() { sudo ifquery --state dummy0; echo " => $?"; }
$ q
=> 1
$ sudo ifup dummy0 & sleep 1; q; ip link show dummy0; ps $!
[1] 7956
dummy0=dummy0
=> 0
Device "dummy0" does not exist.
PID TTY STAT TIME COMMAND
7956 pts/1S 0:00 sudo ifup dummy0
This shows that ifquery does return 0, even while ifup is still
working.
>From my reading of ifquery(8), this is possibly a bug ("successful
code is returned if all of interfaces given as arguments are up").
Either ifquery is supposed to return failure in this case (meaning
ifquery has a bug, since the interface hasn't finished coming up yet),
or it is valid for it to return suc