Hi, Recently I have built a highly-available network using an ECMP route connected to two isolated L2 switches as follows.
Router-- ToR switch 1 ---- Linux | 192.168.11.1/24 | eth0: 192.168.11.2/24 | | eth1: 192.168.12.2/24 +-- ToR switch 2 ------+ 192.168.12.1/24 The (default) route has been configured with: $ sudo ip route add default \ nexthop via 192.168.11.1 \ nexthop via 192.168.12.1 Then I found that Linux chooses a wrong outgoing device for some destination/source address pairs like this: $ ip route get 12.34.56.78 from 192.168.12.2: 12.34.56.78 from 192.168.12.2 via 192.168.11.1 dev eth0 uid 0 # dev should be "eth1" As a consequence, programs like SSH or curl do not work for such destinations because routers drop packets having strange source addresses. Unbound sockets also suffer this problem. My guess for this is that Linux chooses a source address first, then a wrong outgoing device. Although I believe this is a bug in Linux, I found a possibly relevant comment in function ip_route_output_key_hash_rcu at net/ipv4/route.c: /* I removed check for oif == dev_out->oif here. It was wrong for two reasons: 1. ip_dev_find(net, saddr) can return wrong iface, if saddr is assigned to multiple interfaces. 2. Moreover, we are allowed to send packets with saddr of another iface. --ANK According to the comment 2, I wonder this behavior might be intended. So, my question is: 1. Is this intended or not? 2. If this is intended, how can I make programs work in this ECMP network? I have created a simple script to reproduce the problem (attached below). The script creates a dedicated network namespace "testns" and configures ECMP route to reproduce the problem. So far, I can reproduce the problem with these Linux versions: - 4.17-rc5 (Upstream) - 4.15.0-20-generic (Ubuntu 18.04) - 4.14.32-coreos (CoreOS) - 4.13.0-37-generic (Ubuntu 16.04 HWE) - 4.4.0-116-generic (Ubuntu 16.04) Note that the problem is not limited to the default route. Any route configured as ECMP can cause the problem. - ymmt #!/bin/sh -e NS=testns BR1=testbr1 VETH1=testveth1 BR2=testbr2 VETH2=testveth2 LINKS="$VETH1 $VETH2 $BR1 $BR2" NET1=192.168.11.xx/24 NET2=192.168.12.xx/24 IPNS="ip netns exec $NS ip" clean() { for l in $LINKS; do if ip -o link show $l >/dev/null 2>&1; then ip link del $l fi done if ip netns list | grep -q $NS; then ip netns del $NS fi } trap clean INT QUIT TERM HUP PIPE 0 make_address() { local net addr net=$1 addr=$2 echo $net | sed "s/xx/$addr/" } cidr2ip() { echo $1 | cut -d / -f 1 } GW1=$(make_address $NET1 1) GW2=$(make_address $NET2 1) ADDR1=$(make_address $NET1 2) ADDR2=$(make_address $NET2 2) setup_veth() { local br veth dest br=$1 veth=$2 dest=$3 ip link add $br type bridge ip link add $veth type veth peer name ${veth}_ ip link set $br up ip link set $veth master $br up ip link set ${veth}_ netns $NS name $dest up } setup() { ip netns add $NS $IPNS link set lo up setup_veth $BR1 $VETH1 eth0 setup_veth $BR2 $VETH2 eth1 local gw1 gw2 ip addr add $GW1 dev $BR1 ip addr add $GW2 dev $BR2 $IPNS addr add $ADDR1 dev eth0 $IPNS addr add $ADDR2 dev eth1 $IPNS route add 0.0.0.0/0 nexthop via $(cidr2ip $GW1) nexthop via $(cidr2ip $GW2) } test_route_from() { local dest dev from r rdev dest=$1 dev=$2 from=$3 r=$($IPNS -o route get $dest from $from) rdev=$(echo $r | sed -nr 's/^.*dev (eth[[:digit:]]+).*/\1/p') if [ "$dev" != "$rdev" ]; then echo "WRONG dev/from pair: ip -o route get $dest from $from:" printf "%s\n" "$r" return fi } test_route() { test_route_from "$1" eth0 $(cidr2ip $ADDR1) test_route_from "$1" eth1 $(cidr2ip $ADDR2) } run_tests() { test_route 12.34.56.78 test_route 216.58.200.160 test_route 216.58.200.161 test_route 216.58.200.162 test_route 216.58.200.163 test_route 216.58.200.164 test_route 52.85.149.10 test_route 52.85.149.11 test_route 52.85.149.12 test_route 52.85.149.13 test_route 52.85.149.14 } # main setup run_tests read -p "Press enter to finish" ret