Hi folks, looking for some advices on MTU / IP behavior for multi-chassis ports.
22.06 got the new multichassis port concept introduced where the same port may be present on multiple chassis, which can be used as a performance optimization for VM live migration, among other things. Migration with sub-0.1ms network downtime is achieved by cloning all traffic directed towards a multichassis port to all its locations, making all chassis hosting it receive relevant traffic at any given moment until migration is complete. For logical switches with a localnet port where traffic usually goes through the localnet port and not tunnels, it means enforcement of tunneling of egress and ingress traffic for a multichassis port. (The rest of the traffic between other, non-multichassis, ports keeps going through localnet port.) Tunneling enforcement is done because traffic sent through localnet port won't generally be cloned to all port binding locations by the upstream switch. A problem comes down when ingress or egress traffic of a multichassis port, being redirected through a tunnel, gets lost because of geneve header overhead or because the interface used for tunneling has a different MTU from the physical bridge backing up the localnet port. This is happening when: - tunnel_iface_mtu < localnet_mtuĀ + geneve_overhead This makes for an unfortunate situation where, for a multichassis port, SOME traffic (e.g. regular ICMP requests) pass through without any problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't. (Test scenario demonstrating it is included below.) Here are ideas I came up with on how this could be resolved or at least mitigated: a) pass "oversized" traffic through localnet, and the rest through tunnels. Apart from confusing pattern on the wire where packets that belong to the same TCP session may go through two paths (arguably not a big problem and should be expected by other network hardware), this gets us back to the problem of upstream switch not delivering packets to all binding chassis. b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP errors on attempts to send oversized packets to or from a multichassis port. Then TCP sessions could adjust their properties to reflect the new recommended MTU. We already have a similar mechanism for router ports. There are several caveats and limitations with fragmentation needed (b): - AFAIU this works for some protocols but not others. It also depends on the client network stack to reflect the change in network path. TCP should probably work. - It may be confusing for users that some L2 paths between ports of the same switch have reduced MTU while others have the regular, depending on the type of a port (single- or multi-chassis). c) implement fragmentation inside L2 domain. I actually don't know if that's even compliant with RFCs. Usually packet fragmentation is implemented on L2 domain boundary, by a router. In this scenario, a peer port on the same switch would receive fragments for a packet that was sent as a single piece. I currently lean towards (b) though it's not a universal fix since it requires collaboration of the underlying network stack. But (a) leaves cloning broken, and (c) is even more invasive, taking action on port's packets (fragmenting them) without the port's owner knowledge. Perhaps this is all unnecessary and there's a way to make OVN transparently split and reassemble packets as needed, though I doubt it since it doesn't encapsulate tunneled traffic into another application layer. But if there's a way to achieve transparent re-assemble, or there are other alternatives beyond (a)-(c), let me know. Please let me know what you think of (a)-(c) regardless. Thanks, Ihar --- tests/ovn.at | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) diff --git a/tests/ovn.at b/tests/ovn.at index c346975e6..4ec1d37c3 100644 --- a/tests/ovn.at +++ b/tests/ovn.at @@ -14966,6 +14966,98 @@ OVN_CLEANUP([hv1],[hv2],[hv3]) AT_CLEANUP ]) +OVN_FOR_EACH_NORTHD([ +AT_SETUP([localnet connectivity with multiple requested-chassis, max mtu]) +AT_KEYWORDS([multi-chassis]) +ovn_start + +net_add n1 +for i in 1 2; do + sim_add hv$i + as hv$i + check ovs-vsctl add-br br-phys + ovn_attach n1 br-phys 192.168.0.$i + check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys +done + +check ovn-nbctl ls-add ls0 +check ovn-nbctl lsp-add ls0 first +check ovn-nbctl lsp-add ls0 second +check ovn-nbctl lsp-add ls0 migrator +check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1" +check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2" +check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100" + +check ovn-nbctl lsp-add ls0 public +check ovn-nbctl lsp-set-type public localnet +check ovn-nbctl lsp-set-addresses public unknown +check ovn-nbctl lsp-set-options public network_name=phys + +check ovn-nbctl lsp-set-options first requested-chassis=hv1 vif-plug-mtu-request=1500 +check ovn-nbctl lsp-set-options second requested-chassis=hv2 vif-plug-mtu-request=1500 +check ovn-nbctl lsp-set-options migrator requested-chassis=hv1,hv2 vif-plug-mtu-request=1500 + +as hv1 check ovs-vsctl -- add-port br-int first -- \ + set Interface first external-ids:iface-id=first \ + options:tx_pcap=hv1/first-tx.pcap \ + options:rxq_pcap=hv1/first-rx.pcap \ + ofport-request=1 +as hv2 check ovs-vsctl -- add-port br-int second -- \ + set Interface second external-ids:iface-id=second \ + options:tx_pcap=hv2/second-tx.pcap \ + options:rxq_pcap=hv2/second-rx.pcap \ + ofport-request=2 + +# Create Migrator interfaces on both hv1 and hv2 +for hv in hv1 hv2; do + as $hv check ovs-vsctl -- add-port br-int migrator -- \ + set Interface migrator external-ids:iface-id=migrator \ + options:tx_pcap=$hv/migrator-tx.pcap \ + options:rxq_pcap=$hv/migrator-rx.pcap \ + ofport-request=100 +done + +send_icmp_packet() { + local inport=$1 hv=$2 eth_src=$3 eth_dst=$4 ipv4_src=$5 ipv4_dst=$6 ip_chksum=$7 data=$8 + shift 8 + + local ip_ttl=ff + local ip_len=001c + local packet=${eth_dst}${eth_src}08004500${ip_len}00004000${ip_ttl}01${ip_chksum}${ipv4_src}${ipv4_dst}${data} + as hv$hv ovs-appctl netdev-dummy/receive $inport $packet + echo $packet +} + +check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1" +check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2" +check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100" + +first_mac=000000000001 +second_mac=000000000002 +migrator_mac=0000000000ff +first_ip=$(ip_to_hex 10 0 0 1) +second_ip=$(ip_to_hex 10 0 0 2) +migrator_ip=$(ip_to_hex 10 0 0 100) + +OVN_POPULATE_ARP + +for len in 1422 1423; do + data=$(xxd -l $len -c $len -p < /dev/zero) + packet=$(send_icmp_packet migrator 1 $migrator_mac $first_mac $migrator_ip $first_ip 0000 $data) + echo $packet > hv1/first.expected + + packet=$(send_icmp_packet migrator 1 $migrator_mac $second_mac $migrator_ip $second_ip 0000 $data) + echo $packet > hv2/second.expected +done + +OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv1/first-tx.pcap], [hv1/first.expected]) +OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv2/second-tx.pcap], [hv2/second.expected]) + +OVN_CLEANUP([hv1],[hv2]) + +AT_CLEANUP +]) + OVN_FOR_EACH_NORTHD([ AT_SETUP([options:activation-strategy for logical port]) AT_KEYWORDS([multi-chassis]) -- 2.34.1 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev