Hi folks,

looking for some advices on MTU / IP behavior for multi-chassis ports.

22.06 got the new multichassis port concept introduced where the same
port may be present on multiple chassis, which can be used as a
performance optimization for VM live migration, among other things.
Migration with sub-0.1ms network downtime is achieved by cloning all
traffic directed towards a multichassis port to all its locations,
making all chassis hosting it receive relevant traffic at any given
moment until migration is complete.

For logical switches with a localnet port where traffic usually goes
through the localnet port and not tunnels, it means enforcement of
tunneling of egress and ingress traffic for a multichassis port. (The
rest of the traffic between other, non-multichassis, ports keeps going
through localnet port.) Tunneling enforcement is done because traffic
sent through localnet port won't generally be cloned to all port binding
locations by the upstream switch.

A problem comes down when ingress or egress traffic of a multichassis
port, being redirected through a tunnel, gets lost because of geneve
header overhead or because the interface used for tunneling has a
different MTU from the physical bridge backing up the localnet port.

This is happening when:
  - tunnel_iface_mtu < localnet_mtuĀ + geneve_overhead

This makes for an unfortunate situation where, for a multichassis port,
SOME traffic (e.g. regular ICMP requests) pass through without any
problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't.
(Test scenario demonstrating it is included below.)

Here are ideas I came up with on how this could be resolved or at least
mitigated:

a) pass "oversized" traffic through localnet, and the rest through
tunnels. Apart from confusing pattern on the wire where packets that
belong to the same TCP session may go through two paths (arguably not a
big problem and should be expected by other network hardware), this gets
us back to the problem of upstream switch not delivering packets to all
binding chassis.

b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP
errors on attempts to send oversized packets to or from a multichassis
port. Then TCP sessions could adjust their properties to reflect the new
recommended MTU. We already have a similar mechanism for router ports.
There are several caveats and limitations with fragmentation needed (b):

- AFAIU this works for some protocols but not others. It also depends on
  the client network stack to reflect the change in network path. TCP
  should probably work.

- It may be confusing for users that some L2 paths between ports of the
  same switch have reduced MTU while others have the regular, depending
  on the type of a port (single- or multi-chassis).

c) implement fragmentation inside L2 domain. I actually don't know if
that's even compliant with RFCs. Usually packet fragmentation is
implemented on L2 domain boundary, by a router. In this scenario, a peer
port on the same switch would receive fragments for a packet that was
sent as a single piece.

I currently lean towards (b) though it's not a universal fix since it
requires collaboration of the underlying network stack. But (a) leaves
cloning broken, and (c) is even more invasive, taking action on port's
packets (fragmenting them) without the port's owner knowledge.

Perhaps this is all unnecessary and there's a way to make OVN
transparently split and reassemble packets as needed, though I doubt it
since it doesn't encapsulate tunneled traffic into another application
layer. But if there's a way to achieve transparent re-assemble, or there
are other alternatives beyond (a)-(c), let me know. Please let me know
what you think of (a)-(c) regardless.

Thanks,
Ihar
---
 tests/ovn.at | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/tests/ovn.at b/tests/ovn.at
index c346975e6..4ec1d37c3 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -14966,6 +14966,98 @@ OVN_CLEANUP([hv1],[hv2],[hv3])
 AT_CLEANUP
 ])
 
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([localnet connectivity with multiple requested-chassis, max mtu])
+AT_KEYWORDS([multi-chassis])
+ovn_start
+
+net_add n1
+for i in 1 2; do
+    sim_add hv$i
+    as hv$i
+    check ovs-vsctl add-br br-phys
+    ovn_attach n1 br-phys 192.168.0.$i
+    check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
+done
+
+check ovn-nbctl ls-add ls0
+check ovn-nbctl lsp-add ls0 first
+check ovn-nbctl lsp-add ls0 second
+check ovn-nbctl lsp-add ls0 migrator
+check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1"
+check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2"
+check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100"
+
+check ovn-nbctl lsp-add ls0 public
+check ovn-nbctl lsp-set-type public localnet
+check ovn-nbctl lsp-set-addresses public unknown
+check ovn-nbctl lsp-set-options public network_name=phys
+
+check ovn-nbctl lsp-set-options first requested-chassis=hv1 
vif-plug-mtu-request=1500
+check ovn-nbctl lsp-set-options second requested-chassis=hv2 
vif-plug-mtu-request=1500
+check ovn-nbctl lsp-set-options migrator requested-chassis=hv1,hv2 
vif-plug-mtu-request=1500
+
+as hv1 check ovs-vsctl -- add-port br-int first -- \
+    set Interface first external-ids:iface-id=first \
+    options:tx_pcap=hv1/first-tx.pcap \
+    options:rxq_pcap=hv1/first-rx.pcap \
+    ofport-request=1
+as hv2 check ovs-vsctl -- add-port br-int second -- \
+    set Interface second external-ids:iface-id=second \
+    options:tx_pcap=hv2/second-tx.pcap \
+    options:rxq_pcap=hv2/second-rx.pcap \
+    ofport-request=2
+
+# Create Migrator interfaces on both hv1 and hv2
+for hv in hv1 hv2; do
+    as $hv check ovs-vsctl -- add-port br-int migrator -- \
+        set Interface migrator external-ids:iface-id=migrator \
+        options:tx_pcap=$hv/migrator-tx.pcap \
+        options:rxq_pcap=$hv/migrator-rx.pcap \
+        ofport-request=100
+done
+
+send_icmp_packet() {
+    local inport=$1 hv=$2 eth_src=$3 eth_dst=$4 ipv4_src=$5 ipv4_dst=$6 
ip_chksum=$7 data=$8
+    shift 8
+
+    local ip_ttl=ff
+    local ip_len=001c
+    local 
packet=${eth_dst}${eth_src}08004500${ip_len}00004000${ip_ttl}01${ip_chksum}${ipv4_src}${ipv4_dst}${data}
+    as hv$hv ovs-appctl netdev-dummy/receive $inport $packet
+    echo $packet
+}
+
+check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1"
+check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2"
+check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100"
+
+first_mac=000000000001
+second_mac=000000000002
+migrator_mac=0000000000ff
+first_ip=$(ip_to_hex 10 0 0 1)
+second_ip=$(ip_to_hex 10 0 0 2)
+migrator_ip=$(ip_to_hex 10 0 0 100)
+
+OVN_POPULATE_ARP
+
+for len in 1422 1423; do
+    data=$(xxd -l $len -c $len -p < /dev/zero)
+    packet=$(send_icmp_packet migrator 1 $migrator_mac $first_mac $migrator_ip 
$first_ip 0000 $data)
+    echo $packet > hv1/first.expected
+
+    packet=$(send_icmp_packet migrator 1 $migrator_mac $second_mac 
$migrator_ip $second_ip 0000 $data)
+    echo $packet > hv2/second.expected
+done
+
+OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv1/first-tx.pcap], [hv1/first.expected])
+OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv2/second-tx.pcap], [hv2/second.expected])
+
+OVN_CLEANUP([hv1],[hv2])
+
+AT_CLEANUP
+])
+
 OVN_FOR_EACH_NORTHD([
 AT_SETUP([options:activation-strategy for logical port])
 AT_KEYWORDS([multi-chassis])
-- 
2.34.1

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to