Initial (incomplete) approach to the idea can be seen here: https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/
There are lots of missing parts still (support for one of two directions only; there's some issue with local port size check to investigate; geneve info leaking into embedded ICMP error packet; missing IPv6 support to name a few). Posted to gather thoughts on local delivery issue / controller action definition. Please take a look if you have spare cycles. Thanks, Ihar On Thu, Jul 21, 2022 at 11:22 AM Ihar Hrachyshka <[email protected]> wrote: > > On Wed, Jul 20, 2022 at 3:45 PM Numan Siddique <[email protected]> wrote: > > > > On Mon, Jul 18, 2022 at 7:44 PM Ihar Hrachyshka <[email protected]> wrote: > > > > > > Hi folks, > > > > > > looking for some advices on MTU / IP behavior for multi-chassis ports. > > > > > > 22.06 got the new multichassis port concept introduced where the same > > > port may be present on multiple chassis, which can be used as a > > > performance optimization for VM live migration, among other things. > > > Migration with sub-0.1ms network downtime is achieved by cloning all > > > traffic directed towards a multichassis port to all its locations, > > > making all chassis hosting it receive relevant traffic at any given > > > moment until migration is complete. > > > > > > For logical switches with a localnet port where traffic usually goes > > > through the localnet port and not tunnels, it means enforcement of > > > tunneling of egress and ingress traffic for a multichassis port. (The > > > rest of the traffic between other, non-multichassis, ports keeps going > > > through localnet port.) Tunneling enforcement is done because traffic > > > sent through localnet port won't generally be cloned to all port binding > > > locations by the upstream switch. > > > > > > A problem comes down when ingress or egress traffic of a multichassis > > > port, being redirected through a tunnel, gets lost because of geneve > > > header overhead or because the interface used for tunneling has a > > > different MTU from the physical bridge backing up the localnet port. > > > > > > This is happening when: > > > - tunnel_iface_mtu < localnet_mtu + geneve_overhead > > > > > > This makes for an unfortunate situation where, for a multichassis port, > > > SOME traffic (e.g. regular ICMP requests) pass through without any > > > problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't. > > > (Test scenario demonstrating it is included below.) > > > > > > Here are ideas I came up with on how this could be resolved or at least > > > mitigated: > > > > > > a) pass "oversized" traffic through localnet, and the rest through > > > tunnels. Apart from confusing pattern on the wire where packets that > > > belong to the same TCP session may go through two paths (arguably not a > > > big problem and should be expected by other network hardware), this gets > > > us back to the problem of upstream switch not delivering packets to all > > > binding chassis. > > > > > > b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP > > > errors on attempts to send oversized packets to or from a multichassis > > > port. Then TCP sessions could adjust their properties to reflect the new > > > recommended MTU. We already have a similar mechanism for router ports. > > > There are several caveats and limitations with fragmentation needed (b): > > > > > > - AFAIU this works for some protocols but not others. It also depends on > > > the client network stack to reflect the change in network path. TCP > > > should probably work. > > > > > > - It may be confusing for users that some L2 paths between ports of the > > > same switch have reduced MTU while others have the regular, depending > > > on the type of a port (single- or multi-chassis). > > > > > > c) implement fragmentation inside L2 domain. I actually don't know if > > > that's even compliant with RFCs. Usually packet fragmentation is > > > implemented on L2 domain boundary, by a router. In this scenario, a peer > > > port on the same switch would receive fragments for a packet that was > > > sent as a single piece. > > > > > > I currently lean towards (b) though it's not a universal fix since it > > > requires collaboration of the underlying network stack. But (a) leaves > > > cloning broken, and (c) is even more invasive, taking action on port's > > > packets (fragmenting them) without the port's owner knowledge. > > > > > > Perhaps this is all unnecessary and there's a way to make OVN > > > transparently split and reassemble packets as needed, though I doubt it > > > since it doesn't encapsulate tunneled traffic into another application > > > layer. But if there's a way to achieve transparent re-assemble, or there > > > are other alternatives beyond (a)-(c), let me know. Please let me know > > > what you think of (a)-(c) regardless. > > > > > > Thanks, > > > Ihar > > > > I lean towards (b) too. I suppose we only need this special handling > > until the migration is complete and the multi chassis option is > > cleared by CMS. > > Yes in case of short-term use of the feature (e.g. for live > migration). If we are talking about persistent multi-chassis ports > (e.g. to clone traffic), this may be more of a problem. I'd expect > that in this case, MTUs will be carefully aligned between tunneling > interfaces and physical networks so that any packet directed towards a > physical network would fit through the tunneling interface. (This > implication is already documented in ovn-nb.xml, so what is discussed > here is more of an optimization for environments that don't abide to > expected MTU configuration.) > > > > > For (a) how would OVN decide if the packet is oversized ? Using the > > check_pkt_larger OVS action ? > > I believe any option of (a)-(c) would rely on check_pkt_larger, yes. > The question is - how to determine the base line. For this, we'll need > to know: > > 1) type of the tunnel (to determine the overhead comparing to direct > localnet funneling) > 2) mtu of the interface used for tunneling > > The max size for a packet would be calculated as follows: > > max_size = tun_iface_mtu - overhead(tun_type) > > overhead() is simple to implement because it's just a 1:1 str:int mapping. > tun_iface_mtu will require some busy work to maintain mtus for > relevant interfaces used for routing between chassis. > > > > > I've no idea how (c) can be done. > > > > Thanks > > Numan > > > > > --- > > > tests/ovn.at | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 92 insertions(+) > > > > > > diff --git a/tests/ovn.at b/tests/ovn.at > > > index c346975e6..4ec1d37c3 100644 > > > --- a/tests/ovn.at > > > +++ b/tests/ovn.at > > > @@ -14966,6 +14966,98 @@ OVN_CLEANUP([hv1],[hv2],[hv3]) > > > AT_CLEANUP > > > ]) > > > > > > +OVN_FOR_EACH_NORTHD([ > > > +AT_SETUP([localnet connectivity with multiple requested-chassis, max > > > mtu]) > > > +AT_KEYWORDS([multi-chassis]) > > > +ovn_start > > > + > > > +net_add n1 > > > +for i in 1 2; do > > > + sim_add hv$i > > > + as hv$i > > > + check ovs-vsctl add-br br-phys > > > + ovn_attach n1 br-phys 192.168.0.$i > > > + check ovs-vsctl set open . > > > external-ids:ovn-bridge-mappings=phys:br-phys > > > +done > > > + > > > +check ovn-nbctl ls-add ls0 > > > +check ovn-nbctl lsp-add ls0 first > > > +check ovn-nbctl lsp-add ls0 second > > > +check ovn-nbctl lsp-add ls0 migrator > > > +check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1" > > > +check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2" > > > +check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100" > > > + > > > +check ovn-nbctl lsp-add ls0 public > > > +check ovn-nbctl lsp-set-type public localnet > > > +check ovn-nbctl lsp-set-addresses public unknown > > > +check ovn-nbctl lsp-set-options public network_name=phys > > > + > > > +check ovn-nbctl lsp-set-options first requested-chassis=hv1 > > > vif-plug-mtu-request=1500 > > > +check ovn-nbctl lsp-set-options second requested-chassis=hv2 > > > vif-plug-mtu-request=1500 > > > +check ovn-nbctl lsp-set-options migrator requested-chassis=hv1,hv2 > > > vif-plug-mtu-request=1500 > > > + > > > +as hv1 check ovs-vsctl -- add-port br-int first -- \ > > > + set Interface first external-ids:iface-id=first \ > > > + options:tx_pcap=hv1/first-tx.pcap \ > > > + options:rxq_pcap=hv1/first-rx.pcap \ > > > + ofport-request=1 > > > +as hv2 check ovs-vsctl -- add-port br-int second -- \ > > > + set Interface second external-ids:iface-id=second \ > > > + options:tx_pcap=hv2/second-tx.pcap \ > > > + options:rxq_pcap=hv2/second-rx.pcap \ > > > + ofport-request=2 > > > + > > > +# Create Migrator interfaces on both hv1 and hv2 > > > +for hv in hv1 hv2; do > > > + as $hv check ovs-vsctl -- add-port br-int migrator -- \ > > > + set Interface migrator external-ids:iface-id=migrator \ > > > + options:tx_pcap=$hv/migrator-tx.pcap \ > > > + options:rxq_pcap=$hv/migrator-rx.pcap \ > > > + ofport-request=100 > > > +done > > > + > > > +send_icmp_packet() { > > > + local inport=$1 hv=$2 eth_src=$3 eth_dst=$4 ipv4_src=$5 ipv4_dst=$6 > > > ip_chksum=$7 data=$8 > > > + shift 8 > > > + > > > + local ip_ttl=ff > > > + local ip_len=001c > > > + local > > > packet=${eth_dst}${eth_src}08004500${ip_len}00004000${ip_ttl}01${ip_chksum}${ipv4_src}${ipv4_dst}${data} > > > + as hv$hv ovs-appctl netdev-dummy/receive $inport $packet > > > + echo $packet > > > +} > > > + > > > +check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1" > > > +check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2" > > > +check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100" > > > + > > > +first_mac=000000000001 > > > +second_mac=000000000002 > > > +migrator_mac=0000000000ff > > > +first_ip=$(ip_to_hex 10 0 0 1) > > > +second_ip=$(ip_to_hex 10 0 0 2) > > > +migrator_ip=$(ip_to_hex 10 0 0 100) > > > + > > > +OVN_POPULATE_ARP > > > + > > > +for len in 1422 1423; do > > > + data=$(xxd -l $len -c $len -p < /dev/zero) > > > + packet=$(send_icmp_packet migrator 1 $migrator_mac $first_mac > > > $migrator_ip $first_ip 0000 $data) > > > + echo $packet > hv1/first.expected > > > + > > > + packet=$(send_icmp_packet migrator 1 $migrator_mac $second_mac > > > $migrator_ip $second_ip 0000 $data) > > > + echo $packet > hv2/second.expected > > > +done > > > + > > > +OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv1/first-tx.pcap], > > > [hv1/first.expected]) > > > +OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv2/second-tx.pcap], > > > [hv2/second.expected]) > > > + > > > +OVN_CLEANUP([hv1],[hv2]) > > > + > > > +AT_CLEANUP > > > +]) > > > + > > > OVN_FOR_EACH_NORTHD([ > > > AT_SETUP([options:activation-strategy for logical port]) > > > AT_KEYWORDS([multi-chassis]) > > > -- > > > 2.34.1 > > > > > > _______________________________________________ > > > dev mailing list > > > [email protected] > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
