Initial (incomplete) approach to the idea can be seen here:

https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/

There are lots of missing parts still (support for one of two
directions only; there's some issue with local port size check to
investigate; geneve info leaking into embedded ICMP error packet;
missing IPv6 support to name a few). Posted to gather thoughts on
local delivery issue / controller action definition.

Please take a look if you have spare cycles.

Thanks,
Ihar

On Thu, Jul 21, 2022 at 11:22 AM Ihar Hrachyshka <[email protected]> wrote:
>
> On Wed, Jul 20, 2022 at 3:45 PM Numan Siddique <[email protected]> wrote:
> >
> > On Mon, Jul 18, 2022 at 7:44 PM Ihar Hrachyshka <[email protected]> wrote:
> > >
> > > Hi folks,
> > >
> > > looking for some advices on MTU / IP behavior for multi-chassis ports.
> > >
> > > 22.06 got the new multichassis port concept introduced where the same
> > > port may be present on multiple chassis, which can be used as a
> > > performance optimization for VM live migration, among other things.
> > > Migration with sub-0.1ms network downtime is achieved by cloning all
> > > traffic directed towards a multichassis port to all its locations,
> > > making all chassis hosting it receive relevant traffic at any given
> > > moment until migration is complete.
> > >
> > > For logical switches with a localnet port where traffic usually goes
> > > through the localnet port and not tunnels, it means enforcement of
> > > tunneling of egress and ingress traffic for a multichassis port. (The
> > > rest of the traffic between other, non-multichassis, ports keeps going
> > > through localnet port.) Tunneling enforcement is done because traffic
> > > sent through localnet port won't generally be cloned to all port binding
> > > locations by the upstream switch.
> > >
> > > A problem comes down when ingress or egress traffic of a multichassis
> > > port, being redirected through a tunnel, gets lost because of geneve
> > > header overhead or because the interface used for tunneling has a
> > > different MTU from the physical bridge backing up the localnet port.
> > >
> > > This is happening when:
> > >   - tunnel_iface_mtu < localnet_mtu + geneve_overhead
> > >
> > > This makes for an unfortunate situation where, for a multichassis port,
> > > SOME traffic (e.g. regular ICMP requests) pass through without any
> > > problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't.
> > > (Test scenario demonstrating it is included below.)
> > >
> > > Here are ideas I came up with on how this could be resolved or at least
> > > mitigated:
> > >
> > > a) pass "oversized" traffic through localnet, and the rest through
> > > tunnels. Apart from confusing pattern on the wire where packets that
> > > belong to the same TCP session may go through two paths (arguably not a
> > > big problem and should be expected by other network hardware), this gets
> > > us back to the problem of upstream switch not delivering packets to all
> > > binding chassis.
> > >
> > > b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP
> > > errors on attempts to send oversized packets to or from a multichassis
> > > port. Then TCP sessions could adjust their properties to reflect the new
> > > recommended MTU. We already have a similar mechanism for router ports.
> > > There are several caveats and limitations with fragmentation needed (b):
> > >
> > > - AFAIU this works for some protocols but not others. It also depends on
> > >   the client network stack to reflect the change in network path. TCP
> > >   should probably work.
> > >
> > > - It may be confusing for users that some L2 paths between ports of the
> > >   same switch have reduced MTU while others have the regular, depending
> > >   on the type of a port (single- or multi-chassis).
> > >
> > > c) implement fragmentation inside L2 domain. I actually don't know if
> > > that's even compliant with RFCs. Usually packet fragmentation is
> > > implemented on L2 domain boundary, by a router. In this scenario, a peer
> > > port on the same switch would receive fragments for a packet that was
> > > sent as a single piece.
> > >
> > > I currently lean towards (b) though it's not a universal fix since it
> > > requires collaboration of the underlying network stack. But (a) leaves
> > > cloning broken, and (c) is even more invasive, taking action on port's
> > > packets (fragmenting them) without the port's owner knowledge.
> > >
> > > Perhaps this is all unnecessary and there's a way to make OVN
> > > transparently split and reassemble packets as needed, though I doubt it
> > > since it doesn't encapsulate tunneled traffic into another application
> > > layer. But if there's a way to achieve transparent re-assemble, or there
> > > are other alternatives beyond (a)-(c), let me know. Please let me know
> > > what you think of (a)-(c) regardless.
> > >
> > > Thanks,
> > > Ihar
> >
> > I lean towards (b) too.   I suppose we only need this special handling
> > until the migration is complete and the multi chassis option is
> > cleared by CMS.
>
> Yes in case of short-term use of the feature (e.g. for live
> migration). If we are talking about persistent multi-chassis ports
> (e.g. to clone traffic), this may be more of a problem. I'd expect
> that in this case, MTUs will be carefully aligned between tunneling
> interfaces and physical networks so that any packet directed towards a
> physical network would fit through the tunneling interface. (This
> implication is already documented in ovn-nb.xml, so what is discussed
> here is more of an optimization for environments that don't abide to
> expected MTU configuration.)
>
> >
> > For (a) how would OVN decide if the packet is oversized ?  Using the
> > check_pkt_larger OVS action ?
>
> I believe any option of (a)-(c) would rely on check_pkt_larger, yes.
> The question is - how to determine the base line. For this, we'll need
> to know:
>
> 1) type of the tunnel (to determine the overhead comparing to direct
> localnet funneling)
> 2) mtu of the interface used for tunneling
>
> The max size for a packet would be calculated as follows:
>
> max_size = tun_iface_mtu - overhead(tun_type)
>
> overhead() is simple to implement because it's just a 1:1 str:int mapping.
> tun_iface_mtu will require some busy work to maintain mtus for
> relevant interfaces used for routing between chassis.
>
> >
> > I've no idea how (c) can be done.
> >
> > Thanks
> > Numan
> >
> > > ---
> > >  tests/ovn.at | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 92 insertions(+)
> > >
> > > diff --git a/tests/ovn.at b/tests/ovn.at
> > > index c346975e6..4ec1d37c3 100644
> > > --- a/tests/ovn.at
> > > +++ b/tests/ovn.at
> > > @@ -14966,6 +14966,98 @@ OVN_CLEANUP([hv1],[hv2],[hv3])
> > >  AT_CLEANUP
> > >  ])
> > >
> > > +OVN_FOR_EACH_NORTHD([
> > > +AT_SETUP([localnet connectivity with multiple requested-chassis, max 
> > > mtu])
> > > +AT_KEYWORDS([multi-chassis])
> > > +ovn_start
> > > +
> > > +net_add n1
> > > +for i in 1 2; do
> > > +    sim_add hv$i
> > > +    as hv$i
> > > +    check ovs-vsctl add-br br-phys
> > > +    ovn_attach n1 br-phys 192.168.0.$i
> > > +    check ovs-vsctl set open . 
> > > external-ids:ovn-bridge-mappings=phys:br-phys
> > > +done
> > > +
> > > +check ovn-nbctl ls-add ls0
> > > +check ovn-nbctl lsp-add ls0 first
> > > +check ovn-nbctl lsp-add ls0 second
> > > +check ovn-nbctl lsp-add ls0 migrator
> > > +check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1"
> > > +check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2"
> > > +check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100"
> > > +
> > > +check ovn-nbctl lsp-add ls0 public
> > > +check ovn-nbctl lsp-set-type public localnet
> > > +check ovn-nbctl lsp-set-addresses public unknown
> > > +check ovn-nbctl lsp-set-options public network_name=phys
> > > +
> > > +check ovn-nbctl lsp-set-options first requested-chassis=hv1 
> > > vif-plug-mtu-request=1500
> > > +check ovn-nbctl lsp-set-options second requested-chassis=hv2 
> > > vif-plug-mtu-request=1500
> > > +check ovn-nbctl lsp-set-options migrator requested-chassis=hv1,hv2 
> > > vif-plug-mtu-request=1500
> > > +
> > > +as hv1 check ovs-vsctl -- add-port br-int first -- \
> > > +    set Interface first external-ids:iface-id=first \
> > > +    options:tx_pcap=hv1/first-tx.pcap \
> > > +    options:rxq_pcap=hv1/first-rx.pcap \
> > > +    ofport-request=1
> > > +as hv2 check ovs-vsctl -- add-port br-int second -- \
> > > +    set Interface second external-ids:iface-id=second \
> > > +    options:tx_pcap=hv2/second-tx.pcap \
> > > +    options:rxq_pcap=hv2/second-rx.pcap \
> > > +    ofport-request=2
> > > +
> > > +# Create Migrator interfaces on both hv1 and hv2
> > > +for hv in hv1 hv2; do
> > > +    as $hv check ovs-vsctl -- add-port br-int migrator -- \
> > > +        set Interface migrator external-ids:iface-id=migrator \
> > > +        options:tx_pcap=$hv/migrator-tx.pcap \
> > > +        options:rxq_pcap=$hv/migrator-rx.pcap \
> > > +        ofport-request=100
> > > +done
> > > +
> > > +send_icmp_packet() {
> > > +    local inport=$1 hv=$2 eth_src=$3 eth_dst=$4 ipv4_src=$5 ipv4_dst=$6 
> > > ip_chksum=$7 data=$8
> > > +    shift 8
> > > +
> > > +    local ip_ttl=ff
> > > +    local ip_len=001c
> > > +    local 
> > > packet=${eth_dst}${eth_src}08004500${ip_len}00004000${ip_ttl}01${ip_chksum}${ipv4_src}${ipv4_dst}${data}
> > > +    as hv$hv ovs-appctl netdev-dummy/receive $inport $packet
> > > +    echo $packet
> > > +}
> > > +
> > > +check ovn-nbctl lsp-set-addresses first "00:00:00:00:00:01 10.0.0.1"
> > > +check ovn-nbctl lsp-set-addresses second "00:00:00:00:00:02 10.0.0.2"
> > > +check ovn-nbctl lsp-set-addresses migrator "00:00:00:00:00:ff 10.0.0.100"
> > > +
> > > +first_mac=000000000001
> > > +second_mac=000000000002
> > > +migrator_mac=0000000000ff
> > > +first_ip=$(ip_to_hex 10 0 0 1)
> > > +second_ip=$(ip_to_hex 10 0 0 2)
> > > +migrator_ip=$(ip_to_hex 10 0 0 100)
> > > +
> > > +OVN_POPULATE_ARP
> > > +
> > > +for len in 1422 1423; do
> > > +    data=$(xxd -l $len -c $len -p < /dev/zero)
> > > +    packet=$(send_icmp_packet migrator 1 $migrator_mac $first_mac 
> > > $migrator_ip $first_ip 0000 $data)
> > > +    echo $packet > hv1/first.expected
> > > +
> > > +    packet=$(send_icmp_packet migrator 1 $migrator_mac $second_mac 
> > > $migrator_ip $second_ip 0000 $data)
> > > +    echo $packet > hv2/second.expected
> > > +done
> > > +
> > > +OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv1/first-tx.pcap], 
> > > [hv1/first.expected])
> > > +OVN_CHECK_PACKETS_REMOVE_BROADCAST([hv2/second-tx.pcap], 
> > > [hv2/second.expected])
> > > +
> > > +OVN_CLEANUP([hv1],[hv2])
> > > +
> > > +AT_CLEANUP
> > > +])
> > > +
> > >  OVN_FOR_EACH_NORTHD([
> > >  AT_SETUP([options:activation-strategy for logical port])
> > >  AT_KEYWORDS([multi-chassis])
> > > --
> > > 2.34.1
> > >
> > > _______________________________________________
> > > dev mailing list
> > > [email protected]
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to