Re: [ovs-dev] RFC OVN: fabric integration

Frode Nordahl Thu, 01 Aug 2024 07:45:56 -0700

On Mon, Jul 29, 2024 at 2:17 PM Felix Huettner
<felix.huettner@mail.schwarz> wrote:
>
> Hi everyone,
>
> i have built a very first, ugly prototype of such a feature which
> brought interesting insights that i want to share.
>
> # Testimplementation
>
> I first want to shortly share the setup i implemented this on:
> The testsetup consists of 3 ovn nodes. One representing a compute node
> while two others serve as gateways. The gateways also each have a
> point-ot-point interface to an additional machine that represents a
> leaf-spine architecture using network namespaces and static routes.
>
> For the OVN northbound content we have:
> * a normal neutron project setup with a:
>    * LSP for a VM (LSP-VM)
>    * LS for the network (LS-internal)
>    * LR for the router (R1)
>    * LSP to the router (LSP-internal-R1)
>    * LRP to the network (LRP-R1-internal)
>    * a nat rule on R1 representing a floating ip
> * The router R1 has an LRP (LRP-R1-public) with a ha_chassis_group
>   configured to point to both gateways with different priorities
> * There is an integration LR (public) that serves as the integration
>   point of different projects. It replaces the LS normally used for
>   this.
> * The LR public has options:chassis configured to "gtw01,gtw02" (therby
>   making it an l3gateway)
> * LR public has an LRP (LRP-public-R1)
> * The LRPs LRP-public-R1 and LRP-R1-public are configured as each others
>   peers
> * There is a logical switch (LS-public-for-real)
> * LS-public-for-real has a LSP (physnet) of type localnet and
>   network_name set
> * LR public has an LRP (LRP-public-for-real)
> * LS-public-for-real has a LSP (LSP-public)
> * LSP-public and LRP-public-for-real are connected
>
> This setup contains two things that are currently not possible:
> 1. l3gateways can not be bound by more than 1 chassis
> 2. l3gateway lrps can not be directly connected to a distributed gateway
>    port
>
> Supporting an l3gateway that runs one more than 1 chassis can quite
> easily be done. This basically creates an active-active router on
> multiple chassis where each chassis does not know anything of any of the
> other routers (so e.g. no conntrack syncing).
> However this also means that the LRP-public-for-real only exists once
> from the OVN perspective while actually residing on multiple nodes. This
> e.g. means that there is only a single Static Route for the LR pointing
> to an IP behind LRP-public-for-real. This IP must be the same on all
> chassis implementing the LR and it also must have the same mac address.
>
> Directly connecting an l3gateway lrp to a distributed gateway port gets
> a little more ugly. The current implementation relies on the fact that
> the chassis for the l3gateway must be a superset of any chassis in the
> distributed gateway port. In this case i decided to tunnel to the
> appropriate chassis between the ingress and egress pipeline of the
> l3gateway.
>
> The implementation of this is avilable here: 
> https://github.com/ovn-org/ovn/compare/main...felixhuettner:ovn:test_active_active_routing
>
> It allows the above setup to send icmp packets between the LSP-VM and
> the external system. The external system can send packets through both
> of the gateway chassis and they will be forwarded appropriately. Reply
> traffic is always send from the chassis that is currently active for
> LRP-R1-public to the external system.
>
> ## Current Limitations
>
> * An outage of the link between the LRP-public-for-real and the nexthop
>   is not handled.
> * It breaks some testcases. I did not investigate them yet
>
> # Results and Open questions
>
> ## Integrating the project LRs via a Router or Switch
>
> The current default setup of at least neutron is to integrate project
> LRs via a single Switch that also hosts a localnet port to the outside
> world.
>
> In the setup above i tried to use a Router instead for this purpose.
>
> The differences i see between these setups:
> * In case a integration Switch is used each project router must hold the full
>   routing table to all other project routers on the switch. In case of a
>   integration Router it is the only one that needs the routing table
>   while the project routers only need a default route.
> * A logical switch brings features that are not necessary in this
>   scenario (like multicast/broadcast). In case of a lot of LSPs they
>   actually can generate flows that are too large.
> * For project to project communicatiation there is no difference in the
>   amount of datapaths traversed. For project to external communication
>   the switch add an extra datapath.
>
> So from a greenfield perspective i would see no value in using a logical
> switch between the projects. However for existing setups and
> integrations a logical switch would maybe make the life of everyone easier.
>
> ## Connection from public router to localnet port
>
> The setup above uses a single LRP on the public router for connection to
> the localnet port. This is quite easy to setup and extend for newly
> created chassis.
>
> The approach in other examples rather used one LRP per external
> connection. When a new chassis is added an additional LRP and external
> LS is needed (or multiple with multiple nics).
>
> The differences i see are:
> * a single external LRP requires the side outside of ovn to behave
>   identically everywhere. Meaning it needs the same IP and MAC address.
>   This might not always be possible depending on the systems there.
> * with multiple LRPs the CMS needs additional information to configure
>   them in comparison to what is currently needed (e.g. IPs).
> * multiple LRPs currently do not support prefering a node-local route if
>   available. This is needed to prevent us from sending packets between
>   gateway chassis for no reason.
> * multiple LRPs would allow us to learn different routes (or routes with
>   different cost) on different chassis. A single LRP means that we have
>   the same routing table on all chassis.
> * having multiple external connections on a single chassis bound to the
>   same public LR is not easily possible with a single LRP. with multiple
>   LRPs this natively works.
>
> I see points for both approaches. Maybe we find an alternative that
> combines the benefits and skips the drawbacks?
>
> # Summary
>
> I would like to hear opinitions on these topics as i think they are
> relevant to each potential implementation.
> Maybe we can also use some time in the community meeting later on these.


Thanks alot for contributing work on discovering potential paths
forward for interaction between gateway routers and distributed
gateways for the OpenStack use case, Felix, much appreciated!

There is a lot to consider in what you have laid out above, so I'll
leave you with a comment here for now informing you of my intention to
review this more thoroughly and respond to you, to ensure you do not
feel ignored.

> Thanks
> Felix
>
>
>
> On Wed, Jul 17, 2024 at 09:35:13AM +0200, Felix Huettner via dev wrote:
> > Hi everyone,
> >
> > thanks a lot for that work that was already put into this.
> > I just want to share my thoughs below as i'm looking for a similar thing
> > at the moment.
> >
> > On Thu, Jul 04, 2024 at 11:39:36AM +0200, Frode Nordahl wrote:
> > > > I tend to agree with Ilya here, clarity for the operator as for what
> > > > the system is actually doing becomes even more important when we are
> > > > integrating with external systems (the ToRs). The operator would
> > > > expect to be able to map configuration and status observed on one
> > > > system to configuration and status observed in another system.
> > > >
> > > > Another issue is that I see no way for magically mapping a single
> > > > localnet port into multiple chassis resident LRPs which would be
> > > > required for configurations with multiple NICs that do not use bonds.
> >
> > From my understanding users will need to configure multiple things that
> > are specific to each node anyway. They will need at least a peering IP
> > for each node (or one per nic per node) and the bgp sessions for each of
> > these. This also means that the users will need to do configuration on
> > each of their nodes directly (at least for BGP).
> >
> > Maybe we can also use this for caring about the detailed lrp
> > configuration. Since this is relevant only for the respective node and
> > not for the outside we could put that config to the local ovsdb and just
> > leave it for ovn-controller to handle. This might include splitting the
> > lrp up into multiple ones.
> >
> > This way one could have a single LR and LS handling all off the external
> > connections on the northbound side while splitting this up on the actual
> > chassis. This might be easier to understand and to extend (like adding
> > new nodes). This might look something like this when viewed from
> > northbound:
> >
> > +-----------+ +-----------+
> > |Project1 LS| |Project2 LS|
> > +----+------+ +-----+-----+
> >      |              |
> > +----+------+ +-----+-----+
> > |Project1 LR| |Project2 LR|
> > +----+------+ +-----+-----+
> >      |              |
> > +----+--------------+-----+
> > |"provider network" LS    |
> > +----------+--------------+
> >            |
> > +----------+--------------+
> > |magic bgp router         |
> > +----------+--------------+
> >            |
> > +----------+--------------+
> > |physical network LS      |
> > +-------------------------+
> >
> > where the "provider network" LS is the provider network from OpenStack
> > side but without localnet port (as Frode wrote below).
> > The magic bgp router would have one lrp to the provider network side.
> > The magic bgp router would also have one other lrp to the phyiscal
> > network side. Maybe we add something like replicated_chassis_group to LRPs
> > that contains all chassis the lrp should run on (so active-active while
> > HA_Chassis_Group is active-passive).
> > I am unsure if the phyiscal network LS is actually helpful, or if it would
> > be easier to just put the network_name directly on the LRP.
> >
> > From my perspecitve this gives us a nicer separation between the target
> > architecture the user defines in the northbound and the per node
> > implementation details that are then in the local ovsdb.
> >
> > > >
> > > > Presumably the goal with your proposal is to find a low-touch way to
> > > > make existing CMSs with overlay-based OVN configuration, such as
> > > > OpenStack, work in the new topology.
> > > >
> > > > We're also interested in minimizing the development effort on the CMS
> > > > side, so the tardy response to this thread is due to us spending a few
> > > > days exploring options.
> > > >
> > > >
> > > > I'll describe one approach that from observation mostly works below:
> > > >
> > > > With the outset of a classic OpenStack OVN configuration:
> > > > * Single provider LS
> > > > * Per project LSs with distributed LRs that have gateway chassis set
> > > > and NAT rules configured
> > > > * Instances scattered across three nodes
> > > >
> > > > We did the following steps to morph it into a configuration with per
> > > > node gateway routers and local entry/exit of traffic to/from
> > > > instances:
> > > > * Apply Numan's overlay provider network patch [0] and set
> > > > other_config:overlay_provider_network=true on provider LS
> > > > * Remove the provider network localnet port
> > > > * Create per chassis/NIC LS with localnet ports
> > > > * Create per chassis GWR and attach it to NIC LS as well as provider 
> > > > network
> > > >
> > > > We have this handy script to do most of it [1].
> > >
> > > One thing I forgot to mention is that for simplicity the script uses a
> > > lot of IPv4 addresses. In a final solution I would propose we use IPv6
> > > LLAs also for routing between the internal OVN L3 constructs to avoid
> > > this.
> >
> > /me goes back to fixing the v4-over-v6 routing patch :)

I did indeed see your proposals for that, and we are very much
interested in those too! :)

--
Frode Nordahl

> > >
> > > --
> > > Frode Nordahl
> > >
> > > > With that we can have an outside observer sending traffic via external
> > > > GWR IP destined to an instance FIP local to that chassis and have the
> > > > traffic enter/exit locally.
> > > >
> > > > The part that does not work in this setup is correct identification of
> > > > the return path due to the project LR having a single global default
> > > > route, so it only works for a single chassis at a time.
> > > >
> > > >
> > > > Perhaps we could solve this with a per chassis routing table and/or
> > > > option for automatic addition of default route to peer GWR or
> > > > something similar?
> >
> > This should be solveable with my proposal above. The individual project
> > LRs would forward all their traffic to the magic bgp LR.
> >
> > For forwarding from there i see multiple options:
> > 1. expect each project LRP to have a gateway_chassis set and expect that
> >    the magic LR is also running on this chassis. In this case we just
> >    route locally with whatever we get from bgp (or just to the local
> >    nexthop).
> >    This is probably what we want in common cases anyway, but it would
> >    mean that we rely on the nexthop of this node to be healthy to
> >    forward traffic.
> > 2. choose the next alive chassis of the magic LR and forward the packets
> >    there. Ideall this is the same chassis as the gateway_chassis of the
> >    project LRP. In case we don't have a gateway_chassis on the project
> >    LRP or if the bgp session to our nexthop is down we forward to
> >    whatever other chassis is still available.
> >    This would require some coordination between the ovn-controllers that
> >    can host the magic LR to determine which is up.
> >
> > From my perspective option 2 is what we want, especially to ensure
> > clean failovers.
> >
> > > >
> > > >
> > > > 0: 
> > > > https://patchwork.ozlabs.org/project/ovn/patch/20240606214432.168750-1-numans%2540ovn.org/
> > > > 1: https://pastebin.ubuntu.com/p/RFGpsDggpp/
> > > >
> >
> > Thanks
> > Felix
> > _______________________________________________
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] RFC OVN: fabric integration

Reply via email to