Re: [ovs-dev] [RFC ovn v3 4/4] northd: Advertise Routes for LB and NAT VIPs.

martin . kalcok Mon, 10 Feb 2025 11:08:16 -0800

On Mon, 2025-02-10 at 18:47 +0100, Dumitru Ceara wrote:
> On 2/7/25 12:55 PM, Dumitru Ceara wrote:
> > On 2/7/25 10:44 AM, [email protected] wrote:
> > > Hi Dumitru, Frode
> > > Thanks for the detailed feedback. I'll add my thoughts inline
> > > below.
> > > 
> > 
> > Hi Martin, Frode,
> > 
> > I had an internal discussion about this patch with Ales so I'm
> > adding
> > him to the thread too.
> > 
> > I'll trim this email a bit and try to keep the relevant context as
> > I
> > think we might need a bit of redesign.
> > 
> > [...]
> > 
> > > > > > > > +/* This function searches for an ovn_port with
> > > > > > > > specific
> > > > > > > > "op_ip"
> > > > > > > > + * IP address in all LS datapaths directly connected
> > > > > > > > to the
> > > > > > > > + * LR datapath "od".
> > > > > > > > + * If no match is found, this function returns NULL.*/
> > > > > > > > +static struct ovn_port *
> > > > > > > > +find_port_in_connected_ls(struct ovn_datapath *od,
> > > > > > > > char
> > > > > > > > *op_ip)
> > > > > > > > +{
> > > > > > > > +    struct in6_addr *ip_addr = xmalloc(sizeof
> > > > > > > > *ip_addr);
> > > > > > > > +    unsigned int plen;
> > > > > > > > +    bool is_ipv4 = true;
> > > > > > > > +
> > > > > > > > +    /* Return immediately if op_ip isn't a host
> > > > > > > > route.*/
> > > > > > > 
> > > > > > > We shouldn't have to check this, it should be the caller
> > > > > > > doing
> > > > > > > it.  We
> > > > > > > shouldn't parse values coming directly from the database
> > > > > > > here. 
> > > > > > > We
> > > > > > > should receive an already correct struct in6_addr *
> > > > > > > instead as
> > > > > > > argument.
> > > 
> > > Hmm, the ultimate source of the "op_ip" string is the NAT's
> > > internal
> > > IP/LB backend IP. I don't see immediate way to have this
> > > information
> > > available in the in6_addr* format. Am I overlooking something?
> > > 
> > > > > > > 
> > 
> > [...]
> > 
> > > > > > > 
> > > > > > > 'struct ovn_datapath' has the following field:
> > > > > > > 
> > > > > > > struct sset router_ips; /* Router port IPs except the
> > > > > > > IPv6
> > > > > > > LLAs. */
> > > > > > > 
> > > > > > > Instead of walking all ports of all connected switches of
> > > > > > > 'od'
> > > > > > > maybe we
> > > > > > > should also store all switch port IPs inside 'struct
> > > > > > > ovn_datapath'.
> > > > > > > Then we'd just have to do a sset_find() for each
> > > > > > > connected
> > > > > > > switch.
> > > 
> > > This would only tell us if LS has LSP with the given IP, right?
> > > Or are
> > > you suggesting this as an optimization step that would tell us if
> > > it's
> > > even worth looking through the LSPs on the given LS? i.e. we
> > > would go
> > > from:
> > >   for each LS -> for each LSP -> compare op_ip with lsp_ip
> > > to:
> > >   for each LS -> if sset_find(host_ips, op_ip) -> for each LSP ->
> > > compare op_ip with lsp_ip
> > > 
> > > If so, I agree this would be spare us a lot of unnecessary
> > > lookups.
> > > Though proposition to alter ovn_datapath structure is bit scary.
> > > 
> > 
> > 'op_ip' here is the internal IP (logical_ip for NAT and backend_ip
> > for
> > LB) and the goal as far as I understand is to advertise the route
> > with a
> > better metric on the chassis where traffic to that internal IP is
> > handled.
> > 
> > That being said, I think we don't need tracked_port to be the
> > actual LSP
> > that owns the internal IP.  Please see below (in the [suggestion]
> > section).
> > 
> > [...]
> > 
> > > > 
> > > > I'm a bit of an early bird, so I'll give my high level view on
> > > > the
> > > > below question, and allow others to look at your responses in
> > > > more
> > > > detail when their day starts before participating in any
> > > > details
> > > > myself.
> > > > 
> > > > First, thanks alot for your thorough review, and sharing of
> > > > insights,
> > > > the incremental processing engine is indeed one of the darker
> > > > arts,
> > > > so
> > > > I think everyone on this thread have things to learn from you.
> > > > 
> > 
> > No problem, sorry again for not looking at this patch in detail
> > earlier.
> > 
> > > > > Finding the tracked_port (especially for load balancers but
> > > > > for
> > > > > NATs
> > > > > too) seems to be the very costly operation here.  However we
> > > > > don't
> > > > > seem
> > > > > to have any tests that cover that.
> > > > 
> > > > MJ is working on a multi-node test which will cover some of
> > > > this, the
> > > > current plan is to add it as a separate patch once ready.
> > > > 
> > > > > This makes me wonder, is it really a requirement for this
> > > > > feature
> > > > > that
> > > > > the tracked port is set for these routes?  Not having to
> > > > > implement
> > > > > it
> > > > > now seems to simplify the logic and would remove the need to
> > > > > trace
> > > > > back
> > > > > from lb/nat parsed_routes to their NB configurations.
> > > > 
> > > > The motivation for us with this is to provide a potential low
> > > > barrier
> > > > entry to consuming native BGP support from OVN for CMSes such
> > > > as
> > > > OpenStack in brown field deployments.
> > > > 
> > > > Some of the constraints we work in are:
> > > > * clouds with hundreds of tenants, each of which have their own
> > > > routers, so configuring dynamic routing directly on all of
> > > > those is
> > > > not an option.
> > > > * l3-only underlay with no VLANs spanned across racks or EVPN
> > > > support
> > > > in networking equipment.
> > > > 
> > > > To meet those constraints, a potential solution is to pin a
> > > > gateway
> > > > router on every chassis, connect it up to a join-ls that spans
> > > > all or
> > > > groups of chassis, and then connect existing logical OVN
> > > > topology to
> > > > the join-ls.
> > > > 
> > > > In our tests, this appears to work quite well, and for it to be
> > > > truly
> > > > useful the tracked_port needs to be populated so that we can
> > > > tell the
> > > > fabric where to send packets to NAT/LB addresses through the
> > > > use of
> > > > the lower metric. Without it, all chassis with reachability to
> > > > the
> > > > join-ls would announce the same prefixes with the same metric,
> > > > leading
> > > > to suboptimal path for traffic (tromboning).
> > > > 
> > 
> > OK, I understand the use case better now, thanks for explaining.
> > 
> > I'll try to summarize how traffic is handled for the different
> > types of
> > routers and public IPs and how I understand routes for those should
> > be
> > advertised:
> > 
> > [suggestion]
> > ============
> > I. For a gateway router GW0 (bound to a chassis, e.g., hv0):
> > 
> > 1. Traffic destined to a load balancer VIP0 is always handled on
> > hv0
> > (DNAT to one of the backend IPs).
> > 
> > In this case GW0 should advertise a route, R0, for VIP0 only on
> > chassis hv0.
> > 
> > 2. Traffic destined to a public NAT IP0 (dnat-or-snat or dnat) is
> > always
> > handled on hv0.
> > 
> > In this case GW0 should advertise a route, R0, for IP0 only on
> > chassis hv0.
> > 
> > II. For a distributed router DR0, in order for NAT/LB to work, the
> > router needs to have at least one distributed gateway port (DGP1,
> > DGP2,
> > DGPn).
> > 
> > 1. Traffic destined to a load balancer VIP0 is handled on all
> > chassis
> > that bind the DGPs.  That is we'll have logical flows that DNAT
> > traffic
> > with destination VIP0 to the backend IPs on all chassis where DGP1,
> > DGP2,.. DGPn are bound.
> > 
> > The code in this patch was trying to find the port where the first
> > backend IP is bound to an LSP and use that as tracked_port but
> > that's
> > anyway incorrect IMO.  That's because backends can be distributed
> > across
> > multiple chassis, not necessarily the ones that bind the DGPs.
> > 
> > In this case, because traffic for VIP0 is handled in the same way
> > on all
> > chassis that bind DGPs (if traffic towards VIP0 lands on a
> > different
> > chassis it will be redirected to one of the DGPs) we should
> > advertise N
> > routes for VIP0, each of them with tracked_port=DGPx for x in
> > [1..n].
> > This will generates routes with a better metric on hypervisors that
> > bind
> > the DGPs.
> > 
> > 2. Traffic destined to a public NAT IP0 (dnat-or-snat or dnat):
> > 
> > 2a. if the NAT rule is "distributed" (that is the NAT rule has
> > "logical_port" and "external_mac" configured) then this traffic is
> > handled on the hypervisor that binds "logical_port", let's say hv1.
> > 
> 
> Hi Frode,
> 
> Following up from our discussion during the community technical
> meeting
> earlier: I was wondering if this is the case you're trying to
> optimize -
> purely distributed dnat_and_snat (floating IPs).  In this case the
> traffic from and to the floating IP is always processed (and NATed)
> on
> the chassis that binds the "logical_port".


Hi Dumitru
this is only case for "distributed NAT", right? When logical_port and
"mac" are set in the NAT record, right? (just double-checking) Yes,
this is what I'd like to go for in the next version.

Martin.

> 
> > In this case the advertised route should have "tracked_port" ==
> > "NAT.logical_port".  This will generate routes with a better metric
> > on
> > hv1 (where logical_port is bound) compared to hv0.
> > 
> 
> If we set "tracked_port" to be the port binding that corresponds to
> the
> NAT rule's "logical_port" (VM for which we define the floating IP in
> OpenStack I guess) don't we achieve the efficient routing you had in
> mind?
> 
> Or is there another case I'm missing?
> 
> Thanks,
> Dumitru
> 
> > 2b. if the NAT rule is not distributed ovn-northd determines which
> > of
> > the DGPs it should use for that NAT rule (see the
> > lrouter_check_nat_entry() function that returns the DGP in the last
> > output argument).  Assuming that for this NAT rule, DGPx was chosen
> > then
> > all traffic for that public NAT IP is handled by the chassis that
> > binds
> > DGPx, that is the chassis where "is_chassis_resident(DGPx) ==
> > true",
> > let's say hv2.
> > 
> > In this case the advertised route should have "tracked_port" ==
> > DGPx.
> > This will generate routes with a better metric on hv2, where DGPx
> > is bound.
> > 
> > Is my understanding correct?  If so, it seems we don't need to
> > lookup
> > the LSPs that own NAT logical_ips or LB backend IPs.
> > 
> > > > > Would that be an acceptable compromise?
> > > > 
> > > > I believe Martin discovered that we have missed planning work
> > > > on
> > > > properly distributing the dnat flows, so they will still be on
> > > > the
> > > > chassis with the active DGP in the above topology. So since we
> > > > need
> > > > followup work on that anyway, I guess pin-pointing the location
> > > > of
> > > > tracked_port to backend also could be delayed.
> > > > 
> > > > But if we at the very least can get tracked_port pointed at the
> > > > active
> > > > DGP, the feature would be useful and consumable in this
> > > > topology,
> > > > with
> > > > truly distributed support perhaps added later.
> > > > 
> > > > With the branching delayed we may or may not get to more of
> > > > this now
> > > > though, I'll let Martin chime in on his thoughts on how much
> > > > time
> > > > would be needed.
> > > > 
> > > > --
> > > > Frode Nordahl
> > > 
> > > Frode summarized it very well and I hear your concerns as well
> > > Dumitru.
> > > However if we manage to remove the need for
> > > "publish_{nat,lb}_route",
> > > as I suggested above, and if we improve the
> > > "find_port_in_connected_ls", do you think that it would reduce
> > > complexity enough?
> > > 
> > 
> > [...]
> > 
> > I'll try to focus here on the performance/scalability aspect of
> > northd.
> > 
> > [incremental processing nodes and dependencies]
> > ===============================================
> > None of the advertised routes for LB and NAT IPs generate logical
> > flows
> > (you tried to handle that with the install_lflows field).  But that
> > also
> > means that when we create their corresponding parsed_routes we
> > shouldn't
> > trigger en_lflow (the node that generates logical flows) runs
> > because
> > there's no work to do for them.
> > 
> > I understand it was easier to do like that because
> > en_advertised_route_sync currently uses the same parsed_route table
> > input as en_lflows.
> > 
> > But it doesn't have to.
> > 
> > We can split the computation a bit if we use the following
> > incremental
> > processing DAG (where vertical lines represent a top->down
> > dependency):
> > 
> >   en_northd
> >      |
> >      +-------------+------------+-------------+----+
> >      |             |            |             |    |
> >    en_routes    en_lr_nat    en_lb_data       |    |
> >      |   |         |            |             |    |
> >      |   |         |            |             |    |
> >      |   |         +------------+-------------+    |
> >      |   |                      |                  |
> >      |   |                      |                  |
> >      |   |                 en_lr_stateful          |
> >      |   |                      |                  |
> >      |   |                      +    +-------------+
> >      |   |                      |    |
> >      |   |               en_dynamic_routes
> >      |   |                      |
> >      |   |                      |
> >  +---+   +---------+------------+
> >  |                 |
> >  |                 |
> > en_lflow    en_advertised_route_sync
> > 
> > In this DAG, the only new node is en_dynamic_routes.  Its input is
> > the
> > data of the "en_lr_stateful" node (the en_northd input can have a
> > no-op
> > handler, it's just for lookups), that is:
> > 
> > struct ed_type_lr_stateful {
> >     struct lr_stateful_table table;
> > 
> >     /* Node's tracked data. */
> >     struct lr_stateful_tracked_data trk_data;
> > };
> > 
> > struct lr_stateful_tracked_data {
> >     /* Created or updated logical router with LB and/or NAT data.
> > */
> >     struct hmapx crupdated; /* Stores 'struct lr_stateful_record'.
> > */
> > };
> > 
> > struct lr_stateful_record {
> > ...
> >     /* UUID of the NB Logical Router. */
> >     struct uuid nbr_uuid;
> > 
> >     /* Unique id of the logical router.  Note : This id is assigned
> >      * by the northd engine node for each logical router. */
> >     size_t lr_index;
> > 
> >     /* This lrnat_rec comes from the en_lrnat engine node data. */
> >     const struct lr_nat_record *lrnat_rec;
> > ...
> >     /* Load Balancer vIPs relevant for this datapath. */
> >     struct ovn_lb_ip_set *lb_ips;
> > }
> > 
> > The (output) data of the en_dynamic_routes nodes should be a
> > separate
> > hmap storing struct parsed_routes, one record for each LB/NAT route
> > we
> > should advertise.
> > 
> > The en_dynamic_routes_run() function can iterate through the input
> > ed_type_lr_stateful.table data and generate these new
> > parsed_routes.  It
> > also has (indirectly) a reference to the struct ovn_datapath for
> > each
> > lr_stateful_record (we can use lr_index to efficiently look it up
> > in
> > en_northd data).  lr_stateful_records already contain the 'struct
> > ovn_nat' entries that are relevant to that specific router.
> > 
> > All we miss is the DGP information in the ovn_nat records.  But we
> > can
> > change the way 'struct ovn_nat' is created and pre-populate a new
> > field,
> > e.g., 'struct ovn_port *dgp' using lrouter_check_nat_entry().
> > 
> > It also enables us to efficiently implement incremental processing
> > (maybe as follow up) because we could have a handler for the
> > en_lr_stateful -> en_dynamic_routes dependency.  This handler will
> > have
> > to use the 'struct lr_stateful_tracked_data trk_data' input to only
> > update (add/remove) advertised routes for routers that have
> > changes.
> > 
> > Another change is that en_advertised_route_sync now has two inputs:
> > - en_routes - routes that also install lflows
> > - en_dynamic_routes - routes that only need to be advertised
> > 
> > So we need to adapt advertised_route_table_sync() to:
> > - build the 'sync_routes' map with entries from both types of
> > inputs
> > - walk the SB.Advertised_Route table and keep only those that are
> > in
> > 'sync_routes'
> > - create SB.Advertised_Route records for the remainder of
> > parsed_routes
> > in 'sync_routes'.
> > 
> > The main advantage is that LB and NAT changes are still handled
> > incrementally from a logical flow perspective, they don't generate
> > a
> > recompute of the en_lflow node (that's very costly in general).
> > 
> > But this works only if the suggestion (see [sugestion]) I had above
> > for
> > advertising NAT/LB routes with tracked_port is correct.
> > 
> > Sorry for the long email, looking forward to your feedback!
> > 
> > Best regards,
> > Dumitru
> > 
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC ovn v3 4/4] northd: Advertise Routes for LB and NAT VIPs.

Reply via email to