Re: [ovs-dev] [RFC ovn v3 4/4] northd: Advertise Routes for LB and NAT VIPs.

Dumitru Ceara Fri, 07 Feb 2025 03:56:28 -0800

On 2/7/25 10:44 AM, [email protected] wrote:
> Hi Dumitru, Frode
> Thanks for the detailed feedback. I'll add my thoughts inline below.
>


Hi Martin, Frode,

I had an internal discussion about this patch with Ales so I'm adding
him to the thread too.

I'll trim this email a bit and try to keep the relevant context as I
think we might need a bit of redesign.

[...]

>>>>>> +/* This function searches for an ovn_port with specific
>>>>>> "op_ip"
>>>>>> + * IP address in all LS datapaths directly connected to the
>>>>>> + * LR datapath "od".
>>>>>> + * If no match is found, this function returns NULL.*/
>>>>>> +static struct ovn_port *
>>>>>> +find_port_in_connected_ls(struct ovn_datapath *od, char
>>>>>> *op_ip)
>>>>>> +{
>>>>>> +    struct in6_addr *ip_addr = xmalloc(sizeof *ip_addr);
>>>>>> +    unsigned int plen;
>>>>>> +    bool is_ipv4 = true;
>>>>>> +
>>>>>> +    /* Return immediately if op_ip isn't a host route.*/
>>>>>
>>>>> We shouldn't have to check this, it should be the caller doing
>>>>> it.  We
>>>>> shouldn't parse values coming directly from the database here. 
>>>>> We
>>>>> should receive an already correct struct in6_addr * instead as
>>>>> argument.
> 
> Hmm, the ultimate source of the "op_ip" string is the NAT's internal
> IP/LB backend IP. I don't see immediate way to have this information
> available in the in6_addr* format. Am I overlooking something?
> 
>>>>>

[...]

>>>>>
>>>>> 'struct ovn_datapath' has the following field:
>>>>>
>>>>> struct sset router_ips; /* Router port IPs except the IPv6
>>>>> LLAs. */
>>>>>
>>>>> Instead of walking all ports of all connected switches of 'od'
>>>>> maybe we
>>>>> should also store all switch port IPs inside 'struct
>>>>> ovn_datapath'.
>>>>> Then we'd just have to do a sset_find() for each connected
>>>>> switch.
> 
> This would only tell us if LS has LSP with the given IP, right? Or are
> you suggesting this as an optimization step that would tell us if it's
> even worth looking through the LSPs on the given LS? i.e. we would go
> from:
>   for each LS -> for each LSP -> compare op_ip with lsp_ip
> to:
>   for each LS -> if sset_find(host_ips, op_ip) -> for each LSP ->
> compare op_ip with lsp_ip
> 
> If so, I agree this would be spare us a lot of unnecessary lookups.
> Though proposition to alter ovn_datapath structure is bit scary.
> 

'op_ip' here is the internal IP (logical_ip for NAT and backend_ip for
LB) and the goal as far as I understand is to advertise the route with a
better metric on the chassis where traffic to that internal IP is handled.

That being said, I think we don't need tracked_port to be the actual LSP
that owns the internal IP.  Please see below (in the [suggestion] section).

[...]

>>
>> I'm a bit of an early bird, so I'll give my high level view on the
>> below question, and allow others to look at your responses in more
>> detail when their day starts before participating in any details
>> myself.
>>
>> First, thanks alot for your thorough review, and sharing of insights,
>> the incremental processing engine is indeed one of the darker arts,
>> so
>> I think everyone on this thread have things to learn from you.
>>

No problem, sorry again for not looking at this patch in detail earlier.

>>> Finding the tracked_port (especially for load balancers but for
>>> NATs
>>> too) seems to be the very costly operation here.  However we don't
>>> seem
>>> to have any tests that cover that.
>>
>> MJ is working on a multi-node test which will cover some of this, the
>> current plan is to add it as a separate patch once ready.
>>
>>> This makes me wonder, is it really a requirement for this feature
>>> that
>>> the tracked port is set for these routes?  Not having to implement
>>> it
>>> now seems to simplify the logic and would remove the need to trace
>>> back
>>> from lb/nat parsed_routes to their NB configurations.
>>
>> The motivation for us with this is to provide a potential low barrier
>> entry to consuming native BGP support from OVN for CMSes such as
>> OpenStack in brown field deployments.
>>
>> Some of the constraints we work in are:
>> * clouds with hundreds of tenants, each of which have their own
>> routers, so configuring dynamic routing directly on all of those is
>> not an option.
>> * l3-only underlay with no VLANs spanned across racks or EVPN support
>> in networking equipment.
>>
>> To meet those constraints, a potential solution is to pin a gateway
>> router on every chassis, connect it up to a join-ls that spans all or
>> groups of chassis, and then connect existing logical OVN topology to
>> the join-ls.
>>
>> In our tests, this appears to work quite well, and for it to be truly
>> useful the tracked_port needs to be populated so that we can tell the
>> fabric where to send packets to NAT/LB addresses through the use of
>> the lower metric. Without it, all chassis with reachability to the
>> join-ls would announce the same prefixes with the same metric,
>> leading
>> to suboptimal path for traffic (tromboning).
>>

OK, I understand the use case better now, thanks for explaining.

I'll try to summarize how traffic is handled for the different types of
routers and public IPs and how I understand routes for those should be
advertised:

[suggestion]
============
I. For a gateway router GW0 (bound to a chassis, e.g., hv0):

1. Traffic destined to a load balancer VIP0 is always handled on hv0
(DNAT to one of the backend IPs).

In this case GW0 should advertise a route, R0, for VIP0 only on chassis hv0.

2. Traffic destined to a public NAT IP0 (dnat-or-snat or dnat) is always
handled on hv0.

In this case GW0 should advertise a route, R0, for IP0 only on chassis hv0.

II. For a distributed router DR0, in order for NAT/LB to work, the
router needs to have at least one distributed gateway port (DGP1, DGP2,
DGPn).

1. Traffic destined to a load balancer VIP0 is handled on all chassis
that bind the DGPs.  That is we'll have logical flows that DNAT traffic
with destination VIP0 to the backend IPs on all chassis where DGP1,
DGP2,.. DGPn are bound.

The code in this patch was trying to find the port where the first
backend IP is bound to an LSP and use that as tracked_port but that's
anyway incorrect IMO.  That's because backends can be distributed across
multiple chassis, not necessarily the ones that bind the DGPs.

In this case, because traffic for VIP0 is handled in the same way on all
chassis that bind DGPs (if traffic towards VIP0 lands on a different
chassis it will be redirected to one of the DGPs) we should advertise N
routes for VIP0, each of them with tracked_port=DGPx for x in [1..n].
This will generates routes with a better metric on hypervisors that bind
the DGPs.

2. Traffic destined to a public NAT IP0 (dnat-or-snat or dnat):

2a. if the NAT rule is "distributed" (that is the NAT rule has
"logical_port" and "external_mac" configured) then this traffic is
handled on the hypervisor that binds "logical_port", let's say hv1.

In this case the advertised route should have "tracked_port" ==
"NAT.logical_port".  This will generate routes with a better metric on
hv1 (where logical_port is bound) compared to hv0.

2b. if the NAT rule is not distributed ovn-northd determines which of
the DGPs it should use for that NAT rule (see the
lrouter_check_nat_entry() function that returns the DGP in the last
output argument).  Assuming that for this NAT rule, DGPx was chosen then
all traffic for that public NAT IP is handled by the chassis that binds
DGPx, that is the chassis where "is_chassis_resident(DGPx) == true",
let's say hv2.

In this case the advertised route should have "tracked_port" == DGPx.
This will generate routes with a better metric on hv2, where DGPx is bound.

Is my understanding correct?  If so, it seems we don't need to lookup
the LSPs that own NAT logical_ips or LB backend IPs.

>>> Would that be an acceptable compromise?
>>
>> I believe Martin discovered that we have missed planning work on
>> properly distributing the dnat flows, so they will still be on the
>> chassis with the active DGP in the above topology. So since we need
>> followup work on that anyway, I guess pin-pointing the location of
>> tracked_port to backend also could be delayed.
>>
>> But if we at the very least can get tracked_port pointed at the
>> active
>> DGP, the feature would be useful and consumable in this topology,
>> with
>> truly distributed support perhaps added later.
>>
>> With the branching delayed we may or may not get to more of this now
>> though, I'll let Martin chime in on his thoughts on how much time
>> would be needed.
>>
>> --
>> Frode Nordahl
> 
> Frode summarized it very well and I hear your concerns as well Dumitru.
> However if we manage to remove the need for "publish_{nat,lb}_route",
> as I suggested above, and if we improve the
> "find_port_in_connected_ls", do you think that it would reduce
> complexity enough?
> 

[...]

I'll try to focus here on the performance/scalability aspect of northd.

[incremental processing nodes and dependencies]
===============================================
None of the advertised routes for LB and NAT IPs generate logical flows
(you tried to handle that with the install_lflows field).  But that also
means that when we create their corresponding parsed_routes we shouldn't
trigger en_lflow (the node that generates logical flows) runs because
there's no work to do for them.

I understand it was easier to do like that because
en_advertised_route_sync currently uses the same parsed_route table
input as en_lflows.

But it doesn't have to.

We can split the computation a bit if we use the following incremental
processing DAG (where vertical lines represent a top->down dependency):

  en_northd
     |
     +-------------+------------+-------------+----+
     |             |            |             |    |
   en_routes    en_lr_nat    en_lb_data       |    |
     |   |         |            |             |    |
     |   |         |            |             |    |
     |   |         +------------+-------------+    |
     |   |                      |                  |
     |   |                      |                  |
     |   |                 en_lr_stateful          |
     |   |                      |                  |
     |   |                      +    +-------------+
     |   |                      |    |
     |   |               en_dynamic_routes
     |   |                      |
     |   |                      |
 +---+   +---------+------------+
 |                 |
 |                 |
en_lflow    en_advertised_route_sync

In this DAG, the only new node is en_dynamic_routes.  Its input is the
data of the "en_lr_stateful" node (the en_northd input can have a no-op
handler, it's just for lookups), that is:

struct ed_type_lr_stateful {
    struct lr_stateful_table table;

    /* Node's tracked data. */
    struct lr_stateful_tracked_data trk_data;
};

struct lr_stateful_tracked_data {
    /* Created or updated logical router with LB and/or NAT data. */
    struct hmapx crupdated; /* Stores 'struct lr_stateful_record'. */
};

struct lr_stateful_record {
...
    /* UUID of the NB Logical Router. */
    struct uuid nbr_uuid;

    /* Unique id of the logical router.  Note : This id is assigned
     * by the northd engine node for each logical router. */
    size_t lr_index;

    /* This lrnat_rec comes from the en_lrnat engine node data. */
    const struct lr_nat_record *lrnat_rec;
...
    /* Load Balancer vIPs relevant for this datapath. */
    struct ovn_lb_ip_set *lb_ips;
}

The (output) data of the en_dynamic_routes nodes should be a separate
hmap storing struct parsed_routes, one record for each LB/NAT route we
should advertise.

The en_dynamic_routes_run() function can iterate through the input
ed_type_lr_stateful.table data and generate these new parsed_routes.  It
also has (indirectly) a reference to the struct ovn_datapath for each
lr_stateful_record (we can use lr_index to efficiently look it up in
en_northd data).  lr_stateful_records already contain the 'struct
ovn_nat' entries that are relevant to that specific router.

All we miss is the DGP information in the ovn_nat records.  But we can
change the way 'struct ovn_nat' is created and pre-populate a new field,
e.g., 'struct ovn_port *dgp' using lrouter_check_nat_entry().

It also enables us to efficiently implement incremental processing
(maybe as follow up) because we could have a handler for the
en_lr_stateful -> en_dynamic_routes dependency.  This handler will have
to use the 'struct lr_stateful_tracked_data trk_data' input to only
update (add/remove) advertised routes for routers that have changes.

Another change is that en_advertised_route_sync now has two inputs:
- en_routes - routes that also install lflows
- en_dynamic_routes - routes that only need to be advertised

So we need to adapt advertised_route_table_sync() to:
- build the 'sync_routes' map with entries from both types of inputs
- walk the SB.Advertised_Route table and keep only those that are in
'sync_routes'
- create SB.Advertised_Route records for the remainder of parsed_routes
in 'sync_routes'.

The main advantage is that LB and NAT changes are still handled
incrementally from a logical flow perspective, they don't generate a
recompute of the en_lflow node (that's very costly in general).

But this works only if the suggestion (see [sugestion]) I had above for
advertising NAT/LB routes with tracked_port is correct.

Sorry for the long email, looking forward to your feedback!

Best regards,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC ovn v3 4/4] northd: Advertise Routes for LB and NAT VIPs.

Reply via email to