On 2/7/25 12:55 PM, Dumitru Ceara wrote:
> On 2/7/25 10:44 AM, [email protected] wrote:
>> Hi Dumitru, Frode
>> Thanks for the detailed feedback. I'll add my thoughts inline below.
>>
>
> Hi Martin, Frode,
>
> I had an internal discussion about this patch with Ales so I'm adding
> him to the thread too.
>
> I'll trim this email a bit and try to keep the relevant context as I
> think we might need a bit of redesign.
>
> [...]
>
>>>>>>> +/* This function searches for an ovn_port with specific
>>>>>>> "op_ip"
>>>>>>> + * IP address in all LS datapaths directly connected to the
>>>>>>> + * LR datapath "od".
>>>>>>> + * If no match is found, this function returns NULL.*/
>>>>>>> +static struct ovn_port *
>>>>>>> +find_port_in_connected_ls(struct ovn_datapath *od, char
>>>>>>> *op_ip)
>>>>>>> +{
>>>>>>> + struct in6_addr *ip_addr = xmalloc(sizeof *ip_addr);
>>>>>>> + unsigned int plen;
>>>>>>> + bool is_ipv4 = true;
>>>>>>> +
>>>>>>> + /* Return immediately if op_ip isn't a host route.*/
>>>>>>
>>>>>> We shouldn't have to check this, it should be the caller doing
>>>>>> it. We
>>>>>> shouldn't parse values coming directly from the database here.
>>>>>> We
>>>>>> should receive an already correct struct in6_addr * instead as
>>>>>> argument.
>>
>> Hmm, the ultimate source of the "op_ip" string is the NAT's internal
>> IP/LB backend IP. I don't see immediate way to have this information
>> available in the in6_addr* format. Am I overlooking something?
>>
>>>>>>
>
> [...]
>
>>>>>>
>>>>>> 'struct ovn_datapath' has the following field:
>>>>>>
>>>>>> struct sset router_ips; /* Router port IPs except the IPv6
>>>>>> LLAs. */
>>>>>>
>>>>>> Instead of walking all ports of all connected switches of 'od'
>>>>>> maybe we
>>>>>> should also store all switch port IPs inside 'struct
>>>>>> ovn_datapath'.
>>>>>> Then we'd just have to do a sset_find() for each connected
>>>>>> switch.
>>
>> This would only tell us if LS has LSP with the given IP, right? Or are
>> you suggesting this as an optimization step that would tell us if it's
>> even worth looking through the LSPs on the given LS? i.e. we would go
>> from:
>> for each LS -> for each LSP -> compare op_ip with lsp_ip
>> to:
>> for each LS -> if sset_find(host_ips, op_ip) -> for each LSP ->
>> compare op_ip with lsp_ip
>>
>> If so, I agree this would be spare us a lot of unnecessary lookups.
>> Though proposition to alter ovn_datapath structure is bit scary.
>>
>
> 'op_ip' here is the internal IP (logical_ip for NAT and backend_ip for
> LB) and the goal as far as I understand is to advertise the route with a
> better metric on the chassis where traffic to that internal IP is handled.
>
> That being said, I think we don't need tracked_port to be the actual LSP
> that owns the internal IP. Please see below (in the [suggestion] section).
>
> [...]
>
>>>
>>> I'm a bit of an early bird, so I'll give my high level view on the
>>> below question, and allow others to look at your responses in more
>>> detail when their day starts before participating in any details
>>> myself.
>>>
>>> First, thanks alot for your thorough review, and sharing of insights,
>>> the incremental processing engine is indeed one of the darker arts,
>>> so
>>> I think everyone on this thread have things to learn from you.
>>>
>
> No problem, sorry again for not looking at this patch in detail earlier.
>
>>>> Finding the tracked_port (especially for load balancers but for
>>>> NATs
>>>> too) seems to be the very costly operation here. However we don't
>>>> seem
>>>> to have any tests that cover that.
>>>
>>> MJ is working on a multi-node test which will cover some of this, the
>>> current plan is to add it as a separate patch once ready.
>>>
>>>> This makes me wonder, is it really a requirement for this feature
>>>> that
>>>> the tracked port is set for these routes? Not having to implement
>>>> it
>>>> now seems to simplify the logic and would remove the need to trace
>>>> back
>>>> from lb/nat parsed_routes to their NB configurations.
>>>
>>> The motivation for us with this is to provide a potential low barrier
>>> entry to consuming native BGP support from OVN for CMSes such as
>>> OpenStack in brown field deployments.
>>>
>>> Some of the constraints we work in are:
>>> * clouds with hundreds of tenants, each of which have their own
>>> routers, so configuring dynamic routing directly on all of those is
>>> not an option.
>>> * l3-only underlay with no VLANs spanned across racks or EVPN support
>>> in networking equipment.
>>>
>>> To meet those constraints, a potential solution is to pin a gateway
>>> router on every chassis, connect it up to a join-ls that spans all or
>>> groups of chassis, and then connect existing logical OVN topology to
>>> the join-ls.
>>>
>>> In our tests, this appears to work quite well, and for it to be truly
>>> useful the tracked_port needs to be populated so that we can tell the
>>> fabric where to send packets to NAT/LB addresses through the use of
>>> the lower metric. Without it, all chassis with reachability to the
>>> join-ls would announce the same prefixes with the same metric,
>>> leading
>>> to suboptimal path for traffic (tromboning).
>>>
>
> OK, I understand the use case better now, thanks for explaining.
>
> I'll try to summarize how traffic is handled for the different types of
> routers and public IPs and how I understand routes for those should be
> advertised:
>
> [suggestion]
> ============
> I. For a gateway router GW0 (bound to a chassis, e.g., hv0):
>
> 1. Traffic destined to a load balancer VIP0 is always handled on hv0
> (DNAT to one of the backend IPs).
>
> In this case GW0 should advertise a route, R0, for VIP0 only on chassis hv0.
>
> 2. Traffic destined to a public NAT IP0 (dnat-or-snat or dnat) is always
> handled on hv0.
>
> In this case GW0 should advertise a route, R0, for IP0 only on chassis hv0.
>
> II. For a distributed router DR0, in order for NAT/LB to work, the
> router needs to have at least one distributed gateway port (DGP1, DGP2,
> DGPn).
>
> 1. Traffic destined to a load balancer VIP0 is handled on all chassis
> that bind the DGPs. That is we'll have logical flows that DNAT traffic
> with destination VIP0 to the backend IPs on all chassis where DGP1,
> DGP2,.. DGPn are bound.
>
> The code in this patch was trying to find the port where the first
> backend IP is bound to an LSP and use that as tracked_port but that's
> anyway incorrect IMO. That's because backends can be distributed across
> multiple chassis, not necessarily the ones that bind the DGPs.
>
> In this case, because traffic for VIP0 is handled in the same way on all
> chassis that bind DGPs (if traffic towards VIP0 lands on a different
> chassis it will be redirected to one of the DGPs) we should advertise N
> routes for VIP0, each of them with tracked_port=DGPx for x in [1..n].
> This will generates routes with a better metric on hypervisors that bind
> the DGPs.
>
> 2. Traffic destined to a public NAT IP0 (dnat-or-snat or dnat):
>
> 2a. if the NAT rule is "distributed" (that is the NAT rule has
> "logical_port" and "external_mac" configured) then this traffic is
> handled on the hypervisor that binds "logical_port", let's say hv1.
>
Hi Frode,
Following up from our discussion during the community technical meeting
earlier: I was wondering if this is the case you're trying to optimize -
purely distributed dnat_and_snat (floating IPs). In this case the
traffic from and to the floating IP is always processed (and NATed) on
the chassis that binds the "logical_port".
> In this case the advertised route should have "tracked_port" ==
> "NAT.logical_port". This will generate routes with a better metric on
> hv1 (where logical_port is bound) compared to hv0.
>
If we set "tracked_port" to be the port binding that corresponds to the
NAT rule's "logical_port" (VM for which we define the floating IP in
OpenStack I guess) don't we achieve the efficient routing you had in mind?
Or is there another case I'm missing?
Thanks,
Dumitru
> 2b. if the NAT rule is not distributed ovn-northd determines which of
> the DGPs it should use for that NAT rule (see the
> lrouter_check_nat_entry() function that returns the DGP in the last
> output argument). Assuming that for this NAT rule, DGPx was chosen then
> all traffic for that public NAT IP is handled by the chassis that binds
> DGPx, that is the chassis where "is_chassis_resident(DGPx) == true",
> let's say hv2.
>
> In this case the advertised route should have "tracked_port" == DGPx.
> This will generate routes with a better metric on hv2, where DGPx is bound.
>
> Is my understanding correct? If so, it seems we don't need to lookup
> the LSPs that own NAT logical_ips or LB backend IPs.
>
>>>> Would that be an acceptable compromise?
>>>
>>> I believe Martin discovered that we have missed planning work on
>>> properly distributing the dnat flows, so they will still be on the
>>> chassis with the active DGP in the above topology. So since we need
>>> followup work on that anyway, I guess pin-pointing the location of
>>> tracked_port to backend also could be delayed.
>>>
>>> But if we at the very least can get tracked_port pointed at the
>>> active
>>> DGP, the feature would be useful and consumable in this topology,
>>> with
>>> truly distributed support perhaps added later.
>>>
>>> With the branching delayed we may or may not get to more of this now
>>> though, I'll let Martin chime in on his thoughts on how much time
>>> would be needed.
>>>
>>> --
>>> Frode Nordahl
>>
>> Frode summarized it very well and I hear your concerns as well Dumitru.
>> However if we manage to remove the need for "publish_{nat,lb}_route",
>> as I suggested above, and if we improve the
>> "find_port_in_connected_ls", do you think that it would reduce
>> complexity enough?
>>
>
> [...]
>
> I'll try to focus here on the performance/scalability aspect of northd.
>
> [incremental processing nodes and dependencies]
> ===============================================
> None of the advertised routes for LB and NAT IPs generate logical flows
> (you tried to handle that with the install_lflows field). But that also
> means that when we create their corresponding parsed_routes we shouldn't
> trigger en_lflow (the node that generates logical flows) runs because
> there's no work to do for them.
>
> I understand it was easier to do like that because
> en_advertised_route_sync currently uses the same parsed_route table
> input as en_lflows.
>
> But it doesn't have to.
>
> We can split the computation a bit if we use the following incremental
> processing DAG (where vertical lines represent a top->down dependency):
>
> en_northd
> |
> +-------------+------------+-------------+----+
> | | | | |
> en_routes en_lr_nat en_lb_data | |
> | | | | | |
> | | | | | |
> | | +------------+-------------+ |
> | | | |
> | | | |
> | | en_lr_stateful |
> | | | |
> | | + +-------------+
> | | | |
> | | en_dynamic_routes
> | | |
> | | |
> +---+ +---------+------------+
> | |
> | |
> en_lflow en_advertised_route_sync
>
> In this DAG, the only new node is en_dynamic_routes. Its input is the
> data of the "en_lr_stateful" node (the en_northd input can have a no-op
> handler, it's just for lookups), that is:
>
> struct ed_type_lr_stateful {
> struct lr_stateful_table table;
>
> /* Node's tracked data. */
> struct lr_stateful_tracked_data trk_data;
> };
>
> struct lr_stateful_tracked_data {
> /* Created or updated logical router with LB and/or NAT data. */
> struct hmapx crupdated; /* Stores 'struct lr_stateful_record'. */
> };
>
> struct lr_stateful_record {
> ...
> /* UUID of the NB Logical Router. */
> struct uuid nbr_uuid;
>
> /* Unique id of the logical router. Note : This id is assigned
> * by the northd engine node for each logical router. */
> size_t lr_index;
>
> /* This lrnat_rec comes from the en_lrnat engine node data. */
> const struct lr_nat_record *lrnat_rec;
> ...
> /* Load Balancer vIPs relevant for this datapath. */
> struct ovn_lb_ip_set *lb_ips;
> }
>
> The (output) data of the en_dynamic_routes nodes should be a separate
> hmap storing struct parsed_routes, one record for each LB/NAT route we
> should advertise.
>
> The en_dynamic_routes_run() function can iterate through the input
> ed_type_lr_stateful.table data and generate these new parsed_routes. It
> also has (indirectly) a reference to the struct ovn_datapath for each
> lr_stateful_record (we can use lr_index to efficiently look it up in
> en_northd data). lr_stateful_records already contain the 'struct
> ovn_nat' entries that are relevant to that specific router.
>
> All we miss is the DGP information in the ovn_nat records. But we can
> change the way 'struct ovn_nat' is created and pre-populate a new field,
> e.g., 'struct ovn_port *dgp' using lrouter_check_nat_entry().
>
> It also enables us to efficiently implement incremental processing
> (maybe as follow up) because we could have a handler for the
> en_lr_stateful -> en_dynamic_routes dependency. This handler will have
> to use the 'struct lr_stateful_tracked_data trk_data' input to only
> update (add/remove) advertised routes for routers that have changes.
>
> Another change is that en_advertised_route_sync now has two inputs:
> - en_routes - routes that also install lflows
> - en_dynamic_routes - routes that only need to be advertised
>
> So we need to adapt advertised_route_table_sync() to:
> - build the 'sync_routes' map with entries from both types of inputs
> - walk the SB.Advertised_Route table and keep only those that are in
> 'sync_routes'
> - create SB.Advertised_Route records for the remainder of parsed_routes
> in 'sync_routes'.
>
> The main advantage is that LB and NAT changes are still handled
> incrementally from a logical flow perspective, they don't generate a
> recompute of the en_lflow node (that's very costly in general).
>
> But this works only if the suggestion (see [sugestion]) I had above for
> advertising NAT/LB routes with tracked_port is correct.
>
> Sorry for the long email, looking forward to your feedback!
>
> Best regards,
> Dumitru
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev