Re: [ovs-dev] [RFC ovn v3 4/4] northd: Advertise Routes for LB and NAT VIPs.

Dumitru Ceara Mon, 10 Feb 2025 12:20:00 -0800

On 2/10/25 8:07 PM, [email protected] wrote:
> On Mon, 2025-02-10 at 18:47 +0100, Dumitru Ceara wrote:
>> On 2/7/25 12:55 PM, Dumitru Ceara wrote:
>>> On 2/7/25 10:44 AM, [email protected] wrote:
>>>> Hi Dumitru, Frode
>>>> Thanks for the detailed feedback. I'll add my thoughts inline
>>>> below.
>>>>
>>>
>>> Hi Martin, Frode,
>>>
>>> I had an internal discussion about this patch with Ales so I'm
>>> adding
>>> him to the thread too.
>>>
>>> I'll trim this email a bit and try to keep the relevant context as
>>> I
>>> think we might need a bit of redesign.
>>>
>>> [...]
>>>
>>>>>>>>> +/* This function searches for an ovn_port with
>>>>>>>>> specific
>>>>>>>>> "op_ip"
>>>>>>>>> + * IP address in all LS datapaths directly connected
>>>>>>>>> to the
>>>>>>>>> + * LR datapath "od".
>>>>>>>>> + * If no match is found, this function returns NULL.*/
>>>>>>>>> +static struct ovn_port *
>>>>>>>>> +find_port_in_connected_ls(struct ovn_datapath *od,
>>>>>>>>> char
>>>>>>>>> *op_ip)
>>>>>>>>> +{
>>>>>>>>> +    struct in6_addr *ip_addr = xmalloc(sizeof
>>>>>>>>> *ip_addr);
>>>>>>>>> +    unsigned int plen;
>>>>>>>>> +    bool is_ipv4 = true;
>>>>>>>>> +
>>>>>>>>> +    /* Return immediately if op_ip isn't a host
>>>>>>>>> route.*/
>>>>>>>>
>>>>>>>> We shouldn't have to check this, it should be the caller
>>>>>>>> doing
>>>>>>>> it.  We
>>>>>>>> shouldn't parse values coming directly from the database
>>>>>>>> here. 
>>>>>>>> We
>>>>>>>> should receive an already correct struct in6_addr *
>>>>>>>> instead as
>>>>>>>> argument.
>>>>
>>>> Hmm, the ultimate source of the "op_ip" string is the NAT's
>>>> internal
>>>> IP/LB backend IP. I don't see immediate way to have this
>>>> information
>>>> available in the in6_addr* format. Am I overlooking something?
>>>>
>>>>>>>>
>>>
>>> [...]
>>>
>>>>>>>>
>>>>>>>> 'struct ovn_datapath' has the following field:
>>>>>>>>
>>>>>>>> struct sset router_ips; /* Router port IPs except the
>>>>>>>> IPv6
>>>>>>>> LLAs. */
>>>>>>>>
>>>>>>>> Instead of walking all ports of all connected switches of
>>>>>>>> 'od'
>>>>>>>> maybe we
>>>>>>>> should also store all switch port IPs inside 'struct
>>>>>>>> ovn_datapath'.
>>>>>>>> Then we'd just have to do a sset_find() for each
>>>>>>>> connected
>>>>>>>> switch.
>>>>
>>>> This would only tell us if LS has LSP with the given IP, right?
>>>> Or are
>>>> you suggesting this as an optimization step that would tell us if
>>>> it's
>>>> even worth looking through the LSPs on the given LS? i.e. we
>>>> would go
>>>> from:
>>>>   for each LS -> for each LSP -> compare op_ip with lsp_ip
>>>> to:
>>>>   for each LS -> if sset_find(host_ips, op_ip) -> for each LSP ->
>>>> compare op_ip with lsp_ip
>>>>
>>>> If so, I agree this would be spare us a lot of unnecessary
>>>> lookups.
>>>> Though proposition to alter ovn_datapath structure is bit scary.
>>>>
>>>
>>> 'op_ip' here is the internal IP (logical_ip for NAT and backend_ip
>>> for
>>> LB) and the goal as far as I understand is to advertise the route
>>> with a
>>> better metric on the chassis where traffic to that internal IP is
>>> handled.
>>>
>>> That being said, I think we don't need tracked_port to be the
>>> actual LSP
>>> that owns the internal IP.  Please see below (in the [suggestion]
>>> section).
>>>
>>> [...]
>>>
>>>>>
>>>>> I'm a bit of an early bird, so I'll give my high level view on
>>>>> the
>>>>> below question, and allow others to look at your responses in
>>>>> more
>>>>> detail when their day starts before participating in any
>>>>> details
>>>>> myself.
>>>>>
>>>>> First, thanks alot for your thorough review, and sharing of
>>>>> insights,
>>>>> the incremental processing engine is indeed one of the darker
>>>>> arts,
>>>>> so
>>>>> I think everyone on this thread have things to learn from you.
>>>>>
>>>
>>> No problem, sorry again for not looking at this patch in detail
>>> earlier.
>>>
>>>>>> Finding the tracked_port (especially for load balancers but
>>>>>> for
>>>>>> NATs
>>>>>> too) seems to be the very costly operation here.  However we
>>>>>> don't
>>>>>> seem
>>>>>> to have any tests that cover that.
>>>>>
>>>>> MJ is working on a multi-node test which will cover some of
>>>>> this, the
>>>>> current plan is to add it as a separate patch once ready.
>>>>>
>>>>>> This makes me wonder, is it really a requirement for this
>>>>>> feature
>>>>>> that
>>>>>> the tracked port is set for these routes?  Not having to
>>>>>> implement
>>>>>> it
>>>>>> now seems to simplify the logic and would remove the need to
>>>>>> trace
>>>>>> back
>>>>>> from lb/nat parsed_routes to their NB configurations.
>>>>>
>>>>> The motivation for us with this is to provide a potential low
>>>>> barrier
>>>>> entry to consuming native BGP support from OVN for CMSes such
>>>>> as
>>>>> OpenStack in brown field deployments.
>>>>>
>>>>> Some of the constraints we work in are:
>>>>> * clouds with hundreds of tenants, each of which have their own
>>>>> routers, so configuring dynamic routing directly on all of
>>>>> those is
>>>>> not an option.
>>>>> * l3-only underlay with no VLANs spanned across racks or EVPN
>>>>> support
>>>>> in networking equipment.
>>>>>
>>>>> To meet those constraints, a potential solution is to pin a
>>>>> gateway
>>>>> router on every chassis, connect it up to a join-ls that spans
>>>>> all or
>>>>> groups of chassis, and then connect existing logical OVN
>>>>> topology to
>>>>> the join-ls.
>>>>>
>>>>> In our tests, this appears to work quite well, and for it to be
>>>>> truly
>>>>> useful the tracked_port needs to be populated so that we can
>>>>> tell the
>>>>> fabric where to send packets to NAT/LB addresses through the
>>>>> use of
>>>>> the lower metric. Without it, all chassis with reachability to
>>>>> the
>>>>> join-ls would announce the same prefixes with the same metric,
>>>>> leading
>>>>> to suboptimal path for traffic (tromboning).
>>>>>
>>>
>>> OK, I understand the use case better now, thanks for explaining.
>>>
>>> I'll try to summarize how traffic is handled for the different
>>> types of
>>> routers and public IPs and how I understand routes for those should
>>> be
>>> advertised:
>>>
>>> [suggestion]
>>> ============
>>> I. For a gateway router GW0 (bound to a chassis, e.g., hv0):
>>>
>>> 1. Traffic destined to a load balancer VIP0 is always handled on
>>> hv0
>>> (DNAT to one of the backend IPs).
>>>
>>> In this case GW0 should advertise a route, R0, for VIP0 only on
>>> chassis hv0.
>>>
>>> 2. Traffic destined to a public NAT IP0 (dnat-or-snat or dnat) is
>>> always
>>> handled on hv0.
>>>
>>> In this case GW0 should advertise a route, R0, for IP0 only on
>>> chassis hv0.
>>>
>>> II. For a distributed router DR0, in order for NAT/LB to work, the
>>> router needs to have at least one distributed gateway port (DGP1,
>>> DGP2,
>>> DGPn).
>>>
>>> 1. Traffic destined to a load balancer VIP0 is handled on all
>>> chassis
>>> that bind the DGPs.  That is we'll have logical flows that DNAT
>>> traffic
>>> with destination VIP0 to the backend IPs on all chassis where DGP1,
>>> DGP2,.. DGPn are bound.
>>>
>>> The code in this patch was trying to find the port where the first
>>> backend IP is bound to an LSP and use that as tracked_port but
>>> that's
>>> anyway incorrect IMO.  That's because backends can be distributed
>>> across
>>> multiple chassis, not necessarily the ones that bind the DGPs.
>>>
>>> In this case, because traffic for VIP0 is handled in the same way
>>> on all
>>> chassis that bind DGPs (if traffic towards VIP0 lands on a
>>> different
>>> chassis it will be redirected to one of the DGPs) we should
>>> advertise N
>>> routes for VIP0, each of them with tracked_port=DGPx for x in
>>> [1..n].
>>> This will generates routes with a better metric on hypervisors that
>>> bind
>>> the DGPs.
>>>
>>> 2. Traffic destined to a public NAT IP0 (dnat-or-snat or dnat):
>>>
>>> 2a. if the NAT rule is "distributed" (that is the NAT rule has
>>> "logical_port" and "external_mac" configured) then this traffic is
>>> handled on the hypervisor that binds "logical_port", let's say hv1.
>>>
>>
>> Hi Frode,
>>
>> Following up from our discussion during the community technical
>> meeting
>> earlier: I was wondering if this is the case you're trying to
>> optimize -
>> purely distributed dnat_and_snat (floating IPs).  In this case the
>> traffic from and to the floating IP is always processed (and NATed)
>> on
>> the chassis that binds the "logical_port".
> 
> Hi Dumitru


Hi Martin,

> this is only case for "distributed NAT", right? When logical_port and
> "mac" are set in the NAT record, right? (just double-checking) Yes,
> this is what I'd like to go for in the next version.
> 

Right, from the ovn-nb man page:

    <column name="external_mac">
      <p>
        A MAC address.
      </p>

      <p>
        This is only used on the gateway port on distributed routers.
        This must be specified in order for the NAT rule to be
        processed in a distributed manner on all chassis.  If this is
        not specified for a NAT rule on a distributed router, then
        this NAT rule will be processed in a centralized manner on
        the gateway port instance on the gateway chassis.
      </p>

Regards,
Dumitru

> Martin.
> 
>>
>>> In this case the advertised route should have "tracked_port" ==
>>> "NAT.logical_port".  This will generate routes with a better metric
>>> on
>>> hv1 (where logical_port is bound) compared to hv0.
>>>
>>
>> If we set "tracked_port" to be the port binding that corresponds to
>> the
>> NAT rule's "logical_port" (VM for which we define the floating IP in
>> OpenStack I guess) don't we achieve the efficient routing you had in
>> mind?
>>
>> Or is there another case I'm missing?
>>
>> Thanks,
>> Dumitru
>>
>>> 2b. if the NAT rule is not distributed ovn-northd determines which
>>> of
>>> the DGPs it should use for that NAT rule (see the
>>> lrouter_check_nat_entry() function that returns the DGP in the last
>>> output argument).  Assuming that for this NAT rule, DGPx was chosen
>>> then
>>> all traffic for that public NAT IP is handled by the chassis that
>>> binds
>>> DGPx, that is the chassis where "is_chassis_resident(DGPx) ==
>>> true",
>>> let's say hv2.
>>>
>>> In this case the advertised route should have "tracked_port" ==
>>> DGPx.
>>> This will generate routes with a better metric on hv2, where DGPx
>>> is bound.
>>>
>>> Is my understanding correct?  If so, it seems we don't need to
>>> lookup
>>> the LSPs that own NAT logical_ips or LB backend IPs.
>>>
>>>>>> Would that be an acceptable compromise?
>>>>>
>>>>> I believe Martin discovered that we have missed planning work
>>>>> on
>>>>> properly distributing the dnat flows, so they will still be on
>>>>> the
>>>>> chassis with the active DGP in the above topology. So since we
>>>>> need
>>>>> followup work on that anyway, I guess pin-pointing the location
>>>>> of
>>>>> tracked_port to backend also could be delayed.
>>>>>
>>>>> But if we at the very least can get tracked_port pointed at the
>>>>> active
>>>>> DGP, the feature would be useful and consumable in this
>>>>> topology,
>>>>> with
>>>>> truly distributed support perhaps added later.
>>>>>
>>>>> With the branching delayed we may or may not get to more of
>>>>> this now
>>>>> though, I'll let Martin chime in on his thoughts on how much
>>>>> time
>>>>> would be needed.
>>>>>
>>>>> --
>>>>> Frode Nordahl
>>>>
>>>> Frode summarized it very well and I hear your concerns as well
>>>> Dumitru.
>>>> However if we manage to remove the need for
>>>> "publish_{nat,lb}_route",
>>>> as I suggested above, and if we improve the
>>>> "find_port_in_connected_ls", do you think that it would reduce
>>>> complexity enough?
>>>>
>>>
>>> [...]
>>>
>>> I'll try to focus here on the performance/scalability aspect of
>>> northd.
>>>
>>> [incremental processing nodes and dependencies]
>>> ===============================================
>>> None of the advertised routes for LB and NAT IPs generate logical
>>> flows
>>> (you tried to handle that with the install_lflows field).  But that
>>> also
>>> means that when we create their corresponding parsed_routes we
>>> shouldn't
>>> trigger en_lflow (the node that generates logical flows) runs
>>> because
>>> there's no work to do for them.
>>>
>>> I understand it was easier to do like that because
>>> en_advertised_route_sync currently uses the same parsed_route table
>>> input as en_lflows.
>>>
>>> But it doesn't have to.
>>>
>>> We can split the computation a bit if we use the following
>>> incremental
>>> processing DAG (where vertical lines represent a top->down
>>> dependency):
>>>
>>>   en_northd
>>>      |
>>>      +-------------+------------+-------------+----+
>>>      |             |            |             |    |
>>>    en_routes    en_lr_nat    en_lb_data       |    |
>>>      |   |         |            |             |    |
>>>      |   |         |            |             |    |
>>>      |   |         +------------+-------------+    |
>>>      |   |                      |                  |
>>>      |   |                      |                  |
>>>      |   |                 en_lr_stateful          |
>>>      |   |                      |                  |
>>>      |   |                      +    +-------------+
>>>      |   |                      |    |
>>>      |   |               en_dynamic_routes
>>>      |   |                      |
>>>      |   |                      |
>>>  +---+   +---------+------------+
>>>  |                 |
>>>  |                 |
>>> en_lflow    en_advertised_route_sync
>>>
>>> In this DAG, the only new node is en_dynamic_routes.  Its input is
>>> the
>>> data of the "en_lr_stateful" node (the en_northd input can have a
>>> no-op
>>> handler, it's just for lookups), that is:
>>>
>>> struct ed_type_lr_stateful {
>>>     struct lr_stateful_table table;
>>>
>>>     /* Node's tracked data. */
>>>     struct lr_stateful_tracked_data trk_data;
>>> };
>>>
>>> struct lr_stateful_tracked_data {
>>>     /* Created or updated logical router with LB and/or NAT data.
>>> */
>>>     struct hmapx crupdated; /* Stores 'struct lr_stateful_record'.
>>> */
>>> };
>>>
>>> struct lr_stateful_record {
>>> ...
>>>     /* UUID of the NB Logical Router. */
>>>     struct uuid nbr_uuid;
>>>
>>>     /* Unique id of the logical router.  Note : This id is assigned
>>>      * by the northd engine node for each logical router. */
>>>     size_t lr_index;
>>>
>>>     /* This lrnat_rec comes from the en_lrnat engine node data. */
>>>     const struct lr_nat_record *lrnat_rec;
>>> ...
>>>     /* Load Balancer vIPs relevant for this datapath. */
>>>     struct ovn_lb_ip_set *lb_ips;
>>> }
>>>
>>> The (output) data of the en_dynamic_routes nodes should be a
>>> separate
>>> hmap storing struct parsed_routes, one record for each LB/NAT route
>>> we
>>> should advertise.
>>>
>>> The en_dynamic_routes_run() function can iterate through the input
>>> ed_type_lr_stateful.table data and generate these new
>>> parsed_routes.  It
>>> also has (indirectly) a reference to the struct ovn_datapath for
>>> each
>>> lr_stateful_record (we can use lr_index to efficiently look it up
>>> in
>>> en_northd data).  lr_stateful_records already contain the 'struct
>>> ovn_nat' entries that are relevant to that specific router.
>>>
>>> All we miss is the DGP information in the ovn_nat records.  But we
>>> can
>>> change the way 'struct ovn_nat' is created and pre-populate a new
>>> field,
>>> e.g., 'struct ovn_port *dgp' using lrouter_check_nat_entry().
>>>
>>> It also enables us to efficiently implement incremental processing
>>> (maybe as follow up) because we could have a handler for the
>>> en_lr_stateful -> en_dynamic_routes dependency.  This handler will
>>> have
>>> to use the 'struct lr_stateful_tracked_data trk_data' input to only
>>> update (add/remove) advertised routes for routers that have
>>> changes.
>>>
>>> Another change is that en_advertised_route_sync now has two inputs:
>>> - en_routes - routes that also install lflows
>>> - en_dynamic_routes - routes that only need to be advertised
>>>
>>> So we need to adapt advertised_route_table_sync() to:
>>> - build the 'sync_routes' map with entries from both types of
>>> inputs
>>> - walk the SB.Advertised_Route table and keep only those that are
>>> in
>>> 'sync_routes'
>>> - create SB.Advertised_Route records for the remainder of
>>> parsed_routes
>>> in 'sync_routes'.
>>>
>>> The main advantage is that LB and NAT changes are still handled
>>> incrementally from a logical flow perspective, they don't generate
>>> a
>>> recompute of the en_lflow node (that's very costly in general).
>>>
>>> But this works only if the suggestion (see [sugestion]) I had above
>>> for
>>> advertising NAT/LB routes with tracked_port is correct.
>>>
>>> Sorry for the long email, looking forward to your feedback!
>>>
>>> Best regards,
>>> Dumitru
>>>
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC ovn v3 4/4] northd: Advertise Routes for LB and NAT VIPs.

Reply via email to