On 29/07/2021 17:16, Dumitru Ceara wrote:
> On 7/29/21 5:42 PM, Mark Gray wrote:
>> On 29/07/2021 14:20, Dumitru Ceara wrote:
>>> On 7/29/21 11:59 AM, Dumitru Ceara wrote:
>>>> Logical_Flows are immutable, that is, when the match or action change
>>>> ovn-northd will actually remove the old logical flow and add a new one.
>>>>
>>>> Commit ecc941c70f16 added flows to flood ARPs to routers for unreachable
>>>> addresses.
>>>>
>>>> In the following topology (2 switches connected to a router with a load
>>>> balancer attached to it):
>>>>
>>>> S1 --- (IP1/MASK1) R (IP2/MASK2) --- S2
>>>>
>>>> LB1 (applied on R) with N VIPs.
>>>>
>>>> The following logical flows were generated:
>>>>
>>>> S1 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP1 }"
>>>> then ...
>>>> S2 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP2 }"
>>>> then ...
>>>>
>>>> While this seemed a good idea initially because it would reduce the
>>>> total number of logical flows, this approach has a few major downsides
>>>> that severely affect scalability:
>>>>
>>>> 1. When logical datapath grouping is enabled (which is how high scale
>>>> deployments should use OVN) there is no chance that the flows for
>>>> reachable/unreachable router owned IPs are grouped together for all
>>>> datapaths that correspond to the logical switches they are generated
>>>> for.
>>>> 2. Whenever a VIP (VIPn+1) is added to LB1 all these flows will be
>>>> removed and new ones will be added with the only difference being the
>>>> VIPn+1 aded to the flow match. As this is a new logical flow record,
>>>> ovn-controller will have to remove all previously installed openflows
>>>> (for the deleted logical flow) and add new ones. However, for most
>>>> of these openflows, the only change is the cookie value (the first 32
>>>> bits of the logical flow UUID). At scale this unnecessary openflow
>>>> churn will induce significant latency. Moreover, this also creates
>>>> unnecessary jsonrpc traffic from ovn-northd towards OVN_Southbound
>>>> because the old flow needs to be removed and a new one (with a large
>>>> match field) needs to be installed.
>>>>
>>>> To avoid this, we now install individual flows, one per IP, for
>>>> managing ARP/ND traffic from logical switches towards router owned
>>>> IPs that are reachable/unreachable on the logical port connecting
>>>> the switch.
>>>>
>>>> Fixes: ecc941c70f16 ("northd: Flood ARPs to routers for "unreachable"
>>>> addresses.")
>>>> Signed-off-by: Dumitru Ceara <[email protected]>
>>>> ---
>>>> On a deployment based on an ovn-kubernetes cluster with 20 nodes, 2500
>>>> load balancer VIPs, 5000 pods, we:
>>>> - measure number of logical flows in the SB DB, size of the SB DB
>>>> - then create an additional 250 pods (logical ports), bind them, and add
>>>> a load balancer VIP (with a single backend) for each of the additional
>>>> pods, and measure how long it takes for all openflows corresponding to
>>>> each of the pods to be installed in OVS.
>>>>
>>>> Results:
>>>> - without fix:
>>>> ------------
>>>> SB lflows: 67976
>>>> SB size (uncompacted): 46 MiB
>>>> SB size (compacted): 40 MiB
>>>>
>>>> after 250 ports + VIPs added:
>>>> SB lflows: 71479
>>>> SB size (uncompacted): 248 MiB
>>>> SB size (compacted): 42 MiB
>>>> Max time to install all relevant openflows for a single pod: ~6sec
>>>>
>>>> - with fix:
>>>> ---------
>>>> SB lflows: 70399
>>>> SB size (uncompacted): 46 MiB
>>>> SB size (compacted): 40 MiB
>>>>
>>>> after 250 ports + VIPs added:
>>>> SB lflows: 74149
>>>> SB size (uncompacted): 165 MiB
>>>> SB size (compacted): 42 MiB
>>>> Max time to install all relevant openflows for a single pod: ~100msec
>>>
>>>
>>
>> I think you should put these^ (or a summary of these - including maybe
>> the northd degradation) into the commit message.
>>
>
> Sure, I'll do that in the next revision.
If you are going to revise it, in the comments for a lot of the
functions in northd.c, it references "Ingress Table 19". Can you change
this? It is actually table 22 now.
>
>>> CC: Mark Michelson (original author of ecc941c70f16 ("northd: Flood ARPs
>>> to routers for "unreachable" addresses."))
>>>
>>> There's another note I should mention here:
>>>
>>> Since we now call build_lswitch_rport_arp_req_flow_for_unreachable_ip()
>>> for every load balancer VIP (instead of once for a list of VIPs) load on
>>> ovn-northd will be increased as it has to create more logical flows that
>>> will eventually be grouped together on the same datapath group.
>>>
>>> When testing at higher scales (120 nodes setups similar topologies to
>>> the one above but with 45K VIPs) I see ovn-northd poll loop intervals
>>> taking ~3s longer on my machine (from ~10s to ~13s). I do, however,
>>
>> I'm kind of surprised at that degradation because your change doesn't
>> introduce any additional looping. This is purely due to the cost of
>> ovn_lflow_add_with_hint()?
>
> Well, yes, and building of the match. I commented it out and the hit is
> gone.
>
>>
>>> still think that not merging the flow matches is a good thing because it
>>> has a very substantial impact on all ovn-controllers in the cluster,
>>> already visible in our tests with smaller, 20 node clusters.
>>>
>>> A point here is that I think it's fine if
>>> build_lswitch_rport_arp_req_flow_for_unreachable_ip() installs flows at
>>> a priority lower than the one used for reachable IPs, e.g., 79. That
>>> would allow us to avoid the !lrouter_port_ipv*_reachable() condition
>>> when installing these flows.
>>
>> I don't get this point?
>
> In my opinion we don't need to do:
>
> if (port_reachable(op)) {
> add_reachable_flows(prio=80);
> } else {
> add_unreachable_flows(prio=90);
> }
>
> We can probably just do:
>
> if (port_reachable(op)) {
> add_reachable_flows(prio=80);
> }
> add_unreachable_flows(prio=79);
>
> The idea being that a packet is either for a "reachable IP" in which
> case we hit the prio 80 flow, or it's not, in which case we hit the prio
> 79 flow.
Ok if add_unreachable_flows() adds flows for all the IPs then this
should work.
>
> The advantage is we don't have the port dependency for
> add_unreachable_flows() anymore which would allow us to prebuild all
> possible the matches/actions once and just reuse them for all logical
> switches where they need to be added.
>
> Thanks,
> Dumitru
>
>>>
>>> As a follow up optimization we can take the same approach as in
>>> 0fd39c6cf0ba "northd: move build_lb_rules in build_lswitch_flows_for_lb"
>>> and precompute the matches once for all load balancer VIPs, and just use
>>> them when installing the flows on the connected logical switch datapaths.
>>>
>>> Thoughts?
>>>
>>> Thanks,
>>> Dumitru
>>>
>>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev