On 7/29/21 1:49 PM, Numan Siddique wrote:
On Thu, Jul 29, 2021 at 1:14 PM Dumitru Ceara <[email protected]> wrote:

On 7/29/21 6:53 PM, Numan Siddique wrote:
On Thu, Jul 29, 2021 at 12:31 PM Dumitru Ceara <[email protected]> wrote:

On 7/29/21 5:42 PM, Mark Gray wrote:
On 29/07/2021 14:20, Dumitru Ceara wrote:
On 7/29/21 11:59 AM, Dumitru Ceara wrote:
Logical_Flows are immutable, that is, when the match or action change
ovn-northd will actually remove the old logical flow and add a new one.

Commit ecc941c70f16 added flows to flood ARPs to routers for unreachable
addresses.

In the following topology (2 switches connected to a router with a load
balancer attached to it):

S1 --- (IP1/MASK1) R (IP2/MASK2) --- S2

LB1 (applied on R) with N VIPs.

The following logical flows were generated:

S1 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP1 }" then ...
S2 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP2 }" then ...

While this seemed a good idea initially because it would reduce the
total number of logical flows, this approach has a few major downsides
that severely affect scalability:

1. When logical datapath grouping is enabled (which is how high scale
    deployments should use OVN) there is no chance that the flows for
    reachable/unreachable router owned IPs are grouped together for all
    datapaths that correspond to the logical switches they are generated
    for.
2. Whenever a VIP (VIPn+1) is added to LB1 all these flows will be
    removed and new ones will be added with the only difference being the
    VIPn+1 aded to the flow match.  As this is a new logical flow record,
    ovn-controller will have to remove all previously installed openflows
    (for the deleted logical flow) and add new ones.  However, for most
    of these openflows, the only change is the cookie value (the first 32
    bits of the logical flow UUID).  At scale this unnecessary openflow
    churn will induce significant latency.  Moreover, this also creates
    unnecessary jsonrpc traffic from ovn-northd towards OVN_Southbound
    because the old flow needs to be removed and a new one (with a large
    match field) needs to be installed.

To avoid this, we now install individual flows, one per IP, for
managing ARP/ND traffic from logical switches towards router owned
IPs that are reachable/unreachable on the logical port connecting
the switch.

Fixes: ecc941c70f16 ("northd: Flood ARPs to routers for "unreachable" 
addresses.")
Signed-off-by: Dumitru Ceara <[email protected]>
---
On a deployment based on an ovn-kubernetes cluster with 20 nodes, 2500
load balancer VIPs, 5000 pods, we:
- measure number of logical flows in the SB DB, size of the SB DB
- then create an additional 250 pods (logical ports), bind them, and add
   a load balancer VIP (with a single backend) for each of the additional
   pods, and measure how long it takes for all openflows corresponding to
   each of the pods to be installed in OVS.

Results:
- without fix:
   ------------
   SB lflows:             67976
   SB size (uncompacted): 46 MiB
   SB size (compacted):   40 MiB

   after 250 ports + VIPs added:
   SB lflows:             71479
   SB size (uncompacted): 248 MiB
   SB size (compacted):   42  MiB
   Max time to install all relevant openflows for a single pod: ~6sec

- with fix:
   ---------
   SB lflows:             70399
   SB size (uncompacted): 46 MiB
   SB size (compacted):   40 MiB

   after 250 ports + VIPs added:
   SB lflows:             74149
   SB size (uncompacted): 165 MiB
   SB size (compacted):   42 MiB
   Max time to install all relevant openflows for a single pod: ~100msec



I think you should put these^ (or a summary of these - including maybe
the northd degradation) into the commit message.


Sure, I'll do that in the next revision.

CC: Mark Michelson (original author of ecc941c70f16 ("northd: Flood ARPs
to routers for "unreachable" addresses."))

There's another note I should mention here:

Since we now call build_lswitch_rport_arp_req_flow_for_unreachable_ip()
for every load balancer VIP (instead of once for a list of VIPs) load on
ovn-northd will be increased as it has to create more logical flows that
will eventually be grouped together on the same datapath group.

When testing at higher scales (120 nodes setups similar topologies to
the one above but with 45K VIPs) I see ovn-northd poll loop intervals
taking ~3s longer on my machine (from ~10s to ~13s).  I do, however,

I'm fine with this tradeoff. As was stated by someone else, this is relieving load on many ovn-controllers to instead slightly load ovn-northd more. As you've noted, there are potential additional optimizations we could make in ovn-northd to offset this hit as well.


I'm kind of surprised at that degradation because your change doesn't
introduce any additional looping. This is purely due to the cost of
ovn_lflow_add_with_hint()?

Well, yes, and building of the match.  I commented it out and the hit is
gone.


still think that not merging the flow matches is a good thing because it
has a very substantial impact on all ovn-controllers in the cluster,
already visible in our tests with smaller, 20 node clusters.

A point here is that I think it's fine if
build_lswitch_rport_arp_req_flow_for_unreachable_ip() installs flows at
a priority lower than the one used for reachable IPs, e.g., 79.  That
would allow us to avoid the !lrouter_port_ipv*_reachable() condition
when installing these flows.

I don't get this point?

In my opinion we don't need to do:

if (port_reachable(op)) {
     add_reachable_flows(prio=80);
} else {
     add_unreachable_flows(prio=90);
}

We can probably just do:

if (port_reachable(op)) {
     add_reachable_flows(prio=80);
}
add_unreachable_flows(prio=79);

The idea being that a packet is either for a "reachable IP" in which
case we hit the prio 80 flow, or it's not, in which case we hit the prio
79 flow.

The advantage is we don't have the port dependency for
add_unreachable_flows() anymore which would allow us to prebuild all
possible the matches/actions once and just reuse them for all logical
switches where they need to be added.

Thanks,
Dumitru


As a follow up optimization we can take the same approach as in
0fd39c6cf0ba "northd: move build_lb_rules in build_lswitch_flows_for_lb"
and precompute the matches once for all load balancer VIPs, and just use
them when installing the flows on the connected logical switch datapaths.

Thoughts?

Hi Dumitru,

Hi Numan,


Thanks for addressing this issue.

I looked into the issue and in my opinion we can also solve this
problem using address sets.

This is how I'd see an lflow to be generated by ovn-northd

  -  table=22(ls_in_l2_lkup      ), priority=90   , match=(flags[1] ==
0 && arp.op == 1 && arp.tpa == ${<LS_NAME>_V4_UNREACHABLE_IPS})
      action=(outport = "_MC_flood"; output;)

As far as I can tell the problem with this approach is that this will
not allow the logical flows to take advantage of datapath groups.

You're right.  I missed out on the datapath group angle.

Thanks
Numan


For example, for:

LS1 ------+
           |
LS2 ----- LR
           |
LS3 ------+

If LR has a load balancer with VIPs V1, V2, V3, V4, V5.

And the unreachable VIPs for each LRP are:

LRP1: V1, V2, V3         (AS: LS1_UNREACH_IPS)
LRP2: V1, V2, V3, V4     (AS: LS2_UNREACH_IPS)
LRP3: V1, V2, V3, V4, V5 (AS: LS3_UNREACH_IPS)

Then the lflows would be:

LF1: arp.op == 1 && arp.tpa == $LS1_UNREACH_IPS
LF2: arp.op == 1 && arp.tpa == $LS2_UNREACH_IPS
LF3: arp.op == 1 && arp.tpa == $LS3_UNREACH_IPS

Which cannot be grouped in a logical datapath because they don't have
the same match.

Whereas with the patch I proposed:

LF1: arp.op == 1 && arp.tpa == V1, applied on DPG(LS1, LS2, LS3)
LF2: arp.op == 1 && arp.tpa == V2, applied on DPG(LS1, LS2, LS3)
LF3: arp.op == 1 && arp.tpa == V3, applied on DPG(LS1, LS2, LS3)
LF4: arp.op == 1 && arp.tpa == V4, applied on DPG(LS1, LS2, LS3)
LF5: arp.op == 1 && arp.tpa == V5, applied on DPG(LS1, LS2, LS3)

The problem is that at high scale the address sets will grow larger too,
I think, worse than what happens currently with the single logical flow:

arp.op == 1 && arp.tpa == {V1, V2, V3, V4, V5}


And ovn-northd would create an Address set for each logical switch and
store the unreachable v4 ips (and the same for v6).
It should be straightforward to generate an address set name for each
logical switch.

Presently the function build_lswitch_rport_arp_req_flows() builds an
sset of ips_v4_match_reachable.  This part of the code
can be separated out and it can be used to sync the address sets in the SB DB.

When an unreachable ip is added/deleted to/from the sset,
ovn-controller can incrementally handle the change to the address set
and this should not result in the openflow cookie updates.

IMHO we should make use of address sets here.

Thoughts ? And concerns you see w.r.t performance with this approach ?

If I'm not wrong, ovn-controller would probably behave similarly if we
use address sets but I think the cost is larger on the SB contents.

Also, we'd have to reserve those address sets somehow to not allow users
to use them, which is something hard to do in ovn-northd because we
don't parse ACL match contents.


Thanks
Numan

Thanks,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to