On 7/29/21 1:49 PM, Numan Siddique wrote:
On Thu, Jul 29, 2021 at 1:14 PM Dumitru Ceara <[email protected]> wrote:On 7/29/21 6:53 PM, Numan Siddique wrote:On Thu, Jul 29, 2021 at 12:31 PM Dumitru Ceara <[email protected]> wrote:On 7/29/21 5:42 PM, Mark Gray wrote:On 29/07/2021 14:20, Dumitru Ceara wrote:On 7/29/21 11:59 AM, Dumitru Ceara wrote:Logical_Flows are immutable, that is, when the match or action change ovn-northd will actually remove the old logical flow and add a new one. Commit ecc941c70f16 added flows to flood ARPs to routers for unreachable addresses. In the following topology (2 switches connected to a router with a load balancer attached to it): S1 --- (IP1/MASK1) R (IP2/MASK2) --- S2 LB1 (applied on R) with N VIPs. The following logical flows were generated: S1 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP1 }" then ... S2 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP2 }" then ... While this seemed a good idea initially because it would reduce the total number of logical flows, this approach has a few major downsides that severely affect scalability: 1. When logical datapath grouping is enabled (which is how high scale deployments should use OVN) there is no chance that the flows for reachable/unreachable router owned IPs are grouped together for all datapaths that correspond to the logical switches they are generated for. 2. Whenever a VIP (VIPn+1) is added to LB1 all these flows will be removed and new ones will be added with the only difference being the VIPn+1 aded to the flow match. As this is a new logical flow record, ovn-controller will have to remove all previously installed openflows (for the deleted logical flow) and add new ones. However, for most of these openflows, the only change is the cookie value (the first 32 bits of the logical flow UUID). At scale this unnecessary openflow churn will induce significant latency. Moreover, this also creates unnecessary jsonrpc traffic from ovn-northd towards OVN_Southbound because the old flow needs to be removed and a new one (with a large match field) needs to be installed. To avoid this, we now install individual flows, one per IP, for managing ARP/ND traffic from logical switches towards router owned IPs that are reachable/unreachable on the logical port connecting the switch. Fixes: ecc941c70f16 ("northd: Flood ARPs to routers for "unreachable" addresses.") Signed-off-by: Dumitru Ceara <[email protected]> --- On a deployment based on an ovn-kubernetes cluster with 20 nodes, 2500 load balancer VIPs, 5000 pods, we: - measure number of logical flows in the SB DB, size of the SB DB - then create an additional 250 pods (logical ports), bind them, and add a load balancer VIP (with a single backend) for each of the additional pods, and measure how long it takes for all openflows corresponding to each of the pods to be installed in OVS. Results: - without fix: ------------ SB lflows: 67976 SB size (uncompacted): 46 MiB SB size (compacted): 40 MiB after 250 ports + VIPs added: SB lflows: 71479 SB size (uncompacted): 248 MiB SB size (compacted): 42 MiB Max time to install all relevant openflows for a single pod: ~6sec - with fix: --------- SB lflows: 70399 SB size (uncompacted): 46 MiB SB size (compacted): 40 MiB after 250 ports + VIPs added: SB lflows: 74149 SB size (uncompacted): 165 MiB SB size (compacted): 42 MiB Max time to install all relevant openflows for a single pod: ~100msecI think you should put these^ (or a summary of these - including maybe the northd degradation) into the commit message.Sure, I'll do that in the next revision.CC: Mark Michelson (original author of ecc941c70f16 ("northd: Flood ARPs to routers for "unreachable" addresses.")) There's another note I should mention here: Since we now call build_lswitch_rport_arp_req_flow_for_unreachable_ip() for every load balancer VIP (instead of once for a list of VIPs) load on ovn-northd will be increased as it has to create more logical flows that will eventually be grouped together on the same datapath group. When testing at higher scales (120 nodes setups similar topologies to the one above but with 45K VIPs) I see ovn-northd poll loop intervals taking ~3s longer on my machine (from ~10s to ~13s). I do, however,
I'm fine with this tradeoff. As was stated by someone else, this is relieving load on many ovn-controllers to instead slightly load ovn-northd more. As you've noted, there are potential additional optimizations we could make in ovn-northd to offset this hit as well.
I'm kind of surprised at that degradation because your change doesn't introduce any additional looping. This is purely due to the cost of ovn_lflow_add_with_hint()?Well, yes, and building of the match. I commented it out and the hit is gone.still think that not merging the flow matches is a good thing because it has a very substantial impact on all ovn-controllers in the cluster, already visible in our tests with smaller, 20 node clusters. A point here is that I think it's fine if build_lswitch_rport_arp_req_flow_for_unreachable_ip() installs flows at a priority lower than the one used for reachable IPs, e.g., 79. That would allow us to avoid the !lrouter_port_ipv*_reachable() condition when installing these flows.I don't get this point?In my opinion we don't need to do: if (port_reachable(op)) { add_reachable_flows(prio=80); } else { add_unreachable_flows(prio=90); } We can probably just do: if (port_reachable(op)) { add_reachable_flows(prio=80); } add_unreachable_flows(prio=79); The idea being that a packet is either for a "reachable IP" in which case we hit the prio 80 flow, or it's not, in which case we hit the prio 79 flow. The advantage is we don't have the port dependency for add_unreachable_flows() anymore which would allow us to prebuild all possible the matches/actions once and just reuse them for all logical switches where they need to be added. Thanks, DumitruAs a follow up optimization we can take the same approach as in 0fd39c6cf0ba "northd: move build_lb_rules in build_lswitch_flows_for_lb" and precompute the matches once for all load balancer VIPs, and just use them when installing the flows on the connected logical switch datapaths. Thoughts?Hi Dumitru,Hi Numan,Thanks for addressing this issue. I looked into the issue and in my opinion we can also solve this problem using address sets. This is how I'd see an lflow to be generated by ovn-northd - table=22(ls_in_l2_lkup ), priority=90 , match=(flags[1] == 0 && arp.op == 1 && arp.tpa == ${<LS_NAME>_V4_UNREACHABLE_IPS}) action=(outport = "_MC_flood"; output;)As far as I can tell the problem with this approach is that this will not allow the logical flows to take advantage of datapath groups.You're right. I missed out on the datapath group angle. Thanks NumanFor example, for: LS1 ------+ | LS2 ----- LR | LS3 ------+ If LR has a load balancer with VIPs V1, V2, V3, V4, V5. And the unreachable VIPs for each LRP are: LRP1: V1, V2, V3 (AS: LS1_UNREACH_IPS) LRP2: V1, V2, V3, V4 (AS: LS2_UNREACH_IPS) LRP3: V1, V2, V3, V4, V5 (AS: LS3_UNREACH_IPS) Then the lflows would be: LF1: arp.op == 1 && arp.tpa == $LS1_UNREACH_IPS LF2: arp.op == 1 && arp.tpa == $LS2_UNREACH_IPS LF3: arp.op == 1 && arp.tpa == $LS3_UNREACH_IPS Which cannot be grouped in a logical datapath because they don't have the same match. Whereas with the patch I proposed: LF1: arp.op == 1 && arp.tpa == V1, applied on DPG(LS1, LS2, LS3) LF2: arp.op == 1 && arp.tpa == V2, applied on DPG(LS1, LS2, LS3) LF3: arp.op == 1 && arp.tpa == V3, applied on DPG(LS1, LS2, LS3) LF4: arp.op == 1 && arp.tpa == V4, applied on DPG(LS1, LS2, LS3) LF5: arp.op == 1 && arp.tpa == V5, applied on DPG(LS1, LS2, LS3) The problem is that at high scale the address sets will grow larger too, I think, worse than what happens currently with the single logical flow: arp.op == 1 && arp.tpa == {V1, V2, V3, V4, V5}And ovn-northd would create an Address set for each logical switch and store the unreachable v4 ips (and the same for v6). It should be straightforward to generate an address set name for each logical switch. Presently the function build_lswitch_rport_arp_req_flows() builds an sset of ips_v4_match_reachable. This part of the code can be separated out and it can be used to sync the address sets in the SB DB. When an unreachable ip is added/deleted to/from the sset, ovn-controller can incrementally handle the change to the address set and this should not result in the openflow cookie updates. IMHO we should make use of address sets here. Thoughts ? And concerns you see w.r.t performance with this approach ?If I'm not wrong, ovn-controller would probably behave similarly if we use address sets but I think the cost is larger on the SB contents. Also, we'd have to reserve those address sets somehow to not allow users to use them, which is something hard to do in ovn-northd because we don't parse ACL match contents.Thanks NumanThanks, Dumitru _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev_______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
