On Thu, Jul 29, 2021 at 1:14 PM Dumitru Ceara <[email protected]> wrote:
>
> On 7/29/21 6:53 PM, Numan Siddique wrote:
> > On Thu, Jul 29, 2021 at 12:31 PM Dumitru Ceara <[email protected]> wrote:
> >>
> >> On 7/29/21 5:42 PM, Mark Gray wrote:
> >>> On 29/07/2021 14:20, Dumitru Ceara wrote:
> >>>> On 7/29/21 11:59 AM, Dumitru Ceara wrote:
> >>>>> Logical_Flows are immutable, that is, when the match or action change
> >>>>> ovn-northd will actually remove the old logical flow and add a new one.
> >>>>>
> >>>>> Commit ecc941c70f16 added flows to flood ARPs to routers for unreachable
> >>>>> addresses.
> >>>>>
> >>>>> In the following topology (2 switches connected to a router with a load
> >>>>> balancer attached to it):
> >>>>>
> >>>>> S1 --- (IP1/MASK1) R (IP2/MASK2) --- S2
> >>>>>
> >>>>> LB1 (applied on R) with N VIPs.
> >>>>>
> >>>>> The following logical flows were generated:
> >>>>>
> >>>>> S1 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP1 }"
> >>>>> then ...
> >>>>> S2 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP2 }"
> >>>>> then ...
> >>>>>
> >>>>> While this seemed a good idea initially because it would reduce the
> >>>>> total number of logical flows, this approach has a few major downsides
> >>>>> that severely affect scalability:
> >>>>>
> >>>>> 1. When logical datapath grouping is enabled (which is how high scale
> >>>>> deployments should use OVN) there is no chance that the flows for
> >>>>> reachable/unreachable router owned IPs are grouped together for all
> >>>>> datapaths that correspond to the logical switches they are generated
> >>>>> for.
> >>>>> 2. Whenever a VIP (VIPn+1) is added to LB1 all these flows will be
> >>>>> removed and new ones will be added with the only difference being the
> >>>>> VIPn+1 aded to the flow match. As this is a new logical flow record,
> >>>>> ovn-controller will have to remove all previously installed openflows
> >>>>> (for the deleted logical flow) and add new ones. However, for most
> >>>>> of these openflows, the only change is the cookie value (the first 32
> >>>>> bits of the logical flow UUID). At scale this unnecessary openflow
> >>>>> churn will induce significant latency. Moreover, this also creates
> >>>>> unnecessary jsonrpc traffic from ovn-northd towards OVN_Southbound
> >>>>> because the old flow needs to be removed and a new one (with a large
> >>>>> match field) needs to be installed.
> >>>>>
> >>>>> To avoid this, we now install individual flows, one per IP, for
> >>>>> managing ARP/ND traffic from logical switches towards router owned
> >>>>> IPs that are reachable/unreachable on the logical port connecting
> >>>>> the switch.
> >>>>>
> >>>>> Fixes: ecc941c70f16 ("northd: Flood ARPs to routers for "unreachable"
> >>>>> addresses.")
> >>>>> Signed-off-by: Dumitru Ceara <[email protected]>
> >>>>> ---
> >>>>> On a deployment based on an ovn-kubernetes cluster with 20 nodes, 2500
> >>>>> load balancer VIPs, 5000 pods, we:
> >>>>> - measure number of logical flows in the SB DB, size of the SB DB
> >>>>> - then create an additional 250 pods (logical ports), bind them, and add
> >>>>> a load balancer VIP (with a single backend) for each of the additional
> >>>>> pods, and measure how long it takes for all openflows corresponding to
> >>>>> each of the pods to be installed in OVS.
> >>>>>
> >>>>> Results:
> >>>>> - without fix:
> >>>>> ------------
> >>>>> SB lflows: 67976
> >>>>> SB size (uncompacted): 46 MiB
> >>>>> SB size (compacted): 40 MiB
> >>>>>
> >>>>> after 250 ports + VIPs added:
> >>>>> SB lflows: 71479
> >>>>> SB size (uncompacted): 248 MiB
> >>>>> SB size (compacted): 42 MiB
> >>>>> Max time to install all relevant openflows for a single pod: ~6sec
> >>>>>
> >>>>> - with fix:
> >>>>> ---------
> >>>>> SB lflows: 70399
> >>>>> SB size (uncompacted): 46 MiB
> >>>>> SB size (compacted): 40 MiB
> >>>>>
> >>>>> after 250 ports + VIPs added:
> >>>>> SB lflows: 74149
> >>>>> SB size (uncompacted): 165 MiB
> >>>>> SB size (compacted): 42 MiB
> >>>>> Max time to install all relevant openflows for a single pod: ~100msec
> >>>>
> >>>>
> >>>
> >>> I think you should put these^ (or a summary of these - including maybe
> >>> the northd degradation) into the commit message.
> >>>
> >>
> >> Sure, I'll do that in the next revision.
> >>
> >>>> CC: Mark Michelson (original author of ecc941c70f16 ("northd: Flood ARPs
> >>>> to routers for "unreachable" addresses."))
> >>>>
> >>>> There's another note I should mention here:
> >>>>
> >>>> Since we now call build_lswitch_rport_arp_req_flow_for_unreachable_ip()
> >>>> for every load balancer VIP (instead of once for a list of VIPs) load on
> >>>> ovn-northd will be increased as it has to create more logical flows that
> >>>> will eventually be grouped together on the same datapath group.
> >>>>
> >>>> When testing at higher scales (120 nodes setups similar topologies to
> >>>> the one above but with 45K VIPs) I see ovn-northd poll loop intervals
> >>>> taking ~3s longer on my machine (from ~10s to ~13s). I do, however,
> >>>
> >>> I'm kind of surprised at that degradation because your change doesn't
> >>> introduce any additional looping. This is purely due to the cost of
> >>> ovn_lflow_add_with_hint()?
> >>
> >> Well, yes, and building of the match. I commented it out and the hit is
> >> gone.
> >>
> >>>
> >>>> still think that not merging the flow matches is a good thing because it
> >>>> has a very substantial impact on all ovn-controllers in the cluster,
> >>>> already visible in our tests with smaller, 20 node clusters.
> >>>>
> >>>> A point here is that I think it's fine if
> >>>> build_lswitch_rport_arp_req_flow_for_unreachable_ip() installs flows at
> >>>> a priority lower than the one used for reachable IPs, e.g., 79. That
> >>>> would allow us to avoid the !lrouter_port_ipv*_reachable() condition
> >>>> when installing these flows.
> >>>
> >>> I don't get this point?
> >>
> >> In my opinion we don't need to do:
> >>
> >> if (port_reachable(op)) {
> >> add_reachable_flows(prio=80);
> >> } else {
> >> add_unreachable_flows(prio=90);
> >> }
> >>
> >> We can probably just do:
> >>
> >> if (port_reachable(op)) {
> >> add_reachable_flows(prio=80);
> >> }
> >> add_unreachable_flows(prio=79);
> >>
> >> The idea being that a packet is either for a "reachable IP" in which
> >> case we hit the prio 80 flow, or it's not, in which case we hit the prio
> >> 79 flow.
> >>
> >> The advantage is we don't have the port dependency for
> >> add_unreachable_flows() anymore which would allow us to prebuild all
> >> possible the matches/actions once and just reuse them for all logical
> >> switches where they need to be added.
> >>
> >> Thanks,
> >> Dumitru
> >>
> >>>>
> >>>> As a follow up optimization we can take the same approach as in
> >>>> 0fd39c6cf0ba "northd: move build_lb_rules in build_lswitch_flows_for_lb"
> >>>> and precompute the matches once for all load balancer VIPs, and just use
> >>>> them when installing the flows on the connected logical switch datapaths.
> >>>>
> >>>> Thoughts?
> >
> > Hi Dumitru,
>
> Hi Numan,
>
> >
> > Thanks for addressing this issue.
> >
> > I looked into the issue and in my opinion we can also solve this
> > problem using address sets.
> >
> > This is how I'd see an lflow to be generated by ovn-northd
> >
> > - table=22(ls_in_l2_lkup ), priority=90 , match=(flags[1] ==
> > 0 && arp.op == 1 && arp.tpa == ${<LS_NAME>_V4_UNREACHABLE_IPS})
> > action=(outport = "_MC_flood"; output;)
>
> As far as I can tell the problem with this approach is that this will
> not allow the logical flows to take advantage of datapath groups.
You're right. I missed out on the datapath group angle.
Thanks
Numan
>
> For example, for:
>
> LS1 ------+
> |
> LS2 ----- LR
> |
> LS3 ------+
>
> If LR has a load balancer with VIPs V1, V2, V3, V4, V5.
>
> And the unreachable VIPs for each LRP are:
>
> LRP1: V1, V2, V3 (AS: LS1_UNREACH_IPS)
> LRP2: V1, V2, V3, V4 (AS: LS2_UNREACH_IPS)
> LRP3: V1, V2, V3, V4, V5 (AS: LS3_UNREACH_IPS)
>
> Then the lflows would be:
>
> LF1: arp.op == 1 && arp.tpa == $LS1_UNREACH_IPS
> LF2: arp.op == 1 && arp.tpa == $LS2_UNREACH_IPS
> LF3: arp.op == 1 && arp.tpa == $LS3_UNREACH_IPS
>
> Which cannot be grouped in a logical datapath because they don't have
> the same match.
>
> Whereas with the patch I proposed:
>
> LF1: arp.op == 1 && arp.tpa == V1, applied on DPG(LS1, LS2, LS3)
> LF2: arp.op == 1 && arp.tpa == V2, applied on DPG(LS1, LS2, LS3)
> LF3: arp.op == 1 && arp.tpa == V3, applied on DPG(LS1, LS2, LS3)
> LF4: arp.op == 1 && arp.tpa == V4, applied on DPG(LS1, LS2, LS3)
> LF5: arp.op == 1 && arp.tpa == V5, applied on DPG(LS1, LS2, LS3)
>
> The problem is that at high scale the address sets will grow larger too,
> I think, worse than what happens currently with the single logical flow:
>
> arp.op == 1 && arp.tpa == {V1, V2, V3, V4, V5}
>
> >
> > And ovn-northd would create an Address set for each logical switch and
> > store the unreachable v4 ips (and the same for v6).
> > It should be straightforward to generate an address set name for each
> > logical switch.
> >
> > Presently the function build_lswitch_rport_arp_req_flows() builds an
> > sset of ips_v4_match_reachable. This part of the code
> > can be separated out and it can be used to sync the address sets in the SB
> > DB.
> >
> > When an unreachable ip is added/deleted to/from the sset,
> > ovn-controller can incrementally handle the change to the address set
> > and this should not result in the openflow cookie updates.
> >
> > IMHO we should make use of address sets here.
> >
> > Thoughts ? And concerns you see w.r.t performance with this approach ?
>
> If I'm not wrong, ovn-controller would probably behave similarly if we
> use address sets but I think the cost is larger on the SB contents.
>
> Also, we'd have to reserve those address sets somehow to not allow users
> to use them, which is something hard to do in ovn-northd because we
> don't parse ACL match contents.
>
> >
> > Thanks
> > Numan
>
> Thanks,
> Dumitru
>
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev