On 8/31/22 16:59, Han Zhou wrote: > On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <[email protected]> wrote: >> >> On 8/31/22 02:17, Han Zhou wrote: >>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <[email protected]> wrote: >>>> >>>> Hi Han, >>>> >>>> On 8/24/22 08:40, Han Zhou wrote: >>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to >>>>> registers is causing a critical dataplane performance impact to >>>>> short-lived connections, because it unwildcards megaflows with exact >>>>> match on dst IP and L4 ports. Any new connections with a different >>>>> client side L4 port will encounter datapath flow miss and upcall to >>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based >>>>> RESTful API calls suffer big performance degredations. >>>>> >>>>> These fields (dst IP and port) were saved to registers to solve a >>>>> problem of LB hairpin use case when different VIPs are sharing >>>>> overlapping backend+port [0]. The change [0] might not have as wide >>>>> performance impact as it is now because at that time one of the match >>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and >>>>> natted traffic, while now the impact is more obvious because >>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP >>>>> configured on the LS) since commit [1], after several other indirectly >>>>> related optimizations and refactors. >>>>> >>>>> This patch fixes the problem by modifying the priority-120 flows in >>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for > any >>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4 >>>>> port only for traffic matching the LB VIPs, because these are the ones >>>>> that need to be saved for the hairpin purpose. The existed > priority-110 >>>>> flows will match the rest of the traffic just like before but wouldn't >>>>> not save dst IP and L4 port, so any server->client traffic would not >>>>> unwildcard megaflows with client side L4 ports. >>>>> >>>> >>>> Just to be 100% sure, the client->server traffic will still generate up >>>> to V x H flows where V is "the number of VIP:port tuples" and H is "the >>>> number of potential (dp_)hash values", right? >>> >>> Yes, if all the VIPs and backends are being accessed at the same time. > This >>> is expected. For any given pair of client<->server, regardless of the >>> connections (source port number changes), the packets should match the > same >>> mega-flow and be handled by fastpath (instead of going to userspace, > which >>> is the behavior before this patch). >>> >> >> Thanks for the confirmation! >> >>>> >>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with > shared >>> backends.") >>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer > traffic.") >>>>> >>>>> Signed-off-by: Han Zhou <[email protected]> >>>>> --- >>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot >>> to commit >>>>> before sending v1. >>>>> >>>>> northd/northd.c | 125 > +++++++++++++++++++++++++--------------- >>>>> northd/ovn-northd.8.xml | 14 ++--- >>>>> tests/ovn-northd.at | 87 ++++++++++------------------ >>>>> tests/ovn.at | 14 ++--- >>>>> 4 files changed, 118 insertions(+), 122 deletions(-) >>>>> >>>>> diff --git a/northd/northd.c b/northd/northd.c >>>>> index 7e2681865..860641936 100644 >>>>> --- a/northd/northd.c >>>>> +++ b/northd/northd.c >>>>> @@ -273,15 +273,15 @@ enum ovn_stage { >>>>> * | | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} | | >>> | >>>>> * | | REGBIT_ACL_LABEL | X | >>> | >>>>> * +----+----------------------------------------------+ X | >>> | >>>>> - * | R1 | ORIG_DIP_IPV4 (>= IN_STATEFUL) | R | >>> | >>>>> + * | R1 | ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL) | R | >>> | >>>>> * +----+----------------------------------------------+ E | >>> | >>>>> - * | R2 | ORIG_TP_DPORT (>= IN_STATEFUL) | G | >>> | >>>>> + * | R2 | ORIG_TP_DPORT (>= IN_PRE_STATEFUL) | G | >>> | >>>>> * +----+----------------------------------------------+ 0 | >>> | >>>>> * | R3 | ACL LABEL | | >>> | >>>>> * >>> > +----+----------------------------------------------+---+------------------+ >>>>> * | R4 | UNUSED | | >>> | >>>>> - * +----+----------------------------------------------+ X | >>> ORIG_DIP_IPV6 | >>>>> - * | R5 | UNUSED | X | (>= >>> IN_STATEFUL) | >>>>> + * +----+----------------------------------------------+ X | >>> ORIG_DIP_IPV6(>= | >>>>> + * | R5 | UNUSED | X | >>> IN_PRE_STATEFUL) | >>>>> * +----+----------------------------------------------+ R | >>> | >>>>> * | R6 | UNUSED | E | >>> | >>>>> * +----+----------------------------------------------+ G | >>> | >>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od, >>>>> ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1", >>> "next;"); >>>>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1", >>> "next;"); >>>>> >>>>> - const char *ct_lb_action = features->ct_no_masked_label >>>>> - ? "ct_lb_mark" >>>>> - : "ct_lb"; >>>>> - const char *lb_protocols[] = {"tcp", "udp", "sctp"}; >>>>> - struct ds actions = DS_EMPTY_INITIALIZER; >>>>> - struct ds match = DS_EMPTY_INITIALIZER; >>>>> - >>>>> - for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) { >>>>> - ds_clear(&match); >>>>> - ds_clear(&actions); >>>>> - ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && > %s", >>>>> - lb_protocols[i]); >>>>> - ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; " >>>>> - REG_ORIG_TP_DPORT " = %s.dst; %s;", >>>>> - lb_protocols[i], ct_lb_action); >>>>> - ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, >>>>> - ds_cstr(&match), ds_cstr(&actions)); >>>>> - >>>>> - ds_clear(&match); >>>>> - ds_clear(&actions); >>>>> - ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && > %s", >>>>> - lb_protocols[i]); >>>>> - ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; " >>>>> - REG_ORIG_TP_DPORT " = %s.dst; %s;", >>>>> - lb_protocols[i], ct_lb_action); >>>>> - ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, >>>>> - ds_cstr(&match), ds_cstr(&actions)); >>>>> - } >>>>> + /* Note: priority-120 flows are added in >>> build_lb_rules_pre_stateful(). */ >>>>> >>>>> - ds_clear(&actions); >>>>> - ds_put_format(&actions, "%s;", ct_lb_action); >>>>> + const char *ct_lb_action = features->ct_no_masked_label >>>>> + ? "ct_lb_mark;" >>>>> + : "ct_lb;"; >>>>> >>>>> ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110, >>>>> - REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions)); >>>>> + REGBIT_CONNTRACK_NAT" == 1", ct_lb_action); >>>>> >>>>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110, >>>>> - REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions)); >>>>> + REGBIT_CONNTRACK_NAT" == 1", ct_lb_action); >>>>> >>>>> /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets > should >>> be >>>>> * sent to conntrack for tracking and defragmentation. */ >>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od, >>>>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100, >>>>> REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;"); >>>>> >>>>> - ds_destroy(&actions); >>>>> - ds_destroy(&match); >>>>> } >>>>> >>>>> static void >>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap >>> *lflows) { >>>>> } >>>>> >>>>> static void >>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool >>> ct_lb_mark, >>>>> - struct ds *match, struct ds *action, >>>>> - const struct shash *meter_groups) >>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb >>> *lb, >>>>> + bool ct_lb_mark, struct ds *match, >>>>> + struct ds *action) >>>>> { >>>>> for (size_t i = 0; i < lb->n_vips; i++) { >>>>> struct ovn_lb_vip *lb_vip = &lb->vips[i]; >>>>> - struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i]; >>>>> - const char *ip_match = NULL; >>>>> - >>>>> ds_clear(action); >>>>> ds_clear(match); >>>>> + const char *ip_match = NULL; >>>>> >>>>> - /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag. >>> Otherwise >>>>> - * the load balanced packet will be committed again in >>>>> - * S_SWITCH_IN_STATEFUL. */ >>>>> - ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; "); >>>>> /* Store the original destination IP to be used when > generating >>>>> * hairpin flows. >>>>> */ >>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct >>> ovn_northd_lb *lb, bool ct_lb_mark, >>>>> ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ", >>>>> lb_vip->vip_port); >>>>> } >>>>> + ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" : >>> "ct_lb"); >>>>> + >>>>> + ds_put_format(match, "%s.dst == %s", ip_match, >>> lb_vip->vip_str); >>>>> + if (lb_vip->vip_port) { >>>>> + ds_put_format(match, " && %s.dst == %d", proto, >>> lb_vip->vip_port); >>>>> + } >>>>> + >>>>> + struct ovn_lflow *lflow_ref = NULL; >>>>> + uint32_t hash = ovn_logical_flow_hash( >>>>> + ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL), >>>>> + ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), > 120, >>>>> + ds_cstr(match), ds_cstr(action)); >>>>> + >>>>> + for (size_t j = 0; j < lb->n_nb_ls; j++) { >>>>> + struct ovn_datapath *od = lb->nb_ls[j]; >>>>> + >>>>> + if (!ovn_dp_group_add_with_reference(lflow_ref, od)) { >>>>> + lflow_ref = ovn_lflow_add_at_with_hash( >>>>> + lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, >>>>> + ds_cstr(match), ds_cstr(action), >>>>> + NULL, NULL, &lb->nlb->header_, >>>>> + OVS_SOURCE_LOCATOR, hash); >>>>> + } >>>>> + } >>>>> + } >>>> >>>> I see how this can fix the dataplane issue because we're limiting the >>>> number of dp flows but this now induces a measurable penalty on the >>>> control plane side because we essentially double the number of logical >>>> flows that ovn-northd generates for load balancer VIPs >>>> >>>> Using a NB database [2] from a 250 node density heavy test [3] we ran >>>> in-house I see the following: >>>> >>>> a. before this patch (main): >>>> - number of SB lflows: 146233 >>>> - SB size on disk (compacted): 70MB >>>> - northd poll loop intervals: 8525ms >>>> >>>> b. with this patch: >>>> - number of SB lflows: 163303 >>>> - SB size on disk (compacted): 76MB >>>> - northd poll loop intervals: 9958ms >>>> >>>> [2] >>>> >>> > https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz >>>> [3] >>>> >>> > https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml >>>> >>>> This raises some concerns. However, I went ahead and also ran a test >>>> with Ilya's load balancer optimization series [4] (up for review) and I >>> got: >>>> >>>> c. with this patch + Ilya's series: >>>> - number of SB lflows: 163303 << Didn't change compared to "b" >>>> - SB size on disk (compacted): 76MB << Didn't change compared to "b" >>>> - northd poll loop intervals: 6437ms << 25% better than "a" >>>> >>>> Then I ran a test with main + Ilya's series. >>>> >>>> d. main + Ilya's series: >>>> - number of SB lflows: 146233 << Didn't change compared to "a" >>>> - SB size on disk (compacted): 70MB << Didn't change compared to "a" >>>> - northd poll loop intervals: 5413ms << 35% better than "a" >>>> >>>> [4] > https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=* >>>> >>>> I guess what I'm trying to say is that if go ahead with the approach > you >>>> proposed and especially if we want to backport it to the LTS we should >>>> consider accepting/backporting other optimizations too (such as Ilya's) >>>> that would compensate (and more) for the control plane performance hit. >>>> >>> Thanks for the scale test! >>> I do expect some cost in the control plane, and your test result is > aligned >>> with my expectation. It is unpleasant to see ~10% SB size increase (in > the >>> density-heavy and LB/service-heavy test scenario), but if it is > necessary >>> to fix the dataplane performance bug I think it is worth it. After all, >>> everything the control plane does is to serve the dataplane, right? The >>> resulting dataplane performance boost is at least 10x for short-lived >>> connection scenarios. >>> It's great to see Ilya's patch helps the LB scale, and I have no > objection >>> to backport them if it is considered necessary to stable branches, but > I'd >>> consider it independent of the current patch. >>> >> >> They are independent. But the problem is your fix will visibily impact >> control plane latency (~16% on my setup). So, if we can avoid some of >> that hit by considering both patchsets together, I'd vote for that. >> > Again, no objection to backporting Ilya's patches, but this is the criteria > in my mind: > > 1. For this patch, it improves dataplane performance (to me it is in fact > fixing a performance bug) at the cost of the control plane. This is not > uncommon in the OVN project history - when adding some new features > sometimes it does add control plane cost. When this happens, whether to > accept the new feature is based on the value of the feature v.s. the cost > of the control plane - a tradeoff has to be made in many cases. In this > case, for 16% control plane latency increase we get 10x dataplane speed up, > I would vote for it.
Hi Han, I'm not arguing that we don't get dataplane speed up. I'm sure we will and it should be a decent amount. But I've been trying to replicate this 10x boost with an ovn-kubernetes kind deployment and I didn't really manage to. I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in our lab. a. build an ovnkube image (with and without your patch); essentially what our ovn-kubernetes CI jobs do: https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39 b. start a kind cluster with the image we built at step "a". c. edit the ovs-node daemonset definition and increase the resource limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU and 300MB). This restarts the ovs-daemon pods, wait for those to come back up. d. start an nginx deployment with 50 replicas: this is essentially 50 backends accepting http requests on port 80. e. create a service load balancing http requests received on a port, e.g. 42424, to port 80 on the pods in the nginx deployment above. f. create a clients deployment with 100 replicas: 100 pods running the following bash script, a hack to simulate 40 new connections with different ephemeral source ports every 2 seconds: i=0 while true; do i=$((i+1)) [ $((i%40)) -eq 0 ] && sleep 2 nc -z -v $SERVICE_VIP 42424 & done Comparing the runs with and without your patch applied I see improvement: 1. without this patch applied: - 17K, 18K, 27K datapath flows on the 3 nodes of the cluster - OVS cpu usage on the 3 nodes: 30-50% 2. with this patch applied: - 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster - OVS cpu usage on the 3 nodes: 10-20% In both cases the whole system is not overloaded, OVS doesn't reach it's CPU limits. I see clear improvement but I was expecting to see more than this. It does look like my test doesn't generate enough unique sources per second so I was wondering if you could share your test methodology? For the OVN code changes, they look good to me. I think we should apply them to the main branch. For that: Acked-by: Dumitru Ceara <[email protected]> For backporting the change to stable branches I'd really like wait to first understand the exact scenario in which the megaflow undwildcarding created such a big performance regression in ovn-kubernetes. Thanks, Dumitru > > 2. For backporting Ilya's optimization, I'd evaluate using the criteria for > any optimization backporting. Say the patch has 10-20% speed up, and we > think that speed up is critical for a branch for some reason, and the > changes are well contained, I'd vote for the backport. > > Does this make sense? > @Numan Siddique <[email protected]> @Mark Michelson <[email protected]> any > comments? > > Thanks, > Han > >> Given that Ilya's changes seem very contained I don't think this should >> be a problem. They just need a proper review first. >> >>>> I tried to think of an alternative that wouldn't require doubling the >>>> number of VIP logical flows but couldn't come up with one. Maybe you >>>> have more ideas? >>>> >>> This is the approach that has the least impact and changes to existing >>> pipelines I have thought about so far. The earlier attempt wouldn't >>> increase the number of flows but it had to compromise to either failing > to >>> support a corner case (that we are still not sure if it is really > useful at >>> all) or losing HW offloading capability. I'll still think about some > more >>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd > avoid >>> big changes in case any corner case is broken. I'd spend more time > sorting >>> out different use cases and the documentation before making more >>> optimizations, but I don't think we should also fix this performance > bug as >>> soon as we can. (I am also curious why we haven't heard other OVN users >>> complain about this yet ...) >> >> It's a guess, but we probably didn't notice this issue because it should >> be more visible when there are lots of short lived connections. >> >> Thanks, >> Dumitru >> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
