On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <[email protected]> wrote: > > On 8/31/22 02:17, Han Zhou wrote: > > On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <[email protected]> wrote: > >> > >> Hi Han, > >> > >> On 8/24/22 08:40, Han Zhou wrote: > >>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to > >>> registers is causing a critical dataplane performance impact to > >>> short-lived connections, because it unwildcards megaflows with exact > >>> match on dst IP and L4 ports. Any new connections with a different > >>> client side L4 port will encounter datapath flow miss and upcall to > >>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based > >>> RESTful API calls suffer big performance degredations. > >>> > >>> These fields (dst IP and port) were saved to registers to solve a > >>> problem of LB hairpin use case when different VIPs are sharing > >>> overlapping backend+port [0]. The change [0] might not have as wide > >>> performance impact as it is now because at that time one of the match > >>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and > >>> natted traffic, while now the impact is more obvious because > >>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP > >>> configured on the LS) since commit [1], after several other indirectly > >>> related optimizations and refactors. > >>> > >>> This patch fixes the problem by modifying the priority-120 flows in > >>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any > >>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4 > >>> port only for traffic matching the LB VIPs, because these are the ones > >>> that need to be saved for the hairpin purpose. The existed priority-110 > >>> flows will match the rest of the traffic just like before but wouldn't > >>> not save dst IP and L4 port, so any server->client traffic would not > >>> unwildcard megaflows with client side L4 ports. > >>> > >> > >> Just to be 100% sure, the client->server traffic will still generate up > >> to V x H flows where V is "the number of VIP:port tuples" and H is "the > >> number of potential (dp_)hash values", right? > > > > Yes, if all the VIPs and backends are being accessed at the same time. This > > is expected. For any given pair of client<->server, regardless of the > > connections (source port number changes), the packets should match the same > > mega-flow and be handled by fastpath (instead of going to userspace, which > > is the behavior before this patch). > > > > Thanks for the confirmation! > > >> > >>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared > > backends.") > >>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.") > >>> > >>> Signed-off-by: Han Zhou <[email protected]> > >>> --- > >>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot > > to commit > >>> before sending v1. > >>> > >>> northd/northd.c | 125 +++++++++++++++++++++++++--------------- > >>> northd/ovn-northd.8.xml | 14 ++--- > >>> tests/ovn-northd.at | 87 ++++++++++------------------ > >>> tests/ovn.at | 14 ++--- > >>> 4 files changed, 118 insertions(+), 122 deletions(-) > >>> > >>> diff --git a/northd/northd.c b/northd/northd.c > >>> index 7e2681865..860641936 100644 > >>> --- a/northd/northd.c > >>> +++ b/northd/northd.c > >>> @@ -273,15 +273,15 @@ enum ovn_stage { > >>> * | | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} | | > > | > >>> * | | REGBIT_ACL_LABEL | X | > > | > >>> * +----+----------------------------------------------+ X | > > | > >>> - * | R1 | ORIG_DIP_IPV4 (>= IN_STATEFUL) | R | > > | > >>> + * | R1 | ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL) | R | > > | > >>> * +----+----------------------------------------------+ E | > > | > >>> - * | R2 | ORIG_TP_DPORT (>= IN_STATEFUL) | G | > > | > >>> + * | R2 | ORIG_TP_DPORT (>= IN_PRE_STATEFUL) | G | > > | > >>> * +----+----------------------------------------------+ 0 | > > | > >>> * | R3 | ACL LABEL | | > > | > >>> * > > +----+----------------------------------------------+---+------------------+ > >>> * | R4 | UNUSED | | > > | > >>> - * +----+----------------------------------------------+ X | > > ORIG_DIP_IPV6 | > >>> - * | R5 | UNUSED | X | (>= > > IN_STATEFUL) | > >>> + * +----+----------------------------------------------+ X | > > ORIG_DIP_IPV6(>= | > >>> + * | R5 | UNUSED | X | > > IN_PRE_STATEFUL) | > >>> * +----+----------------------------------------------+ R | > > | > >>> * | R6 | UNUSED | E | > > | > >>> * +----+----------------------------------------------+ G | > > | > >>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od, > >>> ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1", > > "next;"); > >>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1", > > "next;"); > >>> > >>> - const char *ct_lb_action = features->ct_no_masked_label > >>> - ? "ct_lb_mark" > >>> - : "ct_lb"; > >>> - const char *lb_protocols[] = {"tcp", "udp", "sctp"}; > >>> - struct ds actions = DS_EMPTY_INITIALIZER; > >>> - struct ds match = DS_EMPTY_INITIALIZER; > >>> - > >>> - for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) { > >>> - ds_clear(&match); > >>> - ds_clear(&actions); > >>> - ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && %s", > >>> - lb_protocols[i]); > >>> - ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; " > >>> - REG_ORIG_TP_DPORT " = %s.dst; %s;", > >>> - lb_protocols[i], ct_lb_action); > >>> - ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, > >>> - ds_cstr(&match), ds_cstr(&actions)); > >>> - > >>> - ds_clear(&match); > >>> - ds_clear(&actions); > >>> - ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && %s", > >>> - lb_protocols[i]); > >>> - ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; " > >>> - REG_ORIG_TP_DPORT " = %s.dst; %s;", > >>> - lb_protocols[i], ct_lb_action); > >>> - ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, > >>> - ds_cstr(&match), ds_cstr(&actions)); > >>> - } > >>> + /* Note: priority-120 flows are added in > > build_lb_rules_pre_stateful(). */ > >>> > >>> - ds_clear(&actions); > >>> - ds_put_format(&actions, "%s;", ct_lb_action); > >>> + const char *ct_lb_action = features->ct_no_masked_label > >>> + ? "ct_lb_mark;" > >>> + : "ct_lb;"; > >>> > >>> ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110, > >>> - REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions)); > >>> + REGBIT_CONNTRACK_NAT" == 1", ct_lb_action); > >>> > >>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110, > >>> - REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions)); > >>> + REGBIT_CONNTRACK_NAT" == 1", ct_lb_action); > >>> > >>> /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets should > > be > >>> * sent to conntrack for tracking and defragmentation. */ > >>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od, > >>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100, > >>> REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;"); > >>> > >>> - ds_destroy(&actions); > >>> - ds_destroy(&match); > >>> } > >>> > >>> static void > >>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap > > *lflows) { > >>> } > >>> > >>> static void > >>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool > > ct_lb_mark, > >>> - struct ds *match, struct ds *action, > >>> - const struct shash *meter_groups) > >>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb > > *lb, > >>> + bool ct_lb_mark, struct ds *match, > >>> + struct ds *action) > >>> { > >>> for (size_t i = 0; i < lb->n_vips; i++) { > >>> struct ovn_lb_vip *lb_vip = &lb->vips[i]; > >>> - struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i]; > >>> - const char *ip_match = NULL; > >>> - > >>> ds_clear(action); > >>> ds_clear(match); > >>> + const char *ip_match = NULL; > >>> > >>> - /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag. > > Otherwise > >>> - * the load balanced packet will be committed again in > >>> - * S_SWITCH_IN_STATEFUL. */ > >>> - ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; "); > >>> /* Store the original destination IP to be used when generating > >>> * hairpin flows. > >>> */ > >>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct > > ovn_northd_lb *lb, bool ct_lb_mark, > >>> ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ", > >>> lb_vip->vip_port); > >>> } > >>> + ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" : > > "ct_lb"); > >>> + > >>> + ds_put_format(match, "%s.dst == %s", ip_match, > > lb_vip->vip_str); > >>> + if (lb_vip->vip_port) { > >>> + ds_put_format(match, " && %s.dst == %d", proto, > > lb_vip->vip_port); > >>> + } > >>> + > >>> + struct ovn_lflow *lflow_ref = NULL; > >>> + uint32_t hash = ovn_logical_flow_hash( > >>> + ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL), > >>> + ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), 120, > >>> + ds_cstr(match), ds_cstr(action)); > >>> + > >>> + for (size_t j = 0; j < lb->n_nb_ls; j++) { > >>> + struct ovn_datapath *od = lb->nb_ls[j]; > >>> + > >>> + if (!ovn_dp_group_add_with_reference(lflow_ref, od)) { > >>> + lflow_ref = ovn_lflow_add_at_with_hash( > >>> + lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, > >>> + ds_cstr(match), ds_cstr(action), > >>> + NULL, NULL, &lb->nlb->header_, > >>> + OVS_SOURCE_LOCATOR, hash); > >>> + } > >>> + } > >>> + } > >> > >> I see how this can fix the dataplane issue because we're limiting the > >> number of dp flows but this now induces a measurable penalty on the > >> control plane side because we essentially double the number of logical > >> flows that ovn-northd generates for load balancer VIPs > >> > >> Using a NB database [2] from a 250 node density heavy test [3] we ran > >> in-house I see the following: > >> > >> a. before this patch (main): > >> - number of SB lflows: 146233 > >> - SB size on disk (compacted): 70MB > >> - northd poll loop intervals: 8525ms > >> > >> b. with this patch: > >> - number of SB lflows: 163303 > >> - SB size on disk (compacted): 76MB > >> - northd poll loop intervals: 9958ms > >> > >> [2] > >> > > https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz > >> [3] > >> > > https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml > >> > >> This raises some concerns. However, I went ahead and also ran a test > >> with Ilya's load balancer optimization series [4] (up for review) and I > > got: > >> > >> c. with this patch + Ilya's series: > >> - number of SB lflows: 163303 << Didn't change compared to "b" > >> - SB size on disk (compacted): 76MB << Didn't change compared to "b" > >> - northd poll loop intervals: 6437ms << 25% better than "a" > >> > >> Then I ran a test with main + Ilya's series. > >> > >> d. main + Ilya's series: > >> - number of SB lflows: 146233 << Didn't change compared to "a" > >> - SB size on disk (compacted): 70MB << Didn't change compared to "a" > >> - northd poll loop intervals: 5413ms << 35% better than "a" > >> > >> [4] https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=* > >> > >> I guess what I'm trying to say is that if go ahead with the approach you > >> proposed and especially if we want to backport it to the LTS we should > >> consider accepting/backporting other optimizations too (such as Ilya's) > >> that would compensate (and more) for the control plane performance hit. > >> > > Thanks for the scale test! > > I do expect some cost in the control plane, and your test result is aligned > > with my expectation. It is unpleasant to see ~10% SB size increase (in the > > density-heavy and LB/service-heavy test scenario), but if it is necessary > > to fix the dataplane performance bug I think it is worth it. After all, > > everything the control plane does is to serve the dataplane, right? The > > resulting dataplane performance boost is at least 10x for short-lived > > connection scenarios. > > It's great to see Ilya's patch helps the LB scale, and I have no objection > > to backport them if it is considered necessary to stable branches, but I'd > > consider it independent of the current patch. > > > > They are independent. But the problem is your fix will visibily impact > control plane latency (~16% on my setup). So, if we can avoid some of > that hit by considering both patchsets together, I'd vote for that. > Again, no objection to backporting Ilya's patches, but this is the criteria in my mind:
1. For this patch, it improves dataplane performance (to me it is in fact fixing a performance bug) at the cost of the control plane. This is not uncommon in the OVN project history - when adding some new features sometimes it does add control plane cost. When this happens, whether to accept the new feature is based on the value of the feature v.s. the cost of the control plane - a tradeoff has to be made in many cases. In this case, for 16% control plane latency increase we get 10x dataplane speed up, I would vote for it. 2. For backporting Ilya's optimization, I'd evaluate using the criteria for any optimization backporting. Say the patch has 10-20% speed up, and we think that speed up is critical for a branch for some reason, and the changes are well contained, I'd vote for the backport. Does this make sense? @Numan Siddique <[email protected]> @Mark Michelson <[email protected]> any comments? Thanks, Han > Given that Ilya's changes seem very contained I don't think this should > be a problem. They just need a proper review first. > > >> I tried to think of an alternative that wouldn't require doubling the > >> number of VIP logical flows but couldn't come up with one. Maybe you > >> have more ideas? > >> > > This is the approach that has the least impact and changes to existing > > pipelines I have thought about so far. The earlier attempt wouldn't > > increase the number of flows but it had to compromise to either failing to > > support a corner case (that we are still not sure if it is really useful at > > all) or losing HW offloading capability. I'll still think about some more > > approaches but the NAT/LB/ACL pipelines are very complex now and I'd avoid > > big changes in case any corner case is broken. I'd spend more time sorting > > out different use cases and the documentation before making more > > optimizations, but I don't think we should also fix this performance bug as > > soon as we can. (I am also curious why we haven't heard other OVN users > > complain about this yet ...) > > It's a guess, but we probably didn't notice this issue because it should > be more visible when there are lots of short lived connections. > > Thanks, > Dumitru > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
