On Wed, Sep 7, 2022 at 8:47 AM Dumitru Ceara <[email protected]> wrote: > > On 9/6/22 22:03, Han Zhou wrote: > > On Tue, Sep 6, 2022 at 6:42 AM Dumitru Ceara <[email protected]> wrote: > >> > >> On 8/31/22 16:59, Han Zhou wrote: > >>> On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <[email protected]> wrote: > >>>> > >>>> On 8/31/22 02:17, Han Zhou wrote: > >>>>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <[email protected]> > > wrote: > >>>>>> > >>>>>> Hi Han, > >>>>>> > >>>>>> On 8/24/22 08:40, Han Zhou wrote: > >>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port > > to > >>>>>>> registers is causing a critical dataplane performance impact to > >>>>>>> short-lived connections, because it unwildcards megaflows with exact > >>>>>>> match on dst IP and L4 ports. Any new connections with a different > >>>>>>> client side L4 port will encounter datapath flow miss and upcall to > >>>>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based > >>>>>>> RESTful API calls suffer big performance degredations. > >>>>>>> > >>>>>>> These fields (dst IP and port) were saved to registers to solve a > >>>>>>> problem of LB hairpin use case when different VIPs are sharing > >>>>>>> overlapping backend+port [0]. The change [0] might not have as wide > >>>>>>> performance impact as it is now because at that time one of the > > match > >>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established > > and > >>>>>>> natted traffic, while now the impact is more obvious because > >>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP > >>>>>>> configured on the LS) since commit [1], after several other > > indirectly > >>>>>>> related optimizations and refactors. > >>>>>>> > >>>>>>> This patch fixes the problem by modifying the priority-120 flows in > >>>>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for > >>> any > >>>>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and > > L4 > >>>>>>> port only for traffic matching the LB VIPs, because these are the > > ones > >>>>>>> that need to be saved for the hairpin purpose. The existed > >>> priority-110 > >>>>>>> flows will match the rest of the traffic just like before but > > wouldn't > >>>>>>> not save dst IP and L4 port, so any server->client traffic would not > >>>>>>> unwildcard megaflows with client side L4 ports. > >>>>>>> > >>>>>> > >>>>>> Just to be 100% sure, the client->server traffic will still generate > > up > >>>>>> to V x H flows where V is "the number of VIP:port tuples" and H is > > "the > >>>>>> number of potential (dp_)hash values", right? > >>>>> > >>>>> Yes, if all the VIPs and backends are being accessed at the same time. > >>> This > >>>>> is expected. For any given pair of client<->server, regardless of the > >>>>> connections (source port number changes), the packets should match the > >>> same > >>>>> mega-flow and be handled by fastpath (instead of going to userspace, > >>> which > >>>>> is the behavior before this patch). > >>>>> > >>>> > >>>> Thanks for the confirmation! > >>>> > >>>>>> > >>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with > >>> shared > >>>>> backends.") > >>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer > >>> traffic.") > >>>>>>> > >>>>>>> Signed-off-by: Han Zhou <[email protected]> > >>>>>>> --- > >>>>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I > > forgot > >>>>> to commit > >>>>>>> before sending v1. > >>>>>>> > >>>>>>> northd/northd.c | 125 > >>> +++++++++++++++++++++++++--------------- > >>>>>>> northd/ovn-northd.8.xml | 14 ++--- > >>>>>>> tests/ovn-northd.at | 87 ++++++++++------------------ > >>>>>>> tests/ovn.at | 14 ++--- > >>>>>>> 4 files changed, 118 insertions(+), 122 deletions(-) > >>>>>>> > >>>>>>> diff --git a/northd/northd.c b/northd/northd.c > >>>>>>> index 7e2681865..860641936 100644 > >>>>>>> --- a/northd/northd.c > >>>>>>> +++ b/northd/northd.c > >>>>>>> @@ -273,15 +273,15 @@ enum ovn_stage { > >>>>>>> * | | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} | | > >>>>> | > >>>>>>> * | | REGBIT_ACL_LABEL | X | > >>>>> | > >>>>>>> * +----+----------------------------------------------+ X | > >>>>> | > >>>>>>> - * | R1 | ORIG_DIP_IPV4 (>= IN_STATEFUL) | R | > >>>>> | > >>>>>>> + * | R1 | ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL) | R | > >>>>> | > >>>>>>> * +----+----------------------------------------------+ E | > >>>>> | > >>>>>>> - * | R2 | ORIG_TP_DPORT (>= IN_STATEFUL) | G | > >>>>> | > >>>>>>> + * | R2 | ORIG_TP_DPORT (>= IN_PRE_STATEFUL) | G | > >>>>> | > >>>>>>> * +----+----------------------------------------------+ 0 | > >>>>> | > >>>>>>> * | R3 | ACL LABEL | | > >>>>> | > >>>>>>> * > >>>>> > >>> > > +----+----------------------------------------------+---+------------------+ > >>>>>>> * | R4 | UNUSED | | > >>>>> | > >>>>>>> - * +----+----------------------------------------------+ X | > >>>>> ORIG_DIP_IPV6 | > >>>>>>> - * | R5 | UNUSED | X | (>= > >>>>> IN_STATEFUL) | > >>>>>>> + * +----+----------------------------------------------+ X | > >>>>> ORIG_DIP_IPV6(>= | > >>>>>>> + * | R5 | UNUSED | X | > >>>>> IN_PRE_STATEFUL) | > >>>>>>> * +----+----------------------------------------------+ R | > >>>>> | > >>>>>>> * | R6 | UNUSED | E | > >>>>> | > >>>>>>> * +----+----------------------------------------------+ G | > >>>>> | > >>>>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od, > >>>>>>> ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1", > >>>>> "next;"); > >>>>>>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1", > >>>>> "next;"); > >>>>>>> > >>>>>>> - const char *ct_lb_action = features->ct_no_masked_label > >>>>>>> - ? "ct_lb_mark" > >>>>>>> - : "ct_lb"; > >>>>>>> - const char *lb_protocols[] = {"tcp", "udp", "sctp"}; > >>>>>>> - struct ds actions = DS_EMPTY_INITIALIZER; > >>>>>>> - struct ds match = DS_EMPTY_INITIALIZER; > >>>>>>> - > >>>>>>> - for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) { > >>>>>>> - ds_clear(&match); > >>>>>>> - ds_clear(&actions); > >>>>>>> - ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && > >>> %s", > >>>>>>> - lb_protocols[i]); > >>>>>>> - ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; " > >>>>>>> - REG_ORIG_TP_DPORT " = %s.dst; %s;", > >>>>>>> - lb_protocols[i], ct_lb_action); > >>>>>>> - ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, > >>>>>>> - ds_cstr(&match), ds_cstr(&actions)); > >>>>>>> - > >>>>>>> - ds_clear(&match); > >>>>>>> - ds_clear(&actions); > >>>>>>> - ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && > >>> %s", > >>>>>>> - lb_protocols[i]); > >>>>>>> - ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; " > >>>>>>> - REG_ORIG_TP_DPORT " = %s.dst; %s;", > >>>>>>> - lb_protocols[i], ct_lb_action); > >>>>>>> - ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, > >>>>>>> - ds_cstr(&match), ds_cstr(&actions)); > >>>>>>> - } > >>>>>>> + /* Note: priority-120 flows are added in > >>>>> build_lb_rules_pre_stateful(). */ > >>>>>>> > >>>>>>> - ds_clear(&actions); > >>>>>>> - ds_put_format(&actions, "%s;", ct_lb_action); > >>>>>>> + const char *ct_lb_action = features->ct_no_masked_label > >>>>>>> + ? "ct_lb_mark;" > >>>>>>> + : "ct_lb;"; > >>>>>>> > >>>>>>> ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110, > >>>>>>> - REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions)); > >>>>>>> + REGBIT_CONNTRACK_NAT" == 1", ct_lb_action); > >>>>>>> > >>>>>>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110, > >>>>>>> - REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions)); > >>>>>>> + REGBIT_CONNTRACK_NAT" == 1", ct_lb_action); > >>>>>>> > >>>>>>> /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets > >>> should > >>>>> be > >>>>>>> * sent to conntrack for tracking and defragmentation. */ > >>>>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od, > >>>>>>> ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100, > >>>>>>> REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;"); > >>>>>>> > >>>>>>> - ds_destroy(&actions); > >>>>>>> - ds_destroy(&match); > >>>>>>> } > >>>>>>> > >>>>>>> static void > >>>>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct > > hmap > >>>>> *lflows) { > >>>>>>> } > >>>>>>> > >>>>>>> static void > >>>>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool > >>>>> ct_lb_mark, > >>>>>>> - struct ds *match, struct ds *action, > >>>>>>> - const struct shash *meter_groups) > >>>>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct > > ovn_northd_lb > >>>>> *lb, > >>>>>>> + bool ct_lb_mark, struct ds *match, > >>>>>>> + struct ds *action) > >>>>>>> { > >>>>>>> for (size_t i = 0; i < lb->n_vips; i++) { > >>>>>>> struct ovn_lb_vip *lb_vip = &lb->vips[i]; > >>>>>>> - struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i]; > >>>>>>> - const char *ip_match = NULL; > >>>>>>> - > >>>>>>> ds_clear(action); > >>>>>>> ds_clear(match); > >>>>>>> + const char *ip_match = NULL; > >>>>>>> > >>>>>>> - /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT > > flag. > >>>>> Otherwise > >>>>>>> - * the load balanced packet will be committed again in > >>>>>>> - * S_SWITCH_IN_STATEFUL. */ > >>>>>>> - ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; "); > >>>>>>> /* Store the original destination IP to be used when > >>> generating > >>>>>>> * hairpin flows. > >>>>>>> */ > >>>>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct > >>>>> ovn_northd_lb *lb, bool ct_lb_mark, > >>>>>>> ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; > > ", > >>>>>>> lb_vip->vip_port); > >>>>>>> } > >>>>>>> + ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" : > >>>>> "ct_lb"); > >>>>>>> + > >>>>>>> + ds_put_format(match, "%s.dst == %s", ip_match, > >>>>> lb_vip->vip_str); > >>>>>>> + if (lb_vip->vip_port) { > >>>>>>> + ds_put_format(match, " && %s.dst == %d", proto, > >>>>> lb_vip->vip_port); > >>>>>>> + } > >>>>>>> + > >>>>>>> + struct ovn_lflow *lflow_ref = NULL; > >>>>>>> + uint32_t hash = ovn_logical_flow_hash( > >>>>>>> + ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL), > >>>>>>> + ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), > >>> 120, > >>>>>>> + ds_cstr(match), ds_cstr(action)); > >>>>>>> + > >>>>>>> + for (size_t j = 0; j < lb->n_nb_ls; j++) { > >>>>>>> + struct ovn_datapath *od = lb->nb_ls[j]; > >>>>>>> + > >>>>>>> + if (!ovn_dp_group_add_with_reference(lflow_ref, od)) { > >>>>>>> + lflow_ref = ovn_lflow_add_at_with_hash( > >>>>>>> + lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120, > >>>>>>> + ds_cstr(match), ds_cstr(action), > >>>>>>> + NULL, NULL, &lb->nlb->header_, > >>>>>>> + OVS_SOURCE_LOCATOR, hash); > >>>>>>> + } > >>>>>>> + } > >>>>>>> + } > >>>>>> > >>>>>> I see how this can fix the dataplane issue because we're limiting the > >>>>>> number of dp flows but this now induces a measurable penalty on the > >>>>>> control plane side because we essentially double the number of > > logical > >>>>>> flows that ovn-northd generates for load balancer VIPs > >>>>>> > >>>>>> Using a NB database [2] from a 250 node density heavy test [3] we ran > >>>>>> in-house I see the following: > >>>>>> > >>>>>> a. before this patch (main): > >>>>>> - number of SB lflows: 146233 > >>>>>> - SB size on disk (compacted): 70MB > >>>>>> - northd poll loop intervals: 8525ms > >>>>>> > >>>>>> b. with this patch: > >>>>>> - number of SB lflows: 163303 > >>>>>> - SB size on disk (compacted): 76MB > >>>>>> - northd poll loop intervals: 9958ms > >>>>>> > >>>>>> [2] > >>>>>> > >>>>> > >>> > > https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz > >>>>>> [3] > >>>>>> > >>>>> > >>> > > https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml > >>>>>> > >>>>>> This raises some concerns. However, I went ahead and also ran a test > >>>>>> with Ilya's load balancer optimization series [4] (up for review) > > and I > >>>>> got: > >>>>>> > >>>>>> c. with this patch + Ilya's series: > >>>>>> - number of SB lflows: 163303 << Didn't change compared to > > "b" > >>>>>> - SB size on disk (compacted): 76MB << Didn't change compared to > > "b" > >>>>>> - northd poll loop intervals: 6437ms << 25% better than "a" > >>>>>> > >>>>>> Then I ran a test with main + Ilya's series. > >>>>>> > >>>>>> d. main + Ilya's series: > >>>>>> - number of SB lflows: 146233 << Didn't change compared to > > "a" > >>>>>> - SB size on disk (compacted): 70MB << Didn't change compared to > > "a" > >>>>>> - northd poll loop intervals: 5413ms << 35% better than "a" > >>>>>> > >>>>>> [4] > >>> https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=* > >>>>>> > >>>>>> I guess what I'm trying to say is that if go ahead with the approach > >>> you > >>>>>> proposed and especially if we want to backport it to the LTS we > > should > >>>>>> consider accepting/backporting other optimizations too (such as > > Ilya's) > >>>>>> that would compensate (and more) for the control plane performance > > hit. > >>>>>> > >>>>> Thanks for the scale test! > >>>>> I do expect some cost in the control plane, and your test result is > >>> aligned > >>>>> with my expectation. It is unpleasant to see ~10% SB size increase (in > >>> the > >>>>> density-heavy and LB/service-heavy test scenario), but if it is > >>> necessary > >>>>> to fix the dataplane performance bug I think it is worth it. After > > all, > >>>>> everything the control plane does is to serve the dataplane, right? > > The > >>>>> resulting dataplane performance boost is at least 10x for short-lived > >>>>> connection scenarios. > >>>>> It's great to see Ilya's patch helps the LB scale, and I have no > >>> objection > >>>>> to backport them if it is considered necessary to stable branches, but > >>> I'd > >>>>> consider it independent of the current patch. > >>>>> > >>>> > >>>> They are independent. But the problem is your fix will visibily impact > >>>> control plane latency (~16% on my setup). So, if we can avoid some of > >>>> that hit by considering both patchsets together, I'd vote for that. > >>>> > >>> Again, no objection to backporting Ilya's patches, but this is the > > criteria > >>> in my mind: > >>> > >>> 1. For this patch, it improves dataplane performance (to me it is in > > fact > >>> fixing a performance bug) at the cost of the control plane. This is not > >>> uncommon in the OVN project history - when adding some new features > >>> sometimes it does add control plane cost. When this happens, whether to > >>> accept the new feature is based on the value of the feature v.s. the > > cost > >>> of the control plane - a tradeoff has to be made in many cases. In this > >>> case, for 16% control plane latency increase we get 10x dataplane speed > > up, > >>> I would vote for it. > >> > >> Hi Han, > >> > >> I'm not arguing that we don't get dataplane speed up. I'm sure we will > >> and it should be a decent amount. But I've been trying to replicate > >> this 10x boost with an ovn-kubernetes kind deployment and I didn't > >> really manage to. > >> > >> I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in our > >> lab. > >> > >> a. build an ovnkube image (with and without your patch); essentially > >> what our ovn-kubernetes CI jobs do: > >> > >> > > https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39 > >> > >> b. start a kind cluster with the image we built at step "a". > >> > >> c. edit the ovs-node daemonset definition and increase the resource > >> limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU > >> and 300MB). This restarts the ovs-daemon pods, wait for those to come > >> back up. > >> > >> d. start an nginx deployment with 50 replicas: this is essentially 50 > >> backends accepting http requests on port 80. > >> > >> e. create a service load balancing http requests received on a port, > >> e.g. 42424, to port 80 on the pods in the nginx deployment above. > >> > >> f. create a clients deployment with 100 replicas: 100 pods running the > >> following bash script, a hack to simulate 40 new connections with > >> different ephemeral source ports every 2 seconds: > >> > >> i=0 > >> while true; do > >> i=$((i+1)) > >> [ $((i%40)) -eq 0 ] && sleep 2 > >> nc -z -v $SERVICE_VIP 42424 & > >> done > >> > >> Comparing the runs with and without your patch applied I see improvement: > >> > >> 1. without this patch applied: > >> - 17K, 18K, 27K datapath flows on the 3 nodes of the cluster > >> - OVS cpu usage on the 3 nodes: 30-50% > >> > >> 2. with this patch applied: > >> - 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster > >> - OVS cpu usage on the 3 nodes: 10-20% > >> > >> In both cases the whole system is not overloaded, OVS doesn't reach it's > >> CPU limits. > >> > >> I see clear improvement but I was expecting to see more than this. It > >> does look like my test doesn't generate enough unique sources per second > >> so I was wondering if you could share your test methodology? > >> > > > > Thanks Dumitru for the performance test. I guess if you reduce the number > > of clients while increasing the connection speed per-client you would see > > more differences. And I am not sure why you saw 10-20% OVS cpu usage with > > the patch applied. If all the 100 clients keep sending packets you should > > have constant 100xN (N=2 or 3) megaflows instead of 9K (unless there are > > other problems causing flow miss). BTW, I assume you don't have HW-offload > > enabled, right? Could you also try sending to a backend directly instead of > > VIP and see the difference? > > No HWOL enabled on my platform. I think my setup was still too > overloaded (too many containers and virtual nodes on the same physical > machine). > > > > > Here is a brief test I had, with the below patch based on the system test > > case "load-balancing". It is basically a single client sending requests to > > one of the LB backends using the "ab - apache benchmark" test. > > ------------------------------------------------------------------------- > > diff --git a/tests/system-ovn.at b/tests/system-ovn.at > > index 992813614..3adc72da1 100644 > > --- a/tests/system-ovn.at > > +++ b/tests/system-ovn.at > > @@ -1430,10 +1430,12 @@ ovn-nbctl lsp-add bar bar3 \ > > # Config OVN load-balancer with a VIP. > > ovn-nbctl lb-add lb1 30.0.0.1 "172.16.1.2,172.16.1.3,172.16.1.4" > > ovn-nbctl ls-lb-add foo lb1 > > +ovn-nbctl ls-lb-add bar lb1 > > > > # Create another load-balancer with another VIP. > > lb2_uuid=`ovn-nbctl create load_balancer name=lb2 > > vips:30.0.0.3="172.16.1.2,172.16.1.3,172.16.1.4"` > > ovn-nbctl ls-lb-add foo lb2 > > +ovn-nbctl ls-lb-add bar lb2 > > > > # Config OVN load-balancer with another VIP (this time with ports). > > ovn-nbctl set load_balancer $lb2_uuid vips:'"30.0.0.2:8000"'='" > > 172.16.1.2:80,172.16.1.3:80,172.16.1.4:80"' > > @@ -1503,6 +1505,10 @@ > > tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(s > > tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>) > > ]) > > > > +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4 && > > ip4.dst == {172.16.1.2,172.16.1.3,172.16.1.4} && tcp.dst == 80" > > allow-related > > +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4" > > allow-related > > +#check ovn-nbctl --wait=hv sync > > + > > dnl Test load-balancing that includes L4 ports in NAT. > > dnl Each server should have at least one connection. > > OVS_WAIT_FOR_OUTPUT([ > > @@ -1516,6 +1522,15 @@ > > tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(s > > tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>) > > ]) > > > > +ip netns exec foo1 ab -n 10000 -c 1 -q 172.16.1.2:80/ > > + > > # Configure selection_fields. > > ovn-nbctl set load_balancer $lb2_uuid > > selection_fields="ip_src,ip_dst,tp_src,tp_dst" > > OVS_WAIT_UNTIL([ > > ----------------------------------------------------------------------- > > > > Tests on 24 cores x Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz > > The test results are: > > > > // Before the change > > Concurrency Level: 1 > > Time taken for tests: 12.913 seconds > > Complete requests: 10000 > > Failed requests: 0 > > Total transferred: 21450000 bytes > > HTML transferred: 19590000 bytes > > Requests per second: 774.41 [#/sec] (mean) > > Time per request: 1.291 [ms] (mean) > > Time per request: 1.291 [ms] (mean, across all concurrent requests) > > Transfer rate: 1622.18 [Kbytes/sec] received > > > > OVS CPU: ~280% > > > > // After the change > > Concurrency Level: 1 > > Time taken for tests: 5.629 seconds > > Complete requests: 10000 > > Failed requests: 0 > > Total transferred: 21450000 bytes > > HTML transferred: 19590000 bytes > > Requests per second: 1776.62 [#/sec] (mean) > > Time per request: 0.563 [ms] (mean) > > Time per request: 0.563 [ms] (mean, across all concurrent requests) > > Transfer rate: 3721.54 [Kbytes/sec] received > > > > OVS CPU: ~0% > > > > We can see that the RPS is 2.5x higher (1776 v.s. 774) even with a single > > thread (-c 1). It is not 10x as what I guessed earlier, but keep in mind > > that the "ab" test has a complete TCP handshake and then request/response, > > and finally tear down, including usually 6 packets per connection, so at > > least half of the packets are still going through fast-path even before the > > change. But for the "first" packets of each connection, the speed up should > > be at 10x level, which we can easily tell by a ping test and see the RTT > > difference between the first packet that goes through the slow-path v.s. > > the later packets that go through the fast path only. > > I changed the test, on an ovn-kubernetes kind cluster, with netperf > TCP_CRR (connection + request + response): > > https://github.com/jtaleric/k8s-netperf/pull/8 > > > > > And the CPU utilization for OVS is huge before the change while after the > > change it is 0. So if we combine more complex scenarios with the system > > overloaded I believe the difference should also be even bigger. > > > > I see a big difference in CPU usage too. It goes up to ~90% CPU (almost > one whole CPU) just for processing CRR between a client and a server. > While with your patch applied the CPU usage for OVS is negligible. > > The number of DP flows also drops drastically, as expected. > > Comparing the number of "transactions" (connect + request + response) > between a single pair of client + server I see on my machine: > > a. without your patch: ~2K TPS > b. with your patch: ~6K TPS > > Probably a coincidence because our platforms and tests are very > different but I also see 2-3x boost in transactions per second. > > >> For the OVN code changes, they look good to me. I think we should apply > >> them to the main branch. For that: > >> > >> Acked-by: Dumitru Ceara <[email protected]> > >> > >> For backporting the change to stable branches I'd really like wait to > >> first understand the exact scenario in which the megaflow undwildcarding > >> created such a big performance regression in ovn-kubernetes. > > > > Thank you for all the reviews and tests. I applied to main and am waiting > > for your confirmation for backporting. > > For branch-22.09 I think maybe we should backport for now before it is > > released? > > > > In light of your results and based on the findings above I think we > should accept the backport even if it might cause a small control plane > degradation. This also having in mind Ilya's northd patches that I > think are a good candidate for stable branches too. But I'll let Mark > and Numan, from their maintainer perspective, share their thoughts on > this matter. > > Regards, > Dumitru
Thanks Dumitru for confirming. In today's OVN meeting I got confirmation from Mark that he is in favor of backporting it as well, so I just backported down to branch-22.03. Han > > > Han > > > >> > >> Thanks, > >> Dumitru > >> > >>> > >>> 2. For backporting Ilya's optimization, I'd evaluate using the criteria > > for > >>> any optimization backporting. Say the patch has 10-20% speed up, and we > >>> think that speed up is critical for a branch for some reason, and the > >>> changes are well contained, I'd vote for the backport. > >>> > >>> Does this make sense? > >>> @Numan Siddique <[email protected]> @Mark Michelson <[email protected]> > > any > >>> comments? > >>> > >>> Thanks, > >>> Han > >>> > >>>> Given that Ilya's changes seem very contained I don't think this should > >>>> be a problem. They just need a proper review first. > >>>> > >>>>>> I tried to think of an alternative that wouldn't require doubling the > >>>>>> number of VIP logical flows but couldn't come up with one. Maybe you > >>>>>> have more ideas? > >>>>>> > >>>>> This is the approach that has the least impact and changes to existing > >>>>> pipelines I have thought about so far. The earlier attempt wouldn't > >>>>> increase the number of flows but it had to compromise to either > > failing > >>> to > >>>>> support a corner case (that we are still not sure if it is really > >>> useful at > >>>>> all) or losing HW offloading capability. I'll still think about some > >>> more > >>>>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd > >>> avoid > >>>>> big changes in case any corner case is broken. I'd spend more time > >>> sorting > >>>>> out different use cases and the documentation before making more > >>>>> optimizations, but I don't think we should also fix this performance > >>> bug as > >>>>> soon as we can. (I am also curious why we haven't heard other OVN > > users > >>>>> complain about this yet ...) > >>>> > >>>> It's a guess, but we probably didn't notice this issue because it > > should > >>>> be more visible when there are lots of short lived connections. > >>>> > >>>> Thanks, > >>>> Dumitru > >>>> > >>> > >> > > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
