Re: [ovs-dev] [PATCH ovn v2] Don't blindly save original dst IP and Port to avoid megaflow unwildcarding.

Dumitru Ceara Wed, 07 Sep 2022 08:48:05 -0700

On 9/6/22 22:03, Han Zhou wrote:
> On Tue, Sep 6, 2022 at 6:42 AM Dumitru Ceara <[email protected]> wrote:
>>
>> On 8/31/22 16:59, Han Zhou wrote:
>>> On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <[email protected]> wrote:
>>>>
>>>> On 8/31/22 02:17, Han Zhou wrote:
>>>>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <[email protected]>
> wrote:
>>>>>>
>>>>>> Hi Han,
>>>>>>
>>>>>> On 8/24/22 08:40, Han Zhou wrote:
>>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
> to
>>>>>>> registers is causing a critical dataplane performance impact to
>>>>>>> short-lived connections, because it unwildcards megaflows with exact
>>>>>>> match on dst IP and L4 ports. Any new connections with a different
>>>>>>> client side L4 port will encounter datapath flow miss and upcall to
>>>>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>>>>>>> RESTful API calls suffer big performance degredations.
>>>>>>>
>>>>>>> These fields (dst IP and port) were saved to registers to solve a
>>>>>>> problem of LB hairpin use case when different VIPs are sharing
>>>>>>> overlapping backend+port [0]. The change [0] might not have as wide
>>>>>>> performance impact as it is now because at that time one of the
> match
>>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> and
>>>>>>> natted traffic, while now the impact is more obvious because
>>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>>>>>> configured on the LS) since commit [1], after several other
> indirectly
>>>>>>> related optimizations and refactors.
>>>>>>>
>>>>>>> This patch fixes the problem by modifying the priority-120 flows in
>>>>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for
>>> any
>>>>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and
> L4
>>>>>>> port only for traffic matching the LB VIPs, because these are the
> ones
>>>>>>> that need to be saved for the hairpin purpose. The existed
>>> priority-110
>>>>>>> flows will match the rest of the traffic just like before but
> wouldn't
>>>>>>> not save dst IP and L4 port, so any server->client traffic would not
>>>>>>> unwildcard megaflows with client side L4 ports.
>>>>>>>
>>>>>>
>>>>>> Just to be 100% sure, the client->server traffic will still generate
> up
>>>>>> to V x H flows where V is "the number of VIP:port tuples" and H is
> "the
>>>>>> number of potential (dp_)hash values", right?
>>>>>
>>>>> Yes, if all the VIPs and backends are being accessed at the same time.
>>> This
>>>>> is expected. For any given pair of client<->server, regardless of the
>>>>> connections (source port number changes), the packets should match the
>>> same
>>>>> mega-flow and be handled by fastpath (instead of going to userspace,
>>> which
>>>>> is the behavior before this patch).
>>>>>
>>>>
>>>> Thanks for the confirmation!
>>>>
>>>>>>
>>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
>>> shared
>>>>> backends.")
>>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
>>> traffic.")
>>>>>>>
>>>>>>> Signed-off-by: Han Zhou <[email protected]>
>>>>>>> ---
>>>>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I
> forgot
>>>>> to commit
>>>>>>>           before sending v1.
>>>>>>>
>>>>>>>  northd/northd.c         | 125
>>> +++++++++++++++++++++++++---------------
>>>>>>>  northd/ovn-northd.8.xml |  14 ++---
>>>>>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
>>>>>>>  tests/ovn.at            |  14 ++---
>>>>>>>  4 files changed, 118 insertions(+), 122 deletions(-)
>>>>>>>
>>>>>>> diff --git a/northd/northd.c b/northd/northd.c
>>>>>>> index 7e2681865..860641936 100644
>>>>>>> --- a/northd/northd.c
>>>>>>> +++ b/northd/northd.c
>>>>>>> @@ -273,15 +273,15 @@ enum ovn_stage {
>>>>>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
>>>>>        |
>>>>>>>   * |    |     REGBIT_ACL_LABEL                         | X |
>>>>>        |
>>>>>>>   * +----+----------------------------------------------+ X |
>>>>>        |
>>>>>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
>>>>>        |
>>>>>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
>>>>>        |
>>>>>>>   * +----+----------------------------------------------+ E |
>>>>>        |
>>>>>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
>>>>>        |
>>>>>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
>>>>>        |
>>>>>>>   * +----+----------------------------------------------+ 0 |
>>>>>        |
>>>>>>>   * | R3 |                  ACL LABEL                   |   |
>>>>>        |
>>>>>>>   *
>>>>>
>>>
> +----+----------------------------------------------+---+------------------+
>>>>>>>   * | R4 |                   UNUSED                     |   |
>>>>>        |
>>>>>>> - * +----+----------------------------------------------+ X |
>>>>> ORIG_DIP_IPV6  |
>>>>>>> - * | R5 |                   UNUSED                     | X | (>=
>>>>> IN_STATEFUL) |
>>>>>>> + * +----+----------------------------------------------+ X |
>>>>> ORIG_DIP_IPV6(>= |
>>>>>>> + * | R5 |                   UNUSED                     | X |
>>>>> IN_PRE_STATEFUL) |
>>>>>>>   * +----+----------------------------------------------+ R |
>>>>>        |
>>>>>>>   * | R6 |                   UNUSED                     | E |
>>>>>        |
>>>>>>>   * +----+----------------------------------------------+ G |
>>>>>        |
>>>>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
>>>>> "next;");
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
>>>>> "next;");
>>>>>>>
>>>>>>> -    const char *ct_lb_action = features->ct_no_masked_label
>>>>>>> -                               ? "ct_lb_mark"
>>>>>>> -                               : "ct_lb";
>>>>>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
>>>>>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
>>>>>>> -    struct ds match = DS_EMPTY_INITIALIZER;
>>>>>>> -
>>>>>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
>>>>>>> -        ds_clear(&match);
>>>>>>> -        ds_clear(&actions);
>>>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 &&
>>> %s",
>>>>>>> -                      lb_protocols[i]);
>>>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
>>>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>>>>>> -                      lb_protocols[i], ct_lb_action);
>>>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>>>>>> -
>>>>>>> -        ds_clear(&match);
>>>>>>> -        ds_clear(&actions);
>>>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 &&
>>> %s",
>>>>>>> -                      lb_protocols[i]);
>>>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
>>>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>>>>>> -                      lb_protocols[i], ct_lb_action);
>>>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>>>>>> -    }
>>>>>>> +    /* Note: priority-120 flows are added in
>>>>> build_lb_rules_pre_stateful(). */
>>>>>>>
>>>>>>> -    ds_clear(&actions);
>>>>>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
>>>>>>> +    const char *ct_lb_action = features->ct_no_masked_label
>>>>>>> +                               ? "ct_lb_mark;"
>>>>>>> +                               : "ct_lb;";
>>>>>>>
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
>>>>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>>>>>
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
>>>>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>>>>>
>>>>>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets
>>> should
>>>>> be
>>>>>>>       * sent to conntrack for tracking and defragmentation. */
>>>>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
>>>>>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
>>>>>>>
>>>>>>> -    ds_destroy(&actions);
>>>>>>> -    ds_destroy(&match);
>>>>>>>  }
>>>>>>>
>>>>>>>  static void
>>>>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct
> hmap
>>>>> *lflows) {
>>>>>>>  }
>>>>>>>
>>>>>>>  static void
>>>>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
>>>>> ct_lb_mark,
>>>>>>> -               struct ds *match, struct ds *action,
>>>>>>> -               const struct shash *meter_groups)
>>>>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct
> ovn_northd_lb
>>>>> *lb,
>>>>>>> +                            bool ct_lb_mark, struct ds *match,
>>>>>>> +                            struct ds *action)
>>>>>>>  {
>>>>>>>      for (size_t i = 0; i < lb->n_vips; i++) {
>>>>>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
>>>>>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
>>>>>>> -        const char *ip_match = NULL;
>>>>>>> -
>>>>>>>          ds_clear(action);
>>>>>>>          ds_clear(match);
>>>>>>> +        const char *ip_match = NULL;
>>>>>>>
>>>>>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT
> flag.
>>>>> Otherwise
>>>>>>> -         * the load balanced packet will be committed again in
>>>>>>> -         * S_SWITCH_IN_STATEFUL. */
>>>>>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
>>>>>>>          /* Store the original destination IP to be used when
>>> generating
>>>>>>>           * hairpin flows.
>>>>>>>           */
>>>>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
>>>>> ovn_northd_lb *lb, bool ct_lb_mark,
>>>>>>>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16";
> ",
>>>>>>>                            lb_vip->vip_port);
>>>>>>>          }
>>>>>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
>>>>> "ct_lb");
>>>>>>> +
>>>>>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
>>>>> lb_vip->vip_str);
>>>>>>> +        if (lb_vip->vip_port) {
>>>>>>> +            ds_put_format(match, " && %s.dst == %d", proto,
>>>>> lb_vip->vip_port);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        struct ovn_lflow *lflow_ref = NULL;
>>>>>>> +        uint32_t hash = ovn_logical_flow_hash(
>>>>>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
>>>>>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL),
>>> 120,
>>>>>>> +                ds_cstr(match), ds_cstr(action));
>>>>>>> +
>>>>>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
>>>>>>> +            struct ovn_datapath *od = lb->nb_ls[j];
>>>>>>> +
>>>>>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
>>>>>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
>>>>>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>>>> +                        ds_cstr(match), ds_cstr(action),
>>>>>>> +                        NULL, NULL, &lb->nlb->header_,
>>>>>>> +                        OVS_SOURCE_LOCATOR, hash);
>>>>>>> +            }
>>>>>>> +        }
>>>>>>> +    }
>>>>>>
>>>>>> I see how this can fix the dataplane issue because we're limiting the
>>>>>> number of dp flows but this now induces a measurable penalty on the
>>>>>> control plane side because we essentially double the number of
> logical
>>>>>> flows that ovn-northd generates for load balancer VIPs
>>>>>>
>>>>>> Using a NB database [2] from a 250 node density heavy test [3] we ran
>>>>>> in-house I see the following:
>>>>>>
>>>>>> a. before this patch (main):
>>>>>> - number of SB lflows:         146233
>>>>>> - SB size on disk (compacted): 70MB
>>>>>> - northd poll loop intervals:  8525ms
>>>>>>
>>>>>> b. with this patch:
>>>>>> - number of SB lflows:         163303
>>>>>> - SB size on disk (compacted): 76MB
>>>>>> - northd poll loop intervals:  9958ms
>>>>>>
>>>>>> [2]
>>>>>>
>>>>>
>>>
> https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
>>>>>> [3]
>>>>>>
>>>>>
>>>
> https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
>>>>>>
>>>>>> This raises some concerns.  However, I went ahead and also ran a test
>>>>>> with Ilya's load balancer optimization series [4] (up for review)
> and I
>>>>> got:
>>>>>>
>>>>>> c. with this patch + Ilya's series:
>>>>>> - number of SB lflows:         163303  << Didn't change compared to
> "b"
>>>>>> - SB size on disk (compacted): 76MB    << Didn't change compared to
> "b"
>>>>>> - northd poll loop intervals:  6437ms  << 25% better than "a"
>>>>>>
>>>>>> Then I ran a test with main + Ilya's series.
>>>>>>
>>>>>> d. main + Ilya's series:
>>>>>> - number of SB lflows:         146233  << Didn't change compared to
> "a"
>>>>>> - SB size on disk (compacted): 70MB    << Didn't change compared to
> "a"
>>>>>> - northd poll loop intervals:  5413ms  << 35% better than "a"
>>>>>>
>>>>>> [4]
>>> https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
>>>>>>
>>>>>> I guess what I'm trying to say is that if go ahead with the approach
>>> you
>>>>>> proposed and especially if we want to backport it to the LTS we
> should
>>>>>> consider accepting/backporting other optimizations too (such as
> Ilya's)
>>>>>> that would compensate (and more) for the control plane performance
> hit.
>>>>>>
>>>>> Thanks for the scale test!
>>>>> I do expect some cost in the control plane, and your test result is
>>> aligned
>>>>> with my expectation. It is unpleasant to see ~10% SB size increase (in
>>> the
>>>>> density-heavy and LB/service-heavy test scenario), but if it is
>>> necessary
>>>>> to fix the dataplane performance bug I think it is worth it. After
> all,
>>>>> everything the control plane does is to serve the dataplane, right?
> The
>>>>> resulting dataplane performance boost is at least 10x for short-lived
>>>>> connection scenarios.
>>>>> It's great to see Ilya's patch helps the LB scale, and I have no
>>> objection
>>>>> to backport them if it is considered necessary to stable branches, but
>>> I'd
>>>>> consider it independent of the current patch.
>>>>>
>>>>
>>>> They are independent.  But the problem is your fix will visibily impact
>>>> control plane latency (~16% on my setup).  So, if we can avoid some of
>>>> that hit by considering both patchsets together, I'd vote for that.
>>>>
>>> Again, no objection to backporting Ilya's patches, but this is the
> criteria
>>> in my mind:
>>>
>>> 1. For this patch, it improves dataplane performance (to me it is in
> fact
>>> fixing a performance bug) at the cost of the control plane. This is not
>>> uncommon in the OVN project history - when adding some new features
>>> sometimes it does add control plane cost. When this happens, whether to
>>> accept the new feature is based on the value of the feature v.s. the
> cost
>>> of the control plane - a tradeoff has to be made in many cases. In this
>>> case, for 16% control plane latency increase we get 10x dataplane speed
> up,
>>> I would  vote for it.
>>
>> Hi Han,
>>
>> I'm not arguing that we don't get dataplane speed up.  I'm sure we will
>> and it should be a decent amount.  But I've been trying to replicate
>> this 10x boost with an ovn-kubernetes kind deployment and I didn't
>> really manage to.
>>
>> I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in our
>> lab.
>>
>> a. build an ovnkube image (with and without your patch); essentially
>> what our ovn-kubernetes CI jobs do:
>>
>>
> https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39
>>
>> b. start a kind cluster with the image we built at step "a".
>>
>> c. edit the ovs-node daemonset definition and increase the resource
>> limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU
>> and 300MB).  This restarts the ovs-daemon pods, wait for those to come
>> back up.
>>
>> d. start an nginx deployment with 50 replicas: this is essentially 50
>> backends accepting http requests on port 80.
>>
>> e. create a service load balancing http requests received on a port,
>> e.g. 42424, to port 80 on the pods in the nginx deployment above.
>>
>> f. create a clients deployment with 100 replicas: 100 pods running the
>> following bash script, a hack to simulate 40 new connections with
>> different ephemeral source ports every 2 seconds:
>>
>> i=0
>> while true; do
>>   i=$((i+1))
>>   [ $((i%40)) -eq 0 ] && sleep 2
>>   nc -z -v $SERVICE_VIP 42424 &
>> done
>>
>> Comparing the runs with and without your patch applied I see improvement:
>>
>> 1. without this patch applied:
>> - 17K, 18K, 27K datapath flows on the 3 nodes of the cluster
>> - OVS cpu usage on the 3 nodes: 30-50%
>>
>> 2. with this patch applied:
>> - 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster
>> - OVS cpu usage on the 3 nodes: 10-20%
>>
>> In both cases the whole system is not overloaded, OVS doesn't reach it's
>> CPU limits.
>>
>> I see clear improvement but I was expecting to see more than this.  It
>> does look like my test doesn't generate enough unique sources per second
>> so I was wondering if you could share your test methodology?
>>
> 
> Thanks Dumitru for the performance test. I guess if you reduce the number
> of clients while increasing the connection speed per-client you would see
> more differences. And I am not sure why you saw 10-20% OVS cpu usage with
> the patch applied. If all the 100 clients keep sending packets you should
> have constant 100xN (N=2 or 3) megaflows instead of 9K (unless there are
> other problems causing flow miss). BTW, I assume you don't have HW-offload
> enabled, right? Could you also try sending to a backend directly instead of
> VIP and see the difference?


No HWOL enabled on my platform.  I think my setup was still too
overloaded (too many containers and virtual nodes on the same physical
machine).

> 
> Here is a brief test I had, with the below patch based on the system test
> case "load-balancing". It is basically a single client sending requests to
> one of the LB backends using the "ab - apache benchmark" test.
> -------------------------------------------------------------------------
> diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> index 992813614..3adc72da1 100644
> --- a/tests/system-ovn.at
> +++ b/tests/system-ovn.at
> @@ -1430,10 +1430,12 @@ ovn-nbctl lsp-add bar bar3 \
>  # Config OVN load-balancer with a VIP.
>  ovn-nbctl lb-add lb1 30.0.0.1 "172.16.1.2,172.16.1.3,172.16.1.4"
>  ovn-nbctl ls-lb-add foo lb1
> +ovn-nbctl ls-lb-add bar lb1
> 
>  # Create another load-balancer with another VIP.
>  lb2_uuid=`ovn-nbctl create load_balancer name=lb2
> vips:30.0.0.3="172.16.1.2,172.16.1.3,172.16.1.4"`
>  ovn-nbctl ls-lb-add foo lb2
> +ovn-nbctl ls-lb-add bar lb2
> 
>  # Config OVN load-balancer with another VIP (this time with ports).
>  ovn-nbctl set load_balancer $lb2_uuid vips:'"30.0.0.2:8000"'='"
> 172.16.1.2:80,172.16.1.3:80,172.16.1.4:80"'
> @@ -1503,6 +1505,10 @@
> tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(s
>  
> tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
>  ])
> 
> +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4 &&
> ip4.dst == {172.16.1.2,172.16.1.3,172.16.1.4} && tcp.dst == 80"
> allow-related
> +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4"
> allow-related
> +#check ovn-nbctl --wait=hv sync
> +
>  dnl Test load-balancing that includes L4 ports in NAT.
>  dnl Each server should have at least one connection.
>  OVS_WAIT_FOR_OUTPUT([
> @@ -1516,6 +1522,15 @@
> tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(s
>  
> tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
>  ])
> 
> +ip netns exec foo1 ab -n 10000 -c 1 -q 172.16.1.2:80/
> +
>  # Configure selection_fields.
>  ovn-nbctl set load_balancer $lb2_uuid
> selection_fields="ip_src,ip_dst,tp_src,tp_dst"
>  OVS_WAIT_UNTIL([
> -----------------------------------------------------------------------
> 
> Tests on 24 cores x Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
> The test results are:
> 
> // Before the change
> Concurrency Level:      1
> Time taken for tests:   12.913 seconds
> Complete requests:      10000
> Failed requests:        0
> Total transferred:      21450000 bytes
> HTML transferred:       19590000 bytes
> Requests per second:    774.41 [#/sec] (mean)
> Time per request:       1.291 [ms] (mean)
> Time per request:       1.291 [ms] (mean, across all concurrent requests)
> Transfer rate:          1622.18 [Kbytes/sec] received
> 
> OVS CPU: ~280%
> 
> // After the change
> Concurrency Level:      1
> Time taken for tests:   5.629 seconds
> Complete requests:      10000
> Failed requests:        0
> Total transferred:      21450000 bytes
> HTML transferred:       19590000 bytes
> Requests per second:    1776.62 [#/sec] (mean)
> Time per request:       0.563 [ms] (mean)
> Time per request:       0.563 [ms] (mean, across all concurrent requests)
> Transfer rate:          3721.54 [Kbytes/sec] received
> 
> OVS CPU: ~0%
> 
> We can see that the RPS is 2.5x higher (1776 v.s. 774) even with a single
> thread (-c 1). It is not 10x as what I guessed earlier, but keep in mind
> that the "ab" test has a complete TCP handshake and then request/response,
> and finally tear down, including usually 6 packets per connection, so at
> least half of the packets are still going through fast-path even before the
> change. But for the "first" packets of each connection, the speed up should
> be at 10x level, which we can easily tell by a ping test and see the RTT
> difference between the first packet that goes through the slow-path v.s.
> the later packets that go through the fast path only.

I changed the test, on an ovn-kubernetes kind cluster, with netperf
TCP_CRR (connection + request + response):

https://github.com/jtaleric/k8s-netperf/pull/8

> 
> And the CPU utilization for OVS is huge before the change while after the
> change it is 0. So if we combine more complex scenarios with the system
> overloaded I believe the difference should also be even bigger.
> 

I see a big difference in CPU usage too.  It goes up to ~90% CPU (almost
one whole CPU) just for processing CRR between a client and a server.
While with your patch applied the CPU usage for OVS is negligible.

The number of DP flows also drops drastically, as expected.

Comparing the number of "transactions" (connect + request + response)
between a single pair of client + server I see on my machine:

a. without your patch: ~2K TPS
b. with your patch: ~6K TPS

Probably a coincidence because our platforms and tests are very
different but I also see 2-3x boost in transactions per second.

>> For the OVN code changes, they look good to me.  I think we should apply
>> them to the main branch.  For that:
>>
>> Acked-by: Dumitru Ceara <[email protected]>
>>
>> For backporting the change to stable branches I'd really like wait to
>> first understand the exact scenario in which the megaflow undwildcarding
>> created such a big performance regression in ovn-kubernetes.
> 
> Thank you for all the reviews and tests. I applied to main and am waiting
> for your confirmation for backporting.
> For branch-22.09 I think maybe we should backport for now before it is
> released?
> 

In light of your results and based on the findings above I think we
should accept the backport even if it might cause a small control plane
degradation.  This also having in mind Ilya's northd patches that I
think are a good candidate for stable branches too.  But I'll let Mark
and Numan, from their maintainer perspective, share their thoughts on
this matter.

Regards,
Dumitru

> Han
> 
>>
>> Thanks,
>> Dumitru
>>
>>>
>>> 2. For backporting Ilya's optimization, I'd evaluate using the criteria
> for
>>> any optimization backporting. Say the patch has 10-20% speed up, and we
>>> think that speed up is critical for a branch for some reason, and the
>>> changes are well contained, I'd vote for the backport.
>>>
>>> Does this make sense?
>>> @Numan Siddique <[email protected]> @Mark Michelson <[email protected]>
> any
>>> comments?
>>>
>>> Thanks,
>>> Han
>>>
>>>> Given that Ilya's changes seem very contained I don't think this should
>>>> be a problem.  They just need a proper review first.
>>>>
>>>>>> I tried to think of an alternative that wouldn't require doubling the
>>>>>> number of VIP logical flows but couldn't come up with one.  Maybe you
>>>>>> have more ideas?
>>>>>>
>>>>> This is the approach that has the least impact and changes to existing
>>>>> pipelines I have thought about so far. The earlier attempt wouldn't
>>>>> increase the number of flows but it had to compromise to either
> failing
>>> to
>>>>> support a corner case (that we are still not sure if it is really
>>> useful at
>>>>> all) or losing HW offloading capability. I'll still think about some
>>> more
>>>>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd
>>> avoid
>>>>> big changes in case any corner case is broken. I'd spend more time
>>> sorting
>>>>> out different use cases and the documentation before making more
>>>>> optimizations, but I don't think we should also fix this performance
>>> bug as
>>>>> soon as we can. (I am also curious why we haven't heard other OVN
> users
>>>>> complain about this yet ...)
>>>>
>>>> It's a guess, but we probably didn't notice this issue because it
> should
>>>> be more visible when there are lots of short lived connections.
>>>>
>>>> Thanks,
>>>> Dumitru
>>>>
>>>
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2] Don't blindly save original dst IP and Port to avoid megaflow unwildcarding.

Reply via email to