On 8/31/22 16:59, Han Zhou wrote:
> On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <[email protected]> wrote:
>>
>> On 8/31/22 02:17, Han Zhou wrote:
>>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <[email protected]> wrote:
>>>>
>>>> Hi Han,
>>>>
>>>> On 8/24/22 08:40, Han Zhou wrote:
>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>>>>> registers is causing a critical dataplane performance impact to
>>>>> short-lived connections, because it unwildcards megaflows with exact
>>>>> match on dst IP and L4 ports. Any new connections with a different
>>>>> client side L4 port will encounter datapath flow miss and upcall to
>>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>>>>> RESTful API calls suffer big performance degredations.
>>>>>
>>>>> These fields (dst IP and port) were saved to registers to solve a
>>>>> problem of LB hairpin use case when different VIPs are sharing
>>>>> overlapping backend+port [0]. The change [0] might not have as wide
>>>>> performance impact as it is now because at that time one of the match
>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>>>>> natted traffic, while now the impact is more obvious because
>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>>>> configured on the LS) since commit [1], after several other indirectly
>>>>> related optimizations and refactors.
>>>>>
>>>>> This patch fixes the problem by modifying the priority-120 flows in
>>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for
> any
>>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
>>>>> port only for traffic matching the LB VIPs, because these are the ones
>>>>> that need to be saved for the hairpin purpose. The existed
> priority-110
>>>>> flows will match the rest of the traffic just like before but wouldn't
>>>>> not save dst IP and L4 port, so any server->client traffic would not
>>>>> unwildcard megaflows with client side L4 ports.
>>>>>
>>>>
>>>> Just to be 100% sure, the client->server traffic will still generate up
>>>> to V x H flows where V is "the number of VIP:port tuples" and H is "the
>>>> number of potential (dp_)hash values", right?
>>>
>>> Yes, if all the VIPs and backends are being accessed at the same time.
> This
>>> is expected. For any given pair of client<->server, regardless of the
>>> connections (source port number changes), the packets should match the
> same
>>> mega-flow and be handled by fastpath (instead of going to userspace,
> which
>>> is the behavior before this patch).
>>>
>>
>> Thanks for the confirmation!
>>
>>>>
>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> shared
>>> backends.")
>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> traffic.")
>>>>>
>>>>> Signed-off-by: Han Zhou <[email protected]>
>>>>> ---
>>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot
>>> to commit
>>>>>           before sending v1.
>>>>>
>>>>>  northd/northd.c         | 125
> +++++++++++++++++++++++++---------------
>>>>>  northd/ovn-northd.8.xml |  14 ++---
>>>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
>>>>>  tests/ovn.at            |  14 ++---
>>>>>  4 files changed, 118 insertions(+), 122 deletions(-)
>>>>>
>>>>> diff --git a/northd/northd.c b/northd/northd.c
>>>>> index 7e2681865..860641936 100644
>>>>> --- a/northd/northd.c
>>>>> +++ b/northd/northd.c
>>>>> @@ -273,15 +273,15 @@ enum ovn_stage {
>>>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
>>>        |
>>>>>   * |    |     REGBIT_ACL_LABEL                         | X |
>>>        |
>>>>>   * +----+----------------------------------------------+ X |
>>>        |
>>>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
>>>        |
>>>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
>>>        |
>>>>>   * +----+----------------------------------------------+ E |
>>>        |
>>>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
>>>        |
>>>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
>>>        |
>>>>>   * +----+----------------------------------------------+ 0 |
>>>        |
>>>>>   * | R3 |                  ACL LABEL                   |   |
>>>        |
>>>>>   *
>>>
> +----+----------------------------------------------+---+------------------+
>>>>>   * | R4 |                   UNUSED                     |   |
>>>        |
>>>>> - * +----+----------------------------------------------+ X |
>>> ORIG_DIP_IPV6  |
>>>>> - * | R5 |                   UNUSED                     | X | (>=
>>> IN_STATEFUL) |
>>>>> + * +----+----------------------------------------------+ X |
>>> ORIG_DIP_IPV6(>= |
>>>>> + * | R5 |                   UNUSED                     | X |
>>> IN_PRE_STATEFUL) |
>>>>>   * +----+----------------------------------------------+ R |
>>>        |
>>>>>   * | R6 |                   UNUSED                     | E |
>>>        |
>>>>>   * +----+----------------------------------------------+ G |
>>>        |
>>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
>>> "next;");
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
>>> "next;");
>>>>>
>>>>> -    const char *ct_lb_action = features->ct_no_masked_label
>>>>> -                               ? "ct_lb_mark"
>>>>> -                               : "ct_lb";
>>>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
>>>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
>>>>> -    struct ds match = DS_EMPTY_INITIALIZER;
>>>>> -
>>>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
>>>>> -        ds_clear(&match);
>>>>> -        ds_clear(&actions);
>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 &&
> %s",
>>>>> -                      lb_protocols[i]);
>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>>>> -                      lb_protocols[i], ct_lb_action);
>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>>>> -
>>>>> -        ds_clear(&match);
>>>>> -        ds_clear(&actions);
>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 &&
> %s",
>>>>> -                      lb_protocols[i]);
>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>>>> -                      lb_protocols[i], ct_lb_action);
>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>>>> -    }
>>>>> +    /* Note: priority-120 flows are added in
>>> build_lb_rules_pre_stateful(). */
>>>>>
>>>>> -    ds_clear(&actions);
>>>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
>>>>> +    const char *ct_lb_action = features->ct_no_masked_label
>>>>> +                               ? "ct_lb_mark;"
>>>>> +                               : "ct_lb;";
>>>>>
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
>>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>>>
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
>>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>>>
>>>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets
> should
>>> be
>>>>>       * sent to conntrack for tracking and defragmentation. */
>>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
>>>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
>>>>>
>>>>> -    ds_destroy(&actions);
>>>>> -    ds_destroy(&match);
>>>>>  }
>>>>>
>>>>>  static void
>>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap
>>> *lflows) {
>>>>>  }
>>>>>
>>>>>  static void
>>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
>>> ct_lb_mark,
>>>>> -               struct ds *match, struct ds *action,
>>>>> -               const struct shash *meter_groups)
>>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb
>>> *lb,
>>>>> +                            bool ct_lb_mark, struct ds *match,
>>>>> +                            struct ds *action)
>>>>>  {
>>>>>      for (size_t i = 0; i < lb->n_vips; i++) {
>>>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
>>>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
>>>>> -        const char *ip_match = NULL;
>>>>> -
>>>>>          ds_clear(action);
>>>>>          ds_clear(match);
>>>>> +        const char *ip_match = NULL;
>>>>>
>>>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.
>>> Otherwise
>>>>> -         * the load balanced packet will be committed again in
>>>>> -         * S_SWITCH_IN_STATEFUL. */
>>>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
>>>>>          /* Store the original destination IP to be used when
> generating
>>>>>           * hairpin flows.
>>>>>           */
>>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
>>> ovn_northd_lb *lb, bool ct_lb_mark,
>>>>>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
>>>>>                            lb_vip->vip_port);
>>>>>          }
>>>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
>>> "ct_lb");
>>>>> +
>>>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
>>> lb_vip->vip_str);
>>>>> +        if (lb_vip->vip_port) {
>>>>> +            ds_put_format(match, " && %s.dst == %d", proto,
>>> lb_vip->vip_port);
>>>>> +        }
>>>>> +
>>>>> +        struct ovn_lflow *lflow_ref = NULL;
>>>>> +        uint32_t hash = ovn_logical_flow_hash(
>>>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
>>>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL),
> 120,
>>>>> +                ds_cstr(match), ds_cstr(action));
>>>>> +
>>>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
>>>>> +            struct ovn_datapath *od = lb->nb_ls[j];
>>>>> +
>>>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
>>>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
>>>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>> +                        ds_cstr(match), ds_cstr(action),
>>>>> +                        NULL, NULL, &lb->nlb->header_,
>>>>> +                        OVS_SOURCE_LOCATOR, hash);
>>>>> +            }
>>>>> +        }
>>>>> +    }
>>>>
>>>> I see how this can fix the dataplane issue because we're limiting the
>>>> number of dp flows but this now induces a measurable penalty on the
>>>> control plane side because we essentially double the number of logical
>>>> flows that ovn-northd generates for load balancer VIPs
>>>>
>>>> Using a NB database [2] from a 250 node density heavy test [3] we ran
>>>> in-house I see the following:
>>>>
>>>> a. before this patch (main):
>>>> - number of SB lflows:         146233
>>>> - SB size on disk (compacted): 70MB
>>>> - northd poll loop intervals:  8525ms
>>>>
>>>> b. with this patch:
>>>> - number of SB lflows:         163303
>>>> - SB size on disk (compacted): 76MB
>>>> - northd poll loop intervals:  9958ms
>>>>
>>>> [2]
>>>>
>>>
> https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
>>>> [3]
>>>>
>>>
> https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
>>>>
>>>> This raises some concerns.  However, I went ahead and also ran a test
>>>> with Ilya's load balancer optimization series [4] (up for review) and I
>>> got:
>>>>
>>>> c. with this patch + Ilya's series:
>>>> - number of SB lflows:         163303  << Didn't change compared to "b"
>>>> - SB size on disk (compacted): 76MB    << Didn't change compared to "b"
>>>> - northd poll loop intervals:  6437ms  << 25% better than "a"
>>>>
>>>> Then I ran a test with main + Ilya's series.
>>>>
>>>> d. main + Ilya's series:
>>>> - number of SB lflows:         146233  << Didn't change compared to "a"
>>>> - SB size on disk (compacted): 70MB    << Didn't change compared to "a"
>>>> - northd poll loop intervals:  5413ms  << 35% better than "a"
>>>>
>>>> [4]
> https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
>>>>
>>>> I guess what I'm trying to say is that if go ahead with the approach
> you
>>>> proposed and especially if we want to backport it to the LTS we should
>>>> consider accepting/backporting other optimizations too (such as Ilya's)
>>>> that would compensate (and more) for the control plane performance hit.
>>>>
>>> Thanks for the scale test!
>>> I do expect some cost in the control plane, and your test result is
> aligned
>>> with my expectation. It is unpleasant to see ~10% SB size increase (in
> the
>>> density-heavy and LB/service-heavy test scenario), but if it is
> necessary
>>> to fix the dataplane performance bug I think it is worth it. After all,
>>> everything the control plane does is to serve the dataplane, right? The
>>> resulting dataplane performance boost is at least 10x for short-lived
>>> connection scenarios.
>>> It's great to see Ilya's patch helps the LB scale, and I have no
> objection
>>> to backport them if it is considered necessary to stable branches, but
> I'd
>>> consider it independent of the current patch.
>>>
>>
>> They are independent.  But the problem is your fix will visibily impact
>> control plane latency (~16% on my setup).  So, if we can avoid some of
>> that hit by considering both patchsets together, I'd vote for that.
>>
> Again, no objection to backporting Ilya's patches, but this is the criteria
> in my mind:
> 
> 1. For this patch, it improves dataplane performance (to me it is in fact
> fixing a performance bug) at the cost of the control plane. This is not
> uncommon in the OVN project history - when adding some new features
> sometimes it does add control plane cost. When this happens, whether to
> accept the new feature is based on the value of the feature v.s. the cost
> of the control plane - a tradeoff has to be made in many cases. In this
> case, for 16% control plane latency increase we get 10x dataplane speed up,
> I would  vote for it.

Hi Han,

I'm not arguing that we don't get dataplane speed up.  I'm sure we will
and it should be a decent amount.  But I've been trying to replicate
this 10x boost with an ovn-kubernetes kind deployment and I didn't
really manage to.

I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in our
lab.

a. build an ovnkube image (with and without your patch); essentially
what our ovn-kubernetes CI jobs do:

https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39

b. start a kind cluster with the image we built at step "a".

c. edit the ovs-node daemonset definition and increase the resource
limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU
and 300MB).  This restarts the ovs-daemon pods, wait for those to come
back up.

d. start an nginx deployment with 50 replicas: this is essentially 50
backends accepting http requests on port 80.

e. create a service load balancing http requests received on a port,
e.g. 42424, to port 80 on the pods in the nginx deployment above.

f. create a clients deployment with 100 replicas: 100 pods running the
following bash script, a hack to simulate 40 new connections with
different ephemeral source ports every 2 seconds:

i=0
while true; do
  i=$((i+1))
  [ $((i%40)) -eq 0 ] && sleep 2
  nc -z -v $SERVICE_VIP 42424 &
done

Comparing the runs with and without your patch applied I see improvement:

1. without this patch applied:
- 17K, 18K, 27K datapath flows on the 3 nodes of the cluster
- OVS cpu usage on the 3 nodes: 30-50%

2. with this patch applied:
- 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster
- OVS cpu usage on the 3 nodes: 10-20%

In both cases the whole system is not overloaded, OVS doesn't reach it's
CPU limits.

I see clear improvement but I was expecting to see more than this.  It
does look like my test doesn't generate enough unique sources per second
so I was wondering if you could share your test methodology?

For the OVN code changes, they look good to me.  I think we should apply
them to the main branch.  For that:

Acked-by: Dumitru Ceara <[email protected]>

For backporting the change to stable branches I'd really like wait to
first understand the exact scenario in which the megaflow undwildcarding
created such a big performance regression in ovn-kubernetes.

Thanks,
Dumitru

> 
> 2. For backporting Ilya's optimization, I'd evaluate using the criteria for
> any optimization backporting. Say the patch has 10-20% speed up, and we
> think that speed up is critical for a branch for some reason, and the
> changes are well contained, I'd vote for the backport.
> 
> Does this make sense?
> @Numan Siddique <[email protected]> @Mark Michelson <[email protected]> any
> comments?
> 
> Thanks,
> Han
> 
>> Given that Ilya's changes seem very contained I don't think this should
>> be a problem.  They just need a proper review first.
>>
>>>> I tried to think of an alternative that wouldn't require doubling the
>>>> number of VIP logical flows but couldn't come up with one.  Maybe you
>>>> have more ideas?
>>>>
>>> This is the approach that has the least impact and changes to existing
>>> pipelines I have thought about so far. The earlier attempt wouldn't
>>> increase the number of flows but it had to compromise to either failing
> to
>>> support a corner case (that we are still not sure if it is really
> useful at
>>> all) or losing HW offloading capability. I'll still think about some
> more
>>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd
> avoid
>>> big changes in case any corner case is broken. I'd spend more time
> sorting
>>> out different use cases and the documentation before making more
>>> optimizations, but I don't think we should also fix this performance
> bug as
>>> soon as we can. (I am also curious why we haven't heard other OVN users
>>> complain about this yet ...)
>>
>> It's a guess, but we probably didn't notice this issue because it should
>> be more visible when there are lots of short lived connections.
>>
>> Thanks,
>> Dumitru
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to