On 8/31/22 02:17, Han Zhou wrote:
> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <[email protected]> wrote:
>>
>> Hi Han,
>>
>> On 8/24/22 08:40, Han Zhou wrote:
>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>>> registers is causing a critical dataplane performance impact to
>>> short-lived connections, because it unwildcards megaflows with exact
>>> match on dst IP and L4 ports. Any new connections with a different
>>> client side L4 port will encounter datapath flow miss and upcall to
>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>>> RESTful API calls suffer big performance degredations.
>>>
>>> These fields (dst IP and port) were saved to registers to solve a
>>> problem of LB hairpin use case when different VIPs are sharing
>>> overlapping backend+port [0]. The change [0] might not have as wide
>>> performance impact as it is now because at that time one of the match
>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>>> natted traffic, while now the impact is more obvious because
>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>> configured on the LS) since commit [1], after several other indirectly
>>> related optimizations and refactors.
>>>
>>> This patch fixes the problem by modifying the priority-120 flows in
>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
>>> port only for traffic matching the LB VIPs, because these are the ones
>>> that need to be saved for the hairpin purpose. The existed priority-110
>>> flows will match the rest of the traffic just like before but wouldn't
>>> not save dst IP and L4 port, so any server->client traffic would not
>>> unwildcard megaflows with client side L4 ports.
>>>
>>
>> Just to be 100% sure, the client->server traffic will still generate up
>> to V x H flows where V is "the number of VIP:port tuples" and H is "the
>> number of potential (dp_)hash values", right?
> 
> Yes, if all the VIPs and backends are being accessed at the same time. This
> is expected. For any given pair of client<->server, regardless of the
> connections (source port number changes), the packets should match the same
> mega-flow and be handled by fastpath (instead of going to userspace, which
> is the behavior before this patch).
> 

Thanks for the confirmation!

>>
>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared
> backends.")
>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
>>>
>>> Signed-off-by: Han Zhou <[email protected]>
>>> ---
>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot
> to commit
>>>           before sending v1.
>>>
>>>  northd/northd.c         | 125 +++++++++++++++++++++++++---------------
>>>  northd/ovn-northd.8.xml |  14 ++---
>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
>>>  tests/ovn.at            |  14 ++---
>>>  4 files changed, 118 insertions(+), 122 deletions(-)
>>>
>>> diff --git a/northd/northd.c b/northd/northd.c
>>> index 7e2681865..860641936 100644
>>> --- a/northd/northd.c
>>> +++ b/northd/northd.c
>>> @@ -273,15 +273,15 @@ enum ovn_stage {
>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
>        |
>>>   * |    |     REGBIT_ACL_LABEL                         | X |
>        |
>>>   * +----+----------------------------------------------+ X |
>        |
>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
>        |
>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
>        |
>>>   * +----+----------------------------------------------+ E |
>        |
>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
>        |
>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
>        |
>>>   * +----+----------------------------------------------+ 0 |
>        |
>>>   * | R3 |                  ACL LABEL                   |   |
>        |
>>>   *
> +----+----------------------------------------------+---+------------------+
>>>   * | R4 |                   UNUSED                     |   |
>        |
>>> - * +----+----------------------------------------------+ X |
> ORIG_DIP_IPV6  |
>>> - * | R5 |                   UNUSED                     | X | (>=
> IN_STATEFUL) |
>>> + * +----+----------------------------------------------+ X |
> ORIG_DIP_IPV6(>= |
>>> + * | R5 |                   UNUSED                     | X |
> IN_PRE_STATEFUL) |
>>>   * +----+----------------------------------------------+ R |
>        |
>>>   * | R6 |                   UNUSED                     | E |
>        |
>>>   * +----+----------------------------------------------+ G |
>        |
>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
> "next;");
>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
> "next;");
>>>
>>> -    const char *ct_lb_action = features->ct_no_masked_label
>>> -                               ? "ct_lb_mark"
>>> -                               : "ct_lb";
>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
>>> -    struct ds match = DS_EMPTY_INITIALIZER;
>>> -
>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
>>> -        ds_clear(&match);
>>> -        ds_clear(&actions);
>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && %s",
>>> -                      lb_protocols[i]);
>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>> -                      lb_protocols[i], ct_lb_action);
>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>> -
>>> -        ds_clear(&match);
>>> -        ds_clear(&actions);
>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && %s",
>>> -                      lb_protocols[i]);
>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>> -                      lb_protocols[i], ct_lb_action);
>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>> -    }
>>> +    /* Note: priority-120 flows are added in
> build_lb_rules_pre_stateful(). */
>>>
>>> -    ds_clear(&actions);
>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
>>> +    const char *ct_lb_action = features->ct_no_masked_label
>>> +                               ? "ct_lb_mark;"
>>> +                               : "ct_lb;";
>>>
>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>
>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>
>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets should
> be
>>>       * sent to conntrack for tracking and defragmentation. */
>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
>>>
>>> -    ds_destroy(&actions);
>>> -    ds_destroy(&match);
>>>  }
>>>
>>>  static void
>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap
> *lflows) {
>>>  }
>>>
>>>  static void
>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
> ct_lb_mark,
>>> -               struct ds *match, struct ds *action,
>>> -               const struct shash *meter_groups)
>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb
> *lb,
>>> +                            bool ct_lb_mark, struct ds *match,
>>> +                            struct ds *action)
>>>  {
>>>      for (size_t i = 0; i < lb->n_vips; i++) {
>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
>>> -        const char *ip_match = NULL;
>>> -
>>>          ds_clear(action);
>>>          ds_clear(match);
>>> +        const char *ip_match = NULL;
>>>
>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.
> Otherwise
>>> -         * the load balanced packet will be committed again in
>>> -         * S_SWITCH_IN_STATEFUL. */
>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
>>>          /* Store the original destination IP to be used when generating
>>>           * hairpin flows.
>>>           */
>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
> ovn_northd_lb *lb, bool ct_lb_mark,
>>>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
>>>                            lb_vip->vip_port);
>>>          }
>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
> "ct_lb");
>>> +
>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
> lb_vip->vip_str);
>>> +        if (lb_vip->vip_port) {
>>> +            ds_put_format(match, " && %s.dst == %d", proto,
> lb_vip->vip_port);
>>> +        }
>>> +
>>> +        struct ovn_lflow *lflow_ref = NULL;
>>> +        uint32_t hash = ovn_logical_flow_hash(
>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), 120,
>>> +                ds_cstr(match), ds_cstr(action));
>>> +
>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
>>> +            struct ovn_datapath *od = lb->nb_ls[j];
>>> +
>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>> +                        ds_cstr(match), ds_cstr(action),
>>> +                        NULL, NULL, &lb->nlb->header_,
>>> +                        OVS_SOURCE_LOCATOR, hash);
>>> +            }
>>> +        }
>>> +    }
>>
>> I see how this can fix the dataplane issue because we're limiting the
>> number of dp flows but this now induces a measurable penalty on the
>> control plane side because we essentially double the number of logical
>> flows that ovn-northd generates for load balancer VIPs
>>
>> Using a NB database [2] from a 250 node density heavy test [3] we ran
>> in-house I see the following:
>>
>> a. before this patch (main):
>> - number of SB lflows:         146233
>> - SB size on disk (compacted): 70MB
>> - northd poll loop intervals:  8525ms
>>
>> b. with this patch:
>> - number of SB lflows:         163303
>> - SB size on disk (compacted): 76MB
>> - northd poll loop intervals:  9958ms
>>
>> [2]
>>
> https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
>> [3]
>>
> https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
>>
>> This raises some concerns.  However, I went ahead and also ran a test
>> with Ilya's load balancer optimization series [4] (up for review) and I
> got:
>>
>> c. with this patch + Ilya's series:
>> - number of SB lflows:         163303  << Didn't change compared to "b"
>> - SB size on disk (compacted): 76MB    << Didn't change compared to "b"
>> - northd poll loop intervals:  6437ms  << 25% better than "a"
>>
>> Then I ran a test with main + Ilya's series.
>>
>> d. main + Ilya's series:
>> - number of SB lflows:         146233  << Didn't change compared to "a"
>> - SB size on disk (compacted): 70MB    << Didn't change compared to "a"
>> - northd poll loop intervals:  5413ms  << 35% better than "a"
>>
>> [4] https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
>>
>> I guess what I'm trying to say is that if go ahead with the approach you
>> proposed and especially if we want to backport it to the LTS we should
>> consider accepting/backporting other optimizations too (such as Ilya's)
>> that would compensate (and more) for the control plane performance hit.
>>
> Thanks for the scale test!
> I do expect some cost in the control plane, and your test result is aligned
> with my expectation. It is unpleasant to see ~10% SB size increase (in the
> density-heavy and LB/service-heavy test scenario), but if it is necessary
> to fix the dataplane performance bug I think it is worth it. After all,
> everything the control plane does is to serve the dataplane, right? The
> resulting dataplane performance boost is at least 10x for short-lived
> connection scenarios.
> It's great to see Ilya's patch helps the LB scale, and I have no objection
> to backport them if it is considered necessary to stable branches, but I'd
> consider it independent of the current patch.
> 

They are independent.  But the problem is your fix will visibily impact
control plane latency (~16% on my setup).  So, if we can avoid some of
that hit by considering both patchsets together, I'd vote for that.

Given that Ilya's changes seem very contained I don't think this should
be a problem.  They just need a proper review first.

>> I tried to think of an alternative that wouldn't require doubling the
>> number of VIP logical flows but couldn't come up with one.  Maybe you
>> have more ideas?
>>
> This is the approach that has the least impact and changes to existing
> pipelines I have thought about so far. The earlier attempt wouldn't
> increase the number of flows but it had to compromise to either failing to
> support a corner case (that we are still not sure if it is really useful at
> all) or losing HW offloading capability. I'll still think about some more
> approaches but the NAT/LB/ACL pipelines are very complex now and I'd avoid
> big changes in case any corner case is broken. I'd spend more time sorting
> out different use cases and the documentation before making more
> optimizations, but I don't think we should also fix this performance bug as
> soon as we can. (I am also curious why we haven't heard other OVN users
> complain about this yet ...)

It's a guess, but we probably didn't notice this issue because it should
be more visible when there are lots of short lived connections.

Thanks,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to