Hi Han,

On 8/24/22 08:40, Han Zhou wrote:
> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> registers is causing a critical dataplane performance impact to
> short-lived connections, because it unwildcards megaflows with exact
> match on dst IP and L4 ports. Any new connections with a different
> client side L4 port will encounter datapath flow miss and upcall to
> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> RESTful API calls suffer big performance degredations.
> 
> These fields (dst IP and port) were saved to registers to solve a
> problem of LB hairpin use case when different VIPs are sharing
> overlapping backend+port [0]. The change [0] might not have as wide
> performance impact as it is now because at that time one of the match
> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> natted traffic, while now the impact is more obvious because
> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> configured on the LS) since commit [1], after several other indirectly
> related optimizations and refactors.
> 
> This patch fixes the problem by modifying the priority-120 flows in
> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
> port only for traffic matching the LB VIPs, because these are the ones
> that need to be saved for the hairpin purpose. The existed priority-110
> flows will match the rest of the traffic just like before but wouldn't
> not save dst IP and L4 port, so any server->client traffic would not
> unwildcard megaflows with client side L4 ports.
> 

Just to be 100% sure, the client->server traffic will still generate up
to V x H flows where V is "the number of VIP:port tuples" and H is "the
number of potential (dp_)hash values", right?

> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared 
> backends.")
> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
> 
> Signed-off-by: Han Zhou <[email protected]>
> ---
> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot to 
> commit
>           before sending v1.
> 
>  northd/northd.c         | 125 +++++++++++++++++++++++++---------------
>  northd/ovn-northd.8.xml |  14 ++---
>  tests/ovn-northd.at     |  87 ++++++++++------------------
>  tests/ovn.at            |  14 ++---
>  4 files changed, 118 insertions(+), 122 deletions(-)
> 
> diff --git a/northd/northd.c b/northd/northd.c
> index 7e2681865..860641936 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -273,15 +273,15 @@ enum ovn_stage {
>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |                 
>  |
>   * |    |     REGBIT_ACL_LABEL                         | X |                 
>  |
>   * +----+----------------------------------------------+ X |                 
>  |
> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |                 
>  |
> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |                 
>  |
>   * +----+----------------------------------------------+ E |                 
>  |
> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |                 
>  |
> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |                 
>  |
>   * +----+----------------------------------------------+ 0 |                 
>  |
>   * | R3 |                  ACL LABEL                   |   |                 
>  |
>   * 
> +----+----------------------------------------------+---+------------------+
>   * | R4 |                   UNUSED                     |   |                 
>  |
> - * +----+----------------------------------------------+ X |   ORIG_DIP_IPV6 
>  |
> - * | R5 |                   UNUSED                     | X | (>= 
> IN_STATEFUL) |
> + * +----+----------------------------------------------+ X | 
> ORIG_DIP_IPV6(>= |
> + * | R5 |                   UNUSED                     | X | 
> IN_PRE_STATEFUL) |
>   * +----+----------------------------------------------+ R |                 
>  |
>   * | R6 |                   UNUSED                     | E |                 
>  |
>   * +----+----------------------------------------------+ G |                 
>  |
> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1", "next;");
>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1", "next;");
>  
> -    const char *ct_lb_action = features->ct_no_masked_label
> -                               ? "ct_lb_mark"
> -                               : "ct_lb";
> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
> -    struct ds actions = DS_EMPTY_INITIALIZER;
> -    struct ds match = DS_EMPTY_INITIALIZER;
> -
> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
> -        ds_clear(&match);
> -        ds_clear(&actions);
> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && %s",
> -                      lb_protocols[i]);
> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> -                      lb_protocols[i], ct_lb_action);
> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> -                      ds_cstr(&match), ds_cstr(&actions));
> -
> -        ds_clear(&match);
> -        ds_clear(&actions);
> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && %s",
> -                      lb_protocols[i]);
> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> -                      lb_protocols[i], ct_lb_action);
> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> -                      ds_cstr(&match), ds_cstr(&actions));
> -    }
> +    /* Note: priority-120 flows are added in build_lb_rules_pre_stateful(). 
> */
>  
> -    ds_clear(&actions);
> -    ds_put_format(&actions, "%s;", ct_lb_action);
> +    const char *ct_lb_action = features->ct_no_masked_label
> +                               ? "ct_lb_mark;"
> +                               : "ct_lb;";
>  
>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>  
>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>  
>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets should be
>       * sent to conntrack for tracking and defragmentation. */
> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
>  
> -    ds_destroy(&actions);
> -    ds_destroy(&match);
>  }
>  
>  static void
> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap 
> *lflows) {
>  }
>  
>  static void
> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool 
> ct_lb_mark,
> -               struct ds *match, struct ds *action,
> -               const struct shash *meter_groups)
> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb *lb,
> +                            bool ct_lb_mark, struct ds *match,
> +                            struct ds *action)
>  {
>      for (size_t i = 0; i < lb->n_vips; i++) {
>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
> -        const char *ip_match = NULL;
> -
>          ds_clear(action);
>          ds_clear(match);
> +        const char *ip_match = NULL;
>  
> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.  
> Otherwise
> -         * the load balanced packet will be committed again in
> -         * S_SWITCH_IN_STATEFUL. */
> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
>          /* Store the original destination IP to be used when generating
>           * hairpin flows.
>           */
> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct 
> ovn_northd_lb *lb, bool ct_lb_mark,
>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
>                            lb_vip->vip_port);
>          }
> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" : "ct_lb");
> +
> +        ds_put_format(match, "%s.dst == %s", ip_match, lb_vip->vip_str);
> +        if (lb_vip->vip_port) {
> +            ds_put_format(match, " && %s.dst == %d", proto, 
> lb_vip->vip_port);
> +        }
> +
> +        struct ovn_lflow *lflow_ref = NULL;
> +        uint32_t hash = ovn_logical_flow_hash(
> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), 120,
> +                ds_cstr(match), ds_cstr(action));
> +
> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
> +            struct ovn_datapath *od = lb->nb_ls[j];
> +
> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
> +                lflow_ref = ovn_lflow_add_at_with_hash(
> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> +                        ds_cstr(match), ds_cstr(action),
> +                        NULL, NULL, &lb->nlb->header_,
> +                        OVS_SOURCE_LOCATOR, hash);
> +            }
> +        }
> +    }

I see how this can fix the dataplane issue because we're limiting the
number of dp flows but this now induces a measurable penalty on the
control plane side because we essentially double the number of logical
flows that ovn-northd generates for load balancer VIPs

Using a NB database [2] from a 250 node density heavy test [3] we ran
in-house I see the following:

a. before this patch (main):
- number of SB lflows:         146233
- SB size on disk (compacted): 70MB
- northd poll loop intervals:  8525ms

b. with this patch:
- number of SB lflows:         163303
- SB size on disk (compacted): 76MB
- northd poll loop intervals:  9958ms

[2]
https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
[3]
https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml

This raises some concerns.  However, I went ahead and also ran a test
with Ilya's load balancer optimization series [4] (up for review) and I got:

c. with this patch + Ilya's series:
- number of SB lflows:         163303  << Didn't change compared to "b"
- SB size on disk (compacted): 76MB    << Didn't change compared to "b"
- northd poll loop intervals:  6437ms  << 25% better than "a"

Then I ran a test with main + Ilya's series.

d. main + Ilya's series:
- number of SB lflows:         146233  << Didn't change compared to "a"
- SB size on disk (compacted): 70MB    << Didn't change compared to "a"
- northd poll loop intervals:  5413ms  << 35% better than "a"

[4] https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*

I guess what I'm trying to say is that if go ahead with the approach you
proposed and especially if we want to backport it to the LTS we should
consider accepting/backporting other optimizations too (such as Ilya's)
that would compensate (and more) for the control plane performance hit.

I tried to think of an alternative that wouldn't require doubling the
number of VIP logical flows but couldn't come up with one.  Maybe you
have more ideas?

Thanks,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to