On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <[email protected]> wrote:
>
> Hi Han,
>
> On 8/24/22 08:40, Han Zhou wrote:
> > The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> > registers is causing a critical dataplane performance impact to
> > short-lived connections, because it unwildcards megaflows with exact
> > match on dst IP and L4 ports. Any new connections with a different
> > client side L4 port will encounter datapath flow miss and upcall to
> > ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> > RESTful API calls suffer big performance degredations.
> >
> > These fields (dst IP and port) were saved to registers to solve a
> > problem of LB hairpin use case when different VIPs are sharing
> > overlapping backend+port [0]. The change [0] might not have as wide
> > performance impact as it is now because at that time one of the match
> > condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> > natted traffic, while now the impact is more obvious because
> > REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> > configured on the LS) since commit [1], after several other indirectly
> > related optimizations and refactors.
> >
> > This patch fixes the problem by modifying the priority-120 flows in
> > ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
> > traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
> > port only for traffic matching the LB VIPs, because these are the ones
> > that need to be saved for the hairpin purpose. The existed priority-110
> > flows will match the rest of the traffic just like before but wouldn't
> > not save dst IP and L4 port, so any server->client traffic would not
> > unwildcard megaflows with client side L4 ports.
> >
>
> Just to be 100% sure, the client->server traffic will still generate up
> to V x H flows where V is "the number of VIP:port tuples" and H is "the
> number of potential (dp_)hash values", right?
Yes, if all the VIPs and backends are being accessed at the same time. This
is expected. For any given pair of client<->server, regardless of the
connections (source port number changes), the packets should match the same
mega-flow and be handled by fastpath (instead of going to userspace, which
is the behavior before this patch).
>
> > [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared
backends.")
> > [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
> >
> > Signed-off-by: Han Zhou <[email protected]>
> > ---
> > v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot
to commit
> > before sending v1.
> >
> > northd/northd.c | 125 +++++++++++++++++++++++++---------------
> > northd/ovn-northd.8.xml | 14 ++---
> > tests/ovn-northd.at | 87 ++++++++++------------------
> > tests/ovn.at | 14 ++---
> > 4 files changed, 118 insertions(+), 122 deletions(-)
> >
> > diff --git a/northd/northd.c b/northd/northd.c
> > index 7e2681865..860641936 100644
> > --- a/northd/northd.c
> > +++ b/northd/northd.c
> > @@ -273,15 +273,15 @@ enum ovn_stage {
> > * | | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} | |
|
> > * | | REGBIT_ACL_LABEL | X |
|
> > * +----+----------------------------------------------+ X |
|
> > - * | R1 | ORIG_DIP_IPV4 (>= IN_STATEFUL) | R |
|
> > + * | R1 | ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL) | R |
|
> > * +----+----------------------------------------------+ E |
|
> > - * | R2 | ORIG_TP_DPORT (>= IN_STATEFUL) | G |
|
> > + * | R2 | ORIG_TP_DPORT (>= IN_PRE_STATEFUL) | G |
|
> > * +----+----------------------------------------------+ 0 |
|
> > * | R3 | ACL LABEL | |
|
> > *
+----+----------------------------------------------+---+------------------+
> > * | R4 | UNUSED | |
|
> > - * +----+----------------------------------------------+ X |
ORIG_DIP_IPV6 |
> > - * | R5 | UNUSED | X | (>=
IN_STATEFUL) |
> > + * +----+----------------------------------------------+ X |
ORIG_DIP_IPV6(>= |
> > + * | R5 | UNUSED | X |
IN_PRE_STATEFUL) |
> > * +----+----------------------------------------------+ R |
|
> > * | R6 | UNUSED | E |
|
> > * +----+----------------------------------------------+ G |
|
> > @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
> > ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
"next;");
> > ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
"next;");
> >
> > - const char *ct_lb_action = features->ct_no_masked_label
> > - ? "ct_lb_mark"
> > - : "ct_lb";
> > - const char *lb_protocols[] = {"tcp", "udp", "sctp"};
> > - struct ds actions = DS_EMPTY_INITIALIZER;
> > - struct ds match = DS_EMPTY_INITIALIZER;
> > -
> > - for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
> > - ds_clear(&match);
> > - ds_clear(&actions);
> > - ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && %s",
> > - lb_protocols[i]);
> > - ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
> > - REG_ORIG_TP_DPORT " = %s.dst; %s;",
> > - lb_protocols[i], ct_lb_action);
> > - ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> > - ds_cstr(&match), ds_cstr(&actions));
> > -
> > - ds_clear(&match);
> > - ds_clear(&actions);
> > - ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && %s",
> > - lb_protocols[i]);
> > - ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
> > - REG_ORIG_TP_DPORT " = %s.dst; %s;",
> > - lb_protocols[i], ct_lb_action);
> > - ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> > - ds_cstr(&match), ds_cstr(&actions));
> > - }
> > + /* Note: priority-120 flows are added in
build_lb_rules_pre_stateful(). */
> >
> > - ds_clear(&actions);
> > - ds_put_format(&actions, "%s;", ct_lb_action);
> > + const char *ct_lb_action = features->ct_no_masked_label
> > + ? "ct_lb_mark;"
> > + : "ct_lb;";
> >
> > ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
> > - REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> > + REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >
> > ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
> > - REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> > + REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >
> > /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets should
be
> > * sent to conntrack for tracking and defragmentation. */
> > @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
> > ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
> > REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
> >
> > - ds_destroy(&actions);
> > - ds_destroy(&match);
> > }
> >
> > static void
> > @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap
*lflows) {
> > }
> >
> > static void
> > -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
ct_lb_mark,
> > - struct ds *match, struct ds *action,
> > - const struct shash *meter_groups)
> > +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb
*lb,
> > + bool ct_lb_mark, struct ds *match,
> > + struct ds *action)
> > {
> > for (size_t i = 0; i < lb->n_vips; i++) {
> > struct ovn_lb_vip *lb_vip = &lb->vips[i];
> > - struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
> > - const char *ip_match = NULL;
> > -
> > ds_clear(action);
> > ds_clear(match);
> > + const char *ip_match = NULL;
> >
> > - /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.
Otherwise
> > - * the load balanced packet will be committed again in
> > - * S_SWITCH_IN_STATEFUL. */
> > - ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
> > /* Store the original destination IP to be used when generating
> > * hairpin flows.
> > */
> > @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
ovn_northd_lb *lb, bool ct_lb_mark,
> > ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
> > lb_vip->vip_port);
> > }
> > + ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
"ct_lb");
> > +
> > + ds_put_format(match, "%s.dst == %s", ip_match,
lb_vip->vip_str);
> > + if (lb_vip->vip_port) {
> > + ds_put_format(match, " && %s.dst == %d", proto,
lb_vip->vip_port);
> > + }
> > +
> > + struct ovn_lflow *lflow_ref = NULL;
> > + uint32_t hash = ovn_logical_flow_hash(
> > + ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
> > + ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), 120,
> > + ds_cstr(match), ds_cstr(action));
> > +
> > + for (size_t j = 0; j < lb->n_nb_ls; j++) {
> > + struct ovn_datapath *od = lb->nb_ls[j];
> > +
> > + if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
> > + lflow_ref = ovn_lflow_add_at_with_hash(
> > + lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> > + ds_cstr(match), ds_cstr(action),
> > + NULL, NULL, &lb->nlb->header_,
> > + OVS_SOURCE_LOCATOR, hash);
> > + }
> > + }
> > + }
>
> I see how this can fix the dataplane issue because we're limiting the
> number of dp flows but this now induces a measurable penalty on the
> control plane side because we essentially double the number of logical
> flows that ovn-northd generates for load balancer VIPs
>
> Using a NB database [2] from a 250 node density heavy test [3] we ran
> in-house I see the following:
>
> a. before this patch (main):
> - number of SB lflows: 146233
> - SB size on disk (compacted): 70MB
> - northd poll loop intervals: 8525ms
>
> b. with this patch:
> - number of SB lflows: 163303
> - SB size on disk (compacted): 76MB
> - northd poll loop intervals: 9958ms
>
> [2]
>
https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
> [3]
>
https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
>
> This raises some concerns. However, I went ahead and also ran a test
> with Ilya's load balancer optimization series [4] (up for review) and I
got:
>
> c. with this patch + Ilya's series:
> - number of SB lflows: 163303 << Didn't change compared to "b"
> - SB size on disk (compacted): 76MB << Didn't change compared to "b"
> - northd poll loop intervals: 6437ms << 25% better than "a"
>
> Then I ran a test with main + Ilya's series.
>
> d. main + Ilya's series:
> - number of SB lflows: 146233 << Didn't change compared to "a"
> - SB size on disk (compacted): 70MB << Didn't change compared to "a"
> - northd poll loop intervals: 5413ms << 35% better than "a"
>
> [4] https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
>
> I guess what I'm trying to say is that if go ahead with the approach you
> proposed and especially if we want to backport it to the LTS we should
> consider accepting/backporting other optimizations too (such as Ilya's)
> that would compensate (and more) for the control plane performance hit.
>
Thanks for the scale test!
I do expect some cost in the control plane, and your test result is aligned
with my expectation. It is unpleasant to see ~10% SB size increase (in the
density-heavy and LB/service-heavy test scenario), but if it is necessary
to fix the dataplane performance bug I think it is worth it. After all,
everything the control plane does is to serve the dataplane, right? The
resulting dataplane performance boost is at least 10x for short-lived
connection scenarios.
It's great to see Ilya's patch helps the LB scale, and I have no objection
to backport them if it is considered necessary to stable branches, but I'd
consider it independent of the current patch.
> I tried to think of an alternative that wouldn't require doubling the
> number of VIP logical flows but couldn't come up with one. Maybe you
> have more ideas?
>
This is the approach that has the least impact and changes to existing
pipelines I have thought about so far. The earlier attempt wouldn't
increase the number of flows but it had to compromise to either failing to
support a corner case (that we are still not sure if it is really useful at
all) or losing HW offloading capability. I'll still think about some more
approaches but the NAT/LB/ACL pipelines are very complex now and I'd avoid
big changes in case any corner case is broken. I'd spend more time sorting
out different use cases and the documentation before making more
optimizations, but I don't think we should also fix this performance bug as
soon as we can. (I am also curious why we haven't heard other OVN users
complain about this yet ...)
Thanks,
Han
> Thanks,
> Dumitru
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev