On Sat, May 14, 2022 at 2:10 AM Ilya Maximets <[email protected]> wrote:
>
> On 5/13/22 10:36, Frode Nordahl wrote:
> > On Fri, Mar 11, 2022 at 2:04 PM Liam Young <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> <tl;dr> Commit 355fef6f2 seems to break connectivity in my setup</tl;dr>
> >
> > After OVN started colocating sNAT and dNAT operations in the same CT
> > zone [0] the above mentioned OVS commit appears to also break OVN [1].
> > I have been discussing with Numan and he has found a correlation with
> > the behavior of the open vswitch kernel data path conntrack
> > `skb_nfct_cached` function, i.e. if that is changed to always return
> > 'false' the erratic behavior disappears.
> >
> > So I guess the question then becomes, is this a OVS userspace or OVS
> > kernel problem?
> >
> > We have a reproducer in [3].
> >
> > 0: 
> > https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a
> > 1: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856
> > 2: 
> > https://elixir.bootlin.com/linux/latest/source/net/openvswitch/conntrack.c#L683
> > 3: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856/comments/6
> >
>
> Hrm.  I think, there is a logical bug in implementations of ct()
> datapath action in both datapaths.
>
> The problem appears when the OpenFlow pipeline for the packet contains
> several ct() actions.  NXAST_CT action (i.e. every ct() action) must
> always put the packet into an untracked state.  Tracked state will be
> set only in the 'recirc_table' where the packet is cloned by the ct()
> action for further processing.
>
> If an OF pipeline looks like this:
>
>   actions=ct(),something,something,ct(),something
>
> each action must be entered with the 'untracked' conntrack state in
> the packet metadata.
>
> However, ct() action in the datapath, unlike OpenFlow, doesn't work
> this way.  It modifies the conntrack state for the packet it is processing.
> During the flow translation OVS inserts recirculation right after the
> datapath ct() action.  This way conntrack state affects only processing
> after recirculation.  Sort of.
>
> The issue is that not all OpenFlow ct() actions have recirc_table
> specified.  These actions supposed to change the state of the conntrack
> subsystem, but they still must leave the packet itself in the untracked
> state with no strings attached.  But since datapath ct() actions doesn't
> work that way and since recirculation is not inserted (no recirc_table
> specified), packet after conntrack execution carries the state along to
> all other actions.
> This doesn't impact normal actions, but seems to break subsequent ct()
> actions on the same pipeline.
>
> In general, it looks to me that ct() action in the datapath is
> not supposed to cache any connection data inside the packet's metadata.
> This seems to be the root cause of the problem.  Fields that OF knows
> about should not trigger any issues if carried along with a packet,
> but datapath-specific metadata can not be cleared, because OFproto
> simply doesn't know about it.
>
> But, I guess, not caching the connection and other internal structures
> might significantly impact datapath performance.  Will it?
>
> Looking at the reproducer in [3], it has, for example, the flow with the
> following actions:
>
>   actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
>           set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
>           set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
>           ct(zone=8),recirc(0x4)
>
> So, if the first ct() will change some datapath-specific packet metadata,
> the second ct() may try to use that information.
>
> It looks like during the flow translation we must add ct_clear action
> explicitly after every ct() action, unless it was the last action in
> the list.  This will end up with adding a lot of ct_clear actions
> everywhere.
>
> Another option is the patch below that tries to track if ct_clear
> is required and adds that action before the next ct() action in
> the datapath.
>
> BTW, the test [3] fails with both kernel and userspace datapaths.
>
> The change below should fix the problem, I think.
> It looks like we also have to put ct_clear action before translating
> output to the patch port if we're in 'conntracked' state.
>
> And I do not know how to fix the problem for kernels without ct_clear
> support.  I don't think we can clear metadata that is internal to
> the kernel.  Ideas are welcome.
>
> CC: Aaron, Paolo.
>
> Any thoughts?

Thank you so much for the detailed explanation of what's going on and
for providing a proposed fix.

I took it for a spin and it does indeed appear to fix the OVN hairpin
issue, but it does unfortunately not appear to fix the issue Liam
reported for the ML2/OVS use case [4].

When trying that use case with your patch I see this in the Open vSwitch log:
2022-05-19T08:17:02.668Z|00001|ofproto_dpif_xlate(handler1)|WARN|over
max translation depth 64 on bridge br-int while processing
ct_state=new|trk,ct_ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ct_ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ct_nw_proto=58,ct_tp_src=128,ct_tp_dst=0,icmp6,in_port=3,vlan_tci=0x0000,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ipv6_label=0xac5db,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=128,icmp_code=0
2022-05-19T08:17:02.668Z|00002|ofproto_dpif_upcall(handler1)|WARN|Dropped
2 log messages in last 205 seconds (most recently, 187 seconds ago)
due to excessive rate
2022-05-19T08:17:02.668Z|00003|ofproto_dpif_upcall(handler1)|WARN|Flow:
ct_state=new|trk,ct_ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ct_ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ct_nw_proto=58,ct_tp_src=128,ct_tp_dst=0,icmp6,in_port=5,vlan_tci=0x0000,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ipv6_label=0xac5db,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=128,icmp_code=0

bridge("br-int")
----------------
 0. priority 0, cookie 0x56d6d58c018b51bd
    goto_table:60
60. priority 3, cookie 0x56d6d58c018b51bd
    NORMAL
     >>>> received packet on unknown port 5 <<<<
     >> no input bundle, dropping

Final flow: unchanged
Megaflow: 
recirc_id=0,eth,ipv6,in_port=5,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,nw_frag=no
Datapath actions: drop


On the back of your explanation and the log output I made an attempt
at adding this patch on top of yours:
---
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 110dab0ec..2955e8e4d 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -7144,6 +7144,7 @@ do_xlate_actions(const struct ofpact *ofpacts,
size_t ofpacts_len,
          * resubmit to the frozen actions.
          */
         case OFPACT_RESUBMIT:
+            ctx->pending_ct_clear = true;
             xlate_ofpact_resubmit(ctx, ofpact_get_RESUBMIT(a), last);
             continue;
         case OFPACT_GOTO_TABLE:
---

And the two patches together do appear to resolve the issue reported
in [4] as well as the OVN hairpin issue [1]. It does however make a
couple of tests fail, so I need to look into if that is expected from
the change or if the approach must be changed.

4: https://bugs.launchpad.net/openvswitch/+bug/1964117

--
Frode Nordahl




> Best regards, Ilya Maximets.
>
> ---
> diff --git a/include/openvswitch/ofp-packet.h 
> b/include/openvswitch/ofp-packet.h
> index 77128d829..26b07e07f 100644
> --- a/include/openvswitch/ofp-packet.h
> +++ b/include/openvswitch/ofp-packet.h
> @@ -133,6 +133,9 @@ struct ofputil_packet_in_private {
>      /* NXCPT_CONNTRACKED. */
>      bool conntracked;
>
> +    /* NXCPT_PENDING_CT_CLEAR. */
> +    bool pending_ct_clear;
> +
>      /* NXCPT_ACTIONS. */
>      struct ofpact *actions;
>      size_t actions_len;
> diff --git a/lib/ofp-packet.c b/lib/ofp-packet.c
> index 9485ddfc9..1df092471 100644
> --- a/lib/ofp-packet.c
> +++ b/lib/ofp-packet.c
> @@ -425,6 +425,7 @@ enum nx_continuation_prop_type {
>      NXCPT_ACTIONS,
>      NXCPT_ACTION_SET,
>      NXCPT_ODP_PORT,
> +    NXCPT_PENDING_CT_CLEAR,
>  };
>
>  /* Only NXT_PACKET_IN2 (not NXT_RESUME) should include NXCPT_USERDATA, so 
> this
> @@ -460,6 +461,10 @@ ofputil_put_packet_in_private(const struct 
> ofputil_packet_in_private *pin,
>          ofpprop_put_flag(msg, NXCPT_CONNTRACKED);
>      }
>
> +    if (pin->pending_ct_clear) {
> +        ofpprop_put_flag(msg, NXCPT_PENDING_CT_CLEAR);
> +    }
> +
>      if (pin->actions_len) {
>          /* Divide 'pin->actions' into groups that begins with an
>           * unroll_xlate action.  For each group, emit a NXCPT_TABLE_ID and
> @@ -863,6 +868,10 @@ ofputil_decode_packet_in_private(const struct ofp_header 
> *oh, bool loose,
>              pin->conntracked = true;
>              break;
>
> +        case NXCPT_PENDING_CT_CLEAR:
> +            pin->pending_ct_clear = true;
> +            break;
> +
>          case NXCPT_TABLE_ID:
>              error = ofpprop_parse_u8(&payload, &table_id);
>              break;
> diff --git a/ofproto/ofproto-dpif-rid.h b/ofproto/ofproto-dpif-rid.h
> index 4df630c62..af6aee475 100644
> --- a/ofproto/ofproto-dpif-rid.h
> +++ b/ofproto/ofproto-dpif-rid.h
> @@ -153,6 +153,7 @@ struct frozen_state {
>      size_t stack_size;
>      mirror_mask_t mirrors;        /* Mirrors already output. */
>      bool conntracked;             /* Conntrack occurred prior to freeze. */
> +    bool pending_ct_clear;        /* ct_clear required before the next ct(). 
> */
>      bool was_mpls;                /* MPLS packet */
>      struct uuid xport_uuid;       /* UUID of 1st port packet received on. */
>
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 8e5d030ac..2826b3d10 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -396,6 +396,10 @@ struct xlate_ctx {
>       * state from the datapath should be honored after thawing. */
>      bool conntracked;
>
> +    /* True if datapath ct_clear action has to be performed before the next
> +     * ct() action. */
> +    bool pending_ct_clear;
> +
>      /* Pointer to an embedded NAT action in a conntrack action, or NULL. */
>      struct ofpact_nat *ct_nat_action;
>
> @@ -3858,6 +3862,7 @@ patch_port_output(struct xlate_ctx *ctx, const struct 
> xport *in_dev,
>      struct flow old_flow = ctx->xin->flow;
>      struct flow_tnl old_flow_tnl_wc = ctx->wc->masks.tunnel;
>      bool old_conntrack = ctx->conntracked;
> +    bool old_ct_clear = ctx->pending_ct_clear;
>      bool old_was_mpls = ctx->was_mpls;
>      ovs_version_t old_version = ctx->xin->tables_version;
>      struct ofpbuf old_stack = ctx->stack;
> @@ -3877,6 +3882,12 @@ patch_port_output(struct xlate_ctx *ctx, const struct 
> xport *in_dev,
>      ctx->wc->masks.tunnel.metadata.tab = flow->tunnel.metadata.tab;
>      memset(flow->regs, 0, sizeof flow->regs);
>      flow->actset_output = OFPP_UNSET;
> +    if (ctx->conntracked) {
> +        /* Patch ports do not carry conntrack information, so any conntrack
> +         * state must be removed from the packet before processing ct actions
> +         * in another bridge. */
> +        ctx->pending_ct_clear = true;
> +    }
>      clear_conntrack(ctx);
>      ctx->xin->trace = xlate_report(ctx, OFT_BRIDGE, "bridge(\"%s\")",
>                                     out_dev->xbridge->name);
> @@ -3945,6 +3956,7 @@ patch_port_output(struct xlate_ctx *ctx, const struct 
> xport *in_dev,
>      /* The out bridge's conntrack execution should have no effect on the
>       * original bridge. */
>      ctx->conntracked = old_conntrack;
> +    ctx->pending_ct_clear = old_ct_clear;
>
>      /* The fact that the out bridge exits (for any reason) does not mean
>       * that the original bridge should exit.  Specifically, if the out
> @@ -4930,6 +4942,7 @@ xlate_controller_action(struct xlate_ctx *ctx, int len,
>          .stack_size = ctx->stack.size,
>          .mirrors = ctx->mirrors,
>          .conntracked = ctx->conntracked,
> +        .pending_ct_clear = ctx->pending_ct_clear,
>          .was_mpls = ctx->was_mpls,
>          .ofpacts = NULL,
>          .ofpacts_len = 0,
> @@ -5005,6 +5018,7 @@ finish_freezing__(struct xlate_ctx *ctx, uint8_t table)
>          .stack_size = ctx->stack.size,
>          .mirrors = ctx->mirrors,
>          .conntracked = ctx->conntracked,
> +        .pending_ct_clear = ctx->pending_ct_clear,
>          .was_mpls = ctx->was_mpls,
>          .xport_uuid = ctx->xin->xport_uuid,
>          .ofpacts = ctx->frozen_actions.data,
> @@ -5835,6 +5849,7 @@ clone_xlate_actions(const struct ofpact *actions, 
> size_t actions_len,
>      struct flow old_base = ctx->base_flow;
>      bool old_was_mpls = ctx->was_mpls;
>      bool old_conntracked = ctx->conntracked;
> +    bool old_ct_clear = ctx->pending_ct_clear;
>
>      /* The actions are not reversible, a datapath clone action is
>       * required to encode the translation. Select the clone action
> @@ -5883,6 +5898,7 @@ dp_clone_done:
>      /* The clone's conntrack execution should have no effect on the original
>       * packet. */
>      ctx->conntracked = old_conntracked;
> +    ctx->pending_ct_clear = old_ct_clear;
>
>      /* Popping MPLS from the clone should have no effect on the original
>       * packet. */
> @@ -6128,6 +6144,18 @@ freeze_unroll_actions(const struct ofpact *a, const 
> struct ofpact *end,
>      }
>  }
>
> +static void
> +compose_ct_clear_action(struct xlate_ctx *ctx)
> +{
> +    clear_conntrack(ctx);
> +    ctx->pending_ct_clear = false;
> +    /* This action originally existed without dpif support. So to preserve
> +     * compatibility, only append it if the dpif supports it. */
> +    if (ctx->xbridge->support.ct_clear) {
> +        nl_msg_put_flag(ctx->odp_actions,  OVS_ACTION_ATTR_CT_CLEAR);
> +    }
> +}
> +
>  static void
>  put_ct_mark(const struct flow *flow, struct ofpbuf *odp_actions,
>              struct flow_wildcards *wc)
> @@ -6336,25 +6364,21 @@ compose_conntrack_action(struct xlate_ctx *ctx, 
> struct ofpact_conntrack *ofc,
>          compose_recirculate_and_fork(ctx, ofc->recirc_table, zone);
>      }
>
> +    /* Datapath will update the conntrack state for the packet, but
> +     * NXAST_CT action must always put the packet into an untracked
> +     * state.  Execution continues in the current table without
> +     * recirculation, hence conntrack state should not be available.
> +     * We'll ask the datapath to clear what OVS_ACTION_ATTR_CT might
> +     * have changed in the packet metadata before the next ct() action. */
> +    ctx->pending_ct_clear = true;
> +
>      ctx->ct_nat_action = NULL;
>
>      /* The ct_* fields are only available in the scope of the 'recirc_table'
>       * call chain. */
> -    flow_clear_conntrack(&ctx->xin->flow);
>      xlate_report(ctx, OFT_DETAIL, "Sets the packet to an untracked state, "
>                   "and clears all the conntrack fields.");
> -    ctx->conntracked = false;
> -}
> -
> -static void
> -compose_ct_clear_action(struct xlate_ctx *ctx)
> -{
>      clear_conntrack(ctx);
> -    /* This action originally existed without dpif support. So to preserve
> -     * compatibility, only append it if the dpif supports it. */
> -    if (ctx->xbridge->support.ct_clear) {
> -        nl_msg_put_flag(ctx->odp_actions,  OVS_ACTION_ATTR_CT_CLEAR);
> -    }
>  }
>
>  /* check_pkt_larger action checks the packet length and stores the
> @@ -6415,6 +6439,7 @@ xlate_check_pkt_larger(struct xlate_ctx *ctx,
>      struct flow old_base = ctx->base_flow;
>      bool old_was_mpls = ctx->was_mpls;
>      bool old_conntracked = ctx->conntracked;
> +    bool old_ct_clear = ctx->pending_ct_clear;
>
>      size_t offset = nl_msg_start_nested(ctx->odp_actions,
>                                          OVS_ACTION_ATTR_CHECK_PKT_LEN);
> @@ -6436,6 +6461,7 @@ xlate_check_pkt_larger(struct xlate_ctx *ctx,
>      ctx->base_flow = old_base;
>      ctx->was_mpls = old_was_mpls;
>      ctx->conntracked = old_conntracked;
> +    ctx->pending_ct_clear = old_ct_clear;
>      ctx->xin->flow = old_flow;
>
>      /* If the flow translation for the IF_GREATER case requires freezing,
> @@ -6466,6 +6492,7 @@ xlate_check_pkt_larger(struct xlate_ctx *ctx,
>      ctx->base_flow = old_base;
>      ctx->was_mpls = old_was_mpls;
>      ctx->conntracked = old_conntracked;
> +    ctx->pending_ct_clear = old_ct_clear;
>      ctx->xin->flow = old_flow;
>      ctx->exit = old_exit;
>  }
> @@ -7316,11 +7343,17 @@ do_xlate_actions(const struct ofpact *ofpacts, size_t 
> ofpacts_len,
>          }
>
>          case OFPACT_CT:
> +            if (ctx->pending_ct_clear) {
> +                /* Previous ct() action might have made some changes but
> +                 * the packet must be untracked.  Clearing the state before
> +                 * executing a new ct() action. */
> +                compose_ct_clear_action(ctx);
> +            }
>              compose_conntrack_action(ctx, ofpact_get_CT(a), last);
>              break;
>
>          case OFPACT_CT_CLEAR:
> -            if (ctx->conntracked) {
> +            if (ctx->conntracked || ctx->pending_ct_clear) {
>                  compose_ct_clear_action(ctx);
>              }
>              break;
> @@ -7710,6 +7743,7 @@ xlate_actions(struct xlate_in *xin, struct xlate_out 
> *xout)
>
>          .was_mpls = false,
>          .conntracked = false,
> +        .pending_ct_clear = false,
>
>          .ct_nat_action = NULL,
>
> @@ -7774,6 +7808,8 @@ xlate_actions(struct xlate_in *xin, struct xlate_out 
> *xout)
>              clear_conntrack(&ctx);
>          }
>
> +        ctx.pending_ct_clear = state->pending_ct_clear;
> +
>          /* Restore pipeline metadata. May change flow's in_port and other
>           * metadata to the values that existed when freezing was triggered. 
> */
>          frozen_metadata_to_flow(&ctx.xbridge->ofproto->up,
> @@ -8113,6 +8149,7 @@ xlate_resume(struct ofproto_dpif *ofproto,
>          .stack_size = pin->stack_size,
>          .mirrors = pin->mirrors,
>          .conntracked = pin->conntracked,
> +        .pending_ct_clear = pin->pending_ct_clear,
>          .xport_uuid = UUID_ZERO,
>
>          /* When there are no actions, xlate_actions() will search the flow
> ---
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to