On Fri, May 19, 2023 at 11:55 AM Eelco Chaudron <[email protected]> wrote:
>
>
>
> On 15 May 2023, at 10:04, Frode Nordahl wrote:
>
> > The tc module combines the use of the `tc_transact` helper
> > function for communication with the in-kernel tc infrastructure
> > with assertions on the reply data by `ofpbuf_at_assert` on the
> > received data prior to further processing.
> >
> > `tc_transact` in turn calls `nl_transact`, which via
> > `nl_transact_multiple__` ultimately calls and handles return
> > value from `recvmsg`.  On error a check for EAGAIN is performed
> > and a consequence of this condition is effectively to provide a
> > non-error (0) result and an empty reply buffer.
> >
> > Before this change, the `tc_transact` and, consumers of it, were
> > unaware of this condition.  The use of assertions on the reply
> > buffer can as such lead to a fatal crash of OVS.
> >
> > To be fair, the behavior of `nl_transact` when handling an EAGAIN
> > return is currently not documented, so this change also adds that
> > documentation.
> >
> > While fixing the problem, it led me to find potential problems
> > with the one-time initialization functions in the netdev-offload-tc
> > module.  Make use of the information now available about an EAGAIN
> > condition to retry one-time initialization, and resort to logging
> > a warning if probing of tc features fails due to temporary
> > situations such as resource depletion.
> >
> > For the record, the symptom of the crash is this in the log:
> > EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= 
> > b->size failed in ofpbuf_at_assert()
> >
> > And an excerpt of the backtrace looks like this:
> > 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, 
> > offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> > tc_replace_flower (id=<optimized out>, flower=<optimized out>) at 
> > ../lib/tc.c:3223
> > 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, 
> > match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> > ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at 
> > ../lib/netdev-offload-tc.c:2096
> > 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, 
> > info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, 
> > actions=<optimized out>,
> > match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> > parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at 
> > ../lib/dpif-netlink.c:2297
> > try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at 
> > ../lib/dpif-netlink.c:2384
>
> Thanks for this patch. See some comments inline below.

Hello, Eelco,

Thank you for taking the time to review, much appreciated! I'll
respond inline as well.

> Cheers,
>
> Eelco
>
> > Reported-At: https://launchpad.net/bugs/2018500
> > Fixes f98e418fbdb6 (tc: Add tc flower functions)
> > Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message 
> > in a transaction.)

I see now that I've forgotten a `:` after `Fixes` which makes
patchwork not account for
them, I'll fix that in the next iteration

> > Signed-off-by: Frode Nordahl <[email protected]>
> > ---
> >  lib/dpif-netlink.c      |  4 +-
> >  lib/dpif.c              |  2 +-
> >  lib/netdev-linux.c      | 34 +++++++++-------
> >  lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
> >  lib/netlink-socket.c    |  5 +++
> >  lib/tc.c                | 18 +++++++++
> >  6 files changed, 122 insertions(+), 27 deletions(-)
> >
> > diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> > index 60bd39643..75537b2f8 100644
> > --- a/lib/dpif-netlink.c
> > +++ b/lib/dpif-netlink.c
> > @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct 
> > dpif_flow_put *put)
> >              }
> >              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
> >          }
> > -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> > +        level = (err == EAGAIN ||
> > +                 err == ENOSPC ||
> > +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> >          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
> >                  ovs_strerror(err),
> >                  (oor_netdev ? oor_netdev->name : dev->name));
>
> Do we want a debug for EAGAIN here? I think it could as well be an error, as 
> we failed to offload a flow due to a busy system?

The reasoning is that the flow will be attempted offloaded again, and if the
system is already busy, adding to the load by possibly rampant logging would
not help on the system load?

Do you think this condition is not to be expected with this subsystem?

> > diff --git a/lib/dpif.c b/lib/dpif.c
> > index 3305401fe..2bbb0aac4 100644
> > --- a/lib/dpif.c
> > +++ b/lib/dpif.c
> > @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
> >       * Kernels that support flow wildcarding will reject these flows as
> >       * duplicates (EEXIST), so lower the log level to debug for these
> >       * types of messages. */
> > -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
> > +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : 
> > VLL_DBG;
>
> Just a bit, but as the above comment explicitly mentions why EEXIST should be 
> DBG level, we might want to add a word on why EAGAIN should also be.

Same reasoning for suppressing EAGAIN as above, and it surely deserves a
comment, thank you for pointing it out!

> In addition same as above, i.e., do we want a debug for EAGAIN here?

The reason for adding it here is that as we discovered the `netlink-socket`
module is currently suppressing these in a rather surprising manner, but there
was probably a reason for the suppression.

Now that we are exposing a new temporary error condition in the `tc` module,
many of the call paths end up in higher level dpif callbacks.  I'm approaching
this with a bit of conservative safety in mind, squelching the logging down to
DBG level to ensure that we don't cause any issues for existing systems or
tests.

> >  }
> >
> >  static bool
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index 36620199e..a0294ca00 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -5817,8 +5817,9 @@ tc_get_policer_action(uint32_t index, struct 
> > ofputil_meter_stats *stats)
> >
> >      error = tc_transact(&request, &replyp);
> >      if (error) {
> > -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), 
> > err=%d",
> > -                    index, error);
> > +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> > +                "Failed to dump police action (index: %u), err=%d",
> > +                index, error);
> >          return error;
> >      }
> >
> > @@ -5849,8 +5850,9 @@ tc_del_policer_action(uint32_t index, struct 
> > ofputil_meter_stats *stats)
> >
> >      error = tc_transact(&request, &replyp);
> >      if (error) {
> > -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), 
> > err=%d",
> > -                    index, error);
> > +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> > +                "Failed to delete police action (index: %u), err=%d",
> > +                index, error);
> >          return error;
> >      }
>
> The two above will still result in a warning by upper-level callers, so dbg 
> is fine.
>
> > @@ -6114,11 +6116,12 @@ tc_query_class(const struct netdev *netdev,
> >
> >      error = tc_transact(&request, replyp);
> >      if (error) {
> > -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed 
> > (%s)",
> > -                     netdev_get_name(netdev),
> > -                     tc_get_major(handle), tc_get_minor(handle),
> > -                     tc_get_major(parent), tc_get_minor(parent),
> > -                     ovs_strerror(error));
> > +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> > +                "query %s class %u:%u (parent %u:%u) failed (%s)",
> > +                netdev_get_name(netdev),
> > +                tc_get_major(handle), tc_get_minor(handle),
> > +                tc_get_major(parent), tc_get_minor(parent),
> > +                ovs_strerror(error));
> >      }
> >      return error;
> >  }
> > @@ -6140,10 +6143,11 @@ tc_delete_class(const struct netdev *netdev, 
> > unsigned int handle)
> >
> >      error = tc_transact(&request, NULL);
> >      if (error) {
> > -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
> > -                     netdev_get_name(netdev),
> > -                     tc_get_major(handle), tc_get_minor(handle),
> > -                     ovs_strerror(error));
> > +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> > +                "delete %s class %u:%u failed (%s)",
> > +                netdev_get_name(netdev),
> > +                tc_get_major(handle), tc_get_minor(handle),
> > +                ovs_strerror(error));
> >      }
> >      return error;
> >  }
> > @@ -6262,7 +6266,9 @@ tc_query_qdisc(const struct netdev *netdev_)
> >                  ops = &tc_ops_other;
> >              }
> >          }
> > -    } else if ((!error && !qdisc->size) || error == ENOENT) {
> > +    } else if ((!error && !qdisc->size) ||
> > +               error == ENOENT || error == EAGAIN)
>
> Is the (!error && !qdisc->size) not the same as EAGAIN is now? If so, we 
> could remove the (!error && !qdisc->size).
> Same for a couple of lines above where we do ‘if (!error && qdisc->size)’ 
> which can now be just !error.

Ack.

> > +    {
> >          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a 
> > qdisc
> >           * set up by some other entity that doesn't have a handle 1:0.  We 
> > will
> >           * assume that it's the system default qdisc. */
> > diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
> > index 4f26dd8cc..4fb9dc239 100644
> > --- a/lib/netdev-offload-tc.c
> > +++ b/lib/netdev-offload-tc.c
> > @@ -2505,10 +2505,10 @@ netdev_tc_flow_get(struct netdev *netdev,
> >
> >      err = tc_get_flower(&id, &flower);
> >      if (err) {
> > -        VLOG_ERR_RL(&error_rl,
> > -                    "flow get failed (dev %s prio %d handle %d): %s",
> > -                    netdev_get_name(netdev), id.prio, id.handle,
> > -                    ovs_strerror(err));
> > +        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,
>
> Do we want a debug for EAGAIN here? I think it could as well be an 
> error/warning as the system was busy?

I'd lean towards yes, given short periods of the system being busy and returning
EAGAIN on calls is to be expected, and rampant logging about it when it happens
would not help with the load?

Do you think this condition is not to be expected with this subsystem?

> > +                "flow get failed (dev %s prio %d handle %d): %s",
> > +                netdev_get_name(netdev), id.prio, id.handle,
> > +                ovs_strerror(err));
> >          return err;
> >      }
> >
> > @@ -2571,11 +2571,36 @@ netdev_tc_get_n_flows(struct netdev *netdev, 
> > uint64_t *n_flows)
> >      return 0;
> >  }
> >
> > +/* This macro is for use by one-time initialization functions, where we 
> > have
> > + * one shot per thread/process to perform a pertinent initialization task 
> > that
> > + * may return a temporary error (EAGAIN).
> > + *
> > + * With min/max values of 1/64 we would retry 7 times, spending at the
> > + * most 127 * 1E6 nsec (0.127s) sleeping.
> > + */
> > +#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
> > +#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
> > +#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,    
> >    \
>
> I think it would be nice to define the backoff variable inside the macro as a 
> local variable, as we do not need it outside the macro.

Ack, was worried about the risk of unexpectedly redefining/shadowing a variable
already defined by the consumer of the macro, but I guess the compiler will
be helpful in pointing this out.  Will fix it.

> > +                                             FUNCTION, ...)                
> >    \
> > +    BACKOFF = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                               
> >    \
> > +    do {                                                                   
> >    \
> > +        ERROR = (FUNCTION)(__VA_ARGS__);                                   
> >    \
> > +        if (CONDITION) {                                                   
> >    \
> > +            xnanosleep(BACKOFF * 1E6);                                     
> >    \
> > +            if (BACKOFF < NETDEV_OFFLOAD_TC_BACKOFF_MAX) {                 
> >    \
> > +                BACKOFF <<= 1;                                             
> >    \
> > +            } else {                                                       
> >    \
> > +                break;                                                     
> >    \
> > +            }                                                              
> >    \
> > +        }                                                                  
> >    \
> > +    } while (CONDITION);
> > +
> >  static void
> >  probe_multi_mask_per_prio(int ifindex)
> >  {
> >      struct tc_flower flower;
> >      struct tcf_id id1, id2;
> > +    uint64_t backoff;
> >      int block_id = 0;
> >      int prio = 1;
> >      int error;
> > @@ -2594,8 +2619,13 @@ probe_multi_mask_per_prio(int ifindex)
> >      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
> >
> >      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> > -    error = tc_replace_flower(&id1, &flower);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         tc_replace_flower, &id1, &flower);
> >      if (error) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> > +                      "support: %s", ovs_strerror(error));
> > +        }
> >          goto out;
> >      }
> >
> > @@ -2603,10 +2633,15 @@ probe_multi_mask_per_prio(int ifindex)
> >      memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
> >
> >      id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> > -    error = tc_replace_flower(&id2, &flower);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         tc_replace_flower, &id2, &flower);
> >      tc_del_flower_filter(&id1);
> >
> >      if (error) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> > +                      "support: %s", ovs_strerror(error));
> > +        }
> >          goto out;
> >      }
> >
> > @@ -2642,6 +2677,7 @@ probe_ct_state_support(int ifindex)
> >  {
> >      struct tc_flower flower;
> >      uint16_t ct_state;
> > +    uint64_t backoff;
> >      struct tcf_id id;
> >      int error;
> >
> > @@ -2657,8 +2693,13 @@ probe_ct_state_support(int ifindex)
> >          goto out;
> >      }
> >
> > -    error = tc_get_flower(&id, &flower);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         tc_get_flower, &id, &flower);
> >      if (error || flower.mask.ct_state != ct_state) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> > +                      ovs_strerror(error));
> > +        }
> >          goto out_del;
> >      }
> >
> > @@ -2670,10 +2711,16 @@ probe_ct_state_support(int ifindex)
> >
> >      /* Test for reject, ct_state >= MAX */
> >      ct_state = ~0;
> > -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         probe_insert_ct_state_rule, 
> > ifindex,
> > +                                         ct_state, &id);
> >      if (!error) {
> >          /* No reject, can't continue probing other flags */
> >          goto out_del;
> > +    } else if (error == EAGAIN) {
> > +        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> > +                  ovs_strerror(error));
> > +        goto out_del;
> >      }
> >
> >      tc_del_flower_filter(&id);
> > @@ -2682,8 +2729,14 @@ probe_ct_state_support(int ifindex)
> >      memset(&flower, 0, sizeof flower);
> >      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
> >                 TCA_FLOWER_KEY_CT_FLAGS_INVALID;
> > -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         probe_insert_ct_state_rule, 
> > ifindex,
> > +                                         ct_state, &id);
> >      if (error) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> > +                      ovs_strerror(error));
> > +        }
> >          goto out;
> >      }
> >
> > @@ -2695,8 +2748,14 @@ probe_ct_state_support(int ifindex)
> >      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
> >                 TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
> >                 TCA_FLOWER_KEY_CT_FLAGS_REPLY;
> > -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         probe_insert_ct_state_rule, 
> > ifindex,
> > +                                         ct_state, &id);
> >      if (error) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> > +                      ovs_strerror(error));
> > +        }
> >          goto out;
> >      }
> >
> > @@ -2714,6 +2773,7 @@ probe_tc_block_support(int ifindex)
> >  {
> >      struct tc_flower flower;
> >      uint32_t block_id = 1;
> > +    uint64_t backoff;
> >      struct tcf_id id;
> >      int prio = 0;
> >      int error;
> > @@ -2732,13 +2792,17 @@ probe_tc_block_support(int ifindex)
> >      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
> >
> >      id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> > -    error = tc_replace_flower(&id, &flower);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         tc_replace_flower, &id, &flower);
> >
> >      tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
> >
> >      if (!error) {
> >          block_support = true;
> >          VLOG_INFO("probe tc: block offload is supported.");
> > +    } else if (error == EAGAIN) {
> > +        VLOG_WARN("probe tc: unable to probe block offload support: %s",
> > +                  ovs_strerror(error));
> >      }
> >  }
> >
> > diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
> > index 80da20d9f..dea060fc3 100644
> > --- a/lib/netlink-socket.c
> > +++ b/lib/netlink-socket.c
> > @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
> >   *
> >   *      2. Resending the request causes it to be re-executed, so the 
> > request
> >   *         needs to be idempotent.
> > + *
> > + *      3. In the event that the kernel is too busy to handle the request 
> > to
> > + *         receive the response (i.e. EAGAIN), this function will still 
> > return
> > + *         0.  The caller can check for this condition by checking for a 
> > zero
> > + *         size of the 'replyp' ofpbuf buffer.
> >   */
> >  int
> >  nl_transact(int protocol, const struct ofpbuf *request,
> > diff --git a/lib/tc.c b/lib/tc.c
> > index 5c32c6f97..7df6a10c5 100644
> > --- a/lib/tc.c
> > +++ b/lib/tc.c
> > @@ -237,11 +237,29 @@ static void request_from_tcf_id(struct tcf_id *id, 
> > uint16_t eth_type,
> >      }
> >  }
> >
> > +/* The `tc_transact` function is a wrapper around `nl_transact` with the
> > + * addition of:
> > + *
> > + * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
> > + *    regardless of success or failure.
> > + *
> > + * 2. When a 'replyp' pointer is provided; in the event of the kernel
> > + *    being too busy to process the request for the response, a positive
> > + *    error return will be provided with the value of EAGAIN.
> > + *
> > + * Please acquaint yourself with the documentation of the `nl_transact`
> > + * function in the netlink-sock module before making use of this function.
> > + */
> >  int
> >  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
> >  {
> >      int error = nl_transact(NETLINK_ROUTE, request, replyp);
> >      ofpbuf_uninit(request);
> > +
> > +    if (!error && replyp && !(*replyp)->size) {
>
> I think it would be really nice to add a coverage counter for this, so we 
> know it happened as most of it is hidden behind a debug message.

Ack, good point, will add!

-- 
Frode Nordahl

> > +        return EAGAIN;
> > +    }
> > +
> >      return error;
> >  }
> >
> > --
> > 2.39.2
> >
> > _______________________________________________
> > dev mailing list
> > [email protected]
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to