Ilya Maximets <[email protected]> writes:

> This change attempts to track the actual available stack and stop
> early if the flow translation logic is about to overflow it.
>
> Unlike the recursion depth counter, this approach allows to track the
> actual stack usage and bail early even if the recursion depth is not
> reached yet.  This is important because different actions have vastly
> different stack requirements and different systems have different
> amount of stack allocated per thread by default.

This is an interesting approach, for some definitions of interesting :)

As you note a bit below, it really depends on a lot - namely that the
code was built in such a way that getting the frame address works via
the built-in (I'm thinking about the weirdness of alpha architecture and
-fomit-frame-pointer), and that the compiler can support giving this
information.

> Should work with both GCC and Clang, will likely not work on Windows.
> The change should have no effect on platforms / compilers that do not
> support checking the current stack frame address via builtins.
>
> The main thread is not treated fairly.  At least on Linux the main
> thread can grow its stack if the limit is dynamically increased.
> That is not normally true for other threads.  However, this patch
> sticks to initial stack size even for the main thread.  This should
> not be a problem for OVS though, as vast majority of all the packet
> processing is normally done by handlers, revalidators or PMD threads.
>
> Unlike the previous RFC version of this change [1], we're not trying
> to work around the stack limits by recirculating packets through the
> datapath.  Practice shows that such techniques may lead to self-DoS,
> overwhelming OVS with upcalls.  See the ovn-controller self-DoS issue
> fixed recently in OVN:
>   https://mail.openvswitch.org/pipermail/ovs-dev/2025-October/426746.html
> So, it's safer to just drop the packets and only execute actions that
> we can translate safely.  All in all, stack exhaustion usually means
> a loop or otherwise a very inefficient OpenFlow pipeline that should
> be fixed by the users.  Attempts to process the whole thing would only
> mask the problem.
>
> It seems hard to create a unit test for this, as support for measuring
> the actual stack depth as well as the amount of space each stack frame
> takes and the ways to limit the stack largely depend on a platform and
> a compiler.  Can be tested manually with an infinite resubmit case:

Although it would be more work, maybe creating our own stack structure
and iterating, while maintaining a separate agnostic stack would make
sense.  Something like::

 #define XLATE_MAX_OPERATION_DEPTH 64

 struct xlate_operation {
        cur_state;
        ofpacts;
        ofpacts_Len;
        flow;
        port;
        recirc;
 };

 struct xlate_op_stack {
        top_ptr;
        struct xlate_operation frames[XLATE_MAX_OPERATION_DEPTH];
 };

That does require rewriting things like xlate_recursively,
do_xlate_actions, etc. to indicate that they need a new operation, and
set the appropriate state so that a while loop can call the right piece.

BUT - I fully concede it's much more work than this approach, and at
least this approach gives an early bail out *now* rather than major
churn associated with reworking the xlate pipeline.

>   make -j8
>   (ulimit -s 386; make sandbox)
>   ovs-vsctl add-br br0
>   ovs-vsctl add-port br0 p1
>   ovs-ofctl del-flows br0
>   (for i in $(seq 0 64); do
>       j=$(expr $i + 1);
>       echo "table=$i, actions=local,resubmit(,$j),local,resubmit(,$j),local";
>    done;
>    echo "table=65, actions=resubmit(,0)") > ./resubmits.txt
>   ovs-ofctl add-flows br0 ./resubmits.txt
>   ovs-appctl ofproto/trace br0 'in_port=1' > ./trace.txt
>
> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2024-February/412048.html
>
> Signed-off-by: Ilya Maximets <[email protected]>
> ---
>  include/linux/openvswitch.h    |  1 +
>  include/openvswitch/compiler.h | 11 ++++++++
>  lib/odp-execute.c              |  4 +++
>  lib/ovs-thread.c               | 47 ++++++++++++++++++++++++++++++++++
>  lib/ovs-thread.h               |  8 ++++++
>  lib/sat-math.h                 |  5 +---
>  ofproto/ofproto-dpif-xlate.c   | 10 ++++++--
>  vswitchd/ovs-vswitchd.c        |  1 +
>  8 files changed, 81 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
> index e69fed4b7..8e0b9ad46 100644
> --- a/include/linux/openvswitch.h
> +++ b/include/linux/openvswitch.h
> @@ -475,6 +475,7 @@ enum xlate_error {
>       XLATE_TUNNEL_OUTPUT_NO_ETHERNET,
>       XLATE_TUNNEL_NEIGH_CACHE_MISS,
>       XLATE_TUNNEL_HEADER_BUILD_FAILED,
> +     XLATE_THREAD_STACK_EXHAUSTED,
>       XLATE_MAX,
>  };
>  #endif
> diff --git a/include/openvswitch/compiler.h b/include/openvswitch/compiler.h
> index bd30369a7..352980e19 100644
> --- a/include/openvswitch/compiler.h
> +++ b/include/openvswitch/compiler.h
> @@ -29,6 +29,9 @@
>  #ifndef __has_attribute
>    #define __has_attribute(x) 0
>  #endif
> +#ifndef __has_builtin
> +  #define __has_builtin(x) 0
> +#endif
>  
>  /* To make OVS_NO_RETURN portable across gcc/clang and MSVC, it should be
>   * added at the beginning of the function declaration. */
> @@ -317,5 +320,13 @@
>  #define BUILD_ASSERT_DECL_GCCONLY(EXPR) ((void) 0)
>  #endif
>  
> +/* OVS_FRAME_ADDRESS can be used to get the address of the current stack 
> frame.
> + * Note: Attempts to get address of any frame beside the current one (0) are
> + * dangerous and can lead to crashes according to GCC documentation.  */
> +#if __has_builtin(__builtin_frame_address)
> +#define OVS_FRAME_ADDRESS() ((char *) __builtin_frame_address(0))
> +#else
> +#define OVS_FRAME_ADDRESS() ((char *) 0)
> +#endif
>  
>  #endif /* compiler.h */
> diff --git a/lib/odp-execute.c b/lib/odp-execute.c
> index ecbda8c01..2271e7d72 100644
> --- a/lib/odp-execute.c
> +++ b/lib/odp-execute.c
> @@ -60,6 +60,7 @@ COVERAGE_DEFINE(drop_action_tunnel_routing_failed);
>  COVERAGE_DEFINE(drop_action_tunnel_output_no_ethernet);
>  COVERAGE_DEFINE(drop_action_tunnel_neigh_cache_miss);
>  COVERAGE_DEFINE(drop_action_tunnel_header_build_failed);
> +COVERAGE_DEFINE(drop_action_thread_stack_exhausted);
>  
>  static void
>  dp_update_drop_action_counter(enum xlate_error drop_reason,
> @@ -116,6 +117,9 @@ dp_update_drop_action_counter(enum xlate_error 
> drop_reason,
>     case XLATE_TUNNEL_HEADER_BUILD_FAILED:
>          COVERAGE_ADD(drop_action_tunnel_header_build_failed, delta);
>          break;
> +   case XLATE_THREAD_STACK_EXHAUSTED:
> +        COVERAGE_ADD(drop_action_thread_stack_exhausted, delta);
> +        break;
>     case XLATE_MAX:
>     default:
>          VLOG_ERR_RL(&rl, "Invalid Drop reason type: %d", drop_reason);
> diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
> index 243791de8..a390ed4f5 100644
> --- a/lib/ovs-thread.c
> +++ b/lib/ovs-thread.c
> @@ -404,6 +404,52 @@ ovsthread_id_init(void)
>      return *ovsthread_id_get() = atomic_count_inc(&next_id);
>  }
>  
> +DEFINE_STATIC_PER_THREAD_DATA(uintptr_t, ovsthread_stack_base, 0);
> +DEFINE_STATIC_PER_THREAD_DATA(size_t, ovsthread_stack_size, 0);
> +
> +void
> +ovsthread_stack_init(void)
> +{
> +    pthread_attr_t attr;
> +    size_t stacksize;
> +    int error;
> +
> +    ovs_assert(*ovsthread_stack_base_get() == 0);
> +    *ovsthread_stack_base_get() = (uintptr_t) OVS_FRAME_ADDRESS();
> +
> +    pthread_attr_init(&attr);

Since we're already in the realm of needing special support and
non-portable stuff, maybe something like pthread_getattr_np() to get the
actual configuration rather than whatever is in the default thread attr?
We can always fall to pthread_attr_init in cases where that call doesn't
exist.

We could also pass a flag that will use getrlimit(RLIMIT_STACK) for the
main thread, and that should be 'close' to accurate.

> +
> +    error = pthread_attr_getstacksize(&attr, &stacksize);
> +    if (error) {
> +        VLOG_ABORT("pthread_attr_getstacksize failed: %s",
> +                   ovs_strerror(error));
> +    }
> +    *ovsthread_stack_size_get() = stacksize;
> +    pthread_attr_destroy(&attr);
> +}
> +
> +bool
> +ovsthread_low_on_stack(void)
> +{
> +    uintptr_t curr = (uintptr_t) OVS_FRAME_ADDRESS();
> +    uintptr_t base = *ovsthread_stack_base_get();
> +    size_t size = *ovsthread_stack_size_get();
> +    size_t used;
> +
> +    if (!curr || !base || !size) {
> +        /* Either not initialized or not supported. */
> +        return false;
> +    }
> +
> +    used = (base > curr) ? base - curr : curr - base;
> +
> +    /* Consider 'low' as less than a 100 KB. */
> +    if (size < used + 100 * 1024) {
> +        return true;
> +    }
> +    return false;
> +}
> +
>  static void *
>  ovsthread_wrapper(void *aux_)
>  {
> @@ -412,6 +458,7 @@ ovsthread_wrapper(void *aux_)
>      unsigned int id;
>  
>      id = ovsthread_id_init();
> +    ovsthread_stack_init();
>  
>      aux = *auxp;
>      free(auxp);
> diff --git a/lib/ovs-thread.h b/lib/ovs-thread.h
> index d37f4ffc4..55c9859b9 100644
> --- a/lib/ovs-thread.h
> +++ b/lib/ovs-thread.h
> @@ -474,6 +474,14 @@ ovsthread_id_self(void)
>      return id;
>  }
>  
> +/* Enables use of ovsthread_low_on_stack().  Must be called by a new thread
> + * early after the thread creation, e.g. called from ovs_thread_create(). */
> +void ovsthread_stack_init(void);
> +
> +/* Returns 'true' if the current thread used up most of its stack space.
> + * Can be used, for example, to check if recursion has to be stopped. */
> +bool ovsthread_low_on_stack(void);
> +
>  /* Simulated global counter.
>   *
>   * Incrementing such a counter is meant to be cheaper than incrementing a
> diff --git a/lib/sat-math.h b/lib/sat-math.h
> index d66872387..34b939277 100644
> --- a/lib/sat-math.h
> +++ b/lib/sat-math.h
> @@ -18,12 +18,9 @@
>  #define SAT_MATH_H 1
>  
>  #include <limits.h>
> +#include "openvswitch/compiler.h"
>  #include "openvswitch/util.h"
>  
> -#ifndef __has_builtin
> -#define __has_builtin(x) 0
> -#endif
> -
>  /* Returns x + y, clamping out-of-range results into the range of the return
>   * type. */
>  static inline unsigned int
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 2c8197fb7..1af2c4118 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -463,6 +463,8 @@ const char *xlate_strerror(enum xlate_error error)
>          return "Tunnel neighbor cache miss";
>      case XLATE_TUNNEL_HEADER_BUILD_FAILED:
>          return "Native tunnel header build failed";
> +    case XLATE_THREAD_STACK_EXHAUSTED:
> +        return "Thread stack exhausted";
>      case XLATE_MAX:
>          break;
>      }
> @@ -4702,6 +4704,9 @@ xlate_resubmit_resource_check(struct xlate_ctx *ctx)
>      } else if (ctx->stack.size >= 65536) {
>          xlate_report_error(ctx, "resubmits yielded over 64 kB of stack");
>          ctx->error = XLATE_STACK_TOO_DEEP;
> +    } else if (ovsthread_low_on_stack()) {
> +        xlate_report_error(ctx, "thread stack exhausted.");
> +        ctx->error = XLATE_THREAD_STACK_EXHAUSTED;

I am a bit confused why we create a new stack exhaustion xlate error
here.  Any reason not to reuse the STACK_TOO_DEEP error case?  Is it
because the specifics of the error report?

>      } else {
>          return true;
>      }
> @@ -4906,8 +4911,9 @@ xlate_group_bucket(struct xlate_ctx *ctx, struct 
> ofputil_bucket *bucket,
>       *
>       * Exception to above is errors which are system limits to protect
>       * translation from running too long or occupy too much space. These 
> errors
> -     * should not be masked. XLATE_RECURSION_TOO_DEEP, 
> XLATE_TOO_MANY_RESUBMITS
> -     * and XLATE_STACK_TOO_DEEP fall in this category. */
> +     * should not be masked. Following errors fall in this category:
> +     * XLATE_RECURSION_TOO_DEEP, XLATE_TOO_MANY_RESUBMITS,
> +     * XLATE_STACK_TOO_DEEP and XLATE_THREAD_STACK_EXHAUSTED. */
>      if (ctx->error == XLATE_TOO_MANY_MPLS_LABELS ||
>          ctx->error == XLATE_UNSUPPORTED_PACKET_TYPE) {
>          /* reset the error and continue processing other buckets */
> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
> index 6d90c73b8..99a090c15 100644
> --- a/vswitchd/ovs-vswitchd.c
> +++ b/vswitchd/ovs-vswitchd.c
> @@ -94,6 +94,7 @@ main(int argc, char *argv[])
>      fatal_ignore_sigpipe();
>  
>      daemonize_start(true, hw_rawio_access);
> +    ovsthread_stack_init();
>  
>      if (want_mlockall) {
>  #ifdef HAVE_MLOCKALL

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to