On 23 Feb 2024, at 2:26, Ilya Maximets wrote:
> This change attempts to track the actual available stack and force > recircuation if the flow translation logic is about to overflow it. Thanks Ilya, for taking a stab at this! > Unlike the recursion depth counter, this approach allows to track the > actual stack usage and bail early even if the recursion depth is not > reached yet. This is important because different actions have vastly > different stack requirements and different systems have different > amount of stack allocated per thread by default. > > Should work with both GCC and Clang, will likely not work on Windows. > > The main thread is not treated fairly in this version. At least on > Linux the main thread can grow its stack if the limit is dynamically > increased. That is not normally true for other threads. However > this patch sticks to initial stack size even for the main thread. Was going over the code first not reading the commit message, ignore my comment below ;) > Sending as an RFC for discussion. For v1 this change should likely > be split in about 4 patches: > > 1. Preserve resubmits counter across recirculations. We might need > that as a bug fix even, because it might be technically possible to > bypass the resubmit limit by inducing recirculations in the current > code, e.g. with conntrack. > > 2. Introduction of the stack frame functions. > > 3. Actual recirculation on stack exhaution. > > 4. Enabe tracing over this kind of recirculations. > > Tracing also may need a refactor, it doesn't look particularly clean. > And we may want to track resubmits over packet-ins as well. > > The behavioral change with this approach is that if we have a large > pipeline that eventually exceeds the number of resubmits, part of this > pipeline can be executed in the datapath then packet recirculates and > will be dropped afterwards. In contrast, no actions will be executed > today in such scenario, but OVS would also crash with stack overflow, > so I'm not sure if that counts as a behavioral change. :) > > Another one is tracking resubmits over recirculations. That will > impact, for example, forked pipelines on conntrack, since resubmits > before the fork will be accounted for after the fork. > > Didn't test much, only checked an infinite resubmit case: > > make -j8 > (ulimit -s 386; make sandbox) > ovs-vsctl add-br br0 > ovs-vsctl add-port br0 p1 > ovs-ofclt del-flows br0 > (for i in $(seq 0 64); do > j=$(expr $i + 1); > echo "table=$i, actions=local,resubmit(,$j),local,resubmit(,$j),local"; > done; > echo "table=65, actions=resubmit(,0)") > ./resubmits.txt > ovs-ofclt add-flows br0 ./resubmits.txt > ovs-appctl ofproto/trace br0 'in_port=1' > ./trace.txt > > Depending on compiler flags it manages to get to 4096 resubmit limit > in anywhere from 5 to 16 recirculations. > > Thoughts? I think the approach is good, and the code seems ok, although it may require some cleanup as you've mentioned. The whole xlate code puzzles me every time I look at it (it always seems to work differently than I thought it was ;) CCing Mike for some input, as he has worked on stack optimizations in this area before. Cheers, Eelco > Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> > --- > include/openvswitch/compiler.h | 12 +++++++++ > lib/ovs-thread.c | 47 ++++++++++++++++++++++++++++++++++ > lib/ovs-thread.h | 8 ++++++ > lib/sat-math.h | 5 +--- > ofproto/ofproto-dpif-rid.h | 1 + > ofproto/ofproto-dpif-trace.h | 2 ++ > ofproto/ofproto-dpif-xlate.c | 32 +++++++++++++++++++++++ > vswitchd/ovs-vswitchd.c | 1 + > 8 files changed, 104 insertions(+), 4 deletions(-) > > diff --git a/include/openvswitch/compiler.h b/include/openvswitch/compiler.h > index 878c5c6a7..8da2e6eef 100644 > --- a/include/openvswitch/compiler.h > +++ b/include/openvswitch/compiler.h > @@ -26,6 +26,9 @@ > #ifndef __has_extension > #define __has_extension(x) 0 > #endif > +#ifndef __has_builtin > + #define __has_builtin(x) 0 > +#endif > > /* To make OVS_NO_RETURN portable across gcc/clang and MSVC, it should be > * added at the beginning of the function declaration. */ > @@ -300,5 +303,14 @@ > #define BUILD_ASSERT_DECL_GCCONLY(EXPR) ((void) 0) > #endif > > +/* OVS_FRAME_ADDRESS can be used to get the address of the current stack > frame. > + * Note: Attempts to get address of any frame beside the current one (0) are > + * dangerous and can lead to crashes according to GCC documentation. */ > +#if __has_builtin(__builtin_frame_address) > +#define OVS_FRAME_ADDRESS() ((char *) __builtin_frame_address(0)) > +#else > +#define OVS_FRAME_ADDRESS() ((char *) 0) > +#endif > + > > #endif /* compiler.h */ > diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c > index f80008061..637ce27ec 100644 > --- a/lib/ovs-thread.c > +++ b/lib/ovs-thread.c > @@ -407,6 +407,52 @@ ovsthread_id_init(void) > return *ovsthread_id_get() = atomic_count_inc(&next_id); > } > > +DEFINE_STATIC_PER_THREAD_DATA(uintptr_t, ovsthread_stack_base, 0); > +DEFINE_STATIC_PER_THREAD_DATA(size_t, ovsthread_stack_size, 0); > + > +void > +ovsthread_stack_init(void) > +{ > + pthread_attr_t attr; > + size_t stacksize; > + int error; > + > + ovs_assert(*ovsthread_stack_base_get() == 0); > + *ovsthread_stack_base_get() = (uintptr_t) OVS_FRAME_ADDRESS(); > + > + pthread_attr_init(&attr); > + > + error = pthread_attr_getstacksize(&attr, &stacksize); > + if (error) { > + VLOG_ABORT("pthread_attr_getstacksize failed: %s", > + ovs_strerror(error)); > + } > + *ovsthread_stack_size_get() = stacksize; The man pages discuss the main thread's ability to increase its stack size. However, for OVS, this detail might not be significant since the issue lies with the other threads. > + pthread_attr_destroy(&attr); > +} > + > +bool > +ovsthread_low_on_stack(void) > +{ > + uintptr_t curr = (uintptr_t) OVS_FRAME_ADDRESS(); > + uintptr_t base = *ovsthread_stack_base_get(); > + size_t size = *ovsthread_stack_size_get(); > + size_t used; > + > + if (!curr || !base || !size) { > + /* Either not initialized or not supported. */ > + return false; > + } > + > + used = (base > curr) ? base - curr : curr - base; > + > + /* Consider 'low' as less than a 100 KB. */ > + if (size < used + 100 * 1024) { > + return true; > + } > + return false; > +} > + > static void * > ovsthread_wrapper(void *aux_) > { > @@ -415,6 +461,7 @@ ovsthread_wrapper(void *aux_) > unsigned int id; > > id = ovsthread_id_init(); > + ovsthread_stack_init(); > > aux = *auxp; > free(auxp); > diff --git a/lib/ovs-thread.h b/lib/ovs-thread.h > index aac5e19c9..e81f28970 100644 > --- a/lib/ovs-thread.h > +++ b/lib/ovs-thread.h > @@ -474,6 +474,14 @@ ovsthread_id_self(void) > return id; > } > > +/* Enables use of ovsthread_low_on_stack(). Must be called by a new thread > + * early after the thread creation. */ > +void ovsthread_stack_init(void); Maybe add a note it’s already done by ovs_thread_create(). > +/* Returns 'true' is the current thread used up most of its stack space. > + * Can be used, for example, to check if recursion has to be stopped. */ > +bool ovsthread_low_on_stack(void); > + > /* Simulated global counter. > * > * Incrementing such a counter is meant to be cheaper than incrementing a > diff --git a/lib/sat-math.h b/lib/sat-math.h > index d66872387..34b939277 100644 > --- a/lib/sat-math.h > +++ b/lib/sat-math.h > @@ -18,12 +18,9 @@ > #define SAT_MATH_H 1 > > #include <limits.h> > +#include "openvswitch/compiler.h" > #include "openvswitch/util.h" > > -#ifndef __has_builtin > -#define __has_builtin(x) 0 > -#endif > - > /* Returns x + y, clamping out-of-range results into the range of the return > * type. */ > static inline unsigned int > diff --git a/ofproto/ofproto-dpif-rid.h b/ofproto/ofproto-dpif-rid.h > index 4df630c62..9488a3850 100644 > --- a/ofproto/ofproto-dpif-rid.h > +++ b/ofproto/ofproto-dpif-rid.h > @@ -151,6 +151,7 @@ struct frozen_state { > struct frozen_metadata metadata; /* Flow metadata. */ > uint8_t *stack; /* Stack if any. */ > size_t stack_size; > + size_t resubmits; /* Number of resubmits prior to freeze. */ > mirror_mask_t mirrors; /* Mirrors already output. */ > bool conntracked; /* Conntrack occurred prior to freeze. */ > bool was_mpls; /* MPLS packet */ > diff --git a/ofproto/ofproto-dpif-trace.h b/ofproto/ofproto-dpif-trace.h > index f579a5ca4..1e34cdb95 100644 > --- a/ofproto/ofproto-dpif-trace.h > +++ b/ofproto/ofproto-dpif-trace.h > @@ -51,9 +51,11 @@ enum oftrace_node_type { > > /* Reason why a flow is in a recirculation queue. */ > enum oftrace_recirc_type { > + OFT_RECIRC_NONE = 0, > OFT_RECIRC_CONNTRACK, > OFT_RECIRC_MPLS, > OFT_RECIRC_BOND, > + OFT_RECIRC_STACK_EXHAUSTED, > }; > > /* A node within a trace. */ > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c > index 89f183182..e5e17b0a6 100644 > --- a/ofproto/ofproto-dpif-xlate.c > +++ b/ofproto/ofproto-dpif-xlate.c > @@ -71,6 +71,7 @@ > COVERAGE_DEFINE(xlate_actions); > COVERAGE_DEFINE(xlate_actions_oversize); > COVERAGE_DEFINE(xlate_actions_too_many_output); > +COVERAGE_DEFINE(xlate_stack_exhausted); > > VLOG_DEFINE_THIS_MODULE(ofproto_dpif_xlate); > > @@ -381,6 +382,7 @@ struct xlate_ctx { > * case at that point. > */ > bool freezing; > + enum oftrace_recirc_type recirc_type; /* Recirculation type for tracing. > */ > bool recirc_update_dp_hash; /* Generated recirculation will be > preceded > * by datapath HASH action to get an > updated > * dp_hash after recirculation. */ > @@ -495,6 +497,7 @@ ctx_cancel_freeze(struct xlate_ctx *ctx) > { > if (ctx->freezing) { > ctx->freezing = false; > + ctx->recirc_type = OFT_RECIRC_NONE; > ctx->recirc_update_dp_hash = false; > ofpbuf_clear(&ctx->frozen_actions); > ctx->frozen_actions.header = NULL; > @@ -4607,6 +4610,12 @@ xlate_resubmit_resource_check(struct xlate_ctx *ctx) > } else if (ctx->stack.size >= 65536) { > xlate_report_error(ctx, "resubmits yielded over 64 kB of stack"); > ctx->error = XLATE_STACK_TOO_DEEP; > + } else if (ovsthread_low_on_stack()) { > + xlate_report(ctx, OFT_WARN, "Thread stack exhausted, " > + "recirculating to avoid overflow."); > + COVERAGE_INC(xlate_stack_exhausted); > + ctx->recirc_type = OFT_RECIRC_STACK_EXHAUSTED; > + ctx_trigger_freeze(ctx); > } else { > return true; > } > @@ -5125,6 +5134,7 @@ xlate_controller_action(struct xlate_ctx *ctx, int len, > .ofproto_uuid = ctx->xbridge->ofproto->uuid, > .stack = ctx->stack.data, > .stack_size = ctx->stack.size, > + .resubmits = ctx->resubmits, > .mirrors = ctx->mirrors, > .conntracked = ctx->conntracked, > .was_mpls = ctx->was_mpls, > @@ -5200,6 +5210,7 @@ finish_freezing__(struct xlate_ctx *ctx, uint8_t table) > .ofproto_uuid = ctx->xbridge->ofproto->uuid, > .stack = ctx->stack.data, > .stack_size = ctx->stack.size, > + .resubmits = ctx->resubmits, > .mirrors = ctx->mirrors, > .conntracked = ctx->conntracked, > .was_mpls = ctx->was_mpls, > @@ -5248,6 +5259,24 @@ finish_freezing__(struct xlate_ctx *ctx, uint8_t table) > act_hash->hash_basis = ctx->dp_hash_basis; > } > nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC, recirc_id); > + > + if (OVS_UNLIKELY(ctx->xin->trace) && recirc_id && > + ctx->recirc_type == OFT_RECIRC_STACK_EXHAUSTED) { > + if (oftrace_add_recirc_node(ctx->xin->recirc_queue, > + ctx->recirc_type, > + &ctx->xin->flow, > + ctx->ct_nat_action, > + ctx->xin->packet, > + recirc_id, > + ctx->xin->flow.ct_zone)) { > + xlate_report(ctx, OFT_DETAIL, > + "Packet is sent for recirculation, " > + "recric_id = 0x%"PRIx32".", recirc_id); > + } else { > + xlate_report(ctx, OFT_DETAIL, "Failed to trace recirculation > " > + "with recirc_id = 0x%"PRIx32".", recirc_id); > + } > + } > } > > /* Undo changes done by freezing. */ > @@ -8062,6 +8091,7 @@ xlate_actions(struct xlate_in *xin, struct xlate_out > *xout) > .mirrors = 0, > > .freezing = false, > + .recirc_type = OFT_RECIRC_NONE, > .recirc_update_dp_hash = false, > .frozen_actions = OFPBUF_STUB_INITIALIZER(frozen_actions_stub), > .pause = NULL, > @@ -8139,6 +8169,8 @@ xlate_actions(struct xlate_in *xin, struct xlate_out > *xout) > clear_conntrack(&ctx); > } > > + ctx.resubmits = state->resubmits; > + > /* Restore pipeline metadata. May change flow's in_port and other > * metadata to the values that existed when freezing was triggered. > */ > frozen_metadata_to_flow(&ctx.xbridge->ofproto->up, > diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c > index 273af9f5d..326e5b9f7 100644 > --- a/vswitchd/ovs-vswitchd.c > +++ b/vswitchd/ovs-vswitchd.c > @@ -93,6 +93,7 @@ main(int argc, char *argv[]) > fatal_ignore_sigpipe(); > > daemonize_start(true, hw_rawio_access); > + ovsthread_stack_init(); > > if (want_mlockall) { > #ifdef HAVE_MLOCKALL > -- > 2.43.0 > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev