On Fri, 30 Jan 2026, Patrick Palka wrote:

> On Fri, 30 Jan 2026, Jonathan Wakely wrote:
> 
> > On Fri, 30 Jan 2026 at 15:25, Patrick Palka <[email protected]> wrote:
> > >
> > > Changes in v8:
> > >  - Use char instead of uint8_t
> > >  - Decompose the first byte of _ExecutorFrame::_M_bytes into an
> > >    anonymous union with code-specific members
> > >  - Use [[__fallthrough__]] instead of // fallthrough
> > >
> > > Changes in v7:
> > >  - Make fallback_next and fallback_rep_once_more code paths fall through
> > >    to the next / rep_once_more code paths instead of pushing a frame
> > >    that gets immediately popped.
> > >
> > > Changes in v6:
> > >  - Use an inline namespace to give the _Executor class a different
> > >    mangled name since its implementation is no longer compatible with
> > >    the previous recursive one
> > >  - Explicitly cast _M_subexpr to _StateIdT in _M_handle_subexpr_begin/end
> > >    to self-document that what we're storing in _ExecutorFrame::_M_state_id
> > >    isn't really a state id
> > >  - I experimented with using std::deque instead of std::vector for the
> > >    frames stack.
> > >    Upsides:
> > >      - Reduces peak memory use from 50MB to 33MB for the microbenchmark
> > >        regex_match(string(200000, 'a'), "(a|b|c)*")
> > >        due to less unused capacity (vector's 2x growth factor means that
> > >        50% of the time there will be at least 25% unused capacity)
> > >      - Seems to be as fast as vector
> > >      - Automatically reclaims unused capacity as the stack unwinds
> > >      - No expensive reallocation needed when growing the stack in the
> > >        case of non-trivially-copyable input iterators
> > >     Downsides:
> > >      - deque will allocate/free a chunk of memory whenever crossing a
> > >        chunk boundary (512 elements) in either direction, which could
> > >        degrade performance in rare cases where our stack size hovers
> > >        near a multiple of the chunk boundary.
> > >      - the fixed chunk size means small inputs will use at least
> > >        512*24 bytes=12kB of memory
> > >     Ultimately it seems using std::deque is better for very large
> > >     inputs, and for small inputs std::vector is slightly better.  I'm
> > >     not sure which container to go with, for now I stuck with std::vector.
> > >     (N.B. we seem to be 2x faster than libc++'s std::regex and use 60%
> > >     less memory for the above microbenchmark)
> > >
> > > Changes in v5:
> > >   - Replace char _ExecutorFrame::_M_char with
> > >     uint8_t _ExecutorFrame::_M_bytes[7] to fully account for
> > >     the implicit struct padding and make it char size/signedness
> > >     agnostic.
> > >
> > > Changes in v4:
> > >   - Do not remove the BFS executor.  I convinced myself that it's
> > >     worthwhile to keep it around for backwards compatibility and its
> > >     better time complexity (e.g. for PR78276).
> > >   - However, this means the mangling of _Executor is now the same as
> > >     before (since the __dfs_mode template parameter is no longer gone).
> > >     I believe we need to give this class a different mangling so
> > >     programs that mix old/new std::regex impls continue to work.
> > >
> > > Changes in v3:
> > >
> > >   - Add char _M_char member to _ExecutorFrame (without increasing the
> > >     size of struct thanks to padding)
> > >   - Use it to fix bug in restore_subexpr_end frame code whereby we
> > >     weren't restoring _M_cur_results.matched properly
> > >   - Combine restore_subexpr_begin/end frame codes into one
> > >     restore_cur_results frame code
> > >   - Combine restore_rep_count_val/posn frame codes into single
> > >     restore_rep_count frame_code (this makes the patch 3/4 from v2
> > >     obsolete)
> > >   - Store the frame stack in the data member _ExecutorFrame::_M_frames
> > >     alongside the rest of the executor state instead of using a local.
> > >     Using a local variable is just more code with no real benefit.
> > >
> > > -- >8 --
> > >
> > > libstdc++/regex: Convert DFS executor to use iteration [PR86164]
> > >
> > > This replaces the recursion and implicit system stack usage of the DFS
> > > executor with iteration and an explicit heap-based stack.  After this
> > > patch, we no longer stack overflow on the PR86164 testcases since system
> > > stack usage is no longer linear with respect to input size.  This should
> > > otherwise be a non-functional change.
> > >
> > >         PR libstdc++/86164
> > >
> > > libstdc++-v3/ChangeLog:
> > >
> > >         * include/bits/regex.h (__detail::_Executor): Use inline
> > >         namespace _V2.
> > >         * include/bits/regex_executor.h (__detail::_ExecutorFrame):
> > >         Declare.
> > >         (__detail::_Executor): Use inline namespace _V2.
> > >         (__detail::_Executor::_M_node): Declare.
> > >         (__detail::_Executor::_M_frames): New data member.
> > >         * include/bits/regex_executor.tcc 
> > > (__detail::_ExecutorFrameOpcode):
> > >         New.
> > >         (__detail::_ExecutorFrame): New.
> > >         (__detail::_Executor): Use inline namespace _V2.
> > >         (__detail::_Executor::_M_rep_once_more): Replace recursive
> > >         _M_dfs calls with an _S_opcode_next frame push, and any tail work
> > >         after such calls with an appropriate frame push.
> > >         (__detail::_M_handle_repeat): Likewise.
> > >         (__detail::_M_handle_subexpr_begin): Likewise.
> > >         (__detail::_M_handle_subexpr_end): Likewise.
> > >         (__detail::_M_handle_line_begin_assertion): Likewise.
> > >         (__detail::_M_handle_line_end_assertion): Likewise.
> > >         (__detail::_M_handle_word_boundary): Likewise.
> > >         (__detail::_M_handle_subexpr_lookahead): Likewise.
> > >         (__detail::_M_handle_match): Likewise.
> > >         (__detail::_M_handle_backref): Likewise.
> > >         (__detail::_M_handle_accept): Likewise.
> > >         (__detail::_M_handle_alternative): Likewise.
> > >         (__detail::_M_node): Factored out from _M_dfs.
> > >         (__detail::_M_dfs): Push an initial frame to _M_frames that
> > >         visits the starting node and pass this stack each subroutine.
> > >         Pop the latest _ExecutorFrame from _M_frames and handle
> > >         appropriately according to its _ExecutorFrameOpcode.  Loop until
> > >         _M_frames is empty.
> > >         * include/std/regex: Include <cstdint>.
> > > ---
> > >  libstdc++-v3/include/bits/regex.h            |   2 +
> > >  libstdc++-v3/include/bits/regex_executor.h   |   9 +
> > >  libstdc++-v3/include/bits/regex_executor.tcc | 233 +++++++++++++++----
> > >  3 files changed, 199 insertions(+), 45 deletions(-)
> > >
> > > diff --git a/libstdc++-v3/include/bits/regex.h 
> > > b/libstdc++-v3/include/bits/regex.h
> > > index 68fb6254f362..5bcb9a447ce2 100644
> > > --- a/libstdc++-v3/include/bits/regex.h
> > > +++ b/libstdc++-v3/include/bits/regex.h
> > > @@ -61,8 +61,10 @@ namespace __detail
> > >                       _RegexExecutorPolicy                 __policy,
> > >                       bool                                 __match_mode);
> > >
> > > +_GLIBCXX_BEGIN_INLINE_ABI_NAMESPACE(_V2)
> > >    template<typename, typename, typename, bool>
> > >      class _Executor;
> > > +_GLIBCXX_END_INLINE_ABI_NAMESPACE(_V2)
> > >
> > >    template<typename _Tp>
> > >      struct __is_contiguous_iter : false_type { };
> > > diff --git a/libstdc++-v3/include/bits/regex_executor.h 
> > > b/libstdc++-v3/include/bits/regex_executor.h
> > > index f9857bfc4c1d..94048e1f632c 100644
> > > --- a/libstdc++-v3/include/bits/regex_executor.h
> > > +++ b/libstdc++-v3/include/bits/regex_executor.h
> > > @@ -36,11 +36,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > >
> > >  namespace __detail
> > >  {
> > > +_GLIBCXX_BEGIN_INLINE_ABI_NAMESPACE(_V2)
> > >    /**
> > >     * @addtogroup regex-detail
> > >     * @{
> > >     */
> > >
> > > +  template<typename _BiIter, bool _Trivial = 
> > > is_trivially_copyable<_BiIter>::value>
> > > +    struct _ExecutorFrame;
> > > +
> > >    /**
> > >     * @brief Takes a regex and an input string and does the matching.
> > >     *
> > > @@ -142,6 +146,9 @@ namespace __detail
> > >        void
> > >        _M_handle_alternative(_Match_mode, _StateIdT);
> > >
> > > +      void
> > > +      _M_node(_Match_mode, _StateIdT);
> > > +
> > >        void
> > >        _M_dfs(_Match_mode __match_mode, _StateIdT __start);
> > >
> > > @@ -290,6 +297,7 @@ namespace __detail
> > >         };
> > >
> > >      public:
> > > +      _GLIBCXX_STD_C::vector<_ExecutorFrame<_BiIter>>       _M_frames;
> > >        _ResultsVec                                           
> > > _M_cur_results;
> > >        _BiIter                                               _M_current;
> > >        _BiIter                                               _M_begin;
> > > @@ -305,6 +313,7 @@ namespace __detail
> > >      };
> > >
> > >   ///@} regex-detail
> > > +_GLIBCXX_END_INLINE_ABI_NAMESPACE(_V2)
> > >  } // namespace __detail
> > >  _GLIBCXX_END_NAMESPACE_VERSION
> > >  } // namespace std
> > > diff --git a/libstdc++-v3/include/bits/regex_executor.tcc 
> > > b/libstdc++-v3/include/bits/regex_executor.tcc
> > > index 421e585f39d9..36c7e05dcdfb 100644
> > > --- a/libstdc++-v3/include/bits/regex_executor.tcc
> > > +++ b/libstdc++-v3/include/bits/regex_executor.tcc
> > > @@ -36,6 +36,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > >  #pragma GCC diagnostic ignored "-Wc++17-extensions" // if constexpr
> > >  namespace __detail
> > >  {
> > > +_GLIBCXX_BEGIN_INLINE_ABI_NAMESPACE(_V2)
> > >    template<typename _BiIter, typename _Alloc, typename _TraitsT,
> > >            bool __dfs_mode>
> > >      bool _Executor<_BiIter, _Alloc, _TraitsT, __dfs_mode>::
> > > @@ -55,6 +56,85 @@ namespace __detail
> > >        return false;
> > >      }
> > >
> > > +  enum _ExecutorFrameOpcode : unsigned char
> > > +  {
> > > +    _S_fopcode_next,
> > > +    _S_fopcode_fallback_next,
> > > +    _S_fopcode_rep_once_more,
> > > +    _S_fopcode_fallback_rep_once_more,
> > > +    _S_fopcode_posix_alternative,
> > > +    _S_fopcode_merge_sol,
> > > +    _S_fopcode_restore_current,
> > > +    _S_fopcode_restore_cur_results,
> > > +    _S_fopcode_restore_rep_count,
> > > +    _S_fopcode_decrement_rep_count,
> > > +  };
> > > +
> > > +#define _GLIBCXX_EXECUTOR_FRAME_BYTES_LAYOUT \
> > > +  union {                                   \
> > > +    unsigned char _M_byte0;                 \
> > > +    struct { /* rep_count frame */           \
> > > +      unsigned char _M_count : 2;           \
> > > +    };                                      \
> > > +    struct { /* restore_cur_results frame */ \
> > > +      unsigned char _M_end : 1;                     \
> > > +      unsigned char _M_matched : 1;         \
> > > +    };                                      \
> > > +  };                                        \
> > > +  unsigned char _M_bytes[6]
> > 
> > Ah, it didn't occur to me that repacing _M_bytes[7] with this union
> > would need to be spammed out in two places, sorry!
> > 
> > This patch is OK for trunk, thanks!
> > 
> > Maybe as a follow-up we can introduce:
> > 
> > struct _ExecutorFrameBase
> > {
> >   _ExecutorFrameOpcode _M_op;
> >   ... the union ...
> >   unsigned char _M_bytes[6];
> >   _StateIdT _M_state_id;
> > };
> > 
> > for the parts that are common to the primary template and the partial
> > specialization, and then make them both derive from that.
> 
> D'oh, using a base class is much cleaner!  I'd prefer making this change now
> rather than as a follow-up.  Like so?  The second patch is effectively
> unchanged.
> 
> Not only is it cleaner, using a base class enables us to store frames such
> as _S_fopcode_next that don't need the extra _M_val/_M_pos space on a separate
> stack of _ExecutorFrameBase objects which would help save space.  I'll 
> experiment
> with that.

This optimization (essentially array-of-struct -> struct-of-array) has
zero benefit for ECMAScript + DFS mode because the "bare" frames (that
don't need the _M_val/_M_pos space) are always at the top of the stack
and get immediately popped.  So the separate stack of bare frames is
always size 1.  I haven't measured how beneficial it is in other modes
because it seems not worth the complexity if the default and by far most
commont mode doesn't benefit from it.

Reply via email to