On Fri, 30 Jan 2026 at 15:25, Patrick Palka <[email protected]> wrote:
>
> On Thu, 29 Jan 2026, Jonathan Wakely wrote:
>
> > On Wed, 28 Jan 2026 at 11:27 -0500, Patrick Palka wrote:
> > > Changes in v7:
> > > - Make fallback_next and fallback_rep_once_more code paths fall through
> > >   to the next / rep_once_more code paths instead of pushing a frame
> > >   that gets immediately popped.
> > >
> > > Changes in v6:
> > > - Use an inline namespace to give the _Executor class a different
> > >   mangled name since its implementation is no longer compatible with
> > >   the previous recursive one
> > > - Explicitly cast _M_subexpr to _StateIdT in _M_handle_subexpr_begin/end
> > >   to self-document that what we're storing in _ExecutorFrame::_M_state_id
> > >   isn't really a state id
> > > - I experimented with using std::deque instead of std::vector for the
> > >   frames stack.
> > >   Upsides:
> > >     - Reduces peak memory use from 50MB to 33MB for the microbenchmark
> > >       regex_match(string(200000, 'a'), "(a|b|c)*")
> > >       due to less unused capacity (vector's 2x growth factor means that
> > >       50% of the time there will be at least 25% unused capacity)
> > >     - Seems to be as fast as vector
> > >     - Automatically reclaims unused capacity as the stack unwinds
> > >     - No expensive reallocation needed when growing the stack in the
> > >       case of non-trivially-copyable input iterators
> > >    Downsides:
> > >     - deque will allocate/free a chunk of memory whenever crossing a
> > >       chunk boundary (512 elements) in either direction, which could
> > >       degrade performance in rare cases where our stack size hovers
> > >       near a multiple of the chunk boundary.
> > >     - the fixed chunk size means small inputs will use at least
> > >       512*24 bytes=12kB of memory
> > >    Ultimately it seems using std::deque is better for very large
> > >    inputs, and for small inputs std::vector is slightly better.  I'm
> > >    not sure which container to go with, for now I stuck with std::vector.
> > >    (N.B. we seem to be 2x faster than libc++'s std::regex and use 60%
> > >    less memory for the above microbenchmark)
> >
> > Another option would be to stick with vector, but manage the capacity
> > semi-manually. So when emplacing N new frames (where N == 1 or N == 2)
> > check if we have capacity and if not, call
> >
> > _M_frames.reserve(_M_frames.size() * 3 / 2 + N);
> >
> > This would grow by a factor of 1.5 which would result in less overhead
> > when we actually only need a tiny bit more capacity, and is claimed to
> > be a better choice anyway:
> > https://github.com/facebook/folly/blob/main/folly/docs/FBVector.md#memory-handling
> >
> > Maybe put that inside a _M_reserve_frames member so that any function
> > which wants to call _M_frames.emplace_back would call
> > _M_reserve_frames(1) or _M_reserve_frames(2) and inside that function
> > it would do the check to see if it needs to increase the capacity.
> >
> > That can be a follow-up patch though, using vector is fine for this
> > patch.
>
> Will experiment with this as a follow-up
>
> >
> > > Changes in v5:
> > >  - Replace char _ExecutorFrame::_M_char with
> > >    uint8_t _ExecutorFrame::_M_bytes[7] to fully account for
> > >    the implicit struct padding and make it char size/signedness
> > >    agnostic.
> > >
> > > Changes in v4:
> > >  - Do not remove the BFS executor.  I convinced myself that it's
> > >    worthwhile to keep it around for backwards compatibility and its
> > >    better time complexity (e.g. for PR78276).
> > >  - However, this means the mangling of _Executor is now the same as
> > >    before (since the __dfs_mode template parameter is no longer gone).
> > >    I believe we need to give this class a different mangling so
> > >    programs that mix old/new std::regex impls continue to work.
> > >
> > > Changes in v3:
> > >
> > >  - Add char _M_char member to _ExecutorFrame (without increasing the
> > >    size of struct thanks to padding)
> > >  - Use it to fix bug in restore_subexpr_end frame code whereby we
> > >    weren't restoring _M_cur_results.matched properly
> > >  - Combine restore_subexpr_begin/end frame codes into one
> > >    restore_cur_results frame code
> > >  - Combine restore_rep_count_val/posn frame codes into single
> > >    restore_rep_count frame_code (this makes the patch 3/4 from v2
> > >    obsolete)
> > >  - Store the frame stack in the data member _ExecutorFrame::_M_frames
> > >    alongside the rest of the executor state instead of using a local.
> > >    Using a local variable is just more code with no real benefit.
> > >
> > > -- >8 --
> > >
> > > libstdc++/regex: Convert DFS executor to use iteration [PR86164]
> > >
> > > This replaces the recursion and implicit system stack usage of the DFS
> > > executor with iteration and an explicit heap-based stack.  After this
> > > patch, we no longer stack overflow on the PR86164 testcases since system
> > > stack usage is no longer linear with respect to input size.  This should
> > > otherwise be a non-functional change.
> >
> > This is so cool. Much simpler and cleaner than my abandoned attempt to
> > do this - and with the added benefit that it actually works!
> >
> > >     PR libstdc++/86164
> > >
> > > libstdc++-v3/ChangeLog:
> > >
> > >     * include/bits/regex.h (__detail::_Executor): Use inline
> > >     namespace _V2.
> > >     * include/bits/regex_executor.h (__detail::_ExecutorFrame):
> > >     Declare.
> > >     (__detail::_Executor): Use inline namespace _V2.
> > >     (__detail::_Executor::_M_node): Declare.
> > >     (__detail::_Executor::_M_frames): New data member.
> > >     * include/bits/regex_executor.tcc (__detail::_ExecutorFrameOpcode):
> > >     New.
> > >     (__detail::_ExecutorFrame): New.
> > >     (__detail::_Executor): Use inline namespace _V2.
> > >     (__detail::_Executor::_M_rep_once_more): Replace recursive
> > >     _M_dfs calls with an _S_opcode_next frame push, and any tail work
> > >     after such calls with an appropriate frame push.
> > >     (__detail::_M_handle_repeat): Likewise.
> > >     (__detail::_M_handle_subexpr_begin): Likewise.
> > >     (__detail::_M_handle_subexpr_end): Likewise.
> > >     (__detail::_M_handle_line_begin_assertion): Likewise.
> > >     (__detail::_M_handle_line_end_assertion): Likewise.
> > >     (__detail::_M_handle_word_boundary): Likewise.
> > >     (__detail::_M_handle_subexpr_lookahead): Likewise.
> > >     (__detail::_M_handle_match): Likewise.
> > >     (__detail::_M_handle_backref): Likewise.
> > >     (__detail::_M_handle_accept): Likewise.
> > >     (__detail::_M_handle_alternative): Likewise.
> > >     (__detail::_M_node): Factored out from _M_dfs.
> > >     (__detail::_M_dfs): Push an initial frame to _M_frames that
> > >     visits the starting node and pass this stack each subroutine.
> > >     Pop the latest _ExecutorFrame from _M_frames and handle
> > >     appropriately according to its _ExecutorFrameOpcode.  Loop until
> > >     _M_frames is empty.
> > >     * include/std/regex: Include <cstdint>.
> > > ---
> > > libstdc++-v3/include/bits/regex.h            |   2 +
> > > libstdc++-v3/include/bits/regex_executor.h   |   9 +
> > > libstdc++-v3/include/bits/regex_executor.tcc | 218 +++++++++++++++----
> > > libstdc++-v3/include/std/regex               |   1 +
> > > 4 files changed, 185 insertions(+), 45 deletions(-)
> > >
> > > diff --git a/libstdc++-v3/include/bits/regex.h
> > > b/libstdc++-v3/include/bits/regex.h
> > > index 68fb6254f362..5bcb9a447ce2 100644
> > > --- a/libstdc++-v3/include/bits/regex.h
> > > +++ b/libstdc++-v3/include/bits/regex.h
> > > @@ -61,8 +61,10 @@ namespace __detail
> > >                   _RegexExecutorPolicy                 __policy,
> > >                   bool                                 __match_mode);
> > >
> > > +_GLIBCXX_BEGIN_INLINE_ABI_NAMESPACE(_V2)
> > >   template<typename, typename, typename, bool>
> > >     class _Executor;
> > > +_GLIBCXX_END_INLINE_ABI_NAMESPACE(_V2)
> > >
> > >   template<typename _Tp>
> > >     struct __is_contiguous_iter : false_type { };
> > > diff --git a/libstdc++-v3/include/bits/regex_executor.h
> > > b/libstdc++-v3/include/bits/regex_executor.h
> > > index f9857bfc4c1d..94048e1f632c 100644
> > > --- a/libstdc++-v3/include/bits/regex_executor.h
> > > +++ b/libstdc++-v3/include/bits/regex_executor.h
> > > @@ -36,11 +36,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > >
> > > namespace __detail
> > > {
> > > +_GLIBCXX_BEGIN_INLINE_ABI_NAMESPACE(_V2)
> > >   /**
> > >    * @addtogroup regex-detail
> > >    * @{
> > >    */
> > >
> > > +  template<typename _BiIter, bool _Trivial =
> > > is_trivially_copyable<_BiIter>::value>
> >
> > I wondered if trivially-copyable is the right property, but I think it
> > is close enough. I think the precise conditions for using this
> > optimization are trivially move constructible, trivially move
> > assignable, and trivially destructible. In practice, I think iterator
> > types that satisfy those requirements but have non-trivial copy
> > construction or non-trivial copy assignment (and so are not
> > trivially-copyable) are unlikely. So using trivially-copyable is not
> > likely to disable the optimization for any realistic types.
>
> Also trivially copyable is already needed for fast std::vector
> resizing so I figured might as well use the same property for this
> optimization.

Ah yes, good point. That definitely makes sense then.

>
> >
> > > +    struct _ExecutorFrame;
> > > +
> > >   /**
> > >    * @brief Takes a regex and an input string and does the matching.
> > >    *
> > > @@ -142,6 +146,9 @@ namespace __detail
> > >       void
> > >       _M_handle_alternative(_Match_mode, _StateIdT);
> > >
> > > +      void
> > > +      _M_node(_Match_mode, _StateIdT);
> > > +
> > >       void
> > >       _M_dfs(_Match_mode __match_mode, _StateIdT __start);
> > >
> > > @@ -290,6 +297,7 @@ namespace __detail
> > >     };
> > >
> > >     public:
> > > +      _GLIBCXX_STD_C::vector<_ExecutorFrame<_BiIter>>       _M_frames;
> > >       _ResultsVec                                           
> > > _M_cur_results;
> > >       _BiIter                                               _M_current;
> > >       _BiIter                                               _M_begin;
> > > @@ -305,6 +313,7 @@ namespace __detail
> > >     };
> > >
> > >  ///@} regex-detail
> > > +_GLIBCXX_END_INLINE_ABI_NAMESPACE(_V2)
> > > } // namespace __detail
> > > _GLIBCXX_END_NAMESPACE_VERSION
> > > } // namespace std
> > > diff --git a/libstdc++-v3/include/bits/regex_executor.tcc
> > > b/libstdc++-v3/include/bits/regex_executor.tcc
> > > index 421e585f39d9..ddfdbb0cb751 100644
> > > --- a/libstdc++-v3/include/bits/regex_executor.tcc
> > > +++ b/libstdc++-v3/include/bits/regex_executor.tcc
> > > @@ -36,6 +36,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > #pragma GCC diagnostic ignored "-Wc++17-extensions" // if constexpr
> > > namespace __detail
> > > {
> > > +_GLIBCXX_BEGIN_INLINE_ABI_NAMESPACE(_V2)
> > >   template<typename _BiIter, typename _Alloc, typename _TraitsT,
> > >        bool __dfs_mode>
> > >     bool _Executor<_BiIter, _Alloc, _TraitsT, __dfs_mode>::
> > > @@ -55,6 +56,71 @@ namespace __detail
> > >       return false;
> > >     }
> > >
> > > +  enum _ExecutorFrameOpcode : uint8_t
> >
> > Can we use unsigned char to avoid <cstdint> here?
>
> Done
>
> >
> > > +  {
> > > +    _S_fopcode_next,
> > > +    _S_fopcode_fallback_next,
> > > +    _S_fopcode_rep_once_more,
> > > +    _S_fopcode_fallback_rep_once_more,
> > > +    _S_fopcode_posix_alternative,
> > > +    _S_fopcode_merge_sol,
> > > +    _S_fopcode_restore_current,
> > > +    _S_fopcode_restore_cur_results,
> > > +    _S_fopcode_restore_rep_count,
> > > +    _S_fopcode_decrement_rep_count,
> > > +  };
> > > +
> > > +  template<typename _BiIter, bool _Trivial /* =
> > > is_trivially_copyable<_BiIter>::value */>
> > > +    struct _ExecutorFrame
> > > +    {
> > > +      _ExecutorFrame(_ExecutorFrameOpcode __op, _StateIdT __i)
> > > +      : _M_op(__op), _M_state_id(__i)
> > > +      { }
> > > +
> > > +      _ExecutorFrame(_ExecutorFrameOpcode __op, _StateIdT __i, _BiIter 
> > > __p)
> > > +      : _M_op(__op), _M_state_id(__i), _M_pos(__p)
> > > +      { }
> > > +
> > > +      _ExecutorFrame(_ExecutorFrameOpcode __op, _StateIdT __i, long __v)
> > > +      : _M_op(__op), _M_state_id(__i), _M_val(__v)
> > > +      { }
> > > +
> > > +      _ExecutorFrameOpcode _M_op;
> > > +      uint8_t _M_bytes[7];
> > > +      _StateIdT _M_state_id;
> > > +      // _M_pos and _M_val are mutually exclusive, which the optimized
> > > +      // partial specialization below depends on.
> > > +      _BiIter _M_pos = _BiIter();
> > > +      long _M_val = 0;
> > > +    };
> > > +
> > > +  // Space-optimized partial specialization for when the input iterator 
> > > is
> > > +  // trivially copyable.
> > > +  template<typename _BiIter>
> > > +    struct _ExecutorFrame<_BiIter, true>
> > > +    {
> > > +      _ExecutorFrame(_ExecutorFrameOpcode __op, _StateIdT __i)
> > > +      : _M_op(__op), _M_state_id(__i)
> > > +      { }
> > > +
> > > +      _ExecutorFrame(_ExecutorFrameOpcode __op, _StateIdT __i, _BiIter 
> > > __p)
> > > +      : _M_op(__op), _M_state_id(__i), _M_pos(__p)
> > > +      { }
> > > +
> > > +      _ExecutorFrame(_ExecutorFrameOpcode __op, _StateIdT __i, long __v)
> > > +      : _M_op(__op), _M_state_id(__i), _M_val(__v)
> > > +      { }
> > > +
> > > +      _ExecutorFrameOpcode _M_op;
> > > +      uint8_t _M_bytes[7];
> > > +      _StateIdT _M_state_id;
> > > +      union
> > > +      {
> > > +   _BiIter _M_pos;
> > > +   long _M_val;
> > > +      };
> > > +    };
> > > +
> > >   // The _M_main function operates in different modes, DFS mode or BFS 
> > > mode,
> > >   // indicated by template parameter __dfs_mode, and dispatches to one of
> > > the
> > >   // _M_main_dispatch overloads.
> > > @@ -181,19 +247,20 @@ namespace __detail
> > >       auto& __rep_count = _M_rep_count[__i];
> > >       if (__rep_count.second == 0 || __rep_count.first != _M_current)
> > >     {
> > > -     auto __back = __rep_count;
> > > +     _M_frames.emplace_back(_S_fopcode_restore_rep_count,
> > > +                            __i, __rep_count.first);
> > > +     _M_frames.back()._M_bytes[0] = __rep_count.second;
> >
> > __rep_count.second is never greater than 2, right? So always fits in
> > uint8_t.
>
> Yes
>
> >
> > I wondered about replacing _M_bytes[7] with a union like:
> >
> >   union {
> >     unsigned char _M_bytes[7];
> >     struct { // Used by restore_rep_count frame
> >       unsigned char _M_rep_count : 2;
> >     };
> >     struct { // Used by restore_cur_results frame
> >       unsigned char _M_matched : 1;
> >       unsigned char _M_restore_first : 1;
> >     };
> >   };
> >
> > So that we use more precise names than just _M_bytes[0] (and could
> > still extend it later with other alternatives if needed).
> >
> > This might be over-complicating things, but if we *did* want to do it,
> > I think we should decide now because the difference between setting
> > _M_bytes[0] = 0xff and setting _M_restore_first = true would not be
> > compatible if we changed it later (old TUs might expect to find 0xff
> > there and new TUs would only set one bit).
>
> I think slightly more flexible would be:
>
>    union {
>      unsigned char _M_byte0;
>      struct { // Used by restore_rep_count frame
>        unsigned char _M_rep_count : 2;
>      };
>      struct { // Used by restore_cur_results frame
>        unsigned char _M_matched : 1;
>        unsigned char _M_restore_first : 1;
>      };
>    };
>    unsigned char _M_bytes[6];
>
> So that we could decide how to use each byte independently of the other
> (e.g. maybe one byte could be for universal flags that apply to every
> code).

Yes, good thinking, that's better.

Reply via email to