On Thu, 29 Jan 2026 at 15:54 +0000, Jonathan Wakely wrote:
On Wed, 28 Jan 2026 at 11:27 -0500, Patrick Palka wrote:
Changes in v7:
- Make fallback_next and fallback_rep_once_more code paths fall through
 to the next / rep_once_more code paths instead of pushing a frame
 that gets immediately popped.

Changes in v6:
- Use an inline namespace to give the _Executor class a different
 mangled name since its implementation is no longer compatible with
 the previous recursive one
- Explicitly cast _M_subexpr to _StateIdT in _M_handle_subexpr_begin/end
 to self-document that what we're storing in _ExecutorFrame::_M_state_id
 isn't really a state id
- I experimented with using std::deque instead of std::vector for the
 frames stack.
 Upsides:
   - Reduces peak memory use from 50MB to 33MB for the microbenchmark
     regex_match(string(200000, 'a'), "(a|b|c)*")
     due to less unused capacity (vector's 2x growth factor means that
     50% of the time there will be at least 25% unused capacity)
   - Seems to be as fast as vector
   - Automatically reclaims unused capacity as the stack unwinds
   - No expensive reallocation needed when growing the stack in the
     case of non-trivially-copyable input iterators
  Downsides:
   - deque will allocate/free a chunk of memory whenever crossing a
     chunk boundary (512 elements) in either direction, which could
     degrade performance in rare cases where our stack size hovers
     near a multiple of the chunk boundary.
   - the fixed chunk size means small inputs will use at least
     512*24 bytes=12kB of memory
  Ultimately it seems using std::deque is better for very large
  inputs, and for small inputs std::vector is slightly better.  I'm
  not sure which container to go with, for now I stuck with std::vector.
  (N.B. we seem to be 2x faster than libc++'s std::regex and use 60%
  less memory for the above microbenchmark)

Another option would be to stick with vector, but manage the capacity
semi-manually. So when emplacing N new frames (where N == 1 or N == 2)
check if we have capacity and if not, call

_M_frames.reserve(_M_frames.size() * 3 / 2 + N);

This would grow by a factor of 1.5 which would result in less overhead
when we actually only need a tiny bit more capacity, and is claimed to
be a better choice anyway:
https://github.com/facebook/folly/blob/main/folly/docs/FBVector.md#memory-handling

Maybe put that inside a _M_reserve_frames member so that any function
which wants to call _M_frames.emplace_back would call
_M_reserve_frames(1) or _M_reserve_frames(2) and inside that function
it would do the check to see if it needs to increase the capacity.

That can be a follow-up patch though, using vector is fine for this
patch.

This _M_reserve_frames function could also end with something like:

  if ((_M_frames.capacity() - _M_frames.size()) < __n)
    __builtin_unreachable();

which might help the emplace_back calls that follow it to always take
the fast path (because they never need to reallocate) and so inline
those emplace_back calls.

Of course now you've just moved the cold path into _M_frames_reserve
instead, which calls _M_frames.reserve(...). But maybe one call to
reserve that always has to relocate the contents to new storage and
two calls to emplace_back that never have to relocate is more
inlinable than two calls to emplace_back which might *both* need to
relocate, depending on how much available capacity there is before the
first call.


Reply via email to