On Thu, 29 Jan 2026, Jakub Jelinek wrote:

> On Thu, Jan 29, 2026 at 09:17:46AM -0500, Patrick Palka wrote:
> > The flatten attribute on a function tells the compiler to recursively
> > inline all calls within it.  The DFS entryoint _M_dfs, now that it's
> > non-recursive, is a good candidate for this attribute because its
> > implementation spans multiple subroutines that are only called from
> > _M_dfs, and so should obviously be inlined, but the compiler doesn't
> > know this (specifically it doesn't know that the subroutines are used
> > in the same way by every TU).
> > 
> > We could instead annotate each of its subroutines with always_inline,
> > but this forces the function to get inlined even with -O0, which we
> > don't really want here, whereas flatten only has an effect when
> > compiling with optimizations AFAICT.  It also has the benefit of
> > inlining all the repetitive std::vector operations which themselves
> > are quite hot.
> 
> You could also use
> #ifdef __OPTIMIZE__
> [[__gnu__::__always_inline__]]
> #endif

Ah yes.  FWIW adding always_inline to just _M_node seems sufficient to
ensure all the other DFS subroutines get inlined, and yields a 33% run
time reduction for the benchmark (as opposed to 50% via flatten on
_M_dfs) and 10% code size reduction (as opposed to 15%).  But the
compiler still doesn't inline some/all(?) vector::emplace_back calls and
they end up having 30% overhead according to perf.

> For a flatten, is there a guarantee that it calls only functions defined in
> libstdc++-v3 headers and nothing else (say some user class ctors/dtors
> or whatever else supplied by the user)?  Because if user can instantiate it
> with some huge functions (or small functions with very deep call graphs),
> flatten would try to inline all that and could cause huge compile time/memory
> regressions.

The _Executor template is parameterized by the input iterator type
_BiIter, the allocator type _Alloc, and (effectively) the character
type.  In practice _BiIter and _Alloc are usually instantiated with
standard library types (__normal_iterator, char*, std::allocator etc).
They could definitely be instantiated with a user-defined
iterator/allocator type, but I'd expect these kinds of types (and their
operations) to be small.  I'm not sure how much we should care about
pathological iterator/allocator types with huge definitions.

But since adding always_inline to _M_node seems to give most of the
speedup, I'm good with just doing that instead.

> 
>       Jakub
> 
> 

Reply via email to