On Thu, 29 Jan 2026, Jakub Jelinek wrote: > On Thu, Jan 29, 2026 at 09:17:46AM -0500, Patrick Palka wrote: > > The flatten attribute on a function tells the compiler to recursively > > inline all calls within it. The DFS entryoint _M_dfs, now that it's > > non-recursive, is a good candidate for this attribute because its > > implementation spans multiple subroutines that are only called from > > _M_dfs, and so should obviously be inlined, but the compiler doesn't > > know this (specifically it doesn't know that the subroutines are used > > in the same way by every TU). > > > > We could instead annotate each of its subroutines with always_inline, > > but this forces the function to get inlined even with -O0, which we > > don't really want here, whereas flatten only has an effect when > > compiling with optimizations AFAICT. It also has the benefit of > > inlining all the repetitive std::vector operations which themselves > > are quite hot. > > You could also use > #ifdef __OPTIMIZE__ > [[__gnu__::__always_inline__]] > #endif
Ah yes. FWIW adding always_inline to just _M_node seems sufficient to ensure all the other DFS subroutines get inlined, and yields a 33% run time reduction for the benchmark (as opposed to 50% via flatten on _M_dfs) and 10% code size reduction (as opposed to 15%). But the compiler still doesn't inline some/all(?) vector::emplace_back calls and they end up having 30% overhead according to perf. > For a flatten, is there a guarantee that it calls only functions defined in > libstdc++-v3 headers and nothing else (say some user class ctors/dtors > or whatever else supplied by the user)? Because if user can instantiate it > with some huge functions (or small functions with very deep call graphs), > flatten would try to inline all that and could cause huge compile time/memory > regressions. The _Executor template is parameterized by the input iterator type _BiIter, the allocator type _Alloc, and (effectively) the character type. In practice _BiIter and _Alloc are usually instantiated with standard library types (__normal_iterator, char*, std::allocator etc). They could definitely be instantiated with a user-defined iterator/allocator type, but I'd expect these kinds of types (and their operations) to be small. I'm not sure how much we should care about pathological iterator/allocator types with huge definitions. But since adding always_inline to _M_node seems to give most of the speedup, I'm good with just doing that instead. > > Jakub > >
