https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91515
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #1 from Peter Cordes <peter at cordes dot ca> --- The real missed optimization is that GCC is returning its own incoming arg instead of returning the copy of it that create() will return in RAX. This is what blocks tailcall optimization; it doesn't "trust" the callee to return what it's passing as RDI. See https://stackoverflow.com/a/57597039/224132 for my analysis (the OP asked the same thing on SO before reporting this, but forgot to link it in the bug report.) The RAX return value tends to rarely be used, but probably it should be; it's less likely to have just been reloaded recently. RAX is more likely to be ready sooner than R12 for out-of-order exec. Either reloaded earlier (still in the callee somewhere if it's complex and/or non-leaf) or never spilled/reloaded. So we're not even gaining a benefit from saving/restoring R12 to hold our incoming RDI. Thus it's not worth the extra cost (in code-size and instructions executed), IMO. Trust the callee to return the pointer in RAX.