(Adding gcc ML as this needs a larger community wide expertise and
discussion. All, please also see thread and a few follow ups
https://inbox.sourceware.org/binutils/caonfrcv0v1catcpyxpplglzwkssws+wsfd12cg71rsi8ic0...@mail.gmail.com/)
Thanks for starting the discussion Mehrdad.
Stackless coroutines are one aspect, but as we discuss the general
intent/support in the formats for stacktracing through them, the problem
of stackful coroutines may also be good to keep in the context.
Some comments/curiosities below.
On 1/22/26 9:09 AM, Mehrdad Niknami wrote:
Hi,
(Apologies in advance for the long email. This is also my first time
using the mailing list, so sorry in advance for any missteps.)
Briefly: we have been working on incorporating C++ coroutine frames into
stack traces.
The particular issue we have been trying to solve is that many common
tools that obtain stack traces (e.g., perf) do not appear to have a
mechanism for discovering or traversing C++ coroutine frames.
It was suggested to us that SFrames could perhaps facilitate this across
many tools, and so we're reaching out here to discuss whether/how that
may be achievable, as we are not too familiar with SFrame details.
As C++ coroutines have a very high amount of complexity, I have
attempted to distill down the relevant details of our current
design below, along with a quick primer on how C++ coroutines work from
the user side.
Overall, we're wondering if SFrames (either as they exist now, or via a
practical extension) would be suitable for this "stitching" or
"interleaving" of coroutine stack frames, and if so, what we could plan
for and/or expect on that front.
Thank you.
------------------------------------------------------------------------
A toy coroutine may look like this:
Task<int> factorial(int n) {
printf("n = %d\n", n);
if (n == 0) {
co_return 1;
}
int subresult = co_await factorial(n - 1);
co_return n * subresult;
}
The Task type internally holds a std::coroutine_handle<> object, which
is basically an opaque void* pointing to the various bookkeeping data of
the coroutine.
The Task type is specially marked to signify to the compiler that this
function is a coroutine, and defines various behaviors, such as those of
the keywords co_await and co_return (e.g., by storing the provided
values in heap memory somewhere for later retrieval).
Ordinarily, if we have
int main() {
factorial(10).Resume(); // Run the coroutine
}
and if we then stop the program to obtain a stack trace arbitrarily, we
may obtain a typical CPU stack trace such as this:
printf()
factorial(n=3)
/(Note that I'm including the n= values for illustrative
purposes; they of course would not be apparent solely from a
list of instruction addresses.)/
std::coroutine_handle<>::resume()
/(Note that the std::coroutine_handle<>::resume() <https://
en.cppreference.com/w/cpp/coroutine/coroutine_handle/
resume.html> function uses a compiler intrinsic such as
__builtin_coro_resume() <https://clang.llvm.org/docs/
LanguageExtensions.html#c-coroutines-support-builtins> to resume
the coroutine underneath.)/
Task::Resume()
main()
However, what we /wish/ to obtain in such a situation is the following:
printf()
factorial(n=3)
factorial(n=2) (.resume)
factorial(n=1) (.resume)
std::coroutine_handle<>::resume()
Task<>::Resume()
main()
The information required for observing factorial(2) and
factorial(1) does exist in memory, but it is not part of the ordinary
thread call stack.
In our implementation, it is represented via a singly-linked list:
1. The currently-running coroutine (in this case, factorial(3)), which
is the head of the list, is maintained in a known *thread-local*
variable.
2. Each coroutine also remembers the address of its caller (a.k.a.
waiter, a.k.a. continuation).
Curious, when you say "our implementation", do you mean a coroutine
library or clang implementation ? Either way, IIUC, how the caller is
stored in the Promise object is not ABI/language defined. So,
stacktracing through these dynamically allocated Promise objects on the
heap is not easily doable via a third party unwinding application
routine oblivious of the running application itself.
On that note, just for the sake of a thought experiment: Is (Was) it
possible (for stack tracing purposes) to assign some skeleton for the
"linked list data structure" of these dynamically allocated Promise
object on the heap ? Something like having a Promise object header on
the heap which specifies the size of Promise obj, location of the next
one on heap, and offset to the caller IP within the object.
Typically, the address of the calling coroutine is stored inside a so-
called "Promise" object on the heap associated with the Task, whose
address is obtained via invoking std::coroutine_handle<T>::promise() on
the coroutine handle embedded inside the Task.
Each coroutine saves/restores the handle to its caller at various points
via hooks provided through the language & compiler.
When the compiler asks the coroutine "/Now that you are suspended, what
should I resume next?/" (via complicated machinery <https://
en.cppreference.com/w/cpp/language/
coroutines.html#:~:text=if%20await_suspend%20returns%20a%20coroutine%20handle%20for%20some%20other%20coroutine%2C%20that%20handle%20is%20resumed>), the coroutine responds with the handle of its caller (the continuation).
If the coroutine frames are marked as such in the stacktrace format (as
Jens suggested on the binutils thread reply), the stacktacer knows where
to stitch the heap frames without guessing or doing linear memory scans.
On that note, so looks like there can there be a stacktrace of multiple
interleaved subsets of "normal" and coroutine functions ? I.e.,
normal_B ()
coroutine_caller_B ()
coroutine_init_B ()
normal_server_start ()
normal_server_init ()
coroutine_server_caller ()
std::coroutine_handle::resume()
normal_event_handler ()
main ()
If yes, how is this handled in your approach below...
The missing piece of the puzzle is: how would the interleaving process
figure out *where* to insert the coroutine frames into the ordinary
stack trace?
Conceptually, we need some way to identify Task::Resume() in the stack
trace, and insert everything beneath the
std::coroutine_handle<>::resume() that it is eventually calling downstream.
Right now we accomplish this via some dynamic probing at program
initialization:
1. We block the inlining of Task::Resume()
2. We force the inlining of std::coroutine_handle<>::resume() (which
just reduces to an intrinsic, __builtin_coro_resume())
3. We call a dummy coroutine whose sole job is to return its own
__builtin_return_address(0).
This return address is in the body of Task<>::Resume, and this forces it
to be the unique address in the entire program that
all Task<T> coroutines eventually return to.
Therefore, we perform a linear search for it starting from the leaf
frame on the stack, and we insert all the asynchronous frames after that.