Hi Mark, Thanks for all the feedback and help on this. I'm planning on getting your comments addressed in the coming days, but I have some initial clarifying questions.
On Fri, May 1, 2026 at 9:46 AM Mark Rutland <[email protected]> wrote: > > Hi Dylan, > > Thanks for putting this together. I think this is looking pretty good. > However, there are some things that aren't quite right and need some > work, which I've commented on below. > > More generally, there are a few things that aren't addressed by this > series that we will also need to address. Importantly: > > (1) For correctness, we'll need to address a latent issue with unwinding > across an fgraph return trampoline, where the return address is > transiently unrecoverable. > > Before this series, that doesn't matter for livepatching because the > livepatching code isn't called synchronously within the fgraph > handler, and unwinds which cross an exception boundary are marked as > unreliable. > > After this series, that does matter as we can unwind across an > exception boundary, and might happen to interrupt that transient > window. > > I think we can solve that with some restructuring of that code, > restoring the original address *before* removing that from the > fgraph return stack, and ensuring that the unwinder can find it. If my understanding is correct, the issue arrises in return_to_handler as the return address is recovered: mov x0, sp bl ftrace_return_to_handler // addr = ftrace_return_to_hander(fregs); mov x30, x0 // restore the original return address Because ftrace_return_to_handler pops the return address from the return stack before it can be restored into the LR, it cannot be recovered. Based on this, I believe you are suggesting to restructure this code path such that the return address is removed from the return stack only after it has been restored to LR. But since kernel/trace/fgraph.c is core kernel code, will this end up requiring either (1) a similar restructuring of other arches supporting ftrace, or (2) an arm64-specific implementation of this recovery logic? It looks to me like there is essentially the same recovery pattern on other arches; is there a reason this transient unrecoverability isn't an issue for reliable unwind on other platforms? > > I'm not immediately sure whether kretprobes has a similar issue. > > (2) To make unwinding generally possible, we'll need to annotate some > assembly functions as unwindable. We'll need to do that for string > routines under lib/, and probably some crypto code, but we don't > need to do that for most code in head.S, entry.S, etc. > > The vast majority of relevant assembly functions are leaf functions > (where the return address is never moved out of the LR), so we can > probably get away with a simple annotation for those that avoids the > need for open-coded CFI directives everywhere. Are you suggesting something like a SYM_LEAF_FUNC_(START|END), that wraps CFI directives for leaf functions? > > I've pushed some reliable stacktrace tests to: > > git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git > stacktrace/tests > > That finds the fgraph issue (regardless of this series). When merged > with this series triggers a warning in kunwind_next_frame_record_meta(), > where unwind_next_frame_sframe() calls that erroneously as a fallback. Thanks for the pointer on these tests, they're super useful! I've been able to reproduce the fgraph failure you mentioned. Thanks, Dylan

