On Tue, 20 May 2025 11:35:56 +0200 Ingo Molnar <mi...@kernel.org> wrote:
> > Ah, I think I forgot about that. I believe the exit path can also be a > > faultable path. All it needs is a hook to do the exit. Is there any > > "task work" clean up on exit? I need to take a look. > > Could you please not rush this facility into v6.16? It barely had any > design review so far, and I'm still not entirely sure about the > approach. Hi Ingo, Note, there has been a lot of discussion on this approach, although it's been mostly at conferences and in meetings. At GNU Cauldron in September 2024 (before Plumbers) Josh, Mathieu and myself discussed it in quite detail. I've then been hosting a monthly meeting with engineers from Google, Red Hat, EffiOS, Oracle, Microsoft and others (I can invite you to it if you would like). There's actually two meetings (one that is in a Asia friendly timezone and another in a European friendly timezone). The first patches from this went out in October, 2024: https://lore.kernel.org/all/cover.1730150953.git.jpoim...@kernel.org/ There wasn't much discussion on it, although Peter did reply, and I believe that we did address all of his concerns. Josh then changed the approach from what we originally discussed, which was to just have each tracer attach a task_work to the task it wants a trace thinking that would be good enough, but Mathieu and I found that it doesn't work because even a perf event can not handle this because it would need to keep track of several tasks that may migrate. Let me state the goal of this work, as it started from a session at the 2022 Tracing Summit in London. We needed a way to get reliable user space stack traces without using frame pointers. We discussed various methods, including using eh_frame but they all had issues. We concluded that it would be nice to have an ORC unwinder (technically it's called a stack walker), for user space. Then in Nov 2022, I read about "sframes" which was exactly what we were looking for: https://www.phoronix.com/news/GNU-Binutils-SFrame At FOSDEM 2023, I asked Jose Marchesi (a GCC maintainer) about sframes, and it just happened to be one of his employees that created it (Indu Bhagat). At Kernel Recipes 2023, Brendan Gregg, during his talk, asked if it would be great to have user space stack walking without needing to run everything with frame pointers. I raised my hand and asked if he had heard about "sframes", which he did not. I explained what it was and he was very interested. After that talk, Josh Poimboeuf came up to me and asked if he could implement this, which I agreed. One thing that is needed for sframes is that it has to be done in a faultable context. The sframe sections are like ORC, where it has lookup tables, but they live in the user space address, and because they can be large, they can't be locked into memory. This brings up the deferred unwinding aspect. At the LSFMMBPF 2023 conference, Indu and myself ran a session on sframes to get a better idea on how to implement this. The perf user space stack tracing was mentioned, where if frame pointers are not there, perf may copy thousands of bytes of the user space stack into the perf buffer. If there's a long system call, it may do this several times, and because the user space stack does not change while the task is in the kernel, this is thousands of bytes of duplicate data. I asked Jiri Olsa "why not just defer the stack trace and only make a single entry", and I believe he replied "we are thinking about it", but nothing further came about it. Now, there's a serious push to get sframes upstream, and that will take a few steps. These are: 1) Create a new user unwind stack call that is expected to be called in faultable context. If a tracer (perf, ftrace, BPF or whatever) knows it's in a faultable context and wants a trace, it can simply ask for it. It was also asked to have a user space system call that can get this trace so that a function in a user space applications can see what is calling it. 2) Create a deferred stack unwinding infrastructure that can be used by many clients (perf, ftrace, BPF or whatever) and called in any context (interrupt or NMI). As the unwind stack call needs to be in a faultable context, and it is very common to want both a kernel stack trace along with a user space stack trace and this can happen in an NMI, allow the kernel stack trace to be executed and delay the user stack trace. The accounting for this is not easy, as it has a many to many relationship. You could have perf, ftrace and BPF all asking for a delayed stack trace and they all need to be called. But each could be wanting a stack trace from a different set of tasks. Keeping track of which tracer gets a callback for which task is where this patch set comes in. Or at least just the infrastructure part. 3) Add sframes. The final part is to get this working with sframes. Where a distro builds all its applications with sframes enabled and then perf, ftrace, BPF or whatever gets access to it through this interface. There's quite a momentum of work happening today that is being built expecting us to get to step 3. There's no use to adding sframes to applications if the kernel can't read them. The only way to read them is to have this deferred infrastructure. We've discussed the current design quiet a bit, but until there's actual users starting to build on top of it, all the corner cases may not come out. That's why I'm suggesting if we can just get the basic infrastructure in this merge window, where it's not fully enabled (there are no users of it), we can then have several users build on top of it in the next merge window to see if it finds anything that breaks. As perf has the biggest user space ABI, where the delayed stack trace may be in a different event buffer than where the kernel stack trace occurred (due to migration), it's the one I'm most concern about getting this right. As once it is exposed to user space, it can never change. That's the one I do want to focus on the most. But it shouldn't delay getting the non user space visible aspect of the kernel side moving forward. If we find an issue, it can always be changed because it doesn't affect user space. I've been testing this on the ftrace side and so far, everything works fine. But the ftrace side has the deferred trace always in the same instance buffer as where it was triggered, as it doesn't depend on user space API for that. I've also been running this on perf and it too is working well. But I don't know the perf interface well enough to make sure there isn't other corner cases that I may be missing. For 6.16, I would just like to get the common infrastructure in the kernel so there's no dependency on different tracers. This patch set adds the changes but does not enable it yet. This way, perf, ftrace and BPF can all work to build on top of the changes without needing to share code. There's only two patches that touch x86 here. I have another patch series that removes them and implements this for ftrace, but because no architecture has it enabled, it's just dead code. But it would also allow to build on top of the infrastructure where any architecture could enable it. Kind of like what PREEMPT_RT did. -- Steve