Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Wed, 3 Feb 2021 22:33:28 +0900 Masami Hiramatsu wrote: > Ah, that is what I worried about. ftrace and kprobes handler usually want to > know "what is the actual status of the system where the probe hits". > > If the new kernel_exception_enter() for ftrace/kprobes or any other kernel > instrumention does > > __preempt_count_add(KEX_OFFSET + NMI_OFFSET + HARDIRQ_OFFSET); > > And we can distinguish the KEX from NMI, and get the original status of the > context. > What would you think about? Oh, that reminds me about the obvious difference between an NMI and a ftrace handler. A ftrace handler doesn't disable interrupts nor preemption. Thus, if you set "in_nmi" to a ftrace handler, and an interrupt (or NMI) comes in, then any ftrace handlers called by the interrupt / NMI will be ignored, since it will think it is recursing from NMI context. -- Steve
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Tue, 2 Feb 2021 16:05:13 -0500 Steven Rostedt wrote: > On Tue, 2 Feb 2021 19:30:34 +0100 > Peter Zijlstra wrote: > > > That does mean that kprobes are then fundamentally running from > > in_nmi(), which is what started all this. > > I just thought about the fact that tracing records the context of the > function it is called in. If you set "in_nmi()" for all ftrace handlers, > then all functions will look like they are in an NMI context during tracing. > > That is, the preempt count is checked to fill in the flags in the ring > buffer that denotes what context the event (in this case the function) was > called in. Ah, that is what I worried about. ftrace and kprobes handler usually want to know "what is the actual status of the system where the probe hits". If the new kernel_exception_enter() for ftrace/kprobes or any other kernel instrumention does __preempt_count_add(KEX_OFFSET + NMI_OFFSET + HARDIRQ_OFFSET); And we can distinguish the KEX from NMI, and get the original status of the context. What would you think about? Thank you, -- Masami Hiramatsu
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Tue, 2 Feb 2021 19:30:34 +0100 Peter Zijlstra wrote: > That does mean that kprobes are then fundamentally running from > in_nmi(), which is what started all this. I just thought about the fact that tracing records the context of the function it is called in. If you set "in_nmi()" for all ftrace handlers, then all functions will look like they are in an NMI context during tracing. That is, the preempt count is checked to fill in the flags in the ring buffer that denotes what context the event (in this case the function) was called in. -- Steve
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Tue, Feb 02, 2021 at 11:56:23AM -0500, Steven Rostedt wrote: > NMIs are special, and they always have been. They shouldn't be doing much > anyway. If they are, then that's a problem. There is a fair amount of NMI level code these days, and it seems to be ever increasing... > My question wasn't to have them do it, I was simply asking if they do. I > was assuming that they do not. per nmi_enter() we do: __preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ > > But it doesn't help with: > > > > spin_lock_irq(); // task context > > #DB > > spin_lock_irq(); // interrupt context per your above > > The statement above said: > > "If #DB and #BP do not change the in_interrupt() context" > > Which would make the above be in the same context and the handler would > not be called for the #DB case. But then replace the above #DB with __fentry__ and explain how it is fundamentally different? And consider that direct call into god knows what code option you have. That doesn't go past any recursion checks IIRC. > I'm fine with #DB and #BP being a "in_nmi()", as they are probably even > more special than NMIs. That does mean that kprobes are then fundamentally running from in_nmi(), which is what started all this. Sure, the opt-probes and ftrace-probes don't actually have in_nmi() set today (because they don't trigger an exception), but given that that is all optional, any kprobe handler had better be in_nmi() clean.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Tue, 2 Feb 2021 17:45:47 +0100 Peter Zijlstra wrote: > On Tue, Feb 02, 2021 at 09:52:49AM -0500, Steven Rostedt wrote: > > > But from a handler, you could do: > > > > if (in_nmi()) > > return; > > local_irq_save(flags); > > /* Now you are safe from being re-entrant. */ > > But that's an utter crap thing to do. That's like saying I don't care > about my events, at which point you might as well not bother at all. > > And you can still do that, you just get less coverage today than you > used to. You used to throw things under the bus, now you throw more > under the bus. If you didn't care, I can't seem to find myself caring > either. NMIs are special, and they always have been. They shouldn't be doing much anyway. If they are, then that's a problem. But if you want to make the stack tracer work on all contexts, I'm happy to take patches. I don't have time to work on it today. > > > Where as there's no equivalent in a NMI handler. That's what makes > > kprobe/ftrace handlers different than NMI handlers. > > I don't see how. > > > > Also, given how everything can nest, it had better all be lockless > > > anyway. You can get your regular function trace interrupted, which can > > > hit a #DB, which can function trace, which can #BP which can function > > > trace again which can get #NMI etc.. Many wonderfun nestings possible. > > > > I would call #DB an #BP handlers very special. > > They are, just like NMI is special, which is why they're classed > together. > > > Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer > > has infrastructure to prevent recursion in the same context. > > Sure we _could_ do that, but then we get into the 'fun' problem of > getting a breakpoint/int3 at random places and calling random code and > having deadlocks because they take the same lock. My question wasn't to have them do it, I was simply asking if they do. I was assuming that they do not. > > There was very little that stopped that from happening. > > > That is, a > > ftrace handler calls something that gets traced, the recursion protection > > will detect that and prevent the handler from being called again. But the > > recursion protection is interrupt context aware and lets the handler get > > called again if the recursion happens from a different context: > > > If #DB and #BP do not change the in_interrupt() context, then the above > > still will protect the ftrace handlers from recursion due to them. > > But it doesn't help with: > > spin_lock_irq(); // task context > #DB > spin_lock_irq(); // interrupt context per your above The statement above said: "If #DB and #BP do not change the in_interrupt() context" Which would make the above be in the same context and the handler would not be called for the #DB case. > > All you need to do is put a breakpoint on a piece of code that holds a > spinlock and a handler that takes the same spinlock. > > There was very little from stopping that. > > > That would require refactoring all the code that's been around since 2008. > > Because I couldn't tell why/if any of that was correct at all. #DB/#BP > don't play by the normal rules. They're _far_ more NMI-like than they're > IRQ-like due to ignoring IF. I'm fine with #DB and #BP being a "in_nmi()", as they are probably even more special than NMIs. -- Steve
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Tue, Feb 02, 2021 at 09:52:49AM -0500, Steven Rostedt wrote: > But from a handler, you could do: > > if (in_nmi()) > return; > local_irq_save(flags); > /* Now you are safe from being re-entrant. */ But that's an utter crap thing to do. That's like saying I don't care about my events, at which point you might as well not bother at all. And you can still do that, you just get less coverage today than you used to. You used to throw things under the bus, now you throw more under the bus. If you didn't care, I can't seem to find myself caring either. > Where as there's no equivalent in a NMI handler. That's what makes > kprobe/ftrace handlers different than NMI handlers. I don't see how. > > Also, given how everything can nest, it had better all be lockless > > anyway. You can get your regular function trace interrupted, which can > > hit a #DB, which can function trace, which can #BP which can function > > trace again which can get #NMI etc.. Many wonderfun nestings possible. > > I would call #DB an #BP handlers very special. They are, just like NMI is special, which is why they're classed together. > Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer > has infrastructure to prevent recursion in the same context. Sure we _could_ do that, but then we get into the 'fun' problem of getting a breakpoint/int3 at random places and calling random code and having deadlocks because they take the same lock. There was very little that stopped that from happening. > That is, a > ftrace handler calls something that gets traced, the recursion protection > will detect that and prevent the handler from being called again. But the > recursion protection is interrupt context aware and lets the handler get > called again if the recursion happens from a different context: > If #DB and #BP do not change the in_interrupt() context, then the above > still will protect the ftrace handlers from recursion due to them. But it doesn't help with: spin_lock_irq(); // task context #DB spin_lock_irq(); // interrupt context per your above All you need to do is put a breakpoint on a piece of code that holds a spinlock and a handler that takes the same spinlock. There was very little from stopping that. > That would require refactoring all the code that's been around since 2008. Because I couldn't tell why/if any of that was correct at all. #DB/#BP don't play by the normal rules. They're _far_ more NMI-like than they're IRQ-like due to ignoring IF.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Tue, 2 Feb 2021 11:45:41 +0100 Peter Zijlstra wrote: > > The stack tracer checks the size of the stack, compares it to the > > largest recorded size, and if it's bigger, it will save the stack. But > > if this happens on two CPUs at the same time, only one can do the > > recording at the same time. To synchronize this, a spin lock must be > > taken. Similar to spin locks in an NMI. > > That sounds like something cmpxchg() should be able to do. > > Have a per-cpu stack trace buffer and a global max one, when cpu local > exceeds previous max, cmpxchg the buffer. > > > But the problem here is, the callbacks can also be done from an NMI > > context, so if we are in NMI, we don't want to take any locks, and > > simply don't record the stack traces from NMIs. > > Which is obviously shit :-) The NMI might have interesting stack usage. Actually, it only checks task stacks. It doesn't check IRQ stacks if they are different than the task stack, because to do it properly, it must know the size of the stack. The tracer currently masks the stack pointer with THREAD_SIZE to find the top of the stack. As other stacks may not be THREAD_SIZE, that won't work. It has been on my TODO list (for a long time), to add an arch specific way to quickly find the top of the stack. > > > The more I think about it, the more I hate the idea that ftrace > > callbacks and kprobes are considered NMIs. Simply because they are not! > > Yet they happen when IRQs are off, so they are ;-) But from a handler, you could do: if (in_nmi()) return; local_irq_save(flags); /* Now you are safe from being re-entrant. */ Where as there's no equivalent in a NMI handler. That's what makes kprobe/ftrace handlers different than NMI handlers. > > Also, given how everything can nest, it had better all be lockless > anyway. You can get your regular function trace interrupted, which can > hit a #DB, which can function trace, which can #BP which can function > trace again which can get #NMI etc.. Many wonderfun nestings possible. I would call #DB an #BP handlers very special. Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer has infrastructure to prevent recursion in the same context. That is, a ftrace handler calls something that gets traced, the recursion protection will detect that and prevent the handler from being called again. But the recursion protection is interrupt context aware and lets the handler get called again if the recursion happens from a different context: func: call ftrace_caller ftrace_caller: call ftrace_handler ftrace_handler() { if (recursion_test()) <- false return; some_traced_func() { call ftrace_caller call ftrace_handler ftrace_handler() { if (recursion_test()) <- true return } func call ftrace_caller call ftrace_handler ftrace_handler() { if (recursion_test()) <- false return; /* continue */ If #DB and #BP do not change the in_interrupt() context, then the above still will protect the ftrace handlers from recursion due to them. > > And god knows what these handlers end up calling. > > The only sane approach is treating it all as NMI and having it all > lockless. That would require refactoring all the code that's been around since 2008. Worse yet. lockless is much more complex to get right. So this refactoring will likely cause more bugs than it solves. -- Steve
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Sat, Jan 30, 2021 at 07:44:10AM -0500, Steven Rostedt wrote: > On Sat, 30 Jan 2021 09:28:32 +0100 > Peter Zijlstra wrote: > > > On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote: > > > Specifically, kprobe and ftrace callbacks may have this: > > > > > > if (in_nmi()) > > > return; > > > > > > raw_spin_lock_irqsave(, flags); > > > [..] > > > raw_spin_unlock_irqrestore(, flags); > > > > > > Which is totally fine to have, > > > > Why? There's a distinct lack of explaining here. > > > > Note that we ripped out all such dodgy locking from kretprobes. > > Actually, I think you helped explain the distinction. You mention > "kretpobes" do you mean the infrastructure of kretprobes or all its > users? kretprobe infra. > The infrastructure of ftrace and kprobes can work in any context, it > does not mean that the callbacks must. Again, these are more like > exceptions. Why have "in_nmi()"? If anything that can be called by an > NMI should just work, right? That's basically your argument for having > ftrace and kprobes set "in_nmi". > > You can have locking in NMIs if the locking is *only* in NMI handlers, > right? If that's the case, then so should ftrace and kprobe callbacks. Which is still dodgy as heck. NMIs _can_ nest. Now mostly it doesn't happen, and the few sites that do use spinlocks from NMI context are sure to use it from a specific NMI 'handler' context which typically don't nest. But I still utterly detest the idea of using spinlocks from NMI. It's inherently fragile. > The stack tracer checks the size of the stack, compares it to the > largest recorded size, and if it's bigger, it will save the stack. But > if this happens on two CPUs at the same time, only one can do the > recording at the same time. To synchronize this, a spin lock must be > taken. Similar to spin locks in an NMI. That sounds like something cmpxchg() should be able to do. Have a per-cpu stack trace buffer and a global max one, when cpu local exceeds previous max, cmpxchg the buffer. > But the problem here is, the callbacks can also be done from an NMI > context, so if we are in NMI, we don't want to take any locks, and > simply don't record the stack traces from NMIs. Which is obviously shit :-) The NMI might have interesting stack usage. > The more I think about it, the more I hate the idea that ftrace > callbacks and kprobes are considered NMIs. Simply because they are not! Yet they happen when IRQs are off, so they are ;-) Also, given how everything can nest, it had better all be lockless anyway. You can get your regular function trace interrupted, which can hit a #DB, which can function trace, which can #BP which can function trace again which can get #NMI etc.. Many wonderfun nestings possible. And god knows what these handlers end up calling. The only sane approach is treating it all as NMI and having it all lockless.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Sat, 30 Jan 2021 09:28:32 +0100 Peter Zijlstra wrote: > On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote: > > Specifically, kprobe and ftrace callbacks may have this: > > > > if (in_nmi()) > > return; > > > > raw_spin_lock_irqsave(, flags); > > [..] > > raw_spin_unlock_irqrestore(, flags); > > > > Which is totally fine to have, > > Why? There's a distinct lack of explaining here. > > Note that we ripped out all such dodgy locking from kretprobes. Actually, I think you helped explain the distinction. You mention "kretpobes" do you mean the infrastructure of kretprobes or all its users? The infrastructure of ftrace and kprobes can work in any context, it does not mean that the callbacks must. Again, these are more like exceptions. Why have "in_nmi()"? If anything that can be called by an NMI should just work, right? That's basically your argument for having ftrace and kprobes set "in_nmi". You can have locking in NMIs if the locking is *only* in NMI handlers, right? If that's the case, then so should ftrace and kprobe callbacks. The stack tracer checks the size of the stack, compares it to the largest recorded size, and if it's bigger, it will save the stack. But if this happens on two CPUs at the same time, only one can do the recording at the same time. To synchronize this, a spin lock must be taken. Similar to spin locks in an NMI. But the problem here is, the callbacks can also be done from an NMI context, so if we are in NMI, we don't want to take any locks, and simply don't record the stack traces from NMIs. The more I think about it, the more I hate the idea that ftrace callbacks and kprobes are considered NMIs. Simply because they are not! -- Steve
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, 29 Jan 2021 19:08:40 -0800 Alexei Starovoitov wrote: > On Sat, Jan 30, 2021 at 11:02:49AM +0900, Masami Hiramatsu wrote: > > On Fri, 29 Jan 2021 18:59:43 +0100 > > Peter Zijlstra wrote: > > > > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote: > > > > Same things apply to bpf side. We can statically prove safety for > > > > ftrace and kprobe attaching whereas to deal with NMI situation we > > > > have to use run-time checks for recursion prevention, etc. > > > > > > I have no idea what you're saying. You can attach to functions that are > > > called with random locks held, you can create kprobes in some very > > > sensitive places. > > > > > > What can you staticlly prove about that? > > > > For the bpf and the kprobe tracer, if a probe hits in the NMI context, > > it can call the handler with another handler processing events. > > > > kprobes is carefully avoiding the deadlock by checking recursion > > with per-cpu variable. But if the handler is shared with the other events > > like tracepoints, it needs to its own recursion cheker too. > > > > So, Alexei, maybe you need something like this instead of in_nmi() check. > > > > DEFINE_PER_CPU(bool, under_running_bpf); > > > > common_handler() > > { > > if (__this_cpu_read(under_running_bpf)) > > return; > > __this_cpu_write(under_running_bpf, true); > > /* execute bpf prog */ > > __this_cpu_write(under_running_bpf, false); > > } > > > > Does this work for you? > > This exactly check is already in trace_call_bpf. > Right after if (in_nmi()). > See bpf_prog_active. It serves different purpose though. > Simply removing if (in_nmi()) from trace_call_bpf is a bit scary. > I need to analyze all code paths first. OK, if bpf already avoids its recursion, other considerable case is that some resources are shared among bpf_prog and other parts. Since asynchronous NMI can occur anywhere, such resource usage can conflict with bpf_prog. Kprobes had similar issue, so I set a dummy kprobes to current_kprobes for protecting such critical sections. See kprobe_busy_begin()/end() and where those are used. Thank you, -- Masami Hiramatsu
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Sat, Jan 30, 2021 at 11:02:49AM +0900, Masami Hiramatsu wrote: > On Fri, 29 Jan 2021 18:59:43 +0100 > Peter Zijlstra wrote: > > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote: > > > Same things apply to bpf side. We can statically prove safety for > > > ftrace and kprobe attaching whereas to deal with NMI situation we > > > have to use run-time checks for recursion prevention, etc. > > > > I have no idea what you're saying. You can attach to functions that are > > called with random locks held, you can create kprobes in some very > > sensitive places. > > > > What can you staticlly prove about that? > > For the bpf and the kprobe tracer, if a probe hits in the NMI context, > it can call the handler with another handler processing events. > > kprobes is carefully avoiding the deadlock by checking recursion > with per-cpu variable. But if the handler is shared with the other events > like tracepoints, it needs to its own recursion cheker too. > > So, Alexei, maybe you need something like this instead of in_nmi() check. > > DEFINE_PER_CPU(bool, under_running_bpf); > > common_handler() > { > if (__this_cpu_read(under_running_bpf)) > return; > __this_cpu_write(under_running_bpf, true); > /* execute bpf prog */ > __this_cpu_write(under_running_bpf, false); > } > > Does this work for you? This exactly check is already in trace_call_bpf. Right after if (in_nmi()). See bpf_prog_active. It serves different purpose though. Simply removing if (in_nmi()) from trace_call_bpf is a bit scary. I need to analyze all code paths first.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote: > Specifically, kprobe and ftrace callbacks may have this: > > if (in_nmi()) > return; > > raw_spin_lock_irqsave(, flags); > [..] > raw_spin_unlock_irqrestore(, flags); > > Which is totally fine to have, Why? There's a distinct lack of explaining here. Note that we ripped out all such dodgy locking from kretprobes.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, 29 Jan 2021 18:59:43 +0100 Peter Zijlstra wrote: > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote: > > Same things apply to bpf side. We can statically prove safety for > > ftrace and kprobe attaching whereas to deal with NMI situation we > > have to use run-time checks for recursion prevention, etc. > > I have no idea what you're saying. You can attach to functions that are > called with random locks held, you can create kprobes in some very > sensitive places. > > What can you staticlly prove about that? For the bpf and the kprobe tracer, if a probe hits in the NMI context, it can call the handler with another handler processing events. kprobes is carefully avoiding the deadlock by checking recursion with per-cpu variable. But if the handler is shared with the other events like tracepoints, it needs to its own recursion cheker too. So, Alexei, maybe you need something like this instead of in_nmi() check. DEFINE_PER_CPU(bool, under_running_bpf); common_handler() { if (__this_cpu_read(under_running_bpf)) return; __this_cpu_write(under_running_bpf, true); /* execute bpf prog */ __this_cpu_write(under_running_bpf, false); } Does this work for you? Thank you, -- Masami Hiramatsu
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, 29 Jan 2021 13:05:33 -0800 Alexei Starovoitov wrote: > On Fri, Jan 29, 2021 at 02:01:03PM -0500, Steven Rostedt wrote: > > On Fri, 29 Jan 2021 18:59:43 +0100 > > Peter Zijlstra wrote: > > > > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote: > > > > Same things apply to bpf side. We can statically prove safety for > > > > ftrace and kprobe attaching whereas to deal with NMI situation we > > > > have to use run-time checks for recursion prevention, etc. > > > > > > I have no idea what you're saying. You can attach to functions that are > > > called with random locks held, you can create kprobes in some very > > > sensitive places. > > > > > > What can you staticlly prove about that? > > > > I think the main difference is, if you attach a kprobe or ftrace function, > > you can theoretically analyze the location before you do the attachment. > > Excatly. > When we're writing bpf helpers we need to carefully think about reentrance > and NMI. > If the helper looks like: > int nokprobe notrace bpf_something(...) > { > // access variables A and B > } > > The implementation can rely on the fact that even if the helper is reentrant > the state of A and B will be consistent. Either both got touched or none. > Only NMI condition we have to worry about, because A could be modified > without touching B. > If we write it as > int nokprobe bpf_something(...) { ... } > that would be another case. > Here we need to consider the case that bpf prog can be attached to it via > fentry nop. > But no need to worry about instructions split in the middle because of kprobe > via int3. > Since we have big "if (in_nmi()) goto unsupported;" check in the beginning we > only need to worry about combinations of kprobe at the start of the func, > kprobe anywhere inside the func via int3, and ftrace at the start. > Not having to think of NMI helps a ton. > My earlier this_cpu vs __this_cpu comment is an example of that. > If in_nmi is filtered early it's one implementation. If nmi has to be handled > it's completely different algorithm. > Now you've broke all this logic by making int3 to be marked as 'in_nmi' and > bpf in kprobe in the middle of the func are now broken. > Do people use that? Yeah they do. > We have to fix it. > What were your reasons to make int3 in_nmi? > I've read the commit log, but I don't see in it the actual motivation > for int3 other than "it looks like NMI to me. Let's make it so". > The commit logs talk about cpu exceptions. I agree that #DB and #MC do behave > like NMI. > But #BP is not really. My understanding it's used by kprobes and text_poke_bp > only. > If the motivation was to close some issue with text_poke_bp then, sure, > let's make handling of text_poke_bp to be treated as nmi. > But kprobe is not that. > I'm thinking either of the following solutions would be generic: > - introduce another state to preempt flags like "kernel exception" I like this solution. Or, at least there should be a way to provide the probed context is NMI or not. (BTW, would the NMI has a specific stack area? If so, nmi_context(regs) can be implemented.) > - remove kprobe's int3 from in_nmi > As bpf specific alternative we can do: > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 6c0018abe68a..37cc549ad52e 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -96,7 +96,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, > void *ctx) > { > unsigned int ret; > > - if (in_nmi()) /* not supported yet */ > + if (in_nmi() && !kprobe_running()) /* not supported yet */ This doesn't make sense, because kprobe_running() always true in the kprobe handler. The problem is that the in_nmi() checks whether the current context is NMI context, but you want to know the context where the kprobe is invoked, is NMI context or not. Thank you, -- Masami Hiramatsu
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, 29 Jan 2021 14:01:03 -0500 Steven Rostedt wrote: > On Fri, 29 Jan 2021 18:59:43 +0100 > Peter Zijlstra wrote: > > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote: > > > Same things apply to bpf side. We can statically prove safety for > > > ftrace and kprobe attaching whereas to deal with NMI situation we > > > have to use run-time checks for recursion prevention, etc. > > > > I have no idea what you're saying. You can attach to functions that are > > called with random locks held, you can create kprobes in some very > > sensitive places. > > > > What can you staticlly prove about that? > > I think the main difference is, if you attach a kprobe or ftrace function, > you can theoretically analyze the location before you do the attachment. > > Does, the NMI context mean "in_nmi()" returns true? Because there's cases > in ftrace callbacks where that is checked (like the stack tracer). And > having ftrace return true for "in_nmi()" will break a lot of existing > utilities. Specifically, kprobe and ftrace callbacks may have this: if (in_nmi()) return; raw_spin_lock_irqsave(, flags); [..] raw_spin_unlock_irqrestore(, flags); Which is totally fine to have, but the above only works if "in_nmi()" returns true only if you are in a real NMI. The stack tracer code does exactly the above. -- Steve
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, Jan 29, 2021 at 02:01:03PM -0500, Steven Rostedt wrote: > On Fri, 29 Jan 2021 18:59:43 +0100 > Peter Zijlstra wrote: > > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote: > > > Same things apply to bpf side. We can statically prove safety for > > > ftrace and kprobe attaching whereas to deal with NMI situation we > > > have to use run-time checks for recursion prevention, etc. > > > > I have no idea what you're saying. You can attach to functions that are > > called with random locks held, you can create kprobes in some very > > sensitive places. > > > > What can you staticlly prove about that? > > I think the main difference is, if you attach a kprobe or ftrace function, > you can theoretically analyze the location before you do the attachment. Excatly. When we're writing bpf helpers we need to carefully think about reentrance and NMI. If the helper looks like: int nokprobe notrace bpf_something(...) { // access variables A and B } The implementation can rely on the fact that even if the helper is reentrant the state of A and B will be consistent. Either both got touched or none. Only NMI condition we have to worry about, because A could be modified without touching B. If we write it as int nokprobe bpf_something(...) { ... } that would be another case. Here we need to consider the case that bpf prog can be attached to it via fentry nop. But no need to worry about instructions split in the middle because of kprobe via int3. Since we have big "if (in_nmi()) goto unsupported;" check in the beginning we only need to worry about combinations of kprobe at the start of the func, kprobe anywhere inside the func via int3, and ftrace at the start. Not having to think of NMI helps a ton. My earlier this_cpu vs __this_cpu comment is an example of that. If in_nmi is filtered early it's one implementation. If nmi has to be handled it's completely different algorithm. Now you've broke all this logic by making int3 to be marked as 'in_nmi' and bpf in kprobe in the middle of the func are now broken. Do people use that? Yeah they do. We have to fix it. What were your reasons to make int3 in_nmi? I've read the commit log, but I don't see in it the actual motivation for int3 other than "it looks like NMI to me. Let's make it so". The commit logs talk about cpu exceptions. I agree that #DB and #MC do behave like NMI. But #BP is not really. My understanding it's used by kprobes and text_poke_bp only. If the motivation was to close some issue with text_poke_bp then, sure, let's make handling of text_poke_bp to be treated as nmi. But kprobe is not that. I'm thinking either of the following solutions would be generic: - introduce another state to preempt flags like "kernel exception" - remove kprobe's int3 from in_nmi As bpf specific alternative we can do: diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6c0018abe68a..37cc549ad52e 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -96,7 +96,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) { unsigned int ret; - if (in_nmi()) /* not supported yet */ + if (in_nmi() && !kprobe_running()) /* not supported yet */ return 1; but imo it's less clean than the first two options.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, 29 Jan 2021 18:59:43 +0100 Peter Zijlstra wrote: > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote: > > Same things apply to bpf side. We can statically prove safety for > > ftrace and kprobe attaching whereas to deal with NMI situation we > > have to use run-time checks for recursion prevention, etc. > > I have no idea what you're saying. You can attach to functions that are > called with random locks held, you can create kprobes in some very > sensitive places. > > What can you staticlly prove about that? I think the main difference is, if you attach a kprobe or ftrace function, you can theoretically analyze the location before you do the attachment. Does, the NMI context mean "in_nmi()" returns true? Because there's cases in ftrace callbacks where that is checked (like the stack tracer). And having ftrace return true for "in_nmi()" will break a lot of existing utilities. -- Steve
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote: > Same things apply to bpf side. We can statically prove safety for > ftrace and kprobe attaching whereas to deal with NMI situation we > have to use run-time checks for recursion prevention, etc. I have no idea what you're saying. You can attach to functions that are called with random locks held, you can create kprobes in some very sensitive places. What can you staticlly prove about that?
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, Jan 29, 2021 at 8:24 AM Peter Zijlstra wrote: > > On Fri, Jan 29, 2021 at 10:59:52AM -0500, Steven Rostedt wrote: > > On Fri, 29 Jan 2021 22:40:11 +0900 > > Masami Hiramatsu wrote: > > > > > > So what, they can all happen with random locks held. Marking them as NMI > > > > enables a whole bunch of sanity checks that are entirely appropriate. > > > > > > How about introducing an idea of Asynchronous NMI (ANMI) and Synchronous > > > NMI (SNMI)? kprobes and ftrace is synchronously called and can be > > > controlled > > > (we can expect the context) but ANMI may be caused by asynchronous > > > hardware events on any context. > > > > > > If we can distinguish those 2 NMIs on preempt count, bpf people can easily > > > avoid the inevitable situation. > > > > I don't like the name NMI IN SNMI, because they are not NMIs. They are > > actually more like kernel exceptions. Even page faults in the kernel is > > similar to a kprobe breakpoint or ftrace. It can happen anywhere, with any > > lock held. Perhaps we need a kernel exception context? Which by definition > > is synchronous. I like 'kernel exception' name. SNMI doesn't sound right. There is nothing 'non maskable' here. > > What problem are you trying to solve? AFAICT all these contexts have the > same restrictions, why try and muck about with different names for the > same thing? from core kernel perspective the difference between 'kernel exception' and true NMI is huge: this_cpu vs __this_cpu static checks vs runtime checks Same things apply to bpf side. We can statically prove safety for ftrace and kprobe attaching whereas to deal with NMI situation we have to use run-time checks for recursion prevention, etc.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Fri, Jan 29, 2021 at 10:59:52AM -0500, Steven Rostedt wrote: > On Fri, 29 Jan 2021 22:40:11 +0900 > Masami Hiramatsu wrote: > > > > So what, they can all happen with random locks held. Marking them as NMI > > > enables a whole bunch of sanity checks that are entirely appropriate. > > > > How about introducing an idea of Asynchronous NMI (ANMI) and Synchronous > > NMI (SNMI)? kprobes and ftrace is synchronously called and can be controlled > > (we can expect the context) but ANMI may be caused by asynchronous > > hardware events on any context. > > > > If we can distinguish those 2 NMIs on preempt count, bpf people can easily > > avoid the inevitable situation. > > I don't like the name NMI IN SNMI, because they are not NMIs. They are > actually more like kernel exceptions. Even page faults in the kernel is > similar to a kprobe breakpoint or ftrace. It can happen anywhere, with any > lock held. Perhaps we need a kernel exception context? Which by definition > is synchronous. What problem are you trying to solve? AFAICT all these contexts have the same restrictions, why try and muck about with different names for the same thing?
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On 29.01.21 г. 3:34 ч., Alexei Starovoitov wrote: > On Thu, Jan 28, 2021 at 07:24:14PM +0100, Peter Zijlstra wrote: >> On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote: >>> it would be placed on the __fentry__ (and not endbr64) hence it works. >>> So perhaps a workaround outside of bpf could essentially detect this >>> scenario and adjust the probe to be on the __fentry__ and not preceding >>> instruction if it's detected to be endbr64 ? >> >> Arguably the fentry handler should also set the nmi context, it can, >> after all, interrupt pretty much any other context by construction. > > But that doesn't make it a real nmi. > nmi can actually interrupt anything. Whereas kprobe via int3 cannot > do nokprobe and noinstr sections. The exposure is a lot different. > ftrace is even more contained. It's only at the start of the functions. > It's even smaller subset of places than kprobes. > So ftrace < kprobe < nmi. > Grouping them all into nmi doesn't make sense to me. > That bpf breaking change came unnoticed mostly because people use > kprobes in the beginning of the functions only, but there are cases > where kprobes are in the middle too. People just didn't upgrade kernels > fast enough to notice. nit: slight correction - I observed while I was putting kprobes at the beginning of the function but __fentry__ wasn't the first thing in the function's code. The reason why people haven't observed is because everyone is running with retpolines enabled which disables CFI/CET. > imo appropriate solution would be to have some distinction between > ftrace < kprobe_via_int3 < nmi, so that bpf side can react accordingly. > That code in trace_call_bpf: > if (in_nmi()) /* not supported yet */ > was necessary in the past. Now we have all sorts of protections built in. > So it's probably ok to just drop it, but I think adding > called_via_ftrace vs called_via_kprobe_int3 vs called_via_nmi > is more appropriate solution long term. >
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Thu, Jan 28, 2021 at 07:24:14PM +0100, Peter Zijlstra wrote: > On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote: > > it would be placed on the __fentry__ (and not endbr64) hence it works. > > So perhaps a workaround outside of bpf could essentially detect this > > scenario and adjust the probe to be on the __fentry__ and not preceding > > instruction if it's detected to be endbr64 ? > > Arguably the fentry handler should also set the nmi context, it can, > after all, interrupt pretty much any other context by construction. But that doesn't make it a real nmi. nmi can actually interrupt anything. Whereas kprobe via int3 cannot do nokprobe and noinstr sections. The exposure is a lot different. ftrace is even more contained. It's only at the start of the functions. It's even smaller subset of places than kprobes. So ftrace < kprobe < nmi. Grouping them all into nmi doesn't make sense to me. That bpf breaking change came unnoticed mostly because people use kprobes in the beginning of the functions only, but there are cases where kprobes are in the middle too. People just didn't upgrade kernels fast enough to notice. imo appropriate solution would be to have some distinction between ftrace < kprobe_via_int3 < nmi, so that bpf side can react accordingly. That code in trace_call_bpf: if (in_nmi()) /* not supported yet */ was necessary in the past. Now we have all sorts of protections built in. So it's probably ok to just drop it, but I think adding called_via_ftrace vs called_via_kprobe_int3 vs called_via_nmi is more appropriate solution long term.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote: > it would be placed on the __fentry__ (and not endbr64) hence it works. > So perhaps a workaround outside of bpf could essentially detect this > scenario and adjust the probe to be on the __fentry__ and not preceding > instruction if it's detected to be endbr64 ? Arguably the fentry handler should also set the nmi context, it can, after all, interrupt pretty much any other context by construction.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote: > On 28.01.21 г. 18:12 ч., Nikolay Borisov wrote: > > On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote: > >> Hi, > > > > > >> > >> Alexei, could you tell me what is the concerning situation for bpf? > > > > Another data point masami is that this affects bpf kprobes which are > > entered via int3, alternatively if the kprobe is entered via > > kprobe_ftrace_handler it works as expected. I haven't been able to > > determine why a particular bpf probe won't use ftrace's infrastructure > > if it's put at the beginning of the function. An alternative call chain > > is : > > > > => __ftrace_trace_stack > > => trace_call_bpf > > => kprobe_perf_func > > => kprobe_ftrace_handler > > => 0xc095d0c8 > > => btrfs_validate_metadata_buffer > > => end_bio_extent_readpage > > => end_workqueue_fn > > => btrfs_work_helper > > => process_one_work > > => worker_thread > > => kthread > > => ret_from_fork > > > >> > > I have a working theory why I'm seeing this. My kernel (broken) was > compiled with retpolines off and with the gcc that comes with ubuntu > (both 9 and 10: > gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > gcc-10 (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0 > ) > > this results in CFI being enabled so functions look like: > 0x81493890 <+0>: endbr64 > 0x81493894 <+4>: callq 0x8104d820 <__fentry__> > > i.e fentry's thunk is not the first instruction on the function hence > it's not going through the optimized ftrace handler. Instead it's using > int3 which is broken as ascertained. > > After testing with my testcase I confirm that with cfi off and > __fentry__ being the first entry bpf starts working. And indeed, even > with CFI turned on if I use a probe like : > > bpftrace -e 'kprobe:btrfs_sync_file+4 {printf("kprobe: %s\n", > kstack());}' &>bpf-output & > > > it would be placed on the __fentry__ (and not endbr64) hence it works. > So perhaps a workaround outside of bpf could essentially detect this > scenario and adjust the probe to be on the __fentry__ and not preceding > instruction if it's detected to be endbr64 ? For now (and the foreseeable future), CET isn't enabled in the kernel. So that endbr64 shouldn't be there in the first place. I can make a proper patch in a bit. diff --git a/Makefile b/Makefile index e0af7a4a5598..5ccc4cdf1fb5 100644 --- a/Makefile +++ b/Makefile @@ -948,11 +948,8 @@ KBUILD_CFLAGS += $(call cc-option,-Werror=designated-init) # change __FILE__ to the relative path from the srctree KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=) -# ensure -fcf-protection is disabled when using retpoline as it is -# incompatible with -mindirect-branch=thunk-extern -ifdef CONFIG_RETPOLINE +# Intel CET isn't enabled in the kernel KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none) -endif # include additional Makefiles when needed include-y := scripts/Makefile.extrawarn
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On 28.01.21 г. 18:12 ч., Nikolay Borisov wrote: > > > On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote: >> Hi, > > > >> >> Alexei, could you tell me what is the concerning situation for bpf? > > Another data point masami is that this affects bpf kprobes which are > entered via int3, alternatively if the kprobe is entered via > kprobe_ftrace_handler it works as expected. I haven't been able to > determine why a particular bpf probe won't use ftrace's infrastructure > if it's put at the beginning of the function. An alternative call chain > is : > > => __ftrace_trace_stack > => trace_call_bpf > => kprobe_perf_func > => kprobe_ftrace_handler > => 0xc095d0c8 > => btrfs_validate_metadata_buffer > => end_bio_extent_readpage > => end_workqueue_fn > => btrfs_work_helper > => process_one_work > => worker_thread > => kthread > => ret_from_fork > >> I have a working theory why I'm seeing this. My kernel (broken) was compiled with retpolines off and with the gcc that comes with ubuntu (both 9 and 10: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 gcc-10 (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0 ) this results in CFI being enabled so functions look like: 0x81493890 <+0>: endbr64 0x81493894 <+4>: callq 0x8104d820 <__fentry__> i.e fentry's thunk is not the first instruction on the function hence it's not going through the optimized ftrace handler. Instead it's using int3 which is broken as ascertained. After testing with my testcase I confirm that with cfi off and __fentry__ being the first entry bpf starts working. And indeed, even with CFI turned on if I use a probe like : bpftrace -e 'kprobe:btrfs_sync_file+4 {printf("kprobe: %s\n", kstack());}' &>bpf-output & it would be placed on the __fentry__ (and not endbr64) hence it works. So perhaps a workaround outside of bpf could essentially detect this scenario and adjust the probe to be on the __fentry__ and not preceding instruction if it's detected to be endbr64 ?
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote: > Hi, > > Alexei, could you tell me what is the concerning situation for bpf? Another data point masami is that this affects bpf kprobes which are entered via int3, alternatively if the kprobe is entered via kprobe_ftrace_handler it works as expected. I haven't been able to determine why a particular bpf probe won't use ftrace's infrastructure if it's put at the beginning of the function. An alternative call chain is : => __ftrace_trace_stack => trace_call_bpf => kprobe_perf_func => kprobe_ftrace_handler => 0xc095d0c8 => btrfs_validate_metadata_buffer => end_bio_extent_readpage => end_workqueue_fn => btrfs_work_helper => process_one_work => worker_thread => kthread => ret_from_fork > > Thank you, > > From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001 > From: Masami Hiramatsu > Date: Thu, 28 Jan 2021 12:31:02 +0900 > Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler > > Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has > changed the kprobe handler to run in the NMI context, in_nmi() always returns > true. This means the bpf events on kprobes always skipped. > > Signed-off-by: Masami Hiramatsu > --- > kernel/trace/bpf_trace.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 6c0018abe68a..764400260eb6 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, > void *ctx) > { > unsigned int ret; > > - if (in_nmi()) /* not supported yet */ > - return 1; > - > cant_sleep(); > > if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { >
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote: > Hi, > > > Yeah, there is. Nikolay, could you try this tentative patch? I can confirm that with this patch everything is working. I also applied the following diff: diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6c0018abe68a..cc5a3a18816d 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -96,8 +96,10 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) { unsigned int ret; - if (in_nmi()) /* not supported yet */ + if (in_nmi()) /* not supported yet */ { + trace_dump_stack(0); return 1; + } cant_sleep(); And can confirm that the branch is being hit and the following call trace is produced: => __ftrace_trace_stack => trace_call_bpf => kprobe_perf_func => kprobe_int3_handler => exc_int3 => asm_exc_int3 => btrfs_sync_file => do_fsync => __x64_sys_fsync => do_syscall_64 => entry_SYSCALL_64_after_hwframe > > Of course this just drops the NMI check from the handler, so alternative > checker is required. But I'm not sure what the original code concerns. > As far as I can see, there seems no re-entrant block flag, nor locks > among ebpf programs in runtime. > > Alexei, could you tell me what is the concerning situation for bpf? > > Thank you, > > From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001 > From: Masami Hiramatsu > Date: Thu, 28 Jan 2021 12:31:02 +0900 > Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler > > Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has > changed the kprobe handler to run in the NMI context, in_nmi() always returns > true. This means the bpf events on kprobes always skipped. FWIW I'd prefer if in addition to the original commit you also mention: ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking") b6be002bcd1d ("x86/entry: Move nmi entry/exit into common code") Since they changed the way nmi state is managed in exc_int3 and not in the original do_int3. THe latter no longer contains any references to nmi-related code. > > Signed-off-by: Masami Hiramatsu > --- > kernel/trace/bpf_trace.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 6c0018abe68a..764400260eb6 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, > void *ctx) > { > unsigned int ret; > > - if (in_nmi()) /* not supported yet */ > - return 1; > - > cant_sleep(); > > if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { >
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
Hi, On Thu, 28 Jan 2021 10:34:15 +0900 Masami Hiramatsu wrote: > On Wed, 27 Jan 2021 19:57:56 +0200 > Nikolay Borisov wrote: > > > > > > > On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote: > > > On Thu, 28 Jan 2021 00:13:53 +0900 > > > Masami Hiramatsu wrote: > > > > > >> Hi Nikolay, > > >> > > >> On Wed, 27 Jan 2021 15:43:29 +0200 > > >> Nikolay Borisov wrote: > > >> > > >>> Hello, > > >>> > > >>> I'm currently seeing latest Linus' master being somewhat broken w.r.t > > >>> krpobes. In particular I have the following test-case: > > >>> > > >>> #!/bin/bash > > >>> > > >>> mkfs.btrfs -f /dev/vdc &> /dev/null > > >>> mount /dev/vdc /media/scratch/ > > >>> > > >>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}' > > >>> &>bpf-output & > > >>> bpf_trace_pid=$! > > >>> > > >>> # force btrfs_sync_file to be called > > >>> sleep 2 > > >>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5 > > >>> > > >>> kill $bpf_trace_pid > > >>> sleep 1 > > >>> > > >>> grep -q kprobe bpf-output > > >>> retval=$? > > >>> rm -f bpf-output > > >>> umount /media/scratch > > >>> > > >>> exit $retval > > >>> > > >>> It traces btrfs_sync_file which is called when fsync is executed on a > > >>> btrfs file, however I don't see the stacktrace being printed i.e the > > >>> kprobe doesn't fire at all. The following alternative program: > > >>> > > >>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint: > > >>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n", > > >>> kstack());}' > > >>> > > >>> only prints the stack from the tracepoint and not from the kprobe, given > > >>> that the tracepoint is called from the btrfs_sync_file function. > > >> > > >> Thank you for reporting! > > >> > > >> If you don't mind, could you confirm it with ftrace (tracefs)? > > >> bpftrace etc. involves too many things. It is better to test with > > >> simpler way to test it. > > >> I'm not familer with the bpftrace, but I think you can check it with > > >> > > >> # cd /sys/kernel/debug/tracing > > >> # echo p:myevent btrfs_sync_file >> kprobe_events > > >> # echo stacktrace > events/kprobes/myevent/trigger > > >> (or echo 1 > options/stacktrace , if trigger file doesn't exist) > > > > > > Of course, also you have to enable the event. > > > # echo 1 > events/kprobes/myevent/enable > > > > > > And check the results > > > > > > # cat trace > > > > > > > > >> Could you also share your kernel config, so that we can reproduce it? > > > > > > > I've attached the config and indeed with the scenario you proposed it > > seems to works. I see: > > > >xfs_io-20280 [000] d.Z. 9900.748633: myevent: > > (btrfs_sync_file+0x0/0x580) > > xfs_io-20280 [000] d.Z. 9900.748647: > > => kprobe_trace_func > > => kprobe_dispatcher > > => kprobe_int3_handler > > => exc_int3 > > => asm_exc_int3 > > => btrfs_sync_file > > => do_fsync > > => __x64_sys_fsync > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > Hmm, then there might be a problem in bpftrace or ebpf (need more info). > At least kprobes itself isn't broken. > I guess they check "in_nmi()" and skip such event? Yeah, there is. Nikolay, could you try this tentative patch? Of course this just drops the NMI check from the handler, so alternative checker is required. But I'm not sure what the original code concerns. As far as I can see, there seems no re-entrant block flag, nor locks among ebpf programs in runtime. Alexei, could you tell me what is the concerning situation for bpf? Thank you, >From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Thu, 28 Jan 2021 12:31:02 +0900 Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has changed the kprobe handler to run in the NMI context, in_nmi() always returns true. This means the bpf events on kprobes always skipped. Signed-off-by: Masami Hiramatsu --- kernel/trace/bpf_trace.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6c0018abe68a..764400260eb6 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) { unsigned int ret; - if (in_nmi()) /* not supported yet */ - return 1; - cant_sleep(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { -- 2.25.1 -- Masami Hiramatsu
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Wed, 27 Jan 2021 19:57:56 +0200 Nikolay Borisov wrote: > > > On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote: > > On Thu, 28 Jan 2021 00:13:53 +0900 > > Masami Hiramatsu wrote: > > > >> Hi Nikolay, > >> > >> On Wed, 27 Jan 2021 15:43:29 +0200 > >> Nikolay Borisov wrote: > >> > >>> Hello, > >>> > >>> I'm currently seeing latest Linus' master being somewhat broken w.r.t > >>> krpobes. In particular I have the following test-case: > >>> > >>> #!/bin/bash > >>> > >>> mkfs.btrfs -f /dev/vdc &> /dev/null > >>> mount /dev/vdc /media/scratch/ > >>> > >>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}' > >>> &>bpf-output & > >>> bpf_trace_pid=$! > >>> > >>> # force btrfs_sync_file to be called > >>> sleep 2 > >>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5 > >>> > >>> kill $bpf_trace_pid > >>> sleep 1 > >>> > >>> grep -q kprobe bpf-output > >>> retval=$? > >>> rm -f bpf-output > >>> umount /media/scratch > >>> > >>> exit $retval > >>> > >>> It traces btrfs_sync_file which is called when fsync is executed on a > >>> btrfs file, however I don't see the stacktrace being printed i.e the > >>> kprobe doesn't fire at all. The following alternative program: > >>> > >>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint: > >>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n", > >>> kstack());}' > >>> > >>> only prints the stack from the tracepoint and not from the kprobe, given > >>> that the tracepoint is called from the btrfs_sync_file function. > >> > >> Thank you for reporting! > >> > >> If you don't mind, could you confirm it with ftrace (tracefs)? > >> bpftrace etc. involves too many things. It is better to test with > >> simpler way to test it. > >> I'm not familer with the bpftrace, but I think you can check it with > >> > >> # cd /sys/kernel/debug/tracing > >> # echo p:myevent btrfs_sync_file >> kprobe_events > >> # echo stacktrace > events/kprobes/myevent/trigger > >> (or echo 1 > options/stacktrace , if trigger file doesn't exist) > > > > Of course, also you have to enable the event. > > # echo 1 > events/kprobes/myevent/enable > > > > And check the results > > > > # cat trace > > > > > >> Could you also share your kernel config, so that we can reproduce it? > > > > I've attached the config and indeed with the scenario you proposed it > seems to works. I see: > >xfs_io-20280 [000] d.Z. 9900.748633: myevent: > (btrfs_sync_file+0x0/0x580) > xfs_io-20280 [000] d.Z. 9900.748647: > => kprobe_trace_func > => kprobe_dispatcher > => kprobe_int3_handler > => exc_int3 > => asm_exc_int3 > => btrfs_sync_file > => do_fsync > => __x64_sys_fsync > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe Hmm, then there might be a problem in bpftrace or ebpf (need more info). At least kprobes itself isn't broken. I guess they check "in_nmi()" and skip such event? Thank you, -- Masami Hiramatsu
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote: > On Thu, 28 Jan 2021 00:13:53 +0900 > Masami Hiramatsu wrote: > >> Hi Nikolay, >> >> On Wed, 27 Jan 2021 15:43:29 +0200 >> Nikolay Borisov wrote: >> >>> Hello, >>> >>> I'm currently seeing latest Linus' master being somewhat broken w.r.t >>> krpobes. In particular I have the following test-case: >>> >>> #!/bin/bash >>> >>> mkfs.btrfs -f /dev/vdc &> /dev/null >>> mount /dev/vdc /media/scratch/ >>> >>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}' >>> &>bpf-output & >>> bpf_trace_pid=$! >>> >>> # force btrfs_sync_file to be called >>> sleep 2 >>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5 >>> >>> kill $bpf_trace_pid >>> sleep 1 >>> >>> grep -q kprobe bpf-output >>> retval=$? >>> rm -f bpf-output >>> umount /media/scratch >>> >>> exit $retval >>> >>> It traces btrfs_sync_file which is called when fsync is executed on a >>> btrfs file, however I don't see the stacktrace being printed i.e the >>> kprobe doesn't fire at all. The following alternative program: >>> >>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint: >>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n", >>> kstack());}' >>> >>> only prints the stack from the tracepoint and not from the kprobe, given >>> that the tracepoint is called from the btrfs_sync_file function. >> >> Thank you for reporting! >> >> If you don't mind, could you confirm it with ftrace (tracefs)? >> bpftrace etc. involves too many things. It is better to test with >> simpler way to test it. >> I'm not familer with the bpftrace, but I think you can check it with >> >> # cd /sys/kernel/debug/tracing >> # echo p:myevent btrfs_sync_file >> kprobe_events >> # echo stacktrace > events/kprobes/myevent/trigger >> (or echo 1 > options/stacktrace , if trigger file doesn't exist) > > Of course, also you have to enable the event. > # echo 1 > events/kprobes/myevent/enable > > And check the results > > # cat trace > > >> Could you also share your kernel config, so that we can reproduce it? > I've attached the config and indeed with the scenario you proposed it seems to works. I see: xfs_io-20280 [000] d.Z. 9900.748633: myevent: (btrfs_sync_file+0x0/0x580) xfs_io-20280 [000] d.Z. 9900.748647: => kprobe_trace_func => kprobe_dispatcher => kprobe_int3_handler => exc_int3 => asm_exc_int3 => btrfs_sync_file => do_fsync => __x64_sys_fsync => do_syscall_64 => entry_SYSCALL_64_after_hwframe # # Automatically generated file; DO NOT EDIT. # Linux/x86 5.11.0-rc5 Kernel Configuration # CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0" CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=90300 CONFIG_LD_VERSION=23400 CONFIG_CLANG_VERSION=0 CONFIG_LLD_VERSION=0 CONFIG_CC_CAN_LINK=y CONFIG_CC_CAN_LINK_STATIC=y CONFIG_CC_HAS_ASM_GOTO=y CONFIG_CC_HAS_ASM_INLINE=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_TABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="-default" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_BUILD_SALT="" CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_HAVE_KERNEL_ZSTD=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set # CONFIG_KERNEL_ZSTD is not set CONFIG_DEFAULT_INIT="" CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y # CONFIG_WATCH_QUEUE is not set CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_USELIB=y # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y CONFIG_GENERIC_IRQ_RESERVATION_MODE=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set # end of IRQ subsystem CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_INIT=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # end of Timers subsystem CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Thu, Jan 28, 2021 at 12:13:53AM +0900, Masami Hiramatsu wrote: > Peter, would you have any idea? Without a clear error, I've no clue yet. I'll try and investigate a little more once we have a little more data.
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
On Thu, 28 Jan 2021 00:13:53 +0900 Masami Hiramatsu wrote: > Hi Nikolay, > > On Wed, 27 Jan 2021 15:43:29 +0200 > Nikolay Borisov wrote: > > > Hello, > > > > I'm currently seeing latest Linus' master being somewhat broken w.r.t > > krpobes. In particular I have the following test-case: > > > > #!/bin/bash > > > > mkfs.btrfs -f /dev/vdc &> /dev/null > > mount /dev/vdc /media/scratch/ > > > > bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}' > > &>bpf-output & > > bpf_trace_pid=$! > > > > # force btrfs_sync_file to be called > > sleep 2 > > xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5 > > > > kill $bpf_trace_pid > > sleep 1 > > > > grep -q kprobe bpf-output > > retval=$? > > rm -f bpf-output > > umount /media/scratch > > > > exit $retval > > > > It traces btrfs_sync_file which is called when fsync is executed on a > > btrfs file, however I don't see the stacktrace being printed i.e the > > kprobe doesn't fire at all. The following alternative program: > > > > bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint: > > %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n", > > kstack());}' > > > > only prints the stack from the tracepoint and not from the kprobe, given > > that the tracepoint is called from the btrfs_sync_file function. > > Thank you for reporting! > > If you don't mind, could you confirm it with ftrace (tracefs)? > bpftrace etc. involves too many things. It is better to test with > simpler way to test it. > I'm not familer with the bpftrace, but I think you can check it with > > # cd /sys/kernel/debug/tracing > # echo p:myevent btrfs_sync_file >> kprobe_events > # echo stacktrace > events/kprobes/myevent/trigger > (or echo 1 > options/stacktrace , if trigger file doesn't exist) Of course, also you have to enable the event. # echo 1 > events/kprobes/myevent/enable And check the results # cat trace > Could you also share your kernel config, so that we can reproduce it? Thank you, > > > > > I started bisecting this and arrived at the following commit: > > > > 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") > > > > FWIW the following series is applied on the kernel I was testing: > > https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/ > > > > but it's still broken. > > Peter, would you have any idea? > > Thank you, > > -- > Masami Hiramatsu -- Masami Hiramatsu
Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
Hi Nikolay, On Wed, 27 Jan 2021 15:43:29 +0200 Nikolay Borisov wrote: > Hello, > > I'm currently seeing latest Linus' master being somewhat broken w.r.t > krpobes. In particular I have the following test-case: > > #!/bin/bash > > mkfs.btrfs -f /dev/vdc &> /dev/null > mount /dev/vdc /media/scratch/ > > bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}' > &>bpf-output & > bpf_trace_pid=$! > > # force btrfs_sync_file to be called > sleep 2 > xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5 > > kill $bpf_trace_pid > sleep 1 > > grep -q kprobe bpf-output > retval=$? > rm -f bpf-output > umount /media/scratch > > exit $retval > > It traces btrfs_sync_file which is called when fsync is executed on a > btrfs file, however I don't see the stacktrace being printed i.e the > kprobe doesn't fire at all. The following alternative program: > > bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint: > %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n", > kstack());}' > > only prints the stack from the tracepoint and not from the kprobe, given > that the tracepoint is called from the btrfs_sync_file function. Thank you for reporting! If you don't mind, could you confirm it with ftrace (tracefs)? bpftrace etc. involves too many things. It is better to test with simpler way to test it. I'm not familer with the bpftrace, but I think you can check it with # cd /sys/kernel/debug/tracing # echo p:myevent btrfs_sync_file >> kprobe_events # echo stacktrace > events/kprobes/myevent/trigger (or echo 1 > options/stacktrace , if trigger file doesn't exist) Could you also share your kernel config, so that we can reproduce it? > > I started bisecting this and arrived at the following commit: > > 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") > > FWIW the following series is applied on the kernel I was testing: > https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/ > > but it's still broken. Peter, would you have any idea? Thank you, -- Masami Hiramatsu
kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
Hello, I'm currently seeing latest Linus' master being somewhat broken w.r.t krpobes. In particular I have the following test-case: #!/bin/bash mkfs.btrfs -f /dev/vdc &> /dev/null mount /dev/vdc /media/scratch/ bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}' &>bpf-output & bpf_trace_pid=$! # force btrfs_sync_file to be called sleep 2 xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5 kill $bpf_trace_pid sleep 1 grep -q kprobe bpf-output retval=$? rm -f bpf-output umount /media/scratch exit $retval It traces btrfs_sync_file which is called when fsync is executed on a btrfs file, however I don't see the stacktrace being printed i.e the kprobe doesn't fire at all. The following alternative program: bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint: %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}' only prints the stack from the tracepoint and not from the kprobe, given that the tracepoint is called from the btrfs_sync_file function. I started bisecting this and arrived at the following commit: 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") FWIW the following series is applied on the kernel I was testing: https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/ but it's still broken.