subject:"kprobes broken since 0d00449c7a28 \(\"x86\: Replace ist_enter\(\) with nmi

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-02-03 Thread Steven Rostedt

On Wed, 3 Feb 2021 22:33:28 +0900
Masami Hiramatsu  wrote:

> Ah, that is what I worried about. ftrace and kprobes handler usually want to
> know "what is the actual status of the system where the probe hits".
> 
> If the new kernel_exception_enter() for ftrace/kprobes or any other kernel
> instrumention does
> 
>   __preempt_count_add(KEX_OFFSET + NMI_OFFSET + HARDIRQ_OFFSET);
> 
> And we can distinguish the KEX from NMI, and get the original status of the 
> context.
> What would you think about?

Oh, that reminds me about the obvious difference between an NMI and a
ftrace handler. A ftrace handler doesn't disable interrupts nor
preemption. Thus, if you set "in_nmi" to a ftrace handler, and an
interrupt (or NMI) comes in, then any ftrace handlers called by the
interrupt / NMI will be ignored, since it will think it is recursing
from NMI context.

-- Steve

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-02-03 Thread Masami Hiramatsu

On Tue, 2 Feb 2021 16:05:13 -0500
Steven Rostedt  wrote:

> On Tue, 2 Feb 2021 19:30:34 +0100
> Peter Zijlstra  wrote:
> 
> > That does mean that kprobes are then fundamentally running from
> > in_nmi(), which is what started all this.
> 
> I just thought about the fact that tracing records the context of the
> function it is called in. If you set "in_nmi()" for all ftrace handlers,
> then all functions will look like they are in an NMI context during tracing.
> 
> That is, the preempt count is checked to fill in the flags in the ring
> buffer that denotes what context the event (in this case the function) was
> called in.

Ah, that is what I worried about. ftrace and kprobes handler usually want to
know "what is the actual status of the system where the probe hits".

If the new kernel_exception_enter() for ftrace/kprobes or any other kernel
instrumention does

  __preempt_count_add(KEX_OFFSET + NMI_OFFSET + HARDIRQ_OFFSET);

And we can distinguish the KEX from NMI, and get the original status of the 
context.
What would you think about?

Thank you,

-- 
Masami Hiramatsu

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-02-02 Thread Steven Rostedt

On Tue, 2 Feb 2021 19:30:34 +0100
Peter Zijlstra  wrote:

> That does mean that kprobes are then fundamentally running from
> in_nmi(), which is what started all this.

I just thought about the fact that tracing records the context of the
function it is called in. If you set "in_nmi()" for all ftrace handlers,
then all functions will look like they are in an NMI context during tracing.

That is, the preempt count is checked to fill in the flags in the ring
buffer that denotes what context the event (in this case the function) was
called in.

-- Steve

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-02-02 Thread Peter Zijlstra

On Tue, Feb 02, 2021 at 11:56:23AM -0500, Steven Rostedt wrote:

> NMIs are special, and they always have been. They shouldn't be doing much
> anyway. If they are, then that's a problem.

There is a fair amount of NMI level code these days, and it seems to be
ever increasing...

> My question wasn't to have them do it, I was simply asking if they do. I
> was assuming that they do not.

per nmi_enter() we do:

  __preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);   \

> > But it doesn't help with:
> > 
> > spin_lock_irq(); // task context
> > #DB
> >   spin_lock_irq(); // interrupt context per your above
> 
> The statement above said:
> 
>  "If #DB and #BP do not change the in_interrupt() context"
> 
> Which would make the above be in the same context and the handler would
> not be called for the #DB case.

But then replace the above #DB with __fentry__ and explain how it is
fundamentally different? And consider that direct call into god knows
what code option you have. That doesn't go past any recursion checks
IIRC.

> I'm fine with #DB and #BP being a "in_nmi()", as they are probably even
> more special than NMIs.

That does mean that kprobes are then fundamentally running from
in_nmi(), which is what started all this.

Sure, the opt-probes and ftrace-probes don't actually have in_nmi() set
today (because they don't trigger an exception), but given that that is
all optional, any kprobe handler had better be in_nmi() clean.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-02-02 Thread Steven Rostedt

On Tue, 2 Feb 2021 17:45:47 +0100
Peter Zijlstra  wrote:

> On Tue, Feb 02, 2021 at 09:52:49AM -0500, Steven Rostedt wrote:
> 
> > But from a handler, you could do:
> > 
> > if (in_nmi())
> > return;
> > local_irq_save(flags);
> > /* Now you are safe from being re-entrant. */  
> 
> But that's an utter crap thing to do. That's like saying I don't care
> about my events, at which point you might as well not bother at all.
> 
> And you can still do that, you just get less coverage today than you
> used to. You used to throw things under the bus, now you throw more
> under the bus. If you didn't care, I can't seem to find myself caring
> either.

NMIs are special, and they always have been. They shouldn't be doing much
anyway. If they are, then that's a problem.

But if you want to make the stack tracer work on all contexts, I'm happy to
take patches. I don't have time to work on it today.

> 
> > Where as there's no equivalent in a NMI handler. That's what makes
> > kprobe/ftrace handlers different than NMI handlers.  
> 
> I don't see how.
> 
> > > Also, given how everything can nest, it had better all be lockless
> > > anyway. You can get your regular function trace interrupted, which can
> > > hit a #DB, which can function trace, which can #BP which can function
> > > trace again which can get #NMI etc.. Many wonderfun nestings possible.  
> > 
> > I would call #DB an #BP handlers very special.  
> 
> They are, just like NMI is special, which is why they're classed
> together.
> 
> > Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer
> > has infrastructure to prevent recursion in the same context.  
> 
> Sure we _could_ do that, but then we get into the 'fun' problem of
> getting a breakpoint/int3 at random places and calling random code and
> having deadlocks because they take the same lock.

My question wasn't to have them do it, I was simply asking if they do. I
was assuming that they do not.

> 
> There was very little that stopped that from happening.
> 
> > That is, a
> > ftrace handler calls something that gets traced, the recursion protection
> > will detect that and prevent the handler from being called again. But the
> > recursion protection is interrupt context aware and lets the handler get
> > called again if the recursion happens from a different context:  
> 
> > If #DB and #BP do not change the in_interrupt() context, then the above
> > still will protect the ftrace handlers from recursion due to them.  
> 
> But it doesn't help with:
> 
>   spin_lock_irq(); // task context
>   #DB
> spin_lock_irq(); // interrupt context per your above

The statement above said:

 "If #DB and #BP do not change the in_interrupt() context"

Which would make the above be in the same context and the handler would
not be called for the #DB case.

> 
> All you need to do is put a breakpoint on a piece of code that holds a
> spinlock and a handler that takes the same spinlock.
> 
> There was very little from stopping that.
> 
> > That would require refactoring all the code that's been around since 2008.  
> 
> Because I couldn't tell why/if any of that was correct at all. #DB/#BP
> don't play by the normal rules. They're _far_ more NMI-like than they're
> IRQ-like due to ignoring IF.

I'm fine with #DB and #BP being a "in_nmi()", as they are probably even
more special than NMIs.

-- Steve

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-02-02 Thread Peter Zijlstra

On Tue, Feb 02, 2021 at 09:52:49AM -0500, Steven Rostedt wrote:

> But from a handler, you could do:
> 
>   if (in_nmi())
>   return;
>   local_irq_save(flags);
>   /* Now you are safe from being re-entrant. */

But that's an utter crap thing to do. That's like saying I don't care
about my events, at which point you might as well not bother at all.

And you can still do that, you just get less coverage today than you
used to. You used to throw things under the bus, now you throw more
under the bus. If you didn't care, I can't seem to find myself caring
either.

> Where as there's no equivalent in a NMI handler. That's what makes
> kprobe/ftrace handlers different than NMI handlers.

I don't see how.

> > Also, given how everything can nest, it had better all be lockless
> > anyway. You can get your regular function trace interrupted, which can
> > hit a #DB, which can function trace, which can #BP which can function
> > trace again which can get #NMI etc.. Many wonderfun nestings possible.
> 
> I would call #DB an #BP handlers very special.

They are, just like NMI is special, which is why they're classed
together.

> Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer
> has infrastructure to prevent recursion in the same context.

Sure we _could_ do that, but then we get into the 'fun' problem of
getting a breakpoint/int3 at random places and calling random code and
having deadlocks because they take the same lock.

There was very little that stopped that from happening.

> That is, a
> ftrace handler calls something that gets traced, the recursion protection
> will detect that and prevent the handler from being called again. But the
> recursion protection is interrupt context aware and lets the handler get
> called again if the recursion happens from a different context:

> If #DB and #BP do not change the in_interrupt() context, then the above
> still will protect the ftrace handlers from recursion due to them.

But it doesn't help with:

spin_lock_irq(); // task context
#DB
  spin_lock_irq(); // interrupt context per your above

All you need to do is put a breakpoint on a piece of code that holds a
spinlock and a handler that takes the same spinlock.

There was very little from stopping that.

> That would require refactoring all the code that's been around since 2008.

Because I couldn't tell why/if any of that was correct at all. #DB/#BP
don't play by the normal rules. They're _far_ more NMI-like than they're
IRQ-like due to ignoring IF.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-02-02 Thread Steven Rostedt

On Tue, 2 Feb 2021 11:45:41 +0100
Peter Zijlstra  wrote:

> > The stack tracer checks the size of the stack, compares it to the
> > largest recorded size, and if it's bigger, it will save the stack. But
> > if this happens on two CPUs at the same time, only one can do the
> > recording at the same time. To synchronize this, a spin lock must be
> > taken. Similar to spin locks in an NMI.  
> 
> That sounds like something cmpxchg() should be able to do.
> 
> Have a per-cpu stack trace buffer and a global max one, when cpu local
> exceeds previous max, cmpxchg the buffer.
> 
> > But the problem here is, the callbacks can also be done from an NMI
> > context, so if we are in NMI, we don't want to take any locks, and
> > simply don't record the stack traces from NMIs.  
> 
> Which is obviously shit :-) The NMI might have interesting stack usage.

Actually, it only checks task stacks. It doesn't check IRQ stacks if
they are different than the task stack, because to do it properly, it
must know the size of the stack. The tracer currently masks the stack
pointer with THREAD_SIZE to find the top of the stack.

As other stacks may not be THREAD_SIZE, that won't work. It has been on
my TODO list (for a long time), to add an arch specific way to quickly find
the top of the stack.

> 
> > The more I think about it, the more I hate the idea that ftrace
> > callbacks and kprobes are considered NMIs. Simply because they are not!  
> 
> Yet they happen when IRQs are off, so they are ;-)

But from a handler, you could do:

if (in_nmi())
return;
local_irq_save(flags);
/* Now you are safe from being re-entrant. */

Where as there's no equivalent in a NMI handler. That's what makes
kprobe/ftrace handlers different than NMI handlers.

> 
> Also, given how everything can nest, it had better all be lockless
> anyway. You can get your regular function trace interrupted, which can
> hit a #DB, which can function trace, which can #BP which can function
> trace again which can get #NMI etc.. Many wonderfun nestings possible.

I would call #DB an #BP handlers very special.

Question: Do #DB and #BP set "in_interrupt()"? Because the function tracer
has infrastructure to prevent recursion in the same context. That is, a
ftrace handler calls something that gets traced, the recursion protection
will detect that and prevent the handler from being called again. But the
recursion protection is interrupt context aware and lets the handler get
called again if the recursion happens from a different context:

func:
   call ftrace_caller
  ftrace_caller:
  call ftrace_handler
  ftrace_handler() {

   if (recursion_test()) <- false
   return;

   some_traced_func() {
call ftrace_caller
   call ftrace_handler
ftrace_handler() {
if (recursion_test()) <- true
   return
}

 func
call ftrace_caller
 call ftrace_handler
   ftrace_handler() {
if (recursion_test()) <- false
 return;
   /* continue */

If #DB and #BP do not change the in_interrupt() context, then the above
still will protect the ftrace handlers from recursion due to them.

> 
> And god knows what these handlers end up calling.
> 
> The only sane approach is treating it all as NMI and having it all
> lockless.

That would require refactoring all the code that's been around since 2008.

Worse yet. lockless is much more complex to get right. So this refactoring
will likely cause more bugs than it solves.

-- Steve

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-02-02 Thread Peter Zijlstra

On Sat, Jan 30, 2021 at 07:44:10AM -0500, Steven Rostedt wrote:
> On Sat, 30 Jan 2021 09:28:32 +0100
> Peter Zijlstra  wrote:
> 
> > On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote:
> > > Specifically, kprobe and ftrace callbacks may have this:
> > > 
> > >   if (in_nmi())
> > >   return;
> > > 
> > >   raw_spin_lock_irqsave(, flags);
> > >   [..]
> > >   raw_spin_unlock_irqrestore(, flags);
> > > 
> > > Which is totally fine to have,  
> > 
> > Why? There's a distinct lack of explaining here.
> > 
> > Note that we ripped out all such dodgy locking from kretprobes.
> 
> Actually, I think you helped explain the distinction. You mention
> "kretpobes" do you mean the infrastructure of kretprobes or all its
> users?

kretprobe infra.

> The infrastructure of ftrace and kprobes can work in any context, it
> does not mean that the callbacks must. Again, these are more like
> exceptions. Why have "in_nmi()"? If anything that can be called by an
> NMI should just work, right? That's basically your argument for having
> ftrace and kprobes set "in_nmi".
> 
> You can have locking in NMIs if the locking is *only* in NMI handlers,
> right? If that's the case, then so should ftrace and kprobe callbacks.

Which is still dodgy as heck. NMIs _can_ nest. Now mostly it doesn't
happen, and the few sites that do use spinlocks from NMI context are
sure to use it from a specific NMI 'handler' context which typically
don't nest.

But I still utterly detest the idea of using spinlocks from NMI. It's
inherently fragile.

> The stack tracer checks the size of the stack, compares it to the
> largest recorded size, and if it's bigger, it will save the stack. But
> if this happens on two CPUs at the same time, only one can do the
> recording at the same time. To synchronize this, a spin lock must be
> taken. Similar to spin locks in an NMI.

That sounds like something cmpxchg() should be able to do.

Have a per-cpu stack trace buffer and a global max one, when cpu local
exceeds previous max, cmpxchg the buffer.

> But the problem here is, the callbacks can also be done from an NMI
> context, so if we are in NMI, we don't want to take any locks, and
> simply don't record the stack traces from NMIs.

Which is obviously shit :-) The NMI might have interesting stack usage.

> The more I think about it, the more I hate the idea that ftrace
> callbacks and kprobes are considered NMIs. Simply because they are not!

Yet they happen when IRQs are off, so they are ;-)

Also, given how everything can nest, it had better all be lockless
anyway. You can get your regular function trace interrupted, which can
hit a #DB, which can function trace, which can #BP which can function
trace again which can get #NMI etc.. Many wonderfun nestings possible.

And god knows what these handlers end up calling.

The only sane approach is treating it all as NMI and having it all
lockless.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-30 Thread Steven Rostedt

On Sat, 30 Jan 2021 09:28:32 +0100
Peter Zijlstra  wrote:

> On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote:
> > Specifically, kprobe and ftrace callbacks may have this:
> > 
> > if (in_nmi())
> > return;
> > 
> > raw_spin_lock_irqsave(, flags);
> > [..]
> > raw_spin_unlock_irqrestore(, flags);
> > 
> > Which is totally fine to have,  
> 
> Why? There's a distinct lack of explaining here.
> 
> Note that we ripped out all such dodgy locking from kretprobes.

Actually, I think you helped explain the distinction. You mention
"kretpobes" do you mean the infrastructure of kretprobes or all its
users?

The infrastructure of ftrace and kprobes can work in any context, it
does not mean that the callbacks must. Again, these are more like
exceptions. Why have "in_nmi()"? If anything that can be called by an
NMI should just work, right? That's basically your argument for having
ftrace and kprobes set "in_nmi".

You can have locking in NMIs if the locking is *only* in NMI handlers,
right? If that's the case, then so should ftrace and kprobe callbacks.

The stack tracer checks the size of the stack, compares it to the
largest recorded size, and if it's bigger, it will save the stack. But
if this happens on two CPUs at the same time, only one can do the
recording at the same time. To synchronize this, a spin lock must be
taken. Similar to spin locks in an NMI.

But the problem here is, the callbacks can also be done from an NMI
context, so if we are in NMI, we don't want to take any locks, and
simply don't record the stack traces from NMIs.

The more I think about it, the more I hate the idea that ftrace
callbacks and kprobes are considered NMIs. Simply because they are not!

-- Steve

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-30 Thread Masami Hiramatsu

On Fri, 29 Jan 2021 19:08:40 -0800
Alexei Starovoitov  wrote:

> On Sat, Jan 30, 2021 at 11:02:49AM +0900, Masami Hiramatsu wrote:
> > On Fri, 29 Jan 2021 18:59:43 +0100
> > Peter Zijlstra  wrote:
> > 
> > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > > Same things apply to bpf side. We can statically prove safety for
> > > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > > have to use run-time checks for recursion prevention, etc.
> > > 
> > > I have no idea what you're saying. You can attach to functions that are
> > > called with random locks held, you can create kprobes in some very
> > > sensitive places.
> > > 
> > > What can you staticlly prove about that?
> > 
> > For the bpf and the kprobe tracer, if a probe hits in the NMI context,
> > it can call the handler with another handler processing events.
> > 
> > kprobes is carefully avoiding the deadlock by checking recursion
> > with per-cpu variable. But if the handler is shared with the other events
> > like tracepoints, it needs to its own recursion cheker too.
> > 
> > So, Alexei, maybe you need something like this instead of in_nmi() check.
> > 
> > DEFINE_PER_CPU(bool, under_running_bpf);
> > 
> > common_handler()
> > {
> > if (__this_cpu_read(under_running_bpf))
> > return;
> > __this_cpu_write(under_running_bpf, true);
> > /* execute bpf prog */
> > __this_cpu_write(under_running_bpf, false); 
> > }
> > 
> > Does this work for you?
> 
> This exactly check is already in trace_call_bpf.
> Right after if (in_nmi()).
> See bpf_prog_active. It serves different purpose though.
> Simply removing if (in_nmi()) from trace_call_bpf is a bit scary.
> I need to analyze all code paths first.

OK, if bpf already avoids its recursion, other considerable case is
that some resources are shared among bpf_prog and other parts. Since
asynchronous NMI can occur anywhere, such resource usage can conflict
with bpf_prog.

Kprobes had similar issue, so I set a dummy kprobes to current_kprobes
for protecting such critical sections.
See kprobe_busy_begin()/end() and where those are used.

Thank you,

-- 
Masami Hiramatsu

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-30 Thread Alexei Starovoitov

On Sat, Jan 30, 2021 at 11:02:49AM +0900, Masami Hiramatsu wrote:
> On Fri, 29 Jan 2021 18:59:43 +0100
> Peter Zijlstra  wrote:
> 
> > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > Same things apply to bpf side. We can statically prove safety for
> > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > have to use run-time checks for recursion prevention, etc.
> > 
> > I have no idea what you're saying. You can attach to functions that are
> > called with random locks held, you can create kprobes in some very
> > sensitive places.
> > 
> > What can you staticlly prove about that?
> 
> For the bpf and the kprobe tracer, if a probe hits in the NMI context,
> it can call the handler with another handler processing events.
> 
> kprobes is carefully avoiding the deadlock by checking recursion
> with per-cpu variable. But if the handler is shared with the other events
> like tracepoints, it needs to its own recursion cheker too.
> 
> So, Alexei, maybe you need something like this instead of in_nmi() check.
> 
> DEFINE_PER_CPU(bool, under_running_bpf);
> 
> common_handler()
> {
>   if (__this_cpu_read(under_running_bpf))
>   return;
>   __this_cpu_write(under_running_bpf, true);
>   /* execute bpf prog */
>   __this_cpu_write(under_running_bpf, false); 
> }
> 
> Does this work for you?

This exactly check is already in trace_call_bpf.
Right after if (in_nmi()).
See bpf_prog_active. It serves different purpose though.
Simply removing if (in_nmi()) from trace_call_bpf is a bit scary.
I need to analyze all code paths first.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-30 Thread Peter Zijlstra

On Fri, Jan 29, 2021 at 04:24:54PM -0500, Steven Rostedt wrote:
> Specifically, kprobe and ftrace callbacks may have this:
> 
>   if (in_nmi())
>   return;
> 
>   raw_spin_lock_irqsave(, flags);
>   [..]
>   raw_spin_unlock_irqrestore(, flags);
> 
> Which is totally fine to have,

Why? There's a distinct lack of explaining here.

Note that we ripped out all such dodgy locking from kretprobes.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Masami Hiramatsu

On Fri, 29 Jan 2021 18:59:43 +0100
Peter Zijlstra  wrote:

> On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > Same things apply to bpf side. We can statically prove safety for
> > ftrace and kprobe attaching whereas to deal with NMI situation we
> > have to use run-time checks for recursion prevention, etc.
> 
> I have no idea what you're saying. You can attach to functions that are
> called with random locks held, you can create kprobes in some very
> sensitive places.
> 
> What can you staticlly prove about that?

For the bpf and the kprobe tracer, if a probe hits in the NMI context,
it can call the handler with another handler processing events.

kprobes is carefully avoiding the deadlock by checking recursion
with per-cpu variable. But if the handler is shared with the other events
like tracepoints, it needs to its own recursion cheker too.

So, Alexei, maybe you need something like this instead of in_nmi() check.

DEFINE_PER_CPU(bool, under_running_bpf);

common_handler()
{
if (__this_cpu_read(under_running_bpf))
return;
__this_cpu_write(under_running_bpf, true);
/* execute bpf prog */
__this_cpu_write(under_running_bpf, false); 
}

Does this work for you?

Thank you,

-- 
Masami Hiramatsu

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Masami Hiramatsu

On Fri, 29 Jan 2021 13:05:33 -0800
Alexei Starovoitov  wrote:

> On Fri, Jan 29, 2021 at 02:01:03PM -0500, Steven Rostedt wrote:
> > On Fri, 29 Jan 2021 18:59:43 +0100
> > Peter Zijlstra  wrote:
> > 
> > > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > > Same things apply to bpf side. We can statically prove safety for
> > > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > > have to use run-time checks for recursion prevention, etc.  
> > > 
> > > I have no idea what you're saying. You can attach to functions that are
> > > called with random locks held, you can create kprobes in some very
> > > sensitive places.
> > > 
> > > What can you staticlly prove about that?
> > 
> > I think the main difference is, if you attach a kprobe or ftrace function,
> > you can theoretically analyze the location before you do the attachment.
> 
> Excatly.
> When we're writing bpf helpers we need to carefully think about reentrance 
> and NMI.
> If the helper looks like:
> int nokprobe notrace bpf_something(...)
> {
>   // access variables A and B
> }
> 
> The implementation can rely on the fact that even if the helper is reentrant
> the state of A and B will be consistent. Either both got touched or none.
> Only NMI condition we have to worry about, because A could be modified 
> without touching B.
> If we write it as
> int nokprobe bpf_something(...) { ... }
> that would be another case.
> Here we need to consider the case that bpf prog can be attached to it via 
> fentry nop.
> But no need to worry about instructions split in the middle because of kprobe 
> via int3.
> Since we have big "if (in_nmi()) goto unsupported;" check in the beginning we
> only need to worry about combinations of kprobe at the start of the func,
> kprobe anywhere inside the func via int3, and ftrace at the start.
> Not having to think of NMI helps a ton.
> My earlier this_cpu vs __this_cpu comment is an example of that.
> If in_nmi is filtered early it's one implementation. If nmi has to be handled
> it's completely different algorithm.
> Now you've broke all this logic by making int3 to be marked as 'in_nmi' and
> bpf in kprobe in the middle of the func are now broken.
> Do people use that? Yeah they do.
> We have to fix it.
> What were your reasons to make int3 in_nmi?
> I've read the commit log, but I don't see in it the actual motivation
> for int3 other than "it looks like NMI to me. Let's make it so".
> The commit logs talk about cpu exceptions. I agree that #DB and #MC do behave 
> like NMI.
> But #BP is not really. My understanding it's used by kprobes and text_poke_bp 
> only.
> If the motivation was to close some issue with text_poke_bp then, sure,
> let's make handling of text_poke_bp to be treated as nmi.
> But kprobe is not that.
> I'm thinking either of the following solutions would be generic:
> - introduce another state to preempt flags like "kernel exception"

I like this solution. Or, at least there should be a way to provide the
probed context is NMI or not.
(BTW, would the NMI has a specific stack area? If so, nmi_context(regs)
 can be implemented.)

> - remove kprobe's int3 from in_nmi
> As bpf specific alternative we can do:
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 6c0018abe68a..37cc549ad52e 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -96,7 +96,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
> void *ctx)
>  {
> unsigned int ret;
> 
> -   if (in_nmi()) /* not supported yet */
> +   if (in_nmi() && !kprobe_running()) /* not supported yet */

This doesn't make sense, because kprobe_running() always true in the kprobe 
handler.

The problem is that the in_nmi() checks whether the current context is NMI 
context,
but you want to know the context where the kprobe is invoked, is NMI context or 
not.

Thank you,

-- 
Masami Hiramatsu

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Steven Rostedt

On Fri, 29 Jan 2021 14:01:03 -0500
Steven Rostedt  wrote:

> On Fri, 29 Jan 2021 18:59:43 +0100
> Peter Zijlstra  wrote:
> 
> > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:  
> > > Same things apply to bpf side. We can statically prove safety for
> > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > have to use run-time checks for recursion prevention, etc.
> > 
> > I have no idea what you're saying. You can attach to functions that are
> > called with random locks held, you can create kprobes in some very
> > sensitive places.
> > 
> > What can you staticlly prove about that?  
> 
> I think the main difference is, if you attach a kprobe or ftrace function,
> you can theoretically analyze the location before you do the attachment.
> 
> Does, the NMI context mean "in_nmi()" returns true? Because there's cases
> in ftrace callbacks where that is checked (like the stack tracer). And
> having ftrace return true for "in_nmi()" will break a lot of existing
> utilities.

Specifically, kprobe and ftrace callbacks may have this:

if (in_nmi())
return;

raw_spin_lock_irqsave(, flags);
[..]
raw_spin_unlock_irqrestore(, flags);

Which is totally fine to have, but the above only works if "in_nmi()"
returns true only if you are in a real NMI.

The stack tracer code does exactly the above.

-- Steve

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Alexei Starovoitov

On Fri, Jan 29, 2021 at 02:01:03PM -0500, Steven Rostedt wrote:
> On Fri, 29 Jan 2021 18:59:43 +0100
> Peter Zijlstra  wrote:
> 
> > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > Same things apply to bpf side. We can statically prove safety for
> > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > have to use run-time checks for recursion prevention, etc.  
> > 
> > I have no idea what you're saying. You can attach to functions that are
> > called with random locks held, you can create kprobes in some very
> > sensitive places.
> > 
> > What can you staticlly prove about that?
> 
> I think the main difference is, if you attach a kprobe or ftrace function,
> you can theoretically analyze the location before you do the attachment.

Excatly.
When we're writing bpf helpers we need to carefully think about reentrance and 
NMI.
If the helper looks like:
int nokprobe notrace bpf_something(...)
{
  // access variables A and B
}

The implementation can rely on the fact that even if the helper is reentrant
the state of A and B will be consistent. Either both got touched or none.
Only NMI condition we have to worry about, because A could be modified 
without touching B.
If we write it as
int nokprobe bpf_something(...) { ... }
that would be another case.
Here we need to consider the case that bpf prog can be attached to it via 
fentry nop.
But no need to worry about instructions split in the middle because of kprobe 
via int3.
Since we have big "if (in_nmi()) goto unsupported;" check in the beginning we
only need to worry about combinations of kprobe at the start of the func,
kprobe anywhere inside the func via int3, and ftrace at the start.
Not having to think of NMI helps a ton.
My earlier this_cpu vs __this_cpu comment is an example of that.
If in_nmi is filtered early it's one implementation. If nmi has to be handled
it's completely different algorithm.
Now you've broke all this logic by making int3 to be marked as 'in_nmi' and
bpf in kprobe in the middle of the func are now broken.
Do people use that? Yeah they do.
We have to fix it.
What were your reasons to make int3 in_nmi?
I've read the commit log, but I don't see in it the actual motivation
for int3 other than "it looks like NMI to me. Let's make it so".
The commit logs talk about cpu exceptions. I agree that #DB and #MC do behave 
like NMI.
But #BP is not really. My understanding it's used by kprobes and text_poke_bp 
only.
If the motivation was to close some issue with text_poke_bp then, sure,
let's make handling of text_poke_bp to be treated as nmi.
But kprobe is not that.
I'm thinking either of the following solutions would be generic:
- introduce another state to preempt flags like "kernel exception"
- remove kprobe's int3 from in_nmi
As bpf specific alternative we can do:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..37cc549ad52e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,7 +96,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
void *ctx)
 {
unsigned int ret;

-   if (in_nmi()) /* not supported yet */
+   if (in_nmi() && !kprobe_running()) /* not supported yet */
return 1;

but imo it's less clean than the first two options.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Steven Rostedt

On Fri, 29 Jan 2021 18:59:43 +0100
Peter Zijlstra  wrote:

> On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > Same things apply to bpf side. We can statically prove safety for
> > ftrace and kprobe attaching whereas to deal with NMI situation we
> > have to use run-time checks for recursion prevention, etc.  
> 
> I have no idea what you're saying. You can attach to functions that are
> called with random locks held, you can create kprobes in some very
> sensitive places.
> 
> What can you staticlly prove about that?

I think the main difference is, if you attach a kprobe or ftrace function,
you can theoretically analyze the location before you do the attachment.

Does, the NMI context mean "in_nmi()" returns true? Because there's cases
in ftrace callbacks where that is checked (like the stack tracer). And
having ftrace return true for "in_nmi()" will break a lot of existing
utilities.

-- Steve

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Peter Zijlstra

On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> Same things apply to bpf side. We can statically prove safety for
> ftrace and kprobe attaching whereas to deal with NMI situation we
> have to use run-time checks for recursion prevention, etc.

I have no idea what you're saying. You can attach to functions that are
called with random locks held, you can create kprobes in some very
sensitive places.

What can you staticlly prove about that?

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Alexei Starovoitov

On Fri, Jan 29, 2021 at 8:24 AM Peter Zijlstra  wrote:
>
> On Fri, Jan 29, 2021 at 10:59:52AM -0500, Steven Rostedt wrote:
> > On Fri, 29 Jan 2021 22:40:11 +0900
> > Masami Hiramatsu  wrote:
> >
> > > > So what, they can all happen with random locks held. Marking them as NMI
> > > > enables a whole bunch of sanity checks that are entirely appropriate.
> > >
> > > How about introducing an idea of Asynchronous NMI (ANMI) and Synchronous
> > > NMI (SNMI)? kprobes and ftrace is synchronously called and can be 
> > > controlled
> > > (we can expect the context) but ANMI may be caused by asynchronous
> > > hardware events on any context.
> > >
> > > If we can distinguish those 2 NMIs on preempt count, bpf people can easily
> > > avoid the inevitable situation.
> >
> > I don't like the name NMI IN SNMI, because they are not NMIs. They are
> > actually more like kernel exceptions. Even page faults in the kernel is
> > similar to a kprobe breakpoint or ftrace. It can happen anywhere, with any
> > lock held. Perhaps we need a kernel exception context? Which by definition
> > is synchronous.

I like 'kernel exception' name. SNMI doesn't sound right. There is nothing
'non maskable' here.

>
> What problem are you trying to solve? AFAICT all these contexts have the
> same restrictions, why try and muck about with different names for the
> same thing?

from core kernel perspective the difference between 'kernel exception'
and true NMI is huge:
this_cpu vs __this_cpu
static checks vs runtime checks

Same things apply to bpf side. We can statically prove safety for
ftrace and kprobe attaching whereas to deal with NMI situation we
have to use run-time checks for recursion prevention, etc.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Peter Zijlstra

On Fri, Jan 29, 2021 at 10:59:52AM -0500, Steven Rostedt wrote:
> On Fri, 29 Jan 2021 22:40:11 +0900
> Masami Hiramatsu  wrote:
> 
> > > So what, they can all happen with random locks held. Marking them as NMI
> > > enables a whole bunch of sanity checks that are entirely appropriate.  
> > 
> > How about introducing an idea of Asynchronous NMI (ANMI) and Synchronous
> > NMI (SNMI)? kprobes and ftrace is synchronously called and can be controlled
> > (we can expect the context) but ANMI may be caused by asynchronous 
> > hardware events on any context.
> > 
> > If we can distinguish those 2 NMIs on preempt count, bpf people can easily
> > avoid the inevitable situation.
> 
> I don't like the name NMI IN SNMI, because they are not NMIs. They are
> actually more like kernel exceptions. Even page faults in the kernel is
> similar to a kprobe breakpoint or ftrace. It can happen anywhere, with any
> lock held. Perhaps we need a kernel exception context? Which by definition
> is synchronous.

What problem are you trying to solve? AFAICT all these contexts have the
same restrictions, why try and muck about with different names for the
same thing?

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Nikolay Borisov




On 29.01.21 г. 3:34 ч., Alexei Starovoitov wrote:
> On Thu, Jan 28, 2021 at 07:24:14PM +0100, Peter Zijlstra wrote:
>> On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
>>> it would be placed on the __fentry__ (and not endbr64) hence it works.
>>> So perhaps a workaround outside of bpf could essentially detect this
>>> scenario and adjust the probe to be on the __fentry__ and not preceding
>>> instruction if it's detected to be endbr64 ?
>>
>> Arguably the fentry handler should also set the nmi context, it can,
>> after all, interrupt pretty much any other context by construction.
> 
> But that doesn't make it a real nmi.
> nmi can actually interrupt anything. Whereas kprobe via int3 cannot
> do nokprobe and noinstr sections. The exposure is a lot different.
> ftrace is even more contained. It's only at the start of the functions.
> It's even smaller subset of places than kprobes.
> So ftrace < kprobe < nmi.
> Grouping them all into nmi doesn't make sense to me.
> That bpf breaking change came unnoticed mostly because people use
> kprobes in the beginning of the functions only, but there are cases
> where kprobes are in the middle too. People just didn't upgrade kernels
> fast enough to notice.

nit: slight correction - I observed while I was putting kprobes at the
beginning of the function but __fentry__ wasn't the first thing in the
function's code. The reason why people haven't observed is because
everyone is running with retpolines enabled which disables CFI/CET.

> imo appropriate solution would be to have some distinction between
> ftrace < kprobe_via_int3 < nmi, so that bpf side can react accordingly.
> That code in trace_call_bpf:
>   if (in_nmi()) /* not supported yet */
> was necessary in the past. Now we have all sorts of protections built in.
> So it's probably ok to just drop it, but I think adding
> called_via_ftrace vs called_via_kprobe_int3 vs called_via_nmi
> is more appropriate solution long term.
>

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Alexei Starovoitov

On Thu, Jan 28, 2021 at 07:24:14PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
> > it would be placed on the __fentry__ (and not endbr64) hence it works.
> > So perhaps a workaround outside of bpf could essentially detect this
> > scenario and adjust the probe to be on the __fentry__ and not preceding
> > instruction if it's detected to be endbr64 ?
> 
> Arguably the fentry handler should also set the nmi context, it can,
> after all, interrupt pretty much any other context by construction.

But that doesn't make it a real nmi.
nmi can actually interrupt anything. Whereas kprobe via int3 cannot
do nokprobe and noinstr sections. The exposure is a lot different.
ftrace is even more contained. It's only at the start of the functions.
It's even smaller subset of places than kprobes.
So ftrace < kprobe < nmi.
Grouping them all into nmi doesn't make sense to me.
That bpf breaking change came unnoticed mostly because people use
kprobes in the beginning of the functions only, but there are cases
where kprobes are in the middle too. People just didn't upgrade kernels
fast enough to notice.
imo appropriate solution would be to have some distinction between
ftrace < kprobe_via_int3 < nmi, so that bpf side can react accordingly.
That code in trace_call_bpf:
  if (in_nmi()) /* not supported yet */
was necessary in the past. Now we have all sorts of protections built in.
So it's probably ok to just drop it, but I think adding
called_via_ftrace vs called_via_kprobe_int3 vs called_via_nmi
is more appropriate solution long term.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Peter Zijlstra

On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
> it would be placed on the __fentry__ (and not endbr64) hence it works.
> So perhaps a workaround outside of bpf could essentially detect this
> scenario and adjust the probe to be on the __fentry__ and not preceding
> instruction if it's detected to be endbr64 ?

Arguably the fentry handler should also set the nmi context, it can,
after all, interrupt pretty much any other context by construction.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Josh Poimboeuf

On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
> On 28.01.21 г. 18:12 ч., Nikolay Borisov wrote:
> > On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
> >> Hi,
> > 
> > 
> >>
> >> Alexei, could you tell me what is the concerning situation for bpf?
> > 
> > Another data point masami is that this affects bpf kprobes which are
> > entered via int3, alternatively if the kprobe is entered via
> > kprobe_ftrace_handler it works as expected. I haven't been able to
> > determine why a particular bpf probe won't use ftrace's infrastructure
> > if it's put at the beginning of the function.  An alternative call chain
> > is :
> > 
> >  => __ftrace_trace_stack
> >  => trace_call_bpf
> >  => kprobe_perf_func
> >  => kprobe_ftrace_handler
> >  => 0xc095d0c8
> >  => btrfs_validate_metadata_buffer
> >  => end_bio_extent_readpage
> >  => end_workqueue_fn
> >  => btrfs_work_helper
> >  => process_one_work
> >  => worker_thread
> >  => kthread
> >  => ret_from_fork
> > 
> >>
> 
> I have a working theory why I'm seeing this. My kernel (broken) was
> compiled with retpolines off and with the gcc that comes with ubuntu
> (both 9 and 10:
> gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
> gcc-10 (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
> )
> 
> this results in CFI being enabled so functions look like:
> 0x81493890 <+0>: endbr64
> 0x81493894 <+4>: callq  0x8104d820 <__fentry__>
> 
> i.e fentry's thunk is not the first instruction on the function hence
> it's not going through the optimized ftrace handler. Instead it's using
> int3 which is broken as ascertained.
> 
> After testing with my testcase I confirm that with cfi off and
> __fentry__ being the first entry bpf starts working. And indeed, even
> with CFI turned on if I use a probe like :
> 
> bpftrace -e 'kprobe:btrfs_sync_file+4 {printf("kprobe: %s\n",
> kstack());}' &>bpf-output &
> 
> 
> it would be placed on the __fentry__ (and not endbr64) hence it works.
> So perhaps a workaround outside of bpf could essentially detect this
> scenario and adjust the probe to be on the __fentry__ and not preceding
> instruction if it's detected to be endbr64 ?

For now (and the foreseeable future), CET isn't enabled in the kernel.
So that endbr64 shouldn't be there in the first place.  I can make a
proper patch in a bit.

diff --git a/Makefile b/Makefile
index e0af7a4a5598..5ccc4cdf1fb5 100644
--- a/Makefile
+++ b/Makefile
@@ -948,11 +948,8 @@ KBUILD_CFLAGS   += $(call 
cc-option,-Werror=designated-init)
 # change __FILE__ to the relative path from the srctree
 KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
 
-# ensure -fcf-protection is disabled when using retpoline as it is
-# incompatible with -mindirect-branch=thunk-extern
-ifdef CONFIG_RETPOLINE
+# Intel CET isn't enabled in the kernel
 KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
-endif
 
 # include additional Makefiles when needed
 include-y  := scripts/Makefile.extrawarn

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Nikolay Borisov

On 28.01.21 г. 18:12 ч., Nikolay Borisov wrote:
> 
> 
> On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
>> Hi,
> 
> 
> 
>>
>> Alexei, could you tell me what is the concerning situation for bpf?
> 
> Another data point masami is that this affects bpf kprobes which are
> entered via int3, alternatively if the kprobe is entered via
> kprobe_ftrace_handler it works as expected. I haven't been able to
> determine why a particular bpf probe won't use ftrace's infrastructure
> if it's put at the beginning of the function.  An alternative call chain
> is :
> 
>  => __ftrace_trace_stack
>  => trace_call_bpf
>  => kprobe_perf_func
>  => kprobe_ftrace_handler
>  => 0xc095d0c8
>  => btrfs_validate_metadata_buffer
>  => end_bio_extent_readpage
>  => end_workqueue_fn
>  => btrfs_work_helper
>  => process_one_work
>  => worker_thread
>  => kthread
>  => ret_from_fork
> 
>>

I have a working theory why I'm seeing this. My kernel (broken) was
compiled with retpolines off and with the gcc that comes with ubuntu
(both 9 and 10:
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
gcc-10 (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
)

this results in CFI being enabled so functions look like:
0x81493890 <+0>: endbr64
0x81493894 <+4>: callq  0x8104d820 <__fentry__>

i.e fentry's thunk is not the first instruction on the function hence
it's not going through the optimized ftrace handler. Instead it's using
int3 which is broken as ascertained.

After testing with my testcase I confirm that with cfi off and
__fentry__ being the first entry bpf starts working. And indeed, even
with CFI turned on if I use a probe like :

bpftrace -e 'kprobe:btrfs_sync_file+4 {printf("kprobe: %s\n",
kstack());}' &>bpf-output &

it would be placed on the __fentry__ (and not endbr64) hence it works.
So perhaps a workaround outside of bpf could essentially detect this
scenario and adjust the probe to be on the __fentry__ and not preceding
instruction if it's detected to be endbr64 ?

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Nikolay Borisov




On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
> Hi,



> 
> Alexei, could you tell me what is the concerning situation for bpf?

Another data point masami is that this affects bpf kprobes which are
entered via int3, alternatively if the kprobe is entered via
kprobe_ftrace_handler it works as expected. I haven't been able to
determine why a particular bpf probe won't use ftrace's infrastructure
if it's put at the beginning of the function.  An alternative call chain
is :

 => __ftrace_trace_stack
 => trace_call_bpf
 => kprobe_perf_func
 => kprobe_ftrace_handler
 => 0xc095d0c8
 => btrfs_validate_metadata_buffer
 => end_bio_extent_readpage
 => end_workqueue_fn
 => btrfs_work_helper
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork

> 
> Thank you,
> 
> From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001
> From: Masami Hiramatsu 
> Date: Thu, 28 Jan 2021 12:31:02 +0900
> Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler
> 
> Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has
> changed the kprobe handler to run in the NMI context, in_nmi() always returns
> true. This means the bpf events on kprobes always skipped.
> 
> Signed-off-by: Masami Hiramatsu 
> ---
>  kernel/trace/bpf_trace.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 6c0018abe68a..764400260eb6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
> void *ctx)
>  {
>   unsigned int ret;
>  
> - if (in_nmi()) /* not supported yet */
> - return 1;
> -
>   cant_sleep();
>  
>   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
>

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Nikolay Borisov




On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
> Hi,
> 



> 
> Yeah, there is. Nikolay, could you try this tentative patch?
I can confirm that with this patch everything is working. I also applied
the following diff:

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..cc5a3a18816d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,8 +96,10 @@ unsigned int trace_call_bpf(struct trace_event_call
*call, void *ctx)
 {
unsigned int ret;

-   if (in_nmi()) /* not supported yet */
+   if (in_nmi()) /* not supported yet */ {
+   trace_dump_stack(0);
return 1;
+   }

cant_sleep();



And can confirm that the branch is being hit and the following call
trace is produced:

 => __ftrace_trace_stack
 => trace_call_bpf
 => kprobe_perf_func
 => kprobe_int3_handler
 => exc_int3
 => asm_exc_int3
 => btrfs_sync_file
 => do_fsync
 => __x64_sys_fsync
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe


> 
> Of course this just drops the NMI check from the handler, so alternative
> checker is required. But I'm not sure what the original code concerns.
> As far as I can see, there seems no re-entrant block flag, nor locks
> among ebpf programs in runtime.
> 
> Alexei, could you tell me what is the concerning situation for bpf?
> 
> Thank you,
> 
> From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001
> From: Masami Hiramatsu 
> Date: Thu, 28 Jan 2021 12:31:02 +0900
> Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler
> 
> Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has
> changed the kprobe handler to run in the NMI context, in_nmi() always returns
> true. This means the bpf events on kprobes always skipped.

FWIW I'd prefer if in addition to the original commit you also mention:

ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking")
b6be002bcd1d ("x86/entry: Move nmi entry/exit into common code")

Since they changed the way nmi state is managed in exc_int3 and not in
the original do_int3. THe latter no longer contains any references to
nmi-related code.

> 
> Signed-off-by: Masami Hiramatsu 
> ---
>  kernel/trace/bpf_trace.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 6c0018abe68a..764400260eb6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
> void *ctx)
>  {
>   unsigned int ret;
>  
> - if (in_nmi()) /* not supported yet */
> - return 1;
> -
>   cant_sleep();
>  
>   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
>

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Masami Hiramatsu

Hi,

On Thu, 28 Jan 2021 10:34:15 +0900
Masami Hiramatsu  wrote:

> On Wed, 27 Jan 2021 19:57:56 +0200
> Nikolay Borisov  wrote:
> 
> > 
> > 
> > On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote:
> > > On Thu, 28 Jan 2021 00:13:53 +0900
> > > Masami Hiramatsu  wrote:
> > > 
> > >> Hi Nikolay,
> > >>
> > >> On Wed, 27 Jan 2021 15:43:29 +0200
> > >> Nikolay Borisov  wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I'm currently seeing latest Linus' master being somewhat broken w.r.t
> > >>> krpobes. In particular I have the following test-case:
> > >>>
> > >>> #!/bin/bash
> > >>>
> > >>> mkfs.btrfs -f /dev/vdc &> /dev/null
> > >>> mount /dev/vdc /media/scratch/
> > >>>
> > >>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
> > >>> &>bpf-output &
> > >>> bpf_trace_pid=$!
> > >>>
> > >>> # force btrfs_sync_file to be called
> > >>> sleep 2
> > >>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
> > >>>
> > >>> kill $bpf_trace_pid
> > >>> sleep 1
> > >>>
> > >>> grep -q kprobe bpf-output
> > >>> retval=$?
> > >>> rm -f bpf-output
> > >>> umount /media/scratch
> > >>>
> > >>> exit $retval
> > >>>
> > >>> It traces btrfs_sync_file which is called when fsync is executed on a
> > >>> btrfs file, however I don't see the stacktrace being printed i.e the
> > >>> kprobe doesn't fire at all. The following alternative program:
> > >>>
> > >>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
> > >>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
> > >>> kstack());}'
> > >>>
> > >>> only prints the stack from the tracepoint and not from the kprobe, given
> > >>> that the tracepoint is called from the btrfs_sync_file function.
> > >>
> > >> Thank you for reporting!
> > >>
> > >> If you don't mind, could you confirm it with ftrace (tracefs)?
> > >> bpftrace etc. involves too many things. It is better to test with
> > >> simpler way to test it.
> > >> I'm not familer with the bpftrace, but I think you can check it with
> > >>
> > >> # cd /sys/kernel/debug/tracing
> > >> # echo p:myevent btrfs_sync_file >> kprobe_events
> > >> # echo stacktrace > events/kprobes/myevent/trigger
> > >>  (or echo 1 > options/stacktrace , if trigger file doesn't exist)
> > > 
> > > Of course, also you have to enable the event.
> > >  # echo 1 > events/kprobes/myevent/enable
> > > 
> > > And check the results
> > > 
> > >  # cat trace
> > > 
> > > 
> > >> Could you also share your kernel config, so that we can reproduce it?
> > > 
> > 
> > I've attached the config and indeed with the scenario you proposed it
> > seems to works. I see:
> > 
> >xfs_io-20280   [000] d.Z.  9900.748633: myevent:
> > (btrfs_sync_file+0x0/0x580)
> >   xfs_io-20280   [000] d.Z.  9900.748647: 
> >  => kprobe_trace_func
> >  => kprobe_dispatcher
> >  => kprobe_int3_handler
> >  => exc_int3
> >  => asm_exc_int3
> >  => btrfs_sync_file
> >  => do_fsync
> >  => __x64_sys_fsync
> >  => do_syscall_64
> >  => entry_SYSCALL_64_after_hwframe
> 
> Hmm, then there might be a problem in bpftrace or ebpf (need more info).
> At least kprobes itself isn't broken.
> I guess they check "in_nmi()" and skip such event?

Yeah, there is. Nikolay, could you try this tentative patch?

Of course this just drops the NMI check from the handler, so alternative
checker is required. But I'm not sure what the original code concerns.
As far as I can see, there seems no re-entrant block flag, nor locks
among ebpf programs in runtime.

Alexei, could you tell me what is the concerning situation for bpf?

Thank you,

>From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001
From: Masami Hiramatsu 
Date: Thu, 28 Jan 2021 12:31:02 +0900
Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler

Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has
changed the kprobe handler to run in the NMI context, in_nmi() always returns
true. This means the bpf events on kprobes always skipped.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/bpf_trace.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..764400260eb6 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
void *ctx)
 {
unsigned int ret;
 
-   if (in_nmi()) /* not supported yet */
-   return 1;
-
cant_sleep();
 
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
-- 
2.25.1

-- 
Masami Hiramatsu

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Masami Hiramatsu

On Wed, 27 Jan 2021 19:57:56 +0200
Nikolay Borisov  wrote:

> 
> 
> On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote:
> > On Thu, 28 Jan 2021 00:13:53 +0900
> > Masami Hiramatsu  wrote:
> > 
> >> Hi Nikolay,
> >>
> >> On Wed, 27 Jan 2021 15:43:29 +0200
> >> Nikolay Borisov  wrote:
> >>
> >>> Hello,
> >>>
> >>> I'm currently seeing latest Linus' master being somewhat broken w.r.t
> >>> krpobes. In particular I have the following test-case:
> >>>
> >>> #!/bin/bash
> >>>
> >>> mkfs.btrfs -f /dev/vdc &> /dev/null
> >>> mount /dev/vdc /media/scratch/
> >>>
> >>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
> >>> &>bpf-output &
> >>> bpf_trace_pid=$!
> >>>
> >>> # force btrfs_sync_file to be called
> >>> sleep 2
> >>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
> >>>
> >>> kill $bpf_trace_pid
> >>> sleep 1
> >>>
> >>> grep -q kprobe bpf-output
> >>> retval=$?
> >>> rm -f bpf-output
> >>> umount /media/scratch
> >>>
> >>> exit $retval
> >>>
> >>> It traces btrfs_sync_file which is called when fsync is executed on a
> >>> btrfs file, however I don't see the stacktrace being printed i.e the
> >>> kprobe doesn't fire at all. The following alternative program:
> >>>
> >>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
> >>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
> >>> kstack());}'
> >>>
> >>> only prints the stack from the tracepoint and not from the kprobe, given
> >>> that the tracepoint is called from the btrfs_sync_file function.
> >>
> >> Thank you for reporting!
> >>
> >> If you don't mind, could you confirm it with ftrace (tracefs)?
> >> bpftrace etc. involves too many things. It is better to test with
> >> simpler way to test it.
> >> I'm not familer with the bpftrace, but I think you can check it with
> >>
> >> # cd /sys/kernel/debug/tracing
> >> # echo p:myevent btrfs_sync_file >> kprobe_events
> >> # echo stacktrace > events/kprobes/myevent/trigger
> >>  (or echo 1 > options/stacktrace , if trigger file doesn't exist)
> > 
> > Of course, also you have to enable the event.
> >  # echo 1 > events/kprobes/myevent/enable
> > 
> > And check the results
> > 
> >  # cat trace
> > 
> > 
> >> Could you also share your kernel config, so that we can reproduce it?
> > 
> 
> I've attached the config and indeed with the scenario you proposed it
> seems to works. I see:
> 
>xfs_io-20280   [000] d.Z.  9900.748633: myevent:
> (btrfs_sync_file+0x0/0x580)
>   xfs_io-20280   [000] d.Z.  9900.748647: 
>  => kprobe_trace_func
>  => kprobe_dispatcher
>  => kprobe_int3_handler
>  => exc_int3
>  => asm_exc_int3
>  => btrfs_sync_file
>  => do_fsync
>  => __x64_sys_fsync
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe

Hmm, then there might be a problem in bpftrace or ebpf (need more info).
At least kprobes itself isn't broken.
I guess they check "in_nmi()" and skip such event?

Thank you,

-- 
Masami Hiramatsu

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Nikolay Borisov



On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote:
> On Thu, 28 Jan 2021 00:13:53 +0900
> Masami Hiramatsu  wrote:
> 
>> Hi Nikolay,
>>
>> On Wed, 27 Jan 2021 15:43:29 +0200
>> Nikolay Borisov  wrote:
>>
>>> Hello,
>>>
>>> I'm currently seeing latest Linus' master being somewhat broken w.r.t
>>> krpobes. In particular I have the following test-case:
>>>
>>> #!/bin/bash
>>>
>>> mkfs.btrfs -f /dev/vdc &> /dev/null
>>> mount /dev/vdc /media/scratch/
>>>
>>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
>>> &>bpf-output &
>>> bpf_trace_pid=$!
>>>
>>> # force btrfs_sync_file to be called
>>> sleep 2
>>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
>>>
>>> kill $bpf_trace_pid
>>> sleep 1
>>>
>>> grep -q kprobe bpf-output
>>> retval=$?
>>> rm -f bpf-output
>>> umount /media/scratch
>>>
>>> exit $retval
>>>
>>> It traces btrfs_sync_file which is called when fsync is executed on a
>>> btrfs file, however I don't see the stacktrace being printed i.e the
>>> kprobe doesn't fire at all. The following alternative program:
>>>
>>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
>>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
>>> kstack());}'
>>>
>>> only prints the stack from the tracepoint and not from the kprobe, given
>>> that the tracepoint is called from the btrfs_sync_file function.
>>
>> Thank you for reporting!
>>
>> If you don't mind, could you confirm it with ftrace (tracefs)?
>> bpftrace etc. involves too many things. It is better to test with
>> simpler way to test it.
>> I'm not familer with the bpftrace, but I think you can check it with
>>
>> # cd /sys/kernel/debug/tracing
>> # echo p:myevent btrfs_sync_file >> kprobe_events
>> # echo stacktrace > events/kprobes/myevent/trigger
>>  (or echo 1 > options/stacktrace , if trigger file doesn't exist)
> 
> Of course, also you have to enable the event.
>  # echo 1 > events/kprobes/myevent/enable
> 
> And check the results
> 
>  # cat trace
> 
> 
>> Could you also share your kernel config, so that we can reproduce it?
> 

I've attached the config and indeed with the scenario you proposed it
seems to works. I see:

   xfs_io-20280   [000] d.Z.  9900.748633: myevent:
(btrfs_sync_file+0x0/0x580)
  xfs_io-20280   [000] d.Z.  9900.748647: 
 => kprobe_trace_func
 => kprobe_dispatcher
 => kprobe_int3_handler
 => exc_int3
 => asm_exc_int3
 => btrfs_sync_file
 => do_fsync
 => __x64_sys_fsync
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe



#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.11.0-rc5 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=90300
CONFIG_LD_VERSION=23400
CONFIG_CLANG_VERSION=0
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-default"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Peter Zijlstra

On Thu, Jan 28, 2021 at 12:13:53AM +0900, Masami Hiramatsu wrote:

> Peter, would you have any idea?

Without a clear error, I've no clue yet. I'll try and investigate a
little more once we have a little more data.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Masami Hiramatsu

On Thu, 28 Jan 2021 00:13:53 +0900
Masami Hiramatsu  wrote:

> Hi Nikolay,
> 
> On Wed, 27 Jan 2021 15:43:29 +0200
> Nikolay Borisov  wrote:
> 
> > Hello,
> > 
> > I'm currently seeing latest Linus' master being somewhat broken w.r.t
> > krpobes. In particular I have the following test-case:
> > 
> > #!/bin/bash
> > 
> > mkfs.btrfs -f /dev/vdc &> /dev/null
> > mount /dev/vdc /media/scratch/
> > 
> > bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
> > &>bpf-output &
> > bpf_trace_pid=$!
> > 
> > # force btrfs_sync_file to be called
> > sleep 2
> > xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
> > 
> > kill $bpf_trace_pid
> > sleep 1
> > 
> > grep -q kprobe bpf-output
> > retval=$?
> > rm -f bpf-output
> > umount /media/scratch
> > 
> > exit $retval
> > 
> > It traces btrfs_sync_file which is called when fsync is executed on a
> > btrfs file, however I don't see the stacktrace being printed i.e the
> > kprobe doesn't fire at all. The following alternative program:
> > 
> > bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
> > %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
> > kstack());}'
> > 
> > only prints the stack from the tracepoint and not from the kprobe, given
> > that the tracepoint is called from the btrfs_sync_file function.
> 
> Thank you for reporting!
> 
> If you don't mind, could you confirm it with ftrace (tracefs)?
> bpftrace etc. involves too many things. It is better to test with
> simpler way to test it.
> I'm not familer with the bpftrace, but I think you can check it with
> 
> # cd /sys/kernel/debug/tracing
> # echo p:myevent btrfs_sync_file >> kprobe_events
> # echo stacktrace > events/kprobes/myevent/trigger
>  (or echo 1 > options/stacktrace , if trigger file doesn't exist)

Of course, also you have to enable the event.
 # echo 1 > events/kprobes/myevent/enable

And check the results

 # cat trace


> Could you also share your kernel config, so that we can reproduce it?

Thank you,

> 
> > 
> > I started bisecting this and arrived at the following commit:
> > 
> > 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
> > 
> > FWIW the following series is applied on the kernel I was testing:
> > https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/
> > 
> > but it's still broken.
> 
> Peter, would you have any idea?
> 
> Thank you,
> 
> -- 
> Masami Hiramatsu 


-- 
Masami Hiramatsu

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Masami Hiramatsu

Hi Nikolay,

On Wed, 27 Jan 2021 15:43:29 +0200
Nikolay Borisov  wrote:

> Hello,
> 
> I'm currently seeing latest Linus' master being somewhat broken w.r.t
> krpobes. In particular I have the following test-case:
> 
> #!/bin/bash
> 
> mkfs.btrfs -f /dev/vdc &> /dev/null
> mount /dev/vdc /media/scratch/
> 
> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
> &>bpf-output &
> bpf_trace_pid=$!
> 
> # force btrfs_sync_file to be called
> sleep 2
> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
> 
> kill $bpf_trace_pid
> sleep 1
> 
> grep -q kprobe bpf-output
> retval=$?
> rm -f bpf-output
> umount /media/scratch
> 
> exit $retval
> 
> It traces btrfs_sync_file which is called when fsync is executed on a
> btrfs file, however I don't see the stacktrace being printed i.e the
> kprobe doesn't fire at all. The following alternative program:
> 
> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
> kstack());}'
> 
> only prints the stack from the tracepoint and not from the kprobe, given
> that the tracepoint is called from the btrfs_sync_file function.

Thank you for reporting!

If you don't mind, could you confirm it with ftrace (tracefs)?
bpftrace etc. involves too many things. It is better to test with
simpler way to test it.
I'm not familer with the bpftrace, but I think you can check it with

# cd /sys/kernel/debug/tracing
# echo p:myevent btrfs_sync_file >> kprobe_events
# echo stacktrace > events/kprobes/myevent/trigger
 (or echo 1 > options/stacktrace , if trigger file doesn't exist)

Could you also share your kernel config, so that we can reproduce it?

> 
> I started bisecting this and arrived at the following commit:
> 
> 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
> 
> FWIW the following series is applied on the kernel I was testing:
> https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/
> 
> but it's still broken.

Peter, would you have any idea?

Thank you,

-- 
Masami Hiramatsu

kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Nikolay Borisov

Hello,

I'm currently seeing latest Linus' master being somewhat broken w.r.t
krpobes. In particular I have the following test-case:

#!/bin/bash

mkfs.btrfs -f /dev/vdc &> /dev/null
mount /dev/vdc /media/scratch/

bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
&>bpf-output &
bpf_trace_pid=$!

# force btrfs_sync_file to be called
sleep 2
xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5

kill $bpf_trace_pid
sleep 1

grep -q kprobe bpf-output
retval=$?
rm -f bpf-output
umount /media/scratch

exit $retval

It traces btrfs_sync_file which is called when fsync is executed on a
btrfs file, however I don't see the stacktrace being printed i.e the
kprobe doesn't fire at all. The following alternative program:

bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
%s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
kstack());}'

only prints the stack from the tracepoint and not from the kprobe, given
that the tracepoint is called from the btrfs_sync_file function.

I started bisecting this and arrived at the following commit:

0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

FWIW the following series is applied on the kernel I was testing:
https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/

but it's still broken.

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

34 matches

Site Navigation

Mail list logo

Footer information