Hi,

On Thu, Dec 10, 2020 at 10:25:07AM +0900, Masami Hiramatsu wrote:
> Hi Adam,
> 
> Thank you for reporting and debugging, actually we had investigated this
> issue in Aug. Please read this thread.
> 
> https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad282...@trendmicro.com
> 

Thanks for the link. I've read the entire history of proposed fix - very 
informative :)

> We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI
> context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about
> kprobe nesting"). Yeah, it will be in the v5.10.
> 
> Could you try to test whether these commits can solve the issue?

I've applied these commits on the top of kernel 5.9.7 and verified that it 
works well. Non-optimized KRETPROBES are back to life.

However, there might be another issue which I wanted to brought / discuss - 
problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 
'ftrace_enable_sysctl' function was correctly optimized without any problems. 
Since 5.9 it can't be optimized anynmore. I didn't see any changes in the 
sources regarding the optimizer, neither function itself.
When I looked at the generated vmlinux binary, I can see that GCC generated 
padding at the end of this function using INT3 opcode:

...
ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194 
<ftrace_enable_sysctl+0x114>
ffffffff81305296:       cc                      int3
ffffffff81305297:       cc                      int3
ffffffff81305298:       cc                      int3
ffffffff81305299:       cc                      int3
ffffffff8130529a:       cc                      int3
ffffffff8130529b:       cc                      int3
ffffffff8130529c:       cc                      int3
ffffffff8130529d:       cc                      int3
ffffffff8130529e:       cc                      int3
ffffffff8130529f:       cc                      int3

Such padding didn't exist in this function in generated images for older 
kernels. Nevertheless, such padding is not unusual and it's pretty common.

Optimizer logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():

    /* Decode instructions */
    addr = paddr - offset;
    while (addr < paddr - offset + size) { /* Decode until function end */
        unsigned long recovered_insn;
        if (search_exception_tables(addr))
            /*
             * Since some fixup code will jumps into this function,
             * we can't optimize kprobe in this function.
             */
            return 0;
        recovered_insn = recover_probed_instruction(buf, addr);
        if (!recovered_insn)
            return 0;
        kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
        insn_get_length(&insn);
        /* Another subsystem puts a breakpoint */
        if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
            return 0;
        /* Recover address */
        insn.kaddr = (void *)addr;
        insn.next_byte = (void *)(addr + insn.length);
        /* Check any instructions don't jump into target */
        if (insn_is_indirect_jump(&insn) ||
            insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
                     DISP32_SIZE))
            return 0;
        addr += insn.length;
    }

One of the check tries to protect from the situation when another subsystem 
puts a breakpoint there as well:

        /* Another subsystem puts a breakpoint */
        if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
            return 0;

However, that's not the case here. INT3_INSN_OPCODE is placed at the end of 
the function as padding (and protect from NOP-padding problems).

I wonder, if optimizer should take this special case into account? Is it worth 
to still optimize such functions when they are padded with INT3?

> If it is OK, we should backport those to stable tree.

Agreed.

Thanks,
Adam

> Thank you,
> 
> On Wed, 9 Dec 2020 22:50:01 +0100
> Adam Zabrocki <p...@pi3.com.pl> wrote:
> 
> > Hello,
> > 
> > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
> > when #DB exception was raised, entry to the NMI was not fully performed. 
> > Among
> > others, the following logic was executed:
> > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589
> > 
> >     if (!user_mode(regs)) {
> >         rcu_nmi_enter();
> >         preempt_disable();
> >     }
> > 
> > In some older kernels function ist_enter() was called instead. Inside this
> > function we can see the following logic:
> > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91
> > 
> >     if (user_mode(regs)) {
> >         RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> >     } else {
> >         /*
> >          * We might have interrupted pretty much anything.  In
> >          * fact, if we're a machine check, we can even interrupt
> >          * NMI processing.  We don't want in_nmi() to return true,
> >          * but we need to notify RCU.
> >          */
> >         rcu_nmi_enter();
> >     }
> > 
> >     preempt_disable();
> > 
> > As the comment says "We don't want in_nmi() to return true, but we need to
> > notify RCU.". However, since kernel 5.8 the logic of how interrupts are 
> > handled
> > was modified and currently we have this (function "exc_int3"):
> > https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630
> > 
> >     /*
> >      * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
> >      * can trigger INT3, hence poke_int3_handler() must be done
> >      * before. If the entry came from kernel mode, then use nmi_enter()
> >      * because the INT3 could have been hit in any context including
> >      * NMI.
> >      */
> >     if (user_mode(regs)) {
> >         idtentry_enter_user(regs);
> >         instrumentation_begin();
> >         do_int3_user(regs);
> >         instrumentation_end();
> >         idtentry_exit_user(regs);
> >     } else {
> >         nmi_enter();
> >         instrumentation_begin();
> >         trace_hardirqs_off_finish();
> >         if (!do_int3(regs))
> >             die("int3", regs, 0);
> >         if (regs->flags & X86_EFLAGS_IF)
> >             trace_hardirqs_on_prepare();
> >         instrumentation_end();
> >         nmi_exit();
> >     }
> > 
> > The root of unlucky change comes from this commit:
> > 
> > https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5
> > 
> > which later was modified by this commit:
> > 
> > https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2
> > 
> > and this is what we currently have in all kernels since 5.8. Essentially,
> > KRETPROBES are not working since these commits. We have the following logic:
> > 
> > asm_exc_int3() -> exc_int3():
> >                     |
> >     ----------------|
> >     |
> >     v
> > ...
> > nmi_enter();
> > ...
> > if (!do_int3(regs))
> >        |
> >   -----|
> >   |
> >   v
> > do_int3() -> kprobe_int3_handler():
> >                     |
> >     ----------------|
> >     |
> >     v
> > ...
> > if (!p->pre_handler || !p->pre_handler(p, regs))
> >                              |
> >     -------------------------|
> >     |
> >     v
> > ...
> > pre_handler_kretprobe():
> > ...
> >     if (unlikely(in_nmi())) {
> >         rp->nmissed++;
> >         return 0;
> >     }
> > 
> > Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() 
> > before
> > invoking any registered kprobe verifies if it is not in NMI via in_nmi() 
> > call.
> > 
> > This bug was masked by OPTIMIZER which was turning most of the KPROBES to be
> > FTRACE so essentially if someone registered KRETPROBE, buggy code was not
> > invoked (FTRACE code was executed instead). However, the optimizer was 
> > changed
> > and can't optimize as many functions anymore (probably another bug which 
> > might
> > be worth to investigate) and this bug is more visible.
> > 
> > Thanks,
> > Adam
> > 
> > -- 
> > pi3 (pi3ki31ny) - pi3 (at) itsec pl
> > http://pi3.com.pl
> > 
> 
> 
> -- 
> Masami Hiramatsu <mhira...@kernel.org>

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl

Reply via email to