On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <[email protected]> wrote:
>
> Andrii reported an issue with optimized uprobes [1] that can clobber
> redzone area with call instruction storing return address on stack
> where user code may keep temporary data without adjusting rsp.
>
> Fixing this by moving the optimized uprobes on top of 10-bytes nop
> instruction, so we can squeeze another instruction to escape the
> redzone area before doing the call, like:
>
>   lea -0x80(%rsp), %rsp
>   call tramp
>
> Note the lea instruction is used to adjust the rsp register without
> changing the flags.
>
> We use nop10 and following transformation to optimized instructions
> above and back as suggested by Peterz [2].
>
> Optimize path (int3_update_optimize):
>
>   1) Initial state after set_swbp() installed the uprobe:
>       cc 2e 0f 1f 84 00 00 00 00 00
>
>      From offset 0 this is INT3 followed by the tail of the original
>      10-byte NOP.
>
>      After a previous unoptimization bytes 5..9 may still contain the
>      old call instruction, which remains valid for threads already there.
>
>   2) Rewrite the LEA tail and call displacement:
>       cc [8d 64 24 80 e8 d0 d1 d2 d3]
>
>      From offset 0 this traps on the uprobe INT3.  Bytes 1..9 are not
>      executable entry points while byte 0 is trapped.
>
>   3) Publish the first LEA byte:
>       [48] 8d 64 24 80 e8 d0 d1 d2 d3
>
>      From offset 0 this is:
>         lea -0x80(%rsp), %rsp
>         call <uprobe-trampoline>
>
> Unoptimize path (int3_update_unoptimize):
>
>   1) Initial optimized state:
>       48 8d 64 24 80 e8 d0 d1 d2 d3
>      Same as 3) above.
>
>   2) Trap new entries before restoring the NOP bytes:
>       [cc] 8d 64 24 80 e8 d0 d1 d2 d3
>
>      From offset 0 this traps. A thread that had already executed the
>      LEA can still reach the intact CALL at offset 5.
>
>   3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
>      and byte 5 as CALL.
>       cc [2e 0f 1f 84] e8 d0 d1 d2 d3
>
>      From offset 0 this still traps. Offset 5 is still the CALL for any
>      thread that was already past the first LEA byte.
>
>   4) Publish the first byte of the original NOP:
>       [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
>
>      From offset 0 this is the restored 10-byte NOP; the CALL opcode and
>      displacement are now only NOP operands.  Offset 5 still decodes as
>      CALL for a thread that was already there.
>
>      Tthere is only a single target uprobe-trampoline for the given nop10
>      instruction address, so the CALL instruction will not be changed across
>      unoptimization/optimization cycles.
>      Therefore, any task that is preempted at the CALL instruction is 
> guaranteed
>      to observe that CALL and not anything else.
>
> Note as explained in [2] we need to use following nop10:
>        PF1   PF2   ESC   NOPL  MOD   SIB   DISP32
> NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 
> 0x00000000(%rax,%rax,1)
>
> which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> attribute in is_prefix_bad function.
>
> Also changing the uprobe syscall error when called out of uprobe
> trampoline to -EPROTO, so we are able to detect the fixed kernel.
>
> The optimized uprobe performance stays the same:
>
>         uprobe-nop     :    3.129 ± 0.013M/s
>         uprobe-push    :    3.045 ± 0.006M/s
>         uprobe-ret     :    1.095 ± 0.004M/s
>   -->   uprobe-nop10   :    7.170 ± 0.020M/s
>         uretprobe-nop  :    2.143 ± 0.021M/s
>         uretprobe-push :    2.090 ± 0.000M/s
>         uretprobe-ret  :    0.942 ± 0.000M/s
>   -->   uretprobe-nop10:    3.381 ± 0.003M/s
>         usdt-nop       :    3.245 ± 0.004M/s
>   -->   usdt-nop10     :    7.256 ± 0.023M/s
>
> [1] https://lore.kernel.org/bpf/[email protected]/
> [2] 
> https://lore.kernel.org/bpf/[email protected]/#t
> Reported-by: Andrii Nakryiko <[email protected]>
> Closes: https://lore.kernel.org/bpf/[email protected]/
> Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> Assisted-by: Codex:GPT-5.5
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
>  arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
>  1 file changed, 190 insertions(+), 65 deletions(-)
>

[...]

> @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, 
> struct vm_area_struct *vma,
>         smp_text_poke_sync_each_cpu();
>
>         /*
> -        * Write first byte.
> +        * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 
> trapped
> +        *    and byte 5 as CALL:
> +        *    cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> +        */
> +       ctx.expect = EXPECT_SWBP_OPTIMIZED;
> +       err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> +                          LEA_INSN_SIZE - 1, verify_insn,
> +                          true /* is_register */, false /* do_update_ref_ctr 
> */,

tbh, it's quite subtle and non-obvious why is_register should be set
to true first two times (and especially that is_register and
do_update_ref_ctr are implicitly connected), not sure how to make it
cleaner, but maybe leave a short comment explaining this twice
register, once unregister sequence?

> +                          &ctx);
> +       if (err)
> +               return err;
> +
> +       smp_text_poke_sync_each_cpu();

[...]

Reply via email to