On Thu, May 21, 2026 at 5:44 AM Jiri Olsa <[email protected]> wrote: > > Andrii reported an issue with optimized uprobes [1] that can clobber > redzone area with call instruction storing return address on stack > where user code may keep temporary data without adjusting rsp. > > Fixing this by moving the optimized uprobes on top of 10-bytes nop > instruction, so we can squeeze another instruction to escape the > redzone area before doing the call, like: > > lea -0x80(%rsp), %rsp > call tramp > > Note the lea instruction is used to adjust the rsp register without > changing the flags. > > We use nop10 and following transofrmation to optimized instructions > above and back as suggested by Peterz [2]. > > Optimize path (int3_update_optimize): > > 1) Initial state after set_swbp() installed the uprobe: > cc 2e 0f 1f 84 00 00 00 00 00 > > From offset 0 this is INT3 followed by the tail of the original > 10-byte NOP. > > 2) Trap the call slot before rewriting the NOP tail: > cc 2e 0f 1f 84 [cc] 00 00 00 00 > > From offset 0 this traps on the uprobe INT3. A thread reaching > offset 5 traps on the temporary INT3 instead of seeing a partially > patched call. > > 3) Rewrite the LEA tail and call displacement, keeping both INT3 bytes: > cc [8d 64 24 80] cc [d0 d1 d2 d3] > > From offset 0 and offset 5 this still traps. The bytes between > them are not executable entry points while both traps are in place. > > 4) Restore the call opcode at offset 5: > cc 8d 64 24 80 [e8] d0 d1 d2 d3 > > From offset 0 this still traps. From offset 5 the instruction is > the final CALL to the uprobe trampoline. >
I'm sorry if I'm slow, but I don't understand why we need that second cc at offset 5? Isn't original nop10 processed by CPU as single instruction? So it will either be at ip of nop10, or at ip+10, no? If we trap at ip and in int3 handler +10 from there while we are installing lea+call, why do we need cc on byte 5? I.e., I don't understand how CPU can end up being at ip+5 until we finalize lea+call sequence? Can it? > 5) Publish the first LEA byte: > [48] 8d 64 24 80 e8 d0 d1 d2 d3 > > From offset 0 this is: > lea -0x80(%rsp), %rsp > call <uprobe-trampoline> > > Unoptimize path (int3_update_unoptimize): > > 1) Initial optimized state: > 48 8d 64 24 80 e8 d0 d1 d2 d3 > Same as 5) above. > > 2) Trap new entries before restoring the NOP bytes: > [cc] 8d 64 24 80 e8 d0 d1 d2 d3 > > From offset 0 this traps. A thread that had already executed the > LEA can still reach the intact CALL at offset 5. > > 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped > and byte 5 as CALL. > cc [2e 0f 1f 84] e8 d0 d1 d2 d3 > > From offset 0 this still traps. Offset 5 is still the CALL for any > thread that was already past the first LEA byte. > > 4) Publish the first byte of the original NOP: > [66] 2e 0f 1f 84 e8 d0 d1 d2 d3 > > From offset 0 this is the restored 10-byte NOP; the CALL opcode and > displacement are now only NOP operands. Offset 5 still decodes as > CALL for a thread that was already there. it's cool that we don't have to do jmp for the first byte, fancy :) > > Note as explained in [2] we need to use following nop10: > PF1 PF2 ESC NOPL MOD SIB DISP32 > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw > 0x00000000(%rax,%rax,1) > > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS > attribute in is_prefix_bad function. > > The optimized uprobe performance stays the same: > > uprobe-nop : 3.129 ± 0.013M/s > uprobe-push : 3.045 ± 0.006M/s > uprobe-ret : 1.095 ± 0.004M/s > --> uprobe-nop10 : 7.170 ± 0.020M/s > uretprobe-nop : 2.143 ± 0.021M/s > uretprobe-push : 2.090 ± 0.000M/s > uretprobe-ret : 0.942 ± 0.000M/s > --> uretprobe-nop10: 3.381 ± 0.003M/s > usdt-nop : 3.245 ± 0.004M/s > --> usdt-nop10 : 7.256 ± 0.023M/s > > [1] https://lore.kernel.org/bpf/[email protected]/ > [2] > https://lore.kernel.org/bpf/[email protected]/#t > Reported-by: Andrii Nakryiko <[email protected]> > Closes: https://lore.kernel.org/bpf/[email protected]/ > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes") > Assisted-by: Codex:GPT-5.5 > Signed-off-by: Jiri Olsa <[email protected]> > --- > arch/x86/kernel/uprobes.c | 281 +++++++++++++++++++++++++++++--------- > 1 file changed, 217 insertions(+), 64 deletions(-) > [...]
