On Wed, Jun 10, 2026 at 1:18 AM Jiri Olsa <[email protected]> wrote: > > On Tue, Jun 09, 2026 at 09:43:15AM -0700, Andrii Nakryiko wrote: > > On Tue, Jun 9, 2026 at 4:44 AM Jiri Olsa <[email protected]> wrote: > > > > > > On Mon, Jun 08, 2026 at 01:46:39PM -0700, Andrii Nakryiko wrote: > > > > On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <[email protected]> wrote: > > > > > > > > > > Andrii reported an issue with optimized uprobes [1] that can clobber > > > > > redzone area with call instruction storing return address on stack > > > > > where user code may keep temporary data without adjusting rsp. > > > > > > > > > > Fixing this by moving the optimized uprobes on top of 10-bytes nop > > > > > instruction, so we can squeeze another instruction to escape the > > > > > redzone area before doing the call, like: > > > > > > > > > > lea -0x80(%rsp), %rsp > > > > > call tramp > > > > > > > > > > Note the lea instruction is used to adjust the rsp register without > > > > > changing the flags. > > > > > > > > > > We use nop10 and following transformation to optimized instructions > > > > > above and back as suggested by Peterz [2]. > > > > > > > > > > Optimize path (int3_update_optimize): > > > > > > > > > > 1) Initial state after set_swbp() installed the uprobe: > > > > > cc 2e 0f 1f 84 00 00 00 00 00 > > > > > > > > > > From offset 0 this is INT3 followed by the tail of the original > > > > > 10-byte NOP. > > > > > > > > > > After a previous unoptimization bytes 5..9 may still contain the > > > > > old call instruction, which remains valid for threads already > > > > > there. > > > > > > > > > > 2) Rewrite the LEA tail and call displacement: > > > > > cc [8d 64 24 80 e8 d0 d1 d2 d3] > > > > > > > > > > From offset 0 this traps on the uprobe INT3. Bytes 1..9 are not > > > > > executable entry points while byte 0 is trapped. > > > > > > > > > > 3) Publish the first LEA byte: > > > > > [48] 8d 64 24 80 e8 d0 d1 d2 d3 > > > > > > > > > > From offset 0 this is: > > > > > lea -0x80(%rsp), %rsp > > > > > call <uprobe-trampoline> > > > > > > > > > > Unoptimize path (int3_update_unoptimize): > > > > > > > > > > 1) Initial optimized state: > > > > > 48 8d 64 24 80 e8 d0 d1 d2 d3 > > > > > Same as 3) above. > > > > > > > > > > 2) Trap new entries before restoring the NOP bytes: > > > > > [cc] 8d 64 24 80 e8 d0 d1 d2 d3 > > > > > > > > > > From offset 0 this traps. A thread that had already executed the > > > > > LEA can still reach the intact CALL at offset 5. > > > > > > > > > > 3) Restore bytes 1..4 of the original NOP while keeping byte 0 > > > > > trapped > > > > > and byte 5 as CALL. > > > > > cc [2e 0f 1f 84] e8 d0 d1 d2 d3 > > > > > > > > > > From offset 0 this still traps. Offset 5 is still the CALL for > > > > > any > > > > > thread that was already past the first LEA byte. > > > > > > > > > > 4) Publish the first byte of the original NOP: > > > > > [66] 2e 0f 1f 84 e8 d0 d1 d2 d3 > > > > > > > > > > From offset 0 this is the restored 10-byte NOP; the CALL opcode > > > > > and > > > > > displacement are now only NOP operands. Offset 5 still decodes > > > > > as > > > > > CALL for a thread that was already there. > > > > > > > > > > Tthere is only a single target uprobe-trampoline for the given > > > > > nop10 > > > > > instruction address, so the CALL instruction will not be changed > > > > > across > > > > > unoptimization/optimization cycles. > > > > > Therefore, any task that is preempted at the CALL instruction is > > > > > guaranteed > > > > > to observe that CALL and not anything else. > > > > > > > > > > Note as explained in [2] we need to use following nop10: > > > > > PF1 PF2 ESC NOPL MOD SIB DISP32 > > > > > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- > > > > > cs nopw 0x00000000(%rax,%rax,1) > > > > > > > > > > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS > > > > > attribute in is_prefix_bad function. > > > > > > > > > > Also changing the uprobe syscall error when called out of uprobe > > > > > trampoline to -EPROTO, so we are able to detect the fixed kernel. > > > > > > > > > > The optimized uprobe performance stays the same: > > > > > > > > > > uprobe-nop : 3.129 ± 0.013M/s > > > > > uprobe-push : 3.045 ± 0.006M/s > > > > > uprobe-ret : 1.095 ± 0.004M/s > > > > > --> uprobe-nop10 : 7.170 ± 0.020M/s > > > > > uretprobe-nop : 2.143 ± 0.021M/s > > > > > uretprobe-push : 2.090 ± 0.000M/s > > > > > uretprobe-ret : 0.942 ± 0.000M/s > > > > > --> uretprobe-nop10: 3.381 ± 0.003M/s > > > > > usdt-nop : 3.245 ± 0.004M/s > > > > > --> usdt-nop10 : 7.256 ± 0.023M/s > > > > > > > > > > [1] > > > > > https://lore.kernel.org/bpf/[email protected]/ > > > > > [2] > > > > > https://lore.kernel.org/bpf/[email protected]/#t > > > > > Reported-by: Andrii Nakryiko <[email protected]> > > > > > Closes: > > > > > https://lore.kernel.org/bpf/[email protected]/ > > > > > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes") > > > > > Assisted-by: Codex:GPT-5.5 > > > > > Signed-off-by: Jiri Olsa <[email protected]> > > > > > --- > > > > > arch/x86/kernel/uprobes.c | 255 > > > > > ++++++++++++++++++++++++++++---------- > > > > > 1 file changed, 190 insertions(+), 65 deletions(-) > > > > > > > > > > > > > [...] > > > > > > > > > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe > > > > > *auprobe, struct vm_area_struct *vma, > > > > > smp_text_poke_sync_each_cpu(); > > > > > > > > > > /* > > > > > - * Write first byte. > > > > > + * 3) Restore bytes 1..4 of the original NOP while keeping > > > > > byte 0 trapped > > > > > + * and byte 5 as CALL: > > > > > + * cc [2e 0f 1f 84] e8 d0 d1 d2 d3 > > > > > + */ > > > > > + ctx.expect = EXPECT_SWBP_OPTIMIZED; > > > > > + err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, > > > > > + LEA_INSN_SIZE - 1, verify_insn, > > > > > + true /* is_register */, false /* > > > > > do_update_ref_ctr */, > > > > > > > > tbh, it's quite subtle and non-obvious why is_register should be set > > > > to true first two times (and especially that is_register and > > > > do_update_ref_ctr are implicitly connected), not sure how to make it > > > > cleaner, but maybe leave a short comment explaining this twice > > > > register, once unregister sequence? > > > > > > ok, I came up with comment below > > > > > > thanks, > > > jirka > > > > > > > > > --- > > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c > > > index de544516ea70..92449f34c005 100644 > > > --- a/arch/x86/kernel/uprobes.c > > > +++ b/arch/x86/kernel/uprobes.c > > > @@ -1011,6 +1011,12 @@ static int int3_update_unoptimize(struct > > > arch_uprobe *auprobe, struct vm_area_st > > > int err; > > > > > > /* > > > + * Note the first two uprobe_write calls use is_register=true, > > > because they > > > + * are intermediate patching states while the probe is still > > > active. > > > > this doesn't really explain why is_register=true is the right one. It > > actually doesn't matter as long as do_update_ref_ctr=true, isn't that > > right? So maybe just to avoid a bit of confusion let's pass > > is_register=false and do_update_ref_ctr=false, and in the comment > > explain as you said that it's intermediate update and we don't want to > > update refctr just yet until the very last step? > > apart from refctr update there's also different way the concerned > page is managed, IIUC: > > with is_register=true we force to get exclusive anonymous page for > the update (or pin the existing one) > > with is_register=false we try to zap the private anonymous page and > return the mapping to the original page > > there are several comments on this in uprobe_write/__uprobe_write > > how about the update below > > jirka > > > --- > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c > index de544516ea70..09f5ff71227c 100644 > --- a/arch/x86/kernel/uprobes.c > +++ b/arch/x86/kernel/uprobes.c > @@ -1011,6 +1011,16 @@ static int int3_update_unoptimize(struct arch_uprobe > *auprobe, struct vm_area_st > int err; > > /* > + * Note the first two uprobe_write calls use is_register=true, > because they > + * are intermediate patching states while the probe is still active, > so > + * we force the exclusive anonymous page for the update. > + * Also we use do_update_ref_ctr=false because refctr was already > updated by > + * the initial int3 install. > + * > + * The last uprobe_write to nop10 instruction is called with > is_register=false > + * and do_update_ref_ctr=true to trigger the refctr update and to > instruct > + * uprobe_write to zap the anonymous page if it now matches the file > page. > + *
lgtm! > * 1) Initial optimized state: > * 48 8d 64 24 80 e8 d0 d1 d2 d3 > *
