On Mon, May 18, 2026 at 3:43 AM Peter Zijlstra <[email protected]> wrote: > > > You seem to have forgotten to Cc LKML and x86 :-( > > On Thu, May 14, 2026 at 03:53:36PM +0200, Jiri Olsa wrote: > > > @@ -1017,17 +1030,32 @@ static int int3_update(struct arch_uprobe *auprobe, > > struct vm_area_struct *vma, > > static int swbp_optimize(struct arch_uprobe *auprobe, struct > > vm_area_struct *vma, > > unsigned long vaddr, unsigned long tramp) > > { > > - u8 call[5]; > > + u8 insn[OPT_INSN_SIZE], *call = &insn[LEA_INSN_SIZE]; > > > > - __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr, > > + /* > > + * We have nop10 instruction (with first byte overwritten to int3), > > + * changing it to: > > + * lea -0x80(%rsp), %rsp > > + * call tramp > > + */ > > + memcpy(insn, lea_rsp, LEA_INSN_SIZE); > > + __text_gen_insn(call, CALL_INSN_OPCODE, > > + (const void *) (vaddr + LEA_INSN_SIZE), > > (const void *) tramp, CALL_INSN_SIZE); > > - return int3_update(auprobe, vma, vaddr, call, true /* optimize */); > > + return int3_update(auprobe, vma, vaddr, insn, OPT_INSN_SIZE, true /* > > optimize */); > > } > > > > static int swbp_unoptimize(struct arch_uprobe *auprobe, struct > > vm_area_struct *vma, > > unsigned long vaddr) > > { > > - return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* > > optimize */); > > + /* > > + * We have optimized nop10 (lea, call), changing it to 'jmp rel8' to > > + * end of the 10-byte slot instead of restoring the original nop10, > > + * because we could have thread already inside lea instruction. > > Inaccurate, RIP could be on CALL, not inside LEA. Writing NOP10 would > make it inside NOP10 though, and that would cause havoc IF you use the > normal NOP10. > > Thing is, the encoding of NOP{8,9,10} would actually allow you to > preserve the CALL instruction :-) > > That is, observe: > > PF1 PF2 ESC NOPL MOD SIB DISP32 > > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw > 0x00000000(%rax,%rax,1) > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x78, 0x56, 0x34, 0x12 -- cs nopw > 0x12345678(%rax,%rbp,8) > > Specifically the CALL opcode sits in the SIB byte and decodes like: > > e8 := 11 101 000 > > scale = 11 (2^3 = 8) > index = 101 BP > base = 000 AX > > And the displacement is just that, a displacement. > > So you *could* in fact, write back _A_ NOP10, just not the standard > NOP10. > > > + */ > > + u8 jmp[OPT_INSN_SIZE] = { JMP8_INSN_OPCODE, OPT_JMP8_OFFSET }; > > + > > + return int3_update(auprobe, vma, vaddr, jmp, JMP8_INSN_SIZE, false /* > > optimize */); > > } > > Changelog wants significant update to explain this scheme. > > So we have: > > NOP10 -+-> LEA -0x80(%rsp), %rsp, CALL foo -> JMP.d8 +8 > | | > `------------------------------------------' > > And you want to belabour the point of how you ensure re-writing the CALL > instruction isn't a problem (because I'm not convinced). > > Note that the above results in: > > initial: > 0: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw > 0x00000000(%rax,%rax,1) > > optimize-int3: > 1: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- int3 > optimize-tail: > 2: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call > 0x78563412 > optimize-finish: > 3: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- lea > -0x80(%rsp),%rsp; call 0x78563412 > > unoptimize-int3: > 4: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call > 0x78563412 > unoptimize-tail: > 5: 0xcc, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call > 0x78563412 > unoptimize-finish: > 6: 0xeb, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- jmp.d8 +8; > call 0x78563412 > > optimize-int3: > 7: 0xcc, 0x08, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call > 0x78563412 > optimize-tail: > 8: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call > 0x12345678 > optimize-finish: > 9: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call > 0x12345678 > > Note that from step 7 to step 8, you re-write the CALL instruction > without going through INT3. This means it is entirely possible for a > concurrent execution to observe a composite instruction. > > This is NOT sound!
We shouldn't need to change call instruction ever, uprobe trampoline is permanent within the given process and its address won't change. > > However, I think it can be salvaged, if instead of only writing INT3 at > +0, you also write INT3 at +5. The sequence then becomes: > > initial: > 0: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw > 0x00000000(%rax,%rax,1) > > optimize-int3: > 1: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xcc, 0x00, 0x00, 0x00, 0x00 -- int3; int3 > optimize-tail(s): > 2: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xcc, 0x12, 0x34, 0x56, 0x78 -- int3; int3 > optimize-finish-1: > 3: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call > 0x78563412 > optimize-finish-2: > 3: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- lea > -0x80(%rsp),%rsp; call 0x78563412 > > unoptimize-int3: > 4: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call > 0x78563412 > unoptimize-tail: > 5: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x12, 0x34, 0x56, 0x78 -- int3; call > 0x78563412 > unoptimize-finish: > 6: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0xe8, 0x12, 0x34, 0x56, 0x78 -- cs nopw > 0x78563412(%rax,%rbp,8); call 0x78563412 > > optimize-int3: > 7: 0xcc, 0x2e, 0x0f, 0x1f, 0x84, 0xcc, 0x12, 0x34, 0x56, 0x78 -- int3; int3 > optimize-tail(s): > 8: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xcc, 0x78, 0x56, 0x34, 0x12 -- int3; int3 > optimize-finish-1: > 9: 0xcc, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- int3; call > 0x12345678 > optimize-finish-2: > 9: 0x48, 0x8d, 0x64, 0x24, 0x80, 0xe8, 0x78, 0x56, 0x34, 0x12 -- lea > -0x80(%rsp),%rsp; call 0x12345678 > [...]
