On Mon, Apr 21, 2025 at 2:46 PM Jiri Olsa <[email protected]> wrote: > > Putting together all the previously added pieces to support optimized > uprobes on top of 5-byte nop instruction. > > The current uprobe execution goes through following: > > - installs breakpoint instruction over original instruction > - exception handler hit and calls related uprobe consumers > - and either simulates original instruction or does out of line single step > execution of it > - returns to user space > > The optimized uprobe path does following: > > - checks the original instruction is 5-byte nop (plus other checks) > - adds (or uses existing) user space trampoline with uprobe syscall > - overwrites original instruction (5-byte nop) with call to user space > trampoline > - the user space trampoline executes uprobe syscall that calls related > uprobe > consumers > - trampoline returns back to next instruction > > This approach won't speed up all uprobes as it's limited to using nop5 as > original instruction, but we plan to use nop5 as USDT probe instruction > (which currently uses single byte nop) and speed up the USDT probes. > > The arch_uprobe_optimize triggers the uprobe optimization and is called after > first uprobe hit. I originally had it called on uprobe installation but then > it clashed with elf loader, because the user space trampoline was added in a > place where loader might need to put elf segments, so I decided to do it after > first uprobe hit when loading is done. > > The uprobe is un-optimized in arch specific set_orig_insn call. > > The instruction overwrite is x86 arch specific and needs to go through 3 > updates: > (on top of nop5 instruction) > > - write int3 into 1st byte > - write last 4 bytes of the call instruction > - update the call instruction opcode > > And cleanup goes though similar reverse stages: > > - overwrite call opcode with breakpoint (int3) > - write last 4 bytes of the nop5 instruction > - write the nop5 first instruction byte > > We do not unmap and release uprobe trampoline when it's no longer needed, > because there's no easy way to make sure none of the threads is still > inside the trampoline. But we do not waste memory, because there's just > single page for all the uprobe trampoline mappings. > > We do waste frame on page mapping for every 4GB by keeping the uprobe > trampoline page mapped, but that seems ok. > > We take the benefit from the fact that set_swbp and set_orig_insn are > called under mmap_write_lock(mm), so we can use the current instruction > as the state the uprobe is in - nop5/breakpoint/call trampoline - > and decide the needed action (optimize/un-optimize) based on that. > > Attaching the speed up from benchs/run_bench_uprobes.sh script: > > current: > usermode-count : 152.604 ± 0.044M/s > syscall-count : 13.359 ± 0.042M/s > --> uprobe-nop : 3.229 ± 0.002M/s > uprobe-push : 3.086 ± 0.004M/s > uprobe-ret : 1.114 ± 0.004M/s > uprobe-nop5 : 1.121 ± 0.005M/s > uretprobe-nop : 2.145 ± 0.002M/s > uretprobe-push : 2.070 ± 0.001M/s > uretprobe-ret : 0.931 ± 0.001M/s > uretprobe-nop5 : 0.957 ± 0.001M/s > > after the change: > usermode-count : 152.448 ± 0.244M/s > syscall-count : 14.321 ± 0.059M/s > uprobe-nop : 3.148 ± 0.007M/s > uprobe-push : 2.976 ± 0.004M/s > uprobe-ret : 1.068 ± 0.003M/s > --> uprobe-nop5 : 7.038 ± 0.007M/s > uretprobe-nop : 2.109 ± 0.004M/s > uretprobe-push : 2.035 ± 0.001M/s > uretprobe-ret : 0.908 ± 0.001M/s > uretprobe-nop5 : 3.377 ± 0.009M/s > > I see bit more speed up on Intel (above) compared to AMD. The big nop5 > speed up is partly due to emulating nop5 and partly due to optimization. > > The key speed up we do this for is the USDT switch from nop to nop5: > uprobe-nop : 3.148 ± 0.007M/s > uprobe-nop5 : 7.038 ± 0.007M/s > > Signed-off-by: Jiri Olsa <[email protected]> > --- > arch/x86/include/asm/uprobes.h | 7 + > arch/x86/kernel/uprobes.c | 281 ++++++++++++++++++++++++++++++++- > include/linux/uprobes.h | 6 +- > kernel/events/uprobes.c | 15 +- > 4 files changed, 301 insertions(+), 8 deletions(-) >
just minor nits, LGTM Acked-by: Andrii Nakryiko <[email protected]> > +int set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma, > + unsigned long vaddr) > +{ > + if (should_optimize(auprobe)) { > + bool optimized = false; > + int err; > + > + /* > + * We could race with another thread that already optimized > the probe, > + * so let's not overwrite it with int3 again in this case. > + */ > + err = is_optimized(vma->vm_mm, vaddr, &optimized); > + if (err || optimized) > + return err; IMO, this is a bit too clever, I'd go with plain if (err) return err; if (optimized) return 0; /* we are done */ (and mirror set_orig_insn() structure, consistently) > + } > + return uprobe_write_opcode(vma, vaddr, UPROBE_SWBP_INSN, true); > +} > + > +int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma, > + unsigned long vaddr) > +{ > + if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) { > + struct mm_struct *mm = vma->vm_mm; > + bool optimized = false; > + int err; > + > + err = is_optimized(mm, vaddr, &optimized); > + if (err) > + return err; > + if (optimized) > + WARN_ON_ONCE(swbp_unoptimize(auprobe, vma, vaddr)); > + } > + return uprobe_write_opcode(vma, vaddr, *(uprobe_opcode_t > *)&auprobe->insn, false); > +} > + > +static int __arch_uprobe_optimize(struct mm_struct *mm, unsigned long vaddr) > +{ > + struct uprobe_trampoline *tramp; > + struct vm_area_struct *vma; > + int err = 0; > + > + vma = find_vma(mm, vaddr); > + if (!vma) > + return -1; this is EPERM, will be confusing to debug... why not -EINVAL? > + tramp = uprobe_trampoline_get(vaddr); > + if (!tramp) > + return -1; ditto > + err = swbp_optimize(vma, vaddr, tramp->vaddr); > + if (WARN_ON_ONCE(err)) > + uprobe_trampoline_put(tramp); > + return err; > +} > + [...]
