x86: Add support to optimize uprobes

Andrii Nakryiko Tue, 22 Apr 2025 17:04:31 -0700

On Mon, Apr 21, 2025 at 2:46 PM Jiri Olsa <[email protected]> wrote:
>
> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
>
> The current uprobe execution goes through following:
>
>   - installs breakpoint instruction over original instruction
>   - exception handler hit and calls related uprobe consumers
>   - and either simulates original instruction or does out of line single step
>     execution of it
>   - returns to user space
>
> The optimized uprobe path does following:
>
>   - checks the original instruction is 5-byte nop (plus other checks)
>   - adds (or uses existing) user space trampoline with uprobe syscall
>   - overwrites original instruction (5-byte nop) with call to user space
>     trampoline
>   - the user space trampoline executes uprobe syscall that calls related 
> uprobe
>     consumers
>   - trampoline returns back to next instruction
>
> This approach won't speed up all uprobes as it's limited to using nop5 as
> original instruction, but we plan to use nop5 as USDT probe instruction
> (which currently uses single byte nop) and speed up the USDT probes.
>
> The arch_uprobe_optimize triggers the uprobe optimization and is called after
> first uprobe hit. I originally had it called on uprobe installation but then
> it clashed with elf loader, because the user space trampoline was added in a
> place where loader might need to put elf segments, so I decided to do it after
> first uprobe hit when loading is done.
>
> The uprobe is un-optimized in arch specific set_orig_insn call.
>
> The instruction overwrite is x86 arch specific and needs to go through 3 
> updates:
> (on top of nop5 instruction)
>
>   - write int3 into 1st byte
>   - write last 4 bytes of the call instruction
>   - update the call instruction opcode
>
> And cleanup goes though similar reverse stages:
>
>   - overwrite call opcode with breakpoint (int3)
>   - write last 4 bytes of the nop5 instruction
>   - write the nop5 first instruction byte
>
> We do not unmap and release uprobe trampoline when it's no longer needed,
> because there's no easy way to make sure none of the threads is still
> inside the trampoline. But we do not waste memory, because there's just
> single page for all the uprobe trampoline mappings.
>
> We do waste frame on page mapping for every 4GB by keeping the uprobe
> trampoline page mapped, but that seems ok.
>
> We take the benefit from the fact that set_swbp and set_orig_insn are
> called under mmap_write_lock(mm), so we can use the current instruction
> as the state the uprobe is in - nop5/breakpoint/call trampoline -
> and decide the needed action (optimize/un-optimize) based on that.
>
> Attaching the speed up from benchs/run_bench_uprobes.sh script:
>
> current:
>         usermode-count :  152.604 ± 0.044M/s
>         syscall-count  :   13.359 ± 0.042M/s
> -->     uprobe-nop     :    3.229 ± 0.002M/s
>         uprobe-push    :    3.086 ± 0.004M/s
>         uprobe-ret     :    1.114 ± 0.004M/s
>         uprobe-nop5    :    1.121 ± 0.005M/s
>         uretprobe-nop  :    2.145 ± 0.002M/s
>         uretprobe-push :    2.070 ± 0.001M/s
>         uretprobe-ret  :    0.931 ± 0.001M/s
>         uretprobe-nop5 :    0.957 ± 0.001M/s
>
> after the change:
>         usermode-count :  152.448 ± 0.244M/s
>         syscall-count  :   14.321 ± 0.059M/s
>         uprobe-nop     :    3.148 ± 0.007M/s
>         uprobe-push    :    2.976 ± 0.004M/s
>         uprobe-ret     :    1.068 ± 0.003M/s
> -->     uprobe-nop5    :    7.038 ± 0.007M/s
>         uretprobe-nop  :    2.109 ± 0.004M/s
>         uretprobe-push :    2.035 ± 0.001M/s
>         uretprobe-ret  :    0.908 ± 0.001M/s
>         uretprobe-nop5 :    3.377 ± 0.009M/s
>
> I see bit more speed up on Intel (above) compared to AMD. The big nop5
> speed up is partly due to emulating nop5 and partly due to optimization.
>
> The key speed up we do this for is the USDT switch from nop to nop5:
>         uprobe-nop     :    3.148 ± 0.007M/s
>         uprobe-nop5    :    7.038 ± 0.007M/s
>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
>  arch/x86/include/asm/uprobes.h |   7 +
>  arch/x86/kernel/uprobes.c      | 281 ++++++++++++++++++++++++++++++++-
>  include/linux/uprobes.h        |   6 +-
>  kernel/events/uprobes.c        |  15 +-
>  4 files changed, 301 insertions(+), 8 deletions(-)
>


just minor nits, LGTM

Acked-by: Andrii Nakryiko <[email protected]>

> +int set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> +            unsigned long vaddr)
> +{
> +       if (should_optimize(auprobe)) {
> +               bool optimized = false;
> +               int err;
> +
> +               /*
> +                * We could race with another thread that already optimized 
> the probe,
> +                * so let's not overwrite it with int3 again in this case.
> +                */
> +               err = is_optimized(vma->vm_mm, vaddr, &optimized);
> +               if (err || optimized)
> +                       return err;

IMO, this is a bit too clever, I'd go with plain

if (err)
    return err;
if (optimized)
    return 0; /* we are done */

(and mirror set_orig_insn() structure, consistently)


> +       }
> +       return uprobe_write_opcode(vma, vaddr, UPROBE_SWBP_INSN, true);
> +}
> +
> +int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> +                 unsigned long vaddr)
> +{
> +       if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) {
> +               struct mm_struct *mm = vma->vm_mm;
> +               bool optimized = false;
> +               int err;
> +
> +               err = is_optimized(mm, vaddr, &optimized);
> +               if (err)
> +                       return err;
> +               if (optimized)
> +                       WARN_ON_ONCE(swbp_unoptimize(auprobe, vma, vaddr));
> +       }
> +       return uprobe_write_opcode(vma, vaddr, *(uprobe_opcode_t 
> *)&auprobe->insn, false);
> +}
> +
> +static int __arch_uprobe_optimize(struct mm_struct *mm, unsigned long vaddr)
> +{
> +       struct uprobe_trampoline *tramp;
> +       struct vm_area_struct *vma;
> +       int err = 0;
> +
> +       vma = find_vma(mm, vaddr);
> +       if (!vma)
> +               return -1;

this is EPERM, will be confusing to debug... why not -EINVAL?

> +       tramp = uprobe_trampoline_get(vaddr);
> +       if (!tramp)
> +               return -1;

ditto

> +       err = swbp_optimize(vma, vaddr, tramp->vaddr);
> +       if (WARN_ON_ONCE(err))
> +               uprobe_trampoline_put(tramp);
> +       return err;
> +}
> +

[...]

Re: [PATCH perf/core 10/22] uprobes/x86: Add support to optimize uprobes

Reply via email to