Arnaldo Carvalho de Melo <a...@kernel.org> writes:

> On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
>> Mina Almasry <almasrym...@google.com> writes:
>> > On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen <t...@redhat.com> 
>> > wrote:
>> >> Back when you posted the first RFC, Jesper and I chatted about ways to
>> >> avoid the ugly "load module and read the output from dmesg" interface to
>> >> the test.
>
>> > I agree the existing interface is ugly.
>
>> >> One idea we came up with was to make the module include only the "inner"
>> >> functions for the benchmark, and expose those to BPF as kfuncs. Then the
>> >> test runner can be a BPF program that runs the tests, collects the data
>> >> and passes it to userspace via maps or a ringbuffer or something. That's
>> >> a nicer and more customisable interface than the printk output. And if
>> >> they're small enough, maybe we could even include the functions into the
>> >> page_pool code itself, instead of in a separate benchmark module?
>
>> >> WDYT of that idea? :)
>
>> > ...but this sounds like an enormous amount of effort, for something
>> > that is a bit ugly but isn't THAT bad. Especially for me, I'm not that
>> > much of an expert that I know how to implement what you're referring
>> > to off the top of my head. I normally am open to spending time but
>> > this is not that high on my todolist and I have limited bandwidth to
>> > resolve this :(
>
>> > I also feel that this is something that could be improved post merge.
>
> agreed
>
>> > I think it's very beneficial to have this merged in some form that can
>> > be improved later. Byungchul is making a lot of changes to these mm
>> > things and it would be nice to have an easy way to run the benchmark
>> > in tree and maybe even get automated results from nipa. If we could
>> > agree on mvp that is appropriate to merge without too much scope creep
>> > that would be ideal from my side at least.
>  
>> Right, fair. I guess we can merge it as-is, and then investigate whether
>> we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
>
> tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use
> it from a 'perf bench' suite.
>
> Yeah, the model would be what I did for uprobes, but even then there is
> a selftests based uprobes benchmark ;-)
>
> The 'perf bench' part, that calls into the skel:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/uprobe.c
>
> The skel:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel/bench_uprobe.bpf.c
>
> While this one is just to generate BPF load to measure the impact on
> uprobes, for your case it would involve using a ring buffer to
> communicate from the skel (BPF/kernel side) to the userspace part,
> similar to what is done in various other BPF based perf tooling
> available in:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel
>
> Like at this line (BPF skel part):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/bpf_skel/off_cpu.bpf.c?h=perf-tools-next#n253
>
> The simplest part is in the canonical, standalone runqslower tool, also
> hosted in the kernel sources:
>
> BPF skel sending stuff to userspace:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.bpf.c#n99
>
> The userspace part that reads it:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n90
>
> This is a callback that gets called for every event that the BPF skel
> produces, called from this loop:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n162
>
> That handle_event callback was associated via:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n153
>
> There is a dissection I did about this process a long time ago, but
> still relevant, I think:
>
> http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/33
>
> The part explaining the interaction userspace/kernel starts here:
>
> http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/40
>
> (yeah, its http, but then, its _old_vger ;-)
>
> Doing it in perf is interesting because it gets widely packaged, so
> whatever you add to it gets visibility for people using 'perf bench' and
> also gets available in most places, it would add to this collection:
>
> root@number:~# perf bench
> Usage: 
>       perf bench [<common options>] <collection> <benchmark> [<options>]
>
>         # List of all available benchmark collections:
>
>          sched: Scheduler and IPC benchmarks
>        syscall: System call benchmarks
>            mem: Memory access benchmarks
>           numa: NUMA scheduling and MM benchmarks
>          futex: Futex stressing benchmarks
>          epoll: Epoll stressing benchmarks
>      internals: Perf-internals benchmarks
>     breakpoint: Breakpoint benchmarks
>         uprobe: uprobe benchmarks
>            all: All benchmarks
>
> root@number:~#
>
> the 'perf bench' that uses BPF skel:
>
> root@number:~# perf bench uprobe baseline
> # Running 'uprobe/baseline' benchmark:
> # Executed 1,000 usleep(1000) calls
>      Total time: 1,050,383 usecs
>
>  1,050.383 usecs/op
> root@number:~# perf trace  --summary perf bench uprobe trace_printk
> # Running 'uprobe/trace_printk' benchmark:
> # Executed 1,000 usleep(1000) calls
>      Total time: 1,053,082 usecs
>
>  1,053.082 usecs/op
>
>  Summary of events:
>
>  uprobe-trace_pr (1247691), 3316 events, 96.9%
>
>    syscall            calls  errors  total       min       avg       max      
>  stddev
>                                      (msec)    (msec)    (msec)    (msec)     
>    (%)
>    --------------- --------  ------ -------- --------- --------- ---------    
>  ------
>    clock_nanosleep     1000      0  1101.236     1.007     1.101    50.939    
>   4.53%
>    close                 98      0    32.979     0.001     0.337    32.821    
>  99.52%
>    perf_event_open        1      0    18.691    18.691    18.691    18.691    
>   0.00%
>    mmap                 209      0     0.567     0.001     0.003     0.007    
>   2.59%
>    bpf                   38      2     0.380     0.000     0.010     0.092    
>  28.38%
>    openat                65      0     0.171     0.001     0.003     0.012    
>   7.14%
>    mprotect              56      0     0.141     0.001     0.003     0.008    
>   6.86%
>    read                  68      0     0.082     0.001     0.001     0.010    
>  11.60%
>    fstat                 65      0     0.056     0.001     0.001     0.003    
>   5.40%
>    brk                   10      0     0.050     0.001     0.005     0.012    
>  24.29%
>    pread64                8      0     0.042     0.001     0.005     0.021    
>  49.29%
> <SNIP other syscalls>
>
> root@number:~#

Cool, thanks for the pointers! Guess we'd need to restructure the
functions to be benchmarked a bit, but that should be doable I guess.

-Toke


Reply via email to