Arnaldo Carvalho de Melo <a...@kernel.org> writes: > On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote: >> Mina Almasry <almasrym...@google.com> writes: >> > On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen <t...@redhat.com> >> > wrote: >> >> Back when you posted the first RFC, Jesper and I chatted about ways to >> >> avoid the ugly "load module and read the output from dmesg" interface to >> >> the test. > >> > I agree the existing interface is ugly. > >> >> One idea we came up with was to make the module include only the "inner" >> >> functions for the benchmark, and expose those to BPF as kfuncs. Then the >> >> test runner can be a BPF program that runs the tests, collects the data >> >> and passes it to userspace via maps or a ringbuffer or something. That's >> >> a nicer and more customisable interface than the printk output. And if >> >> they're small enough, maybe we could even include the functions into the >> >> page_pool code itself, instead of in a separate benchmark module? > >> >> WDYT of that idea? :) > >> > ...but this sounds like an enormous amount of effort, for something >> > that is a bit ugly but isn't THAT bad. Especially for me, I'm not that >> > much of an expert that I know how to implement what you're referring >> > to off the top of my head. I normally am open to spending time but >> > this is not that high on my todolist and I have limited bandwidth to >> > resolve this :( > >> > I also feel that this is something that could be improved post merge. > > agreed > >> > I think it's very beneficial to have this merged in some form that can >> > be improved later. Byungchul is making a lot of changes to these mm >> > things and it would be nice to have an easy way to run the benchmark >> > in tree and maybe even get automated results from nipa. If we could >> > agree on mvp that is appropriate to merge without too much scope creep >> > that would be ideal from my side at least. > >> Right, fair. I guess we can merge it as-is, and then investigate whether >> we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :) > > tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use > it from a 'perf bench' suite. > > Yeah, the model would be what I did for uprobes, but even then there is > a selftests based uprobes benchmark ;-) > > The 'perf bench' part, that calls into the skel: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/uprobe.c > > The skel: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel/bench_uprobe.bpf.c > > While this one is just to generate BPF load to measure the impact on > uprobes, for your case it would involve using a ring buffer to > communicate from the skel (BPF/kernel side) to the userspace part, > similar to what is done in various other BPF based perf tooling > available in: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel > > Like at this line (BPF skel part): > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/bpf_skel/off_cpu.bpf.c?h=perf-tools-next#n253 > > The simplest part is in the canonical, standalone runqslower tool, also > hosted in the kernel sources: > > BPF skel sending stuff to userspace: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.bpf.c#n99 > > The userspace part that reads it: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n90 > > This is a callback that gets called for every event that the BPF skel > produces, called from this loop: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n162 > > That handle_event callback was associated via: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n153 > > There is a dissection I did about this process a long time ago, but > still relevant, I think: > > http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/33 > > The part explaining the interaction userspace/kernel starts here: > > http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/40 > > (yeah, its http, but then, its _old_vger ;-) > > Doing it in perf is interesting because it gets widely packaged, so > whatever you add to it gets visibility for people using 'perf bench' and > also gets available in most places, it would add to this collection: > > root@number:~# perf bench > Usage: > perf bench [<common options>] <collection> <benchmark> [<options>] > > # List of all available benchmark collections: > > sched: Scheduler and IPC benchmarks > syscall: System call benchmarks > mem: Memory access benchmarks > numa: NUMA scheduling and MM benchmarks > futex: Futex stressing benchmarks > epoll: Epoll stressing benchmarks > internals: Perf-internals benchmarks > breakpoint: Breakpoint benchmarks > uprobe: uprobe benchmarks > all: All benchmarks > > root@number:~# > > the 'perf bench' that uses BPF skel: > > root@number:~# perf bench uprobe baseline > # Running 'uprobe/baseline' benchmark: > # Executed 1,000 usleep(1000) calls > Total time: 1,050,383 usecs > > 1,050.383 usecs/op > root@number:~# perf trace --summary perf bench uprobe trace_printk > # Running 'uprobe/trace_printk' benchmark: > # Executed 1,000 usleep(1000) calls > Total time: 1,053,082 usecs > > 1,053.082 usecs/op > > Summary of events: > > uprobe-trace_pr (1247691), 3316 events, 96.9% > > syscall calls errors total min avg max > stddev > (msec) (msec) (msec) (msec) > (%) > --------------- -------- ------ -------- --------- --------- --------- > ------ > clock_nanosleep 1000 0 1101.236 1.007 1.101 50.939 > 4.53% > close 98 0 32.979 0.001 0.337 32.821 > 99.52% > perf_event_open 1 0 18.691 18.691 18.691 18.691 > 0.00% > mmap 209 0 0.567 0.001 0.003 0.007 > 2.59% > bpf 38 2 0.380 0.000 0.010 0.092 > 28.38% > openat 65 0 0.171 0.001 0.003 0.012 > 7.14% > mprotect 56 0 0.141 0.001 0.003 0.008 > 6.86% > read 68 0 0.082 0.001 0.001 0.010 > 11.60% > fstat 65 0 0.056 0.001 0.001 0.003 > 5.40% > brk 10 0 0.050 0.001 0.005 0.012 > 24.29% > pread64 8 0 0.042 0.001 0.005 0.021 > 49.29% > <SNIP other syscalls> > > root@number:~#
Cool, thanks for the pointers! Guess we'd need to restructure the functions to be benchmarked a bit, but that should be doable I guess. -Toke