On 22/07/2025 17.01, Matt Fleming wrote:
From: Matt Fleming<mflem...@cloudflare.com>

Add benchmarks for the standard set of operations: lookup, update,
delete. Also, include a benchmark for trie_free() which is known to have
terrible performance for maps with many entries.

Benchmarks operate on tries without gaps in the key range, i.e. each
test begins with a trie with valid keys in the range [0, nr_entries).
This is intended to cause maximum branching when traversing the trie.

All measurements are recorded inside the kernel to remove syscall
overhead.

Most benchmarks run an XDP program to generate stats but free needs to
collect latencies using fentry/fexit on map_free_deferred() because it's
not possible to use fentry directly on lpm_trie.c since commit
c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry
bpf programs") and there's no way to create/destroy a map from within an
XDP program.

Here is example output from an AMD EPYC 9684X 96-Core machine for each
of the benchmarks using a trie with 10K entries and a 32-bit prefix
length, e.g.

   $ ./bench lpm-trie-$op \
        --prefix_len=32  \
        --producers=1     \
        --nr_entries=10000

   lookup: throughput    7.423 ± 0.023 M ops/s (  7.423M ops/prod), latency  
134.710 ns/op
   update: throughput    2.643 ± 0.015 M ops/s (  2.643M ops/prod), latency  
378.310 ns/op
   delete: throughput    0.712 ± 0.008 M ops/s (  0.712M ops/prod), latency 
1405.152 ns/op
     free: throughput    0.574 ± 0.003 K ops/s (  0.574K ops/prod), latency    
1.743 ms/op

Tested-by: Jesper Dangaard Brouer<h...@kernel.org>

I've run a lot more benchmarks.

I've used a slightly updated version[1] written by Matt, that improve
the "delete" operation accounting that Alexei complain about, but I
guess Matt isn't 100% happy with his approach (as I don't see a V4).
The below "delete" numbers have improved compared to above.

Results from[2] with default 10,000 entries:
 lookup 7.598 ± 0.004 M ops/s   131.608 ns/op
 update 3.247 ± 0.029 M ops/s   308.008 ns/op
 delete 1.747 ± 0.053 M ops/s   572.519 ns/op
 free   0.294 ± 0.055 K ops/s   3.521 ms/op

[1] https://github.com/xdp-project/xdp-project/blob/main/areas/bench/patches/bench-lpm-trie-V3-adjusted.patch [2] https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench01_lpm-trie.org

I'm mostly interested in the fast-path lookup performance. Documented
here [3] and links to plots[4]. People seeing these per operations
costs, remember that this includes the get random number cost. That said
is very clear from my data, that LPM trie have problems with cache-line
trashing as number of entries increase.

[3] https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org [4] https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org#plotting-results

I've also documented the "bench" harness a little[5].

[5] https://github.com/xdp-project/xdp-project/blob/main/areas/bench/README.org

--Jesper


Reply via email to