>> > From: Pavan Nikhilesh <[email protected]> >> > >> > Add RTE_OPTIMAL_BURST_SIZE to allow platforms to configure the >> > optimal burst size. >> > >> > Set default value to 64 for soc_cn10k and 32 generally. >> > >> > Signed-off-by: Pavan Nikhilesh <[email protected]> >> > --- >> > This improves performance by 5% on l2fwd, other examples showed >> > negligible difference on CN10K. >> > >> >> I support the concept of having a recommended mbuf burst size, targeting the >> majority of generic applications. >> Making it CPU dependent seems like a good choice. >> >> It should be named differently. >> First of all, "optimal" depends on the use case; if targeting low latency, >> shorter bursts are better, so "OPTIMAL" should not be part of the name. >> Second, I would guess that it only targets mbuf bursts, not also bursts of >> other operations (e.g. hash lookups), so "MBUF" should be part of the name. >> >> Suggestion: >> /* Recommended burst size for generic applications, striking a balance >> between throughput and latency. */ >> dpdk_conf.set('RTE_MBUF_BURST_SIZE_MAX' (or _DEFAULT), 64) >> >> <feature creep> >> /* Recommended burst size for generic applications targeting low latency. */ >> dpdk_conf.set('RTE_MBUF_BURST_SIZE_MIN', 4) >> </feature creep> >> >> Having these standardized will also allow libraries and drivers to optimize >> for them, e.g. drivers should support bursts sizes all the way down to >> RTE_MBUF_BURST_SIZE_MIN, and can static_assert() that the >> RTE_MBUF_BURST_SIZE_MIN is not lower than supported by the driver/hardware. >> >> <more feature creep> >> rte_config.h could have "#define RTE_MBUF_BURST_SIZE >> RTE_MBUF_BURST_SIZE_MAX", for the application developer to change to >> RTE_MBUF_BURST_SIZE_MIN for low latency applications. >> This will let the libraries and drivers optimize for the specific burst size >> used by the application. >> </more feature creep> >> >> <rambling> >> Intuitively, I would assume that the optimal burst size essentially depends >> on the CPU's L1D cache size and the application's number of non-mbuf cache >> lines accessed per burst. >> Let's say a CPU core has 32 KiB cache (= 512 cache lines), and each burst >> touches 4 cache lines per packet: >> 2 cache lines for the mbuf >> 1 cache line for the packet data >> 1 cache line per packet for some table lookup/forwarding entry >> >> Then the mbuf burst should be max 512/4 = 128. >> But local variables also use memory during processing, so using a burst of >> 64 would leave room for that and some more. >> </rambling> >> >> > config/arm/meson.build | 1 + >> > config/meson.build | 1 + >> > 2 files changed, 2 insertions(+) >> > >> > diff --git a/config/arm/meson.build b/config/arm/meson.build >> > index 523b0fc0ed50..fa64c07016b1 100644 >> > --- a/config/arm/meson.build >> > +++ b/config/arm/meson.build >> > @@ -481,6 +481,7 @@ soc_cn10k = { >> > ['RTE_MAX_LCORE', 24], >> > ['RTE_MAX_NUMA_NODES', 1], >> > ['RTE_MEMPOOL_ALIGN', 128], >> > + ['RTE_OPTIMAL_BURST_SIZE', 64], >> > ], >> > 'part_number': '0xd49', >> > 'extra_march_features': ['crypto'], >> > diff --git a/config/meson.build b/config/meson.build >> > index 0cb074ab95b7..95367ae88e2d 100644 >> > --- a/config/meson.build >> > +++ b/config/meson.build >> > @@ -386,6 +386,7 @@ if get_option('mbuf_refcnt_atomic') >> > dpdk_conf.set('RTE_MBUF_REFCNT_ATOMIC', true) >> > endif >> > dpdk_conf.set10('RTE_IOVA_IN_MBUF', get_option('enable_iova_as_pa')) >> > +dpdk_conf.set('RTE_OPTIMAL_BURST_SIZE', 32) >> > >> > compile_time_cpuflags = [] >> > subdir(arch_subdir) >> > -- >> > 2.50.1 (Apple Git-155) > >I understand the motivation, and it make sense for a pure embedded system. >But then again on an embedded system the application can just set its burst >size; >this config option only impacts performance of testpmd and examples. And the >performance of testpmd is mostly irrelevant what matters is the real >application. >
True, but generally customer engagements start with benchmarking testpmd/l3fwd etc. berfore moving to custom apps. So, having better performance numbers helps. >Making it a DPDK config option is a problem for DPDK build in distros. >The optimal burst size would be driver dependent etc. > Since we are not modifying the current default burst size (32) it shouldn't be a problem and can even benifit SoCs. >Perhaps better off in the existing rx / tx descriptor hints. >Most of those device configs really need to be relooked at >since they were inherited from how old Intel drivers worked.

