> From: Pavan Nikhilesh <[email protected]>
> 
> Add RTE_OPTIMAL_BURST_SIZE to allow platforms to configure the
> optimal burst size.
> 
> Set default value to 64 for soc_cn10k and 32 generally.
> 
> Signed-off-by: Pavan Nikhilesh <[email protected]>
> ---
> This improves performance by 5% on l2fwd, other examples showed
> negligible difference on CN10K.
>

I support the concept of having a recommended mbuf burst size, targeting the 
majority of generic applications.
Making it CPU dependent seems like a good choice.

It should be named differently.
First of all, "optimal" depends on the use case; if targeting low latency, 
shorter bursts are better, so "OPTIMAL" should not be part of the name.
Second, I would guess that it only targets mbuf bursts, not also bursts of 
other operations (e.g. hash lookups), so "MBUF" should be part of the name.

Suggestion:
/* Recommended burst size for generic applications, striking a balance between 
throughput and latency. */
dpdk_conf.set('RTE_MBUF_BURST_SIZE_MAX' (or _DEFAULT), 64)

<feature creep>
/* Recommended burst size for generic applications targeting low latency. */
dpdk_conf.set('RTE_MBUF_BURST_SIZE_MIN', 4)
</feature creep>

Having these standardized will also allow libraries and drivers to optimize for 
them, e.g. drivers should support bursts sizes all the way down to 
RTE_MBUF_BURST_SIZE_MIN, and can static_assert() that the 
RTE_MBUF_BURST_SIZE_MIN is not lower than supported by the driver/hardware.

<more feature creep>
rte_config.h could have "#define RTE_MBUF_BURST_SIZE RTE_MBUF_BURST_SIZE_MAX", 
for the application developer to change to RTE_MBUF_BURST_SIZE_MIN for low 
latency applications.
This will let the libraries and drivers optimize for the specific burst size 
used by the application.
</more feature creep>

<rambling>
Intuitively, I would assume that the optimal burst size essentially depends on 
the CPU's L1D cache size and the application's number of non-mbuf cache lines 
accessed per burst.
Let's say a CPU core has 32 KiB cache (= 512 cache lines), and each burst 
touches 4 cache lines per packet:
2 cache lines for the mbuf
1 cache line for the packet data
1 cache line per packet for some table lookup/forwarding entry

Then the mbuf burst should be max 512/4 = 128.
But local variables also use memory during processing, so using a burst of 64 
would leave room for that and some more.
</rambling>

>  config/arm/meson.build | 1 +
>  config/meson.build     | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 523b0fc0ed50..fa64c07016b1 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -481,6 +481,7 @@ soc_cn10k = {
>          ['RTE_MAX_LCORE', 24],
>          ['RTE_MAX_NUMA_NODES', 1],
>          ['RTE_MEMPOOL_ALIGN', 128],
> +        ['RTE_OPTIMAL_BURST_SIZE', 64],
>      ],
>      'part_number': '0xd49',
>      'extra_march_features': ['crypto'],
> diff --git a/config/meson.build b/config/meson.build
> index 0cb074ab95b7..95367ae88e2d 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -386,6 +386,7 @@ if get_option('mbuf_refcnt_atomic')
>      dpdk_conf.set('RTE_MBUF_REFCNT_ATOMIC', true)
>  endif
>  dpdk_conf.set10('RTE_IOVA_IN_MBUF', get_option('enable_iova_as_pa'))
> +dpdk_conf.set('RTE_OPTIMAL_BURST_SIZE', 32)
> 
>  compile_time_cpuflags = []
>  subdir(arch_subdir)
> --
> 2.50.1 (Apple Git-155)

Reply via email to