[Public]

Hi @Morten Brørup,

We (@P, Thiyagarajan @Murali Krishna, Bala and myself) have used dma-perf to 
validate the performance from 1B to 17B payload.
Following are our observations

With c_args `-DRTE_MEMCPY_AVX512` enabled on zen4, we observe around 25% 
performance regression for payload size 1B to 15B and 17B.
While in case of 16B we see improvement in Mops by 40%.

Without c_args `-DRTE_MEMCPY_AVX512` enabled on zen4, we observe +-4% variation 
from 1B to 17B.

`We are investigating the variation is more prominent with avx512 memcpy.`

Note:
1. in zen4 ld|str is broken to 32B. While in zen5 ld|str is 64B.
2. we tested memif copy on zen5 with patch (without -DRTE_MEMCPY_AVX512) on 64B 
and 65B payload. It is same as zen4 observation (shared in previous email).



> -----Original Message-----
> From: Varghese, Vipin <[email protected]>
> Sent: Wednesday, January 21, 2026 5:19 PM
> To: Morten Brørup <[email protected]>; Stephen Hemminger
> <[email protected]>
> Cc: [email protected]; Bruce Richardson <[email protected]>; Konstantin
> Ananyev <[email protected]>
> Subject: RE: [PATCH v6] eal/x86: optimize memcpy of small sizes
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> [Public]
>
> Hi @Morten Brørup, please find our observation running testpmd with memif in
> zero-copy mode disabled (rte_memcpy enabled).
>
> 1. DPDK baseline version: 25.11 we tested with testpmd in io & flowgen mode 2.
> using no cargs for memcpy (rtemov32) and with patch 64B & 65B we get
> `15.5Mpps` 3. using cargs ` -DRTE_MEMCPY_AVX512` for memcpy (rtemov64)
> and with patch 64B & 65B we get `14.8Mpps`
>
> We will run with dma-perf application for payload sizes of 1,2,3,4,5,...etc
>
> Regards
> Vipin Varghese

Reply via email to