[Public] Hi @Morten Brørup,
We (@P, Thiyagarajan @Murali Krishna, Bala and myself) have used dma-perf to validate the performance from 1B to 17B payload. Following are our observations With c_args `-DRTE_MEMCPY_AVX512` enabled on zen4, we observe around 25% performance regression for payload size 1B to 15B and 17B. While in case of 16B we see improvement in Mops by 40%. Without c_args `-DRTE_MEMCPY_AVX512` enabled on zen4, we observe +-4% variation from 1B to 17B. `We are investigating the variation is more prominent with avx512 memcpy.` Note: 1. in zen4 ld|str is broken to 32B. While in zen5 ld|str is 64B. 2. we tested memif copy on zen5 with patch (without -DRTE_MEMCPY_AVX512) on 64B and 65B payload. It is same as zen4 observation (shared in previous email). > -----Original Message----- > From: Varghese, Vipin <[email protected]> > Sent: Wednesday, January 21, 2026 5:19 PM > To: Morten Brørup <[email protected]>; Stephen Hemminger > <[email protected]> > Cc: [email protected]; Bruce Richardson <[email protected]>; Konstantin > Ananyev <[email protected]> > Subject: RE: [PATCH v6] eal/x86: optimize memcpy of small sizes > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > [Public] > > Hi @Morten Brørup, please find our observation running testpmd with memif in > zero-copy mode disabled (rte_memcpy enabled). > > 1. DPDK baseline version: 25.11 we tested with testpmd in io & flowgen mode 2. > using no cargs for memcpy (rtemov32) and with patch 64B & 65B we get > `15.5Mpps` 3. using cargs ` -DRTE_MEMCPY_AVX512` for memcpy (rtemov64) > and with patch 64B & 65B we get `14.8Mpps` > > We will run with dma-perf application for payload sizes of 1,2,3,4,5,...etc > > Regards > Vipin Varghese

