[Public]

Hi @Morten Brørup, we tried the changes on zen4 `AMD EPYC 8534P 64-Core 
Processor` using `dpdk-test` with option `memcpy_perf_autotest`. Following are 
the observations

1. there are 1 or 2 cycles reduction especially to lower byte size for both 
aligned and unaligned cases.
2. overall test run for aligned and unaligned cases did not change.
3. improvement are seen more on aligned than unaligned.

Some caveats:
1. Zen4 compared to zen5 both support avx512, but the load-store is 32B at the 
backend of the uarch. This might explain the no change > 64B odd sizes.
2. need to test with virto or memif in copy mode to see actual results. (will 
test and share results separately)
3. in function rte_mov48, since zen4 use 32B load|store need to recheck if 
write-combing is causing stalling (which we can speed up by forcing higher then 
lower address).

Note: need some more time to cross check above 2 and 3.

Reply via email to