I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
wins for small values of n.

20 runs:

n       version       min  median    mean     max  stddev  noise%
-----------------------------------------------------------------------
n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%

n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%

n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%

But, micro-benchmarks always make me nervous, so I looked at the actual
instruction cost for my
platform given the version 1 and version 2 code.

If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
tables:  version 1 (loop body)
has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
has ~3-4 cycles per iteration.

The problem for version 2 is that the call to memcpy is ~24-30 cycles due
to the stub + function call + return
and branch predictor pressure on first call.  This probably results in ~2.5
ns per iteration cost for version 2.

So, no I wouldn't call it an optimization.  But, it will be interesting to
hear other opinions on this.

--bg


On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <[email protected]> wrote:

>
>
> Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <[email protected]>
> escreveu:
>
>> I created an example that is a little bit closer to the actual code and
>> changed the compiler from C++ to C.
>>
>> It is interesting the optimization that the compiler has chosen for
>> version 1 versus version 2.  One calls
>> memcpy and one doesn't.  There is a good chance the inlining of memcpy as
>> SSE+scalar per iteration
>> will be faster for syscache scans-- which I believe are usually small
>> (1-4 keys?).
>>
> I doubt the inline version is better.
> Clang is supported too and the code generated is much better with memcpy
> one call outside of the loop.
>
>
>>
>> Probably the only reason to do this patch would be if N is normally large
>> or if this is considered an
>> improvement in code clarity without a detrimental impact on small N
>> syscache scans.
>> I realize you only said "possible small optimization".  It might be
>> worthwhile to benchmark the code for
>> different values of n to determine if there is a tipping point either way?
>>
>  In your opinion, shouldn't this be considered an optimization, even a
> small one?
>
> best regards,
> Ranier Vilela
>

Reply via email to