I don't think your version 1 memcpy is doing what you think it is doing. On Thu, Mar 12, 2026 at 12:35 PM Ranier Vilela <[email protected]> wrote:
> Hi. > > Em seg., 9 de mar. de 2026 às 14:02, Bryan Green <[email protected]> > escreveu: > >> I performed a micro-benchmark on my dual epyc (zen 2) server and version >> 1 wins for small values of n. >> >> 20 runs: >> >> n version min median mean max stddev noise% >> ----------------------------------------------------------------------- >> n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5% >> n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7% >> >> n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1% >> n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3% >> >> n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2% >> n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7% >> >> But, micro-benchmarks always make me nervous, so I looked at the actual >> instruction cost for my >> platform given the version 1 and version 2 code. >> >> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput >> tables: version 1 (loop body) >> has a critical path of ~5-6 cycles per iteration. version 2 (loop body) >> has ~3-4 cycles per iteration. >> >> The problem for version 2 is that the call to memcpy is ~24-30 cycles due >> to the stub + function call + return >> and branch predictor pressure on first call. This probably results in >> ~2.5 ns per iteration cost for version 2. >> >> So, no I wouldn't call it an optimization. But, it will be interesting >> to hear other opinions on this. >> > I made dirty and quick tests with two versions: > gcc 15.2.0 > gcc -O2 memcpy1.c -o memcpy1 > > The first test was with keys 10000000 and 10000000 loops: > version1: on memcpy call > done in 1873 nanoseconds > > version2: inlined memcpy > not finish > > The second test was with keys 4 and 10000000 loops: > version1: one memcpy call > version2: inlined memcpy call > > version1: done in 1519 nanoseconds > version2: done in 104981851 nanoseconds > (1.44692e-05 times faster) > > version1: done in 1979 nanoseconds > version2: done in 110568901 nanoseconds > (1.78983e-05 times faster) > > version1: done in 1814 nanoseconds > version2: done in 108555484 nanoseconds > (1.67103e-05 times faster) > > version1: done in 1631 nanoseconds > version2: done in 109867919 nanoseconds > (1.48451e-05 times faster) > > version1: done in 1269 nanoseconds > version2: done in 111639106 nanoseconds > (1.1367e-05 times faster) > > Unless I'm doing something wrong, one call memcpy wins! > memcpy1.c attached. > > best regards, > Ranier Vilela >
