On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one
can draw from this is that memcpyD is the slowest,
Probably because it does a memory allocation.
Of course; that was already pointed out earlier in the thread.
The CPU makers abandoned optimizing the REP instructions
decades ago, and just left the clunky implementations there for
backwards compatibility.
That's not entirely true. Intel started optimising some of the
REP string instructions again on Ivy Bridge and above. There is a
CPUID bit to indicate that (ERMS?); I'm sure the Optimization
Manual has further details. From what I remember, `rep movsb` is
supposed to beat an AVX loop on most recent Intel µarchs if the
destination is aligned and the data is longer than a few cache
lines. I've never measured that myself, though.
— David