On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one
can draw from this is that memcpyD is the slowest,
Probably because it does a memory allocation.
followed by the ASM implementation.
The CPU makers abandoned optimizing the REP instructions
decades ago, and just left the clunky implementations there for
backwards compatibility.
In fact, memcpyC and memcpyNaive produce exactly the same
machine code (without bounds checking), as LLVM recognizes the
loop and lowers it into a memcpy. memcpyDstdAlg instead gets
turned into a vectorized loop, for reasons I didn't
investigate any further.
This amply illustrates my other point that looking at the
assembler generated is crucial to understanding what's
happening.
On some cpu architectures(for example intel atoms) rep movsb is
the fatest memcpy.