On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest,
Probably because it does a memory allocation.
followed by the ASM implementation.
The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.
In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further.
This amply illustrates my other point that looking at the assembler generated is crucial to understanding what's happening.
