On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest,

Probably because it does a memory allocation.

Of course; that was already pointed out earlier in the thread.

The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.

That's not entirely true. Intel started optimising some of the REP string instructions again on Ivy Bridge and above. There is a CPUID bit to indicate that (ERMS?); I'm sure the Optimization Manual has further details. From what I remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel µarchs if the destination is aligned and the data is longer than a few cache lines. I've never measured that myself, though.

 — David

Reply via email to