Re: Replacing C's memcpy with a D implementation

2018-06-17 Thread David Nadlinger via Digitalmars-d

On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
- default linux: 
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c


To see what is executed when you call memcpy() on a regular 
GNU/Linux distro, you'd want to have a look at glibc instead. For 
example, the AVX2 and AVX512 implementations are in


   sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S

and

  sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S,

as forwarded to by memmove-avx2-unaligned-erms.S and 
memmove-avx512-unaligned-erms.S.


(Pop quiz: Why might a separate "no-vzeroupper" variant be a good 
idea?)


 — David




Re: Replacing C's memcpy with a D implementation

2018-06-17 Thread David Nadlinger via Digitalmars-d

On Monday, 11 June 2018 at 08:02:42 UTC, Walter Bright wrote:

On 6/10/2018 9:44 PM, Patrick Schluter wrote:

See what Agner Fog has to say about it:


Thanks. Agner Fog gets the last word on this topic!


Well, Agner is rarely wrong indeed, but there is a limit to how 
much material a single person can keep up to date.


On newer uarchs, `rep movsb` isn't slower than `rep movsd`, and 
often performs similar to the best SSE2 implementation (using NT 
stores). See "BeeOnRope"'s answer to this StackOverflow question 
for an in-depth discussion about this: 
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy


AVX2 seems to offer extra performance beyond that, though, if it 
is available (for example if runtime feature detection is used). 
I believe I read a comment by Agner somewhere to that effect as 
well – a search engine will certainly be able to turn up more.


 — David


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Mike Franklin via Digitalmars-d

On Monday, 11 June 2018 at 18:34:58 UTC, Johannes Pfau wrote:

I understand that you actually need to reimplement memcpy, as 
in your microcontroller usecase you don't want to have any C 
runtime. So you'll basically have to rewrite the C runtime 
parts D depends on.


However, I think for memcpy and similar functions you're 
probably better off keeping the C interface. This directly 
provides the benefit of compiler intrinsics/optimizations. And 
marking memcpy as nothrow/pure/ system/nogc is simple* either 
way. For the D implementation, the compiler will verify this 
for you, for the C implementation, you have to mark the 
function depending on the C implementation. But that's mostly 
trivial.


My plans go beyond microcontrollers.  Mostly, I'd like to be able 
to use more features of D without having to link in a pre-built 
runtime.  This is especially convenient for cross-compiling 
scenarios.  By replacing the runtime hooks with templates, and 
the software building blocks with D implementations, we'd no 
longer need to obtain a C toolchain and a pre-compiled druntime 
library just  to get a build for our target.  We'd just need a 
cross-compiler like LDC or GDC.  After adding druntime to the 
import path, we'd have everything we need for our target.  Try, 
today, to create an LDC cross-compiler for a Windows Host and an 
ARM target like the Raspberry Pi.  You'll quickly realize what a 
hassle it all is.


Another issue with memcpy being generated by compilers is noone 
knows what it's actually doing without building and looking at 
the assembly.  And the assembly will change based what's being 
built.  Have you tried to look at the GCC implementation.  I 
have, and I never want to again.  If the implementation were 
templated in D, we'd be able to see exactly what code would be 
generated when by simply viewing the D runtime source code.  It'd 
be easier to understand, predict, port, and enhance.


There isn't one single game-changing benefit this endeavor would 
bring, and I guess that's why it's difficult to comprehend, but 
collectively, all the little things it would enable would make it 
quite compelling.


Mike



Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Walter Bright via Digitalmars-d

https://github.com/dlang/druntime/pull/2213


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Walter Bright via Digitalmars-d

On 6/11/2018 11:17 AM, Guillaume Piolat wrote:

I don't know if someone really wrote this code, or if it was all from 
intrinsics.


memcpy is so critical to success it is likely written by Intel itself to ensure 
every drop of perf is wrung out of the CPU.


I was Intel CEO I'd direct the CPU hardware guys to do this and give it away.


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Johannes Pfau via Digitalmars-d
Am Mon, 11 Jun 2018 10:54:23 + schrieb Mike Franklin:

> On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:
>> On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:
>>
 I think there might also be optimization opportunities using
 templates, metaprogramming, and type introspection, that are not
 currently possible with the current design.
>>>
>>> Just making it a template doesn't automatically enable any of this.
>>
>> I think it does, because I can then generate specific code based on the
>> type information at compile-time.
> 
> Also, before you do any more nay-saying, you might want to revisit this
> talk https://www.youtube.com/watch?v=endKC3fDxqs which demonstrates
> precisely the kind of benefits that can be achieved with these kinds of
> changes to the compiler/runtime interface.
> 
> Mike

I guess for most D runtime hooks, using templates is a good idea to 
enable inlining and further optimizations.

I understand that you actually need to reimplement memcpy, as in your 
microcontroller usecase you don't want to have any C runtime. So you'll 
basically have to rewrite the C runtime parts D depends on.

However, I think for memcpy and similar functions you're probably better 
off keeping the C interface. This directly provides the benefit of 
compiler intrinsics/optimizations. And marking memcpy as nothrow/pure/
system/nogc is simple* either way. For the D implementation, the compiler 
will verify this for you, for the C implementation, you have to mark the 
function depending on the C implementation. But that's mostly trivial.

On a related note, I agree that the compiler sometimes cheats by ignoring 
attributes, especially when calling TypeInfo related functions, and this 
is a huge problem. Runtime TypeInfo is not descriptive enough to fully 
represent the types and whenever the compiler the casts without properly 
checking first, there's the possibility of a problem.

-- 
Johannes


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Guillaume Piolat via Digitalmars-d
BTW the way memcpy is(was?) implemented in the C runtime coming 
from the Inter C++ compiler was really enlightening on the sheer 
difficulty of such a task.


First of all there isn't one loop but many depending on the 
source and destination alignment.


- If both are aligned on 16-byte boundaries, source and 
destination operand would be with MOVAPS/MOVDQA, nothing special
- If only the source or destination was misaligned, the function 
would dispatch to a variant with the core loop loading 16-byte 
aligned and writing 16-byte unaligned, with the PALIGNR 
instruction. However, since PALIGNR can't take a runtime value, 
this variant was _replicated 16 times_.
- I don't remember for both source and destination misaligned but 
you can degenerate this case to the above one.


Each of this loop had complicated loop preludes that do the first 
iteration, and they are so hard to do by hand.


It was also the only piece of assembly I've seen that 
(apparently) successfully used the "prefetch" instructions.


This was just the SSE version, AVX was different.

I don't know if someone really wrote this code, or if it was all 
from intrinsics.


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Walter Bright via Digitalmars-d

On 6/11/2018 6:00 AM, Steven Schveighoffer wrote:

No, __doPostblit is necessary -- you are making a copy.

example:

File[] fs = new File[5];

fs[0] = ...; // initialize fs
auto fs2 = fs;
fs.length = 100;

At this point, fs points at a separate block from fs2. If you did not do 
postblit on this, then when one of those arrays is destroyed, the other will 
become invalid, as the File reference count would go to 0.


Yes, you're right. This should probably go as a comment in the code in case it 
comes up again.


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Steven Schveighoffer via Digitalmars-d

On 6/11/18 4:00 AM, Walter Bright wrote:

(I notice it is doing __doPostblit(). This looks wrong, D allows data to 
be moved. As far as I can tell with a perfunctory examination, that's 
the only "can throw" bit.)


No, __doPostblit is necessary -- you are making a copy.

example:

File[] fs = new File[5];

fs[0] = ...; // initialize fs
auto fs2 = fs;
fs.length = 100;

At this point, fs points at a separate block from fs2. If you did not do 
postblit on this, then when one of those arrays is destroyed, the other 
will become invalid, as the File reference count would go to 0.


-Steve


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Mike Franklin via Digitalmars-d

On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:

On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

I think there might also be optimization opportunities using 
templates, metaprogramming, and type introspection, that are 
not currently possible with the current design.


Just making it a template doesn't automatically enable any of 
this.


I think it does, because I can then generate specific code 
based on the type information at compile-time.


Also, before you do any more nay-saying, you might want to 
revisit this talk https://www.youtube.com/watch?v=endKC3fDxqs 
which demonstrates precisely the kind of benefits that can be 
achieved with these kinds of changes to the compiler/runtime 
interface.


Mike



Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Mike Franklin via Digitalmars-d

On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

I think there might also be optimization opportunities using 
templates, metaprogramming, and type introspection, that are 
not currently possible with the current design.


Just making it a template doesn't automatically enable any of 
this.


I think it does, because I can then generate specific code based 
on the type information at compile-time.


So there are other reasons to pursue a template design for all 
of our runtime hooks, and if our memcpy, memcmp, etc.. are 
also templated it't turtles all the way down.


memcpy and memcmp are already handled specially by modern 
compilers, have been for decades; I seriously doubt there's 
more oil in that well.


Yes, I may fail to achieve any performance benefits, but even if 
I achieve parity with the current implementation it will still be 
a win for me because I will no longer have to depend on the C 
standard library.  And if I have to fork to use D that way, then 
that's what I'm going to do.


Mike




Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Mike Franklin via Digitalmars-d

On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

We have no design for this function that doesn't rely on the 
GC, and the GC needs TypeInfo. This function is not usable with 
betterC with or without the TypeInfo argument.


I understand that.  I was using `_d_arraysetlengthT` as an 
example of problems I encountered trying to convert these runtime 
hooks into templates, specifically illustrating how the compiler 
does not enforce the guarantees of safety, nothrow, and pure, 
and, in fact, completely disregards them.


Converting *any* runtime hook, not just `_d_arraysetlengthT`, to 
a template comes with a whole host of problems because of the 
aforementioned issue.  As soon as it's implemented as a template 
it is semantically analyzed in the context of its caller, so 
@safe, nothrow, pure, etc. all need to accommodated for.  That's 
good to have the compiler check for that stuff.  It's unfortunate 
the current implementation does not, and that's the fault in the 
current implementation.


There are other runtime hooks that are good candidates for 
templates, even if `_d_arraysetlengthT` isn't one of them.  
Again, I just chose `_d_arraysetlengthT` as an example, because 
it is the one I had experience with.


Mike


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Walter Bright via Digitalmars-d

On 6/11/2018 1:12 AM, Mike Franklin wrote:

On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:

Making it a template is not really necessary. The compiler knows if there is 
the possibility of it throwing based on the type, it doesn't need to infer it.


There are other reasons to make it a template, though.  For example, if it were 
a template, it would not rely on `TypeInfo` and could then be used in 
-betterC-like situations.


We have no design for this function that doesn't rely on the GC, and the GC 
needs TypeInfo. This function is not usable with betterC with or without the 
TypeInfo argument.



I think there might also be optimization opportunities using templates, 
metaprogramming, and type introspection, that are not currently possible with 
the current design.


Just making it a template doesn't automatically enable any of this.


So there are other reasons to pursue a template design for all of our runtime 
hooks, and if our memcpy, memcmp, etc.. are also templated it't turtles all the 
way down.


memcpy and memcmp are already handled specially by modern compilers, have been 
for decades; I seriously doubt there's more oil in that well.


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Mike Franklin via Digitalmars-d

On Monday, 11 June 2018 at 08:05:14 UTC, Walter Bright wrote:

On 6/10/2018 8:34 PM, Basile B. wrote:
- default win32 OMF: 
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C


I think you mean:

https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM


Cool! and it's even boost licensed.  I think I'll try to port 
that next, before I try the SSE and AVX implementations.


Mike


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Mike Franklin via Digitalmars-d

On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:

Making it a template is not really necessary. The compiler 
knows if there is the possibility of it throwing based on the 
type, it doesn't need to infer it.


There are other reasons to make it a template, though.  For 
example, if it were a template, it would not rely on `TypeInfo` 
and could then be used in -betterC-like situations.  For setting 
the length of an array, there are issues with memory allocation 
that hinder that, but I'm thinking beyond just setting the length 
of an array here; I just used that as an illustrative example.


I think there might also be optimization opportunities using 
templates, metaprogramming, and type introspection, that are not 
currently possible with the current design.


So there are other reasons to pursue a template design for all of 
our runtime hooks, and if our memcpy, memcmp, etc.. are also 
templated it't turtles all the way down.  We have to be careful 
about code bloat with tempates, but I think that can be mitigated.


Mike


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Walter Bright via Digitalmars-d

On 6/10/2018 8:34 PM, Basile B. wrote:
- default win32 OMF: 
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C


I think you mean:

https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Walter Bright via Digitalmars-d

On 6/10/2018 8:43 PM, Mike Franklin wrote:
That only addresses the @safe attribute, and that code is much too complex for 
anyone to audit it and certify it as safe.


Exceptions are also not all handled, so there is no way it can pass as nothrow.

The runtime call needs to be replaced with a template that can take advantage of 
D's compile-time type information instead of `TypeInfo`, but doing so will force 
the implementation through semantic processing, which, in it's current state, 
will fail to compile.  It needs to be refactored, and creating @safe, nothrow, 
pure building blocks will help with that, hence the topic of this thread.


Making it a template is not really necessary. The compiler knows if there is the 
possibility of it throwing based on the type, it doesn't need to infer it.


(I notice it is doing __doPostblit(). This looks wrong, D allows data to be 
moved. As far as I can tell with a perfunctory examination, that's the only "can 
throw" bit.)


Doing without TypeInfo is a worthy goal most of the time, but I don't think it 
is of much value here because it calls the garbage collector anyway which 
requires TypeInfo.


Some obvious refactoring tasks for it would be to:

1. eliminate the gotos
2. use single assignment for variables
3. minimize the scopes of variables
4. use const
5. document the inputs and outputs
6. add test cases for the red lines


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Walter Bright via Digitalmars-d

On 6/10/2018 9:44 PM, Patrick Schluter wrote:

See what Agner Fog has to say about it:


Thanks. Agner Fog gets the last word on this topic!


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Basile B. via Digitalmars-d

On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:

On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:

[...]


- default win32 OMF: 
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
- default linux: 
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
- not used but interesting: 
https://github.com/esmil/musl/blob/master/src/string/memcpy.c


D slice style memcpy cg:

https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526


Re: Replacing C's memcpy with a D implementation

2018-06-11 Thread Basile B. via Digitalmars-d

On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:

On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:

[...]


- default win32 OMF: 
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
- default linux: 
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
- not used but interesting: 
https://github.com/esmil/musl/blob/master/src/string/memcpy.c


D slice style memcpy cg:

https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Patrick Schluter via Digitalmars-d

On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:



memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.



rep movsd is very CPU dependend and needs some precondtions to be 
fast. For relative short memory blocks it sucks on many other CPU 
than the last Intel.


See what Agner Fog has to say about it:

16.10
String instructions (all processors)
String instructions without a repeat prefix are too slow and 
should be replaced by simpler instructions. The same applies to 
LOOP on all processors and to JECXZ
on some processors. REP MOVSD andREP STOSD are quite fast if the 
repeat count is not too small. Always use the largest word size 
possible (DWORDin 32-bit mode, QWORD in 64-bit mode), and make 
sure that both source and destination are aligned by the word 
size. In many cases, however, it is faster to use XMM registers. 
Moving data in XMM registers is faster than REP MOVSD and REP 
STOSD
in most cases, especially on older processors. See page 164 for 
details.
Note that while the REP MOVS instruction writes a word to the 
destination, it reads the next word from the source in the same 
clock cycle. You can have a cache bank conflict if bit 2-4 are 
the same in these two addresses on P2 and P3. In other words, you 
will get a penalty of one clock extra per iteration if ESI
+WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache 
bank conflicts is to align both source and destination by 8. 
Never use MOVSB or MOVSW
in optimized code, not even in 16-bit mode. On many processors, 
REP MOVS and REP STOS can perform fast by moving 16 bytes or an 
entire cache line at a time
. This happens only when certain conditions are met. Depending on 
the processor, the conditions for fast string instructions are, 
typically, that the count must
be high, both source and destination must be aligned, the 
direction must be forward, the distance between source and 
destination must be at least the cache line size, and the memory 
type for both source and destination must be either write-back or 
write-combining (you can normally assume the latter condition is 
met). Under these conditions, the speed is as high as you can 
obtain with vector register moves or even faster on some 
processors.
While the string instructions can be quite convenient, it must be 
emphasized that other solutions are faster in many cases. If the 
above conditions for fast move are not met then there is a lot to 
gain by using other methods. See page 164 for alternatives to REP 
MOVS





Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d

On Monday, 11 June 2018 at 03:31:05 UTC, Walter Bright wrote:

On 6/10/2018 7:49 PM, Mike Franklin wrote:

On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:

If the compiler can't get it right then who can?
The compiler implementation is faulty.  It rewrites the 
expressions to an `extern(C)` runtime implementation that is 
not @safe, nothrow, or pure: 
https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/rt/lifetime.d#L1442  The backend is not involved.


It's only faulty if there's a bug in it. It's essentially 
@trusted code.


That only addresses the @safe attribute, and that code is much 
too complex for anyone to audit it and certify it as safe.


Exceptions are also not all handled, so there is no way it can 
pass as nothrow.


The runtime call needs to be replaced with a template that can 
take advantage of D's compile-time type information instead of 
`TypeInfo`, but doing so will force the implementation through 
semantic processing, which, in it's current state, will fail to 
compile.  It needs to be refactored, and creating @safe, nothrow, 
pure building blocks will help with that, hence the topic of this 
thread.


Mike


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Basile B. via Digitalmars-d

On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
I've modified the test based on the feedback so far, so here's 
what it looks like now:


import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;

void init(ref ubyte[] a)
{
a.length = length;

for(int i = 0; i < length; i++)
{
a[i] = uniform!ubyte;
}
}

void verifyResults(ubyte[] a, ubyte[] b)
{
assert(memcmp(a.ptr, b.ptr, length) == 0);
}

void memcpyD(ubyte[] dst, ubyte[] src)
{
dst[] = src[];
}

void memcpyDstdAlg(ubyte[] dst, ubyte[] src)
{
copy(src, dst);
}

void memcpyC(ubyte[] dst, ubyte[] src)
{
memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive(ubyte[] dst, ubyte[] src)
{
for(int i = 0; i < length; i++)
{
dst[i] = src[i];
}
}

void memcpyASM(ubyte[] dst, ubyte[] src)
{
auto s = src.ptr;
auto d = dst.ptr;
size_t len = length;
asm pure nothrow @nogc
{
mov RSI, s;
mov RDI, d;
cld;
mov RCX, len;
rep;
movsb;
}
}

Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n)
{
Duration result;
auto sw = StopWatch(AutoStart.yes);

sw.reset();
foreach (_; 0 .. n)
{
f(dst, src);
}
result = sw.peek();

return result;
}

void main()
{
ubyte[] src;
ubyte[] dst;

// verify the integrity of the algorithm
init(src);
init(dst);
memcpyD(dst, src);
verifyResults(dst, src);

init(src);
init(dst);
memcpyDstdAlg(dst, src);
verifyResults(dst, src);

init(src);
init(dst);
memcpyC(dst, src);
verifyResults(dst, src);

init(src);
init(dst);
memcpyNaive(dst, src);
verifyResults(dst, src);

init(src);
init(dst);
memcpyASM(dst, src);
verifyResults(dst, src);

// test the performance of the algorithm
enum iterations = 1000;
writeln("memcpyD: ", benchmark!memcpyD(dst, src, 
iterations));
writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, 
src, iterations));
writeln("memcpyC: ", benchmark!memcpyC(dst, src, 
iterations));
writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, 
iterations));
writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, 
iterations));

}

The results on my Windows 10 machine (Intel Core i7-6700, 
3.4GHz):

memcpyD: 127 ╬╝s and 3 hnsecs
memcpyDstdAlg: 195 ╬╝s and 9 hnsecs
memcpyC: 126 ╬╝s and 7 hnsecs
memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs
memcpyASM: 122 ╬╝s and 8 hnsecs
(Gotta love how windows displays μ)

The results running on Arch Linux 64-bit in a VirtualBox on the 
same Windows 10 machine:

memcpyD: 409 μs
memcpyDstdAlg: 400 μs
memcpyC: 404 μs and 4 hnsecs
memcpyNaive: 17 ms, 251 μs, and 6 hnsecs
memcpyASM: 162 μs and 8 hnsecs

The results appear more sane now, but it seems the behavior is 
highly platform dependent.  Still the ASM is doing well for my 
hardware.  If I run the test multiple times, I do see a lot of 
noise in the results, but each test seems to be affected 
proportionally, so I'm gaining a little more confidence in the 
benchmark.


I still need to analyze the assembly of C's memcpy (anyone know 
where I can find the source code?),


- default win32 OMF: 
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
- default linux: 
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
- not used but interesting: 
https://github.com/esmil/musl/blob/master/src/string/memcpy.c




Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Walter Bright via Digitalmars-d

On 6/10/2018 7:49 PM, Mike Franklin wrote:

On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:

If the compiler can't get it right then who can?
The compiler implementation is faulty.  It rewrites the expressions to an 
`extern(C)` runtime implementation that is not @safe, nothrow, or pure: 
https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/rt/lifetime.d#L1442  
The backend is not involved.


It's only faulty if there's a bug in it. It's essentially @trusted code.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d

On Monday, 11 June 2018 at 02:49:00 UTC, Mike Franklin wrote:

The compiler implementation is faulty.  It rewrites the 
expressions to an `extern(C)` runtime implementation that is 
not @safe, nothrow, or pure:  
https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/rt/lifetime.d#L1442  The backend is not involved.


Also, understand that this lowering happens in the IR generation 
stage:  
https://github.com/dlang/dmd/blob/3a79629988efd51d4dda9edb38a6701cd097da89/src/dmd/e2ir.d#L2616 so semantic analysis is not performed on the runtime call.  If it were, it would fail to compile.


Mike




Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d

On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
There are many reasons to do this, one of which is to leverage 
information available at compile-time and in D's type system 
(type sizes, alignment, etc...) in order to optimize the 
implementation of these functions, and allow them to be used 
from @safe code.


In safe code you just use assignment and array ops, backend 
does the rest.


On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:
But one think I discovered is that while we can set an array's 
length in @safe, nothrow, pure code, it gets lowered to a 
runtime hook that is neither @safe, nothrow, nor pure; the 
compiler is lying to us.


If the compiler can't get it right then who can?



The compiler implementation is faulty.  It rewrites the 
expressions to an `extern(C)` runtime implementation that is not 
@safe, nothrow, or pure:  
https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src/rt/lifetime.d#L1442  The backend is not involved.


Mike


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Nick Sabalausky (Abscissa) via Digitalmars-d

On 06/10/2018 08:01 PM, Walter Bright wrote:

On 6/10/2018 4:39 PM, David Nadlinger wrote:
That's not entirely true. Intel started optimising some of the REP 
string instructions again on Ivy Bridge and above. There is a CPUID 
bit to indicate that (ERMS?); I'm sure the Optimization Manual has 
further details. From what I remember, `rep movsb` is supposed to beat 
an AVX loop on most recent Intel µarchs if the destination is aligned 
and the data is longer than a few cache 


The drama of which instruction mix is faster on which CPU never abates!


In many ways, I really miss 80's machine architecture ;) So simple.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d
I've modified the test based on the feedback so far, so here's 
what it looks like now:


import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;

void init(ref ubyte[] a)
{
a.length = length;

for(int i = 0; i < length; i++)
{
a[i] = uniform!ubyte;
}
}

void verifyResults(ubyte[] a, ubyte[] b)
{
assert(memcmp(a.ptr, b.ptr, length) == 0);
}

void memcpyD(ubyte[] dst, ubyte[] src)
{
dst[] = src[];
}

void memcpyDstdAlg(ubyte[] dst, ubyte[] src)
{
copy(src, dst);
}

void memcpyC(ubyte[] dst, ubyte[] src)
{
memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive(ubyte[] dst, ubyte[] src)
{
for(int i = 0; i < length; i++)
{
dst[i] = src[i];
}
}

void memcpyASM(ubyte[] dst, ubyte[] src)
{
auto s = src.ptr;
auto d = dst.ptr;
size_t len = length;
asm pure nothrow @nogc
{
mov RSI, s;
mov RDI, d;
cld;
mov RCX, len;
rep;
movsb;
}
}

Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n)
{
Duration result;
auto sw = StopWatch(AutoStart.yes);

sw.reset();
foreach (_; 0 .. n)
{
f(dst, src);
}
result = sw.peek();

return result;
}

void main()
{
ubyte[] src;
ubyte[] dst;

// verify the integrity of the algorithm
init(src);
init(dst);
memcpyD(dst, src);
verifyResults(dst, src);

init(src);
init(dst);
memcpyDstdAlg(dst, src);
verifyResults(dst, src);

init(src);
init(dst);
memcpyC(dst, src);
verifyResults(dst, src);

init(src);
init(dst);
memcpyNaive(dst, src);
verifyResults(dst, src);

init(src);
init(dst);
memcpyASM(dst, src);
verifyResults(dst, src);

// test the performance of the algorithm
enum iterations = 1000;
writeln("memcpyD: ", benchmark!memcpyD(dst, src, iterations));
writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, src, 
iterations));

writeln("memcpyC: ", benchmark!memcpyC(dst, src, iterations));
writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, 
iterations));
writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, 
iterations));

}

The results on my Windows 10 machine (Intel Core i7-6700, 3.4GHz):
memcpyD: 127 ╬╝s and 3 hnsecs
memcpyDstdAlg: 195 ╬╝s and 9 hnsecs
memcpyC: 126 ╬╝s and 7 hnsecs
memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs
memcpyASM: 122 ╬╝s and 8 hnsecs
(Gotta love how windows displays μ)

The results running on Arch Linux 64-bit in a VirtualBox on the 
same Windows 10 machine:

memcpyD: 409 μs
memcpyDstdAlg: 400 μs
memcpyC: 404 μs and 4 hnsecs
memcpyNaive: 17 ms, 251 μs, and 6 hnsecs
memcpyASM: 162 μs and 8 hnsecs

The results appear more sane now, but it seems the behavior is 
highly platform dependent.  Still the ASM is doing well for my 
hardware.  If I run the test multiple times, I do see a lot of 
noise in the results, but each test seems to be affected 
proportionally, so I'm gaining a little more confidence in the 
benchmark.


I still need to analyze the assembly of C's memcpy (anyone know 
where I can find the source code?), test on more platforms, and 
test varying sizes, but I'm just collecting some initial data 
right now, to learn how to proceed.


I'd be interested in those with other platforms reporting back 
their results for their hardware, and of course suggestions for 
how to meet or beat C's memcpy with a pure D implementation.


Thanks for all the feedback so far.

Mike



Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Walter Bright via Digitalmars-d

On 6/10/2018 4:39 PM, David Nadlinger wrote:
That's not entirely true. Intel started optimising some of the REP string 
instructions again on Ivy Bridge and above. There is a CPUID bit to indicate 
that (ERMS?); I'm sure the Optimization Manual has further details. From what I 
remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel 
µarchs if the destination is aligned and the data is longer than a few cache 


The drama of which instruction mix is faster on which CPU never abates!


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread David Nadlinger via Digitalmars-d

On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:

On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one 
can draw from this is that memcpyD is the slowest,


Probably because it does a memory allocation.


Of course; that was already pointed out earlier in the thread.

The CPU makers abandoned optimizing the REP instructions 
decades ago, and just left the clunky implementations there for 
backwards compatibility.


That's not entirely true. Intel started optimising some of the 
REP string instructions again on Ivy Bridge and above. There is a 
CPUID bit to indicate that (ERMS?); I'm sure the Optimization 
Manual has further details. From what I remember, `rep movsb` is 
supposed to beat an AVX loop on most recent Intel µarchs if the 
destination is aligned and the data is longer than a few cache 
lines. I've never measured that myself, though.


 — David


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Temtaime via Digitalmars-d

On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:

On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one 
can draw from this is that memcpyD is the slowest,


Probably because it does a memory allocation.



followed by the ASM implementation.


The CPU makers abandoned optimizing the REP instructions 
decades ago, and just left the clunky implementations there for 
backwards compatibility.



In fact, memcpyC and memcpyNaive produce exactly the same 
machine code (without bounds checking), as LLVM recognizes the 
loop and lowers it into a memcpy. memcpyDstdAlg instead gets 
turned into a vectorized loop, for reasons I didn't 
investigate any further.


This amply illustrates my other point that looking at the 
assembler generated is crucial to understanding what's 
happening.


On some cpu architectures(for example intel atoms) rep movsb is 
the fatest memcpy.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread solidstate1991 via Digitalmars-d

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:

void memcpyASM()
{
auto s = src.ptr;
auto d = dst.ptr;
size_t len = length;
asm pure nothrow @nogc
{
mov RSI, s;
mov RDI, d;
cld;
mov RCX, len;
rep;
movsb;
}
}

Protip: Use SSE or AVX for an even faster copying.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Walter Bright via Digitalmars-d

On 6/10/2018 11:16 AM, David Nadlinger wrote:
Because of the large amounts of noise, the only conclusion one can draw from 
this is that memcpyD is the slowest,


Probably because it does a memory allocation.



followed by the ASM implementation.


The CPU makers abandoned optimizing the REP instructions decades ago, and just 
left the clunky implementations there for backwards compatibility.



In fact, memcpyC and memcpyNaive produce exactly the same machine code (without 
bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. 
memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't 
investigate any further.


This amply illustrates my other point that looking at the assembler generated is 
crucial to understanding what's happening.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Walter Bright via Digitalmars-d

On 6/10/2018 6:45 AM, Mike Franklin wrote:

 void memcpyD()
{
     dst = src.dup;
}


Note that .dup is doing a GC memory allocation.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Walter Bright via Digitalmars-d

On 6/10/2018 5:49 AM, Mike Franklin wrote:

[...]


One source of entropy in the results is src and dst being global variables. 
Global variables in D are in TLS, and TLS access can be complex (many 
instructions) and is influenced by the -fPIC switch. Worse, global variable 
access is not optimized in dmd because of aliasing problems.


The solution is to pass src, dst, and length to the copy function as function 
parameters (and make sure function inlining is off).


In light of this, I want to BEAT THE DEAD HORSE once again and assert that if 
the assembler generated by a benchmark is not examined, the results can be 
severely misleading. I've seen this happen again and again. In this case, TLS 
access is likely being benchmarked, not memcpy.


BTW, the relative timing of rep movsb can be highly dependent on which CPU chip 
you're using.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread I love Ice Cream via Digitalmars-d
Don't C implementations already do 90% of what you want? I 
thought most compilers know about and optimize these methods 
based on context. I thought they were *special* in the eyes of 
the compiler already. I think you are fighting a battle pitting 
40 years of tweaking against you...


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread David Nadlinger via Digitalmars-d

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm not experienced with this kind of programming, so I'm 
doubting these results.  Have I done something wrong?  Am I 
overlooking something?


You've just discovered the fact that one can rarely be careful 
enough with what is benchmarked, and having enough statistics.


For example, check out the following output from running your 
program on macOS 10.12, compiled with LDC 1.8.0:


---
$ ./test
memcpyD: 2 ms, 570 μs, and 9 hnsecs
memcpyDstdAlg: 77 μs and 2 hnsecs
memcpyC: 74 μs and 1 hnsec
memcpyNaive: 76 μs and 4 hnsecs
memcpyASM: 145 μs and 5 hnsecs
$ ./test
memcpyD: 3 ms and 376 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 104 μs and 4 hnsecs
memcpyNaive: 72 μs and 2 hnsecs
memcpyASM: 181 μs and 8 hnsecs
$ ./test
memcpyD: 2 ms and 565 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 73 μs and 2 hnsecs
memcpyNaive: 71 μs and 9 hnsecs
memcpyASM: 145 μs and 3 hnsecs
$ ./test
memcpyD: 2 ms, 813 μs, and 8 hnsecs
memcpyDstdAlg: 81 μs and 2 hnsecs
memcpyC: 99 μs and 2 hnsecs
memcpyNaive: 74 μs and 2 hnsecs
memcpyASM: 149 μs and 1 hnsec
$ ./test
memcpyD: 2 ms, 593 μs, and 7 hnsecs
memcpyDstdAlg: 77 μs and 3 hnsecs
memcpyC: 75 μs
memcpyNaive: 77 μs and 2 hnsecs
memcpyASM: 145 μs and 5 hnsecs
---

Because of the large amounts of noise, the only conclusion one 
can draw from this is that memcpyD is the slowest, followed by 
the ASM implementation.


In fact, memcpyC and memcpyNaive produce exactly the same machine 
code (without bounds checking), as LLVM recognizes the loop and 
lowers it into a memcpy. memcpyDstdAlg instead gets turned into a 
vectorized loop, for reasons I didn't investigate any further.


 — David




Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Kagamin via Digitalmars-d

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
There are many reasons to do this, one of which is to leverage 
information available at compile-time and in D's type system 
(type sizes, alignment, etc...) in order to optimize the 
implementation of these functions, and allow them to be used 
from @safe code.


In safe code you just use assignment and array ops, backend does 
the rest.


On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:
But one think I discovered is that while we can set an array's 
length in @safe, nothrow, pure code, it gets lowered to a 
runtime hook that is neither @safe, nothrow, nor pure; the 
compiler is lying to us.


If the compiler can't get it right then who can?


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Seb via Digitalmars-d

On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

arr1[] = arr2[]; // the compiler makes this memcpy, the 
optimzer can further do its magic


void memcpyD()
{
dst = src.dup;
}

void memcpyD2()
{
dst[] = src[];
}

-
memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.

btw, what's the difference between the two.  If you can't tell, 
I'm actually not a very good D programmer.


Mike


I would increase the test data size, s.t. you get a least a few 
seconds.
Otherwise the benchmarking won't tell you much because it's 
subject to too much randomness.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread rikki cattermole via Digitalmars-d

On 11/06/2018 1:45 AM, Mike Franklin wrote:

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can 
further do its magic


void memcpyD()
{
     dst = src.dup;


malloc (for slice not static array)


}

void memcpyD2()
{
     dst[] = src[];


memcpy


}

-
memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.

btw, what's the difference between the two.  If you can't tell, I'm 
actually not a very good D programmer.


Mike







Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

arr1[] = arr2[]; // the compiler makes this memcpy, the 
optimzer can further do its magic


void memcpyD()
{
dst = src.dup;
}

void memcpyD2()
{
dst[] = src[];
}

-
memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.

btw, what's the difference between the two.  If you can't tell, 
I'm actually not a very good D programmer.


Mike





Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d

On Sunday, 10 June 2018 at 13:17:53 UTC, Guillaume Piolat wrote:

Please make one that guarantee the usage of the corresponding 
backend intrinsic, for example on LLVM.


I tested with ldc and got similar results.  I thought the 
implementation in C forwarded to the backend intrinsic.  I think 
even LDC links with GCC, and the distribution of GCC is tuned for 
a given platform and architecture, so am I not already utilizing 
GCC's backend intrinsic.


Also, I'm not trying to find the holy grail of memcpy with this 
endeavor.  I just want to find something as good or better than 
what the druntime is currently using, but implemented in D.  Even 
if it's not optimal, if it's as good as what druntime is 
currently using, I want to replace those calls in the runtime 
with a D implementation.


Mike




Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:


And D already has it built in as well for @safe etc:

arr1[] = arr2[]; // the compiler makes this memcpy, the 
optimzer can further do its magic


so be sure to check against that too.


My intent is to use the D implementation in the druntime.  I'm 
trying to replace the runtime hooks with templates, so we don't 
have to rely so much on `TypeInfo` and we can begin to use more 
of D without linking in the runtime.


But one think I discovered is that while we can set an array's 
length in @safe, nothrow, pure code, it gets lowered to a runtime 
hook that is neither @safe, nothrow, nor pure; the compiler is 
lying to us.  If I replace the runtime hook with a template, I 
need to be honest, and that means all that low-level code in 
druntime that the runtime hook depends on needs to be @safe, 
nothrow, and pure.  So I'm starting with the fundamental 
dependencies on the C library, trying to ensure they can be 
replaced with D implementations and use in D idiomatically.  For 
example, memcpy might look like this.


void memcpy(T)(T[] dest, T[] src);

We can extract size, alignment etc.. at compile-time.

So, my goal is not for user-facing code, but for improving the 
druntime implementations.


Does that make sense?

Mike


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d

On Sunday, 10 June 2018 at 13:05:33 UTC, Nicholas Wilson wrote:

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm exploring the possibility of implementing some of the 
basic software building blocks (memcpy, memcmp, memmove, 
etc...) that D utilizes from the C library with D 
implementations.  There are many reasons to do this, one of 
which is to leverage information available at compile-time and 
in D's type system (type sizes, alignment, etc...) in order to 
optimize the implementation of these functions, and allow them 
to be used from @safe code.


[...]


what compiler? what flags?


dmd main.d
Arch Linux x86_64

DMD, with no flags.  I'm not compiling C's memcpy, and the 
significant D implementation is memcpyASM which is all inline 
assembly, so I don't see how the compiler flags will make any 
difference, anyway.


Mike


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Adam D. Ruppe via Digitalmars-d

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
D utilizes from the C library with D implementations.  There 
are many reasons to do this, one of which is to leverage 
information available at compile-time and in D's type system 
(type sizes, alignment, etc...) in order to optimize the 
implementation of these functions, and allow them to be used 
from @safe code.


So keep in mind that memcpy is really a magical intrinsic anyway 
and optimzers frequently don't actually call a function, but 
rather see the size and replace the instructions inline (like it 
might replace it with just a couple movs instead of something 
fancy).


And D already has it built in as well for @safe etc:

arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer 
can further do its magic


so be sure to check against that too.




Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Guillaume Piolat via Digitalmars-d

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm not experienced with this kind of programming, so I'm 
doubting these results.  Have I done something wrong?  Am I 
overlooking something?



Hi,

I've spent a lot of time optimizing memcpy. One of the result was 
that on Intel ICC the compiler intrinsics were unbeatable.
Please make one that guarantee the usage of the corresponding 
backend intrinsic, for example on LLVM.


The reasoning is that small areas of memory (statically known) 
will be elided to a few byte moves (inlined), whereas the larger 
one will call the highly optimized C stdlib calls.
If you use ASM instead of IR the optimization barriers and 
register spilling will make you probably less efficient.


Re: Replacing C's memcpy with a D implementation

2018-06-10 Thread Nicholas Wilson via Digitalmars-d

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm exploring the possibility of implementing some of the basic 
software building blocks (memcpy, memcmp, memmove, etc...) that 
D utilizes from the C library with D implementations.  There 
are many reasons to do this, one of which is to leverage 
information available at compile-time and in D's type system 
(type sizes, alignment, etc...) in order to optimize the 
implementation of these functions, and allow them to be used 
from @safe code.


[...]


what compiler? what flags?


Replacing C's memcpy with a D implementation

2018-06-10 Thread Mike Franklin via Digitalmars-d
I'm exploring the possibility of implementing some of the basic 
software building blocks (memcpy, memcmp, memmove, etc...) that D 
utilizes from the C library with D implementations.  There are 
many reasons to do this, one of which is to leverage information 
available at compile-time and in D's type system (type sizes, 
alignment, etc...) in order to optimize the implementation of 
these functions, and allow them to be used from @safe code.


The prevailing wisdom has been that there is no way to improve on 
C's memcpy implementation given that it has been mirco-optimized 
to death over several decades by many talented members of the 
human race.


So, I threw the following benchmark together to try to get a clue 
about what I was up against, and in a very short time, I beat the 
snot of C's memcpy.  The benefit seems to disappear as the array 
sizes increase, but I believe the vast majority of calls to 
memcpy are probably quite small.


import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;
ubyte[length] dst;
ubyte[length] src;
auto rnd = Random(42);
ubyte[] src2;
ubyte[] dst2;

void verifyResults()
{
assert(memcmp(dst.ptr, src.ptr, length) == 0);
assert(memcmp(dst2.ptr, src2.ptr, length) == 0);
}

void randomizeData()
{
for(int i = 0; i < length; i++)
{
src[i] = uniform!ubyte;
dst[i] = uniform!ubyte;
}

src2 = src;
dst2 = dst;
}

void memcpyD()
{
dst = src.dup;
}

void memcpyDstdAlg()
{
copy(src2, dst2);
}

void memcpyC()
{
memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive()
{
for(int i = 0; i < length; i++)
{
dst[i] = src[i];
}
}

void memcpyASM()
{
auto s = src.ptr;
auto d = dst.ptr;
size_t len = length;
asm pure nothrow @nogc
{
mov RSI, s;
mov RDI, d;
cld;
mov RCX, len;
rep;
movsb;
}
}

void main()
{
// verify the integrity of the algorithm
randomizeData();
memcpyD();
verifyResults();

randomizeData();
memcpyDstdAlg();
verifyResults();

randomizeData();
memcpyC();
verifyResults();

randomizeData();
memcpyNaive();
verifyResults();

randomizeData();
memcpyASM();
verifyResults();

// test the performance of the algorithm
auto r = benchmark!(memcpyD, memcpyDstdAlg, memcpyC, 
memcpyNaive, memcpyASM)(1000);

Duration memcpyDResult = r[0];
Duration memcpyDstdAlgResult = r[1];
Duration memcpyCResult = r[2];
Duration memcpyNaiveResult = r[3];
Duration memcpyASMResult = r[4];

writeln("memcpyD: ", memcpyDResult);
writeln("memcpyDstdAlg: ", memcpyDstdAlgResult);
writeln("memcpyC: ", memcpyCResult);
writeln("memcpyNaive: ", memcpyNaiveResult);
writeln("memcpyASM: ", memcpyASMResult);
}


-- Output 
memcpyD: 1 ms, 772 μs, and 4 hnsecs
memcpyDstdAlg: 531 μs and 8 hnsecs
memcpyC:   371 μs and 3 hnsecs
memcpyNaive:21 ms, 572 μs, and 2 hnsecs
memcpyASM: 119 μs and 6 hnsecs



I'm not experienced with this kind of programming, so I'm 
doubting these results.  Have I done something wrong?  Am I 
overlooking something?


Thanks,
Mike