On 6/10/2018 5:49 AM, Mike Franklin wrote:
[...]
One source of entropy in the results is src and dst being global variables. Global variables in D are in TLS, and TLS access can be complex (many instructions) and is influenced by the -fPIC switch. Worse, global variable access is not optimized in dmd because of aliasing problems.
The solution is to pass src, dst, and length to the copy function as function parameters (and make sure function inlining is off).
In light of this, I want to BEAT THE DEAD HORSE once again and assert that if the assembler generated by a benchmark is not examined, the results can be severely misleading. I've seen this happen again and again. In this case, TLS access is likely being benchmarked, not memcpy.
BTW, the relative timing of rep movsb can be highly dependent on which CPU chip you're using.
