On Tue, Jun 5, 2018 at 4:20 PM Alexey Dobriyan <adobri...@gmail.com> wrote: > > This is Broadwell Xeon E5-2620 v4. > Which is somewhat strange indeed because it should be modern enough.
Yeah, odd. Here's the benchmark I used: #define SIZE 4068 int main(int argc, char **argv) { int i; unsigned char buffer[SIZE], *p; for (i = 0; i < 1000000; i++) asm volatile( "1: movq %[zero],(%[mem]); addq %[eight],%[mem]; decl %[count]; jne 1b" : [mem] "=r" (p) : [zero] "i" (0l), [eight] "i" (8l), "0" (buffer), [count] "r" (SIZE/8)); } where you can change that "i" for [zero] and [eight] to be "r" to get the register version. I just timed it, because I'm lazy and perf seemed to be overkill. It might be some very specific loop buffer issue or something. Or maybe my benchmark above is broken, I didn't really verify that the end result was any good (I just did an objdump to verify the asm code superficially). Linus