http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2014-02-06 CC| |abel at gcc dot gnu.org, | |jakub at gcc dot gnu.org, | |uros at gcc dot gnu.org, | |vmakarov at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> --- GCC right now only handles __restrict on function parameters, so in this case the aliasing info isn't known. While the loop is versioned for aliasing at runtime, the info about that is only known during the vectorizer, therefore e.g. scheduler can hardly know it. The pointers to overaligned memory is something you should generally avoid, __builtin_assume_aligned is what can be used to tell the compiler about the alignment instead, overaligned types often actually hurt generated code instead of improving it. And the way you are calling posix_memalign is IMHO a strict aliasing violation. Perhaps GCC could handle posix_memalign specially as builtin if declared with the right prototype (and optionally some new attribute) and derive both the aliasing and alignment info from it, like the taking of the address of the pointer in it isn't really an escape site of any kind, all the call does is return two values instead of just one, so it could be folded into passing an address of some temporary to the call instead and then loading from the temporary and using some special pass-thru builtin that would tell GCC that the pointer is really malloc-like (non-aliasing anything else) and also use__builtin_assume_aligned. The GNU memalign is far better than posix_memalign from this POV. Anyway, if I rewrite your testcase as: #include <stdlib.h> #include <stdio.h> __attribute__((noinline)) void foo (double *__restrict__ a, double *__restrict__ b, double *__restrict__ c, double *__restrict__ d, unsigned long NSIZE) { unsigned long i, j; a = __builtin_assume_aligned (a, 32); b = __builtin_assume_aligned (b, 32); c = __builtin_assume_aligned (c, 32); d = __builtin_assume_aligned (d, 32); // initialize memory for(i=0; i<NSIZE; i++){ a[i] = 0; b[i] = 0; c[i] = 0; d[i] = 0; } // outer loop - repeat short tests for(j=0; j<10000; j++){ // inner loop - do the work for(i=0; i<NSIZE; i++){ a[i] += b[i]; c[i] += d[i]; } // dummy - prevent loop interchange if(a[NSIZE/2]<0) printf("%lf\n", a[NSIZE/2]); } } int main(int argc, char*argv[]) { unsigned long NSIZE = atol(argv[1]); void *a, *b, *c, *d; // allocate starting from page boundary posix_memalign(&a, 4096, sizeof(double)*(NSIZE)); posix_memalign(&b, 4096, sizeof(double)*(NSIZE)); posix_memalign(&c, 4096, sizeof(double)*(NSIZE)); posix_memalign(&d, 4096, sizeof(double)*(NSIZE)); foo ((double *) a, (double *) b, (double *) c, (double *) d, NSIZE); return 0; } we don't do versioning for alias and also (as before) assume sufficient alignment, but still the scheduler doesn't reorder the loads vs. the store, unless -O3 -mavx -fschedule-insns. The reason why the second scheduler doesn't reorder those is that RA allocates the same register. With -O3 -mavx -fselective-scheduling2 the stores are also changed, but we end up with a weird: .L9: movq -136(%rbp), %rdx vmovapd (%r9,%rax), %ymm0 addq $1, %rdi vmovapd (%r10,%rax), %ymm8 vaddpd (%rdx,%rax), %ymm0, %ymm0 movq -144(%rbp), %rdx vaddpd (%rdx,%rax), %ymm8, %ymm9 vmovapd %ymm0, (%r9,%rax) vmovapd %ymm8, %ymm0 vmovapd %ymm9, %ymm0 vmovapd %ymm0, (%r10,%rax) addq $32, %rax cmpq %rdi, -152(%rbp) ja .L9 Why there is the vmovapd %ymm8, %ymm0 is a mystery, and vmovapd %ymm9, %ymm0 could be very well merged with the store into vmovapd %ymm9, (%r10,%rax).