load false aliasing)

jakub at gcc dot gnu.org Thu, 06 Feb 2014 00:28:53 -0800

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086


Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-02-06
                 CC|                            |abel at gcc dot gnu.org,
                   |                            |jakub at gcc dot gnu.org,
                   |                            |uros at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC right now only handles __restrict on function parameters, so in this case
the aliasing info isn't known.  While the loop is versioned for aliasing at
runtime, the info about that is only known during the vectorizer, therefore
e.g. scheduler can hardly know it.  The pointers to overaligned memory is
something you should generally avoid, __builtin_assume_aligned is what can be
used to tell the compiler about the alignment instead, overaligned types often
actually hurt generated code instead of improving it.  And the way you are
calling posix_memalign is IMHO a strict aliasing violation.

Perhaps GCC could handle posix_memalign specially as builtin if declared with
the right prototype (and optionally some new attribute) and derive both the
aliasing and alignment info from it, like the taking of the address of the
pointer in it isn't really an escape site of any kind, all the call does is
return two values instead of just one, so it could be folded into passing an
address of some temporary to the call instead and then loading from the
temporary and using some special pass-thru builtin that would tell GCC that the
pointer is really malloc-like (non-aliasing anything else) and also
use__builtin_assume_aligned.  The GNU memalign is far better than
posix_memalign from this POV.

Anyway, if I rewrite your testcase as:
#include <stdlib.h>
#include <stdio.h>

__attribute__((noinline)) void
foo (double *__restrict__ a, double *__restrict__ b, double *__restrict__ c,
double *__restrict__ d, unsigned long NSIZE)
{
  unsigned long i, j;
  a = __builtin_assume_aligned (a, 32);
  b = __builtin_assume_aligned (b, 32);
  c = __builtin_assume_aligned (c, 32);
  d = __builtin_assume_aligned (d, 32);
  // initialize memory
  for(i=0; i<NSIZE; i++){
    a[i] = 0;
    b[i] = 0;
    c[i] = 0;
    d[i] = 0;
  }

  // outer loop - repeat short tests
  for(j=0; j<10000; j++){

    // inner loop - do the work
    for(i=0; i<NSIZE; i++){
      a[i] += b[i];
      c[i] += d[i];
    }

    // dummy - prevent loop interchange
    if(a[NSIZE/2]<0) printf("%lf\n", a[NSIZE/2]);
  }
}

int main(int argc, char*argv[])
{
  unsigned long NSIZE = atol(argv[1]);
  void *a, *b, *c, *d;

  // allocate starting from page boundary
  posix_memalign(&a, 4096, sizeof(double)*(NSIZE));
  posix_memalign(&b, 4096, sizeof(double)*(NSIZE));
  posix_memalign(&c, 4096, sizeof(double)*(NSIZE));
  posix_memalign(&d, 4096, sizeof(double)*(NSIZE));
  foo ((double *) a, (double *) b, (double *) c, (double *) d, NSIZE);
  return 0;
}

we don't do versioning for alias and also (as before) assume sufficient
alignment, but still the scheduler doesn't reorder the loads vs. the store,
unless -O3 -mavx -fschedule-insns.  The reason why the second scheduler doesn't
reorder those is that RA allocates the same register.  With -O3 -mavx
-fselective-scheduling2 the stores are also changed, but we end up with a
weird:
.L9:
        movq    -136(%rbp), %rdx
        vmovapd (%r9,%rax), %ymm0
        addq    $1, %rdi
        vmovapd (%r10,%rax), %ymm8
        vaddpd  (%rdx,%rax), %ymm0, %ymm0
        movq    -144(%rbp), %rdx
        vaddpd  (%rdx,%rax), %ymm8, %ymm9
        vmovapd %ymm0, (%r9,%rax)
        vmovapd %ymm8, %ymm0
        vmovapd %ymm9, %ymm0
        vmovapd %ymm0, (%r10,%rax)
        addq    $32, %rax
        cmpq    %rdi, -152(%rbp)
        ja      .L9
Why there is the vmovapd %ymm8, %ymm0 is a mystery, and vmovapd %ymm9, %ymm0
could be very well merged with the store into vmovapd %ymm9, (%r10,%rax).

[Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)

Reply via email to