On Sun, Jul 7, 2019 at 8:40 AM Thomas Womack <t...@womack.net> wrote:
>
> Good morning.
>
> I have some code that looks like
>
> typedef unsigned long long uint64;
> typedef unsigned int uint32;
>
> typedef struct { uint64 x[8]; } __attribute__((aligned(64))) v_t;
>
> inline v_t xor(v_t a, v_t b)
> {
>   v_t Q;
>   for (int i=0; i<8; i++) Q.x[i] = a.x[i] ^ b.x[i];
>   return Q;
> }
>
> void xor_matrix_precomp(v_t* __restrict__ a, v_t* __restrict__ c, v_t* 
> __restrict__ d, int n)
> {
>   uint32 i,j;
>   for (i=0; i<n; i++)
>     {
>       v_t vi = a[i];
>       v_t acc = c[0*256 + (vi.x[0] & 0xff)];
>       for (j=1; j<64; j++)
>         {
>           uint32 w = j>>3, b=j&7;
>           acc = xor(acc, c[j*256 + ((vi.x[w] >> (8*b))&0xff)]);
>         }
>       d[i] = xor(d[i], acc);
>     }
> }
>
> built with
>
> /home/nfsworld/tooling/gcc-9.1-isl16/bin/gcc -O3 -fomit-frame-pointer 
> -march=skylake-avx512 -mprefer-vector-width=512 -S badeg.c
>
>
> and the inner xor_matrix loop is not vectorised at all: it carries ‘acc’ in 
> eight x86-64 registers rather than one ZMM.  What am I missing?
>
> -fopt-info-all-vec is not helping me, it points out that it can’t vectorise 
> the i or j loop but says nothing about the loop in the inlined xor() call.

The loops are fully unrolled, the vectorizer doesn't see them anymore.

Note the vectorizer is confused about the aggregate copy:

      v_t vi = a[i];

which isn't elided.

If you disable the premature unrolling of the xor() look via
-fdisable-tree-cunrolli you get these loops vectorized:

> gcc t.c -O3 -march=skylake-avx512 -mprefer-vector-width=512 -S -fopt-info-vec 
> -fdisable-tree-cunrolli
cc1: note: disable pass tree-cunrolli for functions in the range of
[0, 4294967295]
t.c:9:3: optimized: loop vectorized using 64 byte vectors
t.c:9:3: optimized: loop vectorized using 64 byte vectors

(heuristics are difficult...).

Richard.

> I’m sure I’m missing something obvious; many thanks for your help.
>
> Tom

Reply via email to