https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123163
Bug ID: 123163
Summary: vectorization of pointer arithmetic within struct
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: manu at gcc dot gnu.org
Target Milestone: ---
I am not sure if this is a missed optimization or my misunderstanding of how
GCC calculates the vectorization profitability.
I have a piece of code that performs the operation in "bar()" for values of n
in the order of thousands or hundreds of thousands.
struct list {
const double * restrict x;
struct list ** restrict next;
};
void bar(struct list * restrict p, int n) {
for (int i = 0; i < n; i++) {
p[i].x--;
}
}
void foo(const double ** p, int n) {
for (int i = 0; i < n; i++) {
p[i]--;
}
}
static inline void foo_inline(const double ** p, int n) {
for (int i = 0; i < n; i++) {
p[i]--;
}
}
void baz(struct list * restrict p, int n) {
#ifndef N
#define N 16
#endif
const double * vec[N];
int blocks = n / N;
for (int i = 0; i < blocks; i++) {
for (int k = 0; k < N; k++)
vec[k] = p[k].x;
foo_inline(vec, N);
for (int k = 0; k < N; k++)
p[k].x = vec[k];
p += N;
}
for (int i = blocks*N; i < n; i++) {
p[i].x--;
}
}
gcc -O3 -march=x86-64-v2 -fopt-info-vec-all :
<source>:7:23: missed: couldn't vectorize loop
<source>:8:13: missed: not vectorized: no vectype for stmt: _4 = *_3.x;
scalar_type: const double * restrict
<source>:6:6: note: vectorized 0 loops in function.
<source>:10:1: note: ***** Analysis failed with vector mode V2DI
<source>:10:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
<source>:13:23: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:12:6: note: vectorized 1 loops in function.
<source>:16:1: note: ***** Analysis failed with vector mode V2DI
<source>:16:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
<source>:38:30: missed: couldn't vectorize loop
<source>:39:13: missed: not vectorized: no vectype for stmt: _12 = *_11.x;
scalar_type: const double * restrict
<source>:30:23: missed: couldn't vectorize loop
<source>:32:42: missed: not vectorized: no vectype for stmt: _323 = *p_324.x;
scalar_type: const double * restrict
<source>:24:6: note: vectorized 0 loops in function.
<source>:41:1: note: ***** Analysis failed with vector mode V2DI
<source>:41:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
That is, foo() is vectorized but the rest are not. Surprisingly foo_inline() is
not!
However, with
gcc -O3 -march=x86-64-v2 -fopt-info-vec-all -DN=32 :
<source>:7:23: missed: couldn't vectorize loop
<source>:8:13: missed: not vectorized: no vectype for stmt: _4 = *_3.x;
scalar_type: const double * restrict
<source>:6:6: note: vectorized 0 loops in function.
<source>:10:1: note: ***** Analysis failed with vector mode V2DI
<source>:10:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
<source>:13:23: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:12:6: note: vectorized 1 loops in function.
<source>:16:1: note: ***** Analysis failed with vector mode V2DI
<source>:16:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
<source>:38:30: missed: couldn't vectorize loop
<source>:39:13: missed: not vectorized: no vectype for stmt: _12 = *_11.x;
scalar_type: const double * restrict
<source>:30:23: missed: couldn't vectorize loop
<source>:30:23: missed: not vectorized: loop nest containing two or more
consecutive inner loops cannot be vectorized
<source>:34:27: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:19:23: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:31:27: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:24:6: note: vectorized 3 loops in function.
<source>:41:1: note: ***** Analysis failed with vector mode V2DI
<source>:41:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
The three inner loops in baz() are vectorized including foo_inline().
Why the vectorizer would not vectorize bar() exactly as it does for baz() with
N=32? Note that using #pragma GCC unroll 32 in bar() does not help.
Rewriting bar() as:
void bar(struct list * restrict p, int n) {
int blocks = n / N;
for (int i = 0; i < blocks; i++) {
#pragma GCC ivdep
for (int k = 0; k < N; k++) {
p[k].x--;
}
p += N;
}
for (int i = blocks*N; i < n; i++) {
p[i].x--;
}
}
does not work either.