https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122219
Bug ID: 122219
Summary: Missed store sinking when using memcpy with
vector_size type in inlined functions
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: pfustc at gcc dot gnu.org
Target Milestone: ---
When using an inlined function to cast data from float32x4_t (16 bytes) to a
32-byte vector_size type, GCC fails to sink the stores out of the loop. The
same logic expressed directly with memcpy is optimized correctly.
Reproducer:
#include <arm_neon.h>
typedef float v256_t __attribute__((vector_size(32)));
inline v256_t cast(float32x4_t fv) {
v256_t r;
__builtin_memcpy(&r, &fv, 16);
return r;
}
float foo(unsigned n) {
v256_t vec;
float32x4_t fv = {1.0f, 2.0f, 3.0f, 4.0f};
for (int i = 0; i < n; i++) {
fv = vmulq_f32(fv, fv);
vec = cast(fv);
}
return vec[n % 8];
}
float bar(unsigned n) {
v256_t vec;
float32x4_t fv = {1.0f, 2.0f, 3.0f, 4.0f};
for (int i = 0; i < n; i++) {
fv = vmulq_f32(fv, fv);
__builtin_memcpy(&vec, &fv, 16);
}
return vec[n % 8];
}
We observed below assembly difference when compiling with "-O3" on AArch64.
foo() - BAD (store sinking failed)
.L3:
add w1, w1, 1
fmul v31.4s, v31.4s, v31.4s
str q31, [sp]
ldp q29, q30, [x2]
stp q29, q30, [sp, 32]
cmp w1, w0
bne .L3
bar() - GOOD (store sinking succeeded)
.L12:
add w1, w1, 1
fmul v31.4s, v31.4s, v31.4s
cmp w1, w0
bne .L12
str q31, [sp]
Initial analysis:
Below are some GIMPLE statements generated in "foo()" before the LIM pass:
MEM <uint128_t> [(char * {ref-all})&ret] = _10;
ret.1_14 = ret;
ret ={v} {CLOBBER(eos)};
vec = ret.1_14;
Store sinking in the LIM pass found a dependence between the first MEM [&ret] =
... and its following "ret.1_14 = ret". So it fails to sink the MEM out.
How should we fix this?