https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95974
Bug ID: 95974
Summary: AArch64 arm_neon.h stores interfere with gimple
optimisations
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
Blocks: 95958
Target Milestone: ---
Target: aarch64*-*-*
For:
---------------------------------------
#include <arm_neon.h>
#include <vector>
std::vector<float> a;
void
f (size_t n, float32x4_t v)
{
for (size_t i = 0; i < n; i += 4)
vst1q_f32 (&a[i], v);
}
---------------------------------------
we generate code that loads the start address of
"a" in every iteration of the loop:
---------------------------------------
cbz x0, .L4
adrp x4, .LANCHOR0
add x4, x4, :lo12:.LANCHOR0
mov x1, 0
.p2align 3,,7
.L6:
ldr x3, [x4]
lsl x2, x1, 2
add x1, x1, 4
str q0, [x3, x2]
cmp x0, x1
bhi .L6
.L4:
ret
---------------------------------------
This is really the store equivalent of PR95962. The problem is
that __builtin_aarch64_st1v4sf is modelled as a general function
that could read and write from arbitrary memory. As with PR95962,
one option would be to lower to gimple accesses where possible,
at least for little-endian targets.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64