https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95967
Bug ID: 95967
Summary: Poor aarch64 vector constructor code when using
arm_neon.h
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
Depends on: 95962
Blocks: 95958
Target Milestone: ---
Target: aarch64*-*-*
We generate poor code for the attached functions:
f1:
movi v4.4s, 0
ins v4.s[0], v0.s[0]
ins v4.s[1], v1.s[0]
ins v4.s[2], v2.s[0]
mov v0.16b, v4.16b
ins v0.s[3], v3.s[0]
ret
f2:
dup v0.4s, v0.s[0]
ins v0.s[1], v1.s[0]
ins v0.s[2], v2.s[0]
ins v0.s[3], v3.s[0]
ret
f3:
sub sp, sp, #16
stp s0, s1, [sp]
stp s2, s3, [sp, 8]
ldr q0, [sp]
add sp, sp, 16
ret
g1:
movi v0.4s, 0
ld1 {v0.s}[0], [x0]
ld1 {v0.s}[1], [x1]
ld1 {v0.s}[2], [x2]
ld1 {v0.s}[3], [x3]
ret
g2:
ld1r {v0.4s}, [x0]
ld1 {v0.s}[1], [x1]
ld1 {v0.s}[2], [x2]
ld1 {v0.s}[3], [x3]
ret
g3:
sub sp, sp, #16
ldr s0, [x3]
ldr s3, [x0]
ldr s2, [x1]
ldr s1, [x2]
stp s3, s2, [sp]
stp s1, s0, [sp, 8]
ldr q0, [sp]
add sp, sp, 16
ret
All three f functions should generate:
mov v0.s[1], v1.s[0]
mov v0.s[2], v2.s[0]
mov v0.s[3], v3.s[0]
ret
and all three g functions should generate:
ldr s0, [x0]
ld1 { v0.s }[1], [x1]
ld1 { v0.s }[2], [x2]
ld1 { v0.s }[3], [x3]
ret
which is what current Clang does.
Getting the right code for f3 and g3 depends on the fix for PR95962.
There's a reasonable chance that PR95962 will be enough on its own
to fix f3 and g3, but I included them just in case it isn't.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962
[Bug 95962] Inefficient code for simple arm_neon.h iota operation