https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124043
Bug ID: 124043
Summary: gcc.target/arm/crypto-*_u32.c missed optimization
cause test failure
Product: gcc
Version: 15.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: azoff at gcc dot gnu.org
Target Milestone: ---
When running tests with arch=armv7e-m+fp/float-abi=hard/fpu=auto, the assembler
for foo_lane0 in gcc.target/arm/crypto-vsha1cq_u32.c is different from
foo_lane[123] and causes the check for vdup to have lower count (assembler
generated with arm-none-eabi-gcc .../crypto-vsha1pq_u32.c -march=armv7e-m+fp
-mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -O3 -S -o - -dp):
foo_lane0:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #16 @ 49 [c=4 l=4] *arm_addsi3/11
vst1.64 {d0-d1}, [sp:64] @ 19 [c=4 l=8] *neon_movv4si/1
vld1.32 {d16[], d17[]}, [sp] @ 11 [c=28 l=4] neon_vld1_dupv4si
sha1c.32 q1, q8, q2 @ 12 [c=8 l=4] crypto_sha1c_lb
vmov q0, q1 @ v4si @ 25 [c=4 l=4] *neon_movv4si/0
add sp, sp, #16 @ 53 [c=4 l=4] *arm_addsi3/5
@ sp needed @ 54 [c=8 l=0] force_register_use
bx lr @ 55 [c=8 l=4] *thumb2_return
vs
foo_lane1:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
vmov.32 r3, d0[1] @ 8 [c=12 l=4] neon_vec_extractv4sisi/1
vdup.32 q8, r3 @ 11 [c=16 l=4] neon_vdup_nv4si/0
sha1c.32 q1, q8, q2 @ 12 [c=8 l=4] crypto_sha1c_lb
vmov q0, q1 @ v4si @ 25 [c=4 l=4] *neon_movv4si/0
bx lr @ 28 [c=8 l=4] *thumb2_return
I do not see any reason why lane0 needs to use stack when the other do not.
I also see a similar situation for:
gcc.target/arm/crypto-vsha1h_u32.c
gcc.target/arm/crypto-vsha1mq_u32.c
gcc.target/arm/crypto-vsha1pq_u32.c
Should the test cases override the -march to something that can have a NEON FPU
or is it expected that foo_lane0 should have similar assembler as foo_lane1?