https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124043

            Bug ID: 124043
           Summary: gcc.target/arm/crypto-*_u32.c missed optimization
                    cause test failure
           Product: gcc
           Version: 15.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: azoff at gcc dot gnu.org
  Target Milestone: ---

When running tests with arch=armv7e-m+fp/float-abi=hard/fpu=auto, the assembler
for foo_lane0 in gcc.target/arm/crypto-vsha1cq_u32.c is different from
foo_lane[123] and causes the check for vdup to have lower count (assembler
generated with arm-none-eabi-gcc  .../crypto-vsha1pq_u32.c -march=armv7e-m+fp
-mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -O3 -S -o - -dp):

foo_lane0:
        @ args = 0, pretend = 0, frame = 16
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        sub     sp, sp, #16     @ 49    [c=4 l=4]  *arm_addsi3/11
        vst1.64 {d0-d1}, [sp:64]        @ 19    [c=4 l=8]  *neon_movv4si/1
        vld1.32 {d16[], d17[]}, [sp]    @ 11    [c=28 l=4]  neon_vld1_dupv4si
        sha1c.32        q1, q8, q2      @ 12    [c=8 l=4]  crypto_sha1c_lb
        vmov    q0, q1  @ v4si  @ 25    [c=4 l=4]  *neon_movv4si/0
        add     sp, sp, #16     @ 53    [c=4 l=4]  *arm_addsi3/5
        @ sp needed     @ 54    [c=8 l=0]  force_register_use
        bx      lr      @ 55    [c=8 l=4]  *thumb2_return

vs

foo_lane1:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        vmov.32 r3, d0[1]       @ 8     [c=12 l=4]  neon_vec_extractv4sisi/1
        vdup.32 q8, r3  @ 11    [c=16 l=4]  neon_vdup_nv4si/0
        sha1c.32        q1, q8, q2      @ 12    [c=8 l=4]  crypto_sha1c_lb
        vmov    q0, q1  @ v4si  @ 25    [c=4 l=4]  *neon_movv4si/0
        bx      lr      @ 28    [c=8 l=4]  *thumb2_return

I do not see any reason why lane0 needs to use stack when the other do not.

I also see a similar situation for:
gcc.target/arm/crypto-vsha1h_u32.c
gcc.target/arm/crypto-vsha1mq_u32.c
gcc.target/arm/crypto-vsha1pq_u32.c


Should the test cases override the -march to something that can have a NEON FPU
or is it expected that foo_lane0 should have similar assembler as foo_lane1?

Reply via email to