[Bug target/97875] suboptimal loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875 Christophe Lyon changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #8 from Christophe Lyon --- Fixed on trunk
[Bug target/97875] suboptimal loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875 --- Comment #7 from CVS Commits --- The master branch has been updated by Christophe Lyon : https://gcc.gnu.org/g:25bef68902f42f414f99626cefb2d3df81de7dc8 commit r11-6616-g25bef68902f42f414f99626cefb2d3df81de7dc8 Author: Christophe Lyon Date: Tue Jan 12 16:47:27 2021 + arm: Add movmisalign patterns for MVE (PR target/97875) This patch adds new movmisalign_mve_load and store patterns for MVE to help vectorization. They are very similar to their Neon counterparts, but use different iterators and instructions. Indeed MVE supports less vectors modes than Neon, so we use the MVE_VLD_ST iterator where Neon uses VQX. Since the supported modes are different from the ones valid for arithmetic operators, we introduce two new sets of macros: ARM_HAVE_NEON__LDST true if Neon has vector load/store instructions for ARM_HAVE__LDST true if any vector extension has vector load/store instructions for We move the movmisalign expander from neon.md to vec-commond.md, and replace the TARGET_NEON enabler with ARM_HAVE__LDST. The patch also updates the mve-vneg.c test to scan for the better code generation when loading and storing the vectors involved: it checks that no 'orr' instruction is generated to cope with misalignment at runtime. This test was chosen among the other mve tests, but any other should be OK. Using a plain vector copy loop (dest[i] = a[i]) is not a good test because the compiler chooses to use memcpy. For instance we now generate: test_vneg_s32x4: vldrw.32 q3, [r1] vneg.s32 q3, q3 vstrw.32 q3, [r0] bx lr instead of: test_vneg_s32x4: orr r3, r1, r0 lslsr3, r3, #28 bne .L15 vldrw.32q3, [r1] vneg.s32 q3, q3 vstrw.32q3, [r0] bx lr .L15: push{r4, r5} ldrdr2, r3, [r1, #8] ldrdr5, r4, [r1] rsbsr2, r2, #0 rsbsr5, r5, #0 rsbsr4, r4, #0 rsbsr3, r3, #0 strdr5, r4, [r0] pop {r4, r5} strdr2, r3, [r0, #8] bx lr 2021-01-12 Christophe Lyon PR target/97875 gcc/ * config/arm/arm.h (ARM_HAVE_NEON_V8QI_LDST): New macro. (ARM_HAVE_NEON_V16QI_LDST, ARM_HAVE_NEON_V4HI_LDST): Likewise. (ARM_HAVE_NEON_V8HI_LDST, ARM_HAVE_NEON_V2SI_LDST): Likewise. (ARM_HAVE_NEON_V4SI_LDST, ARM_HAVE_NEON_V4HF_LDST): Likewise. (ARM_HAVE_NEON_V8HF_LDST, ARM_HAVE_NEON_V4BF_LDST): Likewise. (ARM_HAVE_NEON_V8BF_LDST, ARM_HAVE_NEON_V2SF_LDST): Likewise. (ARM_HAVE_NEON_V4SF_LDST, ARM_HAVE_NEON_DI_LDST): Likewise. (ARM_HAVE_NEON_V2DI_LDST): Likewise. (ARM_HAVE_V8QI_LDST, ARM_HAVE_V16QI_LDST): Likewise. (ARM_HAVE_V4HI_LDST, ARM_HAVE_V8HI_LDST): Likewise. (ARM_HAVE_V2SI_LDST, ARM_HAVE_V4SI_LDST, ARM_HAVE_V4HF_LDST): Likewise. (ARM_HAVE_V8HF_LDST, ARM_HAVE_V4BF_LDST, ARM_HAVE_V8BF_LDST): Likewise. (ARM_HAVE_V2SF_LDST, ARM_HAVE_V4SF_LDST, ARM_HAVE_DI_LDST): Likewise. (ARM_HAVE_V2DI_LDST): Likewise. * config/arm/mve.md (*movmisalign_mve_store): New pattern. (*movmisalign_mve_load): New pattern. * config/arm/neon.md (movmisalign): Move to ... * config/arm/vec-common.md: ... here. PR target/97875 gcc/testsuite/ * gcc.target/arm/simd/mve-vneg.c: Update test.
[Bug target/97875] suboptimal loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875 Christophe Lyon changed: What|Removed |Added Status|WAITING |ASSIGNED --- Comment #6 from Christophe Lyon --- Indeed enabling movmisalign for MVE greatly helps.
[Bug target/97875] suboptimal loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875 --- Comment #5 from Christophe Lyon --- Interestingly, if I make arm_builtin_support_vector_misalignment() behave the same for MVE and Neon, the generated code (with __restrict__) becomes: test_vsub_i32: @ args = 0, pretend = 0, frame = 16 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. push{r4, r5, r6, r7, r8, r9, r10, fp} @ 61[c=8 l=4] *push_multi ldrdr10, fp, [r1, #8] @ 75[c=8 l=4] *thumb2_ldrd ldrdr6, r7, [r2, #8]@ 76[c=8 l=4] *thumb2_ldrd ldr r4, [r2]@ 14[c=12 l=4] *thumb2_movsi_vfp/5 ldr r8, [r1]@ 9 [c=12 l=4] *thumb2_movsi_vfp/6 ldr r9, [r1, #4]@ 10[c=12 l=4] *thumb2_movsi_vfp/6 ldr r5, [r2, #4]@ 15[c=12 l=4] *thumb2_movsi_vfp/5 vmovd6, r8, r9 @ v4si @ 35[c=4 l=8] *mve_movv4si/1 vmovd7, r10, fp vmovd4, r4, r5 @ v4si @ 36[c=4 l=8] *mve_movv4si/1 vmovd5, r6, r7 sub sp, sp, #16 @ 62[c=4 l=4] *arm_addsi3/11 mov r3, sp @ 37[c=4 l=2] *thumb2_movsi_vfp/0 vsub.i32q3, q3, q2 @ 18[c=80 l=4] mve_vsubqv4si vstrw.32q3, [r3]@ 34[c=4 l=4] *mve_movv4si/7 ldrdr4, r1, [sp]@ 77[c=8 l=4] *thumb2_ldrd_base ldrdr2, r3, [sp, #8]@ 78[c=8 l=4] *thumb2_ldrd strdr4, r1, [r0]@ 79[c=8 l=4] *thumb2_strd_base strdr2, r3, [r0, #8]@ 80[c=8 l=4] *thumb2_strd add sp, sp, #16 @ 66[c=4 l=4] *arm_addsi3/5 @ sp needed @ 67[c=8 l=0] force_register_use pop {r4, r5, r6, r7, r8, r9, r10, fp} @ 68[c=8 l=4] *load_multiple_with_writeback bx lr @ 69[c=8 l=4] *thumb2_return The Neon version has: vld1.32 {q8}, [r1] @ 8 [c=8 l=4] *movmisalignv4si_neon_load vld1.32 {q9}, [r2] @ 9 [c=8 l=4] *movmisalignv4si_neon_load vsub.i32q8, q8, q9 @ 10[c=80 l=4] *subv4si3_neon vst1.32 {q8}, [r0] @ 11[c=8 l=4] *movmisalignv4si_neon_store bx lr @ 21[c=8 l=4] *thumb2_return So it seems MVE needs movmisalign pattern.
[Bug target/97875] suboptimal loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875 --- Comment #4 from Christophe Lyon --- In both cases (Neon and MVE), DR_TARGET_ALIGNMENT is 8, so the decision to emit a useless loop tail comes from elsewhere. And yes, MVE vldrw.32 and vstrw.32 share the same alignment properties.
[Bug target/97875] suboptimal loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875 Richard Biener changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #3 from Richard Biener --- That would then point to DR_TARGET_ALIGNMENT being wrong here. Now, not sure whether we can guarantee to pick the "correct" instruction at RTL expansion but surely the vectorizer can elide the runtime alignment check and emit appropriately aligned (to vector element) vector loads / stores here. You mention vldrw.32 but I assume the same applies to vstrw.32
[Bug target/97875] suboptimal loop vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875 --- Comment #2 from Christophe Lyon --- Checking the Arm v8-M manual, my understanding is that this architecture does not support unaligned vector loads/stores. However, my understanding is that vldrw.32 accepts to load from addresses aligned on 32 bits, which is the case since a and b are pointers to int32_t.