[Bug target/97875] suboptimal loop vectorization

2021-01-12 Thread clyon at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

Christophe Lyon  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Christophe Lyon  ---
Fixed on trunk

[Bug target/97875] suboptimal loop vectorization

2021-01-12 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

--- Comment #7 from CVS Commits  ---
The master branch has been updated by Christophe Lyon :

https://gcc.gnu.org/g:25bef68902f42f414f99626cefb2d3df81de7dc8

commit r11-6616-g25bef68902f42f414f99626cefb2d3df81de7dc8
Author: Christophe Lyon 
Date:   Tue Jan 12 16:47:27 2021 +

arm: Add movmisalign patterns for MVE (PR target/97875)

This patch adds new movmisalign_mve_load and store patterns for
MVE to help vectorization. They are very similar to their Neon
counterparts, but use different iterators and instructions.

Indeed MVE supports less vectors modes than Neon, so we use the
MVE_VLD_ST iterator where Neon uses VQX.

Since the supported modes are different from the ones valid for
arithmetic operators, we introduce two new sets of macros:

ARM_HAVE_NEON__LDST
  true if Neon has vector load/store instructions for 

ARM_HAVE__LDST
  true if any vector extension has vector load/store instructions for


We move the movmisalign expander from neon.md to vec-commond.md, and
replace the TARGET_NEON enabler with ARM_HAVE__LDST.

The patch also updates the mve-vneg.c test to scan for the better code
generation when loading and storing the vectors involved: it checks
that no 'orr' instruction is generated to cope with misalignment at
runtime.
This test was chosen among the other mve tests, but any other should
be OK. Using a plain vector copy loop (dest[i] = a[i]) is not a good
test because the compiler chooses to use memcpy.

For instance we now generate:
test_vneg_s32x4:
vldrw.32   q3, [r1]
vneg.s32  q3, q3
vstrw.32   q3, [r0]
bx  lr

instead of:
test_vneg_s32x4:
orr r3, r1, r0
lslsr3, r3, #28
bne .L15
vldrw.32q3, [r1]
vneg.s32  q3, q3
vstrw.32q3, [r0]
bx  lr
.L15:
push{r4, r5}
ldrdr2, r3, [r1, #8]
ldrdr5, r4, [r1]
rsbsr2, r2, #0
rsbsr5, r5, #0
rsbsr4, r4, #0
rsbsr3, r3, #0
strdr5, r4, [r0]
pop {r4, r5}
strdr2, r3, [r0, #8]
bx  lr

2021-01-12  Christophe Lyon  

PR target/97875
gcc/
* config/arm/arm.h (ARM_HAVE_NEON_V8QI_LDST): New macro.
(ARM_HAVE_NEON_V16QI_LDST, ARM_HAVE_NEON_V4HI_LDST): Likewise.
(ARM_HAVE_NEON_V8HI_LDST, ARM_HAVE_NEON_V2SI_LDST): Likewise.
(ARM_HAVE_NEON_V4SI_LDST, ARM_HAVE_NEON_V4HF_LDST): Likewise.
(ARM_HAVE_NEON_V8HF_LDST, ARM_HAVE_NEON_V4BF_LDST): Likewise.
(ARM_HAVE_NEON_V8BF_LDST, ARM_HAVE_NEON_V2SF_LDST): Likewise.
(ARM_HAVE_NEON_V4SF_LDST, ARM_HAVE_NEON_DI_LDST): Likewise.
(ARM_HAVE_NEON_V2DI_LDST): Likewise.
(ARM_HAVE_V8QI_LDST, ARM_HAVE_V16QI_LDST): Likewise.
(ARM_HAVE_V4HI_LDST, ARM_HAVE_V8HI_LDST): Likewise.
(ARM_HAVE_V2SI_LDST, ARM_HAVE_V4SI_LDST, ARM_HAVE_V4HF_LDST):
Likewise.
(ARM_HAVE_V8HF_LDST, ARM_HAVE_V4BF_LDST, ARM_HAVE_V8BF_LDST):
Likewise.
(ARM_HAVE_V2SF_LDST, ARM_HAVE_V4SF_LDST, ARM_HAVE_DI_LDST):
Likewise.
(ARM_HAVE_V2DI_LDST): Likewise.
* config/arm/mve.md (*movmisalign_mve_store): New pattern.
(*movmisalign_mve_load): New pattern.
* config/arm/neon.md (movmisalign): Move to ...
* config/arm/vec-common.md: ... here.

PR target/97875
gcc/testsuite/
* gcc.target/arm/simd/mve-vneg.c: Update test.

[Bug target/97875] suboptimal loop vectorization

2020-12-10 Thread clyon at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

Christophe Lyon  changed:

   What|Removed |Added

 Status|WAITING |ASSIGNED

--- Comment #6 from Christophe Lyon  ---
Indeed enabling movmisalign for MVE greatly helps.

[Bug target/97875] suboptimal loop vectorization

2020-12-09 Thread clyon at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

--- Comment #5 from Christophe Lyon  ---
Interestingly, if I make arm_builtin_support_vector_misalignment() behave the
same for MVE and Neon, the generated code (with __restrict__) becomes:
test_vsub_i32:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
push{r4, r5, r6, r7, r8, r9, r10, fp}   @ 61[c=8 l=4] 
*push_multi
ldrdr10, fp, [r1, #8]   @ 75[c=8 l=4]  *thumb2_ldrd
ldrdr6, r7, [r2, #8]@ 76[c=8 l=4]  *thumb2_ldrd
ldr r4, [r2]@ 14[c=12 l=4]  *thumb2_movsi_vfp/5
ldr r8, [r1]@ 9 [c=12 l=4]  *thumb2_movsi_vfp/6
ldr r9, [r1, #4]@ 10[c=12 l=4]  *thumb2_movsi_vfp/6
ldr r5, [r2, #4]@ 15[c=12 l=4]  *thumb2_movsi_vfp/5
vmovd6, r8, r9  @ v4si  @ 35[c=4 l=8]  *mve_movv4si/1
vmovd7, r10, fp
vmovd4, r4, r5  @ v4si  @ 36[c=4 l=8]  *mve_movv4si/1
vmovd5, r6, r7
sub sp, sp, #16 @ 62[c=4 l=4]  *arm_addsi3/11
mov r3, sp  @ 37[c=4 l=2]  *thumb2_movsi_vfp/0
vsub.i32q3, q3, q2  @ 18[c=80 l=4]  mve_vsubqv4si
vstrw.32q3, [r3]@ 34[c=4 l=4]  *mve_movv4si/7
ldrdr4, r1, [sp]@ 77[c=8 l=4]  *thumb2_ldrd_base
ldrdr2, r3, [sp, #8]@ 78[c=8 l=4]  *thumb2_ldrd
strdr4, r1, [r0]@ 79[c=8 l=4]  *thumb2_strd_base
strdr2, r3, [r0, #8]@ 80[c=8 l=4]  *thumb2_strd
add sp, sp, #16 @ 66[c=4 l=4]  *arm_addsi3/5
@ sp needed @ 67[c=8 l=0]  force_register_use
pop {r4, r5, r6, r7, r8, r9, r10, fp}   @ 68[c=8 l=4] 
*load_multiple_with_writeback
bx  lr  @ 69[c=8 l=4]  *thumb2_return


The Neon version has:
vld1.32 {q8}, [r1]  @ 8 [c=8 l=4]  *movmisalignv4si_neon_load
vld1.32 {q9}, [r2]  @ 9 [c=8 l=4]  *movmisalignv4si_neon_load
vsub.i32q8, q8, q9  @ 10[c=80 l=4]  *subv4si3_neon
vst1.32 {q8}, [r0]  @ 11[c=8 l=4]  *movmisalignv4si_neon_store
bx  lr  @ 21[c=8 l=4]  *thumb2_return

So it seems MVE needs movmisalign pattern.

[Bug target/97875] suboptimal loop vectorization

2020-12-09 Thread clyon at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

--- Comment #4 from Christophe Lyon  ---

In both cases (Neon and MVE), DR_TARGET_ALIGNMENT is 8, so the decision to emit
a useless loop tail comes from elsewhere.

And yes, MVE vldrw.32 and vstrw.32 share the same alignment properties.

[Bug target/97875] suboptimal loop vectorization

2020-11-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

Richard Biener  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #3 from Richard Biener  ---
That would then point to DR_TARGET_ALIGNMENT being wrong here.  Now, not sure
whether we can guarantee to pick the "correct" instruction at RTL expansion but
surely the vectorizer can elide the runtime alignment check and emit
appropriately aligned (to vector element) vector loads / stores here.

You mention vldrw.32 but I assume the same applies to vstrw.32

[Bug target/97875] suboptimal loop vectorization

2020-11-17 Thread clyon at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

--- Comment #2 from Christophe Lyon  ---
Checking the Arm v8-M manual, my understanding is that this architecture does
not support unaligned vector loads/stores.

However, my understanding is that vldrw.32 accepts to load from addresses
aligned on 32 bits, which is the case since a and b are pointers to int32_t.