[Bug target/89057] [8/9/10/11 Regression] AArch64 ld3 st4 less optimized

2021-01-04 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057

--- Comment #10 from CVS Commits  ---
The master branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:b41e6dd50f329b0291457e939d4c0dacd81c82c1

commit r11-6439-gb41e6dd50f329b0291457e939d4c0dacd81c82c1
Author: Richard Sandiford 
Date:   Mon Jan 4 11:59:07 2021 +

aarch64: Improve vcombine codegen [PR89057]

This patch fixes a codegen regression in the handling of things like:

  __temp.val[0]
 \
= vcombine_##funcsuffix (__b.val[0],   
 \
 vcreate_##funcsuffix (__AARCH64_UINT64_C
(0))); \

in the 64-bit vst[234] functions.  The zero was forced into a
register at expand time, and we relied on combine to fuse the
zero and combine back together into a single combinez pattern.
The problem is that the zero could be hoisted before combine
gets a chance to do its thing.

gcc/
PR target/89057
* config/aarch64/aarch64-simd.md (aarch64_combine): Accept
aarch64_simd_reg_or_zero for operand 2.  Use the combinez patterns
to handle zero operands.

gcc/testsuite/
PR target/89057
* gcc.target/aarch64/pr89057.c: New test.

[Bug target/89057] [8/9/10/11 Regression] AArch64 ld3 st4 less optimized

2020-12-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #9 from rsandifo at gcc dot gnu.org  
---
I think the fix is to make aarch64_combine aware of
aarch64_combinez{,_be} (i.e. the special case of the
second vector input being zero).  Testing a patch.

[Bug target/89057] [8/9/10/11 Regression] AArch64 ld3 st4 less optimized

2020-12-02 Thread abhiraj.garakapati at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057

--- Comment #8 from Abhiraj Garakapati  ---
Also, cross-checked it with the latest GCC version GCC-8.4.0 with above
mentioned reverting changes of "aarch64-simd.md" file and got the expected
output same as GCC-7.3.0.

[Bug target/89057] [8/9/10/11 Regression] AArch64 ld3 st4 less optimized

2020-11-30 Thread abhiraj.garakapati at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057

Abhiraj Garakapati  changed:

   What|Removed |Added

 CC||abhiraj.garakapati at gmail 
dot co
   ||m

--- Comment #7 from Abhiraj Garakapati  ---
This issue is observed during the RTL phase (test1.cpp.234r.expand i.e, during
Gimple to RTL conversion.) with -O1 flag enabled. (This issue is seen in -O1,
-O2, -O3 not in -O0.)

All these below 3 Gimple instructions are converted to 2 move instructions each
during Gimple to RTL conversion. This scenario is not seen in GCC-7.3.0 only
seen from GCC-8.1.0 due to the patch:
https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=a977dc0c5e069bf198f78ed4767deac369904301
  _68 = __builtin_aarch64_combinev8qi (_67, { 0, 0, 0, 0, 0, 0, 0, 0 });
  _69 = __builtin_aarch64_combinev8qi (_66, { 0, 0, 0, 0, 0, 0, 0, 0 });
  _70 = __builtin_aarch64_combinev8qi (_65, { 0, 0, 0, 0, 0, 0, 0, 0 });

This issue can be fixed by adding "-fno-move-loop-invariants" (as a
workaround).

This issue can be fixed on GCC-8.1.0 by reverting "aarch64-simd.md" file
changes in the patch:
https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=a977dc0c5e069bf198f78ed4767deac369904301

Also, cross-checked the newly built toolchain with reverting "aarch64-simd.md"
file changes with the above-mentioned test case and got the expected output
same as GCC-7.3.0.

With gcc 8.1 with reverting "aarch64-simd.md" file changes the inner loop is:
.L5:
ld3 {v4.8b-v6.8b}, [x1]
add x1, x1, #0x18
mov v0.8b, v6.8b
mov v1.8b, v5.8b
mov v2.8b, v4.8b
mov v3.16b, v7.16b
st4 {v0.8b-v3.8b}, [x0]
add x0, x0, 32
cmp x3, x0
bhi .L5

Also, cross-checked it with the below test case (which is mentioned in patch:
https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=a977dc0c5e069bf198f78ed4767deac369904301
this patch improves code generation for literal vector construction by
expanding and exposing the pattern to RTL optimization earlier. The current
implementation delays splitting the pattern until after reload which results in
poor code generation for the following code)

Test case to show patch
improvement(https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=a977dc0c5e069bf198f78ed4767deac369904301
):

#include "arm_neon.h"
int16x8_t
foo ()
{
  return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8));
}

GCC_8.1.0 -O1 with reverting "aarch64-simd.md" file changes:

foo():
adrpx0, 0 <_Z3foov>
ldr q0, [x0]
ret

So, reverting the "aarch64-simd.md" file changes does not result in poor code
generation.
Also, cross-checked it with the latest GCC version GCC-10.2.0.