[PATCH] aarch64: Use type-qualified builtins for vget_low/high intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned and polynomial type-qualified builtins for
vget_low_*/vget_high_* Neon intrinsics. Using these builtins removes
the need for many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-10  Jonathan Wright  

* config/aarch64/aarch64-builtins.c (TYPES_UNOPP): Define.
* config/aarch64/aarch64-simd-builtins.def: Declare type-
qualified builtins for vget_low/high.
* config/aarch64/arm_neon.h (vget_low_p8): Use type-qualified
builtin and remove casts.
(vget_low_p16): Likewise.
(vget_low_p64): Likewise.
(vget_low_u8): Likewise.
(vget_low_u16): Likewise.
(vget_low_u32): Likewise.
(vget_low_u64): Likewise.
(vget_high_p8): Likewise.
(vget_high_p16): Likewise.
(vget_high_p64): Likewise.
(vget_high_u8): Likewise.
(vget_high_u16): Likewise.
(vget_high_u32): Likewise.
(vget_high_u64): Likewise.
* config/aarch64/iterators.md (VQ_P): New mode iterator.


rb15060.patch
Description: rb15060.patch


[PATCH] aarch64: Use type-qualified builtins for vcombine_* Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned and polynomial type-qualified builtins for
vcombine_* Neon intrinsics. Using these builtins removes the need for
many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-10  Jonathan Wright  

* config/aarch64/aarch64-builtins.c (TYPES_COMBINE): Delete.
(TYPES_COMBINEP): Delete.
* config/aarch64/aarch64-simd-builtins.def: Declare type-
qualified builtins for vcombine_* intrinsics.
* config/aarch64/arm_neon.h (vcombine_s8): Remove unnecessary
cast.
(vcombine_s16): Likewise.
(vcombine_s32): Likewise.
(vcombine_f32): Likewise.
(vcombine_u8): Use type-qualified builtin and remove casts.
(vcombine_u16): Likewise.
(vcombine_u32): Likewise.
(vcombine_u64): Likewise.
(vcombine_p8): Likewise.
(vcombine_p16): Likewise.
(vcombine_p64): Likewise.
(vcombine_bf16): Remove unnecessary cast.
* config/aarch64/iterators.md (VDC_I): New mode iterator.
(VDC_P): New mode iterator.


rb15059.patch
Description: rb15059.patch


[PATCH] aarch64: Use type-qualified builtins for LD1/ST1 Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned and polynomial type-qualified builtins and
uses them to implement the LD1/ST1 Neon intrinsics. This removes the
need for many casts in arm_neon.h.

The new type-qualified builtins are also lowered to gimple - as the
unqualified builtins are already.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-10  Jonathan Wright  

* config/aarch64/aarch64-builtins.c (TYPES_LOAD1_U): Define.
(TYPES_LOAD1_P): Define.
(TYPES_STORE1_U): Define.
(TYPES_STORE1P): Rename to...
(TYPES_STORE1_P): This.
(get_mem_type_for_load_store): Add unsigned and poly types.
(aarch64_general_gimple_fold_builtin): Add unsigned and poly
type-qualified builtin declarations.
* config/aarch64/aarch64-simd-builtins.def: Declare type-
qualified builtins for LD1/ST1.
* config/aarch64/arm_neon.h (vld1_p8): Use type-qualified
builtin and remove cast.
(vld1_p16): Likewise.
(vld1_u8): Likewise.
(vld1_u16): Likewise.
(vld1_u32): Likewise.
(vld1q_p8): Likewise.
(vld1q_p16): Likewise.
(vld1q_p64): Likewise.
(vld1q_u8): Likewise.
(vld1q_u16): Likewise.
(vld1q_u32): Likewise.
(vld1q_u64): Likewise.
(vst1_p8): Likewise.
(vst1_p16): Likewise.
(vst1_u8): Likewise.
(vst1_u16): Likewise.
(vst1_u32): Likewise.
(vst1q_p8): Likewise.
(vst1q_p16): Likewise.
(vst1q_p64): Likewise.
(vst1q_u8): Likewise.
(vst1q_u16): Likewise.
(vst1q_u32): Likewise.
(vst1q_u64): Likewise.
* config/aarch64/iterators.md (VALLP_NO_DI): New iterator.


rb15058.patch
Description: rb15058.patch


[PATCH] aarch64: Use type-qualified builtins for ADDV Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them to
implement the vector reduction Neon intrinsics. This removes the need
for many casts in arm_neon.h.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-09  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Declare unsigned
builtins for vector reduction.
* config/aarch64/arm_neon.h (vaddv_u8): Use type-qualified
builtin and remove casts.
(vaddv_u16): Likewise.
(vaddv_u32): Likewise.
(vaddvq_u8): Likewise.
(vaddvq_u16): Likewise.
(vaddvq_u32): Likewise.
(vaddvq_u64): Likewise.


rb15057.patch
Description: rb15057.patch


[PATCH] aarch64: Use type-qualified builtins for ADDP Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them to
implement the pairwise addition Neon intrinsics. This removes the need
for many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-09  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def:
* config/aarch64/arm_neon.h (vpaddq_u8): Use type-qualified
builtin and remove casts.
(vpaddq_u16): Likewise.
(vpaddq_u32): Likewise.
(vpaddq_u64): Likewise.
(vpadd_u8): Likewise.
(vpadd_u16): Likewise.
(vpadd_u32): Likewise.
(vpaddd_u64): Likewise.


rb15039.patch
Description: rb15039.patch


[PATCH] aarch64: Use type-qualified builtins for [R]SUBHN[2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them to
implement (rounding) halving-narrowing-subtract Neon intrinsics. This
removes the need for many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-09  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Declare unsigned
builtins for [r]subhn[2].
* config/aarch64/arm_neon.h (vsubhn_s16): Remove unnecessary
cast.
(vsubhn_s32): Likewise.
(vsubhn_s64): Likewise.
(vsubhn_u16): Use type-qualified builtin and remove casts.
(vsubhn_u32): Likewise.
(vsubhn_u64): Likewise.
(vrsubhn_s16): Remove unnecessary cast.
(vrsubhn_s32): Likewise.
(vrsubhn_s64): Likewise.
(vrsubhn_u16): Use type-qualified builtin and remove casts.
(vrsubhn_u32): Likewise.
(vrsubhn_u64): Likewise.
(vrsubhn_high_s16): Remove unnecessary cast.
(vrsubhn_high_s32): Likewise.
(vrsubhn_high_s64): Likewise.
(vrsubhn_high_u16): Use type-qualified builtin and remove
casts.
(vrsubhn_high_u32): Likewise.
(vrsubhn_high_u64): Likewise.
(vsubhn_high_s16): Remove unnecessary cast.
(vsubhn_high_s32): Likewise.
(vsubhn_high_s64): Likewise.
(vsubhn_high_u16): Use type-qualified builtin and remove
casts.
(vsubhn_high_u32): Likewise.
(vsubhn_high_u64): Likewise.


rb15038.patch
Description: rb15038.patch


[PATCH] aarch64: Use type-qualified builtins for [R]ADDHN[2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them to
implement (rounding) halving-narrowing-add Neon intrinsics. This
removes the need for many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-09  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Declare unsigned
builtins for [r]addhn[2].
* config/aarch64/arm_neon.h (vaddhn_s16): Remove unnecessary
cast.
(vaddhn_s32): Likewise.
(vaddhn_s64): Likewise.
(vaddhn_u16): Use type-qualified builtin and remove casts.
(vaddhn_u32): Likewise.
(vaddhn_u64): Likewise.
(vraddhn_s16): Remove unnecessary cast.
(vraddhn_s32): Likewise.
(vraddhn_s64): Likewise.
(vraddhn_u16): Use type-qualified builtin and remove casts.
(vraddhn_u32): Likewise.
(vraddhn_u64): Likewise.
(vaddhn_high_s16): Remove unnecessary cast.
(vaddhn_high_s32): Likewise.
(vaddhn_high_s64): Likewise.
(vaddhn_high_u16): Use type-qualified builtin and remove
casts.
(vaddhn_high_u32): Likewise.
(vaddhn_high_u64): Likewise.
(vraddhn_high_s16): Remove unnecessary cast.
(vraddhn_high_s32): Likewise.
(vraddhn_high_s64): Likewise.
(vraddhn_high_u16): Use type-qualified builtin and remove
casts.
(vraddhn_high_u32): Likewise.
(vraddhn_high_u64): Likewise.


rb15037.patch
Description: rb15037.patch


[PATCH] aarch64: Use type-qualified builtins for UHSUB Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them to
implement halving-subtract Neon intrinsics. This removes the need for
many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-09  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Use BINOPU type
qualifiers in generator macros for uhsub builtins.
* config/aarch64/arm_neon.h (vhsub_s8): Remove unnecessary
cast.
(vhsub_s16): Likewise.
(vhsub_s32): Likewise.
(vhsub_u8): Use type-qualified builtin and remove casts.
(vhsub_u16): Likewise.
(vhsub_u32): Likewise.
(vhsubq_s8): Remove unnecessary cast.
(vhsubq_s16): Likewise.
(vhsubq_s32): Likewise.
(vhsubq_u8): Use type-qualified builtin and remove casts.
(vhsubq_u16): Likewise.
(vhsubq_u32): Likewise.


rb15036.patch
Description: rb15036.patch


[PATCH] aarch64: Use type-qualified builtins for U[R]HADD Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them to
implement (rounding) halving-add Neon intrinsics. This removes the
need for many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-09  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Use BINOPU type
qualifiers in generator macros for u[r]hadd builtins.
* config/aarch64/arm_neon.h (vhadd_s8): Remove unnecessary
cast.
(vhadd_s16): Likewise.
(vhadd_s32): Likewise.
(vhadd_u8): Use type-qualified builtin and remove casts.
(vhadd_u16): Likewise.
(vhadd_u32): Likewise.
(vhaddq_s8): Remove unnecessary cast.
(vhaddq_s16): Likewise.
(vhaddq_s32): Likewise.
(vhaddq_u8): Use type-qualified builtin and remove casts.
(vhaddq_u16): Likewise.
(vhaddq_u32): Likewise.
(vrhadd_s8): Remove unnecessary cast.
(vrhadd_s16): Likewise.
(vrhadd_s32): Likewise.
(vrhadd_u8): Use type-qualified builtin and remove casts.
(vrhadd_u16): Likewise.
(vrhadd_u32): Likewise.
(vrhaddq_s8): Remove unnecessary cast.
(vrhaddq_s16): Likewise.
(vrhaddq_s32): Likewise.
(vrhaddq_u8): Use type-wualified builtin and remove casts.
(vrhaddq_u16): Likewise.
(vrhaddq_u32): Likewise.


rb15035.patch
Description: rb15035.patch


[PATCH] aarch64: Use type-qualified builtins for USUB[LW][2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them to
implement widening-subtract Neon intrinsics. This removes the need
for many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-09  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Use BINOPU type
qualifiers in generator macros for usub[lw][2] builtins.
* config/aarch64/arm_neon.h (vsubl_s8): Remove unnecessary
cast.
(vsubl_s16): Likewise.
(vsubl_s32): Likewise.
(vsubl_u8): Use type-qualified builtin and remove casts.
(vsubl_u16): Likewise.
(vsubl_u32): Likewise.
(vsubl_high_s8): Remove unnecessary cast.
(vsubl_high_s16): Likewise.
(vsubl_high_s32): Likewise.
(vsubl_high_u8): Use type-qualified builtin and remove casts.
(vsubl_high_u16): Likewise.
(vsubl_high_u32): Likewise.
(vsubw_s8): Remove unnecessary casts.
(vsubw_s16): Likewise.
(vsubw_s32): Likewise.
(vsubw_u8): Use type-qualified builtin and remove casts.
(vsubw_u16): Likewise.
(vsubw_u32): Likewise.
(vsubw_high_s8): Remove unnecessary cast.
(vsubw_high_s16): Likewise.
(vsubw_high_s32): Likewise.
(vsubw_high_u8): Use type-qualified builtin and remove casts.
(vsubw_high_u16): Likewise.
(vsubw_high_u32): Likewise.


rb15034.patch
Description: rb15034.patch


[PATCH] aarch64: Use type-qualified builtins for UADD[LW][2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them to
implement widening-add Neon intrinsics. This removes the need for
many casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-09  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Use BINOPU type
qualifiers in generator macros for uadd[lw][2] builtins.
* config/aarch64/arm_neon.h (vaddl_s8): Remove unnecessary
cast.
(vaddl_s16): Likewise.
(vaddl_s32): Likewise.
(vaddl_u8): Use type-qualified builtin and remove casts.
(vaddl_u16): Likewise.
(vaddl_u32): Likewise.
(vaddl_high_s8): Remove unnecessary cast.
(vaddl_high_s16): Likewise.
(vaddl_high_s32): Likewise.
(vaddl_high_u8): Use type-qualified builtin and remove casts.
(vaddl_high_u16): Likewise.
(vaddl_high_u32): Likewise.
(vaddw_s8): Remove unnecessary cast.
(vaddw_s16): Likewise.
(vaddw_s32): Likewise.
(vaddw_u8): Use type-qualified builtin and remove casts.
(vaddw_u16): Likewise.
(vaddw_u32): Likewise.
(vaddw_high_s8): Remove unnecessary cast.
(vaddw_high_s16): Likewise.
(vaddw_high_s32): Likewise.
(vaddw_high_u8): Use type-qualified builtin and remove casts.
(vaddw_high_u16): Likewise.
(vaddw_high_u32): Likewise.


rb15033.patch
Description: rb15033.patch


[PATCH] aarch64: Use type-qualified builtins for [R]SHRN[2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

Thus patch declares unsigned type-qualified builtins and uses them for
[R]SHRN[2] Neon intrinsics. This removes the need for casts in
arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-08  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Declare type-
qualified builtins for [R]SHRN[2].
* config/aarch64/arm_neon.h (vshrn_n_u16): Use type-qualified
builtin and remove casts.
(vshrn_n_u32): Likewise.
(vshrn_n_u64): Likewise.
(vrshrn_high_n_u16): Likewise.
(vrshrn_high_n_u32): Likewise.
(vrshrn_high_n_u64): Likewise.
(vrshrn_n_u16): Likewise.
(vrshrn_n_u32): Likewise.
(vrshrn_n_u64): Likewise.
(vshrn_high_n_u16): Likewise.
(vshrn_high_n_u32): Likewise.
(vshrn_high_n_u64): Likewise.


rb15032.patch
Description: rb15032.patch


[PATCH] aarch64: Use type-qualified builtins for XTN[2] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares unsigned type-qualified builtins and uses them for
XTN[2] Neon intrinsics. This removes the need for casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-08  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Declare unsigned
type-qualified builtins for XTN[2].
* config/aarch64/arm_neon.h (vmovn_high_u16): Use type-
qualified builtin and remove casts.
(vmovn_high_u32): Likewise.
(vmovn_high_u64): Likewise.
(vmovn_u16): Likewise.
(vmovn_u32): Likewise.
(vmovn_u64): Likewise.


rb15031.patch
Description: rb15031.patch


[PATCH] aarch64: Use type-qualified builtins for PMUL[L] Neon intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares poly type-qualified builtins and uses them for
PMUL[L] Neon intrinsics. This removes the need for casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-08  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Use poly type
qualifier in builtin generator macros.
* config/aarch64/arm_neon.h (vmul_p8): Use type-qualified
builtin and remove casts.
(vmulq_p8): Likewise.
(vmull_high_p8): Likewise.
(vmull_p8): Likewise.


rb15030.patch
Description: rb15030.patch


[PATCH] aarch64: Use type-qualified builtins for unsigned MLA/MLS intrinsics

2021-11-11 Thread Jonathan Wright via Gcc-patches
Hi,

This patch declares type-qualified builtins and uses them for MLA/MLS
Neon intrinsics that operate on unsigned types. This eliminates lots of
casts in arm_neon.h.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-11-08  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Declare type-
qualified builtin generators for unsigned MLA/MLS intrinsics.
* config/aarch64/arm_neon.h (vmla_n_u16): Use type-qualified
builtin.
(vmla_n_u32): Likewise.
(vmla_u8): Likewise.
(vmla_u16): Likewise.
(vmla_u32): Likewise.
(vmlaq_n_u16): Likewise.
(vmlaq_n_u32): Likewise.
(vmlaq_u8): Likewise.
(vmlaq_u16): Likewise.
(vmlaq_u32): Likewise.
(vmls_n_u16): Likewise.
(vmls_n_u32): Likewise.
(vmls_u8): Likewise.
(vmls_u16): Likewise.
(vmls_u32): Likewise.
(vmlsq_n_u16): Likewise.
(vmlsq_n_u32): Likewise.
(vmlsq_u8): Likewise.
(vmlsq_u16): Likewise.
(vmlsq_u32): Likewise.


rb15027.patch
Description: rb15027.patch


Re: [PATCH 4/6 V2] aarch64: Add machine modes for Neon vector-tuple types

2021-11-02 Thread Jonathan Wright via Gcc-patches
Hi,

Each of the comments on the previous version of the patch have been
addressed.

Ok for master?

Thanks,
Jonathan


From: Richard Sandiford 
Sent: 22 October 2021 16:13
To: Jonathan Wright 
Cc: gcc-patches@gcc.gnu.org ; Kyrylo Tkachov 

Subject: Re: [PATCH 4/6] aarch64: Add machine modes for Neon vector-tuple types 
 
Thanks a lot for doing this.

Jonathan Wright  writes:
> @@ -763,9 +839,16 @@ aarch64_lookup_simd_builtin_type (machine_mode mode,
>  return aarch64_simd_builtin_std_type (mode, q);
>  
>    for (i = 0; i < nelts; i++)
> -    if (aarch64_simd_types[i].mode == mode
> - && aarch64_simd_types[i].q == q)
> -  return aarch64_simd_types[i].itype;
> +    {
> +  if (aarch64_simd_types[i].mode == mode
> +   && aarch64_simd_types[i].q == q)
> + return aarch64_simd_types[i].itype;
> +  else if (aarch64_simd_tuple_types[i][0] != NULL_TREE)

Very minor (sorry for not noticing earlier), but: the “else” is
redundant here.

> + for (int j = 0; j < 3; j++)
> +   if (TYPE_MODE (aarch64_simd_tuple_types[i][j]) == mode
> +   && aarch64_simd_types[i].q == q)
> + return aarch64_simd_tuple_types[i][j];
> +    }
>  
>    return NULL_TREE;
>  }
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 48eddf64e05afe3788abfa05141f6544a9323ea1..0aa185b67ff13d40c87db0449aec312929ff5387
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -6636,162 +6636,165 @@
>  
>  ;; Patterns for vector struct loads and stores.
>  
> -(define_insn "aarch64_simd_ld2"
> -  [(set (match_operand:OI 0 "register_operand" "=w")
> - (unspec:OI [(match_operand:OI 1 "aarch64_simd_struct_operand" "Utv")
> - (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
> -    UNSPEC_LD2))]
> +(define_insn "aarch64_simd_ld2"
> +  [(set (match_operand:VSTRUCT_2Q 0 "register_operand" "=w")
> + (unspec:VSTRUCT_2Q [
> +   (match_operand:VSTRUCT_2Q 1 "aarch64_simd_struct_operand" "Utv")]
> +   UNSPEC_LD2))]
>    "TARGET_SIMD"
>    "ld2\\t{%S0. - %T0.}, %1"
>    [(set_attr "type" "neon_load2_2reg")]
>  )
>  
> -(define_insn "aarch64_simd_ld2r"
> -  [(set (match_operand:OI 0 "register_operand" "=w")
> -   (unspec:OI [(match_operand:BLK 1 "aarch64_simd_struct_operand" "Utv")
> -   (unspec:VALLDIF [(const_int 0)] UNSPEC_VSTRUCTDUMMY) ]
> -  UNSPEC_LD2_DUP))]
> +(define_insn "aarch64_simd_ld2r"
> +  [(set (match_operand:VSTRUCT_2QD 0 "register_operand" "=w")
> + (unspec:VSTRUCT_2QD [
> +   (match_operand:VSTRUCT_2QD 1 "aarch64_simd_struct_operand" "Utv")]
> +  UNSPEC_LD2_DUP))]

Sorry again for missing this, but the ld2rs, ld3rs and ld4rs should
keep their BLKmode arguments, since they only access 2, 3 or 4
scalar memory elements.

> @@ -7515,10 +7605,10 @@
>  )
>  
>  (define_insn_and_split "aarch64_combinev16qi"
> -  [(set (match_operand:OI 0 "register_operand" "=w")
> - (unspec:OI [(match_operand:V16QI 1 "register_operand" "w")
> - (match_operand:V16QI 2 "register_operand" "w")]
> -    UNSPEC_CONCAT))]
> +  [(set (match_operand:V2x16QI 0 "register_operand" "=w")
> + (unspec:V2x16QI [(match_operand:V16QI 1 "register_operand" "w")
> +  (match_operand:V16QI 2 "register_operand" "w")]
> + UNSPEC_CONCAT))]

Just realised that we can now make this a vec_concat, since the
modes are finally self-consistent.

No need to do that though, either way is fine.

Looks good otherwise.

Richard<>


[PATCH 4/6] aarch64: Add machine modes for Neon vector-tuple types

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi,

Until now, GCC has used large integer machine modes (OI, CI and XI)
to model Neon vector-tuple types. This is suboptimal for many
reasons, the most notable are:

 1) Large integer modes are opaque and modifying one vector in the
    tuple requires a lot of inefficient set/get gymnastics. The
    result is a lot of superfluous move instructions.
 2) Large integer modes do not map well to types that are tuples of
    64-bit vectors - we need additional zero-padding which again
    results in superfluous move instructions.

This patch adds new machine modes that better model the C-level Neon
vector-tuple types. The approach is somewhat similar to that already
used for SVE vector-tuple types.

All of the AArch64 backend patterns and builtins that manipulate Neon
vector tuples are updated to use the new machine modes. This has the
effect of significantly reducing the amount of boiler-plate code in
the arm_neon.h header.

While this patch increases the quality of code generated in many
instances, there is still room for significant improvement - which
will be attempted in subsequent patches.

Bootstrapped and regression tested on aarch64-none-linux-gnu and
aarch64_be-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-08-09  Jonathan Wright  
            Richard Sandiford  

* config/aarch64/aarch64-builtins.c (v2x8qi_UP): Define.
(v2x4hi_UP): Likewise.
(v2x4hf_UP): Likewise.
(v2x4bf_UP): Likewise.
(v2x2si_UP): Likewise.
(v2x2sf_UP): Likewise.
(v2x1di_UP): Likewise.
(v2x1df_UP): Likewise.
(v2x16qi_UP): Likewise.
(v2x8hi_UP): Likewise.
(v2x8hf_UP): Likewise.
(v2x8bf_UP): Likewise.
(v2x4si_UP): Likewise.
(v2x4sf_UP): Likewise.
(v2x2di_UP): Likewise.
(v2x2df_UP): Likewise.
(v3x8qi_UP): Likewise.
(v3x4hi_UP): Likewise.
(v3x4hf_UP): Likewise.
(v3x4bf_UP): Likewise.
(v3x2si_UP): Likewise.
(v3x2sf_UP): Likewise.
(v3x1di_UP): Likewise.
(v3x1df_UP): Likewise.
(v3x16qi_UP): Likewise.
(v3x8hi_UP): Likewise.
(v3x8hf_UP): Likewise.
(v3x8bf_UP): Likewise.
(v3x4si_UP): Likewise.
(v3x4sf_UP): Likewise.
(v3x2di_UP): Likewise.
(v3x2df_UP): Likewise.
(v4x8qi_UP): Likewise.
(v4x4hi_UP): Likewise.
(v4x4hf_UP): Likewise.
(v4x4bf_UP): Likewise.
(v4x2si_UP): Likewise.
(v4x2sf_UP): Likewise.
(v4x1di_UP): Likewise.
(v4x1df_UP): Likewise.
(v4x16qi_UP): Likewise.
(v4x8hi_UP): Likewise.
(v4x8hf_UP): Likewise.
(v4x8bf_UP): Likewise.
(v4x4si_UP): Likewise.
(v4x4sf_UP): Likewise.
(v4x2di_UP): Likewise.
(v4x2df_UP): Likewise.
(TYPES_GETREGP): Delete.
(TYPES_SETREGP): Likewise.
(TYPES_LOADSTRUCT_U): Define.
(TYPES_LOADSTRUCT_P): Likewise.
(TYPES_LOADSTRUCT_LANE_U): Likewise.
(TYPES_LOADSTRUCT_LANE_P): Likewise.
(TYPES_STORE1P): Move for consistency.
(TYPES_STORESTRUCT_U): Define.
(TYPES_STORESTRUCT_P): Likewise.
(TYPES_STORESTRUCT_LANE_U): Likewise.
(TYPES_STORESTRUCT_LANE_P): Likewise.
(aarch64_simd_tuple_types): Define.
(aarch64_lookup_simd_builtin_type): Handle tuple type lookup.
(aarch64_init_simd_builtin_functions): Update frontend lookup
for builtin functions after handling arm_neon.h pragma.
(register_tuple_type): Manually set modes of single-integer
tuple types. Record tuple types.
* config/aarch64/aarch64-modes.def
(ADV_SIMD_D_REG_STRUCT_MODES): Define D-register tuple modes.
(ADV_SIMD_Q_REG_STRUCT_MODES): Define Q-register tuple modes.
(SVE_MODES): Give single-vector modes priority over vector-
tuple modes.
(VECTOR_MODES_WITH_PREFIX): Set partial-vector mode order to
be after all single-vector modes.
* config/aarch64/aarch64-simd-builtins.def: Update builtin
generator macros to reflect modifications to the backend
patterns.
* config/aarch64/aarch64-simd.md (aarch64_simd_ld2):
Use vector-tuple mode iterator and rename to...
(aarch64_simd_ld2): This.
(aarch64_simd_ld2r): Use vector-tuple mode iterator and
rename to...
(aarch64_simd_ld2r): This.
(aarch64_vec_load_lanesoi_lane): Use vector-tuple mode
iterator and rename to...
(aarch64_vec_load_lanes_lane): This.
(vec_load_lanesoi): Use vector-tuple mode iterator and
rename to...
(vec_load_lanes): This.
(aarch64_simd_st2): Use vector-tuple mode iterator and
rename to...
(aarch64_simd_st2): This.
(aarch64_vec_store_lanesoi_lane): Use vector-tuple mode
iterator and rename to...
(aarch64_vec_store_lanes_lane): This.
  

[PATCH 6/6] aarch64: Pass and return Neon vector-tuple types without a parallel

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi,

Neon vector-tuple types can be passed in registers on function call
and return - there is no need to generate a parallel rtx. This patch
adds cases to detect vector-tuple modes and generates an appropriate
register rtx.

This change greatly improves code generated when passing Neon vector-
tuple types between functions; many new test cases are added to
defend these improvements.

Bootstrapped and regression tested on aarch64-none-linux-gnu and
aarch64_be-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-10-07  Jonathan Wright  

* config/aarch64/aarch64.c (aarch64_function_value): Generate
a register rtx for Neon vector-tuple modes.
(aarch64_layout_arg): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: New code
generation tests.


rb14937.patch
Description: rb14937.patch


[PATCH 5/6] gcc/lower_subreg.c: Prevent decomposition if modes are not tieable

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi,

Preventing decomposition if modes are not tieable is necessary to
stop AArch64 partial Neon structure modes being treated as packed in
registers.

This is a necessary prerequisite for a future AArch64 PCS change to
maintain good code generation.

Bootstrapped and regression tested on:
* x86_64-pc-linux-gnu - no issues.
* aarch64-none-linux-gnu - two test failures which will be fixed by
  the next patch in this series. 

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-10-14  Jonathan Wright  

* lower-subreg.c (simple_move): Prevent decomposition if
modes are not tieable.


rb14936.patch
Description: rb14936.patch


[PATCH 3/6] gcc/expmed.c: Ensure vector modes are tieable before extraction

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi,

Extracting a bitfield from a vector can be achieved by casting the
vector to a new type whose elements are the same size as the desired
bitfield, before generating a subreg. However, this is only an
optimization if the original vector can be accessed in the new
machine mode without first being copied - a condition denoted by the
TARGET_MODES_TIEABLE_P hook.

This patch adds a check to make sure that the vector modes are
tieable before attempting to generate a subreg. This is a necessary
prerequisite for a subsequent patch that will introduce new machine
modes for Arm Neon vector-tuple types.

Bootstrapped and regression tested on aarch64-none-linux-gnu and
x86_64-pc-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-10-11  Jonathan Wright  

* expmed.c (extract_bit_field_1): Ensure modes are tieable.


rb14926.patch
Description: rb14926.patch


[PATCH 2/6] gcc/expr.c: Remove historic workaround for broken SIMD subreg

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi,

A long time ago, using a parallel to take a subreg of a SIMD register
was broken. This temporary fix[1] (from 2003) spilled these registers
to memory and reloaded the appropriate part to obtain the subreg.

The fix initially existed for the benefit of the PowerPC E500 - a
platform for which GCC removed support a number of years ago.
Regardless, a proper mechanism for taking a subreg of a SIMD register
exists now anyway.

This patch removes the workaround thus preventing SIMD registers
being dumped to memory unnecessarily - which sometimes can't be fixed
by later passes.

Bootstrapped and regression tested on aarch64-none-linux-gnu and
x86_64-pc-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

[1] https://gcc.gnu.org/pipermail/gcc-patches/2003-April/102099.html

---

gcc/ChangeLog:

2021-10-11  Jonathan Wright  

* expr.c (emit_group_load_1): Remove historic workaround.


rb14923.patch
Description: rb14923.patch


[PATCH 1/6] aarch64: Move Neon vector-tuple type declaration into the compiler

2021-10-22 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch declares the Neon vector-tuple types inside the
compiler instead of in the arm_neon.h header. This is a necessary first
step before adding corresponding machine modes to the AArch64
backend.

The vector-tuple types are implemented using a #pragma. This means
initialization of builtin functions that have vector-tuple types as
arguments or return values has to be delayed until the #pragma is
handled.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Note that this patch series cannot be merged until the following has
been accepted: 
https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581948.html

Ok for master with this proviso?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-09-10  Jonathan Wright  

* config/aarch64/aarch64-builtins.c (aarch64_init_simd_builtins):
Factor out main loop to...
(aarch64_init_simd_builtin_functions): This new function.
(register_tuple_type): Define.
(aarch64_scalar_builtin_type_p): Define.
(handle_arm_neon_h): Define.
* config/aarch64/aarch64-c.c (aarch64_pragma_aarch64): Handle
pragma for arm_neon.h.
* config/aarch64/aarch64-protos.h (aarch64_advsimd_struct_mode_p):
Declare.
(handle_arm_neon_h): Likewise.
* config/aarch64/aarch64.c (aarch64_advsimd_struct_mode_p):
Remove static modifier.
* config/aarch64/arm_neon.h (target): Remove Neon vector
structure type definitions.


rb14838.patch
Description: rb14838.patch


[PATCH] aarch64: Remove redundant struct type definitions in arm_neon.h

2021-10-21 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch deletes some redundant type definitions in
arm_neon.h. These vector type definitions are an artifact from the initial
commit that added the AArch64 port.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-10-15  Jonathan Wright  

* config/aarch64/arm_neon.h (__STRUCTN): Delete function
macro and all invocations.


rb14942.patch
Description: rb14942.patch


[PATCH] aarch64: Fix pointer parameter type in LD1 Neon intrinsics

2021-10-14 Thread Jonathan Wright via Gcc-patches
The pointer parameter to load a vector of signed values should itself
be a signed type. This patch fixes two instances of this unsigned-
signed implicit conversion in arm_neon.h.

Tested relevant intrinsics with -Wpointer-sign and warnings no longer
present.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-10-14  Jonathan Wright  

* config/aarch64/arm_neon.h (vld1_s8_x3): Use signed type for
pointer parameter.
(vld1_s32_x3): Likewise.


rb14933.patch
Description: rb14933.patch


[PATCH] aarch64: Fix type qualifiers for qtbl1 and qtbx1 Neon builtins

2021-09-24 Thread Jonathan Wright via Gcc-patches
Hi,

This patch fixes type qualifiers for the qtbl1 and qtbx1 Neon builtins
and removes the casts from the Neon intrinsic function bodies that
use these builtins.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

23-09-2021  Jonathan Wright  

* config/aarch64/aarch64-builtins.c (TYPES_BINOP_PPU): Define
new type qualifier enum.
(TYPES_TERNOP_SSSU): Likewise.
(TYPES_TERNOP_PPPU): Likewise.
* config/aarch64/aarch64-simd-builtins.def: Define PPU, SSU,
PPPU and SSSU builtin generator macros for qtbl1 and qtbx1
Neon builtins.
* config/aarch64/arm_neon.h (vqtbl1_p8): Use type-qualified
builtin and remove casts.
(vqtbl1_s8): Likewise.
(vqtbl1q_p8): Likewise.
(vqtbl1q_s8): Likewise.
(vqtbx1_s8): Likewise.
(vqtbx1_p8): Likewise.
(vqtbx1q_s8): Likewise.
(vqtbx1q_p8): Likewise.
(vtbl1_p8): Likewise.
(vtbl2_p8): Likewise.
(vtbx2_p8): Likewise.


rb14884.patch
Description: rb14884.patch


[PATCH] aarch64: Fix float <-> int errors in vld4[q]_lane intrinsics

2021-08-18 Thread Jonathan Wright via Gcc-patches
Hi,

A previous commit "aarch64: Remove macros for vld4[q]_lane Neon
intrinsics" introduced some float <-> int type conversion errors.
This patch fixes those errors.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-08-18  Jonathan Wright  

* config/aarch64/arm_neon.h (vld4_lane_f32): Use float RTL
pattern.
(vld4q_lane_f64): Use float type cast.



From: Andreas Schwab 
Sent: 18 August 2021 13:09
To: Jonathan Wright via Gcc-patches 
Cc: Jonathan Wright ; Richard Sandiford 

Subject: Re: [PATCH 3/3] aarch64: Remove macros for vld4[q]_lane Neon 
intrinsics 
 
I think this patch breaks bootstrap.

In file included from ../../libcpp/lex.c:756:
/opt/gcc/gcc-20210818/Build/prev-gcc/include/arm_neon.h: In function 
'float32x2x4_t vld4_lane_f32(const float32_t*, float32x2x4_t, int)':
/opt/gcc/gcc-20210818/Build/prev-gcc/include/arm_neon.h:21081:11: error: cannot 
convert 'float*' to 'const int*'
21081 |   (__builtin_aarch64_simd_sf *) __ptr, __o, __c);
  |   ^~~
  |   |
  |   float*
: note:   initializing argument 1 of '__builtin_aarch64_simd_xi 
__builtin_aarch64_ld4_lanev2si(const int*, __builtin_aarch64_simd_xi, int)'
/opt/gcc/gcc-20210818/Build/prev-gcc/include/arm_neon.h: In function 
'float64x2x4_t vld4q_lane_f64(const float64_t*, float64x2x4_t, int)':
/opt/gcc/gcc-20210818/Build/prev-gcc/include/arm_neon.h:21384:9: error: cannot 
convert 'long int*' to 'const double*'
21384 | (__builtin_aarch64_simd_di *) __ptr, __o, __c);
  | ^~~
  | |
  | long int*
: note:   initializing argument 1 of '__builtin_aarch64_simd_xi 
__builtin_aarch64_ld4_lanev2df(const double*, __builtin_aarch64_simd_xi, int)'

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

rb14797.patch
Description: rb14797.patch


[PATCH 3/3] aarch64: Remove macros for vld4[q]_lane Neon intrinsics

2021-08-16 Thread Jonathan Wright via Gcc-patches
Hi,

This patch removes macros for vld4[q]_lane Neon intrinsics. This is a
preparatory step before adding new modes for structures of Advanced
SIMD vectors.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-08-16  Jonathan Wright  

* config/aarch64/arm_neon.h (__LD4_LANE_FUNC): Delete.
(__LD4Q_LANE_FUNC): Likewise.
(vld4_lane_u8): Define without macro.
(vld4_lane_u16): Likewise.
(vld4_lane_u32): Likewise.
(vld4_lane_u64): Likewise.
(vld4_lane_s8): Likewise.
(vld4_lane_s16): Likewise.
(vld4_lane_s32): Likewise.
(vld4_lane_s64): Likewise.
(vld4_lane_f16): Likewise.
(vld4_lane_f32): Likewise.
(vld4_lane_f64): Likewise.
(vld4_lane_p8): Likewise.
(vld4_lane_p16): Likewise.
(vld4_lane_p64): Likewise.
(vld4q_lane_u8): Likewise.
(vld4q_lane_u16): Likewise.
(vld4q_lane_u32): Likewise.
(vld4q_lane_u64): Likewise.
(vld4q_lane_s8): Likewise.
(vld4q_lane_s16): Likewise.
(vld4q_lane_s32): Likewise.
(vld4q_lane_s64): Likewise.
(vld4q_lane_f16): Likewise.
(vld4q_lane_f32): Likewise.
(vld4q_lane_f64): Likewise.
(vld4q_lane_p8): Likewise.
(vld4q_lane_p16): Likewise.
(vld4q_lane_p64): Likewise.
(vld4_lane_bf16): Likewise.
(vld4q_lane_bf16): Likewise.


rb14793.patch
Description: rb14793.patch


[PATCH 2/3] aarch64: Remove macros for vld3[q]_lane Neon intrinsics

2021-08-16 Thread Jonathan Wright via Gcc-patches
Hi,

This patch removes macros for vld3[q]_lane Neon intrinsics. This is a
preparatory step before adding new modes for structures of Advanced
SIMD vectors.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-08-16  Jonathan Wright  

* config/aarch64/arm_neon.h (__LD3_LANE_FUNC): Delete.
(__LD3Q_LANE_FUNC): Delete.
(vld3_lane_u8): Define without macro.
(vld3_lane_u16): Likewise.
(vld3_lane_u32): Likewise.
(vld3_lane_u64): Likewise.
(vld3_lane_s8): Likewise.
(vld3_lane_s16): Likewise.
(vld3_lane_s32): Likewise.
(vld3_lane_s64): Likewise.
(vld3_lane_f16): Likewise.
(vld3_lane_f32): Likewise.
(vld3_lane_f64): Likewise.
(vld3_lane_p8): Likewise.
(vld3_lane_p16): Likewise.
(vld3_lane_p64): Likewise.
(vld3q_lane_u8): Likewise.
(vld3q_lane_u16): Likewise.
(vld3q_lane_u32): Likewise.
(vld3q_lane_u64): Likewise.
(vld3q_lane_s8): Likewise.
(vld3q_lane_s16): Likewise.
(vld3q_lane_s32): Likewise.
(vld3q_lane_s64): Likewise.
(vld3q_lane_f16): Likewise.
(vld3q_lane_f32): Likewise.
(vld3q_lane_f64): Likewise.
(vld3q_lane_p8): Likewise.
(vld3q_lane_p16): Likewise.
(vld3q_lane_p64): Likewise.
(vld3_lane_bf16): Likewise.
(vld3q_lane_bf16): Likewise.


rb14792.patch
Description: rb14792.patch


[PATCH 1/3] aarch64: Remove macros for vld2[q]_lane Neon intrinsics

2021-08-16 Thread Jonathan Wright via Gcc-patches
Hi,

This patch removes macros for vld2[q]_lane Neon intrinsics. This is a
preparatory step before adding new modes for structures of Advanced
SIMD vectors.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-08-12  Jonathan Wright  

* config/aarch64/arm_neon.h (__LD2_LANE_FUNC): Delete.
(__LD2Q_LANE_FUNC): Likewise.
(vld2_lane_u8): Define without macro.
(vld2_lane_u16): Likewise.
(vld2_lane_u32): Likewise.
(vld2_lane_u64): Likewise.
(vld2_lane_s8): Likewise.
(vld2_lane_s16): Likewise.
(vld2_lane_s32): Likewise.
(vld2_lane_s64): Likewise.
(vld2_lane_f16): Likewise.
(vld2_lane_f32): Likewise.
(vld2_lane_f64): Likewise.
(vld2_lane_p8): Likewise.
(vld2_lane_p16): Likewise.
(vld2_lane_p64): Likewise.
(vld2q_lane_u8): Likewise.
(vld2q_lane_u16): Likewise.
(vld2q_lane_u32): Likewise.
(vld2q_lane_u64): Likewise.
(vld2q_lane_s8): Likewise.
(vld2q_lane_s16): Likewise.
(vld2q_lane_s32): Likewise.
(vld2q_lane_s64): Likewise.
(vld2q_lane_f16): Likewise.
(vld2q_lane_f32): Likewise.
(vld2q_lane_f64): Likewise.
(vld2q_lane_p8): Likewise.
(vld2q_lane_p16): Likewise.
(vld2q_lane_p64): Likewise.
(vld2_lane_bf16): Likewise.
(vld2q_lane_bf16): Likewise.


rb14791.patch
Description: rb14791.patch


[PATCH] testsuite: aarch64: Fix invalid SVE tests

2021-08-09 Thread Jonathan Wright via Gcc-patches
Hi,

Some scan-assembler tests for SVE code generation were erroneously
split over multiple lines - meaning they became invalid. This patch
gets the tests working again by putting each test on a single line.

The extract_[1234].c tests are corrected to expect that extracted
32-bit values are moved into 'w' registers rather than 'x' registers.

Ok for master?

Thanks,
Jonathan

---

gcc/testsuite/ChangeLog:

2021-08-06  Jonathan Wright  

* gcc.target/aarch64/sve/dup_lane_1.c: Don't split
scan-assembler tests over multiple lines. Expect 32-bit
result values in 'w' registers.
* gcc.target/aarch64/sve/extract_1.c: Likewise.
* gcc.target/aarch64/sve/extract_2.c: Likewise.
* gcc.target/aarch64/sve/extract_3.c: Likewise.
* gcc.target/aarch64/sve/extract_4.c: Likewise.


rb14768.patch
Description: rb14768.patch


Re: [PATCH] testsuite: aarch64: Fix failing vector structure tests on big-endian

2021-08-09 Thread Jonathan Wright via Gcc-patches
Hi,

I've corrected the quoting and moved everything on to one line.

Ok for master?

Thanks,
Jonathan

---

gcc/testsuite/ChangeLog:

2021-08-04  Jonathan Wright  

* gcc.target/aarch64/vector_structure_intrinsics.c: Restrict
tests to little-endian targets.



From: Richard Sandiford 
Sent: 06 August 2021 13:24
To: Jonathan Wright 
Cc: gcc-patches@gcc.gnu.org ; Christophe Lyon 

Subject: Re: [PATCH] testsuite: aarch64: Fix failing vector structure tests on 
big-endian 
 
Jonathan Wright  writes:
> diff --git a/gcc/testsuite/gcc.target/aarch64/vector_structure_intrinsics.c 
> b/gcc/testsuite/gcc.target/aarch64/vector_structure_intrinsics.c
> index 
> 60c53bc27f8378c78b119576ed19fde0e5743894..a8e31ab85d6fd2a045c8efaf2cbc42b5f40d2411
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/vector_structure_intrinsics.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vector_structure_intrinsics.c
> @@ -197,7 +197,8 @@ TEST_ST1x3 (vst1q, uint64x2x3_t, uint64_t*, u64, x3);
>  TEST_ST1x3 (vst1q, poly64x2x3_t, poly64_t*, p64, x3);
>  TEST_ST1x3 (vst1q, float64x2x3_t, float64_t*, f64, x3);
>  
> -/* { dg-final { scan-assembler-not "mov\\t" } } */
> +/* { dg-final { scan-assembler-not {"mov\\t"} {
> + target { aarch64_little_endian } } ) }  */

I think this needs to stay on line.  We should also either keep the
original quoting on the regexp or use {mov\t}.  Having both forms
of quote would turn it into a test for the characters:

   "mov\t"

(including quotes and backslash).

Thanks,
Richard


>  
>  /* { dg-final { scan-assembler-times "tbl\\t" 18} }  */
>  /* { dg-final { scan-assembler-times "tbx\\t" 18} }  */


rb14749.patch
Description: rb14749.patch


[PATCH 4/4] aarch64: Use memcpy to copy structures in bfloat vst* intrinsics

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch uses __builtin_memcpy to copy vector structures
instead of using a union - or constructing a new opaque structure one
vector at a time - in each of the vst[234][q] and vst1[q]_x[234] bfloat
Neon intrinsics in arm_neon.h.

It also adds new code generation tests to verify that superfluous move
instructions are not generated for the vst[234]q or vst1q_x[234] bfloat
intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-30  Jonathan Wright  

* config/aarch64/arm_neon.h (vst1_bf16_x2): Use
__builtin_memcpy instead of constructing an additional
__builtin_aarch64_simd_oi one vector at a time.
(vst1q_bf16_x2): Likewise.
(vst1_bf16_x3): Use __builtin_memcpy instead of constructing
an additional __builtin_aarch64_simd_ci one vector at a time.
(vst1q_bf16_x3): Likewise.
(vst1_bf16_x4): Use __builtin_memcpy instead of a union.
(vst1q_bf16_x4): Likewise.
(vst2_bf16): Use __builtin_memcpy instead of constructing an
additional __builtin_aarch64_simd_oi one vector at a time.
(vst2q_bf16): Likewise.
(vst3_bf16): Use __builtin_memcpy instead of constructing an
additional __builtin_aarch64_simd_ci mode one vector at a
time.
(vst3q_bf16): Likewise.
(vst4_bf16): Use __builtin_memcpy instead of constructing an
additional __builtin_aarch64_simd_xi one vector at a time.
(vst4q_bf16): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14731.patch
Description: rb14731.patch


[PATCH 3/4] aarch64: Use memcpy to copy structures in vst2[q]_lane intrinsics

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch uses __builtin_memcpy to copy vector structures
instead of using a union - or constructing a new opaque structure one
vector at a time - in each of the vst2[q]_lane Neon intrinsics in
arm_neon.h.

It also adds new code generation tests to verify that superfluous move
instructions are not generated for the vst2q_lane intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-30  Jonathan Wright  

* config/aarch64/arm_neon.h (__ST2_LANE_FUNC): Delete.
(__ST2Q_LANE_FUNC): Delete.
(vst2_lane_f16): Use __builtin_memcpy to copy vector
structure instead of constructing __builtin_aarch64_simd_oi
one vector at a time.
(vst2_lane_f32): Likewise.
(vst2_lane_f64): Likewise.
(vst2_lane_p8): Likewise.
(vst2_lane_p16): Likewise.
(vst2_lane_p64): Likewise.
(vst2_lane_s8): Likewise.
(vst2_lane_s16): Likewise.
(vst2_lane_s32): Likewise.
(vst2_lane_s64): Likewise.
(vst2_lane_u8): Likewise.
(vst2_lane_u16): Likewise.
(vst2_lane_u32): Likewise.
(vst2_lane_u64): Likewise.
(vst2_lane_bf16): Likewise.
(vst2q_lane_f16): Use __builtin_memcpy to copy vector
structure instead of using a union.
(vst2q_lane_f32): Likewise.
(vst2q_lane_f64): Likewise.
(vst2q_lane_p8): Likewise.
(vst2q_lane_p16): Likewise.
(vst2q_lane_p64): Likewise.
(vst2q_lane_s8): Likewise.
(vst2q_lane_s16): Likewise.
(vst2q_lane_s32): Likewise.
(vst2q_lane_s64): Likewise.
(vst2q_lane_u8): Likewise.
(vst2q_lane_u16): Likewise.
(vst2q_lane_u32): Likewise.
(vst2q_lane_u64): Likewise.
(vst2q_lane_bf16): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14730.patch
Description: rb14730.patch


[PATCH 2/4] aarch64: Use memcpy to copy structures in vst3[q]_lane intrinsics

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch uses __builtin_memcpy to copy vector structures
instead of using a union - or constructing a new opaque structure one
vector at a time - in each of the vst3[q]_lane Neon intrinsics in
arm_neon.h.

It also adds new code generation tests to verify that superfluous move
instructions are not generated for the vst3q_lane intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-30  Jonathan Wright  

* config/aarch64/arm_neon.h (__ST3_LANE_FUNC): Delete.
(__ST3Q_LANE_FUNC): Delete.
(vst3_lane_f16): Use __builtin_memcpy to copy vector
structure instead of constructing __builtin_aarch64_simd_ci
one vector at a time.
(vst3_lane_f32): Likewise.
(vst3_lane_f64): Likewise.
(vst3_lane_p8): Likewise.
(vst3_lane_p16): Likewise.
(vst3_lane_p64): Likewise.
(vst3_lane_s8): Likewise.
(vst3_lane_s16): Likewise.
(vst3_lane_s32): Likewise.
(vst3_lane_s64): Likewise.
(vst3_lane_u8): Likewise.
(vst3_lane_u16): Likewise.
(vst3_lane_u32): Likewise.
(vst3_lane_u64): Likewise.
(vst3_lane_bf16): Likewise.
(vst3q_lane_f16): Use __builtin_memcpy to copy vector
structure instead of using a union.
(vst3q_lane_f32): Likewise.
(vst3q_lane_f64): Likewise.
(vst3q_lane_p8): Likewise.
(vst3q_lane_p16): Likewise.
(vst3q_lane_p64): Likewise.
(vst3q_lane_s8): Likewise.
(vst3q_lane_s16): Likewise.
(vst3q_lane_s32): Likewise.
(vst3q_lane_s64): Likewise.
(vst3q_lane_u8): Likewise.
(vst3q_lane_u16): Likewise.
(vst3q_lane_u32): Likewise.
(vst3q_lane_u64): Likewise.
(vst3q_lane_bf16): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14729.patch
Description: rb14729.patch


[PATCH 1/4] aarch64: Use memcpy to copy structures in vst4[q]_lane intrinsics

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch uses __builtin_memcpy to copy vector structures
instead of using a union - or constructing a new opaque structure one
vector at a time - in each of the vst4[q]_lane Neon intrinsics in
arm_neon.h.

It also adds new code generation tests to verify that superfluous move
instructions are not generated for the vst4q_lane intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-29  Jonathan Wright  

* config/aarch64/arm_neon.h (__ST4_LANE_FUNC): Delete.
(__ST4Q_LANE_FUNC): Delete.
(vst4_lane_f16): Use __builtin_memcpy to copy vector
structure instead of constructing __builtin_aarch64_simd_xi
one vector at a time.
(vst4_lane_f32): Likewise.
(vst4_lane_f64): Likewise.
(vst4_lane_p8): Likewise.
(vst4_lane_p16): Likewise.
(vst4_lane_p64): Likewise.
(vst4_lane_s8): Likewise.
(vst4_lane_s16): Likewise.
(vst4_lane_s32): Likewise.
(vst4_lane_s64): Likewise.
(vst4_lane_u8): Likewise.
(vst4_lane_u16): Likewise.
(vst4_lane_u32): Likewise.
(vst4_lane_u64): Likewise.
(vst4_lane_bf16): Likewise.
(vst4q_lane_f16): Use __builtin_memcpy to copy vector
structure instead of using a union.
(vst4q_lane_f32): Likewise.
(vst4q_lane_f64): Likewise.
(vst4q_lane_p8): Likewise.
(vst4q_lane_p16): Likewise.
(vst4q_lane_p64): Likewise.
(vst4q_lane_s8): Likewise.
(vst4q_lane_s16): Likewise.
(vst4q_lane_s32): Likewise.
(vst4q_lane_s64): Likewise.
(vst4q_lane_u8): Likewise.
(vst4q_lane_u16): Likewise.
(vst4q_lane_u32): Likewise.
(vst4q_lane_u64): Likewise.
(vst4q_lane_bf16): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14728.patch
Description: rb14728.patch


[PATCH V2] aarch64: Don't include vec_select high-half in SIMD subtract cost

2021-08-05 Thread Jonathan Wright via Gcc-patches
Hi,

V2 of this change implements the same approach as for the multiply
and add-widen patches.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-28  Jonathan Wright  

* config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
of vec_select high-half from being added into Neon subtract
cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vsubX_high_cost.c: New test.



From: Jonathan Wright
Sent: 29 July 2021 10:23
To: gcc-patches@gcc.gnu.org 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] aarch64: Don't include vec_select high-half in SIMD subtract 
cost 
 
Hi,

The Neon subtract-long/subract-widen instructions can select the top
or bottom half of the operand registers. This selection does not
change the cost of the underlying instruction and this should be
reflected by the RTL cost function.

This patch adds RTL tree traversal in the Neon subtract cost function
to match vec_select high-half of its operands. This traversal
prevents the cost of the vec_select from being added into the cost of
the subtract - meaning that these instructions can now be emitted in
the combine pass as they are no longer deemed prohibitively
expensive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-28  Jonathan Wright  

    * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
    of vec_select high-half from being added into Neon subtract
    cost.

gcc/testsuite/ChangeLog:

    * gcc.target/aarch64/vsubX_high_cost.c: New test.

rb14711.patch
Description: rb14711.patch


[PATCH V2] aarch64: Don't include vec_select high-half in SIMD add cost

2021-08-04 Thread Jonathan Wright via Gcc-patches
Hi,

V2 of this patch uses the same approach as that just implemented
for the multiply high-half cost patch.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan 

---

gcc/ChangeLog:

2021-07-28  Jonathan Wright  

* config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
of vec_select high-half from being added into Neon add cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vaddX_high_cost.c: New test.

From: Jonathan Wright
Sent: 29 July 2021 10:22
To: gcc-patches@gcc.gnu.org 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] aarch64: Don't include vec_select high-half in SIMD add cost 
 
Hi,

The Neon add-long/add-widen instructions can select the top or bottom
half of the operand registers. This selection does not change the
cost of the underlying instruction and this should be reflected by
the RTL cost function.

This patch adds RTL tree traversal in the Neon add cost function to
match vec_select high-half of its operands. This traversal prevents
the cost of the vec_select from being added into the cost of the
subtract - meaning that these instructions can now be emitted in the
combine pass as they are no longer deemed prohibitively expensive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-28  Jonathan Wright  

    * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
    of vec_select high-half from being added into Neon add cost.

gcc/testsuite/ChangeLog:

    * gcc.target/aarch64/vaddX_high_cost.c: New test.

rb14710.patch
Description: rb14710.patch


[PATCH V2] aarch64: Don't include vec_select high-half in SIMD multiply cost

2021-08-04 Thread Jonathan Wright via Gcc-patches
Hi,

Changes suggested here and those discussed off-list have been
implemented in V2 of the patch.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  

* config/aarch64/aarch64.c (aarch64_strip_extend_vec_half):
Define.
(aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
vec_select high-half from being added into Neon multiply
cost.
* rtlanal.c (vec_series_highpart_p): Define.
* rtlanal.h (vec_series_highpart_p): Declare.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vmul_high_cost.c: New test.

From: Richard Sandiford 
Sent: 04 August 2021 10:05
To: Jonathan Wright via Gcc-patches 
Cc: Jonathan Wright 
Subject: Re: [PATCH] aarch64: Don't include vec_select high-half in SIMD 
multiply cost 
 
Jonathan Wright via Gcc-patches  writes:
> Hi,
>
> The Neon multiply/multiply-accumulate/multiply-subtract instructions
> can select the top or bottom half of the operand registers. This
> selection does not change the cost of the underlying instruction and
> this should be reflected by the RTL cost function.
>
> This patch adds RTL tree traversal in the Neon multiply cost function
> to match vec_select high-half of its operands. This traversal
> prevents the cost of the vec_select from being added into the cost of
> the multiply - meaning that these instructions can now be emitted in
> the combine pass as they are no longer deemed prohibitively
> expensive.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.

Like you say, the instructions can handle both the low and high halves.
Shouldn't we also check for the low part (as a SIGN/ZERO_EXTEND of
a subreg)?

> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  
>
>    * config/aarch64/aarch64.c (aarch64_vec_select_high_operand_p):
>    Define.
>    (aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
>    vec_select high-half from being added into Neon multiply
>    cost.
>    * rtlanal.c (vec_series_highpart_p): Define.
>    * rtlanal.h (vec_series_highpart_p): Declare.
>
> gcc/testsuite/ChangeLog:
>
>    * gcc.target/aarch64/vmul_high_cost.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> 5809887997305317c5a81421089db431685e2927..a49672afe785e3517250d324468edacceab5c9d3
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -76,6 +76,7 @@
>  #include "function-abi.h"
>  #include "gimple-pretty-print.h"
>  #include "tree-ssa-loop-niter.h"
> +#include "rtlanal.h"
>  
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -11970,6 +11971,19 @@ aarch64_cheap_mult_shift_p (rtx x)
>    return false;
>  }
>  
> +/* Return true iff X is an operand of a select-high-half vector
> +   instruction.  */
> +
> +static bool
> +aarch64_vec_select_high_operand_p (rtx x)
> +{
> +  return ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND)
> +   && GET_CODE (XEXP (x, 0)) == VEC_SELECT
> +   && vec_series_highpart_p (GET_MODE (XEXP (x, 0)),
> + GET_MODE (XEXP (XEXP (x, 0), 0)),
> + XEXP (XEXP (x, 0), 1)));
> +}
> +
>  /* Helper function for rtx cost calculation.  Calculate the cost of
> a MULT or ASHIFT, which may be part of a compound PLUS/MINUS rtx.
> Return the calculated cost of the expression, recursing manually in to
> @@ -11995,6 +12009,13 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
> int outer, bool speed)
>    unsigned int vec_flags = aarch64_classify_vector_mode (mode);
>    if (vec_flags & VEC_ADVSIMD)
>    {
> +   /* The select-operand-high-half versions of the instruction have the
> +  same cost as the three vector version - don't add the costs of the
> +  select into the costs of the multiply.  */
> +   if (aarch64_vec_select_high_operand_p (op0))
> + op0 = XEXP (XEXP (op0, 0), 0);
> +   if (aarch64_vec_select_high_operand_p (op1))
> + op1 = XEXP (XEXP (op1, 0), 0);

For consistency with aarch64_strip_duplicate_vec_elt, I think this
should be something like aarch64_strip_vec_extension, returning
the inner rtx on success and the original one on failure.

Thanks,
Richard

>  /* The by-element versions of the instruction have the same costs as
> the normal 3-vector version.  So don't add the costs of the
> duplicate or subsequent select into the costs of the

[PATCH] testsuite: aarch64: Fix failing vector structure tests on big-endian

2021-08-04 Thread Jonathan Wright via Gcc-patches
Hi,

Recent refactoring of the arm_neon.h header enabled better code
generation for intrinsics that manipulate vector structures. New
tests were also added to verify the benefit of these changes. It now
transpires that the code generation improvements are observed only on
little-endian systems. This patch restricts the code generation tests
to little-endian targets (for now.)

Ok for master?

Thanks,
Jonathan

---

gcc/testsuite/ChangeLog:

2021-08-04  Jonathan Wright  

* gcc.target/aarch64/vector_structure_intrinsics.c: Restrict
tests to little-endian targets.



From: Christophe Lyon 
Sent: 03 August 2021 10:42
To: Jonathan Wright 
Cc: gcc-patches@gcc.gnu.org ; Richard Sandiford 

Subject: Re: [PATCH 1/8] aarch64: Use memcpy to copy vector tables in 
vqtbl[234] intrinsics 
 


On Fri, Jul 23, 2021 at 10:22 AM Jonathan Wright via Gcc-patches 
 wrote:
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vqtbl[234] Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Add new code generation tests to verify that superfluous move
instructions are no longer generated for the vqtbl[234] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-08  Jonathan Wright  

        * config/aarch64/arm_neon.h (vqtbl2_s8): Use __builtin_memcpy
        instead of constructing __builtin_aarch64_simd_oi one vector
        at a time.
        (vqtbl2_u8): Likewise.
        (vqtbl2_p8): Likewise.
        (vqtbl2q_s8): Likewise.
        (vqtbl2q_u8): Likewise.
        (vqtbl2q_p8): Likewise.
        (vqtbl3_s8): Use __builtin_memcpy instead of constructing
        __builtin_aarch64_simd_ci one vector at a time.
        (vqtbl3_u8): Likewise.
        (vqtbl3_p8): Likewise.
        (vqtbl3q_s8): Likewise.
        (vqtbl3q_u8): Likewise.
        (vqtbl3q_p8): Likewise.
        (vqtbl4_s8): Use __builtin_memcpy instead of constructing
        __builtin_aarch64_simd_xi one vector at a time.
        (vqtbl4_u8): Likewise.
        (vqtbl4_p8): Likewise.
        (vqtbl4q_s8): Likewise.
        (vqtbl4q_u8): Likewise.
        (vqtbl4q_p8): Likewise.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/vector_structure_intrinsics.c: New test.

Hi,

This new test fails on aarch64_be:
 FAIL: gcc.target/aarch64/vector_structure_intrinsics.c scan-assembler-not 
mov\\t

Can you check?

Thanks

Christophe


rb14749.patch
Description: rb14749.patch


[PATCH] aarch64: Don't include vec_select high-half in SIMD subtract cost

2021-07-29 Thread Jonathan Wright via Gcc-patches
Hi,

The Neon subtract-long/subract-widen instructions can select the top
or bottom half of the operand registers. This selection does not
change the cost of the underlying instruction and this should be
reflected by the RTL cost function.

This patch adds RTL tree traversal in the Neon subtract cost function
to match vec_select high-half of its operands. This traversal
prevents the cost of the vec_select from being added into the cost of
the subtract - meaning that these instructions can now be emitted in
the combine pass as they are no longer deemed prohibitively
expensive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-28  Jonathan Wright  

* config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
of vec_select high-half from being added into Neon subtract
cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vsubX_high_cost.c: New test.


rb14711.patch
Description: rb14711.patch


[PATCH] aarch64: Don't include vec_select high-half in SIMD add cost

2021-07-29 Thread Jonathan Wright via Gcc-patches
Hi,

The Neon add-long/add-widen instructions can select the top or bottom
half of the operand registers. This selection does not change the
cost of the underlying instruction and this should be reflected by
the RTL cost function.

This patch adds RTL tree traversal in the Neon add cost function to
match vec_select high-half of its operands. This traversal prevents
the cost of the vec_select from being added into the cost of the
subtract - meaning that these instructions can now be emitted in the
combine pass as they are no longer deemed prohibitively expensive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-28  Jonathan Wright  

* config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
of vec_select high-half from being added into Neon add cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vaddX_high_cost.c: New test.


rb14710.patch
Description: rb14710.patch


[PATCH] aarch64: Don't include vec_select high-half in SIMD multiply cost

2021-07-28 Thread Jonathan Wright via Gcc-patches
Hi,

The Neon multiply/multiply-accumulate/multiply-subtract instructions
can select the top or bottom half of the operand registers. This
selection does not change the cost of the underlying instruction and
this should be reflected by the RTL cost function.

This patch adds RTL tree traversal in the Neon multiply cost function
to match vec_select high-half of its operands. This traversal
prevents the cost of the vec_select from being added into the cost of
the multiply - meaning that these instructions can now be emitted in
the combine pass as they are no longer deemed prohibitively
expensive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  

* config/aarch64/aarch64.c (aarch64_vec_select_high_operand_p):
Define.
(aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
vec_select high-half from being added into Neon multiply
cost.
* rtlanal.c (vec_series_highpart_p): Define.
* rtlanal.h (vec_series_highpart_p): Declare.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vmul_high_cost.c: New test.


rb14704.patch
Description: rb14704.patch


Re: [PATCH V2] aarch64: Don't include vec_select in SIMD multiply cost

2021-07-28 Thread Jonathan Wright via Gcc-patches
Hi,

V2 of the patch addresses the initial review comments, factors out
common code (as we discussed off-list) and adds a set of unit tests
to verify the code generation benefit.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  

* config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt):
Define.
(aarch64_rtx_mult_cost): Traverse RTL tree to prevent
vec_select cost from being added into Neon multiply cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vmul_element_cost.c: New test.



From: Richard Sandiford 
Sent: 22 July 2021 18:16
To: Jonathan Wright 
Cc: gcc-patches@gcc.gnu.org ; Kyrylo Tkachov 

Subject: Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost 
 
Jonathan Wright  writes:
> Hi,
>
> The Neon multiply/multiply-accumulate/multiply-subtract instructions
> can take various forms - multiplying full vector registers of values
> or multiplying one vector by a single element of another. Regardless
> of the form used, these instructions have the same cost, and this
> should be reflected by the RTL cost function.
>
> This patch adds RTL tree traversal in the Neon multiply cost function
> to match the vec_select used by the lane-referencing forms of the
> instructions already mentioned. This traversal prevents the cost of
> the vec_select from being added into the cost of the multiply -
> meaning that these instructions can now be emitted in the combine
> pass as they are no longer deemed prohibitively expensive.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  
>
> * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
> RTL tree to prevents vec_select from being added into Neon
> multiply cost.
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
> int outer, bool speed)
>    op0 = XEXP (op0, 0);
>  else if (GET_CODE (op1) == VEC_DUPLICATE)
>    op1 = XEXP (op1, 0);
> +   /* The same argument applies to the VEC_SELECT when using the lane-
> +  referencing forms of the MUL/MLA/MLS instructions. Without the
> +  traversal here, the combine pass deems these patterns too
> +  expensive and subsequently does not emit the lane-referencing
> +  forms of the instructions. In addition, canonical form is for the
> +  VEC_SELECT to be the second argument of the multiply - thus only
> +  op1 is traversed.  */
> +   if (GET_CODE (op1) == VEC_SELECT
> +   && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> + op1 = XEXP (op1, 0);
> +   else if ((GET_CODE (op1) == ZERO_EXTEND
> + || GET_CODE (op1) == SIGN_EXTEND)
> +    && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
> +    && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> + op1 = XEXP (XEXP (op1, 0), 0);

I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE”
if block, since the condition is never true otherwise.  We can probably
skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think
it would be better to add them to the existing VEC_DUPLICATE tests rather
than restrict them to the VEC_SELECT ones.

Also, although this is in Advanced SIMD-specific code, I think it'd be
better to use:

  is_a (GET_MODE (op1))

instead of:

  GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1

Do you have a testcase?

Thanks,
Richard

rb14675.patch
Description: rb14675.patch


Re: [PATCH V2] simplify-rtx: Push sign/zero-extension inside vec_duplicate

2021-07-26 Thread Jonathan Wright via Gcc-patches
Hi,

This updated patch fixes the two-operators-per-row style issue in the 
aarch64-simd.md RTL patterns as well as integrating the simplify-rtx.c
change as suggested.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  

* config/aarch64/aarch64-simd.md: Push sign/zero-extension
inside vec_duplicate for all patterns.
* simplify-rtx.c (simplify_context::simplify_unary_operation_1):
Push sign/zero-extension inside vec_duplicate.



From: Richard Sandiford 
Sent: 22 July 2021 18:36
To: Jonathan Wright 
Cc: gcc-patches@gcc.gnu.org ; Kyrylo Tkachov 

Subject: Re: [PATCH] simplify-rtx: Push sign/zero-extension inside 
vec_duplicate 
 
Jonathan Wright  writes:
> Hi,
>
> As a general principle, vec_duplicate should be as close to the root
> of an expression as possible. Where unary operations have
> vec_duplicate as an argument, these operations should be pushed
> inside the vec_duplicate.
>
> This patch modifies unary operation simplification to push
> sign/zero-extension of a scalar inside vec_duplicate.
>
> This patch also updates all RTL patterns in aarch64-simd.md to use
> the new canonical form.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu and
> x86_64-none-linux-gnu - no issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  
>
> * config/aarch64/aarch64-simd.md: Push sign/zero-extension
> inside vec_duplicate for all patterns.
> * simplify-rtx.c (simplify_context::simplify_unary_operation_1):
> Push sign/zero-extension inside vec_duplicate.
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 74890989cb3045798bf8d0241467eaaf72238297..99a95a54248041906b9a0ad742d3a0dca9733b35
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2092,14 +2092,14 @@
>  
>  (define_insn "aarch64_mlal_hi_n_insn"
>    [(set (match_operand: 0 "register_operand" "=w")
> -    (plus:
> -  (mult:
> -  (ANY_EXTEND: (vec_select:
> - (match_operand:VQ_HSI 2 "register_operand" "w")
> - (match_operand:VQ_HSI 3 "vect_par_cnst_hi_half" "")))
> -  (ANY_EXTEND: (vec_duplicate:
> -    (match_operand: 4 "register_operand" ""
> -  (match_operand: 1 "register_operand" "0")))]
> + (plus:
> +   (mult:
> +   (ANY_EXTEND: (vec_select:
> +  (match_operand:VQ_HSI 2 "register_operand" "w")
> +  (match_operand:VQ_HSI 3 "vect_par_cnst_hi_half" "")))
> +  (vec_duplicate: (ANY_EXTEND:
> +  (match_operand: 4 "register_operand" ""
> +   (match_operand: 1 "register_operand" "0")))]

Sorry to nitpick, since this is pre-existing, but I think the pattern
would be easier to read with one operation per line.  I.e.:

    (plus:
  (mult:
    (ANY_EXTEND:
  (vec_select:
    (match_operand:VQ_HSI 2 "register_operand" "w")
    (match_operand:VQ_HSI 3 "vect_par_cnst_hi_half" "")))
    (vec_duplicate:
  (ANY_EXTEND:
    (match_operand: 4 "register_operand" ""
  (match_operand: 1 "register_operand" "0")))]

Same for the other patterns with similar doubling of operators.
(It looks like you've fixed other indentation problems though, thanks.)

> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
> index 
> 2d169d3f9f70c85d396adaed124b6c52aca98f07..f885816412f7576d2535f827562d2b425a6a553b
>  100644
> --- a/gcc/simplify-rtx.c
> +++ b/gcc/simplify-rtx.c
> @@ -903,6 +903,18 @@ simplify_context::simplify_unary_operation_1 (rtx_code 
> code, machine_mode mode,
>    rtx temp, elt, base, step;
>    scalar_int_mode inner, int_mode, op_mode, op0_mode;
>  
> +  /* Extending a VEC_DUPLICATE of a scalar should be canonicalized to a
> + VEC_DUPLICATE of an extended scalar. This is outside of the main switch
> + as we may wish to push all unary operations inside VEC_DUPLICATE. */
> +  if ((code == SIGN_EXTEND || code == ZERO_EXTEND)
> +  && GET_CODE (op) == VEC_DUPLICATE
> +  && GET_MODE_NUNITS (GET_MODE (XEXP (op, 0))).to_constant () == 1)
> +    {
> +  rtx x = simplify_gen_unary (code, GET_MODE_INNER (mode),
> +   XEXP (op, 0), GET_MODE (XEXP (op, 0)));
> +  return gen_vec_duplicate (mode, x);
> +    }
> +
>    switch (code)
>  {
>  case NOT:

This is really an extension of the existing code:

  if (VECTOR_MODE_P (mode)
  && vec_duplicate_p (op, )
  && code != VEC_DUPLICATE)
    {
  /* Try applying the operator to ELT and see if that simplifies.
 We can duplicate the result if so.

 The reason we don't use simplify_gen_unary is that it isn't
 necessarily a win to convert things like:


[PATCH] aarch64: Use memcpy to copy vector tables in vst1[q]_x2 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vst1[q]_x2 Neon intrinsics in arm_neon.h. This simplifies the header
file and also improves code generation - superfluous move
instructions were emitted for every register extraction/set in this
additional structure.

Add new code generation tests to verify that superfluous move
instructions are not generated for the vst1q_x2 intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-23  Jonathan Wright  

* config/aarch64/arm_neon.h (vst1_s64_x2): Use
__builtin_memcpy instead of constructing
__builtin_aarch64_simd_oi one vector at a time.
(vst1_u64_x2): Likewise.
(vst1_f64_x2): Likewise.
(vst1_s8_x2): Likewise.
(vst1_p8_x2): Likewise.
(vst1_s16_x2): Likewise.
(vst1_p16_x2): Likewise.
(vst1_s32_x2): Likewise.
(vst1_u8_x2): Likewise.
(vst1_u16_x2): Likewise.
(vst1_u32_x2): Likewise.
(vst1_f16_x2): Likewise.
(vst1_f32_x2): Likewise.
(vst1_p64_x2): Likewise.
(vst1q_s8_x2): Likewise.
(vst1q_p8_x2): Likewise.
(vst1q_s16_x2): Likewise.
(vst1q_p16_x2): Likewise.
(vst1q_s32_x2): Likewise.
(vst1q_s64_x2): Likewise.
(vst1q_u8_x2): Likewise.
(vst1q_u16_x2): Likewise.
(vst1q_u32_x2): Likewise.
(vst1q_u64_x2): Likewise.
(vst1q_f16_x2): Likewise.
(vst1q_f32_x2): Likewise.
(vst1q_f64_x2): Likewise.
(vst1q_p64_x2): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14701.patch
Description: rb14701.patch


[PATCH] aarch64: Use memcpy to copy vector tables in vst1[q]_x3 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vst1[q]_x3 Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Add new code generation tests to verify that superfluous move
instructions are not generated for the vst1q_x3 intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-23  Jonathan Wright  

* config/aarch64/arm_neon.h (vst1_s64_x3): Use
__builtin_memcpy instead of constructing
__builtin_aarch64_simd_ci one vector at a time.
(vst1_u64_x3): Likewise.
(vst1_f64_x3): Likewise.
(vst1_s8_x3): Likewise.
(vst1_p8_x3): Likewise.
(vst1_s16_x3): Likewise.
(vst1_p16_x3): Likewise.
(vst1_s32_x3): Likewise.
(vst1_u8_x3): Likewise.
(vst1_u16_x3): Likewise.
(vst1_u32_x3): Likewise.
(vst1_f16_x3): Likewise.
(vst1_f32_x3): Likewise.
(vst1_p64_x3): Likewise.
(vst1q_s8_x3): Likewise.
(vst1q_p8_x3): Likewise.
(vst1q_s16_x3): Likewise.
(vst1q_p16_x3): Likewise.
(vst1q_s32_x3): Likewise.
(vst1q_s64_x3): Likewise.
(vst1q_u8_x3): Likewise.
(vst1q_u16_x3): Likewise.
(vst1q_u32_x3): Likewise.
(vst1q_u64_x3): Likewise.
(vst1q_f16_x3): Likewise.
(vst1q_f32_x3): Likewise.
(vst1q_f64_x3): Likewise.
(vst1q_p64_x3): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14700.patch
Description: rb14700.patch


Re: [PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Same explanation as for patch 3/8:

I haven't added test cases here because these intrinsics don't map to
a single instruction (they're legacy from Armv7) and would trip the
"scan-assembler not mov" that we're using for the other tests.

Thanks,
Jonathan

From: Richard Sandiford 
Sent: 23 July 2021 10:31
To: Kyrylo Tkachov 
Cc: Jonathan Wright ; gcc-patches@gcc.gnu.org 

Subject: Re: [PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 
intrinsics

Kyrylo Tkachov  writes:
>> -Original Message-
>> From: Jonathan Wright 
>> Sent: 23 July 2021 10:15
>> To: gcc-patches@gcc.gnu.org
>> Cc: Kyrylo Tkachov ; Richard Sandiford
>> 
>> Subject: [PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4
>> intrinsics
>>
>> Hi,
>>
>> This patch uses __builtin_memcpy to copy vector structures instead of
>> building a new opaque structure one vector at a time in each of the
>> vtbx4 Neon intrinsics in arm_neon.h. This simplifies the header file
>> and also improves code generation - superfluous move instructions
>> were emitted for every register extraction/set in this additional
>> structure.
>>
>> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
>> issues.
>>
>> Ok for master?
>
> Ok.

Here too I think we want some testcases…

Thanks,
Richard


[PATCH 8/8] aarch64: Use memcpy to copy vector tables in vst1[q]_x4 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
using a union in each of the vst1[q]_x4 Neon intrinsics in arm_neon.h.

Add new code generation tests to verify that superfluous move
instructions are not generated for the vst1q_x4 intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-21  Jonathan Wright  

* config/aarch64/arm_neon.h (vst1_s8_x4): Use
__builtin_memcpy instead of using a union.
(vst1q_s8_x4): Likewise.
(vst1_s16_x4): Likewise.
(vst1q_s16_x4): Likewise.
(vst1_s32_x4): Likewise.
(vst1q_s32_x4): Likewise.
(vst1_u8_x4): Likewise.
(vst1q_u8_x4): Likewise.
(vst1_u16_x4): Likewise.
(vst1q_u16_x4): Likewise.
(vst1_u32_x4): Likewise.
(vst1q_u32_x4): Likewise.
(vst1_f16_x4): Likewise.
(vst1q_f16_x4): Likewise.
(vst1_f32_x4): Likewise.
(vst1q_f32_x4): Likewise.
(vst1_p8_x4): Likewise.
(vst1q_p8_x4): Likewise.
(vst1_p16_x4): Likewise.
(vst1q_p16_x4): Likewise.
(vst1_s64_x4): Likewise.
(vst1_u64_x4): Likewise.
(vst1_p64_x4): Likewise.
(vst1q_s64_x4): Likewise.
(vst1q_u64_x4): Likewise.
(vst1q_p64_x4): Likewise.
(vst1_f64_x4): Likewise.
(vst1q_f64_x4): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14697.patch
Description: rb14697.patch


[PATCH 7/8] aarch64: Use memcpy to copy vector tables in vst2[q] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vst2[q] Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Add new code generation tests to verify that superfluous move
instructions are no longer generated for the vst2q intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-21  Jonathan Wrightt  

* config/aarch64/arm_neon.h (vst2_s64): Use __builtin_memcpy
instead of constructing __builtin_aarch64_simd_oi one vector
at a time.
(vst2_u64): Likewise.
(vst2_f64): Likewise.
(vst2_s8): Likewise.
(vst2_p8): Likewise.
(vst2_s16): Likewise.
(vst2_p16): Likewise.
(vst2_s32): Likewise.
(vst2_u8): Likewise.
(vst2_u16): Likewise.
(vst2_u32): Likewise.
(vst2_f16): Likewise.
(vst2_f32): Likewise.
(vst2_p64): Likewise.
(vst2q_s8): Likewise.
(vst2q_p8): Likewise.
(vst2q_s16): Likewise.
(vst2q_p16): Likewise.
(vst2q_s32): Likewise.
(vst2q_s64): Likewise.
(vst2q_u8): Likewise.
(vst2q_u16): Likewise.
(vst2q_u32): Likewise.
(vst2q_u64): Likewise.
(vst2q_f16): Likewise.
(vst2q_f32): Likewise.
(vst2q_f64): Likewise.
(vst2q_p64): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14689.patch
Description: rb14689.patch


Re: [PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
I haven't added test cases here because these intrinsics don't map to
a single instruction (they're legacy from Armv7) and would trip the
"scan-assembler not mov" that we're using for the other tests.

Jonathan

From: Richard Sandiford 
Sent: 23 July 2021 10:29
To: Kyrylo Tkachov 
Cc: Jonathan Wright ; gcc-patches@gcc.gnu.org 

Subject: Re: [PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] 
intrinsics

Kyrylo Tkachov  writes:
>> -Original Message-
>> From: Jonathan Wright 
>> Sent: 23 July 2021 09:30
>> To: gcc-patches@gcc.gnu.org
>> Cc: Kyrylo Tkachov ; Richard Sandiford
>> 
>> Subject: [PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34]
>> intrinsics
>>
>> Hi,
>>
>> This patch uses __builtin_memcpy to copy vector structures instead of
>> building a new opaque structure one vector at a time in each of the
>> vtbl[34] Neon intrinsics in arm_neon.h. This simplifies the header file
>> and also improves code generation - superfluous move instructions
>> were emitted for every register extraction/set in this additional
>> structure.
>>
>> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
>> issues.
>>
>> Ok for master?
>
> Ok.

Please add testcases first though. :-)

Thanks,
Richard


[PATCH 6/8] aarch64: Use memcpy to copy vector tables in vst3[q] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vst3[q] Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Add new code generation tests to verify that superfluous move
instructions are no longer generated for the vst3q intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-21  Jonathan Wright  

* config/aarch64/arm_neon.h (vst3_s64): Use __builtin_memcpy
instead of constructing __builtin_aarch64_simd_ci one vector
at a time.
(vst3_u64): Likewise.
(vst3_f64): Likewise.
(vst3_s8): Likewise.
(vst3_p8): Likewise.
(vst3_s16): Likewise.
(vst3_p16): Likewise.
(vst3_s32): Likewise.
(vst3_u8): Likewise.
(vst3_u16): Likewise.
(vst3_u32): Likewise.
(vst3_f16): Likewise.
(vst3_f32): Likewise.
(vst3_p64): Likewise.
(vst3q_s8): Likewise.
(vst3q_p8): Likewise.
(vst3q_s16): Likewise.
(vst3q_p16): Likewise.
(vst3q_s32): Likewise.
(vst3q_s64): Likewise.
(vst3q_u8): Likewise.
(vst3q_u16): Likewise.
(vst3q_u32): Likewise.
(vst3q_u64): Likewise.
(vst3q_f16): Likewise.
(vst3q_f32): Likewise.
(vst3q_f64): Likewise.
(vst3q_p64): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14688.patch
Description: rb14688.patch


[PATCH 5/8] aarch64: Use memcpy to copy vector tables in vst4[q] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vst4[q] Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Add new code generation tests to verify that superfluous move
instructions are no longer generated for the vst4q intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-20  Jonathan Wright  

* config/aarch64/arm_neon.h (vst4_s64): Use __builtin_memcpy
instead of constructing __builtin_aarch64_simd_xi one vector
at a time.
(vst4_u64): Likewise.
(vst4_f64): Likewise.
(vst4_s8): Likewise.
(vst4_p8): Likewise.
(vst4_s16): Likewise.
(vst4_p16): Likewise.
(vst4_s32): Likewise.
(vst4_u8): Likewise.
(vst4_u16): Likewise.
(vst4_u32): Likewise.
(vst4_f16): Likewise.
(vst4_f32): Likewise.
(vst4_p64): Likewise.
(vst4q_s8): Likewise.
(vst4q_p8): Likewise.
(vst4q_s16): Likewise.
(vst4q_p16): Likewise.
(vst4q_s32): Likewise.
(vst4q_s64): Likewise.
(vst4q_u8): Likewise.
(vst4q_u16): Likewise.
(vst4q_u32): Likewise.
(vst4q_u64): Likewise.
(vst4q_f16): Likewise.
(vst4q_f32): Likewise.
(vst4q_f64): Likewise.
(vst4q_p64): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.


rb14687.patch
Description: rb14687.patch


[PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vtbx4 Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  

* config/aarch64/arm_neon.h (vtbx4_s8): Use __builtin_memcpy
instead of constructing __builtin_aarch64_simd_oi one vector
at a time.
(vtbx4_u8): Likewise.
(vtbx4_p8): Likewise.


rb14674.patch
Description: rb14674.patch


[PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vtbl[34] Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-08  Jonathan Wright  

* config/aarch64/arm_neon.h (vtbl3_s8): Use __builtin_memcpy
instead of constructing __builtin_aarch64_simd_oi one vector
at a time.
(vtbl3_u8): Likewise.
(vtbl3_p8): Likewise.
(vtbl4_s8): Likewise.
(vtbl4_u8): Likewise.
(vtbl4_p8): Likewise.

rb14673.patch
Description: rb14673.patch


[PATCH 2/8] aarch64: Use memcpy to copy vector tables in vqtbx[234] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vqtbx[234] Neon intrinsics in arm_neon.h. This simplifies the header
file and also improves code generation - superfluous move
instructions were emitted for every register extraction/set in this
additional structure.

Add new code generation tests to verify that superfluous move
instructions are no longer generated for the vqtbx[234] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-08  Jonathan Wright  

* config/aarch64/arm_neon.h (vqtbx2_s8): Use __builtin_memcpy
instead of constructing __builtin_aarch64_simd_oi one vector
at a time.
(vqtbx2_u8): Likewise.
(vqtbx2_p8): Likewise.
(vqtbx2q_s8): Likewise.
(vqtbx2q_u8): Likewise.
(vqtbx2q_p8): Likewise.
(vqtbx3_s8): Use __builtin_memcpy instead of constructing
__builtin_aarch64_simd_ci one vector at a time.
(vqtbx3_u8): Likewise.
(vqtbx3_p8): Likewise.
(vqtbx3q_s8): Likewise.
(vqtbx3q_u8): Likewise.
(vqtbx3q_p8): Likewise.
(vqtbx4_s8): Use __builtin_memcpy instead of constructing
__builtin_aarch64_simd_xi one vector at a time.
(vqtbx4_u8): Likewise.
(vqtbx4_p8): Likewise.
(vqtbx4q_s8): Likewise.
(vqtbx4q_u8): Likewise.
(vqtbx4q_p8): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: New tests.

rb14640.patch
Description: rb14640.patch


[PATCH 1/8] aarch64: Use memcpy to copy vector tables in vqtbl[234] intrinsics

2021-07-23 Thread Jonathan Wright via Gcc-patches
Hi,

This patch uses __builtin_memcpy to copy vector structures instead of
building a new opaque structure one vector at a time in each of the
vqtbl[234] Neon intrinsics in arm_neon.h. This simplifies the header file
and also improves code generation - superfluous move instructions
were emitted for every register extraction/set in this additional
structure.

Add new code generation tests to verify that superfluous move
instructions are no longer generated for the vqtbl[234] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-08  Jonathan Wright  

* config/aarch64/arm_neon.h (vqtbl2_s8): Use __builtin_memcpy
instead of constructing __builtin_aarch64_simd_oi one vector
at a time.
(vqtbl2_u8): Likewise.
(vqtbl2_p8): Likewise.
(vqtbl2q_s8): Likewise.
(vqtbl2q_u8): Likewise.
(vqtbl2q_p8): Likewise.
(vqtbl3_s8): Use __builtin_memcpy instead of constructing
__builtin_aarch64_simd_ci one vector at a time.
(vqtbl3_u8): Likewise.
(vqtbl3_p8): Likewise.
(vqtbl3q_s8): Likewise.
(vqtbl3q_u8): Likewise.
(vqtbl3q_p8): Likewise.
(vqtbl4_s8): Use __builtin_memcpy instead of constructing
__builtin_aarch64_simd_xi one vector at a time.
(vqtbl4_u8): Likewise.
(vqtbl4_p8): Likewise.
(vqtbl4q_s8): Likewise.
(vqtbl4q_u8): Likewise.
(vqtbl4q_p8): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: New test.


rb14639.patch
Description: rb14639.patch


[PATCH] simplify-rtx: Push sign/zero-extension inside vec_duplicate

2021-07-20 Thread Jonathan Wright via Gcc-patches
Hi,

As a general principle, vec_duplicate should be as close to the root
of an expression as possible. Where unary operations have
vec_duplicate as an argument, these operations should be pushed
inside the vec_duplicate.

This patch modifies unary operation simplification to push
sign/zero-extension of a scalar inside vec_duplicate.

This patch also updates all RTL patterns in aarch64-simd.md to use
the new canonical form.

Regression tested and bootstrapped on aarch64-none-linux-gnu and
x86_64-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  

* config/aarch64/aarch64-simd.md: Push sign/zero-extension
inside vec_duplicate for all patterns.
* simplify-rtx.c (simplify_context::simplify_unary_operation_1):
Push sign/zero-extension inside vec_duplicate.

rb14677.patch
Description: rb14677.patch


[PATCH] aarch64: Don't include vec_select in SIMD multiply cost

2021-07-20 Thread Jonathan Wright via Gcc-patches
Hi,

The Neon multiply/multiply-accumulate/multiply-subtract instructions
can take various forms - multiplying full vector registers of values
or multiplying one vector by a single element of another. Regardless
of the form used, these instructions have the same cost, and this
should be reflected by the RTL cost function.

This patch adds RTL tree traversal in the Neon multiply cost function
to match the vec_select used by the lane-referencing forms of the
instructions already mentioned. This traversal prevents the cost of
the vec_select from being added into the cost of the multiply -
meaning that these instructions can now be emitted in the combine
pass as they are no longer deemed prohibitively expensive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  

* config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
RTL tree to prevents vec_select from being added into Neon
multiply cost.


rb14675.patch
Description: rb14675.patch


[PATCH] aarch64: Refactor TBL/TBX RTL patterns

2021-07-19 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch renames the two-source-register TBL/TBX RTL
patterns so that their names better reflect what they do, rather than
confusing them with tbl3 or tbx4 patterns. Also use the correct
"neon_tbl2" type attribute for both patterns.

Rename single-source-register TBL/TBX patterns for consistency.

Bootstrapped and regression tested on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-08  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Use two variant
generators for all TBL/TBX intrinsics and rename to
consistent forms: qtbl[1234] or qtbx[1234].
* config/aarch64/aarch64-simd.md (aarch64_tbl1):
Rename to...
(aarch64_qtbl1): This.
(aarch64_tbx1): Rename to...
(aarch64_qtbx1): This.
(aarch64_tbl2v16qi): Delete.
(aarch64_tbl3): Rename to...
(aarch64_qtbl2): This.
(aarch64_tbx4): Rename to...
(aarch64_qtbx2): This.
* config/aarch64/aarch64.c (aarch64_expand_vec_perm_1): Use
renamed qtbl1 and qtbl2 RTL patterns.
* config/aarch64/arm_neon.h (vqtbl1_p8): Use renamed qtbl1
RTL pattern.
(vqtbl1_s8): Likewise.
(vqtbl1_u8): Likewise.
(vqtbl1q_p8): Likewise.
(vqtbl1q_s8): Likewise.
(vqtbl1q_u8): Likewise.
(vqtbx1_s8): Use renamed qtbx1 RTL pattern.
(vqtbx1_u8): Likewise.
(vqtbx1_p8): Likewise.
(vqtbx1q_s8): Likewise.
(vqtbx1q_u8): Likewise.
(vqtbx1q_p8): Likewise.
(vtbl1_s8): Use renamed qtbl1 RTL pattern.
(vtbl1_u8): Likewise.
(vtbl1_p8): Likewise.
(vtbl2_s8): Likewise
(vtbl2_u8): Likewise.
(vtbl2_p8): Likewise.
(vtbl3_s8): Use renamed qtbl2 RTL pattern.
(vtbl3_u8): Likewise.
(vtbl3_p8): Likewise.
(vtbl4_s8): Likewise.
(vtbl4_u8): Likewise.
(vtbl4_p8): Likewise.
(vtbx2_s8): Use renamed qtbx2 RTL pattern.
(vtbx2_u8): Likewise.
(vtbx2_p8): Likewise.
(vqtbl2_s8): Use renamed qtbl2 RTL pattern.
(vqtbl2_u8): Likewise.
(vqtbl2_p8): Likewise.
(vqtbl2q_s8): Likewise.
(vqtbl2q_u8): Likewise.
(vqtbl2q_p8): Likewise.
(vqtbx2_s8): Use renamed qtbx2 RTL pattern.
(vqtbx2_u8): Likewise.
(vqtbx2_p8): Likewise.
(vqtbx2q_s8): Likewise.
(vqtbx2q_u8): Likewise.
(vqtbx2q_p8): Likewise.
(vtbx4_s8): Likewise.
(vtbx4_u8): Likewise.
(vtbx4_p8): Likewise.


rb14671.patch
Description: rb14671.patch


Re: [PATCH V2] gcc: Add vec_select -> subreg RTL simplification

2021-07-15 Thread Jonathan Wright via Gcc-patches
Ah, yes - those test results should have only been changed for little endian.

I've submitted a patch to the list restoring the original expected results for 
big
endian.

Thanks,
Jonathan

From: Christophe Lyon 
Sent: 15 July 2021 10:09
To: Richard Sandiford ; Jonathan Wright 
; gcc-patches@gcc.gnu.org ; 
Kyrylo Tkachov 
Subject: Re: [PATCH V2] gcc: Add vec_select -> subreg RTL simplification



On Mon, Jul 12, 2021 at 5:31 PM Richard Sandiford via Gcc-patches 
mailto:gcc-patches@gcc.gnu.org>> wrote:
Jonathan Wright mailto:jonathan.wri...@arm.com>> 
writes:
> Hi,
>
> Version 2 of this patch adds more code generation tests to show the
> benefit of this RTL simplification as well as adding a new helper function
> 'rtx_vec_series_p' to reduce code duplication.
>
> Patch tested as version 1 - ok for master?

Sorry for the slow reply.

> Regression tested and bootstrapped on aarch64-none-linux-gnu,
> x86_64-unknown-linux-gnu, arm-none-linux-gnueabihf and
> aarch64_be-none-linux-gnu - no issues.

I've also tested this on powerpc64le-unknown-linux-gnu, no issues again.

> diff --git a/gcc/combine.c b/gcc/combine.c
> index 
> 6476812a21268e28219d1e302ee1c979d528a6ca..0ff6ca87e4432cfeff1cae1dd219ea81ea0b73e4
>  100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -6276,6 +6276,26 @@ combine_simplify_rtx (rtx x, machine_mode op0_mode, 
> int in_dest,
> - 1,
> 0));
>break;
> +case VEC_SELECT:
> +  {
> + rtx trueop0 = XEXP (x, 0);
> + mode = GET_MODE (trueop0);
> + rtx trueop1 = XEXP (x, 1);
> + int nunits;
> + /* If we select a low-part subreg, return that.  */
> + if (GET_MODE_NUNITS (mode).is_constant ()
> + && targetm.can_change_mode_class (mode, GET_MODE (x), ALL_REGS))
> +   {
> + int offset = BYTES_BIG_ENDIAN ? nunits - XVECLEN (trueop1, 0) : 0;
> +
> + if (rtx_vec_series_p (trueop1, offset))
> +   {
> + rtx new_rtx = lowpart_subreg (GET_MODE (x), trueop0, mode);
> + if (new_rtx != NULL_RTX)
> +   return new_rtx;
> +   }
> +   }
> +  }

Since this occurs three times, I think it would be worth having
a new predicate:

/* Return true if, for all OP of mode OP_MODE:

 (vec_select:RESULT_MODE OP SEL)

   is equivalent to the lowpart RESULT_MODE of OP.  */

bool
vec_series_lowpart_p (machine_mode result_mode, machine_mode op_mode, rtx sel)

containing the GET_MODE_NUNITS (…).is_constant, can_change_mode_class
and rtx_vec_series_p tests.

I think the function belongs in rtlanal.[hc], even though subreg_lowpart_p
is in emit-rtl.c.

> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> aef6da9732d45b3586bad5ba57dafa438374ac3c..f12a0bebd3d6dd3381ac8248cd3fa3f519115105
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1884,15 +1884,16 @@
>  )
>
>  (define_insn "*zero_extend2_aarch64"
> -  [(set (match_operand:GPI 0 "register_operand" "=r,r,w")
> -(zero_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" 
> "r,m,m")))]
> +  [(set (match_operand:GPI 0 "register_operand" "=r,r,w,r")
> +(zero_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" 
> "r,m,m,w")))]
>""
>"@
> and\t%0, %1, 
> ldr\t%w0, %1
> -   ldr\t%0, %1"
> -  [(set_attr "type" "logic_imm,load_4,f_loads")
> -   (set_attr "arch" "*,*,fp")]
> +   ldr\t%0, %1
> +   umov\t%w0, %1.[0]"
> +  [(set_attr "type" "logic_imm,load_4,f_loads,neon_to_gp")
> +   (set_attr "arch" "*,*,fp,fp")]

FTR (just to show I thought about it): I don't know whether the umov
can really be considered an fp operation rather than a simd operation,
but since we don't support fp without simd, this is already a distinction
without a difference.  So the pattern is IMO OK as-is.

> diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md
> index 
> 55b6c1ac585a4cae0789c3afc0fccfc05a6d3653..93e963696dad30f29a76025696670f8b31bf2c35
>  100644
> --- a/gcc/config/arm/vfp.md
> +++ b/gcc/config/arm/vfp.md
> @@ -224,7 +224,7 @@
>  ;; problems because small constants get converted into adds.
>  (define_insn "*arm_movsi_vfp"
>[(set (match_operand:SI 0 "nonimmediate_operand" "=rk,r,r,r,rk,m 
> ,*t,r,*t,*t, *Uv")
> -  (match_operand:SI 1 "general_operand" "rk, 
> I,K,j,mi,rk,r,*t,*t,*Uvi,*t"))]
> +  (match_operand:SI 1 "general_operand" "rk, 
> I,K,j,mi,rk,r,t,*t,*Uvi,*t"))]
>"TARGET_ARM && TARGET_HARD_FLOAT
> && (   s_register_operand (operands[0], SImode)
> || s_register_operand (operands[1], SImode))"

I'll assume that an Arm maintainer would have spoken up by now if
they didn't want this for some reason.

> diff --git a/gcc/rtl.c b/gcc/rtl.c
> index 
> aaee882f5ca3e37b59c9829e41d0864070c170eb..3e8b3628b0b76b41889b77bb0019f582ee6f5aaa
>  100644
> --- a/gcc/rtl.c
> +++ b/gcc/rtl.c
> @@ -736,6 +736,19 @@ rtvec_all_equal_p (const_rtvec 

testsuite: aarch64: Fix failing SVE tests on big endian

2021-07-15 Thread Jonathan Wright via Gcc-patches
Hi,

A recent change "gcc: Add vec_select -> subreg RTL simplification"
updated the expected test results for SVE extraction tests. The new
result should only have been changed for little endian. This patch
restores the old expected result for big endian.

Ok for master?

Thanks,
Jonathan

---

gcc/testsuite/ChangeLog:

2021-07-15  Jonathan Wright  

* gcc.target/aarch64/sve/extract_1.c: Split expected results
by big/little endian targets, restoring the old expected
result for big endian.
* gcc.target/aarch64/sve/extract_2.c: Likewise.
* gcc.target/aarch64/sve/extract_3.c: Likewise.
* gcc.target/aarch64/sve/extract_4.c: Likewise.


rb14655.patch
Description: rb14655.patch


[PATCH] aarch64: Use unions for vector tables in vqtbl[234] intrinsics

2021-07-09 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch uses a union instead of constructing a new opaque
vector structure for each of the vqtbl[234] Neon intrinsics in arm_neon.h.
This simplifies the header file and also improves code generation -
superfluous move instructions were emitted for every register
extraction/set in this additional structure.

This change is safe because the C-level vector structure types e.g.
uint8x16x4_t already provide a tie for sequential register allocation
- which is required by the TBL instructions.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-08  Jonathan Wright  

* config/aarch64/arm_neon.h (vqtbl2_s8): Use union instead of
additional __builtin_aarch64_simd_oi structure.
(vqtbl2_u8): Likewise.
(vqtbl2_p8): Likewise.
(vqtbl2q_s8): Likewise.
(vqtbl2q_u8): Likewise.
(vqtbl2q_p8): Likewise.
(vqtbl3_s8): Use union instead of additional
__builtin_aarch64_simd_ci structure.
(vqtbl3_u8): Likewise.
(vqtbl3_p8): Likewise.
(vqtbl3q_s8): Likewise.
(vqtbl3q_u8): Likewise.
(vqtbl3q_p8): Likewise.
(vqtbl4_s8): Use union instead of additional
__builtin_aarch64_simd_xi structure.
(vqtbl4_u8): Likewise.
(vqtbl4_p8): Likewise.
(vqtbl4q_s8): Likewise.
(vqtbl4q_u8): Likewise.
(vqtbl4q_p8): Likewise.


rb14639.patch
Description: rb14639.patch


[PATCH V2] gcc: Add vec_select -> subreg RTL simplification

2021-07-07 Thread Jonathan Wright via Gcc-patches
Hi,

Version 2 of this patch adds more code generation tests to show the
benefit of this RTL simplification as well as adding a new helper function
'rtx_vec_series_p' to reduce code duplication.

Patch tested as version 1 - ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-08  Jonathan Wright  

* combine.c (combine_simplify_rtx): Add vec_select -> subreg
simplification.
* config/aarch64/aarch64.md 
(*zero_extend2_aarch64):
Add Neon to general purpose register case for zero-extend
pattern.
* config/arm/vfp.md (*arm_movsi_vfp): Remove "*" from *t -> r
case to prevent some cases opting to go through memory.
* cse.c (fold_rtx): Add vec_select -> subreg simplification.
* rtl.c (rtx_vec_series_p): Define helper function to
determine whether RTX vector-selection indices are in series.
* rtl.h (rtx_vec_series_p): Define.
* simplify-rtx.c (simplify_context::simplify_binary_operation_1):
Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/extract_zero_extend.c: Remove dump scan
for RTL pattern match.
* gcc.target/aarch64/narrow_high_combine.c: Add new tests.
* gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Update
scan-assembler regex to look for a scalar register instead of
lane 0 of a vector.
* gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Likewise.
* gcc.target/aarch64/simd/vmulxd_laneq_f64_1.c: Likewise.
* gcc.target/aarch64/simd/vmulxs_lane_f32_1.c: Likewise.
* gcc.target/aarch64/simd/vmulxs_laneq_f32_1.c: Likewise.
* gcc.target/aarch64/simd/vqdmlalh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmlals_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmlslh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmlsls_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmullh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmullh_laneq_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmulls_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmulls_laneq_s32.c: Likewise.
* gcc.target/aarch64/sve/dup_lane_1.c: Likewise.
* gcc.target/aarch64/sve/live_1.c: Update scan-assembler regex
cases to look for 'b' and 'h' registers instead of 'w'.
* gcc.target/arm/mve/intrinsics/vgetq_lane_f16.c: Extract
lane 1 as the moves for lane 0 now get optimized away.
* gcc.target/arm/mve/intrinsics/vgetq_lane_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s8.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u8.c: Likewise.



From: Jonathan Wright
Sent: 02 July 2021 10:53
To: gcc-patches@gcc.gnu.org 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] gcc: Add vec_select -> subreg RTL simplification 
 
Hi,

As subject, this patch adds a new RTL simplification for the case of a
VEC_SELECT selecting the low part of a vector. The simplification
returns a SUBREG.

The primary goal of this patch is to enable better combinations of
Neon RTL patterns - specifically allowing generation of 'write-to-
high-half' narrowing intructions.

Adding this RTL simplification means that the expected results for a
number of tests need to be updated:
* aarch64 Neon: Update the scan-assembler regex for intrinsics tests
  to expect a scalar register instead of lane 0 of a vector.
* aarch64 SVE: Likewise.
* arm MVE: Use lane 1 instead of lane 0 for lane-extraction
  intrinsics tests (as the move instructions get optimized away for
  lane 0.)

Regression tested and bootstrapped on aarch64-none-linux-gnu,
x86_64-unknown-linux-gnu, arm-none-linux-gnueabihf and
aarch64_be-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-08  Jonathan Wright  

    * combine.c (combine_simplify_rtx): Add vec_select -> subreg
    simplification.
    * config/aarch64/aarch64.md 
(*zero_extend2_aarch64):
    Add Neon to general purpose register case for zero-extend
    pattern.
    * config/arm/vfp.md (*arm_movsi_vfp): Remove "*" from *t -> r
    case to prevent some cases opting to go through memory.
    * cse.c (fold_rtx): Add vec_select -> subreg simplification.
    * simplify-rtx.c (simplify_context::simplify_binary_operation_1):
    Likewise.

gcc/testsuite/ChangeLog:

    * gcc.target/aarch64/extract_zero_extend.c: Remove dump scan
    for RTL pattern match.
    * gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Update
    scan-assembler regex to look for a scalar register instead of
    lane 0 of a vector.
    * gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Likewise.

[PATCH] gcc: Add vec_select -> subreg RTL simplification

2021-07-02 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch adds a new RTL simplification for the case of a
VEC_SELECT selecting the low part of a vector. The simplification
returns a SUBREG.

The primary goal of this patch is to enable better combinations of
Neon RTL patterns - specifically allowing generation of 'write-to-
high-half' narrowing intructions.

Adding this RTL simplification means that the expected results for a
number of tests need to be updated:
* aarch64 Neon: Update the scan-assembler regex for intrinsics tests
  to expect a scalar register instead of lane 0 of a vector.
* aarch64 SVE: Likewise.
* arm MVE: Use lane 1 instead of lane 0 for lane-extraction
  intrinsics tests (as the move instructions get optimized away for
  lane 0.)

Regression tested and bootstrapped on aarch64-none-linux-gnu,
x86_64-unknown-linux-gnu, arm-none-linux-gnueabihf and
aarch64_be-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-08  Jonathan Wright  

* combine.c (combine_simplify_rtx): Add vec_select -> subreg
simplification.
* config/aarch64/aarch64.md 
(*zero_extend2_aarch64):
Add Neon to general purpose register case for zero-extend
pattern.
* config/arm/vfp.md (*arm_movsi_vfp): Remove "*" from *t -> r
case to prevent some cases opting to go through memory.
* cse.c (fold_rtx): Add vec_select -> subreg simplification.
* simplify-rtx.c (simplify_context::simplify_binary_operation_1):
Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/extract_zero_extend.c: Remove dump scan
for RTL pattern match.
* gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Update
scan-assembler regex to look for a scalar register instead of
lane 0 of a vector.
* gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Likewise.
* gcc.target/aarch64/simd/vmulxd_laneq_f64_1.c: Likewise.
* gcc.target/aarch64/simd/vmulxs_lane_f32_1.c: Likewise.
* gcc.target/aarch64/simd/vmulxs_laneq_f32_1.c: Likewise.
* gcc.target/aarch64/simd/vqdmlalh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmlals_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmlslh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmlsls_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmullh_lane_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmullh_laneq_s16.c: Likewise.
* gcc.target/aarch64/simd/vqdmulls_lane_s32.c: Likewise.
* gcc.target/aarch64/simd/vqdmulls_laneq_s32.c: Likewise.
* gcc.target/aarch64/sve/dup_lane_1.c: Likewise.
* gcc.target/aarch64/sve/live_1.c: Update scan-assembler regex
cases to look for 'b' and 'h' registers instead of 'w'.
* gcc.target/arm/mve/intrinsics/vgetq_lane_f16.c: Extract
lane 1 as the moves for lane 0 now get optimized away.
* gcc.target/arm/mve/intrinsics/vgetq_lane_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_s8.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vgetq_lane_u8.c: Likewise.


rb14526.patch
Description: rb14526.patch


[PATCH V2] aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions

2021-06-16 Thread Jonathan Wright via Gcc-patches
Hi,

Version 2 of this patch adds tests to verify the benefit of this change.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-14  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_hn):
Change to an expander that emits the correct instruction
depending on endianness.
(aarch64_hn_insn_le): Define.
(aarch64_hn_insn_be): Define.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.

From: Gcc-patches  on 
behalf of Jonathan Wright via Gcc-patches 
Sent: 15 June 2021 11:02
To: gcc-patches@gcc.gnu.org 
Subject: [PATCH] aarch64: Model zero-high-half semantics of ADDHN/SUBHN 
instructions 
 
Hi,

As subject, this patch models the zero-high-half semantics of the
narrowing arithmetic Neon instructions in the
aarch64_hn RTL pattern. Modeling these
semantics allows for better RTL combinations while also removing
some register allocation issues as the compiler now knows that the
operation is totally destructive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-14  Jonathan Wright  

    * config/aarch64/aarch64-simd.md (aarch64_hn):
    Change to an expander that emits the correct instruction
    depending on endianness.
    (aarch64_hn_insn_le): Define.
    (aarch64_hn_insn_be): Define.

rb14566.patch
Description: rb14566.patch


[PATCH V2] aarch64: Model zero-high-half semantics of [SU]QXTN instructions

2021-06-16 Thread Jonathan Wright via Gcc-patches
Hi,

Version 2 of the patch adds tests to verify the benefit of this change.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-14  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Split generator
for aarch64_qmovn builtins into scalar and vector
variants.
* config/aarch64/aarch64-simd.md (aarch64_qmovn_insn_le):
Define.
(aarch64_qmovn_insn_be): Define.
(aarch64_qmovn): Split into scalar and vector
variants. Change vector variant to an expander that emits the
correct instruction depending on endianness.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.


From: Gcc-patches  on 
behalf of Jonathan Wright via Gcc-patches 
Sent: 15 June 2021 10:59
To: gcc-patches@gcc.gnu.org 
Subject: [PATCH] aarch64: Model zero-high-half semantics of [SU]QXTN 
instructions 
 
Hi,

As subject, this patch first splits the aarch64_qmovn
pattern into separate scalar and vector variants. It then further splits
the vector RTL  pattern into big/little endian variants that model the
zero-high-half semantics of the underlying instruction. Modeling
these semantics allows for better RTL combinations while also
removing some register allocation issues as the compiler now knows
that the operation is totally destructive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-14  Jonathan Wright  

    * config/aarch64/aarch64-simd-builtins.def: Split generator
    for aarch64_qmovn builtins into scalar and vector
    variants.
    * config/aarch64/aarch64-simd.md (aarch64_qmovn_insn_le):
    Define.
    (aarch64_qmovn_insn_be): Define.
    (aarch64_qmovn): Split into scalar and vector
    variants. Change vector variant to an expander that emits the
    correct instruction depending on endianness.

rb14565.patch
Description: rb14565.patch


[PATCH V2] aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL

2021-06-16 Thread Jonathan Wright via Gcc-patches
Hi,

Version 2 of the patch adds tests to verify the benefit of this change.

Ok for master?

Thanks,
Jonathan

---
gcc/ChangeLog:

2021-06-14  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Split generator
for aarch64_sqmovun builtins into scalar and vector variants.
* config/aarch64/aarch64-simd.md (aarch64_sqmovun):
Split into scalar and vector variants. Change vector variant
to an expander that emits the correct instruction depending
on endianness.
(aarch64_sqmovun_insn_le): Define.
(aarch64_sqmovun_insn_be): Define.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.

From: Gcc-patches  on 
behalf of Jonathan Wright via Gcc-patches 
Sent: 15 June 2021 10:52
To: gcc-patches@gcc.gnu.org 
Subject: [PATCH] aarch64: Model zero-high-half semantics of SQXTUN instruction 
in RTL 
 
Hi,

As subject, this patch first splits the aarch64_sqmovun pattern
into separate scalar and vector variants. It then further split the vector
pattern into big/little endian variants that model the zero-high-half
semantics of the underlying instruction. Modeling these semantics
allows for better RTL combinations while also removing some register
allocation issues as the compiler now knows that the operation is
totally destructive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-14  Jonathan Wright  

    * config/aarch64/aarch64-simd-builtins.def: Split generator
    for aarch64_sqmovun builtins into scalar and vector variants.
    * config/aarch64/aarch64-simd.md (aarch64_sqmovun):
    Split into scalar and vector variants. Change vector variant
    to an expander that emits the correct instruction depending
    on endianness.
    (aarch64_sqmovun_insn_le): Define.
    (aarch64_sqmovun_insn_be): Define.

rb14564.patch
Description: rb14564.patch


[PATCH V2] aarch64: Model zero-high-half semantics of XTN instruction in RTL

2021-06-16 Thread Jonathan Wright via Gcc-patches
Hi,

Version 2 of this patch adds tests to verify the benefit of this change.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-11  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_xtn_insn_le):
Define - modeling zero-high-half semantics.
(aarch64_xtn): Change to an expander that emits the
appropriate instruction depending on endianness.
(aarch64_xtn_insn_be): Define - modeling zero-high-half
semantics.
(aarch64_xtn2_le): Rename to...
(aarch64_xtn2_insn_le): This.
(aarch64_xtn2_be): Rename to...
(aarch64_xtn2_insn_be): This.
(vec_pack_trunc_): Emit truncation instruction instead
of aarch64_xtn.
* config/aarch64/iterators.md (Vnarrowd): Add Vnarrowd mode
attribute iterator.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/narrow_zero_high_half.c: Add new tests.


From: Gcc-patches  on 
behalf of Jonathan Wright via Gcc-patches 
Sent: 15 June 2021 10:45
To: gcc-patches@gcc.gnu.org 
Subject: [PATCH] aarch64: Model zero-high-half semantics of XTN instruction in 
RTL 
 
Hi,

Modeling the zero-high-half semantics of the XTN narrowing
instruction in RTL indicates to the compiler that this is a totally
destructive operation. This enables more RTL simplifications and also
prevents some register allocation issues.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-11  Jonathan Wright  

    * config/aarch64/aarch64-simd.md (aarch64_xtn_insn_le):
    Define - modeling zero-high-half semantics.
    (aarch64_xtn): Change to an expander that emits the
    appropriate instruction depending on endianness.
    (aarch64_xtn_insn_be): Define - modeling zero-high-half
    semantics.
    (aarch64_xtn2_le): Rename to...
    (aarch64_xtn2_insn_le): This.
    (aarch64_xtn2_be): Rename to...
    (aarch64_xtn2_insn_be): This.
    (vec_pack_trunc_): Emit truncation instruction instead
    of aarch64_xtn.
    * config/aarch64/iterators.md (Vnarrowd): Add Vnarrowd mode
    attribute iterator.

rb14563.patch
Description: rb14563.patch


[PATCH] testsuite: aarch64: Add zero-high-half tests for narrowing shifts

2021-06-16 Thread Jonathan Wright via Gcc-patches
Hi,

This patch adds tests to verify that Neon narrowing-shift instructions
clear the top half of the result vector. It is sufficient to show that a
subsequent combine with a zero-vector is optimized away - leaving
just the narrowing-shift instruction.

Ok for master?

Thanks,
Jonathan

---

gcc/testsuite/ChangeLog:

2021-06-15  Jonathan Wright  

* gcc.target/aarch64/narrow_zero_high_half.c: New test.


rb14569.patch
Description: rb14569.patch


[PATCH] aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions

2021-06-15 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch models the zero-high-half semantics of the
narrowing arithmetic Neon instructions in the
aarch64_hn RTL pattern. Modeling these
semantics allows for better RTL combinations while also removing
some register allocation issues as the compiler now knows that the
operation is totally destructive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-14  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_hn):
Change to an expander that emits the correct instruction
depending on endianness.
(aarch64_hn_insn_le): Define.
(aarch64_hn_insn_be): Define.


rb14566.patch
Description: rb14566.patch


[PATCH] aarch64: Model zero-high-half semantics of [SU]QXTN instructions

2021-06-15 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch first splits the aarch64_qmovn
pattern into separate scalar and vector variants. It then further splits
the vector RTL  pattern into big/little endian variants that model the
zero-high-half semantics of the underlying instruction. Modeling
these semantics allows for better RTL combinations while also
removing some register allocation issues as the compiler now knows
that the operation is totally destructive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-14  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Split generator
for aarch64_qmovn builtins into scalar and vector
variants.
* config/aarch64/aarch64-simd.md (aarch64_qmovn_insn_le):
Define.
(aarch64_qmovn_insn_be): Define.
(aarch64_qmovn): Split into scalar and vector
variants. Change vector variant to an expander that emits the
correct instruction depending on endianness.


rb14565.patch
Description: rb14565.patch


[PATCH] aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL

2021-06-15 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch first splits the aarch64_sqmovun pattern
into separate scalar and vector variants. It then further split the vector
pattern into big/little endian variants that model the zero-high-half
semantics of the underlying instruction. Modeling these semantics
allows for better RTL combinations while also removing some register
allocation issues as the compiler now knows that the operation is
totally destructive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-14  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Split generator
for aarch64_sqmovun builtins into scalar and vector variants.
* config/aarch64/aarch64-simd.md (aarch64_sqmovun):
Split into scalar and vector variants. Change vector variant
to an expander that emits the correct instruction depending
on endianness.
(aarch64_sqmovun_insn_le): Define.
(aarch64_sqmovun_insn_be): Define.


rb14564.patch
Description: rb14564.patch


[PATCH] aarch64: Model zero-high-half semantics of XTN instruction in RTL

2021-06-15 Thread Jonathan Wright via Gcc-patches
Hi,

Modeling the zero-high-half semantics of the XTN narrowing
instruction in RTL indicates to the compiler that this is a totally
destructive operation. This enables more RTL simplifications and also
prevents some register allocation issues.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-06-11  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_xtn_insn_le):
Define - modeling zero-high-half semantics.
(aarch64_xtn): Change to an expander that emits the
appropriate instruction depending on endianness.
(aarch64_xtn_insn_be): Define - modeling zero-high-half
semantics.
(aarch64_xtn2_le): Rename to...
(aarch64_xtn2_insn_le): This.
(aarch64_xtn2_be): Rename to...
(aarch64_xtn2_insn_be): This.
(vec_pack_trunc_): Emit truncation instruction instead
of aarch64_xtn.
* config/aarch64/iterators.md (Vnarrowd): Add Vnarrowd mode
attribute iterator.


rb14563.patch
Description: rb14563.patch


[PATCH] aarch64: Use correct type attributes for RTL generating XTN(2)

2021-05-19 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch corrects the type attribute in RTL patterns that
generate XTN/XTN2 instructions to be "neon_move_narrow_q".

This makes a material difference because these instructions can be
executed on both SIMD pipes in the Cortex-A57 core model, whereas the
"neon_shift_imm_narrow_q" attribute (in use until now) would suggest
to the scheduler that they could only execute on one of the two
pipes.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-05-18  Jonathan Wright  

* config/aarch64/aarch64-simd.md: Use "neon_move_narrow_q"
type attribute in patterns generating XTN(2).


rb14492.patch
Description: rb14492.patch


[PATCH] aarch64: Use an expander for quad-word vec_pack_trunc pattern

2021-05-19 Thread Jonathan Wright via Gcc-patches
Hi,

The existing vec_pack_trunc RTL pattern emits an opaque two-
instruction assembly code sequence that prevents proper instruction
scheduling. This commit changes the pattern to an expander that emits
individual xtn and xtn2 instructions.

This commit also consolidates the duplicate truncation patterns.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-05-17  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_simd_vec_pack_trunc_):
Remove as duplicate of...
(aarch64_xtn): This.
(aarch64_xtn2_le): Move position in file.
(aarch64_xtn2_be): Move position in file.
(aarch64_xtn2): Move position in file.
(vec_pack_trunc_): Define as an expander.


rb14480.patch
Description: rb14480.patch


[PATCH 5/5] testsuite: aarch64: Add tests for high-half narrowing instructions

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch adds tests to confirm that a *2 (write to high-half)
Neon instruction is generated from vcombine* of a narrowing intrinsic
sequence.

Ok for master?

Thanks,
Jonathan

---

gcc/testsuite/ChangeLog:

2021-05-14  Jonathan Wright  

* gcc.target/aarch64/narrow_high_combine.c: New test.

rb14483.patch
Description: rb14483.patch


[PATCH 4/5] aarch64: Refactor aarch64_qshrn_n RTL pattern

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch splits the aarch64_qshrn_n
pattern into separate scalar and vector variants. It further splits the vector
pattern into big/little endian variants that model the zero-high-half
semantics of the underlying instruction - allowing for more combinations
with the write-to-high-half variant
(aarch64_qshrn2_n.) This improvement will be
confirmed by a new test in gcc.target/aarch64/narrow_high_combine.c 
(patch 5/5 in this series.)

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-05-14  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Split builtin
generation for aarch64_qshrn_n pattern into
separate scalar and vector generators.
* config/aarch64/aarch64-simd.md
(aarch64_qshrn_n): Define as an expander and
split into...
(aarch64_qshrn_n_insn_le): This and...
(aarch64_qshrn_n_insn_be): This.
* config/aarch64/iterators.md: Define SD_HSDI iterator.


rb14490.patch
Description: rb14490.patch


[PATCH 3/5] aarch64: Relax aarch64_sqxtun2 RTL pattern

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch uses UNSPEC_SQXTUN instead of UNSPEC_SQXTUN2
in the aarch64_sqxtun2 patterns. This allows for more more
aggressive combinations and ultimately better code generation - which will
be confirmed by a new set of tests in
gcc.target/aarch64/narrow_high_combine.c (patch 5/5 in this series.)

The now redundant UNSPEC_SQXTUN2 is removed.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-05-14  Jonathn Wright  

* config/aarch64/aarch64-simd.md: Use UNSPEC_SQXTUN instead
of UNSPEC_SQXTUN2.
* config/aarch64/iterators.md: Remove UNSPEC_SQXTUN2.


rb14481.patch
Description: rb14481.patch


[PATCH 2/5] aarch64: Relax aarch64_qshrn2_n RTL pattern

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch implements saturating right-shift and narrow high
Neon intrinsic RTL patterns using a vec_concat of a register_operand
and a VQSHRN_N unspec - instead of just a VQSHRN_N unspec. This
more relaxed pattern allows for more aggressive combinations and
ultimately better code generation - which will be confirmed by a new
set of tests in gcc.target/aarch64/narrow_high_combine.c (patch 5/5 in
this series.)

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-03-04  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_qshrn2_n):
Implement as an expand emitting a big/little endian
instruction pattern.
(aarch64_qshrn2_n_insn_le): Define.
(aarch64_qshrn2_n_insn_be): Define.


rb14251.patch
Description: rb14251.patch


[PATCH 1/5] aarch64: Relax aarch64_hn2 RTL pattern

2021-05-18 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch implements v[r]addhn2 and v[r]subhn2 Neon intrinsic
RTL patterns using a vec_concat of a register_operand and an ADDSUBHN
unspec - instead of just an ADDSUBHN2 unspec. This more relaxed pattern
allows for more aggressive combinations and ultimately better code
generation - which will be confirmed by a new set of tests in
gcc.target/aarch64/narrow_high_combine.c (patch 5/5 in this series).

This patch also removes the now redundant [R]ADDHN2 and [R]SUBHN2
unspecs and their iterator.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-03-03  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_hn2):
Implement as an expand emitting a big/little endian
instruction pattern.
(aarch64_hn2_insn_le): Define.
(aarch64_hn2_insn_be): Define.
* config/aarch64/iterators.md: Remove UNSPEC_[R]ADDHN2 and
UNSPEC_[R]SUBHN2 unspecs and ADDSUBHN2 iterator.


rb14250.patch
Description: rb14250.patch


Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-05-04 Thread Jonathan Wright via Gcc-patches
Hi Richard,

I think you may be referencing an older checkout as we refactored this
pattern in a previous change to:

(define_insn "mul_lane3"
 [(set (match_operand:VMUL 0 "register_operand" "=w")
   (mult:VMUL
   (vec_duplicate:VMUL
 (vec_select:
   (match_operand:VMUL 2 "register_operand" "")
   (parallel [(match_operand:SI 3 "immediate_operand" "i")])))
   (match_operand:VMUL 1 "register_operand" "w")))]
  "TARGET_SIMD"
  {
operands[3] = aarch64_endian_lane_rtx (mode, INTVAL (operands[3]));
return "mul\\t%0., %1., %2.[%3]";
  }
  [(set_attr "type" "neon_mul__scalar")]
)

which doesn't help us with the 'laneq' intrinsics as the machine mode for
operands 0 and 1 (of the laneq intrinsics) is narrower than the machine
mode for operand 2.

Thanks,
Jonathan
​

From: Richard Sandiford 
Sent: 30 April 2021 19:18
To: Jonathan Wright 
Cc: gcc-patches@gcc.gnu.org 
Subject: Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq 
intrinsics

Richard Sandiford via Gcc-patches  writes:
> Jonathan Wright  writes:
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 
>> bdee49f74f4725409d33af733bb55be290b3f0e7..234762960bd6df057394f753072ef65a6628a43d
>>  100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -734,6 +734,22 @@
>>[(set_attr "type" "neon_mul__scalar")]
>>  )
>>
>> +(define_insn "mul_laneq3"
>> +  [(set (match_operand:VDQSF 0 "register_operand" "=w")
>> +(mult:VDQSF
>> +  (vec_duplicate:VDQSF
>> +(vec_select:
>> +  (match_operand:V4SF 2 "register_operand" "w")
>> +  (parallel [(match_operand:SI 3 "immediate_operand" "i")])))
>> +  (match_operand:VDQSF 1 "register_operand" "w")))]
>> +  "TARGET_SIMD"
>> +  {
>> +operands[3] = aarch64_endian_lane_rtx (V4SFmode, INTVAL (operands[3]));
>> +return "fmul\\t%0., %1., %2.[%3]";
>> +  }
>> +  [(set_attr "type" "neon_fp_mul_s_scalar")]
>> +)
>> +

Oops, sorry, I just realised that this pattern does already exist as:

(define_insn "*aarch64_mul3_elt"
 [(set (match_operand:VMUL 0 "register_operand" "=w")
(mult:VMUL
  (vec_duplicate:VMUL
  (vec_select:
(match_operand:VMUL 1 "register_operand" "")
(parallel [(match_operand:SI 2 "immediate_operand")])))
  (match_operand:VMUL 3 "register_operand" "w")))]
  "TARGET_SIMD"
  {
operands[2] = aarch64_endian_lane_rtx (mode, INTVAL (operands[2]));
return "mul\\t%0., %3., %1.[%2]";
  }
  [(set_attr "type" "neon_mul__scalar")]
)

Thanks,
Richard


Re: [PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Updated the patch to implement suggestions - restricting these tests to run on
only aarch64 targets.

Tested and all new tests pass on aarch64-none-linux-gnu.

Ok for master?

Thanks,
Jonathan

From: Richard Sandiford 
Sent: 28 April 2021 16:46
To: Jonathan Wright via Gcc-patches 
Cc: Jonathan Wright 
Subject: Re: [PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] 
intrinsics

Jonathan Wright via Gcc-patches  writes:
> Hi,
>
> As subject, this patch adds compilation tests to make sure that the output
> of vmla/vmls floating-point Neon intrinsics (fmul, fadd/fsub) is not fused
> into fmla/fmls instructions.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/testsuite/ChangeLog:
>
> 2021-02-16  Jonathan Wright  
>
>* gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c:
>New test.
>* gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c:
>New test.
>* gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused.c:
>New test.
>* gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused_A64.c:
>New test.
>
> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c
> new file mode 100644
> index 
> ..402c4ef414558767c7d7ddc21817093a80d2a06d
> --- /dev/null
> +++ 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c
> @@ -0,0 +1,42 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */

Could you test this on an arm*-*-* target too?  I'd expect the
dg-finals to fail there, since the syntax is vmul.f32 etc. instead.
Alternatively, we could just skip this for arm*-*-*, like you do
with the by-lane tests.

> +
> +
> +#include 
> +
> +float32x2_t foo_f32 (float32x2_t a, float32x2_t b, float32x2_t c)
> +{
> +  return vmla_f32 (a, b, c);
> +}
> +
> +float32x4_t fooq_f32 (float32x4_t a, float32x4_t b, float32x4_t c)
> +{
> +  return vmlaq_f32 (a, b, c);
> +}
> +
> +float32x2_t foo_n_f32 (float32x2_t a, float32x2_t b, float32_t c)
> +{
> +  return vmla_n_f32 (a, b, c);
> +}
> +
> +float32x4_t fooq_n_f32 (float32x4_t a, float32x4_t b, float32_t c)
> +{
> +  return vmlaq_n_f32 (a, b, c);
> +}
> +
> +float32x2_t foo_lane_f32 (float32x2_t a,
> +   float32x2_t b,
> +   float32x2_t v)
> +{
> +  return vmla_lane_f32 (a, b, v, 0);
> +}
> +
> +float32x4_t fooq_lane_f32 (float32x4_t a,
> +float32x4_t b,
> +float32x2_t v)
> +{
> +  return vmlaq_lane_f32 (a, b, v, 0);
> +}
> +
> +/* { dg-final { scan-assembler-times {fmul} 6} }  */
> +/* { dg-final { scan-assembler-times {fadd} 6} }  */

It'd be safer to match {\tfmul\t} etc. instead.  Matching bare words
runs the risk of picking up things like directory names that happen
to contain “fmul” as a substring.

Thanks,
Richard

> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c
>  
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c
> new file mode 100644
> index 
> ..08a9590e2572fa78c8360f09c8353a0d23678ec1
> --- /dev/null
> +++ 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c
> @@ -0,0 +1,33 @@
> +/* { dg-skip-if "" { arm*-*-* } } */
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +
> +
> +#include 
> +
> +float64x1_t foo_f64 (float64x1_t a, float64x1_t b, float64x1_t c)
> +{
> +  return vmla_f64 (a, b, c);
> +}
> +
> +float64x2_t fooq_f64 (float64x2_t a, float64x2_t b, float64x2_t c)
> +{
> +  return vmlaq_f64 (a, b, c);
> +}
> +
> +float32x2_t foo_laneq_f32 (float32x2_t a,
> +float32x2_t b,
> +float32x4_t v)
> +{
> +  return vmla_laneq_f32 (a, b, v, 0);
> +}
> +
> +float32x4_t fooq_laneq_f32 (float32x4_t a,
> + float32x4_t b,
> + float32x4_t v)
> +{
> +  return vmlaq_laneq_f32 (a, b, v, 0);
> +}
> +
> +/* { dg-final { scan-assembler-times {fmul} 4} }  */
> +/* { dg-final { scan-assembler-times {fadd} 4} }  */
> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused.c 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused.c
> new file mode 100644
> index 
> ..0846b7cf5d2c332175235c15bbe534b2558960ef
> --- 

Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Updated the patch to be more consistent with the others in the series.

Tested and bootstrapped on aarch64-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

From: Gcc-patches  on behalf of Jonathan 
Wright via Gcc-patches 
Sent: 28 April 2021 15:42
To: gcc-patches@gcc.gnu.org 
Subject: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq 
intrinsics

Hi,

As subject, this patch rewrites the floating-point vml[as][q]_laneq Neon
intrinsics to use RTL builtins rather than relying on the GCC vector
extensions. Using RTL builtins allows control over the emission of
fmla/fmls instructions (which we don't want here.)

With this commit, the code generated by these intrinsics changes from
a fused multiply-add/subtract instruction to an fmul followed by an
fadd/fsub instruction. If the programmer really wants fmla/fmls
instructions, they can use the vfm[as] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-17  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add
float_ml[as][q]_laneq builtin generator macros.
* config/aarch64/aarch64-simd.md (mul_laneq3): Define.
(aarch64_float_mla_laneq): Define.
(aarch64_float_mls_laneq): Define.
* config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin
instead of GCC vector extensions.
(vmlaq_laneq_f32): Likewise.
(vmls_laneq_f32): Likewise.
(vmlsq_laneq_f32): Likewise.


rb14213.patch
Description: rb14213.patch


Re: [PATCH 12/20] aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Patch updated as per suggestion (similar to patch 10/20.)

Tested and bootstrapped on aarch64-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

From: Richard Sandiford 
Sent: 28 April 2021 16:37
To: Jonathan Wright via Gcc-patches 
Cc: Jonathan Wright 
Subject: Re: [PATCH 12/20] aarch64: Use RTL builtins for FP ml[as][q]_lane 
intrinsics

Jonathan Wright via Gcc-patches  writes:
> Hi,
>
> As subject, this patch rewrites the floating-point vml[as][q]_lane Neon
> intrinsics to use RTL builtins rather than relying on the GCC vector
> extensions. Using RTL builtins allows control over the emission of
> fmla/fmls instructions (which we don't want here.)
>
> With this commit, the code generated by these intrinsics changes from
> a fused multiply-add/subtract instruction to an fmul followed by an
> fadd/fsub instruction. If the programmer really wants fmla/fmls
> instructions, they can use the vfm[as] intrinsics.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-02-16  Jonathan Wright  
>
>* config/aarch64/aarch64-simd-builtins.def: Add
>float_ml[as]_lane builtin generator macros.
>* config/aarch64/aarch64-simd.md (mul_lane3): Define.
>(aarch64_float_mla_lane): Define.
>(aarch64_float_mls_lane): Define.
>* config/aarch64/arm_neon.h (vmla_lane_f32): Use RTL builtin
>instead of GCC vector extensions.
>(vmlaq_lane_f32): Likewise.
>(vmls_lane_f32): Likewise.
>(vmlsq_lane_f32): Likewise.
>
> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
> b/gcc/config/aarch64/aarch64-simd-builtins.def
> index 
> 55a5682baeb13041053ef9e6eaa831182ea8b10c..b702493e1351478272bb7d26991a5673943d61ec
>  100644
> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
> @@ -668,6 +668,8 @@
>BUILTIN_VDQF_DF (TERNOP, float_mls, 0, FP)
>BUILTIN_VDQSF (TERNOP, float_mla_n, 0, FP)
>BUILTIN_VDQSF (TERNOP, float_mls_n, 0, FP)
> +  BUILTIN_VDQSF (QUADOP_LANE, float_mla_lane, 0, FP)
> +  BUILTIN_VDQSF (QUADOP_LANE, float_mls_lane, 0, FP)
>
>/* Implemented by aarch64_simd_bsl.  */
>BUILTIN_VDQQH (BSL_P, simd_bsl, 0, NONE)
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 95363d7b5ad11f775aa03f24bbcb0b66d20abb7c..abc8b1708b86bcee2e5082cc4659a197c5821985
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2625,6 +2625,22 @@
>[(set_attr "type" "neon_fp_mul_")]
>  )
>
> +(define_insn "mul_lane3"
> +  [(set (match_operand:VDQSF 0 "register_operand" "=w")
> + (mult:VDQSF
> +   (vec_duplicate:VDQSF
> + (vec_select:
> +   (match_operand:V2SF 2 "register_operand" "w")
> +   (parallel [(match_operand:SI 3 "immediate_operand" "i")])))
> +   (match_operand:VDQSF 1 "register_operand" "w")))]
> +  "TARGET_SIMD"
> +  {
> +operands[3] = aarch64_endian_lane_rtx (V2SFmode, INTVAL (operands[3]));
> +return "fmul\\t%0., %1., %2.[%3]";
> +  }
> +  [(set_attr "type" "neon_fp_mul_s_scalar")]
> +)
> +

Similarly to the 10/20 patch (IIRC), we can instead reuse:

(define_insn "*aarch64_mul3_elt"
 [(set (match_operand:VMUL 0 "register_operand" "=w")
(mult:VMUL
  (vec_duplicate:VMUL
  (vec_select:
(match_operand:VMUL 1 "register_operand" "")
(parallel [(match_operand:SI 2 "immediate_operand")])))
  (match_operand:VMUL 3 "register_operand" "w")))]
  "TARGET_SIMD"
  {
operands[2] = aarch64_endian_lane_rtx (mode, INTVAL (operands[2]));
return "mul\\t%0., %3., %1.[%2]";
  }
  [(set_attr "type" "neon_mul__scalar")]
)

Thanks,
Richard

>  (define_expand "div3"
>   [(set (match_operand:VHSDF 0 "register_operand")
> (div:VHSDF (match_operand:VHSDF 1 "register_operand")
> @@ -2728,6 +2744,46 @@
>}
>  )
>
> +(define_expand "aarch64_float_mla_lane"
> +  [(set (match_operand:VDQSF 0 "register_operand")
> + (plus:VDQSF
> +   (mult:VDQSF
> + (vec_duplicate:VDQSF
> +   (vec_select:
> + (match_operand:V2SF 3 "register_operand")
> + (parallel [(match_operand:SI 4 "immediate_operand")])))
> + (ma

Re: [PATCH 10/20] aarch64: Use RTL builtins for FP ml[as]_n intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Patch updated as per your suggestion.

Tested and bootstrapped on aarch64-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

From: Richard Sandiford 
Sent: 28 April 2021 16:11
To: Jonathan Wright via Gcc-patches 
Cc: Jonathan Wright 
Subject: Re: [PATCH 10/20] aarch64: Use RTL builtins for FP ml[as]_n intrinsics

Jonathan Wright via Gcc-patches  writes:
> Hi,
>
> As subject, this patch rewrites the floating-point vml[as][q]_n Neon
> intrinsics to use RTL builtins rather than inline assembly code, allowing
> for better scheduling and optimization.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-01-18  Jonathan Wright  
>
>* config/aarch64/aarch64-simd-builtins.def: Add
>float_ml[as]_n builtin generator macros.
>* config/aarch64/aarch64-simd.md (mul_n3): Define.
>(aarch64_float_mla_n): Define.
>(aarch64_float_mls_n): Define.
>* config/aarch64/arm_neon.h (vmla_n_f32): Use RTL builtin
>instead of inline asm.
>(vmlaq_n_f32): Likewise.
>(vmls_n_f32): Likewise.
>(vmlsq_n_f32): Likewise.
>
> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
> b/gcc/config/aarch64/aarch64-simd-builtins.def
> index 
> 0f44ed84ff9d08d808b1b2dfe528db5208b134f5..547509474c23daf6882ed2f8407ddb5caf1d1b91
>  100644
> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
> @@ -664,6 +664,9 @@
>BUILTIN_VHSDF (TERNOP, fnma, 4, FP)
>VAR1 (TERNOP, fnma, 4, FP, hf)
>
> +  BUILTIN_VDQSF (TERNOP, float_mla_n, 0, FP)
> +  BUILTIN_VDQSF (TERNOP, float_mls_n, 0, FP)
> +
>/* Implemented by aarch64_simd_bsl.  */
>BUILTIN_VDQQH (BSL_P, simd_bsl, 0, NONE)
>VAR2 (BSL_P, simd_bsl,0, NONE, di, v2di)
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 5f701dd2775290156634ef8c6feccecd359e9ec9..d016970a2c278405b270a0ac745221e69f0f625e
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2614,6 +2614,17 @@
>[(set_attr "type" "neon_fp_mul_")]
>  )
>
> +(define_insn "mul_n3"
> + [(set (match_operand:VHSDF 0 "register_operand" "=w")
> + (mult:VHSDF
> +   (vec_duplicate:VHSDF
> + (match_operand: 2 "register_operand" "w"))
> +   (match_operand:VHSDF 1 "register_operand" "w")))]
> + "TARGET_SIMD"
> + "fmul\\t%0., %1., %2.[0]"

This functionality should already be provided by:

(define_insn "*aarch64_mul3_elt_from_dup"
 [(set (match_operand:VMUL 0 "register_operand" "=w")
(mult:VMUL
  (vec_duplicate:VMUL
(match_operand: 1 "register_operand" ""))
  (match_operand:VMUL 2 "register_operand" "w")))]
  "TARGET_SIMD"
  "mul\t%0., %2., %1.[0]";
  [(set_attr "type" "neon_mul__scalar")]
)

so I think we should instead rename that to mul_n3 and reorder
its operands.

Thanks,
Richard

> +  [(set_attr "type" "neon_fp_mul_")]
> +)
> +
>  (define_expand "div3"
>   [(set (match_operand:VHSDF 0 "register_operand")
> (div:VHSDF (match_operand:VHSDF 1 "register_operand")
> @@ -2651,6 +2662,40 @@
>[(set_attr "type" "neon_fp_abs_")]
>  )
>
> +(define_expand "aarch64_float_mla_n"
> +  [(set (match_operand:VDQSF 0 "register_operand")
> + (plus:VDQSF
> +   (mult:VDQSF
> + (vec_duplicate:VDQSF
> +   (match_operand: 3 "register_operand"))
> + (match_operand:VDQSF 2 "register_operand"))
> +   (match_operand:VDQSF 1 "register_operand")))]
> +  "TARGET_SIMD"
> +  {
> +rtx scratch = gen_reg_rtx (mode);
> +emit_insn (gen_mul_n3 (scratch, operands[2], operands[3]));
> +emit_insn (gen_add3 (operands[0], operands[1], scratch));
> +DONE;
> +  }
> +)
> +
> +(define_expand "aarch64_float_mls_n"
> +  [(set (match_operand:VDQSF 0 "register_operand")
> + (minus:VDQSF
> +   (match_operand:VDQSF 1 "register_operand")
> +   (mult:VDQSF
> + (vec_duplicate:VDQSF
> +   (match_operand: 3 "register_operand"))
> + (match_operand:VDQSF 2 "register_operand"]
> +  "TARGET_SIMD"
> +  {
> +rtx scratch = gen_reg_rtx (mode);
> +emit_insn (gen_mul

Re: [PATCH 1/20] aarch64: Use RTL builtin for vmull[_high]_p8 intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Thanks for the review, I've updated the patch as per option 1.

Tested and bootstrapped on aarch64-none-linux-gnu with no issues.

Ok for master?

Thanks,
Jonathan

From: Richard Sandiford 
Sent: 28 April 2021 15:11
To: Jonathan Wright via Gcc-patches 
Cc: Jonathan Wright 
Subject: Re: [PATCH 1/20] aarch64: Use RTL builtin for vmull[_high]_p8 
intrinsics

Jonathan Wright via Gcc-patches  writes:
> Hi,
>
> As subject, this patch rewrites the vmull[_high]_p8 Neon intrinsics to use RTL
> builtins rather than inline assembly code, allowing for better scheduling and
> optimization.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu and
> aarch64_be-none-elf - no issues.

Thanks for doing this.  Mostly LGTM, but one comment about the patterns:

> […]
> +(define_insn "aarch64_pmull_hiv16qi_insn"
> +  [(set (match_operand:V8HI 0 "register_operand" "=w")
> + (unspec:V8HI
> +   [(vec_select:V8QI
> +  (match_operand:V16QI 1 "register_operand" "w")
> +  (match_operand:V16QI 3 "vect_par_cnst_hi_half" ""))
> +(vec_select:V8QI
> +  (match_operand:V16QI 2 "register_operand" "w")
> +  (match_dup 3))]
> +   UNSPEC_PMULL2))]
> + "TARGET_SIMD"
> + "pmull2\\t%0.8h, %1.16b, %2.16b"
> +  [(set_attr "type" "neon_mul_b_long")]
> +)

As things stands, UNSPEC_PMULL2 has the vec_select “built in”:

(define_insn "aarch64_crypto_pmullv2di"
 [(set (match_operand:TI 0 "register_operand" "=w")
   (unspec:TI [(match_operand:V2DI 1 "register_operand" "w")
   (match_operand:V2DI 2 "register_operand" "w")]
  UNSPEC_PMULL2))]
  "TARGET_SIMD && TARGET_AES"
  "pmull2\\t%0.1q, %1.2d, %2.2d"
  [(set_attr "type" "crypto_pmull")]
)

So I think it would be more consistent to do one of the following:

(1) Keep the vec_selects in the new pattern, but use UNSPEC_PMULL
for the operation instead of UNSPEC_PMULL2.
(2) Remove the vec_selects and keep the UNSPEC_PMULL2.

(1) in principle allows more combination opportunities than (2),
although I don't know how likely it is to help in practice.

Thanks,
Richard


rb14128.patch
Description: rb14128.patch


[PATCH 20/20] aarch64: Remove unspecs from [su]qmovn RTL pattern

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

Saturating truncation can be expressed using the RTL expressions
ss_truncate and us_truncate. This patch changes the implementation
of the vqmovn_* Neon intrinsics to use these RTL expressions rather
than a pair of unspecs. The redundant unspecs are removed along with
their code iterator.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-04-12  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Modify comment to
make consistent with updated RTL pattern.
* config/aarch64/aarch64-simd.md (aarch64_qmovn):
Implement using ss_truncate and us_truncate rather than
unspecs.
* config/aarch64/iterators.md: Remove redundant unspecs and
iterator: UNSPEC_[SU]QXTN and SUQMOVN respectively.


rb14376.patch
Description: rb14376.patch


[PATCH 19/20] aarch64: Update attributes of arm_acle.h intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch updates the attributes of all intrinsics defined in
arm_acle.h to be consistent with the attributes of the intrinsics defined
in arm_neon.h. Specifically, this means updating the attributes from:
  __extension__ static __inline 
  __attribute__ ((__always_inline__))
to:
  __extension__ extern __inline 
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-03-18  Jonathan Wright  

* config/aarch64/arm_acle.h (__attribute__): Make intrinsic
attributes consistent with those defined in arm_neon.h.


rb14296.patch
Description: rb14296.patch


[PATCH 18/20] aarch64: Update attributes of arm_fp16.h intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch updates the attributes of all intrinsics defined in
arm_fp16.h to be consistent with the attributes of the intrinsics defined
in arm_neon.h. Specifically, this means updating the attributes from:
  __extension__ static __inline 
  __attribute__ ((__always_inline__))
to:
  __extension__ extern __inline 
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-03-18  Jonathan Wright  

* config/aarch64/arm_fp16.h (__attribute__): Make intrinsic
attributes consistent with those defined in arm_neon.h.


rb14295.patch
Description: rb14295.patch


[PATCH 17/20] aarch64: Relax aarch64_qshrnn2_n RTL pattern

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch implements the saturating right-shift and narrow
high Neon intrinsic RTL patterns using a vec_concat of a register_operand
and a VQSHRN_N unspec - instead of just a VQSHRN2_N unspec. This
more relaxed pattern allows for more aggressive combinations and
ultimately better code generation.

Regression tested and bootstrapped on aarch64-none-linux-gnu and
aarch64_be-none-elf - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-03-04  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_qshrn2_n):
Implement as an expand emitting a big/little endian
instruction pattern.
(aarch64_qshrn2_n_insn_le): Define.
(aarch64_qshrn2_n_insn_be): Define.
* config/aarch64/iterators.md: Add VQSHRN2_N iterator and
constituent unspecs.


rb14251.patch
Description: rb14251.patch


[PATCH 16/20] aarch64: Relax aarch64_hn2 RTL pattern

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch implements the v[r]addhn2 and v[r]subhn2 Neon
intrinsic RTL patterns using a vec_concat of a register_operand and an
ADDSUBHN unspec - instead of just an ADDSUBHN2 unspec. This more
relaxed pattern allows for more aggressive combinations and ultimately
better code generation.

Regression tested and bootstrapped on aarch64-none-linux-gnu and
aarch64_be-none-elf - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-03-03  Jonathan Wright  

* config/aarch64/aarch64-simd.md (aarch64_hn2):
Implement as an expand emitting a big/little endian
instruction pattern.
(aarch64_hn2_insn_le): Define.
(aarch64_hn2_insn_be): Define.


rb14250.patch
Description: rb14250.patch


[PATCH 15/20] aarch64: Use RTL builtins for vcvtx intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the vcvtx Neon intrinsics to use RTL builtins
rather than inline assembly code, allowing for better scheduling and
optimization.

Regression tested and bootstrapped on aarch64-none-linux-gnu and
aarch64_be-none-elf - no issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-18  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add
float_trunc_rodd builtin generator macros.
* config/aarch64/aarch64-simd.md (aarch64_float_trunc_rodd_df):
Define.
(aarch64_float_trunc_rodd_lo_v2sf): Define.
(aarch64_float_trunc_rodd_hi_v4sf_le): Define.
(aarch64_float_trunc_rodd_hi_v4sf_be): Define.
(aarch64_float_trunc_rodd_hi_v4sf): Define.
* config/aarch64/arm_neon.h (vcvtx_f32_f64): Use RTL builtin
instead of inline asm.
(vcvtx_high_f32_f64): Likewise.
(vcvtxd_f32_f64): Likewise.
* config/aarch64/iterators.md: Add FCVTXN unspec.


rb14222.patch
Description: rb14222.patch


[PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch adds compilation tests to make sure that the output
of vmla/vmls floating-point Neon intrinsics (fmul, fadd/fsub) is not fused
into fmla/fmls instructions.

Ok for master?

Thanks,
Jonathan

---

gcc/testsuite/ChangeLog:

2021-02-16  Jonathan Wright  

* gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c:
New test.
* gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c:
New test.
* gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused.c:
New test.
* gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused_A64.c:
New test.


rb14202.patch
Description: rb14202.patch


[PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the floating-point vml[as][q]_laneq Neon
intrinsics to use RTL builtins rather than relying on the GCC vector
extensions. Using RTL builtins allows control over the emission of
fmla/fmls instructions (which we don't want here.)

With this commit, the code generated by these intrinsics changes from
a fused multiply-add/subtract instruction to an fmul followed by an
fadd/fsub instruction. If the programmer really wants fmla/fmls
instructions, they can use the vfm[as] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-17  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add
float_ml[as][q]_laneq builtin generator macros.
* config/aarch64/aarch64-simd.md (mul_laneq3): Define.
(aarch64_float_mla_laneq): Define.
(aarch64_float_mls_laneq): Define.
* config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin
instead of GCC vector extensions.
(vmlaq_laneq_f32): Likewise.
(vmls_laneq_f32): Likewise.
(vmlsq_laneq_f32): Likewise.


rb14213.patch
Description: rb14213.patch


[PATCH 12/20] aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the floating-point vml[as][q]_lane Neon
intrinsics to use RTL builtins rather than relying on the GCC vector
extensions. Using RTL builtins allows control over the emission of
fmla/fmls instructions (which we don't want here.)

With this commit, the code generated by these intrinsics changes from
a fused multiply-add/subtract instruction to an fmul followed by an
fadd/fsub instruction. If the programmer really wants fmla/fmls
instructions, they can use the vfm[as] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-16  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add
float_ml[as]_lane builtin generator macros.
* config/aarch64/aarch64-simd.md (mul_lane3): Define.
(aarch64_float_mla_lane): Define.
(aarch64_float_mls_lane): Define.
* config/aarch64/arm_neon.h (vmla_lane_f32): Use RTL builtin
instead of GCC vector extensions.
(vmlaq_lane_f32): Likewise.
(vmls_lane_f32): Likewise.
(vmlsq_lane_f32): Likewise.


rb14212.patch
Description: rb14212.patch


[PATCH 11/20] aarch64: Use RTL builtins for FP ml[as] intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the floating-point vml[as][q] Neon intrinsics
to use RTL builtins rather than relying on the GCC vector extensions.
Using RTL builtins allows control over the emission of fmla/fmls
instructions (which we don't want here.)

With this commit, the code generated by these intrinsics changes from
a fused multiply-add/subtract instruction to an fmul followed by an
fadd/fsub instruction. If the programmer really wants fmla/fmls
instructions, they can use the vfm[as] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-16  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add float_ml[as]
builtin generator macros.
* config/aarch64/aarch64-simd.md (aarch64_float_mla):
Define.
(aarch64_float_mls): Define.
* config/aarch64/arm_neon.h (vmla_f32): Use RTL builtin
instead of relying on GCC vector extensions.
(vmla_f64): Likewise.
(vmlaq_f32): Likewise.
(vmlaq_f64): Likewise.
(vmls_f32): Likewise.
(vmls_f64): Likewise.
(vmlsq_f32): Likewise.
(vmlsq_f64): Likewise.
* config/aarch64/iterators.md: Define VDQF_DF mode iterator.


rb14211.patch
Description: rb14211.patch


[PATCH 10/20] aarch64: Use RTL builtins for FP ml[as]_n intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the floating-point vml[as][q]_n Neon
intrinsics to use RTL builtins rather than inline assembly code, allowing
for better scheduling and optimization.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-01-18  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add
float_ml[as]_n builtin generator macros.
* config/aarch64/aarch64-simd.md (mul_n3): Define.
(aarch64_float_mla_n): Define.
(aarch64_float_mls_n): Define.
* config/aarch64/arm_neon.h (vmla_n_f32): Use RTL builtin
instead of inline asm.
(vmlaq_n_f32): Likewise.
(vmls_n_f32): Likewise.
(vmlsq_n_f32): Likewise.


rb14042.patch
Description: rb14042.patch


[PATCH 9/20] aarch64: Use RTL builtins for v[q]tbx intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the v[q]tbx Neon intrinsics to use RTL
builtins rather than inline assembly code, allowing for better scheduling
and optimization.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-12  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add tbx1 builtin
generator macros.
* config/aarch64/aarch64-simd.md (aarch64_tbx1):
Define.
* config/aarch64/arm_neon.h (vqtbx1_s8): USE RTL builtin
instead of inline asm.
(vqtbx1_u8): Likewise.
(vqtbx1_p8): Likewise.
(vqtbx1q_s8): Likewise.
(vqtbx1q_u8): Likewise.
(vqtbx1q_p8): Likewise.
(vtbx2_s8): Likewise.
(vtbx2_u8): Likewise.
(vtbx2_p8): Likewise.


rb14188.patch
Description: rb14188.patch


[PATCH 8/20] aarch64: Use RTL builtins for v[q]tbl intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the v[q]tbl Neon intrinsics to use RTL
builtins rather than inline assembly code, allowing for better scheduling
and optimization.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-12  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add tbl1 builtin
generator macros.
* config/aarch64/arm_neon.h (vqtbl1_p8): Use RTL builtin
instead of inline asm.
(vqtbl1_s8): Likewise.
(vqtbl1_u8): Likewise.
(vqtbl1q_p8): Likewise.
(vqtbl1q_s8): Likewise.
(vqtbl1q_u8): Likewise.
(vtbl1_s8): Likewise.
(vtbl1_u8): Likewise.
(vtbl1_p8): Likewise.
(vtbl2_s8): Likewise.
(vtbl2_u8): Likewise.
(vtbl2_p8): Likewise.

rb14154.patch
Description: rb14154.patch


[PATCH 7/20] aarch64: Use RTL builtins for polynomial vsri[q]_n intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the vsri[q]_n_p* Neon intrinsics to use RTL
builtins rather than inline assembly code, allowing for better scheduling
and optimization.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-10  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add polynomial
ssri_n buitin generator macro.
* config/aarch64/arm_neon.h (vsri_n_p8): Use RTL builtin
instead of inline asm.
(vsri_n_p16): Likewise.
(vsri_n_p64): Likewise.
(vsriq_n_p8): Likewise.
(vsriq_n_p16): Likewise.
(vsriq_n_p64): Likewise.


rb14147.patch
Description: rb14147.patch


  1   2   >