from:"Liu, Hongtao via Gcc\-patches"

RE: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-09 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Xi Ruoyao 
> Sent: Thursday, August 10, 2023 9:48 AM
> To: Liu, Hongtao ; gcc-patches@gcc.gnu.org
> Cc: richard.guent...@gmail.com; ubiz...@gmail.com; hubi...@ucw.cz
> Subject: Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable
> vectorization for all gather/scatter instructions.
> 
> On Thu, 2023-08-10 at 09:11 +0800, liuhongt via Gcc-patches wrote:
> > Currently we have 3 different independent tunes for gather
> > "use_gather,use_gather_2parts,use_gather_4parts",
> > similar for scatter, there're
> > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> >
> > The patch support 2 standardizing options to enable/disable
> > vectorization for all gather/scatter instructions. The options is
> > interpreted by driver to 3 tunes.
> >
> > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > Ok for trunk?
> 
> And should we set -mno-gather as the default for GDS affected processors?
> We'll likely apply the ucode update for them, and then the gathering
> instructions will be much slower.
Assume you're talking about 
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html
Yes, there will be an separate patch for microarchitecture tuning.
> 
> > gcc/ChangeLog:
> >
> > * config/i386/i386.h (DRIVER_SELF_SPECS): Add
> > GATHER_SCATTER_DRIVER_SELF_SPECS.
> > (GATHER_SCATTER_DRIVER_SELF_SPECS): New macro.
> > * config/i386/i386.opt (mgather): New option.
> > (mscatter): Ditto.
> > ---
> >  gcc/config/i386/i386.h   | 12 +++-
> >  gcc/config/i386/i386.opt |  8 
> >  2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> > ef342fcee9b..d9ac2c29bde 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -565,7 +565,17 @@ extern GTY(()) tree x86_mfence;
> >  # define SUBTARGET_DRIVER_SELF_SPECS ""
> >  #endif
> >
> > -#define DRIVER_SELF_SPECS SUBTARGET_DRIVER_SELF_SPECS
> > +#ifndef GATHER_SCATTER_DRIVER_SELF_SPECS # define
> > +GATHER_SCATTER_DRIVER_SELF_SPECS \
> > +  "%{mno-gather:-mtune-
> > ctrl=^use_gather_2parts,^use_gather_4parts,^use_gather} \
> > +   %{mgather:-mtune-
> > ctrl=use_gather_2parts,use_gather_4parts,use_gather} \
> > +   %{mno-scatter:-mtune-
> > ctrl=^use_scatter_2parts,^use_scatter_4parts,^use_scatter} \
> > +   %{mscatter:-mtune-
> > ctrl=use_scatter_2parts,use_scatter_4parts,use_scatter}"
> > +#endif
> > +
> > +#define DRIVER_SELF_SPECS \
> > +  SUBTARGET_DRIVER_SELF_SPECS " " \
> > +  GATHER_SCATTER_DRIVER_SELF_SPECS
> >
> >  /* -march=native handling only makes sense with compiler running on
> >     an x86 or x86_64 chip.  If changing this condition, also change
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index
> > ddb7f110aa2..99948644a8d 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -424,6 +424,14 @@ mdaz-ftz
> >  Target
> >  Set the FTZ and DAZ Flags.
> >
> > +mgather
> > +Target
> > +Enable vectorization for gather instruction.
> > +
> > +mscatter
> > +Target
> > +Enable vectorization for scatter instruction.
> > +
> >  mpreferred-stack-boundary=
> >  Target RejectNegative Joined UInteger
> > Var(ix86_preferred_stack_boundary_arg)
> >  Attempt to keep stack aligned to this power of 2.
> 
> --
> Xi Ruoyao 
> School of Aerospace Science and Technology, Xidian University

RE: [PATCH V2] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-09 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Uros Bizjak 
> Sent: Wednesday, August 9, 2023 2:33 PM
> To: Liu, Hongtao 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH V2] [X86] Workaround possible CPUID bug in Sandy
> Bridge.
> 
> On Wed, Aug 9, 2023 at 3:48 AM liuhongt  wrote:
> >
> > > Please rather do it in a more self-descriptive way, as proposed in
> > > the attached patch. You won't need a comment then.
> > >
> >
> > Adjusted in V2 patch.
> >
> > Don't access leaf 7 subleaf 1 unless subleaf 0 says it is supported
> > via EAX.
> >
> > Intel documentation says invalid subleaves return 0. We had been
> > relying on that behavior instead of checking the max sublef number.
> >
> > It appears that some Sandy Bridge CPUs return at least the subleaf 0
> > EDX value for subleaf 1. Best guess is that this is a bug in a
> > microcode patch since all of the bits we're seeing set in EDX were
> > introduced after Sandy Bridge was originally released.
> >
> > This is causing avxvnniint16 to be incorrectly enabled with
> > -march=native on these CPUs.
> >
> > gcc/ChangeLog:
> >
> > * common/config/i386/cpuinfo.h (get_available_features): Check
> > EAX for valid subleaf before use CPUID.
> > ---
> >  gcc/common/config/i386/cpuinfo.h | 82
> > +---
> >  1 file changed, 43 insertions(+), 39 deletions(-)
> >
> > diff --git a/gcc/common/config/i386/cpuinfo.h
> > b/gcc/common/config/i386/cpuinfo.h
> > index 30ef0d334ca..9fa4dec2a7e 100644
> > --- a/gcc/common/config/i386/cpuinfo.h
> > +++ b/gcc/common/config/i386/cpuinfo.h
> > @@ -663,6 +663,7 @@ get_available_features (struct __processor_model
> *cpu_model,
> >unsigned int max_cpuid_level = cpu_model2->__cpu_max_level;
> >unsigned int eax, ebx;
> >unsigned int ext_level;
> > +  unsigned int subleaf_level;
> 
> Oh, I failed this in my previous review. This variable should be named
> max_subleaf_level, as it represents the maximum supported ECX value.
I've committed previous patch ,but not backport yet.
Guess I can just commit another patch to change the name?
For backport, I'll merge the change together with just 1 commit.
> 
> Uros.
> 
> >
> >/* Get XCR_XFEATURE_ENABLED_MASK register with xgetbv.  */
> >  #define XCR_XFEATURE_ENABLED_MASK  0x0
> > @@ -762,7 +763,7 @@ get_available_features (struct __processor_model
> *cpu_model,
> >/* Get Advanced Features at level 7 (eax = 7, ecx = 0/1). */
> >if (max_cpuid_level >= 7)
> >  {
> > -  __cpuid_count (7, 0, eax, ebx, ecx, edx);
> > +  __cpuid_count (7, 0, subleaf_level, ebx, ecx, edx);
> >if (ebx & bit_BMI)
> > set_feature (FEATURE_BMI);
> >if (ebx & bit_SGX)
> > @@ -874,45 +875,48 @@ get_available_features (struct
> __processor_model *cpu_model,
> > set_feature (FEATURE_AVX512FP16);
> > }
> >
> > -  __cpuid_count (7, 1, eax, ebx, ecx, edx);
> > -  if (eax & bit_HRESET)
> > -   set_feature (FEATURE_HRESET);
> > -  if (eax & bit_CMPCCXADD)
> > -   set_feature(FEATURE_CMPCCXADD);
> > -  if (edx & bit_PREFETCHI)
> > -   set_feature (FEATURE_PREFETCHI);
> > -  if (eax & bit_RAOINT)
> > -   set_feature (FEATURE_RAOINT);
> > -  if (avx_usable)
> > -   {
> > - if (eax & bit_AVXVNNI)
> > -   set_feature (FEATURE_AVXVNNI);
> > - if (eax & bit_AVXIFMA)
> > -   set_feature (FEATURE_AVXIFMA);
> > - if (edx & bit_AVXVNNIINT8)
> > -   set_feature (FEATURE_AVXVNNIINT8);
> > - if (edx & bit_AVXNECONVERT)
> > -   set_feature (FEATURE_AVXNECONVERT);
> > - if (edx & bit_AVXVNNIINT16)
> > -   set_feature (FEATURE_AVXVNNIINT16);
> > - if (eax & bit_SM3)
> > -   set_feature (FEATURE_SM3);
> > - if (eax & bit_SHA512)
> > -   set_feature (FEATURE_SHA512);
> > - if (eax & bit_SM4)
> > -   set_feature (FEATURE_SM4);
> > -   }
> > -  if (avx512_usable)
> > -   {
> > - if (eax & bit_AVX512BF16)
> > -   set_feature (FEATURE_AVX512BF16);
> > -   }
> > -  if (amx_usable)
> > +  if (subleaf_level >= 1)
> > {
> > - if (eax & bit_AMX_FP16)
> > -   set_feature (FEATURE_AMX_FP16);
> > - if (edx & bit_AMX_COMPLEX)
> > -   set_feature (FEATURE_AMX_COMPLEX);
> > + __cpuid_count (7, 1, eax, ebx, ecx, edx);
> > + if (eax & bit_HRESET)
> > +   set_feature (FEATURE_HRESET);
> > + if (eax & bit_CMPCCXADD)
> > +   set_feature(FEATURE_CMPCCXADD);
> > + if (edx & bit_PREFETCHI)
> > +   set_feature (FEATURE_PREFETCHI);
> > + if (eax & bit_RAOINT)
> > +   set_feature (FEATURE_RAOINT);
> > + if (avx_usable)
> > +   {
> > + if (eax & bit_AVXVNNI)
> > +   set_feature (FEATURE_AVXVNNI);
> > + if (eax & bit_AVXIFMA)
> > +   set_feature (FEATURE_AVXIFMA);
> > +

RE: [PATCH] x86: fold two of vec_dupv2df's alternatives

2023-08-01 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jan Beulich 
> Sent: Tuesday, August 1, 2023 1:49 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; Kirill Yukhin
> 
> Subject: [PATCH] x86: fold two of vec_dupv2df's alternatives
> 
> By using Yvm in the source, both can be expressed in one.
> 
> gcc/
> 
>   * sse.md (vec_dupv2df): Fold the middle two of the
>   alternatives.
Ok, thanks.
> 
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -13784,21 +13784,20 @@
> (set_attr "mode" "DF,DF,V1DF,V1DF,V1DF,V2DF,V1DF,V1DF,V1DF")])
> 
>  (define_insn "vec_dupv2df"
> -  [(set (match_operand:V2DF 0 "register_operand" "=x,x,v,v")
> +  [(set (match_operand:V2DF 0 "register_operand" "=x,v,v")
>   (vec_duplicate:V2DF
> -   (match_operand:DF 1 "nonimmediate_operand" "0,xm,vm,vm")))]
> +   (match_operand:DF 1 "nonimmediate_operand" "0,Yvm,vm")))]
>"TARGET_SSE2"
>"@
> unpcklpd\t%0, %0
> %vmovddup\t{%1, %0|%0, %1}
> -   vmovddup\t{%1, %0|%0, %1}
> vbroadcastsd\t{%1, }%g0{|, %1}"
> -  [(set_attr "isa" "noavx,sse3,avx512vl,*")
> -   (set_attr "type" "sselog1,ssemov,ssemov,ssemov")
> -   (set_attr "prefix" "orig,maybe_vex,evex,evex")
> -   (set_attr "mode" "V2DF,DF,DF,V8DF")
> +  [(set_attr "isa" "noavx,sse3,*")
> +   (set_attr "type" "sselog1,ssemov,ssemov")
> +   (set_attr "prefix" "orig,maybe_evex,evex")
> +   (set_attr "mode" "V2DF,DF,V8DF")
> (set (attr "enabled")
> - (cond [(eq_attr "alternative" "3")
> + (cond [(eq_attr "alternative" "2")
>(symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
> && !TARGET_PREFER_AVX256")
>  (match_test "")

RE: [PATCH] Replace invariant ternlog operands

2023-07-26 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Yan Simonaytes 
> Sent: Wednesday, July 26, 2023 2:11 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; Uros Bizjak ;
> Yan Simonaytes 
> Subject: [PATCH] Replace invariant ternlog operands
> 
> Sometimes GCC generates ternlog with three operands, but some of them are
> invariant.
> For example:
> 
> vpternlogq$252, %zmm2, %zmm1, %zmm0
> 
> In this case zmm1 register isnt used by ternlog.
> So should replace zmm1 with zmm0 or zmm2:
> 
> vpternlogq$252, %zmm0, %zmm1, %zmm0
> 
> When the third operand of ternlog is memory and both others are invariant
> should add load instruction from this memory to register and replace the first
> and the second operands to this register.
> So insted of
> 
> vpternlogq$85, (%rdi), %zmm1, %zmm0
> 
> Should emit
> 
> vmovdqa64 (%rdi), %zmm0
> vpternlogq$85, %zmm0, %zmm0, %zmm0
> 
> gcc/ChangeLog:
> 
> * config/i386/i386.cc (ternlog_invariant_operand_mask): New helper
>   function for replacing invariant operands.
> (reduce_ternlog_operands): Likewise.
> * config/i386/i386-protos.h (ternlog_invariant_operand_mask):
> Prototype here.
> (reduce_ternlog_operands): Likewise.
> * config/i386/sse.md:
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.target/i386/reduce-ternlog-operands-1.c: New test.
> * gcc.target/i386/reduce-ternlog-operands-2.c: New test.
> ---
>  gcc/config/i386/i386-protos.h |  2 +
>  gcc/config/i386/i386.cc   | 45 +++
>  gcc/config/i386/sse.md| 43 ++
>  .../i386/reduce-ternlog-operands-1.c  | 20 +
>  .../i386/reduce-ternlog-operands-2.c  | 11 +
>  5 files changed, 121 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/reduce-ternlog-operands-
> 1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/reduce-ternlog-operands-
> 2.c
> 
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 27fe73ca65c..49398ef9936 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -57,6 +57,8 @@ extern int standard_80387_constant_p (rtx);  extern
> const char *standard_80387_constant_opcode (rtx);  extern rtx
> standard_80387_constant_rtx (int);  extern int standard_sse_constant_p (rtx,
> machine_mode);
> +extern int ternlog_invariant_operand_mask (rtx *operands); extern void
> +reduce_ternlog_operands (rtx *operands);
>  extern const char *standard_sse_constant_opcode (rtx_insn *, rtx *);  extern
> bool ix86_standard_x87sse_constant_load_p (const rtx_insn *, rtx);  extern
> bool ix86_pre_reload_split (void); diff --git a/gcc/config/i386/i386.cc
> b/gcc/config/i386/i386.cc index f0d6167e667..140de478571 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -5070,6 +5070,51 @@ ix86_check_no_addr_space (rtx insn)
>  }
>return true;
>  }
> +
> +/* Return mask of invariant operands:
> +   bit number 0 1 2
> +   operand number 1 2 3.  */
> +
> +int
> +ternlog_invariant_operand_mask (rtx *operands) {
> +  int mask = 0;
> +  int imm8 = XINT (operands[4], 0);
> +
> +  if (((imm8 >> 4) & 0xF) == (imm8 & 0xF))
> +mask |= 1;
> +  if (((imm8 >> 2) & 0x33) == (imm8 & 0x33))
> +mask |= (1 << 1);
> +  if (((imm8 >> 1) & 0x55) == (imm8 & 0x55))
> +mask |= (1 << 2);
> +
> +  return mask;
> +}
> +
> +/* Replace one of the unused operators with the one used.  */
> +
> +void
> +reduce_ternlog_operands (rtx *operands) {
> +  int mask = ternlog_invariant_operand_mask (operands);
> +
> +  if (mask & 1) /* the first operand is invariant.  */
> +operands[1] = operands[2];
> +
> +  if (mask & 2) /* the second operand is invariant.  */
> +operands[2] = operands[1];
> +
> +  if (mask & 4)  /* the third operand is invariant.  */
> +   operands[3] = operands[1];
> +  else if (!MEM_P (operands[3]))
> +{
> +  if (mask & 1) /* the first operand is invariant.  */
> + operands[1] = operands[3];
> +  if (mask & 2) /* the second operands is invariant.  */
> + operands[2] = operands[3];
> +}
> +}
> +
> 
> 
> 
>  /* Initialize the table of extra 80387 mathematical constants.  */
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index
> a2099373123..f88d82b315c 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12625,6 +12625,49 @@
> (symbol_ref " == 64 || TARGET_AVX512VL")
> (const_string "*")))])
> 
> +;; If the first and the second operands of ternlog are invariant and ;;
> +the third operand is memory ;; then we should add load third operand
> +from memory to register and ;; replace first and second operands with
> +this register (define_split
> +  [(set (match_operand:V 0 "register_operand")
> + (unspec:V
> +   [(match_operand:V 1 "register_operand")
> +(match_operand:V 2 "register_operand")
> +(match_operand:V 3

RE: [PATCH] Initial Granite Rapids D Support

2023-07-11 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Mo, Zewei 
> Sent: Wednesday, July 12, 2023 1:56 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; ubiz...@gmail.com
> Subject: [PATCH] Initial Granite Rapids D Support
> 
> Hi all,
> 
> This patch is to add initial support for Granite Rapids D for GCC.
> 
> The link of related information is listed below:
> https://www.intel.com/content/www/us/en/develop/download/intel-
> architecture-instruction-set-extensions-programming-reference.html
> 
> Also, the patch of removing AMX-COMPLEX from Granite Rapids will be
> backported to GCC13.
> 
> This has been tested on x86_64-pc-linux-gnu. Is this ok for trunk? Thank you.
Ok.
> 
> Sincerely,
> Zewei Mo
> 
> gcc/ChangeLog:
> 
>   * common/config/i386/cpuinfo.h
>   (get_intel_cpu): Handle Granite Rapids D.
>   * common/config/i386/i386-common.cc:
>   (processor_alias_table): Add graniterapids-d.
>   * common/config/i386/i386-cpuinfo.h
>   (enum processor_subtypes): Add INTEL_COREI7_GRANITERAPIDS_D.
>   * config.gcc: Add -march=graniterapids-d.
>   * config/i386/driver-i386.cc (host_detect_local_cpu):
>   Handle graniterapids-d.
>   * gcc/config/i386/i386.h: (PTA_GRANITERAPIDS_D): New.
>   * doc/extend.texi: Add graniterapids-d.
>   * doc/invoke.texi: Ditto.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.target/i386/mv16.C: Add graniterapids-d.
>   * gcc.target/i386/funcspec-56.inc: Handle new march.
> ---
>  gcc/common/config/i386/cpuinfo.h  |  9 -
>  gcc/common/config/i386/i386-common.cc |  2 ++
>  gcc/common/config/i386/i386-cpuinfo.h |  1 +
>  gcc/config.gcc|  2 +-
>  gcc/config/i386/driver-i386.cc|  3 +++
>  gcc/config/i386/i386.h|  4 +++-
>  gcc/doc/extend.texi   |  3 +++
>  gcc/doc/invoke.texi   | 11 +++
>  gcc/testsuite/g++.target/i386/mv16.C  |  6 ++
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |  1 +
>  10 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/common/config/i386/cpuinfo.h
> b/gcc/common/config/i386/cpuinfo.h
> index ae48bc17771..7c2565c1d93 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -565,7 +565,6 @@ get_intel_cpu (struct __processor_model
> *cpu_model,
>cpu_model->__cpu_type = INTEL_SIERRAFOREST;
>break;
>  case 0xad:
> -case 0xae:
>/* Granite Rapids.  */
>cpu = "graniterapids";
>CHECK___builtin_cpu_is ("corei7"); @@ -573,6 +572,14 @@
> get_intel_cpu (struct __processor_model *cpu_model,
>cpu_model->__cpu_type = INTEL_COREI7;
>cpu_model->__cpu_subtype = INTEL_COREI7_GRANITERAPIDS;
>break;
> +case 0xae:
> +  /* Granite Rapids D.  */
> +  cpu = "graniterapids-d";
> +  CHECK___builtin_cpu_is ("corei7");
> +  CHECK___builtin_cpu_is ("graniterapids-d");
> +  cpu_model->__cpu_type = INTEL_COREI7;
> +  cpu_model->__cpu_subtype = INTEL_COREI7_GRANITERAPIDS_D;
> +  break;
>  case 0xb6:
>/* Grand Ridge.  */
>cpu = "grandridge";
> diff --git a/gcc/common/config/i386/i386-common.cc
> b/gcc/common/config/i386/i386-common.cc
> index bf126f14073..8cea3669239 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -2094,6 +2094,8 @@ const pta processor_alias_table[] =
>  M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2},
>{"graniterapids", PROCESSOR_GRANITERAPIDS, CPU_HASWELL,
> PTA_GRANITERAPIDS,
>  M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX512F},
> +  {"graniterapids-d", PROCESSOR_GRANITERAPIDS, CPU_HASWELL,
> PTA_GRANITERAPIDS_D,
> +M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS_D),
> P_PROC_AVX512F},
>{"bonnell", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL,
>  M_CPU_TYPE (INTEL_BONNELL), P_PROC_SSSE3},
>{"atom", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL, diff --git
> a/gcc/common/config/i386/i386-cpuinfo.h b/gcc/common/config/i386/i386-
> cpuinfo.h
> index 2dafbb25a49..254dfec70e5 100644
> --- a/gcc/common/config/i386/i386-cpuinfo.h
> +++ b/gcc/common/config/i386/i386-cpuinfo.h
> @@ -98,6 +98,7 @@ enum processor_subtypes
>ZHAOXIN_FAM7H_LUJIAZUI,
>AMDFAM19H_ZNVER4,
>INTEL_COREI7_GRANITERAPIDS,
> +  INTEL_COREI7_GRANITERAPIDS_D,
>CPU_SUBTYPE_MAX
>  };
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc index d88071773c9..1446eb2b3ca
> 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -682,7 +682,7 @@ silvermont knl knm skylake-avx512 cannonlake
> icelake-client icelake-server \  skylake goldmont goldmont-plus tremont
> cascadelake tigerlake cooperlake \  sapphirerapids alderlake rocketlake
> eden-x2 nano nano-1000 nano-2000 nano-3000 \
>  nano-x2 eden-x4 nano-x4 lujiazui x86-64 x86-64-v2 x86-64-v3 x86-64-v4 \ -
> sierraforest graniterapids grandridge native"
> +sierraforest graniterapids

RE: [PATCH] x86: improve fast bfloat->float conversion

2023-07-11 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jan Beulich 
> Sent: Tuesday, July 11, 2023 3:50 PM
> To: Liu, Hongtao 
> Cc: Kirill Yukhin ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] x86: improve fast bfloat->float conversion
> 
> On 11.07.2023 08:45, Liu, Hongtao wrote:
> >> -Original Message-
> >> From: Jan Beulich 
> >> Sent: Tuesday, July 11, 2023 2:08 PM
> >>
> >> There's nothing AVX512BW-ish in here, so no reason to use Yw as the
> >> constraints for the AVX alternative. Furthermore by using the 512-bit
> >> form of VPSSLD (in a new alternative) all 32 registers can be used
> >> directly by the insn without AVX512VL needing to be enabled.
> > Yes, the instruction vpslld doesn't need AVX512BW, the patch LGTM.
> 
> Thanks.
> 
> >> ---
> >> The corresponding expander, "extendbfsf2", looks to have been dead
> >> since its introduction in a1ecc5600464 ("Fix incorrect
> >> _mm_cvtsbh_ss"): The builtin references the insn (extendbfsf2_1), not
> >> the expander. Can't the expander be deleted and the name of the insn
> >> then pruned of the _1 suffix? If so, that further raises the question
> >> of the significance of the "!HONOR_NANS (BFmode)" that the expander
> >> has, but the insn doesn't have. Which may instead suggest the builtin
> >> was meant to reference the expander. Yet then I can't see what would
> >> the builtin would expand to when HONOR_NANS
> >> (BFmode) it true.
> >
> > Quote from what Jakub said in [1].
> > ---
> > This is not correct.
> > While using such code for _mm_cvtsbh_ss is fine if it is documented
> > not to raise exceptions and turn a sNaN into a qNaN, it is not fine
> > for HONOR_NANS (i.e. when -ffast-math is not on), because a __bf16 ->
> > float conversion on sNaN should raise invalid exception and turn it into a
> qNaN.
> > We could have extendbfsf2 expander that would FAIL; if HONOR_NANS
> and
> > emit extendbfsf2_1 otherwise.
> > ---
> > [1]
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607108.html
> 
> I'm not sure I understand: It sounds like what Jakub said matches my
> observation, yet then it seems unlikely that the issue wasn't fixed in over 
> half
> a year.
> 
> Also having the expander FAIL when HONOR_NANS (matching what I was
> thinking) still doesn't clarify to me what then would happen to uses of the
> builtin. Is there any (common code) fallback for such a case? I didn't think
> there would be, in which case wouldn't this result in an internal compiler
> error?
For __bf16 -> float or target specific builtins, it should be ok since __bf16 
is just an extension type.
 but extendbfsf2 is a standard pattern name which is also used to expand c++23 
std::bfloat16_t -> float conversion which is assumed to raise exceptions for 
sNAN.
Since vpslld won't raise any exception, we need to add HONOR_NANS in the 
extendbfsf2 pattern.
It's my understanding, for std:bfloat16_t support, it's mentioned in [2].

https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601865.html
> 
> Jan

RE: [PATCH] x86: improve fast bfloat->float conversion

2023-07-11 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jan Beulich 
> Sent: Tuesday, July 11, 2023 2:08 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; Kirill Yukhin
> 
> Subject: [PATCH] x86: improve fast bfloat->float conversion
> 
> There's nothing AVX512BW-ish in here, so no reason to use Yw as the
> constraints for the AVX alternative. Furthermore by using the 512-bit form of
> VPSSLD (in a new alternative) all 32 registers can be used directly by the 
> insn
> without AVX512VL needing to be enabled.
Yes, the instruction vpslld doesn't need AVX512BW, the patch LGTM.
> 
> Also adjust the originally last alternative's "prefix" attribute to 
> maybe_evex.
> 
> gcc/
> 
>   * config/i386/i386.md (extendbfsf2_1): Add new AVX512F
>   alternative. Adjust original last alternative's "prefix"
>   attribute to maybe_evex.
> ---
> The corresponding expander, "extendbfsf2", looks to have been dead since
> its introduction in a1ecc5600464 ("Fix incorrect _mm_cvtsbh_ss"): The builtin
> references the insn (extendbfsf2_1), not the expander. Can't the expander
> be deleted and the name of the insn then pruned of the _1 suffix? If so, that
> further raises the question of the significance of the "!HONOR_NANS
> (BFmode)" that the expander has, but the insn doesn't have. Which may
> instead suggest the builtin was meant to reference the expander. Yet then I
> can't see what would the builtin would expand to when HONOR_NANS
> (BFmode) it true.

Quote from what Jakub said in [1].
---
This is not correct.
While using such code for _mm_cvtsbh_ss is fine if it is documented not to
raise exceptions and turn a sNaN into a qNaN, it is not fine for HONOR_NANS
(i.e. when -ffast-math is not on), because a __bf16 -> float conversion
on sNaN should raise invalid exception and turn it into a qNaN.
We could have extendbfsf2 expander that would FAIL; if HONOR_NANS and
emit extendbfsf2_1 otherwise. 
---
[1] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607108.html
> 
> I further wonder whether the nearby "extendhfdf2" expander is really
> needed. It doesn't look to specify anything that the corresponding insn
> doesn't also specify.
> 
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -5181,21 +5181,27 @@
>  ;; Don't use float_extend since psrlld doesn't raise  ;; exceptions and turn 
> a
> sNaN into a qNaN.
>  (define_insn "extendbfsf2_1"
> -  [(set (match_operand:SF 0 "register_operand"   "=x,Yw")
> +  [(set (match_operand:SF 0 "register_operand"   "=x,Yv,v")
>   (unspec:SF
> -   [(match_operand:BF 1 "register_operand" " 0,Yw")]
> +   [(match_operand:BF 1 "register_operand" " 0,Yv,v")]
> UNSPEC_CVTBFSF))]
>   "TARGET_SSE2"
>   "@
>pslld\t{$16, %0|%0, 16}
> -  vpslld\t{$16, %1, %0|%0, %1, 16}"
> -  [(set_attr "isa" "noavx,avx")
> +  vpslld\t{$16, %1, %0|%0, %1, 16}
> +  vpslld\t{$16, %g1, %g0|%g0, %g1, 16}"
> +  [(set_attr "isa" "noavx,avx,*")
> (set_attr "type" "sseishft1")
> (set_attr "length_immediate" "1")
> -   (set_attr "prefix_data16" "1,*")
> -   (set_attr "prefix" "orig,vex")
> -   (set_attr "mode" "TI")
> -   (set_attr "memory" "none")])
> +   (set_attr "prefix_data16" "1,*,*")
> +   (set_attr "prefix" "orig,maybe_evex,evex")
> +   (set_attr "mode" "TI,TI,XI")
> +   (set_attr "memory" "none")
> +   (set (attr "enabled")
> + (if_then_else (eq_attr "alternative" "2")
> +   (symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
> + && !TARGET_PREFER_AVX256")
> +   (const_string "*")))])
> 
>  (define_expand "extendxf2"
>[(set (match_operand:XF 0 "nonimmediate_operand")

RE: [PATCH v3] x86: make better use of VBROADCASTSS / VPBROADCASTD

2023-07-11 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jan Beulich 
> Sent: Tuesday, July 11, 2023 2:04 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kirill Yukhin ; Liu, Hongtao
> 
> Subject: [PATCH v3] x86: make better use of VBROADCASTSS /
> VPBROADCASTD
> 
> ... in vec_dupv4sf / *vec_dupv4si. The respective broadcast insns are never
> longer (yet sometimes shorter) than the corresponding VSHUFPS / VPSHUFD,
> due to the immediate operand of the shuffle insns balancing the
> (uniform) need for VEX3 in the broadcast ones. When EVEX encoding is
> respective the broadcast insns are always shorter.
> 
> Add new alternatives to cover the AVX2 and AVX512 cases as appropriate.
> 
> While touching this anyway, switch to consistently using "sseshuf1" in the
> "type" attributes for all shuffle forms.
> 
> gcc/
> 
>   * config/i386/sse.md (vec_dupv4sf): Make first alternative use
>   vbroadcastss for AVX2. New AVX512F alternative.
>   (*vec_dupv4si): New AVX2 and AVX512F alternatives using
>   vpbroadcastd. Replace sselog1 by sseshuf1 in "type" attribute.
> 
> gcc/testsuite/
> 
>   * gcc.target/i386/avx2-dupv4sf.c: New test.
>   * gcc.target/i386/avx2-dupv4si.c: Likewise.
>   * gcc.target/i386/avx512f-dupv4sf.c: Likewise.
>   * gcc.target/i386/avx512f-dupv4si.c: Likewise.
> ---
> Note that unlike originally intended, "prefix_extra" isn't dropped:
> "length_vex" uses it to determine whether 2-byte VEX encoding is possible
> (which it isn't for VBROADCASTSS / VPBROADCASTD). "length"
> itself specifically does not use it for VEX/EVEX encoded insns.
> 
> Especially with the added "enabled" attribute I didn't really see how to
> (further) fold alternatives 0 and 1. Instead *vec_dupv4si might benefit from
> using sse2_noavx2 instead of sse2 for alternative 2, except that there is no
> sse2_noavx2, only sse2_noavx.
> 
> I'm working from the assumption that the isa attributes to the original 1st 
> and
> 2nd alternatives don't need further restricting (to sse2_noavx2 or
> avx_noavx2 as applicable), as the new earlier alternatives cover all operand
> forms already when at least AVX2 is enabled.
Yes, the patch LGTM.
> ---
> v3: Testcases for new alternatives. "type" and "prefix_extra"
> adjustments.
> v2: Correct operand constraints. Respect -mprefer-vector-width=. Fold
> two alternatives of vec_dupv4sf.
> 
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -25969,41 +25969,64 @@
>   (const_int 1)))])
> 
>  (define_insn "vec_dupv4sf"
> -  [(set (match_operand:V4SF 0 "register_operand" "=v,v,x")
> +  [(set (match_operand:V4SF 0 "register_operand" "=v,v,v,x")
>   (vec_duplicate:V4SF
> -   (match_operand:SF 1 "nonimmediate_operand" "Yv,m,0")))]
> +   (match_operand:SF 1 "nonimmediate_operand" "Yv,v,m,0")))]
>"TARGET_SSE"
>"@
> -   vshufps\t{$0, %1, %1, %0|%0, %1, %1, 0}
> +   * return TARGET_AVX2 ? \"vbroadcastss\t{%1, %0|%0, %1}\" :
> \"vshufps\t{$0, %d1, %0|%0, %d1, 0}\";
> +   vbroadcastss\t{%1, %g0|%g0, %1}
> vbroadcastss\t{%1, %0|%0, %1}
> shufps\t{$0, %0, %0|%0, %0, 0}"
> -  [(set_attr "isa" "avx,avx,noavx")
> -   (set_attr "type" "sseshuf1,ssemov,sseshuf1")
> -   (set_attr "length_immediate" "1,0,1")
> -   (set_attr "prefix_extra" "0,1,*")
> -   (set_attr "prefix" "maybe_evex,maybe_evex,orig")
> -   (set_attr "mode" "V4SF")])
> +  [(set_attr "isa" "avx,*,avx,noavx")
> +   (set (attr "type")
> + (cond [(and (eq_attr "alternative" "0")
> + (match_test "!TARGET_AVX2"))
> +  (const_string "sseshuf1")
> +(eq_attr "alternative" "3")
> +  (const_string "sseshuf1")
> +   ]
> +   (const_string "ssemov")))
> +   (set (attr "length_immediate")
> + (if_then_else (eq_attr "type" "sseshuf1")
> +   (const_string "1")
> +   (const_string "0")))
> +   (set_attr "prefix_extra" "0,1,1,*")
> +   (set_attr "prefix" "maybe_evex,evex,maybe_evex,orig")
> +   (set_attr "mode" "V4SF,V16SF,V4SF,V4SF")
> +   (set (attr "enabled")
> + (if_then_else (eq_attr "alternative" "1")
> +   (symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
> +&& !TARGET_PREFER_AVX256")
> +   (const_string "*")))])
> 
>  (define_insn "*vec_dupv4si"
> -  [(set (match_operand:V4SI 0 "register_operand" "=v,v,x")
> +  [(set (match_operand:V4SI 0 "register_operand" "=v,v,v,v,x")
>   (vec_duplicate:V4SI
> -   (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))]
> +   (match_operand:SI 1 "nonimmediate_operand" "Yvm,v,Yv,m,0")))]
>"TARGET_SSE"
>"@
> +   vpbroadcastd\t{%1, %0|%0, %1}
> +   vpbroadcastd\t{%1, %g0|%g0, %1}
> %vpshufd\t{$0, %1, %0|%0, %1, 0}
> vbroadcastss\t{%1, %0|%0, %1}
> shufps\t{$0, %0, %0|%0, %0, 0}"
> -  [(set_attr "isa" "sse2,avx,noavx")
> -   (set_attr "type" "sselog1,ssemov,sselog1")
> -   (set_attr "length_immediate" "1,0,1")
> -   (set_attr "prefix_extra" "0,1,*")
> -   (set_attr

RE: [PATCH] Initial Granite Rapids D Support

2023-07-06 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Mo, Zewei 
> Sent: Thursday, July 6, 2023 2:37 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; ubiz...@gmail.com
> Subject: [PATCH] Initial Granite Rapids D Support
> 
> Hi all,
> 
> This patch is to add initial support for Granite Rapids D for GCC.
> The link of related information is listed below:
> https://www.intel.com/content/www/us/en/develop/download/intel-
> architecture-instruction-set-extensions-programming-reference.html
> 
> Also, the patch of removing AMX-COMPLEX from Granite Rapids will be
> backported to GCC13.
Ok.
> 
> This has been tested on x86_64-pc-linux-gnu. Is this ok for trunk? Thank you.
> 
> Sincerely,
> Zewei Mo
> 
> gcc/ChangeLog:
> 
>   * common/config/i386/cpuinfo.h
>   (get_intel_cpu): Handle Granite Rapids D.
>   * common/config/i386/i386-common.cc:
>   (processor_names): Add graniterapids-d.
>   (processor_alias_table): Ditto.
>   * common/config/i386/i386-cpuinfo.h
>   (enum processor_subtypes): Add INTEL_GRANITERAPIDS_D.
>   * config.gcc: Add -march=graniterapids-d.
>   * config/i386/driver-i386.cc (host_detect_local_cpu):
>   Handle graniterapids-d.
>   * config/i386/i386-c.cc (ix86_target_macros_internal):
>   Ditto.
>   * config/i386/i386-options.cc (m_GRANITERAPIDSD): New.
>   (processor_cost_table): Add graniterapids-d.
>   * config/i386/i386.h (enum processor_type):
>   Add PROCESSOR_GRANITERAPIDS_D.
>   * doc/extend.texi: Add graniterapids-d.
>   * doc/invoke.texi: Ditto.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.target/i386/mv16.C: Add graniterapids-d.
>   * gcc.target/i386/funcspec-56.inc: Handle new march.
> ---
>  gcc/common/config/i386/cpuinfo.h  |  9 -
>  gcc/common/config/i386/i386-common.cc |  3 +++
>  gcc/common/config/i386/i386-cpuinfo.h |  1 +
>  gcc/config.gcc|  2 +-
>  gcc/config/i386/driver-i386.cc|  3 +++
>  gcc/config/i386/i386-c.cc |  7 +++
>  gcc/config/i386/i386-options.cc   |  4 +++-
>  gcc/config/i386/i386.h|  5 -
>  gcc/doc/extend.texi   |  3 +++
>  gcc/doc/invoke.texi   | 11 +++
>  gcc/testsuite/g++.target/i386/mv16.C  |  6 ++
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |  1 +
>  12 files changed, 51 insertions(+), 4 deletions(-)
> 
> diff --git a/gcc/common/config/i386/cpuinfo.h
> b/gcc/common/config/i386/cpuinfo.h
> index ae48bc17771..7c2565c1d93 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -565,7 +565,6 @@ get_intel_cpu (struct __processor_model
> *cpu_model,
>cpu_model->__cpu_type = INTEL_SIERRAFOREST;
>break;
>  case 0xad:
> -case 0xae:
>/* Granite Rapids.  */
>cpu = "graniterapids";
>CHECK___builtin_cpu_is ("corei7"); @@ -573,6 +572,14 @@
> get_intel_cpu (struct __processor_model *cpu_model,
>cpu_model->__cpu_type = INTEL_COREI7;
>cpu_model->__cpu_subtype = INTEL_COREI7_GRANITERAPIDS;
>break;
> +case 0xae:
> +  /* Granite Rapids D.  */
> +  cpu = "graniterapids-d";
> +  CHECK___builtin_cpu_is ("corei7");
> +  CHECK___builtin_cpu_is ("graniterapids-d");
> +  cpu_model->__cpu_type = INTEL_COREI7;
> +  cpu_model->__cpu_subtype = INTEL_COREI7_GRANITERAPIDS_D;
> +  break;
>  case 0xb6:
>/* Grand Ridge.  */
>cpu = "grandridge";
> diff --git a/gcc/common/config/i386/i386-common.cc
> b/gcc/common/config/i386/i386-common.cc
> index bf126f14073..5a337c5b8be 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -1971,6 +1971,7 @@ const char *const processor_names[] =
>"alderlake",
>"rocketlake",
>"graniterapids",
> +  "graniterapids-d",
>"intel",
>"lujiazui",
>"geode",
> @@ -2094,6 +2095,8 @@ const pta processor_alias_table[] =
>  M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2},
>{"graniterapids", PROCESSOR_GRANITERAPIDS, CPU_HASWELL,
> PTA_GRANITERAPIDS,
>  M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX512F},
> +  {"graniterapids-d", PROCESSOR_GRANITERAPIDS_D, CPU_HASWELL,
> PTA_GRANITERAPIDS_D,
> +M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS_D),
> P_PROC_AVX512F},
>{"bonnell", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL,
>  M_CPU_TYPE (INTEL_BONNELL), P_PROC_SSSE3},
>{"atom", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL, diff --git
> a/gcc/common/config/i386/i386-cpuinfo.h b/gcc/common/config/i386/i386-
> cpuinfo.h
> index 2dafbb25a49..254dfec70e5 100644
> --- a/gcc/common/config/i386/i386-cpuinfo.h
> +++ b/gcc/common/config/i386/i386-cpuinfo.h
> @@ -98,6 +98,7 @@ enum processor_subtypes
>ZHAOXIN_FAM7H_LUJIAZUI,
>AMDFAM19H_ZNVER4,
>INTEL_COREI7_GRANITERAPIDS,
> +  INTEL_COREI7_GRANITERAPIDS_D,
>CPU_SUBTYPE_MAX
>  };

RE: [PATCH v3] x86: make VPTERNLOG* usable on less than 512-bit operands with just AVX512F

2023-07-04 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jan Beulich 
> Sent: Tuesday, July 4, 2023 11:30 PM
> To: Hongtao Liu 
> Cc: gcc-patches@gcc.gnu.org; Kirill Yukhin ; Liu,
> Hongtao 
> Subject: Re: [PATCH v3] x86: make VPTERNLOG* usable on less than 512-bit
> operands with just AVX512F
> 
> On 27.06.2023 07:11, Hongtao Liu wrote:
> > On Tue, Jun 20, 2023 at 5:34 PM Hongtao Liu  wrote:
> >>
> >> On Tue, Jun 20, 2023 at 5:03 PM Jan Beulich  wrote:
> >>>
> >>> On 20.06.2023 10:33, Hongtao Liu wrote:
>  On Tue, Jun 20, 2023 at 3:07 PM Jan Beulich via Gcc-patches
>   wrote:
> >
> > I guess the underlying pattern, going along the lines of what
> > one_cmpl2 uses, can be
> applied
> > elsewhere as well.
>  That should be guarded with !TARGET_PREFER_AVX256, let's handle
>  that in a separate patch.
> >>>
> >>> Sure, and as indicated there are more places where similar things
> >>> could be done.
> >>>
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx512f-copysign.c
> > @@ -0,0 +1,32 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx512f -mno-avx512vl -O2" } */
>  Please explicitly add -mprefer-vector-width=512, our tester will
>  also test unix{-m32 \-march=cascadelake,\ -march=cascadelake} which
>  set the
>  - mprefer-vector-width=256, -mprefer-vector-width=512 in dg-options
>  can overwrite that.
> >>>
> >>> Oh, I see. Will do. And I expect I then also need to adjust the
> >>> newly added avx512f-dupv2di.c from the earlier patch. I guess I
> >>> could commit that option addition there as obvious?
> >> Still need to send out the patch, and commit as an obvious fix.
> >>>
>  Others LGTM.
> >>>
> >>> May I take this as "okay with that change", or should I submit v4?
> >> Okay. no need for a v4 version.
> >>>
> > avx512f-copysign.c failed for -m32, we need to add -mfpmath=sse to dg-
> options.
> 
> Oh, of course. I will take care of this, but it may take me a couple of days, 
> as I
> just came back from a week of vacation. One question though:
> Elsewhere such tests are simply suppressed for 32-bit. Personally I'd prefer
> going that route, but if you think adding -mfpmath=sse is indeed better, I'll
> follow your request.
Either is ok.
> 
> Jan

RE: [PATCH v2] x86: make better use of VBROADCASTSS / VPBROADCASTD

2023-06-24 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jan Beulich 
> Sent: Wednesday, June 21, 2023 8:40 PM
> To: Hongtao Liu 
> Cc: gcc-patches@gcc.gnu.org; Kirill Yukhin ; Liu,
> Hongtao 
> Subject: Re: [PATCH v2] x86: make better use of VBROADCASTSS /
> VPBROADCASTD
> 
> On 21.06.2023 09:44, Jan Beulich wrote:
> > On 21.06.2023 09:37, Hongtao Liu wrote:
> >> On Wed, Jun 21, 2023 at 2:06 PM Jan Beulich via Gcc-patches
> >>  wrote:
> >>>
> >>> Isn't prefix_extra use bogus here? What extra prefix does
> >>> vbroadcastss
> >> According to comments, yes, no extra prefix is needed.
> >>
> >> ;; There are also additional prefixes in 3DNOW, SSSE3.
> >> ;; ssemuladd,sse4arg default to 0f24/0f25 and DREX byte, ;;
> >> sseiadd1,ssecvt1 to 0f7a with no DREX byte.
> >> ;; 3DNOW has 0f0f prefix, SSSE3 and SSE4_{1,2} 0f38/0f3a.
> >
> > Right, that's what triggered my question. I guess dropping these
> > "prefix_extra" really wants to be a separate patch (or maybe even
> > multiple, but it's hard to see how to split), dealing with all of the
> > instances which likely have accumulated simply via copy-and-paste.
> 
> Or wait - I'm altering those lines anyway, so I could as well drop them right
> away (and slightly shrink patch size), if that's okay with you. Of course I
> should then not forget to also mention this in the changelog entry.
> 
Yes.
> Jan

RE: [PATCH v2] x86: make VPTERNLOG* usable on less than 512-bit operands with just AVX512F

2023-06-18 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jan Beulich 
> Sent: Friday, June 16, 2023 2:22 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kirill Yukhin ; Liu, Hongtao
> 
> Subject: [PATCH v2] x86: make VPTERNLOG* usable on less than 512-bit
> operands with just AVX512F
> 
> There's no reason to constrain this to AVX512VL, unless instructed so by -
> mprefer-vector-width=, as the wider operation is unusable for more narrow
> operands only when the possible memory source is a non-broadcast one.
> This way even the scalar copysign3 can benefit from the operation
> being a single-insn one (leaving aside moves which the compiler decides to
> insert for unclear reasons, and leaving aside the fact that
> bcst_mem_operand() is too restrictive for broadcast to be embedded right
> into VPTERNLOG*).
> 
> Along with this also request value duplication in ix86_expand_copysign()'s
> call to ix86_build_signbit_mask(), eliminating excess space allocation
> in .rodata.*, filled with zeros which are never read.
> 
> gcc/
> 
>   * config/i386/i386-expand.cc (ix86_expand_copysign): Request
>   value duplication by ix86_build_signbit_mask() when AVX512F and
>   not HFmode.
>   * config/i386/sse.md (*_vternlog_all): Convert to
>   2-alternative form. Adjust "mode" attribute. Add "enabled"
>   attribute.
>   (*_vpternlog_1): Also permit when
> TARGET_AVX512F
>   && !TARGET_PREFER_AVX256.
>   (*_vpternlog_2): Likewise.
>   (*_vpternlog_3): Likewise.
> ---
> I guess the underlying pattern, going along the lines of what
> one_cmpl2 uses, can be applied
> elsewhere as well.
> 
> HFmode could use embedded broadcast too for copysign and alike, but that
> would need to be V2HF -> V8HF (for which I don't think there are any existing
> patterns).
> ---
> v2: Respect -mprefer-vector-width=.
> 
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -2266,7 +2266,7 @@ ix86_expand_copysign (rtx operands[])
>else
>  dest = NULL_RTX;
>op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode);
> -  mask = ix86_build_signbit_mask (vmode, 0, 0);
> +  mask = ix86_build_signbit_mask (vmode, TARGET_AVX512F && mode !=
> + HFmode, 0);
> 
>if (CONST_DOUBLE_P (operands[1]))
>  {
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12597,11 +12597,11 @@
> (set_attr "mode" "")])
> 
>  (define_insn "*_vternlog_all"
> -  [(set (match_operand:V 0 "register_operand" "=v")
> +  [(set (match_operand:V 0 "register_operand" "=v,v")
>   (unspec:V
> -   [(match_operand:V 1 "register_operand" "0")
> -(match_operand:V 2 "register_operand" "v")
> -(match_operand:V 3 "bcst_vector_operand" "vmBr")
> +   [(match_operand:V 1 "register_operand" "0,0")
> +(match_operand:V 2 "register_operand" "v,v")
> +(match_operand:V 3 "bcst_vector_operand" "vBr,m")
>  (match_operand:SI 4 "const_0_to_255_operand")]
> UNSPEC_VTERNLOG))]
>"TARGET_AVX512F
Change condition to  == 64 || TARGET_AVX512VL || (TARGET_AVX512F && 
!TARGET_PREFER_AVX256)
Also please add a testcase for case TARGET_AVX512F && !TARGET_PREFER_AVX256.
> @@ -12609,10 +12609,22 @@
> it's not real AVX512FP16 instruction.  */
>&& (GET_MODE_SIZE (GET_MODE_INNER (mode)) >= 4
>   || GET_CODE (operands[3]) != VEC_DUPLICATE)"
> -  "vpternlog\t{%4, %3, %2, %0|%0, %2, %3, %4}"
> +{
> +  if (TARGET_AVX512VL)
> +return "vpternlog\t{%4, %3, %2, %0|%0, %2, %3, %4}";
> +  else
> +return "vpternlog\t{%4, %g3, %g2, %g0|%g0, %g2, %g3,
> +%4}"; }
>[(set_attr "type" "sselog")
> (set_attr "prefix" "evex")
> -   (set_attr "mode" "")])
> +   (set (attr "mode")
> +(if_then_else (match_test "TARGET_AVX512VL")
> +   (const_string "")
> +   (const_string "XI")))
> +   (set (attr "enabled")
> + (if_then_else (eq_attr "alternative" "1")
> +   (symbol_ref " == 64 || TARGET_AVX512VL")
> +   (const_string "*")))])
> 
>  ;; There must be lots of other combinations like  ;; @@ -12641,7 +12653,8
> @@
> (any_logic2:V
>   (match_operand:V 3 "regmem_or_bitnot_regmem_operand")
>   (match_operand:V 4 "regmem_or_bitnot_regmem_operand"]
> -  "( == 64 || TARGET_AVX512VL)
> +  "( == 64 || TARGET_AVX512VL
> +|| (TARGET_AVX512F && !TARGET_PREFER_AVX256))
> && ix86_pre_reload_split ()
> && (rtx_equal_p (STRIP_UNARY (operands[1]),
>   STRIP_UNARY (operands[4]))
> @@ -12725,7 +12738,8 @@
> (match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
>   (match_operand:V 3 "regmem_or_bitnot_regmem_operand"))
> (match_operand:V 4 "regmem_or_bitnot_regmem_operand")))]
> -  "( == 64 || TARGET_AVX512VL)
> +  "( == 64 || TARGET_AVX512VL
> +|| (TARGET_AVX512F && !TARGET_PREFER_AVX256))
> && ix86_pre_reload_split ()
> && (rtx_equal_p (STRIP_UNARY (operands[1]),
>   STRIP_UNARY (operands[4]))
> @@ -12808,7

RE: [PATCH v2] x86: correct and improve "*vec_dupv2di"

2023-06-18 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jan Beulich 
> Sent: Friday, June 16, 2023 2:20 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; Kirill Yukhin
> 
> Subject: [PATCH v2] x86: correct and improve "*vec_dupv2di"
> 
> The input constraint for the %vmovddup alternative was wrong, as the upper
> 16 XMM registers require AVX512VL to be used with this insn. To
> compensate, introduce a new alternative permitting all 32 registers, by
> broadcasting to the full 512 bits in that case if AVX512VL is not available.
> 
> gcc/
> 
>   * config/i386/sse.md (vec_dupv2di): Correct %vmovddup input
>   constraint. Add new AVX512F alternative.
Could you add a testcase for that.
Ok with the testcase.
> ---
> Strictly speaking the new alternative could be enabled from AVX2 onwards,
> but vmovddup can frequently be a shorter encoding (VEX2 vs VEX3).
> 
> It was suggested that the previously flawed %vmovddup alternative could
> use "xm" as source constraint. But then its destination would better also use
> "x", I think?
> ---
> v2: Use "* return ..." form. Set "mode" to XI for new alternative
> without AVX512VL.
> 
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -26033,19 +26033,35 @@
>  (symbol_ref "true")))])
> 
>  (define_insn "*vec_dupv2di"
> -  [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,x")
> +  [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,v,x")
>   (vec_duplicate:V2DI
> -   (match_operand:DI 1 "nonimmediate_operand" " 0,Yv,vm,0")))]
> +   (match_operand:DI 1 "nonimmediate_operand" "
> 0,Yv,vm,Yvm,0")))]
>"TARGET_SSE"
>"@
> punpcklqdq\t%0, %0
> vpunpcklqdq\t{%d1, %0|%0, %d1}
> +   * return TARGET_AVX512VL ? \"vpbroadcastq\t{%1, %0|%0, %1}\" :
> + \"vpbroadcastq\t{%1, %g0|%g0, %1}\";
> %vmovddup\t{%1, %0|%0, %1}
> movlhps\t%0, %0"
> -  [(set_attr "isa" "sse2_noavx,avx,sse3,noavx")
> -   (set_attr "type" "sselog1,sselog1,sselog1,ssemov")
> -   (set_attr "prefix" "orig,maybe_evex,maybe_vex,orig")
> -   (set_attr "mode" "TI,TI,DF,V4SF")])
> +  [(set_attr "isa" "sse2_noavx,avx,avx512f,sse3,noavx")
> +   (set_attr "type" "sselog1,sselog1,ssemov,sselog1,ssemov")
> +   (set_attr "prefix" "orig,maybe_evex,evex,maybe_vex,orig")
> +   (set (attr "mode")
> + (cond [(and (eq_attr "alternative" "2")
> + (match_test "!TARGET_AVX512VL"))
> +  (const_string "XI")
> +(eq_attr "alternative" "3")
> +  (const_string "DF")
> +(eq_attr "alternative" "4")
> +  (const_string "V4SF")
> +   ]
> +   (const_string "TI")))
> +   (set (attr "enabled")
> + (if_then_else
> +   (eq_attr "alternative" "2")
> +   (symbol_ref "TARGET_AVX512VL
> +|| (TARGET_AVX512F && !TARGET_PREFER_AVX256)")
> +   (const_string "*")))])
> 
>  (define_insn "avx2_vbroadcasti128_"
>[(set (match_operand:VI_256 0 "register_operand" "=x,v,v")

RE: [PATCH 3/3] AVX512 fully masked vectorization

2023-06-14 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, June 14, 2023 10:30 PM
> To: Andrew Stubbs 
> Cc: gcc-patches@gcc.gnu.org; richard.sandif...@arm.com; Jan Hubicka
> ; Liu, Hongtao ;
> kirill.yuk...@gmail.com
> Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
> 
> 
> 
> > Am 14.06.2023 um 16:27 schrieb Andrew Stubbs
> :
> >
> > On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >> This implemens fully masked vectorization or a masked epilog for
> >> AVX512 style masks which single themselves out by representing each
> >> lane with a single bit and by using integer modes for the mask (both
> >> is much like GCN).
> >> AVX512 is also special in that it doesn't have any instruction to
> >> compute the mask from a scalar IV like SVE has with while_ult.
> >> Instead the masks are produced by vector compares and the loop
> >> control retains the scalar IV (mainly to avoid dependences on mask
> >> generation, a suitable mask test instruction is available).
> >
> > This is also sounds like GCN. We currently use WHILE_ULT in the middle end
> which expands to a vector compare against a vector of stepped values. This
> requires an additional instruction to prepare the comparison vector
> (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
> the DImode bitmask, so it works reasonably well.
> >
> >> Like RVV code generation prefers a decrementing IV though IVOPTs
> >> messes things up in some cases removing that IV to eliminate it with
> >> an incrementing one used for address generation.
> >> One of the motivating testcases is from PR108410 which in turn is
> >> extracted from x264 where large size vectorization shows issues with
> >> small trip loops.  Execution time there improves compared to classic
> >> AVX512 with AVX2 epilogues for the cases of less than 32 iterations.
> >> size   scalar 128 256 512512e512f
> >> 19.42   11.329.35   11.17   15.13   16.89
> >> 25.726.536.666.667.628.56
> >> 34.495.105.105.745.085.73
> >> 44.104.334.295.213.794.25
> >> 63.783.853.864.762.542.85
> >> 83.641.893.764.501.922.16
> >>123.562.213.754.261.261.42
> >>163.360.831.064.160.951.07
> >>203.391.421.334.070.750.85
> >>243.230.661.724.220.620.70
> >>283.181.092.044.200.540.61
> >>323.160.470.410.410.470.53
> >>343.160.670.610.560.440.50
> >>383.190.950.950.820.400.45
> >>423.090.581.211.130.360.40
> >> 'size' specifies the number of actual iterations, 512e is for a
> >> masked epilog and 512f for the fully masked loop.  From
> >> 4 scalar iterations on the AVX512 masked epilog code is clearly the
> >> winner, the fully masked variant is clearly worse and it's size
> >> benefit is also tiny.
> >
> > Let me check I understand correctly. In the fully masked case, there is a
> single loop in which a new mask is generated at the start of each iteration. 
> In
> the masked epilogue case, the main loop uses no masking whatsoever, thus
> avoiding the need for generating a mask, carrying the mask, inserting
> vec_merge operations, etc, and then the epilogue looks much like the fully
> masked case, but unlike smaller mode epilogues there is no loop because the
> eplogue vector size is the same. Is that right?
> 
> Yes.
What about vectorizer and unroll, when vector size is the same, unroll factor 
is N, but there're at most N - 1 iterations for epilogue loop, will there still 
a loop? 
> > This scheme seems like it might also benefit GCN, in so much as it 
> > simplifies
> the hot code path.
> >
> > GCN does not actually have smaller vector sizes, so there's no analogue to
> AVX2 (we pretend we have some smaller sizes, but that's because the
> middle end can't do masking everywhere yet, and it helps make some vector
> constants smaller, perhaps).
> >
> >> This patch does not enable using fully masked loops or masked
> >> epilogues by default.  More work on cost modeling and vectorization
> >> kind selection on x86_64 is necessary for this.
> >> Implementation wise this introduces
> LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >> which could be exploited further to unify some of the flags we have
> >> right now but there didn't seem to be many easy things to merge, so
> >> I'm leaving this for followups.
> >> Mask requirements as registered by vect_record_loop_mask are kept in
> >> their original form and recorded in a hash_set now instead of being
> >> processed to a vector of rgroup_controls.  Instead that's now left to
> >> the final analysis phase which tries forming the rgroup_controls
> >> vector using while_ult and if that fails now tries AVX512 style which
>

RE: [PATCH] i386: Fix incorrect intrinsic signature for AVX512 s{lli|rai|rli}

2023-05-25 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Hu, Lin1 
> Sent: Thursday, May 25, 2023 3:52 PM
> To: Hongtao Liu 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao ;
> ubiz...@gmail.com
> Subject: RE: [PATCH] i386: Fix incorrect intrinsic signature for AVX512
> s{lli|rai|rli}
> 
> OK, I update the change log and modify a part of format. The attached file is
> the new version.
LGTM.
> 
> -Original Message-
> From: Hongtao Liu 
> Sent: Thursday, May 25, 2023 11:40 AM
> To: Hu, Lin1 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao ;
> ubiz...@gmail.com
> Subject: Re: [PATCH] i386: Fix incorrect intrinsic signature for AVX512
> s{lli|rai|rli}
> 
> On Thu, May 25, 2023 at 10:55 AM Hu, Lin1 via Gcc-patches
>  wrote:
> >
> > Hi all,
> >
> > This patch aims to fix incorrect intrinsic signature for
> _mm{512|256|}_s{lli|rai|rli}_epi*. And it has been tested on x86_64-pc-
> linux-gnu. OK for trunk?
> >
> > BRs,
> > Lin
> >
> > gcc/ChangeLog:
> >
> > PR target/109173
> > PR target/109174
> > * config/i386/avx512bwintrin.h (_mm512_srli_epi16): Change type
> from
> > int to const int.
> int to unsigned int or const int to const unsigned int.
> Others LGTM.
> > (_mm512_mask_srli_epi16): Ditto.
> > (_mm512_slli_epi16): Ditto.
> > (_mm512_mask_slli_epi16): Ditto.
> > (_mm512_maskz_slli_epi16): Ditto.
> > (_mm512_srai_epi16): Ditto.
> > (_mm512_mask_srai_epi16): Ditto.
> > (_mm512_maskz_srai_epi16): Ditto.
> > * config/i386/avx512vlintrin.h (_mm256_mask_srli_epi32): Ditto.
> > (_mm256_maskz_srli_epi32): Ditto.
> > (_mm_mask_srli_epi32): Ditto.
> > (_mm_maskz_srli_epi32): Ditto.
> > (_mm256_mask_srli_epi64): Ditto.
> > (_mm256_maskz_srli_epi64): Ditto.
> > (_mm_mask_srli_epi64): Ditto.
> > (_mm_maskz_srli_epi64): Ditto.
> > (_mm256_mask_srai_epi32): Ditto.
> > (_mm256_maskz_srai_epi32): Ditto.
> > (_mm_mask_srai_epi32): Ditto.
> > (_mm_maskz_srai_epi32): Ditto.
> > (_mm256_srai_epi64): Ditto.
> > (_mm256_mask_srai_epi64): Ditto.
> > (_mm256_maskz_srai_epi64): Ditto.
> > (_mm_srai_epi64): Ditto.
> > (_mm_mask_srai_epi64): Ditto.
> > (_mm_maskz_srai_epi64): Ditto.
> > (_mm_mask_slli_epi32): Ditto.
> > (_mm_maskz_slli_epi32): Ditto.
> > (_mm_mask_slli_epi64): Ditto.
> > (_mm_maskz_slli_epi64): Ditto.
> > (_mm256_mask_slli_epi32): Ditto.
> > (_mm256_maskz_slli_epi32): Ditto.
> > (_mm256_mask_slli_epi64): Ditto.
> > (_mm256_maskz_slli_epi64): Ditto.
> > (_mm_mask_srai_epi16): Ditto.
> > (_mm_maskz_srai_epi16): Ditto.
> > (_mm256_srai_epi16): Ditto.
> > (_mm256_mask_srai_epi16): Ditto.
> > (_mm_mask_slli_epi16): Ditto.
> > (_mm_maskz_slli_epi16): Ditto.
> > (_mm256_mask_slli_epi16): Ditto.
> > (_mm256_maskz_slli_epi16): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR target/109173
> > PR target/109174
> > * gcc.target/i386/pr109173-1.c: New test.
> > * gcc.target/i386/pr109174-1.c: Ditto.
> > ---
> >  gcc/config/i386/avx512bwintrin.h   |  32 +++---
> >  gcc/config/i386/avx512fintrin.h|  58 +++
> >  gcc/config/i386/avx512vlbwintrin.h |  36 ---
> >  gcc/config/i386/avx512vlintrin.h   | 112 +++--
> >  gcc/testsuite/gcc.target/i386/pr109173-1.c |  57 +++
> >  gcc/testsuite/gcc.target/i386/pr109174-1.c |  45 +
> >  6 files changed, 236 insertions(+), 104 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr109173-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr109174-1.c
> >
> > diff --git a/gcc/config/i386/avx512bwintrin.h
> b/gcc/config/i386/avx512bwintrin.h
> > index 89790f7917b..791d4e35f32 100644
> > --- a/gcc/config/i386/avx512bwintrin.h
> > +++ b/gcc/config/i386/avx512bwintrin.h
> > @@ -2880,7 +2880,7 @@ _mm512_maskz_dbsad_epu8 (__mmask32 __U,
> __m512i __A, __m512i __B,
> >
> >  extern __inline __m512i
> >  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> > -_mm512_srli_epi16 (__m512i __A, const int __imm)
> > +_mm512_srli_epi16 (__m512i __A, const unsigned int __imm)
> >  {
> >return (__m512i) __builtin_ia32_psrlwi512_mask ((__v32hi) __A, __imm,
> >   (__v32hi)
> > @@ -2891,7 +2891,7 @@ _mm512_srli_epi16 (__m512i __A, const int
> __imm)
> >  extern __inline __m512i
> >  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> >  _mm512_mask_srli_epi16 (__m512i __W, __mmask32 __U, __m512i __A,
> > -   const int __imm)
> > +   const unsigned int __imm)
> >  {
> >return (__m512i) __builtin_ia32_psrlwi512_mask ((__v32hi) __A, __imm,
> >   (__v32hi) __W,
> > @@ -2910,7

RE: [PATCH 1/2] Use NO_REGS in cost calculation when the preferred register class are not known yet.

2023-04-22 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Vladimir Makarov 
> Sent: Saturday, April 22, 2023 3:26 AM
> To: Liu, Hongtao ; gcc-patches@gcc.gnu.org
> Cc: crazy...@gmail.com; hjl.to...@gmail.com
> Subject: Re: [PATCH 1/2] Use NO_REGS in cost calculation when the
> preferred register class are not known yet.
> 
> 
> On 4/19/23 20:46, liuhongt via Gcc-patches wrote:
> > 1547  /* If this insn loads a parameter from its stack slot, then it
> > 1548 represents a savings, rather than a cost, if the parameter is
> > 1549 stored in memory.  Record this fact.
> > 1550
> > 1551 Similarly if we're loading other constants from memory (constant
> > 1552 pool, TOC references, small data areas, etc) and this is the only
> > 1553 assignment to the destination pseudo.
> >
> > At that time, preferred regclass is unknown, and GENERAL_REGS is used
> > to record memory move cost, but it's not accurate especially for large
> > vector modes, i.e. 512-bit vector in x86 which would most probably
> > allocate with SSE_REGS instead of GENERAL_REGS. Using GENERAL_REGS
> > here will overestimate the cost of this load and make RA propagate the
> > memeory operand into many consume instructions which causes worse
> performance.
> 
> For this case GENERAL_REGS was used in GCC practically all the time. You can
> check this in the old regclass.c file (existing until IRA introduction).
> 
> But I guess it is ok to use NO_REGS for this to promote more usage of
> registers instead of equiv memory and as a lot of code was changed since
> then (the old versions of GCC even did not support vector regs).
> 
> Although it would be nice to do some benchmarking (SPEC is preferable) for
> such kind of changes.
Thanks, I've run SPEC2017 on x86 ICX, no big performance change, a little bit 
code size improvement as expected(codesize of 1 load + multi ops should be 
smaller than multi ciscy ops).  
> 
> On the other hand, I expect that any performance regression (if any) will be
> reported anyway.
> 
> The patch is ok for me.  You can commit it into the trunk.
> 
> Thank you for addressing this issue.
> 
> > Fortunately, NO_REGS is used to record the best scenario, so the patch
> > uses NO_REGS instead of GENERAL_REGS here, it could help RA in
> PR108707.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} and
> > aarch64-linux-gnu.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > PR rtl-optimization/108707
> > * ira-costs.cc (scan_one_insn): Use NO_REGS instead of
> > GENERAL_REGS when preferred reg_class is not known.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr108707.c: New test.

RE: [PATCH] i386: Share AES xmm intrin with VAES

2023-04-18 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jiang, Haochen 
> Sent: Wednesday, April 19, 2023 10:41 AM
> To: Hongtao Liu 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao ;
> ubiz...@gmail.com
> Subject: RE: [PATCH] i386: Share AES xmm intrin with VAES
> 
> > > a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index
> > > 33e281901cf..e7d565a8389 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -25107,67 +25107,71 @@
> > >
> > > 
> > > ;;
> > > ;;
> > >
> > >  (define_insn "aesenc"
> > > -  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
> > > -   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x")
> > > -  (match_operand:V2DI 2 "vector_operand" "xBm,xm")]
> > > +  [(set (match_operand:V2DI 0 "register_operand" "=x,x,v")
> > > +   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x,v")
> > > +  (match_operand:V2DI 2 "vector_operand"
> > > + "xBm,xm,vm")]
> > >   UNSPEC_AESENC))]
> > > -  "TARGET_AES"
> > > +  "TARGET_AES || (TARGET_VAES && TARGET_AVX512VL)"
> > >"@
> > > aesenc\t{%2, %0|%0, %2}
> > > +   vaesenc\t{%2, %1, %0|%0, %1, %2}
> > > vaesenc\t{%2, %1, %0|%0, %1, %2}"
> > > -  [(set_attr "isa" "noavx,avx")
> > > +  [(set_attr "isa" "noavx,aes,avx512vl")
> > Shouldn't it be vaes_avx512vl and then remove " || (TARGET_VAES &&
> > TARGET_AVX512VL)" from condition.
> 
> Since VAES should not imply AES, we need that "|| (TARGET_VAES &&
> TARGET_AVX512VL)"
> 
> And there is no need to add vaes_avx512vl since the last alternative will only
> be hit when there is no aes. When there is no aes, the pattern will need vaes
> and avx512vl both or we could not use this pattern. avx512vl here is just 
> like a
> placeholder.
Ok, I see, then LGTM.
> 
> BRs,
> Haochen
> 
> > Similar for below patterns.
> > Others LGTM.
> > > (set_attr "type" "sselog1")
> > > (set_attr "prefix_extra" "1")
> > > -   (set_attr "prefix" "orig,vex")
> > > -   (set_attr "btver2_decode" "double,double")
> > > +   (set_attr "prefix" "orig,vex,evex")
> > > +   (set_attr "btver2_decode" "double,double,double")
> > > (set_attr "mode" "TI")])
> > >
> > >  (define_insn "aesenclast"
> > > -  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
> > > -   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x")
> > > -  (match_operand:V2DI 2 "vector_operand" "xBm,xm")]
> > > +  [(set (match_operand:V2DI 0 "register_operand" "=x,x,v")
> > > +   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x,v")
> > > +  (match_operand:V2DI 2 "vector_operand"
> > > + "xBm,xm,vm")]
> > >   UNSPEC_AESENCLAST))]
> > > -  "TARGET_AES"
> > > +  "TARGET_AES || (TARGET_VAES && TARGET_AVX512VL)"
> > >"@
> > > aesenclast\t{%2, %0|%0, %2}
> > > +   vaesenclast\t{%2, %1, %0|%0, %1, %2}
> > > vaesenclast\t{%2, %1, %0|%0, %1, %2}"
> > > -  [(set_attr "isa" "noavx,avx")
> > > +  [(set_attr "isa" "noavx,aes,avx512vl")
> > > (set_attr "type" "sselog1")
> > > (set_attr "prefix_extra" "1")
> > > -   (set_attr "prefix" "orig,vex")
> > > -   (set_attr "btver2_decode" "double,double")
> > > +   (set_attr "prefix" "orig,vex,evex")
> > > +   (set_attr "btver2_decode" "double,double,double")
> > > (set_attr "mode" "TI")])
> > >
> > >  (define_insn "aesdec"
> > > -  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
> > > -   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x")
> > > -  (match_operand:V2DI 2 "vector_operand" "xBm,xm")]
> > > +  [(set (match_operand:V2DI 0 "register_operand" "=x,x,v")
> > > +   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x,v")
> > > +  (match_operand:V2DI 2 "vector_operand"
> > > + "xBm,xm,vm")]
> > >   UNSPEC_AESDEC))]
> > > -  "TARGET_AES"
> > > +  "TARGET_AES || (TARGET_VAES && TARGET_AVX512VL)"
> > >"@
> > > aesdec\t{%2, %0|%0, %2}
> > > +   vaesdec\t{%2, %1, %0|%0, %1, %2}
> > > vaesdec\t{%2, %1, %0|%0, %1, %2}"
> > > -  [(set_attr "isa" "noavx,avx")
> > > +  [(set_attr "isa" "noavx,aes,avx512vl")
> > > (set_attr "type" "sselog1")
> > > (set_attr "prefix_extra" "1")
> > > -   (set_attr "prefix" "orig,vex")
> > > -   (set_attr "btver2_decode" "double,double")
> > > +   (set_attr "prefix" "orig,vex,evex")
> > > +   (set_attr "btver2_decode" "double,double,double")
> > > (set_attr "mode" "TI")])
> > >
> > >  (define_insn "aesdeclast"
> > > -  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
> > > -   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x")
> > > -  (match_operand:V2DI 2 "vector_operand" "xBm,xm")]
> > > +  [(set (match_operand:V2DI 0 "register_operand" "=x,x,v")
> > > +   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x,v")
> > > +

RE: [PATCH] Re-arrange sections of i386 cpuid

2023-04-18 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Mo, Zewei 
> Sent: Wednesday, April 19, 2023 10:03 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; ubiz...@gmail.com
> Subject: [PATCH] Re-arrange sections of i386 cpuid
> 
> Re-order i386 cpuid based on the order of CPUID.
> 
> gcc/ChangeLog:
> 
> * config/i386/cpuid.h: Open a new section for Extended Features
>   Leaf (%eax == 7, %ecx == 0) and Extended Features Sub-leaf (%eax
> == 7,
>   %ecx == 1).
Ok.
> ---
>  gcc/config/i386/cpuid.h | 35 +++
>  1 file changed, 19 insertions(+), 16 deletions(-)
> 
> diff --git a/gcc/config/i386/cpuid.h b/gcc/config/i386/cpuid.h index
> be162dd8c78..971781c2b91 100644
> --- a/gcc/config/i386/cpuid.h
> +++ b/gcc/config/i386/cpuid.h
> @@ -24,15 +24,6 @@
>  #ifndef _CPUID_H_INCLUDED
>  #define _CPUID_H_INCLUDED
> 
> -/* %eax */
> -#define bit_RAOINT   (1 << 3)
> -#define bit_AVXVNNI  (1 << 4)
> -#define bit_AVX512BF16   (1 << 5)
> -#define bit_CMPCCXADD(1 << 7)
> -#define bit_AMX_FP16 (1 << 21)
> -#define bit_HRESET   (1 << 22)
> -#define bit_AVXIFMA  (1 << 23)
> -
>  /* %ecx */
>  #define bit_SSE3 (1 << 0)
>  #define bit_PCLMUL   (1 << 1)
> @@ -52,10 +43,7 @@
>  #define bit_RDRND(1 << 30)
> 
>  /* %edx */
> -#define bit_AVXVNNIINT8 (1 << 4)
> -#define bit_AVXNECONVERT (1 << 5)
>  #define bit_CMPXCHG8B(1 << 8)
> -#define bit_PREFETCHI(1 << 14)
>  #define bit_CMOV (1 << 15)
>  #define bit_MMX  (1 << 23)
>  #define bit_FXSAVE   (1 << 24)
> @@ -84,7 +72,7 @@
>  #define bit_CLZERO   (1 << 0)
>  #define bit_WBNOINVD (1 << 9)
> 
> -/* Extended Features (%eax == 7) */
> +/* Extended Features Leaf (%eax == 7, %ecx == 0) */
>  /* %ebx */
>  #define bit_FSGSBASE (1 << 0)
>  #define bit_SGX (1 << 2)
> @@ -132,9 +120,9 @@
>  #define bit_AVX5124VNNIW (1 << 2)
>  #define bit_AVX5124FMAPS (1 << 3)
>  #define bit_AVX512VP2INTERSECT   (1 << 8)
> -#define bit_AVX512FP16   (1 << 23)
> -#define bit_IBT  (1 << 20)
> -#define bit_UINTR (1 << 5)
> +#define bit_AVX512FP16   (1 << 23)
> +#define bit_IBT (1 << 20)
> +#define bit_UINTR   (1 << 5)
>  #define bit_PCONFIG  (1 << 18)
>  #define bit_SERIALIZE(1 << 14)
>  #define bit_TSXLDTRK(1 << 16)
> @@ -142,6 +130,21 @@
>  #define bit_AMX_TILE(1 << 24)
>  #define bit_AMX_INT8(1 << 25)
> 
> +/* Extended Features Sub-leaf (%eax == 7, %ecx == 1) */
> +/* %eax */
> +#define bit_RAOINT  (1 << 3)
> +#define bit_AVXVNNI (1 << 4)
> +#define bit_AVX512BF16  (1 << 5)
> +#define bit_CMPCCXADD   (1 << 7)
> +#define bit_AMX_FP16(1 << 21)
> +#define bit_HRESET  (1 << 22)
> +#define bit_AVXIFMA (1 << 23)
> +
> +/* %edx */
> +#define bit_AVXVNNIINT8 (1 << 4)
> +#define bit_AVXNECONVERT (1 << 5)
> +#define bit_PREFETCHI (1 << 14)
> +
>  /* Extended State Enumeration Sub-leaf (%eax == 0xd, %ecx == 1) */
>  #define bit_XSAVEOPT (1 << 0)
>  #define bit_XSAVEC   (1 << 1)
> --
> 2.31.1

RE: [PATCH] Check hard_regno_mode_ok before setting lowest memory move cost for the mode with different reg classes.

2023-04-05 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Vladimir Makarov 
> Sent: Wednesday, April 5, 2023 8:59 PM
> To: Jeff Law ; Liu, Hongtao
> ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Check hard_regno_mode_ok before setting lowest
> memory move cost for the mode with different reg classes.
> 
> 
> On 4/4/23 21:29, Jeff Law wrote:
> >
> >
> > On 4/3/23 23:13, liuhongt via Gcc-patches wrote:
> >> There's a potential performance issue when backend returns some
> >> unreasonable value for the mode which can be never be allocate with
> >> reg class.
> >>
> >> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> >> Ok for trunk(or GCC14 stage1)?
> >>
> >> gcc/ChangeLog:
> >>
> >> PR rtl-optimization/109351
> >> * ira.cc (setup_class_subset_and_memory_move_costs): Check
> >> hard_regno_mode_ok before setting lowest memory move cost for
> >> the mode with different reg classes.
> > Not a regression *and* changing register allocation.  This seems like
> > it should defer to gcc-14.
> >
> Yes, I am agree.  It should wait for gcc-14, especially when we are close to 
> the
> release. Also the testing x86-64 is not enough for such changes (although I
> tried ppc64le and did not find any problem).
> 
> Cost related patches for RA frequently result in new testsuite failures on
> some targets.  Even if the change seems obvious and expected to improve
> the generated code.
> 
> Target dependent code sometimes defines correctly the costs only for some
> possible cases and making less dependent from this pitfall is good.  So I 
> think
> the patch moves us to the right direction.
> 
> The patch is ok for me to commit it to the trunk after the gcc-13 release and 
> if
> arm64 testing shows no GCC testsuite regression.
Bootstrapped and regtested on aarch64-unknown-linux-gnu.
Waiting for GCC14.
> 
> Thank you for working on this issue.
>

RE: [PATCH] i386: Fix up -Wuninitialized warnings in avx512erintrin.h [PR105593]

2023-01-31 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Jakub Jelinek 
> Sent: Tuesday, January 31, 2023 4:09 PM
> To: Liu, Hongtao ; Uros Bizjak 
> Cc: gcc-patches@gcc.gnu.org
> Subject: [PATCH] i386: Fix up -Wuninitialized warnings in avx512erintrin.h
> [PR105593]
> 
> Hi!
> 
> As reported in the PR, there are some -Wuninitialized warnings in
> avx512erintrin.h.  One can see that by compiling sse-23.c testcase with -
> Wuninitialized (or when actually using those intrinsics).
> Those 6 spots use an uninitialized variable and pass it as one of the argument
> to a builtin with constant mask -1, because there is no unmasked builtin.  It 
> is
> true that expansion of those builtins into RTL will see mask is all ones and
> ignore the unneeded argument, but -Wuninitialized is diagnosed on GIMPLE
> and on GIMPLE these builtins are just builtin calls.
> avx512fintrin.h and other headers use in these cases the _mm*_undefined_*
> () intrinsics, like:
>   return (__m512i) __builtin_ia32_psrav8di_mask ((__v8di) __X,
>  (__v8di) __Y,
>  (__v8di)
>  _mm512_undefined_epi32 (),
>  (__mmask8) -1); etc. and the 
> following patch does
> the same for avx512erintrin.h.
> With the recent changes in C++ FE and the _mm*_undefined_* intrinsics, we
> don't emit -Wuninitialized warnings for those (previously we didn't just in C
> due to self-initialization).  Of course we could also just self-initialize 
> these
> uninitialized vars and add the #pragma GCC diagnostic dances around it, but
> using the intrinsics is consistent with the rest and IMHO cleaner.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
Ok, thanks.
> 
> 2023-01-31  Jakub Jelinek  
> 
>   PR c++/105593
>   * config/i386/avx512erintrin.h (_mm512_exp2a23_round_pd,
>   _mm512_exp2a23_round_ps, _mm512_rcp28_round_pd,
> _mm512_rcp28_round_ps,
>   _mm512_rsqrt28_round_pd, _mm512_rsqrt28_round_ps): Use
>   _mm512_undefined_pd () or _mm512_undefined_ps () instead of
> using
>   uninitialized automatic variable __W.
> 
>   * gcc.target/i386/sse-23.c: Add -Wuninitialized to dg-options.
> 
> --- gcc/config/i386/avx512erintrin.h.jj   2023-01-16 11:52:15.944736113
> +0100
> +++ gcc/config/i386/avx512erintrin.h  2023-01-30 20:53:08.057769691
> +0100
> @@ -51,9 +51,8 @@ extern __inline __m512d  __attribute__
> ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_exp2a23_round_pd (__m512d __A, int __R)  {
> -  __m512d __W;
>return (__m512d) __builtin_ia32_exp2pd_mask ((__v8df) __A,
> -(__v8df) __W,
> +(__v8df) _mm512_undefined_pd
> (),
>  (__mmask8) -1, __R);
>  }
> 
> @@ -79,9 +78,8 @@ extern __inline __m512  __attribute__
> ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_exp2a23_round_ps (__m512 __A, int __R)  {
> -  __m512 __W;
>return (__m512) __builtin_ia32_exp2ps_mask ((__v16sf) __A,
> -   (__v16sf) __W,
> +   (__v16sf) _mm512_undefined_ps
> (),
> (__mmask16) -1, __R);
>  }
> 
> @@ -107,9 +105,8 @@ extern __inline __m512d  __attribute__
> ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_rcp28_round_pd (__m512d __A, int __R)  {
> -  __m512d __W;
>return (__m512d) __builtin_ia32_rcp28pd_mask ((__v8df) __A,
> - (__v8df) __W,
> + (__v8df)
> _mm512_undefined_pd (),
>   (__mmask8) -1, __R);
>  }
> 
> @@ -135,9 +132,8 @@ extern __inline __m512  __attribute__
> ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_rcp28_round_ps (__m512 __A, int __R)  {
> -  __m512 __W;
>return (__m512) __builtin_ia32_rcp28ps_mask ((__v16sf) __A,
> -(__v16sf) __W,
> +(__v16sf) _mm512_undefined_ps
> (),
>  (__mmask16) -1, __R);
>  }
> 
> @@ -229,9 +225,8 @@ extern __inline __m512d  __attribute__
> ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_rsqrt28_round_pd (__m512d __A, int __R)  {
> -  __m512d __W;
>return (__m512d) __builtin_ia32_rsqrt28pd_mask ((__v8df) __A,
> -   (__v8df) __W,
> +   (__v8df)
> _mm512_undefined_pd (),
> (__mmask8) -1, __R);
>  }
> 
> @@ -257,9 +252,8 @@ extern __inline __m512  __attribute__
> ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_rsqrt28_round_ps (__m512 __A, int __R)  {
> -  __m512 __W;
>

RE: [PATCH 2/4] Initial Emeraldrapids Support

2023-01-03 Thread Liu, Hongtao via Gcc-patches

There are actually only two patches, not four, and the subject *Patch 2/4* 
should be a typo.

> -Original Message-
> From: Hu, Lin1 
> Sent: Tuesday, January 3, 2023 4:37 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; ubiz...@gmail.com
> Subject: [PATCH 2/4] Initial Emeraldrapids Support
> 
> gcc/ChangeLog:
> 
>   * common/config/i386/cpuinfo.h (get_intel_cpu): Handle
> Emeraldrapids.
>   * common/config/i386/i386-common.cc: Add Emeraldrapids.
> ---
>  gcc/common/config/i386/cpuinfo.h  | 2 ++
>  gcc/common/config/i386/i386-common.cc | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/gcc/common/config/i386/cpuinfo.h
> b/gcc/common/config/i386/cpuinfo.h
> index bde231c07ee..3729b0f14a5 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -551,6 +551,8 @@ get_intel_cpu (struct __processor_model *cpu_model,
>break;
>  case 0x8f:
>/* Sapphire Rapids.  */
> +case 0xcf:
> +  /* Emerald Rapids.  */
>cpu = "sapphirerapids";
>CHECK___builtin_cpu_is ("corei7");
>CHECK___builtin_cpu_is ("sapphirerapids"); diff --git
> a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-
> common.cc
> index 7751265aff4..026926d8b41 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -2465,6 +2465,8 @@ const pta processor_alias_table[] =
>  M_CPU_SUBTYPE (INTEL_COREI7_COOPERLAKE), P_PROC_AVX512F},
>{"sapphirerapids", PROCESSOR_SAPPHIRERAPIDS, CPU_HASWELL,
> PTA_SAPPHIRERAPIDS,
>  M_CPU_SUBTYPE (INTEL_COREI7_SAPPHIRERAPIDS), P_PROC_AVX512F},
> +  {"emeraldrapids", PROCESSOR_SAPPHIRERAPIDS, CPU_HASWELL,
> PTA_SAPPHIRERAPIDS,
> +M_CPU_SUBTYPE (INTEL_COREI7_SAPPHIRERAPIDS), P_PROC_AVX512F},
>{"alderlake", PROCESSOR_ALDERLAKE, CPU_HASWELL, PTA_ALDERLAKE,
>  M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2},
>{"raptorlake", PROCESSOR_ALDERLAKE, CPU_HASWELL, PTA_ALDERLAKE,
> --
> 2.18.2

RE: [PATCH] [x86] x86: Don't add crtfastmath.o for -shared and add a new option -mdaz-ftz to enable FTZ and DAZ flags in MXCSR.

2022-12-14 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, December 14, 2022 4:23 PM
> To: Jakub Jelinek 
> Cc: Liu, Hongtao ; gcc-patches@gcc.gnu.org;
> crazy...@gmail.com; hjl.to...@gmail.com; ubiz...@gmail.com
> Subject: Re: [PATCH] [x86] x86: Don't add crtfastmath.o for -shared and add a
> new option -mdaz-ftz to enable FTZ and DAZ flags in MXCSR.
> 
> On Wed, Dec 14, 2022 at 9:16 AM Jakub Jelinek  wrote:
> >
> > On Wed, Dec 14, 2022 at 09:08:02AM +0100, Richard Biener via Gcc-patches
> wrote:
> > > On Wed, Dec 14, 2022 at 3:21 AM liuhongt via Gcc-patches
> > >  wrote:
> > > >
> > > > Don't add crtfastmath.o for -shared to avoid changing the MXCSR
> > > > register when loading a shared library.  crtfastmath.o will be
> > > > used only when building executables.
> > > >
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > Ok for trunk?
> > >
> > > You reject negative -mdaz-ftz but wouldn't that be useful with
> > > -Ofast -mno-daz-ftz since there's otherwise no way to avoid that?
> >
> > Agreed.
> > I even wonder if the best wouldn't be to make the option effectively
> > three state, default, no and yes, where if the option isn't specified
> > at all, then crtfastmath.o* is linked as is now except for -shared, if
> > it is -mno-daz-ftz, then it is never linked in regardless of other
> > options and if it is -mdaz-ftz, then it is linked even for -shared.
> 
> Possibly.  I'd also suggest to split the changed -shared handling to a 
> separate
> patch since people may want to backport this and it should be applicable to
> all other targets with similar handling.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522#c26
So patch in the upper link is ok for trunk?
I'll change -mdaz-ftz part as a separate patch.
> 
> > > > --- a/gcc/config/i386/i386.opt
> > > > +++ b/gcc/config/i386/i386.opt
> > > > @@ -420,6 +420,10 @@ mpc80
> > > >  Target RejectNegative
> > > >  Set 80387 floating-point precision to 80-bit.
> > > >
> > > > +mdaz-ftz
> > > > +Target RejectNegative
> > > > +Set the FTZ and DAZ Flags.
> >
> > As the option is only used in the driver, shouldn't it be marked
> > Driver and not Target?  It doesn't need to be saved/restored on every
> > cfun switch etc.
> >
> > > > +@item -mdaz-ftz
> > > > +@opindex mdaz-ftz
> > > > +
> > > > +the flush-to-zero (FTZ) and denormals-are-zero (DAZ) flags in the
> > > > +MXCSR register
> >
> > Shouldn't description start with capital letter?
> >
> > > > +are used to control floating-point calculations.SSE and AVX
> > > > +instructions including scalar and vector instructions could
> > > > +benefit from enabling the FTZ and DAZ flags when @option{-mdaz-ftz}
> is specified.
> > >
> > > Maybe say that the MXCSR register is set at program start to achieve
> > > this when the flag is specified at _link_ time and say this switch
> > > is ignored when -shared is specified?
> >
> > Jakub
> >

RE: [PATCH] i386: Only enable small loop unrolling in backend [PR 107602]

2022-11-20 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Wang, Hongyu 
> Sent: Saturday, November 19, 2022 2:26 PM
> To: gcc-patches@gcc.gnu.org
> Cc: richard.guent...@gmail.com; ubiz...@gmail.com; Liu, Hongtao
> 
> Subject: [PATCH] i386: Only enable small loop unrolling in backend [PR 107602]
> 
> Hi,
> 
> Followed by the discussion in pr107602, -munroll-only-small-loops Does not
PR107692?
> turns on/off -funroll-loops, and current check in pass_rtl_unroll_loops::gate
> would cause -funroll-loops do not take effect. Revert the change about
> targetm.loop_unroll_adjust and apply the backend option change to strictly
> follow the rule that -funroll-loops takes full control of loop unrolling, and
> munroll-only-small-loops just change its behavior to unroll small size loops.
> 
> Bootstrapped and regtested on x86-64-pc-linux-gnu.
> 
> Ok for trunk?
> 
> gcc/ChangeLog:
> 
>   PR target/107602
>   * common/config/i386/i386-common.cc (ix86_optimization_table):
>   Enable loop unroll O2, disable -fweb and -frename-registers
>   by default.
>   * config/i386/i386-options.cc
>   (ix86_override_options_after_change):
>   Disable small loop unroll when funroll-loops enabled, reset
>   cunroll_grow_size when it is not explicitly enabled.
>   (ix86_option_override_internal): Call
>   ix86_override_options_after_change instead of calling
>   ix86_recompute_optlev_based_flags and ix86_default_align
>   separately.
>   * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
>   factor if -munroll-only-small-loops enabled.
>   * loop-init.cc (pass_rtl_unroll_loops::gate): Do not enable
>   loop unrolling for -O2-speed.
>   (pass_rtl_unroll_loops::execute): Rmove
>   targetm.loop_unroll_adjust check.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR target/107602
>   * gcc.target/i386/pr86270.c: Add -fno-unroll-loops.
>   * gcc.target/i386/pr93002.c: Likewise.
> ---
>  gcc/common/config/i386/i386-common.cc   |  8 ++
>  gcc/config/i386/i386-options.cc | 34 ++---
>  gcc/config/i386/i386.cc | 18 -
>  gcc/loop-init.cc| 11 +++-
>  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
>  6 files changed, 49 insertions(+), 26 deletions(-)
> 
> diff --git a/gcc/common/config/i386/i386-common.cc
> b/gcc/common/config/i386/i386-common.cc
> index 6ce2a588adc..660a977b68b 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -1808,7 +1808,15 @@ static const struct default_options
> ix86_option_optimization_table[] =
>  /* The STC algorithm produces the smallest code at -Os, for x86.  */
>  { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
>REORDER_BLOCKS_ALGORITHM_STC },
> +
> +/* Turn on -funroll-loops with -munroll-only-small-loops to enable small
> +   loop unrolling at -O2.  */
> +{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_funroll_loops, NULL, 1 },
>  { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL,
> 1 },
> +/* Turns off -frename-registers and -fweb which are enabled by
> +   funroll-loops.  */
> +{ OPT_LEVELS_ALL, OPT_frename_registers, NULL, 0 },
> +{ OPT_LEVELS_ALL, OPT_fweb, NULL, 0 },
>  /* Turn off -fschedule-insns by default.  It tends to make the
> problem with not enough registers even worse.  */
>  { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 }, diff --git
> a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc index
> e5c77f3a84d..bc1d36e36a8 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1838,8 +1838,37 @@ ix86_recompute_optlev_based_flags (struct
> gcc_options *opts,  void  ix86_override_options_after_change (void)  {
> +  /* Default align_* from the processor table.  */
>ix86_default_align (_options);
> +
>ix86_recompute_optlev_based_flags (_options, _options_set);
> +
> +  /* Disable unrolling small loops when there's explicit
> + -f{,no}unroll-loop.  */
> +  if ((OPTION_SET_P (flag_unroll_loops))
> + || (OPTION_SET_P (flag_unroll_all_loops)
> +  && flag_unroll_all_loops))
> +{
> +  if (!OPTION_SET_P (ix86_unroll_only_small_loops))
> + ix86_unroll_only_small_loops = 0;
> +  /* Re-enable -frename-registers and -fweb if funroll-loops
> +  enabled.  */
> +  if (!OPTION_SET_P (flag_web))
> + flag_web = flag_unroll_loops;
> +  if (!OPTION_SET_P (flag_rename_registers))
> + flag_rename_registers = flag_unroll_loops;
> +  /* -fcunroll-grow-size default follws -f[no]-unroll-loops.  */
> +  if (!OPTION_SET_P (flag_cunroll_grow_size))
> + flag_cunroll_grow_size = flag_unroll_loops
> +  || flag_peel_loops
> +  || optimize >= 3;
> +}
> +  else
> +{
> +  if (!OPTION_SET_P (flag_cunroll_grow_size))
> +

RE: [PATCH 4/6] Support Intel AVX-NE-CONVERT

2022-10-30 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Kong, Lingling 
> Sent: Friday, October 28, 2022 4:57 PM
> To: Hongtao Liu 
> Cc: Liu, Hongtao ; gcc-patches@gcc.gnu.org; Jiang,
> Haochen 
> Subject: RE: [PATCH 4/6] Support Intel AVX-NE-CONVERT
> 
> Hi,
> 
> Because we  switch intrinsics for avx512bf16 to the new type __bf16. Now we
> could use m128/256bh for vector bf16 type instead of m128/256bf16.
> And unified builtin for avx512bf16/avxneconvert.
Ok.
> 
> Thanks,
> Lingling
> 
> > -Original Message-
> > From: Hongtao Liu 
> > Sent: Tuesday, October 25, 2022 1:23 PM
> > To: Kong, Lingling 
> > Cc: Liu, Hongtao ; gcc-patches@gcc.gnu.org;
> > Jiang, Haochen 
> > Subject: Re: [PATCH 4/6] Support Intel AVX-NE-CONVERT
> >
> > On Mon, Oct 24, 2022 at 2:20 PM Kong, Lingling
> > 
> > wrote:
> > >
> > > > From: Gcc-patches
> > > > 
> > > > On Behalf Of Hongtao Liu via Gcc-patches
> > > > Sent: Monday, October 17, 2022 1:47 PM
> > > > To: Jiang, Haochen 
> > > > Cc: Liu, Hongtao ; gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH 4/6] Support Intel AVX-NE-CONVERT
> > > >
> > > > On Fri, Oct 14, 2022 at 3:58 PM Haochen Jiang via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > From: Kong Lingling 
> > > > > +(define_insn "vbcstne2ps_"
> > > > > +  [(set (match_operand:VF1_128_256 0 "register_operand" "=x")
> > > > > +(vec_duplicate:VF1_128_256
> > > > > + (unspec:SF
> > > > > +  [(match_operand:HI 1 "memory_operand" "m")]
> > > > > +  VBCSTNE)))]
> > > > > +  "TARGET_AVXNECONVERT"
> > > > > +  "vbcstne2ps\t{%1, %0|%0, %1}"
> > > > > +  [(set_attr "prefix" "vex")
> > > > > +  (set_attr "mode" "")])
> > > > Since jakub has support bf16 software emulation, can we rewrite it
> > > > with general rtl ir without unspec?
> > > > Like (float_extend:SF (match_operand:BF "memory_operand" "m")
> > > > > +
> > > > > +(define_int_iterator VCVTNEBF16
> > > > > +  [UNSPEC_VCVTNEEBF16SF
> > > > > +   UNSPEC_VCVTNEOBF16SF])
> > > > > +
> > > > > +(define_int_attr vcvtnebf16type
> > > > > +  [(UNSPEC_VCVTNEEBF16SF "ebf16")
> > > > > +   (UNSPEC_VCVTNEOBF16SF "obf16")]) (define_insn
> > > > > +"vcvtne2ps_"
> > > > > +  [(set (match_operand:VF1_128_256 0 "register_operand" "=x")
> > > > > +(unspec:VF1_128_256
> > > > > +  [(match_operand: 1 "memory_operand" "m")]
> > > > > + VCVTNEBF16))]
> > > > > +  "TARGET_AVXNECONVERT"
> > > > > +  "vcvtne2ps\t{%1, %0|%0, %1}"
> > > > > +  [(set_attr "prefix" "vex")
> > > > > +   (set_attr "mode" "")])
> > > > Similar for this one and all those patterns below.
> > >
> > > That's great! Thanks for the review!
> > > Now rewrite it without unspec and use float_extend for new define_insn.
> > Ok.
> > >
> > > Thanks
> > > Lingling
> > >
> > >
> >
> >
> > --
> > BR,
> > Hongtao

RE: [PATCH] MAINTAINERS: Add myself for write after approval

2022-10-12 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Cui, Lili 
> Sent: Wednesday, October 12, 2022 3:50 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao 
> Subject: [PATCH] MAINTAINERS: Add myself for write after approval
> 
> Hi,
> 
> I want to add myself in MAINTANINER for write after approval.
> 
> OK for master?
Obvious fixes can be committed without prior 
approval(https://gcc.gnu.org/gitwrite.html).
This can be considered as an obvious fix(But you still need to send the patch 
out like this).
> 
> ChangeLog:
>   * MAINTAINERS (Write After Approval): Add myself.
> 
> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 11fa8bc6dbd..e4e7349a6d9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -377,6 +377,7 @@ Andrea Corallo
>   
>  Christian Cornelssen 
>  Ludovic Courtès  
>  Lawrence Crowl   
> +Lili Cui 
>  Ian Dall 
>  David Daney
>   
>  Robin Dapp   
> --
> 2.17.1

RE: [PATCH] Remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS

2022-10-11 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Cui, Lili 
> Sent: Wednesday, October 12, 2022 11:00 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; ubiz...@gmail.com; Lu, Hongjiu
> 
> Subject: [PATCH] Remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS
> 
> Hi Hontao,
> 
> This patch is to remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS.
> The new intel ISE removes AVX512_VP2INTERSECT from SAPPHIRERAPIDS,
> AVX512_VP2INTERSECT is only supportted in Tigerlake.
> 
> Hi Uros,
> 
> This patch is to remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS.
> The new intel ISE removes AVX512_VP2INTERSECT from SAPPHIRERAPIDS,
> AVX512_VP2INTERSECT is only supportted in Tigerlake.
> 
> Bootstrap is ok, and no regressions for i386/x86-64 testsuite.
> 
> OK for master?
Yes, thanks.
> 
> 
> gcc/ChangeLog:
> 
>   * config/i386/driver-i386.cc (host_detect_local_cpu):
>   Move sapphirerapids out of AVX512_VP2INTERSECT.
>   * config/i386/i386.h: Remove AVX512_VP2INTERSECT from
> PTA_SAPPHIRERAPIDS
>   * doc/invoke.texi: Remove AVX512_VP2INTERSECT from
> SAPPHIRERAPIDS
> ---
>  gcc/config/i386/driver-i386.cc | 13 +
>  gcc/config/i386/i386.h |  7 +++
>  gcc/doc/invoke.texi|  8 
>  3 files changed, 12 insertions(+), 16 deletions(-)
> 
> diff --git a/gcc/config/i386/driver-i386.cc b/gcc/config/i386/driver-i386.cc 
> index
> 3c702fdca33..ef567045c67 100644
> --- a/gcc/config/i386/driver-i386.cc
> +++ b/gcc/config/i386/driver-i386.cc
> @@ -589,15 +589,12 @@ const char *host_detect_local_cpu (int argc, const
> char **argv)
> /* This is unknown family 0x6 CPU.  */
> if (has_feature (FEATURE_AVX))
>   {
> +   /* Assume Tiger Lake */
> if (has_feature (FEATURE_AVX512VP2INTERSECT))
> - {
> -   if (has_feature (FEATURE_TSXLDTRK))
> - /* Assume Sapphire Rapids.  */
> - cpu = "sapphirerapids";
> -   else
> - /* Assume Tiger Lake */
> - cpu = "tigerlake";
> - }
> + cpu = "tigerlake";
> +   /* Assume Sapphire Rapids.  */
> +   else if (has_feature (FEATURE_TSXLDTRK))
> + cpu = "sapphirerapids";
> /* Assume Cooper Lake */
> else if (has_feature (FEATURE_AVX512BF16))
>   cpu = "cooperlake";
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> 900a3bc3673..372a2cff8fe 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -2326,10 +2326,9 @@ constexpr wide_int_bitmask PTA_ICELAKE_SERVER
> = PTA_ICELAKE_CLIENT  constexpr wide_int_bitmask PTA_TIGERLAKE =
> PTA_ICELAKE_CLIENT | PTA_MOVDIRI
>| PTA_MOVDIR64B | PTA_CLWB | PTA_AVX512VP2INTERSECT | PTA_KL |
> PTA_WIDEKL;  constexpr wide_int_bitmask PTA_SAPPHIRERAPIDS =
> PTA_ICELAKE_SERVER | PTA_MOVDIRI
> -  | PTA_MOVDIR64B | PTA_AVX512VP2INTERSECT | PTA_ENQCMD |
> PTA_CLDEMOTE
> -  | PTA_PTWRITE | PTA_WAITPKG | PTA_SERIALIZE | PTA_TSXLDTRK |
> PTA_AMX_TILE
> -  | PTA_AMX_INT8 | PTA_AMX_BF16 | PTA_UINTR | PTA_AVXVNNI |
> PTA_AVX512FP16
> -  | PTA_AVX512BF16;
> +  | PTA_MOVDIR64B | PTA_ENQCMD | PTA_CLDEMOTE | PTA_PTWRITE |
> + PTA_WAITPKG  | PTA_SERIALIZE | PTA_TSXLDTRK | PTA_AMX_TILE |
> + PTA_AMX_INT8 | PTA_AMX_BF16  | PTA_UINTR | PTA_AVXVNNI |
> + PTA_AVX512FP16 | PTA_AVX512BF16;
>  constexpr wide_int_bitmask PTA_KNL = PTA_BROADWELL | PTA_AVX512PF
>| PTA_AVX512ER | PTA_AVX512F | PTA_AVX512CD | PTA_PREFETCHWT1;
> constexpr wide_int_bitmask PTA_BONNELL = PTA_CORE2 | PTA_MOVBE; diff --
> git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index
> 271c8bb8468..a9ecc4426a4 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -32057,11 +32057,11 @@ Intel sapphirerapids CPU with 64-bit extensions,
> MOVBE, MMX, SSE, SSE2, SSE3,  SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, SAHF,
> FXSR, AVX, XSAVE, PCLMUL, FSGSBASE,  RDRND, F16C, AVX2, BMI, BMI2, LZCNT,
> FMA, MOVBE, HLE, RDSEED, ADCX, PREFETCHW,  AES, CLFLUSHOPT, XSAVEC,
> XSAVES, SGX, AVX512F, AVX512VL, AVX512BW, AVX512DQ, -AVX512CD, PKU,
> AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI, GFNI, VAES, AVX512VBMI2
> +AVX512CD, PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI, GFNI, VAES,
> +AVX512VBMI2,
>  VPCLMULQDQ, AVX512BITALG, RDPID, AVX512VPOPCNTDQ, PCONFIG,
> WBNOINVD, CLWB, -MOVDIRI, MOVDIR64B, AVX512VP2INTERSECT, ENQCMD,
> CLDEMOTE, PTWRITE, WAITPKG, -SERIALIZE, TSXLDTRK, UINTR, AMX-BF16,
> AMX-TILE, AMX-INT8, AVX-VNNI, AVX512FP16 -and AVX512BF16 instruction set
> support.
> +MOVDIRI, MOVDIR64B, ENQCMD, CLDEMOTE, PTWRITE, WAITPKG, SERIALIZE,
> +TSXLDTRK, UINTR, AMX-BF16, AMX-TILE, AMX-INT8, AVX-VNNI, AVX512FP16
> and
> +AVX512BF16 instruction set support.
> 
>  @item alderlake
>  Intel Alderlake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
> SSSE3,
> --
> 2.17.1
> 
> Thanks,
> Lili.
> Thanks

RE: [PATCH] [x86] Add define_insn_and_split to support general version of "kxnor".

2022-10-11 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Jakub Jelinek 
> Sent: Tuesday, October 11, 2022 9:59 PM
> To: Liu, Hongtao 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [x86] Add define_insn_and_split to support general
> version of "kxnor".
> 
> On Tue, Oct 11, 2022 at 04:03:16PM +0800, liuhongt via Gcc-patches wrote:
> > gcc/ChangeLog:
> >
> > * config/i386/i386.md (*notxor_1): New post_reload
> > define_insn_and_split.
> > (*notxorqi_1): Ditto.
> 
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -10826,6 +10826,39 @@ (define_insn "*_1"
> > (set_attr "type" "alu, alu, msklog")
> > (set_attr "mode" "")])
> >
> > +(define_insn_and_split "*notxor_1"
> > +  [(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm,r,?k")
> > +   (not:SWI248
> > + (xor:SWI248
> > +   (match_operand:SWI248 1 "nonimmediate_operand" "%0,0,k")
> > +   (match_operand:SWI248 2 "" "r,,k"
> > +   (clobber (reg:CC FLAGS_REG))]
> > +  "ix86_binary_operator_ok (XOR, mode, operands)"
> > +  "#"
> > +  "&& reload_completed"
> > +  [(parallel
> > +[(set (match_dup 0)
> > + (xor:SWI248 (match_dup 1) (match_dup 2)))
> > + (clobber (reg:CC FLAGS_REG))])
> > +   (set (match_dup 0)
> > +   (not:SWI248 (match_dup 1)))]
> > +{
> > +  if (MASK_REGNO_P (REGNO (operands[0])))
> 
> This causes --enable-checking=yes,rtl,extra regression on
> gcc.dg/store_merging_13.c test on x86_64-linux:
> .../gcc/testsuite/gcc.dg/store_merging_13.c: In function 'f13':
> .../gcc/testsuite/gcc.dg/store_merging_13.c:189:1: internal compiler error: 
> RTL
> check: expected code 'reg', have 'mem' in rhs_regno, at rtl.h:1932 0x7b0c8f
> rtl_check_failed_code1(rtx_def const*, rtx_code, char const*, int, char 
> const*)
> ../../gcc/rtl.cc:916
> 0x8e74be rhs_regno
> ../../gcc/rtl.h:1932
> 0x9785fd rhs_regno
> ./genrtl.h:120
> 0x9785fd gen_split_260(rtx_insn*, rtx_def**)
> ../../gcc/config/i386/i386.md:10846
> 0x23596dc split_insns(rtx_def*, rtx_insn*)
> ../../gcc/config/i386/i386.md:16392
> 0xfccd5a try_split(rtx_def*, rtx_insn*, int)
> ../../gcc/emit-rtl.cc:3799
> 0x132e9d8 split_insn
> ../../gcc/recog.cc:3384
> 0x13359d5 split_all_insns()
> ../../gcc/recog.cc:3488
> 0x1335ae8 execute
> ../../gcc/recog.cc:4412
> Please submit a full bug report, with preprocessed source (by using -freport-
> bug).
> Please include the complete backtrace with any bug report.
> See  for instructions.
> 
> Fixed thusly, tested on x86_64-linux, committed to trunk as obvious.
Thanks.
> 
> 2022-10-11  Jakub Jelinek  
> 
>   PR target/107185
>   * config/i386/i386.md (*notxor_1): Use MASK_REG_P (x)
> instead of
>   MASK_REGNO_P (REGNO (x)).
> 
> --- gcc/config/i386/i386.md.jj2022-10-11 12:10:42.188891134 +0200
> +++ gcc/config/i386/i386.md   2022-10-11 15:47:45.531449089 +0200
> @@ -10843,7 +10843,7 @@ (define_insn_and_split "*notxor_1"
> (set (match_dup 0)
>   (not:SWI248 (match_dup 0)))]
>  {
> -  if (MASK_REGNO_P (REGNO (operands[0])))
> +  if (MASK_REG_P (operands[0]))
>  {
>emit_insn (gen_kxnor (operands[0], operands[1], operands[2]));
>DONE;
> 
> 
>   Jakub

RE: [PATCH] testsuite: Fix up avx256-unaligned-store-3.c test.

2022-09-25 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Hu, Lin1 
> Sent: Monday, September 26, 2022 1:20 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; ubiz...@gmail.com
> Subject: [PATCH] testsuite: Fix up avx256-unaligned-store-3.c test.
> 
> Hi all,
> 
> This patch aims to fix a problem that avx256-unaligned-store-3.c test reports
> two unexpected fails under "-march=cascadelake".
> 
> Regtested on x86_64-pc-linux-gnu. Ok for trunk?
Ok.
> 
> BRs,
> Lin
> 
> gcc/testsuite/ChangeLog:
> 
>   PR target/94962
>   * gcc.target/i386/avx256-unaligned-store-3.c: Add -mno-avx512f
> ---
>  gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c
> b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c
> index f909099bcb1..67635fb9e66 100644
> --- a/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c
> +++ b/gcc/testsuite/gcc.target/i386/avx256-unaligned-store-3.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-store -
> mtune=generic -fno-common" } */
> +/* { dg-options "-O3 -dp -mavx -mavx256-split-unaligned-store -
> mtune=generic -fno-common -mno-avx512f" } */
> 
>  #define N 1024
> 
> --
> 2.18.2

RE: [PATCH] i386: Add syscall to enable AMX for latest kernels

2022-09-22 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Jiang, Haochen 
> Sent: Thursday, September 22, 2022 2:23 PM
> To: Uros Bizjak 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao 
> Subject: RE: [PATCH] i386: Add syscall to enable AMX for latest kernels
> 
> Hi all,
> 
> I would like to backport this patch to GCC 12 release branch as machines with
> the version of default GCC is 12.x (which is always using newer kernels), if 
> the
> patch is not backported, the amx tests will always fail.
> 
> Ok for backport?
Ok.
> 
> BRs,
> Haochen
> 
> > -Original Message-
> > From: Uros Bizjak 
> > Sent: Tuesday, June 21, 2022 10:53 PM
> > To: Jiang, Haochen 
> > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao 
> > Subject: Re: [PATCH] i386: Add syscall to enable AMX for latest
> > kernels
> >
> > On Tue, Jun 21, 2022 at 9:41 AM Jiang, Haochen
> > 
> > wrote:
> > >
> > > > -Original Message-
> > > > From: Uros Bizjak 
> > > > Sent: Tuesday, June 21, 2022 3:06 PM
> > > > To: Jiang, Haochen 
> > > > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao 
> > > > Subject: Re: [PATCH] i386: Add syscall to enable AMX for latest
> > > > kernels
> > > >
> > > > On Tue, Jun 21, 2022 at 4:23 AM Jiang, Haochen
> > > > 
> > > > wrote:
> > > > >
> > > > > > -Original Message-
> > > > > > From: Uros Bizjak 
> > > > > > Sent: Monday, June 20, 2022 10:54 PM
> > > > > > To: Jiang, Haochen 
> > > > > > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao
> > > > > > 
> > > > > > Subject: Re: [PATCH] i386: Add syscall to enable AMX for
> > > > > > latest kernels
> > > > > >
> > > > > > On Mon, Jun 20, 2022 at 10:04 AM Haochen Jiang
> > > > > > 
> > > > > > wrote:
> > > > > > >
> > > > > > > From: "Jiang, Haochen" 
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We need syscall to enable AMX for kernels>=5.4. It is
> > > > > > > missing in current amx tests, which will cause test fail.
> > > > > >
> > > > > > So this new code is only valid for linux & co?
> > > > >
> > > > > Thanks for reminding me for that, I only test on linux since the
> > > > > header file is
> > > > only in linux.
> > > > >
> > > > > Just updated a patch wrapping with a macro not to change the
> > > > > behavior on
> > > > windows.
> > > >
> > > > I think you want __linux__ there, not __unix__.
> > >
> > > Fixed with __linux__.
> >
> > OK.
> >
> > Thanks,
> > Uros.
> >
> > >
> > > Thx,
> > > Haochen
> > >
> > > >
> > > > Uros.
> > > >
> > > > >
> > > > > Regtested on x86_64-pc-linux-gnu.
> > > > >
> > > > > Thx,
> > > > > Haochen
> > > > > >
> > > > > > Uros.
> > > > > >
> > > > > > >
> > > > > > > This patch aims to add them to fix this bug.
> > > > > > >
> > > > > > > BRs,
> > > > > > > Haochen
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > > * gcc.target/i386/amx-check.h (request_perm_xtile_data):
> > > > > > > New function to check if AMX is usable and enable AMX.
> > > > > > > (main): Run test if AMX is usable.
> > > > > > > ---
> > > > > > >  gcc/testsuite/gcc.target/i386/amx-check.h | 24
> > > > > > > +++
> > > > > > >  1 file changed, 24 insertions(+)
> > > > > > >
> > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/amx-check.h
> > > > > > > b/gcc/testsuite/gcc.target/i386/amx-check.h
> > > > > > > index 434b0e59703..92ed8669304 100644
> > > > > > > --- a/gcc/testsuite/gcc.target/i386/amx-check.h
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/amx-check.h
> > > > > > > @@ -4,11 +4,22 @@
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > >  #ifdef DEBUG
> > > > > > >  #include 
> > > > > > >  #endif
> > > > > > >  #include "cpuid.h"
> > > > > > >
> > > > > > > +#define XFEATURE_XTILECFG  17
> > > > > > > +#define XFEATURE_XTILEDATA 18
> > > > > > > +#define XFEATURE_MASK_XTILECFG (1 << XFEATURE_XTILECFG)
> > > > > > > +#define XFEATURE_MASK_XTILEDATA(1 <<
> XFEATURE_XTILEDATA)
> > > > > > > +#define XFEATURE_MASK_XTILE(XFEATURE_MASK_XTILECFG |
> > > > > > XFEATURE_MASK_XTILEDATA)
> > > > > > > +
> > > > > > > +#define ARCH_GET_XCOMP_PERM0x1022
> > > > > > > +#define ARCH_REQ_XCOMP_PERM0x1023
> > > > > > > +
> > > > > > >  /* TODO: The tmm emulation is temporary for current
> > > > > > > AMX implementation with no tmm regclass, should
> > > > > > > be changed in the future. */ @@ -44,6 +55,18 @@ typedef
> > > > > > > struct __tile
> > > > > > >  /* Stride (colum width in byte) used for tileload/store */
> > > > > > > #define _STRIDE 64
> > > > > > >
> > > > > > > +/* We need syscall to use amx functions */ int
> > > > > > > +request_perm_xtile_data() {
> > > > > > > +  unsigned long bitmask;
> > > > > > > +
> > > > > > > +  if (syscall (SYS_arch_prctl, ARCH_REQ_XCOMP_PERM,
> > > > > > XFEATURE_XTILEDATA) ||
> > > > > > > +  syscall (SYS_arch_prctl, ARCH_GET_XCOMP_PERM, ))
> > > > > > > +return 0;
> > > > > > > +
> > > > > > > +  return (bitmask &

RE: [PATCH] i386: Fixed vec_init_dup_v16bf [PR106887]

2022-09-16 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Kong, Lingling 
> Sent: Friday, September 16, 2022 3:40 PM
> To: Hongtao Liu 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao 
> Subject: RE: [PATCH] i386: Fixed vec_init_dup_v16bf [PR106887]
> 
> Hi,
> 
> > >   machine_mode hvmode = (mode == V16HImode ? V8HImode
> > >  : mode == V16HFmode ? V8HFmode
> > > +: mode == V16BFmode ? V8BFmode
> > Can it be written as switch case?
> Sure, I fixed it in new patch. Thanks again for take a look.
> OK for master ?
+ switch (mode)
+   {
+ case V16HImode:
+   hvmode = V8HImode;
+   break;
+ case V16HFmode:
+   hvmode = V8HFmode;
+   break;
+ case V16BFmode:
+   hvmode = V8BFmode;
+   break;
+ case V32QImode:
+   hvmode = V16QImode;
+   break;
+ default:
+   gcc_unreachable ();
+   } > 

For the format, case aligns with {?
Others LGTM.

> Thanks,
> Lingling
> 
> > -Original Message-
> > From: Hongtao Liu 
> > Sent: Thursday, September 15, 2022 11:46 AM
> > To: Kong, Lingling 
> > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao 
> > Subject: Re: [PATCH] i386: Fixed vec_init_dup_v16bf [PR106887]
> >
> > On Thu, Sep 15, 2022 at 11:36 AM Kong, Lingling via Gcc-patches  > patc...@gcc.gnu.org> wrote:
> > >
> > > Hi
> > >
> > > The patch is to fix vec_init_dup_v16bf, add correct handle for v16bf
> > > mode in
> > ix86_expand_vector_init_duplicate.
> > > Add testcase with sse2 without avx2.
> > >
> > > OK for master?
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/106887
> > > * config/i386/i386-expand.cc (ix86_expand_vector_init_duplicate):
> > > Fixed V16BF mode case.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > PR target/106887
> > > * gcc.target/i386/vect-bfloat16-2c.c: New test.
> > > ---
> > >  gcc/config/i386/i386-expand.cc|  1 +
> > >  .../gcc.target/i386/vect-bfloat16-2c.c| 76 +++
> > >  2 files changed, 77 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/vect-bfloat16-2c.c
> > >
> > > diff --git a/gcc/config/i386/i386-expand.cc
> > > b/gcc/config/i386/i386-expand.cc index d7b49c99dc8..9451c561489
> > > 100644
> > > --- a/gcc/config/i386/i386-expand.cc
> > > +++ b/gcc/config/i386/i386-expand.cc
> > > @@ -15111,6 +15111,7 @@ ix86_expand_vector_init_duplicate (bool
> > mmx_ok, machine_mode mode,
> > > {
> > >   machine_mode hvmode = (mode == V16HImode ? V8HImode
> > >  : mode == V16HFmode ? V8HFmode
> > > +: mode == V16BFmode ? V8BFmode
> > Can it be written as switch case?
> > >  : V16QImode);
> > >   rtx x = gen_reg_rtx (hvmode);
> > >
> > > diff --git a/gcc/testsuite/gcc.target/i386/vect-bfloat16-2c.c
> > > b/gcc/testsuite/gcc.target/i386/vect-bfloat16-2c.c
> > > new file mode 100644
> > > index 000..bead94e46a1
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/vect-bfloat16-2c.c
> > > @@ -0,0 +1,76 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-mf16c -msse2 -mno-avx2 -O2" } */
> > > +
> > > +typedef __bf16 v8bf __attribute__ ((__vector_size__ (16))); typedef
> > > +__bf16 v16bf __attribute__ ((__vector_size__ (32)));
> > > +
> > > +#define VEC_EXTRACT(V,S,IDX)   \
> > > +  S\
> > > +  __attribute__((noipa))   \
> > > +  vec_extract_##V##_##IDX (V v)\
> > > +  {\
> > > +return v[IDX]; \
> > > +  }
> > > +
> > > +#define VEC_SET(V,S,IDX)   \
> > > +  V\
> > > +  __attribute__((noipa))   \
> > > +  vec_set_##V##_##IDX (V v, S s)   \
> > > +  {\
> > > +v[IDX] = s;\
> > > +return v;  \
> > > +  }
> > > +
> > > +v8bf
> > > +vec_init_v8bf (__bf16 a1, __bf16 a2, __bf16 a3, __bf16 a4,
> > > +  __bf16 a5,  __bf16 a6, __bf16 a7, __bf16 a8) {
> > > +return __extension__ (v8bf) {a1, a2, a3, a4, a5, a6, a7, a8}; }
> > > +
> > > +v16bf
> > > +vec_init_v16bf (__bf16 a1, __bf16 a2, __bf16 a3, __bf16 a4,
> > > +  __bf16 a5,  __bf16 a6, __bf16 a7, __bf16 a8,
> > > +  __bf16 a9,  __bf16 a10, __bf16 a11, __bf16 a12,
> > > +  __bf16 a13,  __bf16 a14, __bf16 a15, __bf16 a16) {
> > > +return __extension__ (v16bf) {a1, a2, a3, a4, a5, a6, a7, a8,
> > > + a9, a10, a11, a12, a13, a14, a15,
> > > +a16}; }
> > > +
> > > +v8bf
> > > +vec_init_dup_v8bf (__bf16 a1)
> > > +{
> >

RE: [PATCH] i386: Extend cvtps2pd to memory

2022-06-30 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Uros Bizjak 
> Sent: Thursday, June 30, 2022 4:53 PM
> To: Jiang, Haochen 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao 
> Subject: Re: [PATCH] i386: Extend cvtps2pd to memory
> 
> On Thu, Jun 30, 2022 at 10:45 AM Uros Bizjak  wrote:
> >
> > On Thu, Jun 30, 2022 at 9:41 AM Uros Bizjak  wrote:
> > >
> > > On Thu, Jun 30, 2022 at 9:24 AM Jiang, Haochen 
> wrote:
> > > >
> > > > > -Original Message-
> > > > > From: Uros Bizjak 
> > > > > Sent: Thursday, June 30, 2022 2:20 PM
> > > > > To: Jiang, Haochen 
> > > > > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao
> > > > > 
> > > > > Subject: Re: [PATCH] i386: Extend cvtps2pd to memory
> > > > >
> > > > > On Thu, Jun 30, 2022 at 7:59 AM Haochen Jiang
> > > > > 
> > > > > wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > This patch aims to fix the cvtps2pd insn, which should also
> > > > > > work on memory operand but currently does not. After this fix,
> > > > > > when loop == 2, it will eliminate movq instruction.
> > > > > >
> > > > > > Regtested on x86_64-pc-linux-gnu. Ok for trunk?
> > > > > >
> > > > > > BRs,
> > > > > > Haochen
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > > PR target/43618
> > > > > > * config/i386/sse.md (extendv2sfv2df2): New define_expand.
> > > > > > (sse2_cvtps2pd_load): Rename extendvsdfv2df2.
> >
> > Rename FROM ...
> >
> > Please also mention change to sse2_cvtps2pd.
> >
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > > PR target/43618
> > > > > > * gcc.target/i386/pr43618-1.c: New test.
> > > > >
> > > > > This patch could be as simple as:
> > > > >
> > > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > > > index 8cd0f617bf3..c331445cb2d 100644
> > > > > --- a/gcc/config/i386/sse.md
> > > > > +++ b/gcc/config/i386/sse.md
> > > > > @@ -9195,7 +9195,7 @@
> > > > > (define_insn "extendv2sfv2df2"
> > > > >   [(set (match_operand:V2DF 0 "register_operand" "=v")
> > > > >(float_extend:V2DF
> > > > > - (match_operand:V2SF 1 "register_operand" "v")))]
> > > > > + (match_operand:V2SF 1 "nonimmediate_operand" "vm")))]
> > > > >   "TARGET_MMX_WITH_SSE"
> > > > >   "%vcvtps2pd\t{%1, %0|%0, %1}"
> > > > >   [(set_attr "type" "ssecvt")
> > > >
> > > > We also tested on this version, it is ok.
> > > >
> > > > The reason why the patch looks like this is because in the
> > > > previous insn sse2_cvtps2pd, the constraint vm and
> > > > vector_operand actually does not match the actual instruction.
> > > > Memory operand is V2SF, not V4SF.
> > > >
> > > > Therefore, we changed the constraint in that insn. Then it caused 
> > > > another
> issue.
> > > > For memory operand, it seems that we cannot generate those mask
> instructions.
> > > > So I change the pattern to how extendv2hfv2df2 works.
> > >
> > > If you want to change the memory access in sse2_cvtps2pd,
> > > then please see how e.g. v2hiv2di is handled in sse.md. In
> > > addition to two instructions, you will need one
> > > define_insn_and_split with a pre-reload splitter.
> >
> > Oh, nowadays combine does vec_select from a paradoxical subreg on its own.
> >
> > +(define_expand "extendv2sfv2df2"
> > +  [(set (match_operand:V2DF 0 "register_operand")
> > +(float_extend:V2DF
> > +  (match_operand:V2SF 1 "nonimmediate_operand")))]
> > +  "TARGET_MMX_WITH_SSE"
> > +{
> > +  if (!MEM_P (operands[1]))
> > +{
> >
> > You will need force reg here:
> >
> > rtx op1 = force_reg (V2SFmode, operands[1]);
> > +  operands[1] = lowpart_subreg (V4SFmode, op1, V2SFmode);
> > +  emit_insn (gen_sse2_cvtps2pd (operands[0], operands[1]));
> > +  DONE;
> > +}
> > +})
> >
> >
> > -(define_insn "extendv2sfv2df2"
> > +(define_insn "sse2_cvtps2pd_load"
> >
> > Please name this insn "*sse2_cvtps2pd_1". Please note the
> > star at the beginning, You don't have to make the name public.
> >
> > OK with the above changes.
> 
> Forgot to mention:
> 
> 
> - (match_operand:V2SF 1 "register_operand" "v")))]
> -  "TARGET_MMX_WITH_SSE"
> -  "%vcvtps2pd\t{%1, %0|%0, %1}"
> + (match_operand:V2SF 1 "memory_operand" "m")))]
> + "TARGET_MMX_WITH_SSE && "
> +  "%vcvtps2pd\t{%1, %0|%0 and2>, %q1}"
>[(set_attr "type" "ssecvt")
> 
> The new insn does not need to be limited to TARGET_MMX_WITH_SSE, so we
> can use TARGET_SSE2 here.
> 
> Which opens the question if the expander could also be TARGET_SSE2 only.
> There are no MMX registers involved in any of the patterns anymore.
Yes.
> 
> Uros.
> >
> > Thanks,
> > Uros,

RE: [PATCH] i386: Add AVX512BW to AVX512F in MASK_ISA2

2022-06-29 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Jiang, Haochen 
> Sent: Thursday, June 30, 2022 9:51 AM
> To: gcc-patches@gcc.gnu.org
> Cc: ubiz...@gmail.com; Liu, Hongtao 
> Subject: [PATCH] i386: Add AVX512BW to AVX512F in MASK_ISA2
> 
> Hi all,
> 
> I just found in MASK_ISA2_UNSET part, since AVX512BW is based on AVX512F,
> we should add OPTION_MASK_ISA2_AVX512BW_UNSET to AVX512F for
> maintainence convenience and logic correctness, or we will need to add all
> future ISAs based on AVX512BW in both AVX512F and AVX512BW. This will be
> easily forgot and might cause confusion.
> 
> Also remove the redundant ones in this change.
> 
> Regtested on x86_64-pc-linux-gnu. Ok for trunk?
LGTM.
> 
> BRs,
> Haochen
> 
> gcc/ChangeLog:
> 
>   * common/config/i386/i386-common.cc
> (OPTION_MASK_ISA2_AVX512F_UNSET):
>   Add OPTION_MASK_ISA2_AVX512BW_UNSET, remove
>   OPTION_MASK_ISA2_AVX512BF16_UNSET and
>   OPTION_MASK_ISA2_AVX512FP16_UNSET.
> ---
>  gcc/common/config/i386/i386-common.cc | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/common/config/i386/i386-common.cc
> b/gcc/common/config/i386/i386-common.cc
> index cb878163492..c0c2ad74d87 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -315,11 +315,10 @@ along with GCC; see the file COPYING3.  If not see
> | OPTION_MASK_ISA_SSE_UNSET)
> 
>  #define OPTION_MASK_ISA2_AVX512F_UNSET \
> -  (OPTION_MASK_ISA2_AVX512BF16_UNSET \
> +  (OPTION_MASK_ISA2_AVX512BW_UNSET \
> | OPTION_MASK_ISA2_AVX5124FMAPS_UNSET \
> | OPTION_MASK_ISA2_AVX5124VNNIW_UNSET \
> -   | OPTION_MASK_ISA2_AVX512VP2INTERSECT_UNSET \
> -   | OPTION_MASK_ISA2_AVX512FP16_UNSET)
> +   | OPTION_MASK_ISA2_AVX512VP2INTERSECT_UNSET)
>  #define OPTION_MASK_ISA2_GENERAL_REGS_ONLY_UNSET \
>OPTION_MASK_ISA2_SSE_UNSET
>  #define OPTION_MASK_ISA2_AVX_UNSET OPTION_MASK_ISA2_AVX2_UNSET
> --
> 2.18.1

RE: [PATCH] Add a bit dislike for separate mem alternative when op is REG_P.

2022-05-29 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Alexander Monakov 
> Sent: Friday, May 27, 2022 5:39 PM
> To: Liu, Hongtao 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Add a bit dislike for separate mem alternative when op is
> REG_P.
> 
> On Wed, 25 May 2022, liuhongt via Gcc-patches wrote:
> 
> > Rigt now, mem_cost for separate mem alternative is 1 * frequency which
> > is pretty small and caused the unnecessary SSE spill in the PR, I've
> > tried to rework backend cost model, but RA still not happy with
> > that(regress somewhere else). I think the root cause of this is cost for 
> > separate
> 'm'
> > alternative cost is too small, especially considering that the mov
> > cost of gpr are 2(default for REGISTER_MOVE_COST). So this patch
> > increase mem_cost to 2*frequency, also increase 1 for reg_class cost when m
> alternative.
> 
> In the PR, the spill happens in the initial basic block of the function, i.e.
> the one with the highest frequency.
> 
> Also as noted in the PR, swapping the 'unlikely' branch to 'likely' avoids 
> the spill,
> even though it does not affect the frequency of the initial basic block, and
> makes the block with the use more rarely executed.

The spill is mainly decided by 3 insns related to r92

283(insn 3 61 4 2 (set (reg/v:SF 92 [ x ])
284(reg:SF 102)) "test3.c":7:1 142 {*movsf_internal}
285 (expr_list:REG_DEAD (reg:SF 102)

288(insn 9 4 12 2 (set (reg:SI 89 [ _11 ])
289(subreg:SI (reg/v:SF 92 [ x ]) 0)) "test3.c":3:36 81 
{*movsi_internal}
290 (nil))

And
382(insn 28 27 29 5 (set (reg:DF 98)
383(float_extend:DF (reg/v:SF 92 [ x ]))) "test3.c":11:13 163 
{*extendsfdf2}
384 (expr_list:REG_DEAD (reg/v:SF 92 [ x ])
385(nil)))
386(insn 29 28 30 5 (s

The frequency the for INSN 3 and INSN 9 is not affected, but frequency of INSN 
28 drop from 805 -> 89 after swapping "unlikely" and "likely".
Because of that, GPR cost decreases a lot, finally make the RA choose GPR 
instead of MEM.

GENERAL_REGS:2356,2356 
SSE_REGS:6000,6000
MEM:4089,4089

Dump of 301.ira:
67  a4(r92,l0) costs: AREG:2356,2356 DREG:2356,2356 CREG:2356,2356 
BREG:2356,2356 SIREG:2356,2356 DIREG:2356,2356 AD_REGS:2356,2356 
CLOBBERED_REGS:2356,2356 Q_REGS:2356,2356 NON_Q_REGS:2356,2356 
TLS_GOTBASE_REGS:2356,2356 GENERAL_REGS:2356,2356 SSE_FIRST_REG:6000,6000 
NO_REX_SSE_REGS:6000,6000 SSE_REGS:6000,6000 \
   MMX_REGS:19534,19534 INT_SSE_REGS:19534,19534 ALL_REGS:214534,214534 
MEM:4089,4089

And although there's no spill, there's an extra VMOVD in the later BB which 
looks suboptimal(Guess we can stand with that since it's cold.)

24vmovd   %eax, %xmm2
25vcvtss2sd   %xmm2, %xmm2, %xmm1
26vmulsd  %xmm0, %xmm1, %xmm0
27vcvtsd2ss   %xmm0, %xmm0, %xmm0
> 
> Do you have a root cause analysis that explains the above?
> 
> Alexander

RE: [PATCH] Optimize vpermtiw/b to vpunpcklqdq for certain cases.

2022-05-13 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Uros Bizjak 
> Sent: Friday, May 13, 2022 4:15 PM
> To: Liu, Hongtao 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Optimize vpermtiw/b to vpunpcklqdq for certain cases.
> 
> On Fri, May 13, 2022 at 9:16 AM liuhongt  wrote:
> >
> > Assembly Optimization like:
> > -   vmovq   %xmm0, %xmm2
> > -   vmovdqa .LC0(%rip), %xmm0
> > vmovq   %xmm1, %xmm1
> > -   vpermi2w%xmm1, %xmm2, %xmm0
> > +   vmovq   %xmm0, %xmm0
> > +   vpunpcklqdq %xmm1, %xmm0, %xmm0
> >
> > ...
> >
> > -.LC0:
> > -   .value  0
> > -   .value  1
> > -   .value  2
> > -   .value  3
> > -   .value  8
> > -   .value  9
> > -   .value  10
> > -   .value  11
> >
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > PR target/105033
> > * config/i386/sse.md (*vec_concatv4si): Extend to ..
> > (*vec_concat): .. V16QI and V8HImode.
> > (*vec_concatv16qi_permt2): New pre_reload define_insn_and_split.
> > (*vec_concatv8hi_permt2): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr105033.c: New test.
> > ---
> >  gcc/config/i386/sse.md   | 62 ++--
> >  gcc/testsuite/gcc.target/i386/pr105033.c | 27 +++
> >  2 files changed, 84 insertions(+), 5 deletions(-)  create mode 100644
> > gcc/testsuite/gcc.target/i386/pr105033.c
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index
> > a63df0d0b1f..2e417e47d20 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -19600,11 +19600,11 @@ (define_insn "*vec_concatv2si"
> > (set_attr "type" "sselog,ssemov,sselog,ssemov,mmxcvt,mmxmov")
> > (set_attr "mode" "TI,TI,V4SF,SF,DI,DI")])
> >
> > -(define_insn "*vec_concatv4si"
> > -  [(set (match_operand:V4SI 0 "register_operand"   "=x,v,x,x,v")
> > -   (vec_concat:V4SI
> > - (match_operand:V2SI 1 "register_operand" " 0,v,0,0,v")
> > - (match_operand:V2SI 2 "nonimmediate_operand" " x,v,x,m,m")))]
> > +(define_insn "*vec_concat"
> > +  [(set (match_operand:VI124_128 0 "register_operand"   "=x,v,x,x,v")
> > +   (vec_concat:VI124_128
> > + (match_operand: 1 "register_operand" " 
> > 0,v,0,0,v")
> > + (match_operand: 2 "nonimmediate_operand" "
> > +x,v,x,m,m")))]
> >"TARGET_SSE"
> >"@
> > punpcklqdq\t{%2, %0|%0, %2}
> > @@ -19617,6 +19617,58 @@ (define_insn "*vec_concatv4si"
> > (set_attr "prefix" "orig,maybe_evex,orig,orig,maybe_evex")
> > (set_attr "mode" "TI,TI,V4SF,V2SF,V2SF")])
> >
> > +(define_insn_and_split "*vec_concatv16qi_permt2"
> > +  [(set (match_operand:V16QI 0 "register_operand")
> > +   (unspec:V16QI
> > + [(const_vector:V16QI [(const_int 0) (const_int 1)
> > +   (const_int 2) (const_int 3)
> > +   (const_int 4) (const_int 5)
> > +   (const_int 6) (const_int 7)
> > +   (const_int 16) (const_int 17)
> > +   (const_int 18) (const_int 19)
> > +   (const_int 20) (const_int 21)
> > +   (const_int 22) (const_int 23)])
> > +  (match_operand:V16QI 1 "register_operand")
> > +  (match_operand:V16QI 2 "nonimmediate_operand")]
> > + UNSPEC_VPERMT2))]
> > +  "TARGET_AVX512VL && TARGET_AVX512VBMI"
> 
> You need "&& ix86_pre_reload_split ()" here, because a pseudo can be
> generated via force_reg.
> 
will change.
> > +  "#"
> > +  "&& 1"
> > +  [(set (match_dup 0)
> > +   (vec_concat:V16QI (match_dup 1) (match_dup 2)))] {
> > +  operands[1] = lowpart_subreg (V8QImode,
> > +   force_reg (V16QImode, operands[1]),
> > +   V16QImode);
> > +  if (!MEM_P (operands[2]))
> > +operands[2] = force_reg (V16QImode, operands[2]);
> 
> Are you sure there are no subregs possible in operand[2]? To stay on the safe
> side, use force_reg unconditionally, it will also force subregs to reg, 
> avoiding
> failure with the following lowpart_subreg.
When it's MEM, not need to force_reg.
> 
> Uros.
> 
> > +  operands[2] = lowpart_subreg (V8QImode, operands[2], V16QImode);
> > +})
> > +
> > +(define_insn_and_split "*vec_concatv8hi_permt2"
> > +  [(set (match_operand:V8HI 0 "register_operand")
> > +   (unspec:V8HI
> > + [(const_vector:V8HI [(const_int 0) (const_int 1)
> > +   (const_int 2) (const_int 3)
> > +   (const_int 8) (const_int 9)
> > +   (const_int 10) (const_int 11)])
> > +  (match_operand:V8HI 1 "register_operand")
> > +  (match_operand:V8HI 2 "nonimmediate_operand")]
> > + UNSPEC_VPERMT2))]
> > +  "TARGET_AVX512VL && TARGET_AVX512BW"
> > +  "#"
> > +  "&& 1"
> > +

RE: [PATCH] docs: Document new param x86-stlf-window-ninsns.

2022-04-06 Thread Liu, Hongtao via Gcc-patches



> -Original Message-
> From: Martin Liška 
> Sent: Wednesday, April 6, 2022 3:35 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao 
> Subject: [PATCH] docs: Document new param x86-stlf-window-ninsns.
> 
> Hi.
> 
> The patch documents the newly added parameter. One question I have is if it's
> fine listing it under 'i386 and x86_64 targets'?
Yes, thanks.
> 
> Cheers,
> Martin
> 
> gcc/ChangeLog:
> 
>   * doc/invoke.texi: Document it.
> ---
>   gcc/doc/invoke.texi | 8 
>   1 file changed, 8 insertions(+)
> 
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index
> 3936aef69d0..1a51759e6e4 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -15247,6 +15247,14 @@ loop.  The default value is four.
> 
>   @end table
> 
> +The following choices of @var{name} are available on i386 and x86_64 targets:
> +
> +@table @gcctabopt
> +@item x86-stlf-window-ninsns
> +Instructions number above which STFL stall penalty can be compensated.
> +
> +@end table
> +
>   @end table
> 
>   @node Instrumentation Options
> --
> 2.35.1

RE: [PATCH v3] AVX512FP16: Fix wrong code for _mm_mask_f[c]madd.*sch [PR 104978]

2022-03-21 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Wang, Hongyu 
> Sent: Tuesday, March 22, 2022 11:28 AM
> To: Liu, Hongtao 
> Cc: gcc-patches@gcc.gnu.org
> Subject: [PATCH v3] AVX512FP16: Fix wrong code for _mm_mask_f[c]madd.*sch
> [PR 104978]
> 
> Hi, here is the patch with force_reg before lowpart_subreg.
> 
> Bootstraped/regtested on x86_64-pc-linux-gnu{-m32,} and sde.
> 
> Ok for master?
> 
> For complex scalar intrinsic like _mm_mask_fcmadd_sch, the mask should be
> and by 1 to ensure the mask is bind to lowest byte.
> Use masked vmovss to perform same operation which omits higher bits of mask.
> 
> gcc/ChangeLog:
> 
>   PR target/104978
>   * config/i386/sse.md
>   (avx512fp16_fmaddcsh_v8hf_mask1   Use avx512f_movsf_mask instead of vmovaps or vblend, and
>   force_reg before lowpart_subreg.
>   (avx512fp16_fcmaddcsh_v8hf_mask1 
> gcc/testsuite/ChangeLog:
> 
>   PR target/104978
>   * gcc.target/i386/avx512fp16-vfcmaddcsh-1a.c: Adjust asm scan.
>   * gcc.target/i386/avx512fp16-vfmaddcsh-1a.c: Ditto.
>   * gcc.target/i386/avx512fp16-vfcmaddcsh-1c.c: Removed.
>   * gcc.target/i386/avx512fp16-vfmaddcsh-1c.c: Ditto.
>   * gcc.target/i386/pr104978.c: New test.
> 
> V3
> ---
>  gcc/config/i386/sse.md| 62 ++-
>  .../i386/avx512fp16-vfcmaddcsh-1a.c   |  4 +-
>  .../i386/avx512fp16-vfcmaddcsh-1c.c   | 13 
>  .../gcc.target/i386/avx512fp16-vfmaddcsh-1a.c |  4 +-
>   .../gcc.target/i386/avx512fp16-vfmaddcsh-1c.c | 13 
>  gcc/testsuite/gcc.target/i386/pr104978.c  | 18 ++
>  6 files changed, 42 insertions(+), 72 deletions(-)  delete mode 100644
> gcc/testsuite/gcc.target/i386/avx512fp16-vfcmaddcsh-1c.c
>  delete mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vfmaddcsh-1c.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr104978.c
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index
> 21bf3c55c95..6f7af2f21d6 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -6576,7 +6576,7 @@ (define_expand
> "avx512fp16_fmaddcsh_v8hf_mask1"
> (match_operand:QI 4 "register_operand")]
>"TARGET_AVX512FP16 && "
>  {
> -  rtx op0, op1;
> +  rtx op0, op1, dest;
> 
>if ()
>  emit_insn (gen_avx512fp16_fmaddcsh_v8hf_mask
> ( @@ -6586,26 +6586,15 @@ (define_expand
> "avx512fp16_fmaddcsh_v8hf_mask1"
>  emit_insn (gen_avx512fp16_fmaddcsh_v8hf_mask (operands[0],
>operands[1], operands[2], operands[3], operands[4]));
> 
> -  if (TARGET_AVX512VL)
> -  {
> -op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode);
> -op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode);
> -emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4]));
> -  }
> -  else
> -  {
> -rtx mask, tmp, vec_mask;
> -mask = lowpart_subreg (SImode, operands[4], QImode),
> -tmp = gen_reg_rtx (SImode);
> -emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31)));
> -vec_mask = gen_reg_rtx (V4SImode);
> -emit_insn (gen_rtx_SET (vec_mask, CONST0_RTX (V4SImode)));
> -emit_insn (gen_vec_setv4si_0 (vec_mask, vec_mask, tmp));
> -vec_mask = lowpart_subreg (V4SFmode, vec_mask, V4SImode);
> -op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode);
> -op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode);
> -emit_insn (gen_sse4_1_blendvps (op0, op1, op0, vec_mask));
> -  }
> +  op0 = lowpart_subreg (V4SFmode, force_reg (V8HFmode, operands[0]),
> + V8HFmode);
> +  if (!MEM_P (operands[1]))
> +operands[1] = force_reg (V8HFmode, operands[1]);
> +  op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode);
> +  dest = gen_reg_rtx (V4SFmode);
> +  emit_insn (gen_avx512f_movsf_mask (dest, op1, op0, op1,
> +operands[4]));
> +  emit_move_insn (operands[0], lowpart_subreg (V8HFmode, dest,
> +V4SFmode));
>DONE;
>  })
> 
> @@ -6631,7 +6620,7 @@ (define_expand
> "avx512fp16_fcmaddcsh_v8hf_mask1"
> (match_operand:QI 4 "register_operand")]
>"TARGET_AVX512FP16 && "
>  {
> -  rtx op0, op1;
> +  rtx op0, op1, dest;
> 
>if ()
>  emit_insn (gen_avx512fp16_fcmaddcsh_v8hf_mask
> ( @@ -6641,26 +6630,15 @@ (define_expand
> "avx512fp16_fcmaddcsh_v8hf_mask1"
>  emit_insn (gen_avx512fp16_fcmaddcsh_v8hf_mask (operands[0],
>operands[1], operands[2], operands[3], operands[4]));
> 
> -  if (TARGET_AVX512VL)
> -  {
> -op0 = lowpart_subreg (V4SFmode, operands[0], V8HFmode);
> -op1 = lowpart_subreg (V4SFmode, operands[1], V8HFmode);
> -emit_insn (gen_avx512vl_loadv4sf_mask (op0, op0, op1, operands[4]));
> -  }
> -  else
> -  {
> -rtx mask, tmp, vec_mask;
> -mask = lowpart_subreg (SImode, operands[4], QImode),
> -tmp = gen_reg_rtx (SImode);
> -emit_insn (gen_ashlsi3 (tmp, mask, GEN_INT (31)));
> -vec_mask = gen_reg_rtx (V4SImode);
> -emit_insn (gen_rtx_SET (vec_mask, CONST0_RTX (V4SImode)));
> -emit_insn (gen_vec_setv4si_0 (vec_mask,

RE: [PATCH] x86: Update model value for Alderlake and Rocketlake

2022-01-03 Thread Liu, Hongtao via Gcc-patches




> -Original Message-
> From: Cui, Lili 
> Sent: Tuesday, January 4, 2022 1:20 PM
> To: gcc-patches@gcc.gnu.org
> Cc: ubiz...@gmail.com; Liu, Hongtao ;
> hjl.to...@gmail.com
> Subject: [PATCH] x86: Update model value for Alderlake and Rocketlake
> 
> Hi Uros,
> 
> This patch is to update model value for Alderlake and Rocketlake.
Just note the update is according to latest 
https://www.intel.com/content/dam/develop/public/us/en/documents/325462-sdm-vol-1-2abcd-3abcd.pdf
> 
> Bootstrap is ok, and no regressions for i386/x86-64 testsuite.
> 
> OK for master?
> 
> gcc/ChangeLog
> 
>   * common/config/i386/cpuinfo.h (get_intel_cpu): Add new model
> values
>   to Alderlake and Rocketlake.
> ---
>  gcc/common/config/i386/cpuinfo.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/gcc/common/config/i386/cpuinfo.h
> b/gcc/common/config/i386/cpuinfo.h
> index 2d8ea201ab5..61b1a0f291c 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -415,6 +415,7 @@ get_intel_cpu (struct __processor_model
> *cpu_model,
>cpu_model->__cpu_subtype = INTEL_COREI7_SKYLAKE;
>break;
>  case 0xa7:
> +case 0xa8:
>/* Rocket Lake.  */
>cpu = "rocketlake";
>CHECK___builtin_cpu_is ("corei7"); @@ -487,6 +488,7 @@ get_intel_cpu
> (struct __processor_model *cpu_model,
>break;
>  case 0x97:
>  case 0x9a:
> +case 0xbf:
>/* Alder Lake.  */
>cpu = "alderlake";
>CHECK___builtin_cpu_is ("corei7");
> --
> 2.17.1
> 
> Thanks,
> Lili.

RE: [PATCH] i386: vcvtph2ps and vcvtps2ph should be used to convert _Float16 to SFmode with -mf16c [PR 102811]

2021-11-23 Thread Liu, Hongtao via Gcc-patches




>-Original Message-
>From: Kong, Lingling 
>Sent: Wednesday, November 24, 2021 2:25 PM
>To: Liu, Hongtao ; gcc-patches@gcc.gnu.org
>Cc: Kong, Lingling 
>Subject: RE: [PATCH] i386: vcvtph2ps and vcvtps2ph should be used to convert
>_Float16 to SFmode with -mf16c [PR 102811]
>
>Hi,
>
>vcvtph2ps and vcvtps2ph should be used to convert _Float16 to SFmode with
>-mf16c. So added define_insn extendhfsf2 and truncsfhf2 for target_f16c.
>And cleared before conversion, updated  movhi_internal and
>ix86_can_change_mode_class.
>
>OK for master?
>
>gcc/ChangeLog:
>
>   PR target/102811
>   * config/i386/i386.c (ix86_can_change_mode_class): SSE2 can load
>16bit data
>   to sse register via pinsrw.
>   * config/i386/i386.md (extendhfsf2): Add extenndhfsf2 for f16c.
>   (extendhfdf2): Split extendhf2 into separate extendhfsf2,
>extendhfdf2.
>   extendhfdf only for target_avx512fp16.
>   (*extendhf2):rename extendhf2.
>   (truncsfhf2): Likewise.
>   (truncdfhf2): Likewise.
>   (*trunc2): Likewise.
>
>gcc/testsuite/ChangeLog:
>
>   PR target/102811
>   * gcc.target/i386/pr90773-21.c: Optimized movhi_internal,
>   optimize vmovd + movw to vpextrw.
>   * gcc.target/i386/pr90773-23.c: Ditto.
>   * gcc.target/i386/avx512vl-vcvtps2ph-pr102811.c: New test.
>---
> gcc/config/i386/i386.c|  5 +-
> gcc/config/i386/i386.md   | 74 +--
> .../i386/avx512vl-vcvtps2ph-pr102811.c| 11 +++
> gcc/testsuite/gcc.target/i386/pr90773-21.c|  2 +-
> gcc/testsuite/gcc.target/i386/pr90773-23.c|  2 +-
> 5 files changed, 83 insertions(+), 11 deletions(-)  create mode 100644
>gcc/testsuite/gcc.target/i386/avx512vl-vcvtps2ph-pr102811.c
>
>diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index
>e94efdf39fb..4b813533961 100644
>--- a/gcc/config/i386/i386.c
>+++ b/gcc/config/i386/i386.c
>@@ -19485,9 +19485,8 @@ ix86_can_change_mode_class (machine_mode
>from, machine_mode to,
>disallow a change to these modes, reload will assume it's ok to
>drop the subreg from (subreg:SI (reg:HI 100) 0).  This affects
>the vec_dupv4hi pattern.
>-   NB: AVX512FP16 supports vmovw which can load 16bit data to sse
>-   register.  */
>-  int mov_size = MAYBE_SSE_CLASS_P (regclass) && TARGET_AVX512FP16 ?
>2 : 4;
>+   NB: SSE2 can load 16bit data to sse register via pinsrw.  */
>+  int mov_size = MAYBE_SSE_CLASS_P (regclass) && TARGET_SSE2 ? 2 :
>+4;
>   if (GET_MODE_SIZE (from) < mov_size)
>   return false;
> }
>diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index
>6eb9de81921..6ee264f1151 100644
>--- a/gcc/config/i386/i386.md
>+++ b/gcc/config/i386/i386.md
>@@ -2525,6 +2525,16 @@
> case TYPE_SSEMOV:
>   return ix86_output_ssemov (insn, operands);
>
>+case TYPE_SSELOG:
>+  if (SSE_REG_P (operands[0]))
>+  return MEM_P (operands[1])
>+? "pinsrw\t{$0, %1, %0|%0, %1, 0}"
>+: "pinsrw\t{$0, %k1, %0|%0, %k1, 0}";
>+  else
>+  return MEM_P (operands[1])
>+? "pextrw\t{$0, %1, %0|%0, %1, 0}"
>+: "pextrw\t{$0, %1, %k0|%k0, %k1, 0}";
>+
> case TYPE_MSKLOG:
>   if (operands[1] == const0_rtx)
>   return "kxorw\t%0, %0, %0";
>@@ -2540,13 +2550,17 @@
> }
> }
>   [(set (attr "isa")
>-  (cond [(eq_attr "alternative" "9,10,11,12,13")
>-(const_string "avx512fp16")
>+  (cond [(eq_attr "alternative" "9,10,11,12")
>+(const_string "sse2")
>+ (eq_attr "alternative" "13")
>+(const_string "sse4")
>  ]
>  (const_string "*")))
>(set (attr "type")
>  (cond [(eq_attr "alternative" "9,10,11,12,13")
>-(const_string "ssemov")
>+(if_then_else (match_test "TARGET_AVX512FP16")
>+  (const_string "ssemov")
>+  (const_string "sselog"))
>   (eq_attr "alternative" "4,5,6,7")
> (const_string "mskmov")
>   (eq_attr "alternative" "8")
>@@ -4574,8 +4588,32 @@
>   emit_move_insn (operands[0], CONST0_RTX (V2DFmode));
> })
>
>-(define_insn "extendhf2"
>-  [(set (match_operand:MODEF 0 "nonimm_ssenomem_operand" "=v")
>+(define_expand "extendhfsf2"
>+  [(set (match_operand:SF 0 "register_operand")
>+  (float_extend:SF
>+(match_operand:HF 1 "nonimmediate_operand")))]
>+  "TARGET_AVX512FP16 || TARGET_F16C || TARGET_AVX512VL"
>+{
>+  if (!TARGET_AVX512FP16)
>+{
>+  rtx res = gen_reg_rtx (V4SFmode);
>+  rtx tmp = force_reg (V8HFmode, CONST0_RTX (V8HFmode));
>+
>+  ix86_expand_vector_set (false, tmp, operands[1], 0);
>+  emit_insn (gen_vcvtph2ps (res, gen_lowpart (V8HImode, tmp)));
>+  emit_move_insn (operands[0], gen_lowpart (SFmode, res));
>+  DONE;
>+}
>+})
>+
>+(define_expand "extendhfdf2"
>+  [(set (match_operand:DF 0 "register_operand")
>+  (float_extend:DF
>+(match_operand:HF 1

RE: [PATCH] AVX512FP16: Support cond_op for HFmode

2021-09-23 Thread Liu, Hongtao via Gcc-patches




>-Original Message-
>From: Wang, Hongyu 
>Sent: Thursday, September 23, 2021 5:16 PM
>To: Liu, Hongtao 
>Cc: gcc-patches@gcc.gnu.org
>Subject: [PATCH] AVX512FP16: Support cond_op for HFmode
>
>Hi,
>
>This patch extend the expanders for cond_op to support vector HF modes.
>bootstraped and regtested on x86_64-pc-linux-gnu{-m32,}.
Do runtime tests passe on sde{-m32,}?
>Ok for master?
>
>gcc/ChangeLog:
>
>   * config/i386/sse.md (cond_): Extend to support
>   vector HFmodes.
>   (cond_mul): Likewise.
>   (cond_div): Likewise.
>   (cond_): Likewise.
>   (cond_fma): Likewise.
>   (cond_fms): Likewise.
>   (cond_fnma): Likewise.
>   (cond_fnms): Likewise.
>
>gcc/testsuite/ChangeLog:
>
>   * gcc.target/i386/cond_op_addsubmuldiv__Float16-1.c: New test.
>   * gcc.target/i386/cond_op_addsubmuldiv__Float16-2.c: Ditto.
>   * gcc.target/i386/cond_op_fma__Float16-1.c: Ditto.
>   * gcc.target/i386/cond_op_fma__Float16-2.c: Ditto.
>   * gcc.target/i386/cond_op_maxmin__Float16-1.c: Ditto.
>   * gcc.target/i386/cond_op_maxmin__Float16-2.c: Ditto.
>---
> gcc/config/i386/sse.md| 112 +-
> .../i386/cond_op_addsubmuldiv__Float16-1.c|   9 ++
> .../i386/cond_op_addsubmuldiv__Float16-2.c|   7 ++
> .../gcc.target/i386/cond_op_fma__Float16-1.c  |  20 
> .../gcc.target/i386/cond_op_fma__Float16-2.c  |   7 ++
> .../i386/cond_op_maxmin__Float16-1.c  |   8 ++
> .../i386/cond_op_maxmin__Float16-2.c  |   6 +
> 7 files changed, 113 insertions(+), 56 deletions(-)  create mode 100644
>gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv__Float16-1.c
> create mode 100644
>gcc/testsuite/gcc.target/i386/cond_op_addsubmuldiv__Float16-2.c
> create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_fma__Float16-1.c
> create mode 100644 gcc/testsuite/gcc.target/i386/cond_op_fma__Float16-2.c
> create mode 100644
>gcc/testsuite/gcc.target/i386/cond_op_maxmin__Float16-1.c
> create mode 100644
>gcc/testsuite/gcc.target/i386/cond_op_maxmin__Float16-2.c
>
>diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index
>1ca95984afc..c2eeb7b1517 100644
>--- a/gcc/config/i386/sse.md
>+++ b/gcc/config/i386/sse.md
>@@ -2118,12 +2118,12 @@
>   [(set_attr "isa" "noavx,noavx,avx,avx")])
>
> (define_expand "cond_"
>-  [(set (match_operand:VF 0 "register_operand")
>-  (vec_merge:VF
>-(plusminus:VF
>-  (match_operand:VF 2 "vector_operand")
>-  (match_operand:VF 3 "vector_operand"))
>-(match_operand:VF 4 "nonimm_or_0_operand")
>+  [(set (match_operand:VFH 0 "register_operand")
>+  (vec_merge:VFH
>+(plusminus:VFH
>+  (match_operand:VFH 2 "vector_operand")
>+  (match_operand:VFH 3 "vector_operand"))
>+(match_operand:VFH 4 "nonimm_or_0_operand")
> (match_operand: 1 "register_operand")))]
>   " == 64 || TARGET_AVX512VL"
> {
>@@ -2207,12 +2207,12 @@
>(set_attr "mode" "")])
>
> (define_expand "cond_mul"
>-  [(set (match_operand:VF 0 "register_operand")
>-  (vec_merge:VF
>-(mult:VF
>-  (match_operand:VF 2 "vector_operand")
>-  (match_operand:VF 3 "vector_operand"))
>-(match_operand:VF 4 "nonimm_or_0_operand")
>+  [(set (match_operand:VFH 0 "register_operand")
>+  (vec_merge:VFH
>+(mult:VFH
>+  (match_operand:VFH 2 "vector_operand")
>+  (match_operand:VFH 3 "vector_operand"))
>+(match_operand:VFH 4 "nonimm_or_0_operand")
> (match_operand: 1 "register_operand")))]
>   " == 64 || TARGET_AVX512VL"
> {
>@@ -2322,12 +2322,12 @@
> })
>
> (define_expand "cond_div"
>-  [(set (match_operand:VF 0 "register_operand")
>-  (vec_merge:VF
>-(div:VF
>-  (match_operand:VF 2 "register_operand")
>-  (match_operand:VF 3 "vector_operand"))
>-(match_operand:VF 4 "nonimm_or_0_operand")
>+  [(set (match_operand:VFH 0 "register_operand")
>+  (vec_merge:VFH
>+(div:VFH
>+  (match_operand:VFH 2 "register_operand")
>+  (match_operand:VFH 3 "vector_operand"))
>+(match_operand:VFH 4 "nonimm_or_0_operand")
> (match_operand: 1 "register_operand")))]
>   " == 64 || TARGET_AVX512VL"
> {
>@@ -2660,12 +2660,12 @@
>(set_attr "mode" "HF")])
>
> (define_expand "cond_"
>-  [(set (match_operand:VF 0 "register_operand")
>-  (vec_merge:VF
>-(smaxmin:VF
>-  (match_operand:VF 2 "vector_operand")
>-  (match_operand:VF 3 "vector_operand"))
>-(match_operand:VF 4 "nonimm_or_0_operand")
>+  [(set (match_operand:VFH 0 "register_operand")
>+  (vec_merge:VFH
>+(smaxmin:VFH
>+  (match_operand:VFH 2 "vector_operand")
>+  (match_operand:VFH 3 "vector_operand"))
>+(match_operand:VFH 4 "nonimm_or_0_operand")
> (match_operand: 1 "register_operand")))]
>   " == 64 || TARGET_AVX512VL"
> {
>@@ -4785,13 +4785,13 @@
>(set_attr "mode" "")])
>
> (define_expand "cond_fma"
>-

RE: [PATCH] Support logic shift left/right for avx512 mask type.

2021-07-21 Thread Liu, Hongtao via Gcc-patches



>-Original Message-
>From: Uros Bizjak 
>Sent: Wednesday, July 21, 2021 4:23 PM
>To: Hongtao Liu 
>Cc: Liu, Hongtao ; gcc-patches@gcc.gnu.org; H. J. Lu
>; Richard Biener 
>Subject: Re: [PATCH] Support logic shift left/right for avx512 mask type.
>
>On Wed, Jul 21, 2021 at 5:05 AM Hongtao Liu  wrote:
>>
>> On Tue, Jul 20, 2021 at 9:41 PM Uros Bizjak  wrote:
>> >
>> > On Tue, Jul 20, 2021 at 2:33 PM liuhongt  wrote:
>> > >
>> > > Hi:
>> > >   As mention in
>> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575420.html
>> > >
>> > > cut start-
>> > > > note for the lowpart we can just view-convert away the excess
>> > > > bits, fully re-using the mask.  We generate surprisingly "good" code:
>> > > >
>> > > > kmovb   %k1, %edi
>> > > > shrb$4, %dil
>> > > > kmovb   %edi, %k2
>> > > >
>> > > > besides the lack of using kshiftrb.  I guess we're just lacking
>> > > > a mask register alternative for
>> > > Yes, we can do it similar as kor/kand/kxor.
>> > > ---cut end
>> > >
>> > >   Bootstrap and regtested on x86_64-linux-gnu{-m32,}.
>> > >   Ok for trunk?
>> > >
>> > > gcc/ChangeLog:
>> > >
>> > > * config/i386/constraints.md (Wb): New constraint.
>> > > (Ww): Ditto.
>> > > * config/i386/i386.md (*ashlhi3_1): Extend to avx512 mask
>> > > shift.
>> > > (*ashlqi3_1): Ditto.
>> > > (*3_1): Ditto.
>> > > (*3_1): Ditto.
>> > > * config/i386/sse.md (k): New define_split after
>> > > it to convert generic shift pattern to mask shift ones.
>> > >
>> > > gcc/testsuite/ChangeLog:
>> > >
>> > > * gcc.target/i386/mask-shift.c: New test.
>
>
>+(define_insn "*lshr3_1"
>+  [(set (match_operand:SWI12 0 "nonimmediate_operand" "=m, ?k")
>+(lshiftrt:SWI12
>+  (match_operand:SWI12 1 "nonimmediate_operand" "0, k")
>+  (match_operand:QI 2 "nonmemory_operand" "c, ")))
>+   (clobber (reg:CC FLAGS_REG))]
>+  "ix86_binary_operator_ok (LSHIFTRT, mode, operands)"
>
>Also split this one to QImode and HImode to avoid conditions in isa attribute.
>
>OK with this change.
>

Thanks for the review, here's the patch I'm check in.

>Thanks,
>Uros.


V3-0001-Support-logic-shift-left-right-for-avx512-mask-type.patch
Description: V3-0001-Support-logic-shift-left-right-for-avx512-mask-type.patch

RE: [PATCH] Canonicalize (vec_duplicate (not A)) to (not (vec_duplicate A)).

2021-06-03 Thread Liu, Hongtao via Gcc-patches




>-Original Message-
>From: Segher Boessenkool 
>Sent: Friday, June 4, 2021 4:00 AM
>To: Liu, Hongtao 
>Cc: Richard Biener ; GCC Patches patc...@gcc.gnu.org>
>Subject: Re: [PATCH] Canonicalize (vec_duplicate (not A)) to (not
>(vec_duplicate A)).
>
>On Thu, Jun 03, 2021 at 11:03:43AM +, Liu, Hongtao wrote:
>> >A very typical example is how UMIN is optimised:
>> >
>> >   case UMIN:
>> >  if (trueop1 == CONST0_RTX (mode) && ! side_effects_p (op0))
>> >return op1;
>> >  if (rtx_equal_p (trueop0, trueop1) && ! side_effects_p (op0))
>> >return op0;
>> >  tem = simplify_associative_operation (code, mode, op0, op1);
>> >  if (tem)
>> >return tem;
>> >  break;
>> >
>> >(the stuff using "tem").
>> >
>> >Hongtao, can we do something similar here?  Does that work well?
>> >Please try it out :-)
>>
>> In simplify_rtx, no simplication occurs, there is just the difference
>> between  (vec_duplicate (not REG)) and (not (vec_duplicate (REG)). So here
>tem will only be 0.
>
>simplify-rtx is used by combine.  When you do and+not+splat for example my
>suggestion should kick in.  Try it out, don't just dismiss it?
>
Forgive my obtuseness, do you mean try the following changes, if so then there 
will be no "kick in", 
temp will be 0, there's no simplification here since it's just the difference 
between  (vec_duplicate (not REG))
 and (not (vec_duplicate (REG)). Or maybe you mean something else?

@@ -1708,6 +1708,17 @@ simplify_context::simplify_unary_operation_1 (rtx_code 
code, machine_mode mode,
 #endif
   break;

+  /* Canonicalize (vec_duplicate (not A)) to (not (vec_duplicate A)).  */
+case VEC_DUPLICATE:
+  if (GET_CODE (op) == NOT)
+   {
+ rtx vec_dup = gen_rtx_VEC_DUPLICATE (mode, XEXP (op, 0));
+ temp = simplify_unary_operation (NOT, mode, vec_dup, GET_MODE (op));
+ if (temp)
+   return temp;
+   }
+  break;
+
>> Basically we don't know it's a simplication until combine successfully
>> split the
>> 3->2 instructions (not + broadcast + and to andnot + broadcast), but
>> 3->it's pretty awkward
>> to do this in combine.
>
>But you need to do this *before* it is split.  That is the whole point.
>
>> Consider andnot is existed for many backends, I think a canonicalization is
>needed here.
>
>Please do note that that is not as easy as yoou may think: you need to make
>sure nothing ever creates non-canonical code.
>
>> Maybe we can add insn canonicalization for transforming (and
>> (vect_duplicate (not A)) B) to (and (not (duplicate (not A)) B) instead of
>(vec_duplicate (not A)) to (not (vec_duplicate A))?
>
>I don't understand what this means?
I mean let's give a last shot for andnot in case AND like below

@ -3702,6 +3702,16 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
   tem = simplify_associative_operation (code, mode, op0, op1);
   if (tem)
return tem;
+
+  if (GET_CODE (op0) == VEC_DUPLICATE
+ && GET_CODE (XEXP (op0, 0)) == NOT)
+   {
+ rtx vec_dup = gen_rtx_VEC_DUPLICATE (GET_MODE (op0),
+  XEXP (XEXP (op0, 0), 0));
+ return simplify_gen_binary (AND, mode,
+ gen_rtx_NOT (mode, vec_dup),
+ op1);
+   }
   break;
>
>
>Segher

RE: [PATCH] [i386] Fix ICE of insn does not satisfy its constraints.

2021-06-03 Thread Liu, Hongtao via Gcc-patches



>-Original Message-
>From: Jakub Jelinek 
>Sent: Thursday, June 3, 2021 9:49 PM
>To: Liu, Hongtao 
>Cc: gcc-patches@gcc.gnu.org
>Subject: Re: [PATCH] [i386] Fix ICE of insn does not satisfy its constraints.
>
>On Thu, Jun 03, 2021 at 05:07:26PM +0800, liuhongt via Gcc-patches wrote:
>> @@ -18163,10 +18163,10 @@ (define_expand "v16qiv16si2"
>>"TARGET_AVX512F")
>>
>>  (define_insn "avx2_v8qiv8si2"
>> -  [(set (match_operand:V8SI 0 "register_operand" "=v")
>> +  [(set (match_operand:V8SI 0 "register_operand" "=Yv")
>>  (any_extend:V8SI
>>(vec_select:V8QI
>> -(match_operand:V16QI 1 "register_operand" "v")
>> +(match_operand:V16QI 1 "register_operand" "Yv")
>>  (parallel [(const_int 0) (const_int 1)
>> (const_int 2) (const_int 3)
>> (const_int 4) (const_int 5)
>
>Why do you need this change (and similarly other v -> Yv changes)?
>I mean, ix86_hard_regno_mode_ok for TARGET_AVX512F
>&& !TARGET_AVX512VL should return false for the 16-byte and 32-byte vector
>modes.
>
>The reason to use Yv is typically where the match_operand has 64-byte vector
>mode or scalar mode, yet it needs an AVX512VL instruction.
>
>The changes to use Yw look ok, that is for the cases where the insn requires
>both AVX512VL and AVX512BW, while ix86_hard_regno_mode_ok ensures
>the xmm16+ regs won't be used for the 16/32-byte vectors when AVX512VL is
>not on, it doesn't ensure that AVX512BW will be enabled.
Thanks for the review.
Yes, you're right, AVX512VL parts are already guaranteed by 
ix86_hard_regno_mode_ok.

Here is updated patch.
>
>   Jakub



0001-i386-Fix-ICE-of-insn-does-not-satisfy-its-constraint_v2.patch
Description: 0001-i386-Fix-ICE-of-insn-does-not-satisfy-its-constraint_v2.patch

RE: [PATCH] Canonicalize (vec_duplicate (not A)) to (not (vec_duplicate A)).

2021-06-03 Thread Liu, Hongtao via Gcc-patches




>-Original Message-
>From: Segher Boessenkool 
>Sent: Thursday, June 3, 2021 4:46 AM
>To: Richard Biener 
>Cc: Liu, Hongtao ; GCC Patches patc...@gcc.gnu.org>
>Subject: Re: [PATCH] Canonicalize (vec_duplicate (not A)) to (not
>(vec_duplicate A)).
>
>Hi!
>
>On Wed, Jun 02, 2021 at 09:07:35AM +0200, Richard Biener wrote:
>> On Wed, Jun 2, 2021 at 7:41 AM liuhongt via Gcc-patches
>>  wrote:
>> > For i386, it will enable below opt
>> >
>> > from
>> > notl%edi
>> > vpbroadcastd%edi, %xmm0
>> > vpand   %xmm1, %xmm0, %xmm0
>> > to
>> > vpbroadcastd%edi, %xmm0
>> > vpandn   %xmm1, %xmm0, %xmm0
>>
>> There will be cases where (vec_duplicate (not A)) is better than (not
>> (vec_duplicate A)), so I'm not sure it is a good idea to forcefully
>> canonicalize unary operations.
>
>It is two unaries in sequence, where the order does not matter either.
>As in all such cases you either have to handle both cases everywhere, or have
>a canonical order.
>
>> I suppose the
>> simplification happens inside combine
>
>combine uses simplify-rtx for most cases (it is part of combine, but used in
>quite a few other places these days).
>
>> - doesn't combine
>> already have code to try variants of an expression and isn't this a
>> good candidate that can be added there, avoiding the canonicalization?
>
>As I mentioned, this is done in simplify-rtx in cases that do not have a
>canonical representation.  This is critical because it prevents loops.
>
>A very typical example is how UMIN is optimised:
>
>   case UMIN:
>  if (trueop1 == CONST0_RTX (mode) && ! side_effects_p (op0))
>   return op1;
>  if (rtx_equal_p (trueop0, trueop1) && ! side_effects_p (op0))
>   return op0;
>  tem = simplify_associative_operation (code, mode, op0, op1);
>  if (tem)
>   return tem;
>  break;
>
>(the stuff using "tem").
>
>Hongtao, can we do something similar here?  Does that work well?  Please try
>it out :-)

In simplify_rtx, no simplication occurs, there is just the difference between
 (vec_duplicate (not REG)) and (not (vec_duplicate (REG)). So here tem will 
only be 0.
Basically we don't know it's a simplication until combine successfully split the
3->2 instructions (not + broadcast + and to andnot + broadcast), but it's 
pretty awkward
to do this in combine.

Consider andnot is existed for many backends, I think a canonicalization is 
needed here.
Maybe we can add insn canonicalization for transforming (and (vect_duplicate 
(not A)) B) to 
(and (not (duplicate (not A)) B) instead of (vec_duplicate (not A)) to (not 
(vec_duplicate A))?

>
>
>Segher

RE: gcc-wwwdocs branch master updated. 88e29096c36837553fc841bd1fa5df6caa776b44

2020-11-05 Thread Liu, Hongtao via Gcc-patches




>-Original Message-
>From: Liu, Hongtao
>Sent: Friday, November 6, 2020 9:22 AM
>To: Gerald Pfeifer ; Hongtao Liu ;
>hongtao Liu 
>Cc: gcc-patches@gcc.gnu.org
>Subject: RE: gcc-wwwdocs branch master updated.
>88e29096c36837553fc841bd1fa5df6caa776b44
>
>
>
>>-Original Message-
>>From: Gerald Pfeifer 
>>Sent: Friday, November 6, 2020 5:57 AM
>>To: Hongtao Liu ; hongtao Liu
>>
>>Cc: gcc-patches@gcc.gnu.org
>>Subject: Re: gcc-wwwdocs branch master updated.
>>88e29096c36837553fc841bd1fa5df6caa776b44
>>
>>On Thu, 29 Oct 2020, hongtao Liu via Gcc-cvs-wwwdocs wrote:
>>> The branch, master has been updated
>>>via  88e29096c36837553fc841bd1fa5df6caa776b44 (commit)
>>>   from  053c956f6e9c71efac5be01f8a8ba79f15d87f4b (commit)
>>
>>>GCC now supports the Intel CPU named Alderlake through
>>>  -march=alderlake.
>>> -The switch enables the CLDEMOTE PTWRITE WAITPKG SERIALIZE ISA
>>extensions.
>>> +The switch enables the CLDEMOTE PTWRITE WAITPKG SERIALIZE
>>KEYLOCKER
>>> +ISA extensions.
>>
>>I did not see this posted on gcc-patches.  Should this list of
>>extensions be separated by commas?
>>

I realize you're talking about the patch for gcc-wwwdocs.
No, I didn't send out a patch, sorry for that, will do it in further commit.
  
>>(I can make that change if you agree.)
>>
>
>Yes, thanks for that.
>Patch for adding -march=alderlake  https://gcc.gnu.org/pipermail/gcc-
>patches/2020-July/549699.html
>Patch for Keylocker  https://gcc.gnu.org/pipermail/gcc-patches/2020-
>October/556026.html
>
>>Also, I did not see you in gcc/MAINTAINERS, or did miss it?
>>Since evidently you have write after approval access, please add
>>yourself there.
>>
>
>Will do.
>
>>Gerald

RE: gcc-wwwdocs branch master updated. 88e29096c36837553fc841bd1fa5df6caa776b44

2020-11-05 Thread Liu, Hongtao via Gcc-patches




>-Original Message-
>From: Gerald Pfeifer 
>Sent: Friday, November 6, 2020 5:57 AM
>To: Hongtao Liu ; hongtao Liu
>
>Cc: gcc-patches@gcc.gnu.org
>Subject: Re: gcc-wwwdocs branch master updated.
>88e29096c36837553fc841bd1fa5df6caa776b44
>
>On Thu, 29 Oct 2020, hongtao Liu via Gcc-cvs-wwwdocs wrote:
>> The branch, master has been updated
>>via  88e29096c36837553fc841bd1fa5df6caa776b44 (commit)
>>   from  053c956f6e9c71efac5be01f8a8ba79f15d87f4b (commit)
>
>>GCC now supports the Intel CPU named Alderlake through
>>  -march=alderlake.
>> -The switch enables the CLDEMOTE PTWRITE WAITPKG SERIALIZE ISA
>extensions.
>> +The switch enables the CLDEMOTE PTWRITE WAITPKG SERIALIZE
>KEYLOCKER
>> +ISA extensions.
>
>I did not see this posted on gcc-patches.  Should this list of extensions be
>separated by commas?
>
>(I can make that change if you agree.)
>

Yes, thanks for that.
Patch for adding -march=alderlake  
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549699.html
Patch for Keylocker  
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556026.html

>Also, I did not see you in gcc/MAINTAINERS, or did miss it?
>Since evidently you have write after approval access, please add yourself
>there.
>

Will do.

>Gerald

45 matches

Mail list logo