RE: [PATCH] x86: Update model values for Raptorlake.

2023-08-14 Thread Cui, Lili via Gcc-patches
Sorry, I should have built the patch while backporting, and thanks for your 
report and suggestions.
I'll backport another patch to fix the problems after finishing bootstraps, 
probably in couple hours.

Thank you!
Lili.

> -Original Message-
> From: Jonathan Wakely 
> Sent: Monday, August 14, 2023 10:26 PM
> To: Cui, Lili 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao 
> Subject: Re: [PATCH] x86: Update model values for Raptorlake.
> 
> On 14/08/23 15:19 +0100, Jonathan Wakely wrote:
> >On 14/08/23 04:37 +, Pan Li via Gcc-patches wrote:
> >>Committed as obvious, and backported to GCC13.
> >
> >Did you try building it on gcc-13?
> >
> >case 0x97:
> >case 0x9a:
> >case 0xbf:
> >  /* Alder Lake.  */
> >case 0xb7:
> >case 0xba:
> >case 0xbf:
> >  /* Raptor Lake.  */
> >
> >
> >This fails:
> >
> >In file included from /home/test/src/gcc-13/gcc/config/i386/driver-
> i386.cc:31:
> >/home/test/src/gcc-13/gcc/common/config/i386/cpuinfo.h: In function ‘const
> char* get_intel_cpu(__processor_model*, __processor_model2*, unsigned
> int*)’:
> >/home/test/src/gcc-13/gcc/common/config/i386/cpuinfo.h:543:5: error:
> duplicate case value
> >  543 | case 0xbf:
> >  | ^~~~
> >/home/test/src/gcc-13/gcc/common/config/i386/cpuinfo.h:539:5: note:
> previously used here
> >  539 | case 0xbf:
> >  | ^~~~
> >
> >Please fix or revert.
> 
> 
> The backported patch is not the same as the trunk one, it adds two new cases
> not one. But one of them is a duplicate of one you already added in January
> 2022, in 4bd5297f665fd3ba5691297c016809f3501e7fba
> 
> No matter how obvious a patch is, if it touches code (not just comments or
> docs) please don't commit without even building it once.
> 
> Also, backports should typically say something in the git commit message, e.g.
> using git gcc-backport (or git cherry-pick -x) will automatically add:
> 
> (cherry picked from commit 003016a40844701c48851020df672b70f3446bdb)
> 
> to the commit message.
> 
> 
> 
> 
> 
> >>Lili.
> >>
> >>
> >>Update model values for Raptorlake according to SDM.
> >>
> >>gcc/ChangeLog
> >>
> >>* common/config/i386/cpuinfo.h (get_intel_cpu): Add model value
> 0xba
> >>to Raptorlake.
> >>---
> >>gcc/common/config/i386/cpuinfo.h | 1 +
> >>1 file changed, 1 insertion(+)
> >>
> >>diff --git a/gcc/common/config/i386/cpuinfo.h
> >>b/gcc/common/config/i386/cpuinfo.h
> >>index ae48bc17771..dd7f0f6abfd 100644
> >>--- a/gcc/common/config/i386/cpuinfo.h
> >>+++ b/gcc/common/config/i386/cpuinfo.h
> >>@@ -537,6 +537,7 @@ get_intel_cpu (struct __processor_model
> *cpu_model,
> >>case 0x9a:
> >>  /* Alder Lake.  */
> >>case 0xb7:
> >>+case 0xba:
> >>case 0xbf:
> >>  /* Raptor Lake.  */
> >>case 0xaa:



RE: Bootstrap fail on GCC 13 (was: Re: [PATCH] x86: Update model values for Alderlake, Rocketlake and Raptorlake.)

2023-08-14 Thread Cui, Lili via Gcc-patches
Sorry, I should have built the patch while backporting.
I'll backport another patch to fix the problems after finishing bootstraps, 
probably in couple hours.

Thank you!
Lili.

> -Original Message-
> From: Tobias Burnus 
> Sent: Monday, August 14, 2023 5:34 PM
> To: gcc-patches@gcc.gnu.org; Cui, Lili 
> Subject: Bootstrap fail on GCC 13 (was: Re: [PATCH] x86: Update model values
> for Alderlake, Rocketlake and Raptorlake.)
> 
> Hi,
> 
> your GCC 13 commit
> https://gcc.gnu.org/r13-7720-g0fa76e35a5f9e1 x86: Update model values for
> Raptorlake.
> 
> causes a build fail:
> 
> gcc/common/config/i386/cpuinfo.h: In function ‘const char*
> get_intel_cpu(__processor_model*, __processor_model2*, unsigned int*)’:
> gcc/common/config/i386/cpuinfo.h:543:5: error: duplicate case value
>543 | case 0xbf:
>| ^~~~
> gcc/common/config/i386/cpuinfo.h:539:5: note: previously used here
>539 | case 0xbf:
>| ^~~~
> 
> Your patch did:
> 
>   case 0x97:
>   case 0x9a:
>   case 0xbf:   <<<<<< Existing case value
> /* Alder Lake.  */
>   case 0xb7:
> +case 0xba:
> +    case 0xbf:  <<<<<< Newly added same case value
> /* Raptor Lake.  */
> 
> 
> Tobias
> 
> On 29.06.23 05:06, Cui, Lili via Gcc-patches wrote:
> > I will directly commit this patch, it can be considered as an obvious patch.
> >
> > Thanks,
> > Lili.
> >
> >> -Original Message-
> >> From: Gcc-patches
> >>  On Behalf Of
> >> Cui, Lili via Gcc-patches
> >> Sent: Wednesday, June 28, 2023 6:52 PM
> >> To: gcc-patches@gcc.gnu.org
> >> Cc: Liu, Hongtao 
> >> Subject: [PATCH] x86: Update model values for Alderlake, Rocketlake
> >> and Raptorlake.
> >>
> >> Hi Hongtao,
> >>
> >> This patch is to update model values for Alderlake, Rocketlake and
> >> Raptorlake according to SDM.
> >>
> >> Ok for trunk?
> >>
> >> Thanks.
> >> Lili.
> >>
> >> Update model values for Alderlake, Rocketlake and Raptorlake
> >> according to SDM.
> >>
> >> gcc/ChangeLog
> >>
> >>  * common/config/i386/cpuinfo.h (get_intel_cpu): Remove model
> >> value 0xa8
> >>  from Rocketlake, move model value 0xbf from Alderlake to
> >> Raptorlake.
> >> ---
> >>   gcc/common/config/i386/cpuinfo.h | 3 +--
> >>   1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >> diff --git a/gcc/common/config/i386/cpuinfo.h
> >> b/gcc/common/config/i386/cpuinfo.h
> >> index 61559ed9de2..ae48bc17771 100644
> >> --- a/gcc/common/config/i386/cpuinfo.h
> >> +++ b/gcc/common/config/i386/cpuinfo.h
> >> @@ -463,7 +463,6 @@ get_intel_cpu (struct __processor_model
> >> *cpu_model,
> >> cpu_model->__cpu_subtype = INTEL_COREI7_SKYLAKE;
> >> break;
> >>   case 0xa7:
> >> -case 0xa8:
> >> /* Rocket Lake.  */
> >> cpu = "rocketlake";
> >> CHECK___builtin_cpu_is ("corei7"); @@ -536,9 +535,9 @@
> >> get_intel_cpu (struct __processor_model *cpu_model,
> >> break;
> >>   case 0x97:
> >>   case 0x9a:
> >> -case 0xbf:
> >> /* Alder Lake.  */
> >>   case 0xb7:
> >> +case 0xbf:
> >> /* Raptor Lake.  */
> >>   case 0xaa:
> >>   case 0xac:
> >> --
> >> 2.25.1
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201,
> 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer:
> Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München;
> Registergericht München, HRB 106955


[PATCH] x86: Update model values for Raptorlake.

2023-08-13 Thread Cui, Lili via Gcc-patches
Committed as obvious, and backported to GCC13.

Lili.


Update model values for Raptorlake according to SDM.

gcc/ChangeLog

* common/config/i386/cpuinfo.h (get_intel_cpu): Add model value 0xba
to Raptorlake.
---
 gcc/common/config/i386/cpuinfo.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h
index ae48bc17771..dd7f0f6abfd 100644
--- a/gcc/common/config/i386/cpuinfo.h
+++ b/gcc/common/config/i386/cpuinfo.h
@@ -537,6 +537,7 @@ get_intel_cpu (struct __processor_model *cpu_model,
 case 0x9a:
   /* Alder Lake.  */
 case 0xb7:
+case 0xba:
 case 0xbf:
   /* Raptor Lake.  */
 case 0xaa:
-- 
2.25.1



RE: [PATCH] x86: Enable ENQCMD and UINTR for march=sierraforest.

2023-07-04 Thread Cui, Lili via Gcc-patches


> -Original Message-
> From: Hongtao Liu 
> Sent: Tuesday, July 4, 2023 4:27 PM
> To: Cui, Lili 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] x86: Enable ENQCMD and UINTR for march=sierraforest.
> 
> On Tue, Jul 4, 2023 at 4:15 PM Cui, Lili  wrote:
> >
> > From: Lili Cui 
> >
> > Hi Maintainer,
> >
> > This patch is to enable ENQCMD and UINTR for march=sierraforest
> according to Intel ISE.
> >
> > Bootstrapped and regtested. Ok for trunk? And I will backport this patch to
> GCC13.
> Ok.

Committed and backported to GCC13, thanks.

Regards,
Lili.

> >
> > Thanks,
> > Lili.
> >
> > Enable ENQCMD and UINTR for march=sierraforest according to Intel ISE
> > https://cdrdv2.intel.com/v1/dl/getContent/671368
> >
> > gcc/ChangeLog
> >
> > * config/i386/i386.h: Add PTA_ENQCMD and PTA_UINTR to
> PTA_SIERRAFOREST.
> > * doc/invoke.texi: Update new isa to march=sierraforest and
> grandridge.
> > ---
> >  gcc/config/i386/i386.h | 2 +-
> >  gcc/doc/invoke.texi| 7 ---
> >  2 files changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> > 5ac9c78d3ba..84ebafdf2dc 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -2341,7 +2341,7 @@ constexpr wide_int_bitmask PTA_ALDERLAKE =
> PTA_TREMONT | PTA_ADX | PTA_AVX
> >| PTA_PCONFIG | PTA_PKU | PTA_VAES | PTA_VPCLMULQDQ |
> PTA_SERIALIZE
> >| PTA_HRESET | PTA_KL | PTA_WIDEKL | PTA_AVXVNNI;  constexpr
> > wide_int_bitmask PTA_SIERRAFOREST = PTA_ALDERLAKE | PTA_AVXIFMA
> > -  | PTA_AVXVNNIINT8 | PTA_AVXNECONVERT | PTA_CMPCCXADD;
> > +  | PTA_AVXVNNIINT8 | PTA_AVXNECONVERT | PTA_CMPCCXADD |
> PTA_ENQCMD |
> > + PTA_UINTR;
> >  constexpr wide_int_bitmask PTA_GRANITERAPIDS = PTA_SAPPHIRERAPIDS
> | PTA_AMX_FP16
> >| PTA_PREFETCHI | PTA_AMX_COMPLEX;
> >  constexpr wide_int_bitmask PTA_GRANDRIDGE = PTA_SIERRAFOREST |
> > PTA_RAOINT; diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 26bcbe26c6c..dc385c1a3d8 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -32559,7 +32559,8 @@ SSSE3, SSE4.1, SSE4.2, POPCNT, AES,
> PREFETCHW,
> > PCLMUL, RDRND, XSAVE, XSAVEC,  XSAVES, XSAVEOPT, FSGSBASE, PTWRITE,
> > RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,  MOVDIR64B, CLDEMOTE,
> WAITPKG,
> > ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,  PCONFIG, PKU, VAES,
> > VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI, -AVXIFMA,
> AVXVNNIINT8, AVXNECONVERT and CMPCCXADD instruction set support.
> > +AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, ENQCMD and
> UINTR
> > +instruction set support.
> >
> >  @item grandridge
> >  Intel Grand Ridge CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2,
> > SSE3, @@ -32567,8 +32568,8 @@ SSSE3, SSE4.1, SSE4.2, POPCNT, AES,
> > PREFETCHW, PCLMUL, RDRND, XSAVE, XSAVEC,  XSAVES, XSAVEOPT,
> FSGSBASE,
> > PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,  MOVDIR64B,
> CLDEMOTE,
> > WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,  PCONFIG, PKU,
> > VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI, -AVXIFMA,
> > AVXVNNIINT8, AVXNECONVERT, CMPCCXADD and RAOINT instruction set -
> support.
> > +AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, ENQCMD, UINTR
> and
> > +RAOINT instruction set support.
> >
> >  @item knl
> >  Intel Knight's Landing CPU with 64-bit extensions, MOVBE, MMX, SSE,
> > SSE2, SSE3,
> > --
> > 2.25.1
> >
> 
> 
> --
> BR,
> Hongtao


[PATCH] x86: Enable ENQCMD and UINTR for march=sierraforest.

2023-07-04 Thread Cui, Lili via Gcc-patches
From: Lili Cui 

Hi Maintainer,

This patch is to enable ENQCMD and UINTR for march=sierraforest according to 
Intel ISE.

Bootstrapped and regtested. Ok for trunk? And I will backport this patch to 
GCC13.

Thanks,
Lili.

Enable ENQCMD and UINTR for march=sierraforest according to Intel ISE
https://cdrdv2.intel.com/v1/dl/getContent/671368

gcc/ChangeLog

* config/i386/i386.h: Add PTA_ENQCMD and PTA_UINTR to PTA_SIERRAFOREST.
* doc/invoke.texi: Update new isa to march=sierraforest and grandridge.
---
 gcc/config/i386/i386.h | 2 +-
 gcc/doc/invoke.texi| 7 ---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 5ac9c78d3ba..84ebafdf2dc 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2341,7 +2341,7 @@ constexpr wide_int_bitmask PTA_ALDERLAKE = PTA_TREMONT | 
PTA_ADX | PTA_AVX
   | PTA_PCONFIG | PTA_PKU | PTA_VAES | PTA_VPCLMULQDQ | PTA_SERIALIZE
   | PTA_HRESET | PTA_KL | PTA_WIDEKL | PTA_AVXVNNI;
 constexpr wide_int_bitmask PTA_SIERRAFOREST = PTA_ALDERLAKE | PTA_AVXIFMA
-  | PTA_AVXVNNIINT8 | PTA_AVXNECONVERT | PTA_CMPCCXADD;
+  | PTA_AVXVNNIINT8 | PTA_AVXNECONVERT | PTA_CMPCCXADD | PTA_ENQCMD | 
PTA_UINTR;
 constexpr wide_int_bitmask PTA_GRANITERAPIDS = PTA_SAPPHIRERAPIDS | 
PTA_AMX_FP16
   | PTA_PREFETCHI | PTA_AMX_COMPLEX;
 constexpr wide_int_bitmask PTA_GRANDRIDGE = PTA_SIERRAFOREST | PTA_RAOINT;
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 26bcbe26c6c..dc385c1a3d8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -32559,7 +32559,8 @@ SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL, 
RDRND, XSAVE, XSAVEC,
 XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,
 MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,
 PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI,
-AVXIFMA, AVXVNNIINT8, AVXNECONVERT and CMPCCXADD instruction set support.
+AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, ENQCMD and UINTR instruction set
+support.
 
 @item grandridge
 Intel Grand Ridge CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
@@ -32567,8 +32568,8 @@ SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL, 
RDRND, XSAVE, XSAVEC,
 XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,
 MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,
 PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI,
-AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD and RAOINT instruction set
-support.
+AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, ENQCMD, UINTR and RAOINT
+instruction set support.
 
 @item knl
 Intel Knight's Landing CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
-- 
2.25.1



RE: [PATCH] PR gcc/110148:Avoid adding loop-carried ops to long chains

2023-06-29 Thread Cui, Lili via Gcc-patches


> -Original Message-
> From: Richard Biener 
> Sent: Thursday, June 29, 2023 2:42 PM
> To: Cui, Lili 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] PR gcc/110148:Avoid adding loop-carried ops to long
> chains
> 
> On Thu, Jun 29, 2023 at 3:49 AM Cui, Lili  wrote:
> >
> > From: Lili Cui 
> >
> > Hi Maintainer
> >
> > This patch is to fix TSVC242 regression related to loop-carried ops.
> >
> > Bootstrapped and regtested. Ok for trunk?
> 
> OK.
> 
Committed, thanks Richard.

Regards,
Lili.

> Thanks,
> Richard.
> 
> > Regards
> > Lili.
> >
> > Avoid adding loop-carried ops to long chains, otherwise the whole
> > chain will have dependencies across the loop iteration. Just keep
> > loop-carried ops in a separate chain.
> >E.g.
> >x_1 = phi(x_0, x_2)
> >y_1 = phi(y_0, y_2)
> >
> >a + b + c + d + e + x1 + y1
> >
> >SSA1 = a + b;
> >SSA2 = c + d;
> >SSA3 = SSA1 + e;
> >SSA4 = SSA3 + SSA2;
> >SSA5 = x1 + y1;
> >SSA6 = SSA4 + SSA5;
> >
> > With the patch applied, these test cases improved by 32%~100%.
> >
> > S242:
> > for (int i = 1; i < LEN_1D; ++i) {
> > a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];}
> >
> > Case 1:
> > for (int i = 1; i < LEN_1D; ++i) {
> > a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];}
> >
> > Case 2:
> > for (int i = 1; i < LEN_1D; ++i) {
> > a[i] = a[i - 1] + b[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];}
> >
> > The value is the execution time
> > A: original version
> > B: with FMA patch g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409(base
> on
> > A)
> > C: with current patch(base on B)
> >
> >   A   B   C B/A C/A
> > s2422.859   5.152   2.859   1.802028681 1
> > case 1  5.489   5.488   3.511   0.9998180.64
> > case 2  7.216   7.499   4.885   1.0392180.68
> >
> > gcc/ChangeLog:
> > PR tree-optimization/110148
> > * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): Handle 
> > loop-carried
> > ops in this function.
> > ---
> >  gcc/tree-ssa-reassoc.cc | 236
> > 
> >  1 file changed, 167 insertions(+), 69 deletions(-)
> >
> > diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc index
> > 96c88ec003e..f5da385e0b2 100644
> > --- a/gcc/tree-ssa-reassoc.cc
> > +++ b/gcc/tree-ssa-reassoc.cc
> > @@ -5471,37 +5471,62 @@ get_reassociation_width (int ops_num, enum
> tree_code opc,
> >return width;
> >  }
> >
> > +#define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops.
> > +*/ #define BIASED_END_STMT 1 /* It is the end stmt of normal or
> > +biased ops.  */ #define NORMAL_END_STMT 2 /* It is the end stmt of
> > +normal ops.  */
> > +
> >  /* Rewrite statements with dependency chain with regard the chance to
> generate
> > FMA.
> > For the chain with FMA: Try to keep fma opportunity as much as possible.
> > For the chain without FMA: Putting the computation in rank order and
> trying
> > to allow operations to be executed in parallel.
> > E.g.
> > -   e + f + g + a * b + c * d;
> > +   e + f + a * b + c * d;
> >
> > -   ssa1 = e + f;
> > -   ssa2 = g + a * b;
> > -   ssa3 = ssa1 + c * d;
> > -   ssa4 = ssa2 + ssa3;
> > +   ssa1 = e + a * b;
> > +   ssa2 = f + c * d;
> > +   ssa3 = ssa1 + ssa2;
> >
> > This reassociation approach preserves the chance of fma generation as
> much
> > -   as possible.  */
> > +   as possible.
> > +
> > +   Another thing is to avoid adding loop-carried ops to long chains,
> otherwise
> > +   the whole chain will have dependencies across the loop iteration. Just
> keep
> > +   loop-carried ops in a separate chain.
> > +   E.g.
> > +   x_1 = phi(x_0, x_2)
> > +   y_1 = phi(y_0, y_2)
> > +
> > +   a + b + c + d + e + x1 + y1
> > +
> > +   SSA1 = a + b;
> > +   SSA2 = c + d;
> > +   SSA3 = SSA1 + e;
> > +   SSA4 = SSA3 + SSA2;
> > +   SSA5 = x1 + y1;
> > +   SSA6 = SSA4 + SSA5;
> > + */
> >  static void
> >  rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
> > -const vec )
> > +   const vec )
> >  {
> >enum tree_code opcode = gimple_assign_rhs_code (stmt);
> >int op_num = ops.length ();
> > +  int op_normal_num = op_num;
> >gcc_assert (op_num > 0);
> >int stmt_num = op_num - 1;
> >gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
> > -  int op_index = op_num - 1;
> > -  int width_count = width;
> >int i = 0, j = 0;
> >tree tmp_op[2], op1;
> >operand_entry *oe;
> >gimple *stmt1 = NULL;
> >tree last_rhs1 = gimple_assign_rhs1 (stmt);
> > +  int last_rhs1_stmt_index = 0, last_rhs2_stmt_index = 0;  int
> > + width_active = 0, width_count = 0;  bool has_biased = false,
> > + ops_changed = false;  auto_vec ops_normal;
> > + auto_vec ops_biased;  vec *ops1;
> >
> >/* We start expression rewriting from the top statements.
> >   So, in this loop we create a full list of statements @@ -5510,83
> > +5535,155 @@ 

RE: [PATCH] x86: Update model values for Alderlake, Rocketlake and Raptorlake.

2023-06-28 Thread Cui, Lili via Gcc-patches
I will directly commit this patch, it can be considered as an obvious patch.

Thanks,
Lili.

> -Original Message-
> From: Gcc-patches  On
> Behalf Of Cui, Lili via Gcc-patches
> Sent: Wednesday, June 28, 2023 6:52 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao 
> Subject: [PATCH] x86: Update model values for Alderlake, Rocketlake and
> Raptorlake.
> 
> Hi Hongtao,
> 
> This patch is to update model values for Alderlake, Rocketlake and
> Raptorlake according to SDM.
> 
> Ok for trunk?
> 
> Thanks.
> Lili.
> 
> Update model values for Alderlake, Rocketlake and Raptorlake according to
> SDM.
> 
> gcc/ChangeLog
> 
>   * common/config/i386/cpuinfo.h (get_intel_cpu): Remove model
> value 0xa8
>   from Rocketlake, move model value 0xbf from Alderlake to
> Raptorlake.
> ---
>  gcc/common/config/i386/cpuinfo.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/gcc/common/config/i386/cpuinfo.h
> b/gcc/common/config/i386/cpuinfo.h
> index 61559ed9de2..ae48bc17771 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -463,7 +463,6 @@ get_intel_cpu (struct __processor_model
> *cpu_model,
>cpu_model->__cpu_subtype = INTEL_COREI7_SKYLAKE;
>break;
>  case 0xa7:
> -case 0xa8:
>/* Rocket Lake.  */
>cpu = "rocketlake";
>CHECK___builtin_cpu_is ("corei7"); @@ -536,9 +535,9 @@ get_intel_cpu
> (struct __processor_model *cpu_model,
>break;
>  case 0x97:
>  case 0x9a:
> -case 0xbf:
>/* Alder Lake.  */
>  case 0xb7:
> +case 0xbf:
>/* Raptor Lake.  */
>  case 0xaa:
>  case 0xac:
> --
> 2.25.1



[PATCH] PR gcc/110148:Avoid adding loop-carried ops to long chains

2023-06-28 Thread Cui, Lili via Gcc-patches
From: Lili Cui 

Hi Maintainer

This patch is to fix TSVC242 regression related to loop-carried ops.

Bootstrapped and regtested. Ok for trunk?

Regards
Lili.

Avoid adding loop-carried ops to long chains, otherwise the whole chain will
have dependencies across the loop iteration. Just keep loop-carried ops in a
separate chain.
   E.g.
   x_1 = phi(x_0, x_2)
   y_1 = phi(y_0, y_2)

   a + b + c + d + e + x1 + y1

   SSA1 = a + b;
   SSA2 = c + d;
   SSA3 = SSA1 + e;
   SSA4 = SSA3 + SSA2;
   SSA5 = x1 + y1;
   SSA6 = SSA4 + SSA5;

With the patch applied, these test cases improved by 32%~100%.

S242:
for (int i = 1; i < LEN_1D; ++i) {
a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];}

Case 1:
for (int i = 1; i < LEN_1D; ++i) {
a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];}

Case 2:
for (int i = 1; i < LEN_1D; ++i) {
a[i] = a[i - 1] + b[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];}

The value is the execution time
A: original version
B: with FMA patch g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409(base on A)
C: with current patch(base on B)

  A   B   C B/A C/A
s2422.859   5.152   2.859   1.802028681 1
case 1  5.489   5.488   3.511   0.9998180.64
case 2  7.216   7.499   4.885   1.0392180.68

gcc/ChangeLog:
PR tree-optimization/110148
* tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): Handle loop-carried
ops in this function.
---
 gcc/tree-ssa-reassoc.cc | 236 
 1 file changed, 167 insertions(+), 69 deletions(-)

diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 96c88ec003e..f5da385e0b2 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5471,37 +5471,62 @@ get_reassociation_width (int ops_num, enum tree_code 
opc,
   return width;
 }
 
+#define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops.  */
+#define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops.  */
+#define NORMAL_END_STMT 2 /* It is the end stmt of normal ops.  */
+
 /* Rewrite statements with dependency chain with regard the chance to generate
FMA.
For the chain with FMA: Try to keep fma opportunity as much as possible.
For the chain without FMA: Putting the computation in rank order and trying
to allow operations to be executed in parallel.
E.g.
-   e + f + g + a * b + c * d;
+   e + f + a * b + c * d;
 
-   ssa1 = e + f;
-   ssa2 = g + a * b;
-   ssa3 = ssa1 + c * d;
-   ssa4 = ssa2 + ssa3;
+   ssa1 = e + a * b;
+   ssa2 = f + c * d;
+   ssa3 = ssa1 + ssa2;
 
This reassociation approach preserves the chance of fma generation as much
-   as possible.  */
+   as possible.
+
+   Another thing is to avoid adding loop-carried ops to long chains, otherwise
+   the whole chain will have dependencies across the loop iteration. Just keep
+   loop-carried ops in a separate chain.
+   E.g.
+   x_1 = phi(x_0, x_2)
+   y_1 = phi(y_0, y_2)
+
+   a + b + c + d + e + x1 + y1
+
+   SSA1 = a + b;
+   SSA2 = c + d;
+   SSA3 = SSA1 + e;
+   SSA4 = SSA3 + SSA2;
+   SSA5 = x1 + y1;
+   SSA6 = SSA4 + SSA5;
+ */
 static void
 rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
-const vec )
+   const vec )
 {
   enum tree_code opcode = gimple_assign_rhs_code (stmt);
   int op_num = ops.length ();
+  int op_normal_num = op_num;
   gcc_assert (op_num > 0);
   int stmt_num = op_num - 1;
   gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
-  int op_index = op_num - 1;
-  int width_count = width;
   int i = 0, j = 0;
   tree tmp_op[2], op1;
   operand_entry *oe;
   gimple *stmt1 = NULL;
   tree last_rhs1 = gimple_assign_rhs1 (stmt);
+  int last_rhs1_stmt_index = 0, last_rhs2_stmt_index = 0; 
+  int width_active = 0, width_count = 0;
+  bool has_biased = false, ops_changed = false;
+  auto_vec ops_normal;
+  auto_vec ops_biased;
+  vec *ops1;
 
   /* We start expression rewriting from the top statements.
  So, in this loop we create a full list of statements
@@ -5510,83 +5535,155 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, 
bool has_fma,
   for (i = stmt_num - 2; i >= 0; i--)
 stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
 
-  /* Width should not be larger than op_num / 2, since we can not create
+  /* Avoid adding loop-carried ops to long chains, first filter out the
+ loop-carried.  But we need to make sure that the length of the remainder
+ is not less than 4, which is the smallest ops length we can break the
+ dependency.  */
+  FOR_EACH_VEC_ELT (ops, i, oe)
+{
+  if (TREE_CODE (oe->op) == SSA_NAME
+ && bitmap_bit_p (biased_names, SSA_NAME_VERSION (oe->op))
+ && op_normal_num > 4)
+   {
+ ops_biased.safe_push (oe);
+ has_biased = true;
+ op_normal_num --;
+   }
+  else
+   ops_normal.safe_push (oe);
+}
+
+  /* Width should not be larger than ops length 

[PATCH] x86: Update model values for Alderlake, Rocketlake and Raptorlake.

2023-06-28 Thread Cui, Lili via Gcc-patches
Hi Hongtao,

This patch is to update model values for Alderlake, Rocketlake and Raptorlake 
according to SDM.

Ok for trunk?

Thanks.
Lili.

Update model values for Alderlake, Rocketlake and Raptorlake according to SDM.

gcc/ChangeLog

* common/config/i386/cpuinfo.h (get_intel_cpu): Remove model value 0xa8
from Rocketlake, move model value 0xbf from Alderlake to Raptorlake.
---
 gcc/common/config/i386/cpuinfo.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h
index 61559ed9de2..ae48bc17771 100644
--- a/gcc/common/config/i386/cpuinfo.h
+++ b/gcc/common/config/i386/cpuinfo.h
@@ -463,7 +463,6 @@ get_intel_cpu (struct __processor_model *cpu_model,
   cpu_model->__cpu_subtype = INTEL_COREI7_SKYLAKE;
   break;
 case 0xa7:
-case 0xa8:
   /* Rocket Lake.  */
   cpu = "rocketlake";
   CHECK___builtin_cpu_is ("corei7");
@@ -536,9 +535,9 @@ get_intel_cpu (struct __processor_model *cpu_model,
   break;
 case 0x97:
 case 0x9a:
-case 0xbf:
   /* Alder Lake.  */
 case 0xb7:
+case 0xbf:
   /* Raptor Lake.  */
 case 0xaa:
 case 0xac:
-- 
2.25.1



RE: [PATCH] Handle FMA friendly in reassoc pass

2023-06-07 Thread Cui, Lili via Gcc-patches
Hi Di,

The compile options I use are: "-march=native -Ofast -funroll-loops -flto"
I re-ran 503, 507, and 527 on two neoverse-n1 machines, and found that one 
machine fluctuated greatly, and the score was only 70% of the other machine. I 
also couldn't reproduce the gain on the stable machine. For the 527 regression, 
I can't reproduce it and the data seems stable.

Regards,
Lili.

> -Original Message-
> From: Di Zhao OS 
> Sent: Wednesday, June 7, 2023 11:48 AM
> To: Cui, Lili ; gcc-patches@gcc.gnu.org
> Cc: richard.guent...@gmail.com; li...@linux.ibm.com
> Subject: RE: [PATCH] Handle FMA friendly in reassoc pass
> 
> Hello Lili Cui,
> 
> Since I'm also trying to improve this lately, I've tested your patch on 
> several
> aarch64 machines we have, including neoverse-n1 and ampere1
> architectures. However, I haven't reproduced the 6.00% improvement of
> 503.bwaves_r single copy run you mentioned. Could you share more
> information about the aarch64 CPU and compile options you tested? The
> option I'm using is "-Ofast", with or without "--param avoid-fma-max-
> bits=512".
> 
> Additionally, we found some spec2017 cases with regressions, including 4%
> on 527.cam4_r (neoverse-n1).
> 
> > -Original Message-
> > From: Gcc-patches  > bounces+dizhao=os.amperecomputing@gcc.gnu.org> On Behalf Of
> Cui,
> > bounces+Lili via
> > Gcc-patches
> > Sent: Thursday, May 25, 2023 7:30 AM
> > To: gcc-patches@gcc.gnu.org
> > Cc: richard.guent...@gmail.com; li...@linux.ibm.com; Lili Cui
> > 
> > Subject: [PATCH] Handle FMA friendly in reassoc pass
> >
> > From: Lili Cui 
> >
> > Make some changes in reassoc pass to make it more friendly to fma pass
> later.
> > Using FMA instead of mult + add reduces register pressure and
> > insruction retired.
> >
> > There are mainly two changes
> > 1. Put no-mult ops and mult ops alternately at the end of the queue,
> > which is conducive to generating more fma and reducing the loss of FMA
> > when breaking the chain.
> > 2. Rewrite the rewrite_expr_tree_parallel function to try to build
> > parallel chains according to the given correlation width, keeping the
> > FMA chance as much as possible.
> >
> > With the patch applied
> >
> > On ICX:
> > 507.cactuBSSN_r: Improved by 1.7% for multi-copy .
> > 503.bwaves_r   : Improved by  0.60% for single copy .
> > 507.cactuBSSN_r: Improved by  1.10% for single copy .
> > 519.lbm_r  : Improved by  2.21% for single copy .
> > no measurable changes for other benchmarks.
> >
> > On aarch64
> > 507.cactuBSSN_r: Improved by 1.7% for multi-copy.
> > 503.bwaves_r   : Improved by 6.00% for single-copy.
> > no measurable changes for other benchmarks.
> >
> > TEST1:
> >
> > float
> > foo (float a, float b, float c, float d, float *e) {
> >return  *e  + a * b + c * d ;
> > }
> >
> > For "-Ofast -mfpmath=sse -mfma" GCC generates:
> > vmulss  %xmm3, %xmm2, %xmm2
> > vfmadd132ss %xmm1, %xmm2, %xmm0
> > vaddss  (%rdi), %xmm0, %xmm0
> > ret
> >
> > With this patch GCC generates:
> > vfmadd213ss   (%rdi), %xmm1, %xmm0
> > vfmadd231ss   %xmm2, %xmm3, %xmm0
> > ret
> >
> > TEST2:
> >
> > for (int i = 0; i < N; i++)
> > {
> >   a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i]
> > * l[i]
> > + m[i]* o[i] + p[i];
> > }
> >
> > For "-Ofast -mfpmath=sse -mfma"  GCC generates:
> > vmovapd e(%rax), %ymm4
> > vmulpd  d(%rax), %ymm4, %ymm3
> > addq$32, %rax
> > vmovapd c-32(%rax), %ymm5
> > vmovapd j-32(%rax), %ymm6
> > vmulpd  h-32(%rax), %ymm6, %ymm2
> > vmovapd a-32(%rax), %ymm6
> > vaddpd  p-32(%rax), %ymm6, %ymm0
> > vmovapd g-32(%rax), %ymm7
> > vfmadd231pd b-32(%rax), %ymm5, %ymm3
> > vmovapd o-32(%rax), %ymm4
> > vmulpd  m-32(%rax), %ymm4, %ymm1
> > vmovapd l-32(%rax), %ymm5
> > vfmadd231pd f-32(%rax), %ymm7, %ymm2
> > vfmadd231pd k-32(%rax), %ymm5, %ymm1
> > vaddpd  %ymm3, %ymm0, %ymm0
> > vaddpd  %ymm2, %ymm0, %ymm0
> > vaddpd  %ymm1, %ymm0, %ymm0
> > vmovapd %ymm0, a-32(%rax)
> > cmpq$8192, %rax
> > jne .L4
> > vzeroupper
> > ret
> >
> > with this patch applied GCC breaks the chain with width = 2 and
> > generates 6
> > fma:
> >
> > vmovapd a(%rax), %ymm2
> > vmovapd c(%rax), %ymm0
> > addq$32, %rax
> > vmovapd e-32(%rax), %ymm1
> > vmovapd p-32(%rax), %ymm5
> > vmovapd g-32(%rax), %ymm3
> > vmovapd j-32(%rax), %ymm6
> > vmovapd l-32(%rax), %ymm4
> > vmovapd o-32(%rax), %ymm7
> > vfmadd132pd b-32(%rax), %ymm2, %ymm0
> > vfmadd132pd d-32(%rax), %ymm5, %ymm1
> > vfmadd231pd f-32(%rax), %ymm3, %ymm0
> > vfmadd231pd h-32(%rax), %ymm6, %ymm1
> > vfmadd231pd k-32(%rax), %ymm4, %ymm0
> > vfmadd231pd m-32(%rax), %ymm7, %ymm1
> > vaddpd  %ymm1, %ymm0, %ymm0
> > vmovapd %ymm0, a-32(%rax)
> > cmpq$8192, %rax
> > jne .L2
> > vzeroupper
> >   

RE: [PATCH] Fix ICE in rewrite_expr_tree_parallel

2023-05-31 Thread Cui, Lili via Gcc-patches
Committed, thanks Richard.

Lili.

> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, May 31, 2023 3:22 PM
> To: Cui, Lili 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Fix ICE in rewrite_expr_tree_parallel
> 
> On Wed, May 31, 2023 at 3:35 AM Cui, Lili  wrote:
> >
> > Hi,
> >
> > This patch is to fix ICE in rewrite_expr_tree_parallel.
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110038
> >
> > Bootstrapped and regtested. Ok for trunk?
> 
> OK.
> 
> > Regards
> > Lili.
> >
> > 1. Limit the value of tree-reassoc-width to IntegerRange(0, 256).
> > 2. Add width limit in rewrite_expr_tree_parallel.
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/110038
> > * params.opt: Add a limit on tree-reassoc-width.
> > * tree-ssa-reassoc.cc
> > (rewrite_expr_tree_parallel): Add width limit.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/110038
> > * gcc.dg/pr110038.c: New test.
> > ---
> >  gcc/params.opt  |  2 +-
> >  gcc/testsuite/gcc.dg/pr110038.c | 10 ++
> >  gcc/tree-ssa-reassoc.cc |  3 +++
> >  3 files changed, 14 insertions(+), 1 deletion(-)  create mode 100644
> > gcc/testsuite/gcc.dg/pr110038.c
> >
> > diff --git a/gcc/params.opt b/gcc/params.opt index
> > 66f1c99036a..70cfb495e3a 100644
> > --- a/gcc/params.opt
> > +++ b/gcc/params.opt
> > @@ -1091,7 +1091,7 @@ Common Joined UInteger
> > Var(param_tracer_min_branch_ratio) Init(10) IntegerRange(  Stop reverse
> growth if the reverse probability of best edge is less than this threshold (in
> percent).
> >
> >  -param=tree-reassoc-width=
> > -Common Joined UInteger Var(param_tree_reassoc_width) Param
> > Optimization
> > +Common Joined UInteger Var(param_tree_reassoc_width) IntegerRange(0,
> > +256) Param Optimization
> >  Set the maximum number of instructions executed in parallel in
> reassociated tree.  If 0, use the target dependent heuristic.
> >
> >  -param=tsan-distinguish-volatile=
> > diff --git a/gcc/testsuite/gcc.dg/pr110038.c
> > b/gcc/testsuite/gcc.dg/pr110038.c new file mode 100644 index
> > 000..0f578b182ca
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/pr110038.c
> > @@ -0,0 +1,10 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O --param=tree-reassoc-width=256" } */
> > +
> > +unsigned a, b;
> > +
> > +void
> > +foo (unsigned c)
> > +{
> > +  a += b + c + 1;
> > +}
> > diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc index
> > ad2f528ff07..f8055d59d57 100644
> > --- a/gcc/tree-ssa-reassoc.cc
> > +++ b/gcc/tree-ssa-reassoc.cc
> > @@ -5510,6 +5510,9 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width, bool has_fma,
> >for (i = stmt_num - 2; i >= 0; i--)
> >  stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
> >
> > +   /* Width should not be larger than op_num/2.  */
> > +   width = width <= op_num / 2 ? width : op_num / 2;
> > +
> >/* Build parallel dependency chain according to width.  */
> >for (i = 0; i < width; i++)
> >  {
> > --
> > 2.25.1
> >


[PATCH] Fix ICE in rewrite_expr_tree_parallel

2023-05-30 Thread Cui, Lili via Gcc-patches
Hi,

This patch is to fix ICE in rewrite_expr_tree_parallel.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110038

Bootstrapped and regtested. Ok for trunk?

Regards
Lili.

1. Limit the value of tree-reassoc-width to IntegerRange(0, 256).
2. Add width limit in rewrite_expr_tree_parallel.

gcc/ChangeLog:

PR tree-optimization/110038
* params.opt: Add a limit on tree-reassoc-width.
* tree-ssa-reassoc.cc
(rewrite_expr_tree_parallel): Add width limit.

gcc/testsuite/ChangeLog:

PR tree-optimization/110038
* gcc.dg/pr110038.c: New test.
---
 gcc/params.opt  |  2 +-
 gcc/testsuite/gcc.dg/pr110038.c | 10 ++
 gcc/tree-ssa-reassoc.cc |  3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110038.c

diff --git a/gcc/params.opt b/gcc/params.opt
index 66f1c99036a..70cfb495e3a 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1091,7 +1091,7 @@ Common Joined UInteger Var(param_tracer_min_branch_ratio) 
Init(10) IntegerRange(
 Stop reverse growth if the reverse probability of best edge is less than this 
threshold (in percent).
 
 -param=tree-reassoc-width=
-Common Joined UInteger Var(param_tree_reassoc_width) Param Optimization
+Common Joined UInteger Var(param_tree_reassoc_width) IntegerRange(0, 256) 
Param Optimization
 Set the maximum number of instructions executed in parallel in reassociated 
tree.  If 0, use the target dependent heuristic.
 
 -param=tsan-distinguish-volatile=
diff --git a/gcc/testsuite/gcc.dg/pr110038.c b/gcc/testsuite/gcc.dg/pr110038.c
new file mode 100644
index 000..0f578b182ca
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110038.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O --param=tree-reassoc-width=256" } */
+
+unsigned a, b;
+
+void
+foo (unsigned c)
+{
+  a += b + c + 1;
+}
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index ad2f528ff07..f8055d59d57 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5510,6 +5510,9 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, 
bool has_fma,
   for (i = stmt_num - 2; i >= 0; i--)
 stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
 
+   /* Width should not be larger than op_num/2.  */
+   width = width <= op_num / 2 ? width : op_num / 2;
+
   /* Build parallel dependency chain according to width.  */
   for (i = 0; i < width; i++)
 {
-- 
2.25.1



RE: [PATCH] Handle FMA friendly in reassoc pass

2023-05-29 Thread Cui, Lili via Gcc-patches
I will rebase and commit this patch, thanks!

Lili.


> -Original Message-
> From: Cui, Lili 
> Sent: Thursday, May 25, 2023 7:30 AM
> To: gcc-patches@gcc.gnu.org
> Cc: richard.guent...@gmail.com; li...@linux.ibm.com; Cui, Lili
> 
> Subject: [PATCH] Handle FMA friendly in reassoc pass
> 
> From: Lili Cui 
> 
> Make some changes in reassoc pass to make it more friendly to fma pass
> later.
> Using FMA instead of mult + add reduces register pressure and insruction
> retired.
> 
> There are mainly two changes
> 1. Put no-mult ops and mult ops alternately at the end of the queue, which is
> conducive to generating more fma and reducing the loss of FMA when
> breaking the chain.
> 2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel
> chains according to the given correlation width, keeping the FMA chance as
> much as possible.
> 
> With the patch applied
> 
> On ICX:
> 507.cactuBSSN_r: Improved by 1.7% for multi-copy .
> 503.bwaves_r   : Improved by  0.60% for single copy .
> 507.cactuBSSN_r: Improved by  1.10% for single copy .
> 519.lbm_r  : Improved by  2.21% for single copy .
> no measurable changes for other benchmarks.
> 
> On aarch64
> 507.cactuBSSN_r: Improved by 1.7% for multi-copy.
> 503.bwaves_r   : Improved by 6.00% for single-copy.
> no measurable changes for other benchmarks.
> 
> TEST1:
> 
> float
> foo (float a, float b, float c, float d, float *e) {
>return  *e  + a * b + c * d ;
> }
> 
> For "-Ofast -mfpmath=sse -mfma" GCC generates:
> vmulss  %xmm3, %xmm2, %xmm2
> vfmadd132ss %xmm1, %xmm2, %xmm0
> vaddss  (%rdi), %xmm0, %xmm0
> ret
> 
> With this patch GCC generates:
> vfmadd213ss   (%rdi), %xmm1, %xmm0
> vfmadd231ss   %xmm2, %xmm3, %xmm0
> ret
> 
> TEST2:
> 
> for (int i = 0; i < N; i++)
> {
>   a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] 
> + m[i]* o[i] +
> p[i]; }
> 
> For "-Ofast -mfpmath=sse -mfma"  GCC generates:
>   vmovapd e(%rax), %ymm4
>   vmulpd  d(%rax), %ymm4, %ymm3
>   addq$32, %rax
>   vmovapd c-32(%rax), %ymm5
>   vmovapd j-32(%rax), %ymm6
>   vmulpd  h-32(%rax), %ymm6, %ymm2
>   vmovapd a-32(%rax), %ymm6
>   vaddpd  p-32(%rax), %ymm6, %ymm0
>   vmovapd g-32(%rax), %ymm7
>   vfmadd231pd b-32(%rax), %ymm5, %ymm3
>   vmovapd o-32(%rax), %ymm4
>   vmulpd  m-32(%rax), %ymm4, %ymm1
>   vmovapd l-32(%rax), %ymm5
>   vfmadd231pd f-32(%rax), %ymm7, %ymm2
>   vfmadd231pd k-32(%rax), %ymm5, %ymm1
>   vaddpd  %ymm3, %ymm0, %ymm0
>   vaddpd  %ymm2, %ymm0, %ymm0
>   vaddpd  %ymm1, %ymm0, %ymm0
>   vmovapd %ymm0, a-32(%rax)
>   cmpq$8192, %rax
>   jne .L4
>   vzeroupper
>   ret
> 
> with this patch applied GCC breaks the chain with width = 2 and generates 6
> fma:
> 
>   vmovapd a(%rax), %ymm2
>   vmovapd c(%rax), %ymm0
>   addq$32, %rax
>   vmovapd e-32(%rax), %ymm1
>   vmovapd p-32(%rax), %ymm5
>   vmovapd g-32(%rax), %ymm3
>   vmovapd j-32(%rax), %ymm6
>   vmovapd l-32(%rax), %ymm4
>   vmovapd o-32(%rax), %ymm7
>   vfmadd132pd b-32(%rax), %ymm2, %ymm0
>   vfmadd132pd d-32(%rax), %ymm5, %ymm1
>   vfmadd231pd f-32(%rax), %ymm3, %ymm0
>   vfmadd231pd h-32(%rax), %ymm6, %ymm1
>   vfmadd231pd k-32(%rax), %ymm4, %ymm0
>   vfmadd231pd m-32(%rax), %ymm7, %ymm1
>   vaddpd  %ymm1, %ymm0, %ymm0
>   vmovapd %ymm0, a-32(%rax)
>   cmpq$8192, %rax
>   jne .L2
>   vzeroupper
>   ret
> 
> gcc/ChangeLog:
> 
>   PR gcc/98350
>   * tree-ssa-reassoc.cc
>   (rewrite_expr_tree_parallel): Rewrite this function.
>   (rank_ops_for_fma): New.
>   (reassociate_bb): Handle new function.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR gcc/98350
>   * gcc.dg/pr98350-1.c: New test.
>   * gcc.dg/pr98350-2.c: Ditto.
> ---
>  gcc/testsuite/gcc.dg/pr98350-1.c |  31   gcc/testsuite/gcc.dg/pr98350-
> 2.c |  11 ++
>  gcc/tree-ssa-reassoc.cc  | 256 +--
>  3 files changed, 215 insertions(+), 83 deletions(-)  create mode 100644
> gcc/testsuite/gcc.dg/pr98350-1.c  create mode 100644
> gcc/testsuite/gcc.dg/pr98350-2.c
> 
> diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-
> 1.c
> new file mode 100644
> index 000..6bcf78a19ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast  -fdump-tree-widening_mul" } */
> +
> +/* Test that the compiler properly optimizes multiply and add
> +   to generate more FMA instructions.  */ #define N 1024 double a[N];
> +double b[N]; double c[N]; double d[N]; double e[N]; double f[N]; double
> +g[N]; double h[N]; double j[N]; double k[N]; double l[N]; double m[N];
> +double o[N]; double p[N];
> +
> +
> +void
> +foo (void)
> +{
> +  for (int i = 0; i < N; i++)
> +  {

RE: [PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass

2023-05-24 Thread Cui, Lili via Gcc-patches
> > +rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
> > +const vec
> > +)
> >  {
> >enum tree_code opcode = gimple_assign_rhs_code (stmt);
> >int op_num = ops.length ();
> > @@ -5483,10 +5494,11 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width,
> >int stmt_num = op_num - 1;
> >gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
> >int op_index = op_num - 1;
> > -  int stmt_index = 0;
> > -  int ready_stmts_end = 0;
> > -  int i = 0;
> > -  gimple *stmt1 = NULL, *stmt2 = NULL;
> > +  int width_count = width;
> > +  int i = 0, j = 0;
> > +  tree tmp_op[2], op1;
> > +  operand_entry *oe;
> > +  gimple *stmt1 = NULL;
> >tree last_rhs1 = gimple_assign_rhs1 (stmt);
> >
> >/* We start expression rewriting from the top statements.
> > @@ -5496,91 +5508,84 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width,
> >for (i = stmt_num - 2; i >= 0; i--)
> >  stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
> >
> > -  for (i = 0; i < stmt_num; i++)
> > +  /* Build parallel dependency chain according to width.  */  for (i
> > + = 0; i < width; i++)
> >  {
> > -  tree op1, op2;
> > -
> > -  /* Determine whether we should use results of
> > -already handled statements or not.  */
> > -  if (ready_stmts_end == 0
> > - && (i - stmt_index >= width || op_index < 1))
> > -   ready_stmts_end = i;
> > -
> > -  /* Now we choose operands for the next statement.  Non zero
> > -value in ready_stmts_end means here that we should use
> > -the result of already generated statements as new operand.  */
> > -  if (ready_stmts_end > 0)
> > -   {
> > - op1 = gimple_assign_lhs (stmts[stmt_index++]);
> > - if (ready_stmts_end > stmt_index)
> > -   op2 = gimple_assign_lhs (stmts[stmt_index++]);
> > - else if (op_index >= 0)
> > -   {
> > - operand_entry *oe = ops[op_index--];
> > - stmt2 = oe->stmt_to_insert;
> > - op2 = oe->op;
> > -   }
> > - else
> > -   {
> > - gcc_assert (stmt_index < i);
> > - op2 = gimple_assign_lhs (stmts[stmt_index++]);
> > -   }
> > +  /*   */
> 
> empty comment?

Added it, thanks.

> 
> > +  if (op_index > 1 && !has_fma)
> > +   swap_ops_for_binary_stmt (ops, op_index - 2);
> >
> > - if (stmt_index >= ready_stmts_end)
> > -   ready_stmts_end = 0;
> > -   }
> > -  else
> > +  for (j = 0; j < 2; j++)
> > {
> > - if (op_index > 1)
> > -   swap_ops_for_binary_stmt (ops, op_index - 2);
> > - operand_entry *oe2 = ops[op_index--];
> > - operand_entry *oe1 = ops[op_index--];
> > - op2 = oe2->op;
> > - stmt2 = oe2->stmt_to_insert;
> > - op1 = oe1->op;
> > - stmt1 = oe1->stmt_to_insert;
> > + gcc_assert (op_index >= 0);
> > + oe = ops[op_index--];
> > + tmp_op[j] = oe->op;
> > + /* If the stmt that defines operand has to be inserted, insert it
> > +before the use.  */
> > + stmt1 = oe->stmt_to_insert;
> > + if (stmt1)
> > +   insert_stmt_before_use (stmts[i], stmt1);
> > + stmt1 = NULL;
> > }
> > -
> > -  /* If we emit the last statement then we should put
> > -operands into the last statement.  It will also
> > -break the loop.  */
> > -  if (op_index < 0 && stmt_index == i)
> > -   i = stmt_num - 1;
> > +  stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1],
> tmp_op[0], opcode);
> > +  gimple_set_visited (stmts[i], true);
> >
> >if (dump_file && (dump_flags & TDF_DETAILS))
> > {
> > - fprintf (dump_file, "Transforming ");
> > + fprintf (dump_file, " into ");
> >   print_gimple_stmt (dump_file, stmts[i], 0);
> > }
> > +}
> >
> > -  /* If the stmt that defines operand has to be inserted, insert it
> > -before the use.  */
> > -  if (stmt1)
> > -   insert_stmt_before_use (stmts[i], stmt1);
> > -  if (stmt2)
> > -   insert_stmt_before_use (stmts[i], stmt2);
> > -  stmt1 = stmt2 = NULL;
> > -
> > -  /* We keep original statement only for the last one.  All
> > -others are recreated.  */
> > -  if (i == stmt_num - 1)
> > +  for (i = width; i < stmt_num; i++)
> > +{
> > +  /* We keep original statement only for the last one.  All others are
> > +recreated.  */
> > +  if ( op_index < 0)
> > {
> > - gimple_assign_set_rhs1 (stmts[i], op1);
> > - gimple_assign_set_rhs2 (stmts[i], op2);
> > - update_stmt (stmts[i]);
> > + if (width_count == 2)
> > +   {
> > +
> > + /* We keep original statement only for the last one.  All
> > +others are recreated.  */
> > + 

[PATCH] Handle FMA friendly in reassoc pass

2023-05-24 Thread Cui, Lili via Gcc-patches
From: Lili Cui 

Make some changes in reassoc pass to make it more friendly to fma pass later.
Using FMA instead of mult + add reduces register pressure and insruction
retired.

There are mainly two changes
1. Put no-mult ops and mult ops alternately at the end of the queue, which is
conducive to generating more fma and reducing the loss of FMA when breaking
the chain.
2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel
chains according to the given correlation width, keeping the FMA chance as
much as possible.

With the patch applied

On ICX:
507.cactuBSSN_r: Improved by 1.7% for multi-copy .
503.bwaves_r   : Improved by  0.60% for single copy .
507.cactuBSSN_r: Improved by  1.10% for single copy .
519.lbm_r  : Improved by  2.21% for single copy .
no measurable changes for other benchmarks.

On aarch64
507.cactuBSSN_r: Improved by 1.7% for multi-copy.
503.bwaves_r   : Improved by 6.00% for single-copy.
no measurable changes for other benchmarks.

TEST1:

float
foo (float a, float b, float c, float d, float *e)
{
   return  *e  + a * b + c * d ;
}

For "-Ofast -mfpmath=sse -mfma" GCC generates:
vmulss  %xmm3, %xmm2, %xmm2
vfmadd132ss %xmm1, %xmm2, %xmm0
vaddss  (%rdi), %xmm0, %xmm0
ret

With this patch GCC generates:
vfmadd213ss   (%rdi), %xmm1, %xmm0
vfmadd231ss   %xmm2, %xmm3, %xmm0
ret

TEST2:

for (int i = 0; i < N; i++)
{
  a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + 
m[i]* o[i] + p[i];
}

For "-Ofast -mfpmath=sse -mfma"  GCC generates:
vmovapd e(%rax), %ymm4
vmulpd  d(%rax), %ymm4, %ymm3
addq$32, %rax
vmovapd c-32(%rax), %ymm5
vmovapd j-32(%rax), %ymm6
vmulpd  h-32(%rax), %ymm6, %ymm2
vmovapd a-32(%rax), %ymm6
vaddpd  p-32(%rax), %ymm6, %ymm0
vmovapd g-32(%rax), %ymm7
vfmadd231pd b-32(%rax), %ymm5, %ymm3
vmovapd o-32(%rax), %ymm4
vmulpd  m-32(%rax), %ymm4, %ymm1
vmovapd l-32(%rax), %ymm5
vfmadd231pd f-32(%rax), %ymm7, %ymm2
vfmadd231pd k-32(%rax), %ymm5, %ymm1
vaddpd  %ymm3, %ymm0, %ymm0
vaddpd  %ymm2, %ymm0, %ymm0
vaddpd  %ymm1, %ymm0, %ymm0
vmovapd %ymm0, a-32(%rax)
cmpq$8192, %rax
jne .L4
vzeroupper
ret

with this patch applied GCC breaks the chain with width = 2 and generates 6 fma:

vmovapd a(%rax), %ymm2
vmovapd c(%rax), %ymm0
addq$32, %rax
vmovapd e-32(%rax), %ymm1
vmovapd p-32(%rax), %ymm5
vmovapd g-32(%rax), %ymm3
vmovapd j-32(%rax), %ymm6
vmovapd l-32(%rax), %ymm4
vmovapd o-32(%rax), %ymm7
vfmadd132pd b-32(%rax), %ymm2, %ymm0
vfmadd132pd d-32(%rax), %ymm5, %ymm1
vfmadd231pd f-32(%rax), %ymm3, %ymm0
vfmadd231pd h-32(%rax), %ymm6, %ymm1
vfmadd231pd k-32(%rax), %ymm4, %ymm0
vfmadd231pd m-32(%rax), %ymm7, %ymm1
vaddpd  %ymm1, %ymm0, %ymm0
vmovapd %ymm0, a-32(%rax)
cmpq$8192, %rax
jne .L2
vzeroupper
ret

gcc/ChangeLog:

PR gcc/98350
* tree-ssa-reassoc.cc
(rewrite_expr_tree_parallel): Rewrite this function.
(rank_ops_for_fma): New.
(reassociate_bb): Handle new function.

gcc/testsuite/ChangeLog:

PR gcc/98350
* gcc.dg/pr98350-1.c: New test.
* gcc.dg/pr98350-2.c: Ditto.
---
 gcc/testsuite/gcc.dg/pr98350-1.c |  31 
 gcc/testsuite/gcc.dg/pr98350-2.c |  11 ++
 gcc/tree-ssa-reassoc.cc  | 256 +--
 3 files changed, 215 insertions(+), 83 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c

diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
new file mode 100644
index 000..6bcf78a19ab
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast  -fdump-tree-widening_mul" } */
+
+/* Test that the compiler properly optimizes multiply and add 
+   to generate more FMA instructions.  */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+  for (int i = 0; i < N; i++)
+  {
+a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * 
l[i] + m[i]* o[i] + p[i];
+  }
+}
+/* { dg-final { scan-tree-dump-times { = \.FMA \(} 6 "widening_mul" } } */
diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
new file mode 100644
index 000..333d34f026a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-widening_mul" } */
+
+/* 

RE: [PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass

2023-05-18 Thread Cui, Lili via Gcc-patches
Attach CPU2017 3 run results:

On ICX: 
507.cactuBSSN_r: Improved by 1.7% for multi-copy .
503.bwaves_r  : Improved by  0.60% for single copy .
507.cactuBSSN_r : Improved by  1.10% for single copy .
519.lbm_r : Improved by  2.21% for single copy .
no measurable changes for other benchmarks.

On aarch64 
507.cactuBSSN_r: Improved by 1.7% for multi-copy.
503.bwaves_r : Improved by 6.00% for single-copy.
no measurable changes for other benchmarks.

> -Original Message-
> From: Cui, Lili 
> Sent: Wednesday, May 17, 2023 9:02 PM
> To: gcc-patches@gcc.gnu.org
> Cc: richard.guent...@gmail.com; Cui, Lili 
> Subject: [PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass
> 
> From: Lili Cui 
> 
> Make some changes in reassoc pass to make it more friendly to fma pass
> later.
> Using FMA instead of mult + add reduces register pressure and insruction
> retired.
> 
> There are mainly two changes
> 1. Put no-mult ops and mult ops alternately at the end of the queue, which is
> conducive to generating more fma and reducing the loss of FMA when
> breaking the chain.
> 2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel
> chains according to the given correlation width, keeping the FMA chance as
> much as possible.
> 
> TEST1:
> 
> float
> foo (float a, float b, float c, float d, float *e) {
>return  *e  + a * b + c * d ;
> }
> 
> For "-Ofast -mfpmath=sse -mfma" GCC generates:
> vmulss  %xmm3, %xmm2, %xmm2
> vfmadd132ss %xmm1, %xmm2, %xmm0
> vaddss  (%rdi), %xmm0, %xmm0
> ret
> 
> With this patch GCC generates:
> vfmadd213ss   (%rdi), %xmm1, %xmm0
> vfmadd231ss   %xmm2, %xmm3, %xmm0
> ret
> 
> TEST2:
> 
> for (int i = 0; i < N; i++)
> {
>   a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] 
> + m[i]* o[i] +
> p[i]; }
> 
> For "-Ofast -mfpmath=sse -mfma"  GCC generates:
>   vmovapd e(%rax), %ymm4
>   vmulpd  d(%rax), %ymm4, %ymm3
>   addq$32, %rax
>   vmovapd c-32(%rax), %ymm5
>   vmovapd j-32(%rax), %ymm6
>   vmulpd  h-32(%rax), %ymm6, %ymm2
>   vmovapd a-32(%rax), %ymm6
>   vaddpd  p-32(%rax), %ymm6, %ymm0
>   vmovapd g-32(%rax), %ymm7
>   vfmadd231pd b-32(%rax), %ymm5, %ymm3
>   vmovapd o-32(%rax), %ymm4
>   vmulpd  m-32(%rax), %ymm4, %ymm1
>   vmovapd l-32(%rax), %ymm5
>   vfmadd231pd f-32(%rax), %ymm7, %ymm2
>   vfmadd231pd k-32(%rax), %ymm5, %ymm1
>   vaddpd  %ymm3, %ymm0, %ymm0
>   vaddpd  %ymm2, %ymm0, %ymm0
>   vaddpd  %ymm1, %ymm0, %ymm0
>   vmovapd %ymm0, a-32(%rax)
>   cmpq$8192, %rax
>   jne .L4
>   vzeroupper
>   ret
> 
> with this patch applied GCC breaks the chain with width = 2 and generates 6
> fma:
> 
>   vmovapd a(%rax), %ymm2
>   vmovapd c(%rax), %ymm0
>   addq$32, %rax
>   vmovapd e-32(%rax), %ymm1
>   vmovapd p-32(%rax), %ymm5
>   vmovapd g-32(%rax), %ymm3
>   vmovapd j-32(%rax), %ymm6
>   vmovapd l-32(%rax), %ymm4
>   vmovapd o-32(%rax), %ymm7
>   vfmadd132pd b-32(%rax), %ymm2, %ymm0
>   vfmadd132pd d-32(%rax), %ymm5, %ymm1
>   vfmadd231pd f-32(%rax), %ymm3, %ymm0
>   vfmadd231pd h-32(%rax), %ymm6, %ymm1
>   vfmadd231pd k-32(%rax), %ymm4, %ymm0
>   vfmadd231pd m-32(%rax), %ymm7, %ymm1
>   vaddpd  %ymm1, %ymm0, %ymm0
>   vmovapd %ymm0, a-32(%rax)
>   cmpq$8192, %rax
>   jne .L2
>   vzeroupper
>   ret
> 
> gcc/ChangeLog:
> 
>   PR gcc/98350
>   * tree-ssa-reassoc.cc
>   (rewrite_expr_tree_parallel): Rewrite this function.
>   (rank_ops_for_fma): New.
>   (reassociate_bb): Handle new function.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR gcc/98350
>   * gcc.dg/pr98350-1.c: New test.
>   * gcc.dg/pr98350-2.c: Ditto.
> ---
>  gcc/testsuite/gcc.dg/pr98350-1.c |  31   gcc/testsuite/gcc.dg/pr98350-2.c
> |  11 ++
>  gcc/tree-ssa-reassoc.cc  | 256 +--
>  3 files changed, 215 insertions(+), 83 deletions(-)  create mode 100644
> gcc/testsuite/gcc.dg/pr98350-1.c  create mode 100644
> gcc/testsuite/gcc.dg/pr98350-2.c
> 
> diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-
> 1.c
> new file mode 100644
> index 000..185511c5e0a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=sse -mfma -Wno-attributes " } */
> +
> +/* Test that the compiler properly optimizes multiply and add
> +   to generate more FMA instructions.  */ #define N 1024 double a[N];
> +double b[N]; double c[N]; double d[N]; double e[N]; double f[N]; double
> +g[N]; double h[N]; double j[N]; double k[N]; double l[N]; double m[N];
> +double o[N]; double p[N];
> +
> +
> +void
> +foo (void)
> +{
> +  for (int i = 0; i < N; i++)
> +  {
> +a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + 

RE: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass

2023-05-17 Thread Cui, Lili via Gcc-patches
> I think to make a difference you need to hit the number of parallel fadd/fmul
> the pipeline can perform.  I don't think issue width is ever a problem for
> chains w/o fma and throughput of fma vs fadd + fmul should be similar.
> 

Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they should 
have the same width. 
The current implementation is reasonable  /* reassoc int, fp, vec_int, vec_fp.  
*/.

> That said, I think iff then we should try to improve
> rewrite_expr_tree_parallel rather than adding a new function.  For example
> for the case with equal rank operands we can try to sort adds first.  I can't
> convince myself that rewrite_expr_tree_parallel honors ranks properly
> quickly.
> 

I rewrite this patch, there are mainly two changes:
1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it 
instead of rewrite_expr_tree_parallel. The following example shows that the 
sequence generated by the this patch is better.
2. Put no-mult ops and mult ops alternately at the end of the queue, which is 
conducive to generating more fma and reducing the loss of FMA when breaking the 
chain.
  
With these two changes, GCC can break the chain with width = 2 and generates 6 
FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350  without any params.

--
Source code: g + h + j + s + m + n+a+b +e  (https://godbolt.org/z/G8sb86n84)
Compile options: -Ofast -mfpmath=sse -mfma
Width = 3 was chosen for reassociation
-
Old rewrite_expr_tree_parallel generates:
  _6 = g_8(D) + h_9(D);   --> parallel 0
  _3 = s_11(D) + m_12(D);  --> parallel 1
  _5 = _3 + j_10(D);
  _2 = n_13(D) + a_14(D);   --> parallel 2
  _1 = b_15(D) + e_16(D);  -> Parallel 3, This is not necessary, and it is 
not friendly to FMA.
  _4 = _1 + _2;
  _7 = _4 + _5;
  _17 = _6 + _7;  
  return _17;

When the width = 3,  we need 5 cycles here.
-first 
end-
Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates:

  _3 = s_11(D) + m_12(D);  --> parallel 0
  _5 = _3 + j_10(D);
  _2 = n_13(D) + a_14(D);   --> parallel 1
  _1 = b_15(D) + e_16(D);   --> parallel 2
  _4 = _1 + _2;
  _6 = _4 + _5;
  _7 = _6 + h_9(D);
  _17 = _7 + g_8(D); 
  return _17;

When the width = 3, we need 5 cycles here.
-second 
end---
Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel 
generates:

  _3 = s_11(D) + m_12(D);
  _6 = _3 + g_8(D);
  _2 = n_13(D) + a_14(D);
  _5 = _2 + h_9(D);
  _1 = b_15(D) + e_16(D);
  _4 = _1 + j_10(D);
  _7 = _4 + _5;
  _17 = _7 + _6;
  return _17;

When the width = 3, we need 4 cycles here.
third 
end---

Thanks,
Lili.



[PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass

2023-05-17 Thread Cui, Lili via Gcc-patches
From: Lili Cui 

Make some changes in reassoc pass to make it more friendly to fma pass later.
Using FMA instead of mult + add reduces register pressure and insruction
retired.

There are mainly two changes
1. Put no-mult ops and mult ops alternately at the end of the queue, which is
conducive to generating more fma and reducing the loss of FMA when breaking
the chain.
2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel
chains according to the given correlation width, keeping the FMA chance as
much as possible.

TEST1:

float
foo (float a, float b, float c, float d, float *e)
{
   return  *e  + a * b + c * d ;
}

For "-Ofast -mfpmath=sse -mfma" GCC generates:
vmulss  %xmm3, %xmm2, %xmm2
vfmadd132ss %xmm1, %xmm2, %xmm0
vaddss  (%rdi), %xmm0, %xmm0
ret

With this patch GCC generates:
vfmadd213ss   (%rdi), %xmm1, %xmm0
vfmadd231ss   %xmm2, %xmm3, %xmm0
ret

TEST2:

for (int i = 0; i < N; i++)
{
  a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + 
m[i]* o[i] + p[i];
}

For "-Ofast -mfpmath=sse -mfma"  GCC generates:
vmovapd e(%rax), %ymm4
vmulpd  d(%rax), %ymm4, %ymm3
addq$32, %rax
vmovapd c-32(%rax), %ymm5
vmovapd j-32(%rax), %ymm6
vmulpd  h-32(%rax), %ymm6, %ymm2
vmovapd a-32(%rax), %ymm6
vaddpd  p-32(%rax), %ymm6, %ymm0
vmovapd g-32(%rax), %ymm7
vfmadd231pd b-32(%rax), %ymm5, %ymm3
vmovapd o-32(%rax), %ymm4
vmulpd  m-32(%rax), %ymm4, %ymm1
vmovapd l-32(%rax), %ymm5
vfmadd231pd f-32(%rax), %ymm7, %ymm2
vfmadd231pd k-32(%rax), %ymm5, %ymm1
vaddpd  %ymm3, %ymm0, %ymm0
vaddpd  %ymm2, %ymm0, %ymm0
vaddpd  %ymm1, %ymm0, %ymm0
vmovapd %ymm0, a-32(%rax)
cmpq$8192, %rax
jne .L4
vzeroupper
ret

with this patch applied GCC breaks the chain with width = 2 and generates 6 fma:

vmovapd a(%rax), %ymm2
vmovapd c(%rax), %ymm0
addq$32, %rax
vmovapd e-32(%rax), %ymm1
vmovapd p-32(%rax), %ymm5
vmovapd g-32(%rax), %ymm3
vmovapd j-32(%rax), %ymm6
vmovapd l-32(%rax), %ymm4
vmovapd o-32(%rax), %ymm7
vfmadd132pd b-32(%rax), %ymm2, %ymm0
vfmadd132pd d-32(%rax), %ymm5, %ymm1
vfmadd231pd f-32(%rax), %ymm3, %ymm0
vfmadd231pd h-32(%rax), %ymm6, %ymm1
vfmadd231pd k-32(%rax), %ymm4, %ymm0
vfmadd231pd m-32(%rax), %ymm7, %ymm1
vaddpd  %ymm1, %ymm0, %ymm0
vmovapd %ymm0, a-32(%rax)
cmpq$8192, %rax
jne .L2
vzeroupper
ret

gcc/ChangeLog:

PR gcc/98350
* tree-ssa-reassoc.cc
(rewrite_expr_tree_parallel): Rewrite this function.
(rank_ops_for_fma): New.
(reassociate_bb): Handle new function.

gcc/testsuite/ChangeLog:

PR gcc/98350
* gcc.dg/pr98350-1.c: New test.
* gcc.dg/pr98350-2.c: Ditto.
---
 gcc/testsuite/gcc.dg/pr98350-1.c |  31 
 gcc/testsuite/gcc.dg/pr98350-2.c |  11 ++
 gcc/tree-ssa-reassoc.cc  | 256 +--
 3 files changed, 215 insertions(+), 83 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c

diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
new file mode 100644
index 000..185511c5e0a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add 
+   to generate more FMA instructions.  */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+  for (int i = 0; i < N; i++)
+  {
+a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * 
l[i] + m[i]* o[i] + p[i];
+  }
+}
+/* { dg-final { scan-assembler-times "vfm" 6  } } */
diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
new file mode 100644
index 000..b35d88aead9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma -Wno-attributes " } */
+
+/* Test that the compiler rearrange the ops to generate more FMA.  */
+
+float
+foo1 (float a, float b, float c, float d, float *e)
+{
+   return   *e + a * b + c * d ;
+}
+/* { dg-final { scan-assembler-times "vfm" 2  } } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 067a3f07f7e..52c8aab6033 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not 

RE: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass

2023-05-12 Thread Cui, Lili via Gcc-patches
> ISTR there were no sufficient comments in the code explaining why
> rewrite_expr_tree_parallel_for_fma is better by design.  In fact ...
> 
> >
> > >
> > > >   if (!reassoc_insert_powi_p
> > > > - && ops.length () > 3
> > > > + && len > 3
> > > > + && (!keep_fma_chain
> > > > + || (keep_fma_chain
> > > > + && len >
> > > > + param_reassoc_max_chain_length_with_fma))
> > >
> > > in the case len < param_reassoc_max_chain_length_with_fma we have
> > > the chain re-sorted but fall through to non-parallel rewrite.  I
> > > wonder if we do not want to instead adjust the reassociation width?
> > > I'd say it depends on the number of mult cases in the chain (sth the re-
> sorting could have computed).
> > > Why do we have two completely independent --params here?  Can you
> > > give an example --param value combination that makes "sense" and
> > > show how it is beneficial?
> >
> > For this small case https://godbolt.org/z/Pxczrre8P a * b + c * d + e
> > * f  + j
> >
> > GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp
> cost is 4). Calculated: Width = 2. we can get 2 FMAs.
> > --
> >   _1 = a_6(D) * b_7(D);
> >   _2 = c_8(D) * d_9(D);
> >   _5 = _1 + _2;
> >   _4 = e_10(D) * f_11(D);
> >   _3 = _4 + j_12(D);
> >   _13 = _3 + _5;
> > 
> >   _2 = c_8(D) * d_9(D);
> >   _5 = .FMA (a_6(D), b_7(D), _2);
> >   _3 = .FMA (e_10(D), f_11(D), j_12(D));
> >   _13 = _3 + _5;
> > 
> > New patch: If just rearrange ops and fall through to parallel rewrite to
> break the chain with width = 2.
> >
> > -
> >   _1 = a_6(D) * b_7(D);
> >   _2 = j + _1;  -> put j at the first.
> >   _3 = c_8(D) * d_9(D);
> >   _4 = e_10(D) * f_11(D);
> >   _5 = _3 + _4;   -> break chain with width = 2. we lost a FMA here.
> >   _13 = _2 + 5;
> >
> > ---
> >   _3 = c_8(D) * d_9(D);
> >   _2 = .FMA (a_6(D), b_7(D), j);
> >   _5 = .FMA (e_10(D), f_11(D), _3);
> >   _13 = _2 + _5;
> > 
> > Sometimes break chain will lose FMA( break chain needs put two
> > mult-ops together, which will lose one FMA ), we can only get 2 FMAs
> > here, if we want to get 3 FMAs, we need to keep the chain and not
> > break it. So I added a param to control chain length
> > "param_reassoc_max_chain_length_with_fma = 4" (For the small case in
> > Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
> > ---
> >   _1 = a_6(D) * b_7(D);
> >   _2 = c_8(D) * d_9(D);
> >   _4 = e_10(D) * f_11(D);
> >   _15 = _4 + j_12(D);
> >   _16 = _15 + _2;
> >   _13 = _16 + _1;
> > ---
> >   _15 = .FMA (e_10(D), f_11(D), j_12(D));
> >   _16 = .FMA (c_8(D), d_9(D), _15);
> >   _13 = .FMA (a_6(D), b_7(D), _16);
> > ---
> > In some case we want to break the chain with width, we can set
> "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and
> break the chain with width.
> 
> ... it sounds like the problem could be fully addressed by sorting the chain
> with reassoc-width in mind?
> Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vector of
> mul and a vector of non-mul ops so it can pick from the optimal candidate?
> 
> That said, I think rewrite_expr_tree_parallel_for_fma at least needs more
> comments.
> 
Sorry for not writing note clearly enough, I'll add more. 
I have two places that need to be clarified.

1. For some case we need to keep chain to generate more FMAs, because break 
chain will lose FMA.
   for example  g + a * b + c * d + e * f,
   Keep chain can get 3 FMAs, break chain can get 2 FMAs. It's hard to say 
which one is better, so we provide a param for users to customize.
   
2. when the chain has FMAs and need to break the chain with width,
for example l + a * b + c * d + e * f + g * h + j * k;(we already put non-mul 
first)
rewrite_expr_tree_parallel :
when width = 2, it will break the chain like this. actually it break the chain 
in to 3. It ignores the width and adds all ops two by two. it will lose FMA.  

ssa1 = l + a * b;
ssa2 = c * d + e * f;
ssa3 = g * h + j * k;
ssa4 = ssa1 + ssa2;
ssa5 = ssa4 + ssa3;

rewrite_expr_tree_parallel_for_fma
when width = 2, we break the chain into two like this.

ssa1 = l + a * b; 
ssa2 = c * d + e * f;
ssa3 = ssa1 + g * h;
ssa4 = ssa2 + j * k;
ssa5 = ssa3 +ssa4;

I think it's okay to remove or keep rewrite_expr_tree_parallel_for_fma. More 
FMAs are generated only for some special cases.
I'm not sure whether the new method is better than the old one. I created a 
small 

[PATCH1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass

2023-05-11 Thread Cui, Lili via Gcc-patches
From: Lili Cui 

Add a param for the chain with FMA in reassoc pass to make it more friendly to
the fma pass later. First to detect if this chain has ability to
generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
is enabled, We will rearrange the ops so that they can be combined into more
FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
build parallel chains according to given association width and try to keep FMA
opportunity as much as possible.

TEST1:

float
foo (float a, float b, float c, float d, float *e)
{
   return  *e  + a * b + c * d ;
}

For -Ofast -march=icelake-server  GCC generates:
vmulss  %xmm3, %xmm2, %xmm2
vfmadd132ss %xmm1, %xmm2, %xmm0
vaddss  (%rdi), %xmm0, %xmm0
ret

with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
vfmadd213ss   (%rdi), %xmm1, %xmm0
vfmadd231ss   %xmm2, %xmm3, %xmm0
ret

gcc/ChangeLog:

PR gcc/98350
* params.opt (reassoc-max-fma-chain-length): New param.
* tree-ssa-reassoc.cc
(rewrite_expr_tree_parallel_for_fma): New.
(rank_ops_for_fma): Ditto.
(reassociate_bb): Handle new function.

gcc/testsuite/ChangeLog:

PR gcc/98350
* gcc.dg/pr98350-1.c: New test.
* gcc.dg/pr98350-2.c: Ditto.
---
 gcc/params.opt   |   4 +
 gcc/testsuite/gcc.dg/pr98350-1.c |  31 +
 gcc/testsuite/gcc.dg/pr98350-2.c |  17 +++
 gcc/tree-ssa-reassoc.cc  | 226 ---
 4 files changed, 262 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c

diff --git a/gcc/params.opt b/gcc/params.opt
index 823cdb2ff85..f7c719afe64 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to 
the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 
1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=reassoc-max-chain-length-with-fma=
+Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) 
IntegerRange(1, 65536) Param Optimization
+The maximum chain length with fma considered in reassociation pass.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
new file mode 100644
index 000..265e0e57a49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma 
--param=reassoc-max-chain-length-with-fma=8 -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add 
+   to generate more FMA instructions.  */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+  for (int i = 0; i < N; i++)
+  {
+a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * 
l[i] + m[i]* o[i] + p[i];
+  }
+}
+/* { dg-final { scan-assembler-times "vfm" 6  } } */
diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
new file mode 100644
index 000..246025d43b8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma 
--param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
+
+/* Test that the compiler properly build parallel chains according to given
+   association width and try to keep FMA opportunity as much as possible.  */
+#define N 33
+double a[N];
+
+void
+foo (void)
+{
+  a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
++ a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
++ a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
++ a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
+}
+/* { dg-final { scan-assembler-times "vfm" 12  } } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 067a3f07f7e..f8c70ccadab 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-reassoc.h"
 #include "tree-ssa-math-opts.h"
 #include "gimple-range.h"
+#include "internal-fn.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
 on the LLVM pass of the same name (They do some things more/less
@@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code 
opc,
   return width;
 }
 
+/* Rewrite statements with dependency chain with regard to the chance to
+   generate FMA. When the dependency chain length exceeds
+   param_max_reassoc_chain_length_with_fma, build parallel chains according to
+   given association width and 

RE: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass

2023-05-11 Thread Cui, Lili via Gcc-patches
> -Original Message-
> From: Richard Biener 
> Sent: Thursday, May 11, 2023 6:53 PM
> To: Cui, Lili 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of
> the chain with FMA in reassoc pass

Hi Richard,
Thanks for helping to review the patch.

> 
> As you are not changing the number of ops you should be able to use
> quick_push here and below.  You should be able to do
> 
>  ops->splice (ops_mult);
>  ops->splice (ops_others);
> 
> as well.
> 
Done.

> > + /* When enabling param_reassoc_max_chain_length_with_fma
> to
> > +keep the chain with fma, rank_ops_for_fma will detect 
> > if
> > +the chain has fmas and if so it will rearrange the 
> > ops.  */
> > + if (param_reassoc_max_chain_length_with_fma > 1
> > + && direct_internal_fn_supported_p (IFN_FMA,
> > +TREE_TYPE (lhs),
> > +opt_type)
> > + && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
> > +   {
> > + keep_fma_chain = rank_ops_for_fma();
> > +   }
> > +
> > + int len = ops.length ();
> >   /* Only rewrite the expression tree to parallel in the
> >  last reassoc pass to avoid useless work back-and-forth
> >  with initial linearization.  */
> 
> we are doing the parallel rewrite only in the last reassoc pass, i think it 
> makes
> sense to do the same for reassoc-for-fma.

I rearranged the order of ops in reassoc1 without break the chain, it generated 
more vectorize during vector pass( seen in benchmark 503). So I rewrite the ssa 
tree and keep the chain with function "rewrite_expr_tree" in reassoc1, break 
the chain with "rewrite_expr_tree_parallel_for_fma" in reassoc2.

> 
> Why do the existing expr rewrites not work after re-sorting the ops?

For case https://godbolt.org/z/3x9PWE9Kb:  we put  "j" at first.

j + l * m + a * b + c * d + e * f + g * h;

GCC trunk: width = 2, ops_num = 6, old function " rewrite_expr_tree_parallel " 
generates 3 FMAs.
---
  _1 = l_10(D) * m_11(D);
  _3 = a_13(D) * b_14(D);
  _4 = j_12(D) + _3;> Here is one FMA.
  _5 = c_15(D) * d_16(D);
  _8 = _1 + _5;> Here is one FMA and lost one.
  _7 = e_17(D) * f_18(D);
  _9 = g_19(D) * h_20(D);
  _2 = _7 + _9;   > Here is one FMA and lost one.
  _6 = _2 + _4;
  _21 = _6 + _8;
  # VUSE <.MEM_22(D)>
  return _21;
--
width = 2, ops_num = 6, new function " rewrite_expr_tree_parallel_for_fma " 
generates 4 FMAs.
--
_1 = a_10(D) * b_11(D);
  _3 = c_13(D) * d_14(D);
  _5 = e_15(D) * f_16(D);
  _7 = g_17(D) * h_18(D);
  _4 = _5 + _7;   > Here is one FMA and lost one.
  _8 = _4 + _1;   > Here is one FMA.
  _9 = l_19(D) * m_20(D);
  _2 = _9 + j_12(D);> Here is one FMA.
  _6 = _2 + _3;> Here is one FMA.
  _21 = _8 + _6; 
  return _21;



> 
> >   if (!reassoc_insert_powi_p
> > - && ops.length () > 3
> > + && len > 3
> > + && (!keep_fma_chain
> > + || (keep_fma_chain
> > + && len >
> > + param_reassoc_max_chain_length_with_fma))
> 
> in the case len < param_reassoc_max_chain_length_with_fma we have the
> chain re-sorted but fall through to non-parallel rewrite.  I wonder if we do
> not want to instead adjust the reassociation width?  I'd say it depends on the
> number of mult cases in the chain (sth the re-sorting could have computed).
> Why do we have two completely independent --params here?  Can you give
> an example --param value combination that makes "sense" and show how it
> is beneficial?

For this small case https://godbolt.org/z/Pxczrre8P
a * b + c * d + e * f  + j

GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp cost 
is 4). Calculated: Width = 2. we can get 2 FMAs.
--
  _1 = a_6(D) * b_7(D);
  _2 = c_8(D) * d_9(D);
  _5 = _1 + _2;
  _4 = e_10(D) * f_11(D);
  _3 = _4 + j_12(D);
  _13 = _3 + _5;

  _2 = c_8(D) * d_9(D);
  _5 = .FMA (a_6(D), b_7(D), _2);
  _3 = .FMA (e_10(D), f_11(D), j_12(D));
  _13 = _3 + _5;

New patch: If just rearrange ops and fall through to parallel rewrite to break 
the chain with width = 2.


[PATCH 2/2] Add a tune option to control the length of the chain with FMA

2023-05-11 Thread Cui, Lili via Gcc-patches
From: Lili Cui 

Set the length of the chain with FMA to 5 for icelake_cost.

With this patch applied,
SPR multi-copy: 508.namd_r increased by 3%
ICX multi-copy: 508.namd_r increased by 3.5%,
507.cactuBSSN_r increased by 3.7%

Using FMA instead of mult + add reduces register pressure and insruction
retired.

gcc/ChangeLog:

* config/i386/i386-options.cc (ix86_option_override_internal):
Set param_max_reassoc_fma_chain_length.
* config/i386/i386.h (struct processor_costs): Add new tune parameters.
* config/i386/x86-tune-costs.h (struct processor_costs): Set
reassoc_max_chain_length_with_fma to 5 for icelake.

gcc/testsuite/ChangeLog:

* gcc.target/i386/fma-chain.c: New test.
---
 gcc/config/i386/i386-options.cc   |  2 ++
 gcc/config/i386/i386.h|  3 ++
 gcc/config/i386/x86-tune-costs.h  | 35 +++
 gcc/testsuite/gcc.target/i386/fma-chain.c | 11 +++
 4 files changed, 51 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/fma-chain.c

diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
index 2cb0bddcd35..67d35d89d91 100644
--- a/gcc/config/i386/i386-options.cc
+++ b/gcc/config/i386/i386-options.cc
@@ -2684,6 +2684,8 @@ ix86_option_override_internal (bool main_args_p,
   ix86_tune_cost->l1_cache_size);
   SET_OPTION_IF_UNSET (opts, opts_set, param_l2_cache_size,
   ix86_tune_cost->l2_cache_size);
+  SET_OPTION_IF_UNSET (opts, opts_set, param_reassoc_max_chain_length_with_fma,
+  ix86_tune_cost->reassoc_max_chain_length_with_fma);
 
   /* 64B is the accepted value for these for all x86.  */
   SET_OPTION_IF_UNSET (_options, _options_set,
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index c7439f89bdf..c7fa7312a67 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -206,6 +206,9 @@ struct processor_costs {
   to number of instructions executed in
   parallel.  See also
   ix86_reassociation_width.  */
+  const int reassoc_max_chain_length_with_fma;
+   /* Specify max reassociation chain length with
+  FMA.  */
   struct stringop_algs *memcpy, *memset;
   const int cond_taken_branch_cost;/* Cost of taken branch for vectorizer
  cost model.  */
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 4f7a67ca5c5..1f57a5ee2a7 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -127,6 +127,7 @@ struct processor_costs ix86_size_cost = {/* costs for 
tuning for size */
   COSTS_N_BYTES (2),   /* cost of SQRTSS instruction.  */
   COSTS_N_BYTES (2),   /* cost of SQRTSD instruction.  */
   1, 1, 1, 1,  /* reassoc int, fp, vec_int, vec_fp.  */
+  1,   /* Reassoc max FMA chain length.  */
   ix86_size_memcpy,
   ix86_size_memset,
   COSTS_N_BYTES (1),   /* cond_taken_branch_cost.  */
@@ -238,6 +239,7 @@ struct processor_costs i386_cost = {/* 386 specific 
costs */
   COSTS_N_INSNS (122), /* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (122), /* cost of SQRTSD instruction.  */
   1, 1, 1, 1,  /* reassoc int, fp, vec_int, vec_fp.  */
+  1,   /* Reassoc max FMA chain length.  */
   i386_memcpy,
   i386_memset,
   COSTS_N_INSNS (3),   /* cond_taken_branch_cost.  */
@@ -350,6 +352,7 @@ struct processor_costs i486_cost = {/* 486 specific 
costs */
   COSTS_N_INSNS (83),  /* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (83),  /* cost of SQRTSD instruction.  */
   1, 1, 1, 1,  /* reassoc int, fp, vec_int, vec_fp.  */
+  1,   /* Reassoc max FMA chain length.  */
   i486_memcpy,
   i486_memset,
   COSTS_N_INSNS (3),   /* cond_taken_branch_cost.  */
@@ -460,6 +463,7 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (70),  /* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (70),  /* cost of SQRTSD instruction.  */
   1, 1, 1, 1,  /* reassoc int, fp, vec_int, vec_fp.  */
+  1,   /* Reassoc max FMA chain length.  */
   pentium_memcpy,
   pentium_memset,
   COSTS_N_INSNS (3),   /* cond_taken_branch_cost.  */
@@ -563,6 +567,7 @@ struct processor_costs lakemont_cost = {
   COSTS_N_INSNS (31),  /* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (63),  /* cost of SQRTSD instruction.  */
   1, 1, 1, 1,  /* reassoc int, fp, vec_int, 

[PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass

2023-05-11 Thread Cui, Lili via Gcc-patches
From: Lili Cui 

Hi,

Those two patches each add a param to control the length of the chain with
FMA in reassoc pass and a tuning option in the backend.

Bootstrapped and regtested. Ok for trunk?

Regards
Lili.

Add a param for the chain with FMA in reassoc pass to make it more friendly to
the fma pass later. First to detect if this chain has ability to
generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
is enabled, We will rearrange the ops so that they can be combined into more
FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
build parallel chains according to given association width and try to keep FMA
opportunity as much as possible.

TEST1:

float
foo (float a, float b, float c, float d, float *e)
{
   return  *e  + a * b + c * d ;
}

For -Ofast -march=icelake-server  GCC generates:
vmulss  %xmm3, %xmm2, %xmm2
vfmadd132ss %xmm1, %xmm2, %xmm0
vaddss  (%rdi), %xmm0, %xmm0
ret

with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
vfmadd213ss   (%rdi), %xmm1, %xmm0
vfmadd231ss   %xmm2, %xmm3, %xmm0
ret

gcc/ChangeLog:

PR gcc/98350
* params.opt (reassoc-max-fma-chain-length): New param.
* tree-ssa-reassoc.cc
(rewrite_expr_tree_parallel_for_fma): New.
(rank_ops_for_fma): Ditto.
(reassociate_bb): Handle new function.

gcc/testsuite/ChangeLog:

PR gcc/98350
* gcc.dg/pr98350-1.c: New test.
* gcc.dg/pr98350-2.c: Ditto.
---
 gcc/params.opt   |   4 +
 gcc/testsuite/gcc.dg/pr98350-1.c |  31 +
 gcc/testsuite/gcc.dg/pr98350-2.c |  17 +++
 gcc/tree-ssa-reassoc.cc  | 228 ---
 4 files changed, 264 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c

diff --git a/gcc/params.opt b/gcc/params.opt
index 823cdb2ff85..f7c719afe64 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to 
the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 
1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=reassoc-max-chain-length-with-fma=
+Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) 
IntegerRange(1, 65536) Param Optimization
+The maximum chain length with fma considered in reassociation pass.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
new file mode 100644
index 000..32ecce13a2d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma 
--param=reassoc-max-chain-length-with-fma=7 -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add 
+   to generate more FMA instructions.  */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+  for (int i = 0; i < N; i++)
+  {
+a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * 
l[i] + m[i]* o[i] + p[i];
+  }
+}
+/* { dg-final { scan-assembler-times "vfm" 6  } } */
diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
new file mode 100644
index 000..246025d43b8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma 
--param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
+
+/* Test that the compiler properly build parallel chains according to given
+   association width and try to keep FMA opportunity as much as possible.  */
+#define N 33
+double a[N];
+
+void
+foo (void)
+{
+  a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
++ a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
++ a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
++ a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
+}
+/* { dg-final { scan-assembler-times "vfm" 12  } } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 067a3f07f7e..6d2e158c4f5 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-reassoc.h"
 #include "tree-ssa-math-opts.h"
 #include "gimple-range.h"
+#include "internal-fn.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
 on the LLVM pass of the same name (They do some things more/less
@@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code 
opc,
   return width;
 }
 
+/* Rewrite statements with dependency chain with 

[PATCH] x86: Enable 256 move by pieces for ALDERLAKE and AVX2.

2022-11-11 Thread Cui,Lili via Gcc-patches
From: Lili Cui 

Hi Hontao,

This patch is to enable 256 move by pieces for ALDERLAKE and AVX2.
Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?


gcc/Changelog:

* config/i386/x86-tune.def
(X86_TUNE_AVX256_MOVE_BY_PIECES): Add alderlake and avx2.
(X86_TUNE_AVX256_STORE_BY_PIECES): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pieces-memset-50.c: New test.
---
 gcc/config/i386/x86-tune.def |  4 ++--
 gcc/testsuite/gcc.target/i386/pieces-memset-50.c | 12 
 2 files changed, 14 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-50.c

diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 58e29e7806a..cd66f335113 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -536,12 +536,12 @@ DEF_TUNE (X86_TUNE_AVX256_OPTIMAL, "avx256_optimal", 
m_CORE_AVX512)
 /* X86_TUNE_AVX256_MOVE_BY_PIECES: Optimize move_by_pieces with 256-bit
AVX instructions.  */
 DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
- m_CORE_AVX512)
+ m_ALDERLAKE | m_CORE_AVX2)
 
 /* X86_TUNE_AVX256_STORE_BY_PIECES: Optimize store_by_pieces with 256-bit
AVX instructions.  */
 DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
- m_CORE_AVX512)
+ m_ALDERLAKE | m_CORE_AVX2)
 
 /* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
AVX instructions.  */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-50.c 
b/gcc/testsuite/gcc.target/i386/pieces-memset-50.c
new file mode 100644
index 000..c09e7c3649c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-50.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=alderlake" } */
+
+extern char *dst;
+
+void
+foo (int x)
+{
+  __builtin_memset (dst, x, 64);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%ymm" 2 } } */
-- 
2.17.1

Thanks,
Lili.


[PATCH] Remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS

2022-11-07 Thread Cui,Lili via Gcc-patches
Hi Hongtao,

   I backported this patch to gcc-12 release.

gcc/ChangeLog:

* config/i386/driver-i386.cc (host_detect_local_cpu):
Move sapphirerapids out of AVX512_VP2INTERSECT.
* config/i386/i386.h: Remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS
* doc/invoke.texi: Remove AVX512_VP2INTERSECT from SAPPHIRERAPIDS
---
 gcc/config/i386/driver-i386.cc | 13 +
 gcc/config/i386/i386.h |  7 +++
 gcc/doc/invoke.texi|  8 
 3 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/gcc/config/i386/driver-i386.cc b/gcc/config/i386/driver-i386.cc
index 9e0ae0b2baa..fcf23fd921d 100644
--- a/gcc/config/i386/driver-i386.cc
+++ b/gcc/config/i386/driver-i386.cc
@@ -574,15 +574,12 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
  /* This is unknown family 0x6 CPU.  */
  if (has_feature (FEATURE_AVX))
{
+ /* Assume Tiger Lake */
  if (has_feature (FEATURE_AVX512VP2INTERSECT))
-   {
- if (has_feature (FEATURE_TSXLDTRK))
-   /* Assume Sapphire Rapids.  */
-   cpu = "sapphirerapids";
- else
-   /* Assume Tiger Lake */
-   cpu = "tigerlake";
-   }
+   cpu = "tigerlake";
+ /* Assume Sapphire Rapids.  */
+ else if (has_feature (FEATURE_TSXLDTRK))
+   cpu = "sapphirerapids";
  /* Assume Cooper Lake */
  else if (has_feature (FEATURE_AVX512BF16))
cpu = "cooperlake";
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 363082ba47b..a61c32b8957 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2328,10 +2328,9 @@ constexpr wide_int_bitmask PTA_ICELAKE_SERVER = 
PTA_ICELAKE_CLIENT
 constexpr wide_int_bitmask PTA_TIGERLAKE = PTA_ICELAKE_CLIENT | PTA_MOVDIRI
   | PTA_MOVDIR64B | PTA_CLWB | PTA_AVX512VP2INTERSECT | PTA_KL | PTA_WIDEKL;
 constexpr wide_int_bitmask PTA_SAPPHIRERAPIDS = PTA_ICELAKE_SERVER | 
PTA_MOVDIRI
-  | PTA_MOVDIR64B | PTA_AVX512VP2INTERSECT | PTA_ENQCMD | PTA_CLDEMOTE
-  | PTA_PTWRITE | PTA_WAITPKG | PTA_SERIALIZE | PTA_TSXLDTRK | PTA_AMX_TILE
-  | PTA_AMX_INT8 | PTA_AMX_BF16 | PTA_UINTR | PTA_AVXVNNI | PTA_AVX512FP16
-  | PTA_AVX512BF16;
+  | PTA_MOVDIR64B | PTA_ENQCMD | PTA_CLDEMOTE | PTA_PTWRITE | PTA_WAITPKG
+  | PTA_SERIALIZE | PTA_TSXLDTRK | PTA_AMX_TILE | PTA_AMX_INT8 | PTA_AMX_BF16
+  | PTA_UINTR | PTA_AVXVNNI | PTA_AVX512FP16 | PTA_AVX512BF16;
 constexpr wide_int_bitmask PTA_KNL = PTA_BROADWELL | PTA_AVX512PF
   | PTA_AVX512ER | PTA_AVX512F | PTA_AVX512CD | PTA_PREFETCHWT1;
 constexpr wide_int_bitmask PTA_BONNELL = PTA_CORE2 | PTA_MOVBE;
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 3749e06f13e..cee057a70bf 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -31541,11 +31541,11 @@ Intel sapphirerapids CPU with 64-bit extensions, 
MOVBE, MMX, SSE, SSE2, SSE3,
 SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE,
 RDRND, F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX, PREFETCHW,
 AES, CLFLUSHOPT, XSAVEC, XSAVES, SGX, AVX512F, AVX512VL, AVX512BW, AVX512DQ,
-AVX512CD, PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI, GFNI, VAES, AVX512VBMI2
+AVX512CD, PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI, GFNI, VAES, 
AVX512VBMI2,
 VPCLMULQDQ, AVX512BITALG, RDPID, AVX512VPOPCNTDQ, PCONFIG, WBNOINVD, CLWB,
-MOVDIRI, MOVDIR64B, AVX512VP2INTERSECT, ENQCMD, CLDEMOTE, PTWRITE, WAITPKG,
-SERIALIZE, TSXLDTRK, UINTR, AMX-BF16, AMX-TILE, AMX-INT8, AVX-VNNI, AVX512FP16
-and AVX512BF16 instruction set support.
+MOVDIRI, MOVDIR64B, ENQCMD, CLDEMOTE, PTWRITE, WAITPKG, SERIALIZE, TSXLDTRK,
+UINTR, AMX-BF16, AMX-TILE, AMX-INT8, AVX-VNNI, AVX512FP16 and AVX512BF16
+instruction set support.
 
 @item alderlake
 Intel Alderlake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-- 
2.17.1

Thanks,
Lili.


RE: [PATCH] ix86: Suggest unroll factor for loop vectorization

2022-11-02 Thread Cui, Lili via Gcc-patches
> > > +@item x86-vect-unroll-min-ldst-threshold
> > > +The vectorizer will check with target information to determine
> > > +whether unroll it. This parameter is used to limit the mininum of
> > > +loads and stores in the main loop.
> > >
> > > It's odd to "limit" the minimum number of something.  I think this
> > > warrants clarification that for some (unknow to me ;)) reason we
> > > think that when we have many loads and (or?) stores it is beneficial
> > > to unroll to get even more loads and stores in a single iteration.
> > > Btw, does the parameter limit the number of loads and stores _after_
> unrolling or before?
> > >
> > When the number of loads/stores exceeds the threshold, the loads/stores
> are more likely to conflict with loop itself in the L1 cache(Assuming that
> address of loads are scattered).
> > Unroll + software scheduling will make 2 or 4 address contiguous
> loads/stores closer together, it will reduce cache miss rate.
> 
> Ah, nice.  Can we express the default as a function of L1 data cache size, L1
> cache line size and more importantly, the size of the vector memory access?
> 
> Btw, I was looking into making a more meaningful cost modeling for loop
> distribution.  Similar reasoning might apply there - try to _reduce_ the
> number of memory streams so L1 cache utilization allows re-use of a cache
> line in the next [next N] iteration[s]?  OTOH given L1D is quite large I'd 
> expect
> the loops affected to be either quite huge or bottlenecked by load/store
> bandwith (there are 1024 L1D cache lines in zen2 for
> example) - what's the effective L1D load you are keying off?.
> Btw, how does L1D allocation on stores play a role here?
> 
Hi Richard,
To answer your question, I rechecked 549, I found that the 549 improvement 
comes from load reduction, it has a 3-level loop and 8 scalar loads in inner 
loop are loop invariants (due to high register pressure, these loop invariants 
all spill to the stack).
After unrolling the inner loop, those scalar parts are not doubled,  so 
unrolling reduces load instructions and L1/L2/L3 accesses. In the inner loop 
there are 8 different three-dimensional arrays, which size like this 
"a[128][480][128]". Although the size of the 3-layer array is very large,
but it doesn't support the theory I said before, Sorry for that. I need to hold 
this patch to see if we can do something about this scenario. 

Thanks,
Lili.




RE: Ping^3 [PATCH V2] Add attribute hot judgement for INLINE_HINT_known_hot hint.

2022-10-30 Thread Cui, Lili via Gcc-patches
> 
> On 10/20/22 19:52, Cui, Lili via Gcc-patches wrote:
> > Hi Honza,
> >
> > Gentle ping
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601934.html
> >
> > gcc/ChangeLog
> >
> >* ipa-inline-analysis.cc (do_estimate_edge_time): Add function attribute
> >judgement for INLINE_HINT_known_hot hint.
> >
> > gcc/testsuite/ChangeLog:
> >
> >* gcc.dg/ipa/inlinehint-6.c: New test.
> > ---
> >   gcc/ipa-inline-analysis.cc  | 13 ---
> >   gcc/testsuite/gcc.dg/ipa/inlinehint-6.c | 47
> +
> >   2 files changed, 56 insertions(+), 4 deletions(-)
> >   create mode 100644 gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> >
> > diff --git a/gcc/ipa-inline-analysis.cc b/gcc/ipa-inline-analysis.cc
> > index 1ca685d1b0e..7bd29c36590 100644
> > --- a/gcc/ipa-inline-analysis.cc
> > +++ b/gcc/ipa-inline-analysis.cc
> > @@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
> >   #include "ipa-utils.h"
> >   #include "cfgexpand.h"
> >   #include "gimplify.h"
> > +#include "attribs.h"
> >
> >   /* Cached node/edge growths.  */
> >   fast_call_summary
> > *edge_growth_cache = NULL; @@ -249,15 +250,19 @@
> do_estimate_edge_time (struct cgraph_edge *edge, sreal
> *ret_nonspec_time)
> > hints = estimates.hints;
> >   }
> >
> > -  /* When we have profile feedback, we can quite safely identify hot
> > - edges and for those we disable size limits.  Don't do that when
> > - probability that caller will call the callee is low however, since it
> > +  /* When we have profile feedback or function attribute, we can quite
> safely
> > + identify hot edges and for those we disable size limits.  Don't do 
> > that
> > + when probability that caller will call the callee is low
> > + however, since it
> >may hurt optimization of the caller's hot path.  */
> > -  if (edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
> > +  if ((edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
> > && (edge->count.ipa () * 2
> >   > (edge->caller->inlined_to
> >  ? edge->caller->inlined_to->count.ipa ()
> >  : edge->caller->count.ipa (
> > +  || (lookup_attribute ("hot", DECL_ATTRIBUTES (edge->caller->decl))
> > + != NULL
> > +&& lookup_attribute ("hot", DECL_ATTRIBUTES (edge->callee->decl))
> > + != NULL))
> >   hints |= INLINE_HINT_known_hot;
> 
> Is the theory here that if the user has marked the caller and callee as hot,
> then we're going to assume an edge between them is hot too?  That's not
> necessarily true, it could be they're both hot, but via other call chains.  
> But it's
> probably a reasonable heuristic in practice.
> 
Yes,  thanks Jeff.

Lili.
> 
> OK
> 
> 
> jeff
> 



RE: [PATCH] ix86: Suggest unroll factor for loop vectorization

2022-10-26 Thread Cui, Lili via Gcc-patches
Hi Richard,

> +@item x86-vect-unroll-min-ldst-threshold
> +The vectorizer will check with target information to determine whether
> +unroll it. This parameter is used to limit the mininum of loads and
> +stores in the main loop.
> 
> It's odd to "limit" the minimum number of something.  I think this warrants
> clarification that for some (unknow to me ;)) reason we think that when we
> have many loads and (or?) stores it is beneficial to unroll to get even more
> loads and stores in a single iteration.  Btw, does the parameter limit the
> number of loads and stores _after_ unrolling or before?
> 
When the number of loads/stores exceeds the threshold, the loads/stores are 
more likely to conflict with loop itself in the L1 cache(Assuming that address 
of loads are scattered).
Unroll + software scheduling will make 2 or 4 address contiguous loads/stores 
closer together, it will reduce cache miss rate.

> +@item x86-vect-unroll-max-loop-size
> +The vectorizer will check with target information to determine whether
> +unroll it. This threshold is used to limit the max size of loop body after
> unrolling.
> +The default value is 200.
> 
> it should probably say not "size" but "number of instructions".  Note that 200
> is quite large given we are talking about vector instructions here which have
> larger encodings than scalar instructions.  Optimistically assuming
> 4 byte encoding (quite optimistic give we're looking at loops with many
> loads/stores) that would be an 800 byte loop body which would be 25 cache
> lines.
> ISTR that at least the loop discovery is limited to a lot smaller cases (but 
> we
> are likely not targeting that).  The limit probably still works to fit the 
> loop
> body in the u-op caches though.
> 
Agree with you, it should be "x86-vect-unroll-max-loop-insns". Thanks for the 
reminder about larger encodings, I checked the skylake uop cache, it can hold 
1.5k uOPs, 200 * 2 (1~3 uops/instruction) = 400 uops. I think 200 still work.

> That said, the heuristic made me think "what the heck".  Can we explain in u-
> arch terms why the unrolling is beneficial instead of just defering to SPEC
> CPU 2017 fotonik?
> 
Regarding the benefits,  I explained in the first answer, I checked 5 hottest 
functions in the 549, they all benefit from it, it improves the cache hit ratio.

Thanks,
Lili.

> > On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
> >  wrote:
> > >
> > > Hi Hongtao,
> > >
> > > This patch introduces function finish_cost and
> > > determine_suggested_unroll_factor for x86 backend, to make it be
> > > able to suggest the unroll factor for a given loop being vectorized.
> > > Referring to aarch64, RS6000 backends and basing on the analysis on
> > > SPEC2017 performance evaluation results.
> > >
> > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > >
> > > OK for trunk?
> > >
> > >
> > >
> > > With this patch, SPEC2017 performance evaluation results on
> > > ICX/CLX/ADL/Znver3 are listed below:
> > >
> > > For single copy:
> > >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> > >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> > >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> > >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> > >
> > > For multi-copy:
> > >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> > >
> > > gcc/ChangeLog:
> > >
> > > * config/i386/i386.cc (class ix86_vector_costs): Add new members
> > >  m_nstmts, m_nloads m_nstores and
> determine_suggested_unroll_factor.
> > > (ix86_vector_costs::add_stmt_cost): Update for m_nstores,
> m_nloads
> > > and m_nstores.
> > > (ix86_vector_costs::determine_suggested_unroll_factor): New
> function.
> > > (ix86_vector_costs::finish_cost): Diito.
> > > * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> > > (x86-vect-unroll-min-ldst-threshold): Likewise.
> > > (x86-vect-unroll-max-loop-size): Likewise.
> > > * doc/invoke.texi: Document new parameter.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> > > * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> > > * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> > > * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> > > * gcc.

[PATCH] ix86: Suggest unroll factor for loop vectorization

2022-10-23 Thread Cui,Lili via Gcc-patches
Hi Hongtao,

This patch introduces function finish_cost and 
determine_suggested_unroll_factor for x86 backend, to make it be
able to suggest the unroll factor for a given loop being vectorized.
Referring to aarch64, RS6000 backends and basing on the analysis on
SPEC2017 performance evaluation results.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.

OK for trunk?



With this patch, SPEC2017 performance evaluation results on
ICX/CLX/ADL/Znver3 are listed below:

For single copy:
  - ICX: 549.fotonik3d_r +6.2%, the others are neutral
  - CLX: 549.fotonik3d_r +1.9%, the others are neutral
  - ADL: 549.fotonik3d_r +4.5%, the others are neutral
  - Znver3: 549.fotonik3d_r +4.8%, the others are neutral

For multi-copy:
  - ADL: 549.fotonik3d_r +2.7%, the others are neutral

gcc/ChangeLog:

* config/i386/i386.cc (class ix86_vector_costs): Add new members
 m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
(ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
and m_nstores.
(ix86_vector_costs::determine_suggested_unroll_factor): New function.
(ix86_vector_costs::finish_cost): Diito.
* config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
(x86-vect-unroll-min-ldst-threshold): Likewise.
(x86-vect-unroll-max-loop-size): Likewise.
* doc/invoke.texi: Document new parameter.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
* gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
* gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
* gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
* gcc.target/i386/vect-reduc-1.c: Ditto.
---
 gcc/config/i386/i386.cc   | 106 ++
 gcc/config/i386/i386.opt  |  15 +++
 gcc/doc/invoke.texi   |  17 +++
 .../gcc.target/i386/cond_op_maxmin_b-1.c  |   2 +-
 .../gcc.target/i386/cond_op_maxmin_ub-1.c |   2 +-
 .../i386/vect-alignment-peeling-1.c   |   2 +-
 .../i386/vect-alignment-peeling-2.c   |   2 +-
 gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
 8 files changed, 143 insertions(+), 5 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index aeea26ef4be..a939354e55e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
  stmt_vec_info stmt_info, slp_tree node,
  tree vectype, int misalign,
  vect_cost_model_location where) override;
+
+  unsigned int determine_suggested_unroll_factor (loop_vec_info);
+
+  void finish_cost (const vector_costs *) override;
+
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned m_nstmts = 0;
+  /* Total number of loads (loop only).  */
+  unsigned m_nloads = 0;
+  /* Total number of stores (loop only).  */
+  unsigned m_nstores = 0;
 };
 
 /* Implement targetm.vectorize.create_costs.  */
@@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
retval = (retval * 17) / 10;
 }
 
+  if (!m_costing_for_scalar
+  && is_a (m_vinfo)
+  && where == vect_body)
+{
+  m_nstmts += count;
+  if (kind == scalar_load || kind == vector_load
+ || kind == unaligned_load || kind == vector_gather_load)
+   m_nloads += count;
+  else if (kind == scalar_store || kind == vector_store
+  || kind == unaligned_store || kind == vector_scatter_store)
+   m_nstores += count;
+}
+
   m_costs[where] += retval;
 
   return retval;
@@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop 
*loop)
   return nunroll;
 }
 
+unsigned int
+ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+
+  /* Don't unroll if it's specified explicitly not to be unrolled.  */
+  if (loop->unroll == 1
+  || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
+  || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
+return 1;
+
+  /* Don't unroll if there is no vectorized stmt.  */
+  if (m_nstmts == 0)
+return 1;
+
+  /* Don't unroll if vector size is zmm, since zmm throughput is lower than 
other
+ sizes.  */
+  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
+return 1;
+
+  /* Calc the total number of loads and stores in the loop body.  */
+  unsigned int nstmts_ldst = m_nloads + m_nstores;
+
+  /* Don't unroll if loop body size big than threshold, the threshold
+ is a heuristic value inspired by param_max_unrolled_insns.  */
+  unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size
+   ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
+   : 1;
+  uf = MIN ((unsigned 

Ping^3 [PATCH V2] Add attribute hot judgement for INLINE_HINT_known_hot hint.

2022-10-20 Thread Cui, Lili via Gcc-patches
Hi Honza,

Gentle ping  
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601934.html

gcc/ChangeLog

  * ipa-inline-analysis.cc (do_estimate_edge_time): Add function attribute
  judgement for INLINE_HINT_known_hot hint.

gcc/testsuite/ChangeLog:

  * gcc.dg/ipa/inlinehint-6.c: New test.
---
 gcc/ipa-inline-analysis.cc  | 13 ---
 gcc/testsuite/gcc.dg/ipa/inlinehint-6.c | 47 +
 2 files changed, 56 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/ipa/inlinehint-6.c

diff --git a/gcc/ipa-inline-analysis.cc b/gcc/ipa-inline-analysis.cc
index 1ca685d1b0e..7bd29c36590 100644
--- a/gcc/ipa-inline-analysis.cc
+++ b/gcc/ipa-inline-analysis.cc
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-utils.h"
 #include "cfgexpand.h"
 #include "gimplify.h"
+#include "attribs.h"
 
 /* Cached node/edge growths.  */
 fast_call_summary *edge_growth_cache = 
NULL;
@@ -249,15 +250,19 @@ do_estimate_edge_time (struct cgraph_edge *edge, sreal 
*ret_nonspec_time)
   hints = estimates.hints;
 }
 
-  /* When we have profile feedback, we can quite safely identify hot
- edges and for those we disable size limits.  Don't do that when
- probability that caller will call the callee is low however, since it
+  /* When we have profile feedback or function attribute, we can quite safely
+ identify hot edges and for those we disable size limits.  Don't do that
+ when probability that caller will call the callee is low however, since it
  may hurt optimization of the caller's hot path.  */
-  if (edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
+  if ((edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
   && (edge->count.ipa () * 2
  > (edge->caller->inlined_to
 ? edge->caller->inlined_to->count.ipa ()
 : edge->caller->count.ipa (
+  || (lookup_attribute ("hot", DECL_ATTRIBUTES (edge->caller->decl))
+ != NULL
+&& lookup_attribute ("hot", DECL_ATTRIBUTES (edge->callee->decl))
+ != NULL))
 hints |= INLINE_HINT_known_hot;
 
   gcc_checking_assert (size >= 0);
diff --git a/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c 
b/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
new file mode 100644
index 000..1f3be641c6d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
@@ -0,0 +1,47 @@
+/* { dg-options "-O3 -c -fdump-ipa-inline-details -fno-early-inlining 
-fno-ipa-cp"  } */
+/* { dg-add-options bind_pic_locally } */
+
+#define size_t long long int
+
+struct A
+{
+  size_t f1, f2, f3, f4;
+};
+struct C
+{
+  struct A a;
+  size_t b;
+};
+struct C x;
+
+__attribute__((hot)) struct C callee (struct A *a, struct C *c)
+{
+  c->a=(*a);
+
+  if((c->b + 7) & 17)
+   {
+  c->a.f1 = c->a.f2 + c->a.f1;
+  c->a.f2 = c->a.f3 - c->a.f2;
+  c->a.f3 = c->a.f2 + c->a.f3;
+  c->a.f4 = c->a.f2 - c->a.f4;
+  c->b = c->a.f2;
+
+}
+  return *c;
+}
+
+__attribute__((hot)) struct C caller (size_t d, size_t e, size_t f, size_t g, 
struct C *c)
+{
+  struct A a;
+  a.f1 = 1 + d;
+  a.f2 = e;
+  a.f3 = 12 + f;
+  a.f4 = 68 + g;
+  if (c->b > 0)
+return callee (, c);
+  else
+return *c;
+}
+
+/* { dg-final { scan-ipa-dump "known_hot"  "inline"  } } */
+
-- 
2.17.1

Thanks,
Lili.


RE: Ping^2 [PATCH] Add attribute hot judgement for INLINE_HINT_known_hot hint.

2022-10-14 Thread Cui, Lili via Gcc-patches
 Hi Honza,

 Gentle ping  
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601934.html
 
 Thanks,
 Lili.

> -Original Message-
> From: Cui, Lili 
> Sent: Saturday, October 8, 2022 8:33 AM
> To: Cui, Lili ; Jan Hubicka 
> Cc: Lu, Hongjiu ; Liu, Hongtao
> ; gcc-patches@gcc.gnu.org
> Subject: Ping^1 [PATCH] Add attribute hot judgement for
> INLINE_HINT_known_hot hint.
> 
> Hi Honza,
> 
> Gentle ping  https://gcc.gnu.org/pipermail/gcc-patches/2022-
> September/601934.html
> 
> Thanks,
> Lili.
> 
> > -Original Message-----
> > From: Gcc-patches 
> > On Behalf Of Cui, Lili via Gcc-patches
> > Sent: Wednesday, September 21, 2022 5:22 PM
> > To: Jan Hubicka 
> > Cc: Lu, Hongjiu ; Liu, Hongtao
> > ; gcc-patches@gcc.gnu.org
> > Subject: RE: [PATCH] Add attribute hot judgement for
> > INLINE_HINT_known_hot hint.
> >
> > > Thank you.  Can you please also add a testcase that tests for this.
> > > So you modify imagemagick marking attribute hot on the specific inline?
> >
> > Thanks Honza. Added the testcase. I didn't modify source code of
> > 538.imagic_r, the original source code has attribute like:
> >
> > #define magick_hot_spot  __attribute__((__hot__)) static Cache
> > *SetPixelCacheNexusPixels( ... ) magick_hot_spot;
> >
> > > I will try to also look again at your earlier patch - I had very
> > > busy summer and unfortunately lost track on this one.
> > >
> > NP, I guessed you were busy during that time, my earlier patch was
> > partially duplicated with function "Elimination_by_inlining_prob",
> > except "parameter points to caller local memory" part, maybe we can
> > find a suitable place to add local memory part  to the IPA.
> >
> > > Honza
> >
> > gcc/ChangeLog
> >
> >   * ipa-inline-analysis.cc (do_estimate_edge_time): Add function attribute
> >   judgement for INLINE_HINT_known_hot hint.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.dg/ipa/inlinehint-6.c: New test.
> > ---
> >  gcc/ipa-inline-analysis.cc  | 13 ---
> >  gcc/testsuite/gcc.dg/ipa/inlinehint-6.c | 47
> > +
> >  2 files changed, 56 insertions(+), 4 deletions(-)  create mode 100644
> > gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> >
> > diff --git a/gcc/ipa-inline-analysis.cc b/gcc/ipa-inline-analysis.cc
> > index
> > 1ca685d1b0e..7bd29c36590 100644
> > --- a/gcc/ipa-inline-analysis.cc
> > +++ b/gcc/ipa-inline-analysis.cc
> > @@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
> > #include "ipa-utils.h"
> >  #include "cfgexpand.h"
> >  #include "gimplify.h"
> > +#include "attribs.h"
> >
> >  /* Cached node/edge growths.  */
> >  fast_call_summary
> > *edge_growth_cache = NULL; @@ -249,15 +250,19 @@
> do_estimate_edge_time
> > (struct cgraph_edge *edge, sreal *ret_nonspec_time)
> >hints = estimates.hints;
> >  }
> >
> > -  /* When we have profile feedback, we can quite safely identify hot
> > - edges and for those we disable size limits.  Don't do that when
> > - probability that caller will call the callee is low however, since it
> > +  /* When we have profile feedback or function attribute, we can
> > + quite
> > safely
> > + identify hot edges and for those we disable size limits.  Don't do 
> > that
> > + when probability that caller will call the callee is low
> > + however, since it
> >   may hurt optimization of the caller's hot path.  */
> > -  if (edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
> > +  if ((edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
> >&& (edge->count.ipa () * 2
> >   > (edge->caller->inlined_to
> >  ? edge->caller->inlined_to->count.ipa ()
> >  : edge->caller->count.ipa (
> > +  || (lookup_attribute ("hot", DECL_ATTRIBUTES (edge->caller->decl))
> > + != NULL
> > +&& lookup_attribute ("hot", DECL_ATTRIBUTES (edge->callee->decl))
> > + != NULL))
> >  hints |= INLINE_HINT_known_hot;
> >
> >gcc_checking_assert (size >= 0);
> > diff --git a/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> > b/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> > new file mode 100644
> > index 000..1f3be641c6d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> >

[PATCH] MAINTAINERS: Add myself for write after approval

2022-10-12 Thread Cui,Lili via Gcc-patches
Hi,

I want to add myself in MAINTANINER for write after approval.

OK for master?

ChangeLog:
* MAINTAINERS (Write After Approval): Add myself.

---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 11fa8bc6dbd..e4e7349a6d9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -377,6 +377,7 @@ Andrea Corallo  

 Christian Cornelssen   
 Ludovic Courtès
 Lawrence Crowl 
+Lili Cui   
 Ian Dall   
 David Daney
 Robin Dapp 
-- 
2.17.1



[PATCH] Remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS

2022-10-11 Thread Cui,Lili via Gcc-patches
Hi Hontao,

This patch is to remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS.
The new intel ISE removes AVX512_VP2INTERSECT from SAPPHIRERAPIDS, 
AVX512_VP2INTERSECT is only supportted in Tigerlake.

Hi Uros,

This patch is to remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS.
The new intel ISE removes AVX512_VP2INTERSECT from SAPPHIRERAPIDS, 
AVX512_VP2INTERSECT is only supportted in Tigerlake.

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?


gcc/ChangeLog:

* config/i386/driver-i386.cc (host_detect_local_cpu):
Move sapphirerapids out of AVX512_VP2INTERSECT.
* config/i386/i386.h: Remove AVX512_VP2INTERSECT from PTA_SAPPHIRERAPIDS
* doc/invoke.texi: Remove AVX512_VP2INTERSECT from SAPPHIRERAPIDS
---
 gcc/config/i386/driver-i386.cc | 13 +
 gcc/config/i386/i386.h |  7 +++
 gcc/doc/invoke.texi|  8 
 3 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/gcc/config/i386/driver-i386.cc b/gcc/config/i386/driver-i386.cc
index 3c702fdca33..ef567045c67 100644
--- a/gcc/config/i386/driver-i386.cc
+++ b/gcc/config/i386/driver-i386.cc
@@ -589,15 +589,12 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
  /* This is unknown family 0x6 CPU.  */
  if (has_feature (FEATURE_AVX))
{
+ /* Assume Tiger Lake */
  if (has_feature (FEATURE_AVX512VP2INTERSECT))
-   {
- if (has_feature (FEATURE_TSXLDTRK))
-   /* Assume Sapphire Rapids.  */
-   cpu = "sapphirerapids";
- else
-   /* Assume Tiger Lake */
-   cpu = "tigerlake";
-   }
+   cpu = "tigerlake";
+ /* Assume Sapphire Rapids.  */
+ else if (has_feature (FEATURE_TSXLDTRK))
+   cpu = "sapphirerapids";
  /* Assume Cooper Lake */
  else if (has_feature (FEATURE_AVX512BF16))
cpu = "cooperlake";
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 900a3bc3673..372a2cff8fe 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2326,10 +2326,9 @@ constexpr wide_int_bitmask PTA_ICELAKE_SERVER = 
PTA_ICELAKE_CLIENT
 constexpr wide_int_bitmask PTA_TIGERLAKE = PTA_ICELAKE_CLIENT | PTA_MOVDIRI
   | PTA_MOVDIR64B | PTA_CLWB | PTA_AVX512VP2INTERSECT | PTA_KL | PTA_WIDEKL;
 constexpr wide_int_bitmask PTA_SAPPHIRERAPIDS = PTA_ICELAKE_SERVER | 
PTA_MOVDIRI
-  | PTA_MOVDIR64B | PTA_AVX512VP2INTERSECT | PTA_ENQCMD | PTA_CLDEMOTE
-  | PTA_PTWRITE | PTA_WAITPKG | PTA_SERIALIZE | PTA_TSXLDTRK | PTA_AMX_TILE
-  | PTA_AMX_INT8 | PTA_AMX_BF16 | PTA_UINTR | PTA_AVXVNNI | PTA_AVX512FP16
-  | PTA_AVX512BF16;
+  | PTA_MOVDIR64B | PTA_ENQCMD | PTA_CLDEMOTE | PTA_PTWRITE | PTA_WAITPKG
+  | PTA_SERIALIZE | PTA_TSXLDTRK | PTA_AMX_TILE | PTA_AMX_INT8 | PTA_AMX_BF16
+  | PTA_UINTR | PTA_AVXVNNI | PTA_AVX512FP16 | PTA_AVX512BF16;
 constexpr wide_int_bitmask PTA_KNL = PTA_BROADWELL | PTA_AVX512PF
   | PTA_AVX512ER | PTA_AVX512F | PTA_AVX512CD | PTA_PREFETCHWT1;
 constexpr wide_int_bitmask PTA_BONNELL = PTA_CORE2 | PTA_MOVBE;
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 271c8bb8468..a9ecc4426a4 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -32057,11 +32057,11 @@ Intel sapphirerapids CPU with 64-bit extensions, 
MOVBE, MMX, SSE, SSE2, SSE3,
 SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE,
 RDRND, F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX, PREFETCHW,
 AES, CLFLUSHOPT, XSAVEC, XSAVES, SGX, AVX512F, AVX512VL, AVX512BW, AVX512DQ,
-AVX512CD, PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI, GFNI, VAES, AVX512VBMI2
+AVX512CD, PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI, GFNI, VAES, 
AVX512VBMI2,
 VPCLMULQDQ, AVX512BITALG, RDPID, AVX512VPOPCNTDQ, PCONFIG, WBNOINVD, CLWB,
-MOVDIRI, MOVDIR64B, AVX512VP2INTERSECT, ENQCMD, CLDEMOTE, PTWRITE, WAITPKG,
-SERIALIZE, TSXLDTRK, UINTR, AMX-BF16, AMX-TILE, AMX-INT8, AVX-VNNI, AVX512FP16
-and AVX512BF16 instruction set support.
+MOVDIRI, MOVDIR64B, ENQCMD, CLDEMOTE, PTWRITE, WAITPKG, SERIALIZE, TSXLDTRK,
+UINTR, AMX-BF16, AMX-TILE, AMX-INT8, AVX-VNNI, AVX512FP16 and AVX512BF16
+instruction set support.
 
 @item alderlake
 Intel Alderlake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-- 
2.17.1

Thanks,
Lili.
Thanks


Ping^1 [PATCH] Add attribute hot judgement for INLINE_HINT_known_hot hint.

2022-10-07 Thread Cui, Lili via Gcc-patches
Hi Honza,

Gentle ping  
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601934.html

Thanks,
Lili.

> -Original Message-
> From: Gcc-patches  On
> Behalf Of Cui, Lili via Gcc-patches
> Sent: Wednesday, September 21, 2022 5:22 PM
> To: Jan Hubicka 
> Cc: Lu, Hongjiu ; Liu, Hongtao
> ; gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH] Add attribute hot judgement for
> INLINE_HINT_known_hot hint.
> 
> > Thank you.  Can you please also add a testcase that tests for this.
> > So you modify imagemagick marking attribute hot on the specific inline?
> 
> Thanks Honza. Added the testcase. I didn't modify source code of
> 538.imagic_r, the original source code has attribute like:
> 
> #define magick_hot_spot  __attribute__((__hot__)) static Cache
> *SetPixelCacheNexusPixels( ... ) magick_hot_spot;
> 
> > I will try to also look again at your earlier patch - I had very busy
> > summer and unfortunately lost track on this one.
> >
> NP, I guessed you were busy during that time, my earlier patch was partially
> duplicated with function "Elimination_by_inlining_prob", except "parameter
> points to caller local memory" part, maybe we can find a suitable place to
> add local memory part  to the IPA.
> 
> > Honza
> 
> gcc/ChangeLog
> 
>   * ipa-inline-analysis.cc (do_estimate_edge_time): Add function attribute
>   judgement for INLINE_HINT_known_hot hint.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/ipa/inlinehint-6.c: New test.
> ---
>  gcc/ipa-inline-analysis.cc  | 13 ---
>  gcc/testsuite/gcc.dg/ipa/inlinehint-6.c | 47 +
>  2 files changed, 56 insertions(+), 4 deletions(-)  create mode 100644
> gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> 
> diff --git a/gcc/ipa-inline-analysis.cc b/gcc/ipa-inline-analysis.cc index
> 1ca685d1b0e..7bd29c36590 100644
> --- a/gcc/ipa-inline-analysis.cc
> +++ b/gcc/ipa-inline-analysis.cc
> @@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
> #include "ipa-utils.h"
>  #include "cfgexpand.h"
>  #include "gimplify.h"
> +#include "attribs.h"
> 
>  /* Cached node/edge growths.  */
>  fast_call_summary
> *edge_growth_cache = NULL; @@ -249,15 +250,19 @@
> do_estimate_edge_time (struct cgraph_edge *edge, sreal *ret_nonspec_time)
>hints = estimates.hints;
>  }
> 
> -  /* When we have profile feedback, we can quite safely identify hot
> - edges and for those we disable size limits.  Don't do that when
> - probability that caller will call the callee is low however, since it
> +  /* When we have profile feedback or function attribute, we can quite
> safely
> + identify hot edges and for those we disable size limits.  Don't do that
> + when probability that caller will call the callee is low however,
> + since it
>   may hurt optimization of the caller's hot path.  */
> -  if (edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
> +  if ((edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
>&& (edge->count.ipa () * 2
> > (edge->caller->inlined_to
>? edge->caller->inlined_to->count.ipa ()
>: edge->caller->count.ipa (
> +  || (lookup_attribute ("hot", DECL_ATTRIBUTES (edge->caller->decl))
> +   != NULL
> +  && lookup_attribute ("hot", DECL_ATTRIBUTES (edge->callee->decl))
> +   != NULL))
>  hints |= INLINE_HINT_known_hot;
> 
>gcc_checking_assert (size >= 0);
> diff --git a/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> b/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> new file mode 100644
> index 000..1f3be641c6d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
> @@ -0,0 +1,47 @@
> +/* { dg-options "-O3 -c -fdump-ipa-inline-details -fno-early-inlining
> +-fno-ipa-cp"  } */
> +/* { dg-add-options bind_pic_locally } */
> +
> +#define size_t long long int
> +
> +struct A
> +{
> +  size_t f1, f2, f3, f4;
> +};
> +struct C
> +{
> +  struct A a;
> +  size_t b;
> +};
> +struct C x;
> +
> +__attribute__((hot)) struct C callee (struct A *a, struct C *c) {
> +  c->a=(*a);
> +
> +  if((c->b + 7) & 17)
> +   {
> +  c->a.f1 = c->a.f2 + c->a.f1;
> +  c->a.f2 = c->a.f3 - c->a.f2;
> +  c->a.f3 = c->a.f2 + c->a.f3;
> +  c->a.f4 = c->a.f2 - c->a.f4;
> +  c->b = c->a.f2;
> +
> +}
> +  return *c;
> +}
> +
> +__attribute__((hot)) struct C caller (size_t d, size_t e, size_t f,
> +size_t g, struct C *c) {
> +  struct A a;
> +  a.f1 = 1 + d;
> +  a.f2 = e;
> +  a.f3 = 12 + f;
> +  a.f4 = 68 + g;
> +  if (c->b > 0)
> +return callee (, c);
> +  else
> +return *c;
> +}
> +
> +/* { dg-final { scan-ipa-dump "known_hot"  "inline"  } } */
> +
> --
> 2.17.1
> 
> Thanks,
> Lili.



RE: [PATCH] Add attribute hot judgement for INLINE_HINT_known_hot hint.

2022-09-21 Thread Cui, Lili via Gcc-patches
> Thank you.  Can you please also add a testcase that tests for this.
> So you modify imagemagick marking attribute hot on the specific inline?

Thanks Honza. Added the testcase. I didn't modify source code of 538.imagic_r, 
the original source code has attribute like:

#define magick_hot_spot  __attribute__((__hot__))
static Cache *SetPixelCacheNexusPixels( ... ) magick_hot_spot;

> I will try to also look again at your earlier patch - I had very busy summer 
> and
> unfortunately lost track on this one.
>
NP, I guessed you were busy during that time, my earlier patch was partially 
duplicated with function "Elimination_by_inlining_prob", 
except "parameter points to caller local memory" part, maybe we can find a 
suitable place to add local memory part  to the IPA.

> Honza

gcc/ChangeLog

  * ipa-inline-analysis.cc (do_estimate_edge_time): Add function attribute
  judgement for INLINE_HINT_known_hot hint.

gcc/testsuite/ChangeLog:

  * gcc.dg/ipa/inlinehint-6.c: New test.
---
 gcc/ipa-inline-analysis.cc  | 13 ---
 gcc/testsuite/gcc.dg/ipa/inlinehint-6.c | 47 +
 2 files changed, 56 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/ipa/inlinehint-6.c

diff --git a/gcc/ipa-inline-analysis.cc b/gcc/ipa-inline-analysis.cc
index 1ca685d1b0e..7bd29c36590 100644
--- a/gcc/ipa-inline-analysis.cc
+++ b/gcc/ipa-inline-analysis.cc
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-utils.h"
 #include "cfgexpand.h"
 #include "gimplify.h"
+#include "attribs.h"
 
 /* Cached node/edge growths.  */
 fast_call_summary *edge_growth_cache = 
NULL;
@@ -249,15 +250,19 @@ do_estimate_edge_time (struct cgraph_edge *edge, sreal 
*ret_nonspec_time)
   hints = estimates.hints;
 }
 
-  /* When we have profile feedback, we can quite safely identify hot
- edges and for those we disable size limits.  Don't do that when
- probability that caller will call the callee is low however, since it
+  /* When we have profile feedback or function attribute, we can quite safely
+ identify hot edges and for those we disable size limits.  Don't do that
+ when probability that caller will call the callee is low however, since it
  may hurt optimization of the caller's hot path.  */
-  if (edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
+  if ((edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
   && (edge->count.ipa () * 2
  > (edge->caller->inlined_to
 ? edge->caller->inlined_to->count.ipa ()
 : edge->caller->count.ipa (
+  || (lookup_attribute ("hot", DECL_ATTRIBUTES (edge->caller->decl))
+ != NULL
+&& lookup_attribute ("hot", DECL_ATTRIBUTES (edge->callee->decl))
+ != NULL))
 hints |= INLINE_HINT_known_hot;
 
   gcc_checking_assert (size >= 0);
diff --git a/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c 
b/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
new file mode 100644
index 000..1f3be641c6d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/ipa/inlinehint-6.c
@@ -0,0 +1,47 @@
+/* { dg-options "-O3 -c -fdump-ipa-inline-details -fno-early-inlining 
-fno-ipa-cp"  } */
+/* { dg-add-options bind_pic_locally } */
+
+#define size_t long long int
+
+struct A
+{
+  size_t f1, f2, f3, f4;
+};
+struct C
+{
+  struct A a;
+  size_t b;
+};
+struct C x;
+
+__attribute__((hot)) struct C callee (struct A *a, struct C *c)
+{
+  c->a=(*a);
+
+  if((c->b + 7) & 17)
+   {
+  c->a.f1 = c->a.f2 + c->a.f1;
+  c->a.f2 = c->a.f3 - c->a.f2;
+  c->a.f3 = c->a.f2 + c->a.f3;
+  c->a.f4 = c->a.f2 - c->a.f4;
+  c->b = c->a.f2;
+
+}
+  return *c;
+}
+
+__attribute__((hot)) struct C caller (size_t d, size_t e, size_t f, size_t g, 
struct C *c)
+{
+  struct A a;
+  a.f1 = 1 + d;
+  a.f2 = e;
+  a.f3 = 12 + f;
+  a.f4 = 68 + g;
+  if (c->b > 0)
+return callee (, c);
+  else
+return *c;
+}
+
+/* { dg-final { scan-ipa-dump "known_hot"  "inline"  } } */
+
-- 
2.17.1

Thanks,
Lili.



0001-Add-attribute-hot-judgement-for-INLINE_HINT_known_ho.patch
Description: 0001-Add-attribute-hot-judgement-for-INLINE_HINT_known_ho.patch


[PATCH] Add attribute hot judgement for INLINE_HINT_known_hot hint.

2022-09-20 Thread Cui,Lili via Gcc-patches
Hi Honza,

This patch is to add attribute hot judgement for INLINE_HINT_known_hot hint.

We set up INLINE_HINT_known_hot hint only when we have profile feedback,
now add function attribute judgement for it, when both caller and callee
have __attribute__((hot)), we will also set up INLINE_HINT_known_hot hint
for it.

With this patch applied
 Ratio   Codesize
ADL Multi-copy:538.imagic_r  16.7%1.6%
SPR Multi-copy:538.imagic_r  15%  1.7%
ICX Multi-copy:538.imagic_r  15.2%1.7%
CLX Multi-copy:538.imagic_r  12.7%1.7%
Znver3 Multi-copy: 538.imagic_r  10.6%1.5%

Bootstrap and regtest pending on x86_64-unknown-linux-gnu.
OK for trunk?

Thanks,
Lili.

gcc/ChangeLog

  * ipa-inline-analysis.cc (do_estimate_edge_time): Add function attribute
  judgement for INLINE_HINT_known_hot hint.
---
 gcc/ipa-inline-analysis.cc | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/gcc/ipa-inline-analysis.cc b/gcc/ipa-inline-analysis.cc
index 1ca685d1b0e..7bd29c36590 100644
--- a/gcc/ipa-inline-analysis.cc
+++ b/gcc/ipa-inline-analysis.cc
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-utils.h"
 #include "cfgexpand.h"
 #include "gimplify.h"
+#include "attribs.h"
 
 /* Cached node/edge growths.  */
 fast_call_summary *edge_growth_cache = 
NULL;
@@ -249,15 +250,19 @@ do_estimate_edge_time (struct cgraph_edge *edge, sreal 
*ret_nonspec_time)
   hints = estimates.hints;
 }
 
-  /* When we have profile feedback, we can quite safely identify hot
- edges and for those we disable size limits.  Don't do that when
- probability that caller will call the callee is low however, since it
+  /* When we have profile feedback or function attribute, we can quite safely
+ identify hot edges and for those we disable size limits.  Don't do that
+ when probability that caller will call the callee is low however, since it
  may hurt optimization of the caller's hot path.  */
-  if (edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
+  if ((edge->count.ipa ().initialized_p () && edge->maybe_hot_p ()
   && (edge->count.ipa () * 2
  > (edge->caller->inlined_to
 ? edge->caller->inlined_to->count.ipa ()
 : edge->caller->count.ipa (
+  || (lookup_attribute ("hot", DECL_ATTRIBUTES (edge->caller->decl))
+ != NULL
+&& lookup_attribute ("hot", DECL_ATTRIBUTES (edge->callee->decl))
+ != NULL))
 hints |= INLINE_HINT_known_hot;
 
   gcc_checking_assert (size >= 0);
-- 
2.17.1



RE: [PATCH] Add a heuristic for eliminate redundant load and store in inline pass.

2022-07-18 Thread Cui, Lili via Gcc-patches
Hi Honza,
Gentle ping  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597891.html

Thanks,
Lili.

> -Original Message-
> From: Gcc-patches  On
> Behalf Of Cui, Lili via Gcc-patches
> Sent: Sunday, July 10, 2022 10:05 PM
> To: Jan Hubicka 
> Cc: Lu, Hongjiu ; Liu, Hongtao
> ; gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH] Add a heuristic for eliminate redundant load and store
> in inline pass.
> 
> 
> > -Original Message-
> > From: Jan Hubicka  This is interesting idea.
> > Basically we want to guess if inlining will
> > make SRA and or strore->load propagation possible.   I think the
> > solution using INLINE_HINT may be bit too trigger happy, since it is
> > very common that this happens and with -O3 the hints are taken quite
> sriously.
> >
> > We already have mechanism to predict this situaiton by simply
> > expeciting that stores to addresses pointed to by function parameter
> > will be eliminated by 50%.  See eliminated_by_inlining_prob.
> >
> > I was thinking that we may combine it with a knowledge that the
> > parameter points to caller local memory (which is done by llvm's
> > heuristics) which can be added to IPA predicates.
> >
> > The idea of checking that the actual sotre in question is paired with
> > load at caller side is bit harder: one needs to invent representation
> > for such conditions.  So I wonder how much extra help we need for
> > critical inlning to happen at imagemagics?
> 
> Hi Honza,
> 
> Really appreciate for the feedback. I found that eliminated_by_inlining_prob
> does eliminated  the stmt 50% of the time, but the gap is still big.
> SRA cannot split callee's parameter for "Do not decompose non-BLKmode
> parameters in a way that would create a BLKmode parameter. Especially for
> pass-by-reference (hence, pointer type parameters), it's not worth it."
> 
> Critical inline function information
> 
> Caller:GetVirtualPixelsFromNexus
> size:541
> time:  484.08
> e->freq: 0.83
> 
> Callee:SetPixelCacheNexusPixels
> nonspec time: 46.60
> time : 36.18
> size:87
> 
> 
> Since the insns number 87 of callee function is bigger than inline_insns_auto
> (30) and there is no hint, so inline depends on "big_speedup_p (e)". 484.08
> (caller_time) * 0.15 (param_inline_min_speedup == 15)   = 72.61,  which
> means callee's time should be at least 72.61, but callee's time is 46.60, so 
> we
> need to lower param_inline_min_speedup to 3 or 4. I checked the
> history(https://gcc.gnu.org/bugzilla/show_bug.cgi?format=multiple=8366
> 5), that you tried changing it to 8,  but that increases the gzip code size by
> 2.5KB. so I want to add a heuristic hit for it.
> 
> Thanks,
> Lili.
> >
> > Honza



RE: [PATCH] Add a heuristic for eliminate redundant load and store in inline pass.

2022-07-10 Thread Cui, Lili via Gcc-patches


> -Original Message-
> From: Jan Hubicka 
> This is interesting idea.  Basically we want to guess if inlining will
> make SRA and or strore->load propagation possible.   I think the
> solution using INLINE_HINT may be bit too trigger happy, since it is very
> common that this happens and with -O3 the hints are taken quite sriously.
> 
> We already have mechanism to predict this situaiton by simply expeciting
> that stores to addresses pointed to by function parameter will be
> eliminated by 50%.  See eliminated_by_inlining_prob.
> 
> I was thinking that we may combine it with a knowledge that the parameter
> points to caller local memory (which is done by llvm's
> heuristics) which can be added to IPA predicates.
> 
> The idea of checking that the actual sotre in question is paired with load at
> caller side is bit harder: one needs to invent representation for such
> conditions.  So I wonder how much extra help we need for critical inlning to
> happen at imagemagics?

Hi Honza,

Really appreciate for the feedback. I found that eliminated_by_inlining_prob 
does eliminated  the stmt 50% of the time, but the gap is still big. 
SRA cannot split callee's parameter for "Do not decompose non-BLKmode 
parameters in a way that would create a BLKmode parameter. Especially for 
pass-by-reference (hence, pointer type parameters), it's not worth it."

Critical inline function information

Caller:GetVirtualPixelsFromNexus
size:541
time:  484.08
e->freq: 0.83

Callee:SetPixelCacheNexusPixels
nonspec time: 46.60
time : 36.18
size:87


Since the insns number 87 of callee function is bigger than inline_insns_auto 
(30) and there is no hint, so inline depends on "big_speedup_p (e)". 484.08 
(caller_time) * 0.15 (param_inline_min_speedup == 15)   = 72.61,  which means 
callee's time should be at least 72.61, but callee's time is 46.60, so we need 
to lower param_inline_min_speedup to 3 or 4. I checked the 
history(https://gcc.gnu.org/bugzilla/show_bug.cgi?format=multiple=83665), 
that you tried changing it to 8,  but that increases the gzip code size by 
2.5KB. so I want to add a heuristic hit for it.

Thanks,
Lili.
> 
> Honza



[PATCH] Add a heuristic for eliminate redundant load and store in inline pass.

2022-07-06 Thread Cui,Lili via Gcc-patches
From: Lili 


Hi Hubicka,

This patch is to add a heuristic inline hint to eliminate redundant load and 
store.

Bootstrap and regtest pending on x86_64-unknown-linux-gnu.
OK for trunk?

Thanks,
Lili.

Add a INLINE_HINT_eliminate_load_and_store hint in to inline pass.
We accumulate the insn number of redundant load and store that can be
reduced by these three cases, when the count size is greater than the
threshold, we will enable the hint. with the hint, inlining_insns_auto
will enlarge the bound.

1. Caller's store is same with callee's load
2. Caller's load is same with callee's load
3. Callee's load is same with caller's local memory access

With the patch applied
Icelake server: 538.imagic_r get 14.10% improvement for multicopy and 38.90%
improvement for single copy with no measurable changes for other benchmarks.

Casecadelake: 538.imagic_r get 12.5% improvement for multicopy with and code
size increased by 0.2%. With no measurable changes for other benchmarks.

Znver3 server: 538.imagic_r get 14.20% improvement for multicopy with and codei
size increased by 0.3%. With no measurable changes for other benchmarks.

CPU2017 single copy performance data for Icelake server
BenchMarks   Score   Build time  Code size
500.perlbench_r  1.50%   -0.20%  0.00%
502.gcc_r0.10%   -0.10%  0.00%
505.mcf_r0.00%   1.70%   0.00%
520.omnetpp_r-0.60%  -0.30%  0.00%
523.xalancbmk_r  0.60%   0.00%   0.00%
525.x264_r   0.00%   -0.20%  0.00%
531.deepsjeng_r  0.40%   -1.10%  -0.10%
541.leela_r  0.00%   0.00%   0.00%
548.exchange2_r  0.00%   -0.90%  0.00%
557.xz_r 0.00%   0.00%   0.00%
503.bwaves_r 0.00%   1.40%   0.00%
507.cactuBSSN_r  0.00%   1.00%   0.00%
508.namd_r   0.00%   0.30%   0.00%
510.parest_r 0.00%   -0.40%  0.00%
511.povray_r 0.70%   -0.60%  0.00%
519.lbm_r0.00%   0.00%   0.00%
521.wrf_r0.00%   0.60%   0.00%
526.blender_r0.00%   0.00%   0.00%
527.cam4_r   -0.30%  -0.50%  0.00%
538.imagick_r38.90%  0.50%   0.20%
544.nab_r0.00%   1.10%   0.00%
549.fotonik3d_r  0.00%   0.90%   0.00%
554.roms_r   2.30%   -0.10%  0.00%
Geomean-int  0.00%   -0.30%  0.00%
Geomean-fp   3.80%   0.30%   0.00%

gcc/ChangeLog:

* ipa-fnsummary.cc (ipa_dump_hints): Add print for hint 
"eliminate_load_and_store"
* ipa-fnsummary.h (enum ipa_hints_vals): Add 
INLINE_HINT_eliminate_load_and_store.
* ipa-inline-analysis.cc (do_estimate_edge_time): Add judgment for 
INLINE_HINT_eliminate_load_and_store.
* ipa-inline.cc (want_inline_small_function_p): Add 
"INLINE_HINT_eliminate_load_and_store" for hints flag.
* ipa-modref-tree.h (struct modref_access_node): Move function contains 
to public..
(struct modref_tree): Add new function "same" and 
"local_vector_memory_accesse"
* ipa-modref.cc (eliminate_load_and_store): New.
(ipa_merge_modref_summary_after_inlining): Change the input value of 
useful_p.
* ipa-modref.h (eliminate_load_and_store): New.
* opts.cc: Add param "min_inline_hint_eliminate_loads_num"
* params.opt: Ditto.

gcc/testsuite/ChangeLog:

* gcc.dg/ipa/inlinehint-6.c: New test.
---
 gcc/ipa-fnsummary.cc|   5 ++
 gcc/ipa-fnsummary.h |   4 +-
 gcc/ipa-inline-analysis.cc  |   7 ++
 gcc/ipa-inline.cc   |   3 +-
 gcc/ipa-modref-tree.h   | 109 +++-
 gcc/ipa-modref.cc   |  46 +-
 gcc/ipa-modref.h|   1 +
 gcc/opts.cc |   1 +
 gcc/params.opt  |   4 +
 gcc/testsuite/gcc.dg/ipa/inlinehint-6.c |  54 
 10 files changed, 229 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/ipa/inlinehint-6.c

diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
index e2a86680a21..0a962f62490 100644
--- a/gcc/ipa-fnsummary.cc
+++ b/gcc/ipa-fnsummary.cc
@@ -150,6 +150,11 @@ ipa_dump_hints (FILE *f, ipa_hints hints)
   hints &= ~INLINE_HINT_builtin_constant_p;
   fprintf (f, " builtin_constant_p");
 }
+  if (hints & INLINE_HINT_eliminate_load_and_store)
+{
+  hints &= ~INLINE_HINT_eliminate_load_and_store;
+  fprintf (f, " eliminate_load_and_store");
+}
   gcc_assert (!hints);
 }
 
diff --git a/gcc/ipa-fnsummary.h b/gcc/ipa-fnsummary.h
index 941fea6de0d..5f589e5ea0d 100644
--- a/gcc/ipa-fnsummary.h
+++ b/gcc/ipa-fnsummary.h
@@ -52,7 +52,9 @@ enum ipa_hints_vals {
   INLINE_HINT_known_hot = 128,
   /* There is builtin_constant_p dependent on parameter which is usually
  a strong hint to inline.  */
-  INLINE_HINT_builtin_constant_p = 256
+  INLINE_HINT_builtin_constant_p = 256,
+  /* Inlining can 

[PATCH] testsuite: Add -mtune=generic to dg-options for two testcases.

2022-06-10 Thread Cui,Lili via Gcc-patches
This patch is to change dg-options for two testcases.

Use -mtune=generic to limit these two testcases. Because configuring them with
-mtune=cascadelake or znver3 will vectorize them.

regtested on x86_64-linux-gnu{-m32,}. Ok for trunk?

Thanks,
Lili.

Use -mtune=generic to limit these two test cases. Because configuring them with
-mtune=cascadelake or znver3 will vectorize them.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Add
-mtune=generic to dg-options.
* gcc.target/i386/pr84101.c: Likewise.
---
 .../gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c | 2 +-
 gcc/testsuite/gcc.target/i386/pr84101.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
index 7637cdb4a97..d060135d877 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
+/* { dg-additional-options "-msse -mtune=generic -fdump-tree-slp2-details" } */
 
 struct S { unsigned long a, b; } s;
 
diff --git a/gcc/testsuite/gcc.target/i386/pr84101.c 
b/gcc/testsuite/gcc.target/i386/pr84101.c
index cf144894f9b..2c5a97308ca 100644
--- a/gcc/testsuite/gcc.target/i386/pr84101.c
+++ b/gcc/testsuite/gcc.target/i386/pr84101.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O3 -fdump-tree-slp2-details" } */
+/* { dg-options "-O3 -mtune=generic -fdump-tree-slp2-details" } */
 
 typedef struct uint64_pair uint64_pair_t ;
 struct uint64_pair
-- 
2.17.1



RE: [PATCH] Update {skylake,icelake,alderlake}_cost to add a bit preference to vector store.

2022-06-07 Thread Cui, Lili via Gcc-patches
> -Original Message-
> From: Hongtao Liu 
> Sent: Monday, June 6, 2022 1:25 PM
> To: H.J. Lu 
> Cc: Cui, Lili ; Liu, Hongtao ; GCC
> Patches 
> Subject: Re: [PATCH] Update {skylake,icelake,alderlake}_cost to add a bit
> preference to vector store.
> >
> > Should we add some tests to verify improvements?
> We can take pr99881.c as a unit test.
> 
> Ok for the trunk.
> >
> > --
> > H.J.
> 
Hi hongtao,

1. I added test case pr105493.c for 525.x264_r. For 538.imagic_r we have 
pr99881.c.
2. I changed the dg-final check in pr105638.c due to code generation changes.

Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk?

Thanks,
Lili.

> 
> --
> BR,
> Hongtao


0001-Update-skylake-icelake-alderlake-_cost-to-add-a-bit-.patch
Description: 0001-Update-skylake-icelake-alderlake-_cost-to-add-a-bit-.patch


[PATCH] Update {skylake, icelake, alderlake}_cost to add a bit preference to vector store.

2022-05-31 Thread Cui,Lili via Gcc-patches
This patch is to update {skylake,icelake,alderlake}_cost to add a bit 
preference to vector store.
Since the interger vector construction cost has changed, we need to adjust the 
load and store costs for intel processers.

With the patch applied
538.imagic_r:gets ~6% improvement on ADL for multicopy.
525.x264_r  :gets ~2% improvement on ADL and ICX for multicopy.
with no measurable changes for other benchmarks.

Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk?

Thanks,
Lili.

gcc/ChangeLog

PR target/105493
* config/i386/x86-tune-costs.h (skylake_cost): Raise the gpr load cost
from 4 to 6 and gpr store cost from 6 to 8. Change SSE loads and
unaligned loads cost from {6, 6, 6, 10, 20} to {8, 8, 8, 8, 16}.
(icelake_cost): Ditto.
(alderlake_cost): Raise the gpr store cost from 6 to 8 and SSE loads,
stores and unaligned stores cost from {6, 6, 6, 10, 15} to
{8, 8, 8, 10, 15}.

gcc/testsuite/

PR target/105493
* gcc.target/i386/pr91446.c: Adjust to expect vectorization
* gcc.target/i386/pr99881.c: XFAIL.
---
 gcc/config/i386/x86-tune-costs.h| 26 -
 gcc/testsuite/gcc.target/i386/pr91446.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr99881.c |  2 +-
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index ea34a939c68..6c9066c84cc 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1897,15 +1897,15 @@ struct processor_costs skylake_cost = {
   8,   /* "large" insn */
   17,  /* MOVE_RATIO */
   17,  /* CLEAR_RATIO */
-  {4, 4, 4},   /* cost of loading integer registers
+  {6, 6, 6},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
-  {6, 6, 6},   /* cost of storing integer registers */
-  {6, 6, 6, 10, 20},   /* cost of loading SSE register
+  {8, 8, 8},   /* cost of storing integer registers */
+  {8, 8, 8, 8, 16},/* cost of loading SSE register
   in 32bit, 64bit, 128bit, 256bit and 
512bit */
   {8, 8, 8, 8, 16},/* cost of storing SSE register
   in 32bit, 64bit, 128bit, 256bit and 
512bit */
-  {6, 6, 6, 10, 20},   /* cost of unaligned loads.  */
+  {8, 8, 8, 8, 16},/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},/* cost of unaligned stores.  */
   2, 2, 4, /* cost of moving XMM,YMM,ZMM register 
*/
   6,   /* cost of moving SSE register to 
integer.  */
@@ -2023,15 +2023,15 @@ struct processor_costs icelake_cost = {
   8,   /* "large" insn */
   17,  /* MOVE_RATIO */
   17,  /* CLEAR_RATIO */
-  {4, 4, 4},   /* cost of loading integer registers
+  {6, 6, 6},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
-  {6, 6, 6},   /* cost of storing integer registers */
-  {6, 6, 6, 10, 20},   /* cost of loading SSE register
+  {8, 8, 8},   /* cost of storing integer registers */
+  {8, 8, 8, 8, 16},/* cost of loading SSE register
   in 32bit, 64bit, 128bit, 256bit and 
512bit */
   {8, 8, 8, 8, 16},/* cost of storing SSE register
   in 32bit, 64bit, 128bit, 256bit and 
512bit */
-  {6, 6, 6, 10, 20},   /* cost of unaligned loads.  */
+  {8, 8, 8, 8, 16},/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},/* cost of unaligned stores.  */
   2, 2, 4, /* cost of moving XMM,YMM,ZMM register 
*/
   6,   /* cost of moving SSE register to 
integer.  */
@@ -2146,13 +2146,13 @@ struct processor_costs alderlake_cost = {
   {6, 6, 6},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
-  {6, 6, 6},   /* cost of storing integer registers */
-  {6, 6, 6, 10, 15},   /* cost of loading SSE register
+  {8, 8, 8},  

[PATCH] x86: Correct march=sapphirerapids to base on icelake server

2022-03-17 Thread Cui,Lili via Gcc-patches
Hi Hongtao,

This patch is to correct march=sapphirerapids to base on icelake server.
and update sapphirerapids in the documentation.

OK for master and backport to GCC 11?


gcc/Changelog:

PR target/104963
* config/i386/i386.h (PTA_SAPPHIRERAPIDS): change it to base on ICX.
* doc/invoke.texi: Update documents for Intel sapphirerapids.

gcc/testsuite/ChangeLog

PR target/104963
* gcc.target/i386/pr104963.c: New test case.
---
 gcc/config/i386/i386.h   |  5 +++--
 gcc/doc/invoke.texi  | 11 ++-
 gcc/testsuite/gcc.target/i386/pr104963.c | 12 
 3 files changed, 21 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr104963.c

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 37b523cea4f..b92955177fe 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2323,10 +2323,11 @@ constexpr wide_int_bitmask PTA_ICELAKE_SERVER = 
PTA_ICELAKE_CLIENT
   | PTA_PCONFIG | PTA_WBNOINVD | PTA_CLWB;
 constexpr wide_int_bitmask PTA_TIGERLAKE = PTA_ICELAKE_CLIENT | PTA_MOVDIRI
   | PTA_MOVDIR64B | PTA_CLWB | PTA_AVX512VP2INTERSECT | PTA_KL | PTA_WIDEKL;
-constexpr wide_int_bitmask PTA_SAPPHIRERAPIDS = PTA_COOPERLAKE | PTA_MOVDIRI
+constexpr wide_int_bitmask PTA_SAPPHIRERAPIDS = PTA_ICELAKE_SERVER | 
PTA_MOVDIRI
   | PTA_MOVDIR64B | PTA_AVX512VP2INTERSECT | PTA_ENQCMD | PTA_CLDEMOTE
   | PTA_PTWRITE | PTA_WAITPKG | PTA_SERIALIZE | PTA_TSXLDTRK | PTA_AMX_TILE
-  | PTA_AMX_INT8 | PTA_AMX_BF16 | PTA_UINTR | PTA_AVXVNNI | PTA_AVX512FP16;
+  | PTA_AMX_INT8 | PTA_AMX_BF16 | PTA_UINTR | PTA_AVXVNNI | PTA_AVX512FP16
+  | PTA_AVX512BF16;
 constexpr wide_int_bitmask PTA_KNL = PTA_BROADWELL | PTA_AVX512PF
   | PTA_AVX512ER | PTA_AVX512F | PTA_AVX512CD | PTA_PREFETCHWT1;
 constexpr wide_int_bitmask PTA_BONNELL = PTA_CORE2 | PTA_MOVBE;
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d65979bba3f..59baa5e5747 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -31288,11 +31288,12 @@ AVX512VP2INTERSECT and KEYLOCKER instruction set 
support.
 Intel sapphirerapids CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
 SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE,
 RDRND, F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX, PREFETCHW,
-AES, CLFLUSHOPT, XSAVEC, XSAVES, SGX, AVX512F, CLWB, AVX512VL, AVX512BW,
-AVX512DQ, AVX512CD, AVX512VNNI, AVX512BF16 MOVDIRI, MOVDIR64B,
-AVX512VP2INTERSECT, ENQCMD, CLDEMOTE, PTWRITE, WAITPKG, SERIALIZE, TSXLDTRK,
-UINTR, AMX-BF16, AMX-TILE, AMX-INT8, AVX-VNNI and AVX512FP16 instruction set
-support.
+AES, CLFLUSHOPT, XSAVEC, XSAVES, SGX, AVX512F, AVX512VL, AVX512BW, AVX512DQ,
+AVX512CD, PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI, GFNI, VAES, AVX512VBMI2
+VPCLMULQDQ, AVX512BITALG, RDPID, AVX512VPOPCNTDQ, PCONFIG, WBNOINVD, CLWB,
+MOVDIRI, MOVDIR64B, AVX512VP2INTERSECT, ENQCMD, CLDEMOTE, PTWRITE, WAITPKG,
+SERIALIZE, TSXLDTRK, UINTR, AMX-BF16, AMX-TILE, AMX-INT8, AVX-VNNI and
+AVX512FP16 instruction set support.
 
 @item alderlake
 Intel Alderlake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
diff --git a/gcc/testsuite/gcc.target/i386/pr104963.c 
b/gcc/testsuite/gcc.target/i386/pr104963.c
new file mode 100644
index 000..19000671ebf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr104963.c
@@ -0,0 +1,12 @@
+/* PR target/104963 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids" } */
+
+#include
+
+__m512i
+foo (__m512i a, __m512i b)
+{
+return _mm512_permutexvar_epi8(a, b);
+}
+
-- 
2.17.1

Thanks.


[PATCH] x86: Update Intel architectures ISA support in documentation.

2022-02-21 Thread Cui,Lili via Gcc-patches
Hi Uros,

This patch is to update Intel architectures ISA support in documentation.
Since the ISA supported by Intel architectures in the documentation
are inconsistent with the actual, modify them all.

OK for master?


gcc/Changelog:

  * gcc/doc/invoke.texi: Update documents for Intel architectures.
---
 gcc/doc/invoke.texi | 185 +++-
 1 file changed, 98 insertions(+), 87 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 635c5f79278..60472a21255 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -31086,66 +31086,69 @@ instruction set is used, so the code runs on all i686 
family chips.
 When used with @option{-mtune}, it has the same meaning as @samp{generic}.
 
 @item pentium2
-Intel Pentium II CPU, based on Pentium Pro core with MMX instruction set
-support.
+Intel Pentium II CPU, based on Pentium Pro core with MMX and FXSR instruction
+set support.
 
 @item pentium3
 @itemx pentium3m
-Intel Pentium III CPU, based on Pentium Pro core with MMX and SSE instruction
-set support.
+Intel Pentium III CPU, based on Pentium Pro core with MMX, FXSR and SSE
+instruction set support.
 
 @item pentium-m
 Intel Pentium M; low-power version of Intel Pentium III CPU
-with MMX, SSE and SSE2 instruction set support.  Used by Centrino notebooks.
+with MMX, SSE, SSE2 and FXSR instruction set support.  Used by Centrino
+notebooks.
 
 @item pentium4
 @itemx pentium4m
-Intel Pentium 4 CPU with MMX, SSE and SSE2 instruction set support.
+Intel Pentium 4 CPU with MMX, SSE, SSE2 and FXSR instruction set support.
 
 @item prescott
-Improved version of Intel Pentium 4 CPU with MMX, SSE, SSE2 and SSE3 
instruction
-set support.
+Improved version of Intel Pentium 4 CPU with MMX, SSE, SSE2, SSE3 and FXSR
+instruction set support.
 
 @item nocona
 Improved version of Intel Pentium 4 CPU with 64-bit extensions, MMX, SSE,
-SSE2 and SSE3 instruction set support.
+SSE2, SSE3 and FXSR instruction set support.
 
 @item core2
-Intel Core 2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
-instruction set support.
+Intel Core 2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, CX16,
+SAHF and FXSR instruction set support.
 
 @item nehalem
 Intel Nehalem CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2 and POPCNT instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF and FXSR instruction set support.
 
 @item westmere
 Intel Westmere CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AES and PCLMUL instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR and PCLMUL instruction set support.
 
 @item sandybridge
 Intel Sandy Bridge CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AES and PCLMUL instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE and PCLMUL instruction set
+support.
 
 @item ivybridge
 Intel Ivy Bridge CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AES, PCLMUL, FSGSBASE, RDRND and F16C
-instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND
+and F16C instruction set support.
 
 @item haswell
 Intel Haswell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2 and F16C instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND,
+F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE and HLE instruction set support.
 
 @item broadwell
 Intel Broadwell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, 
BMI2,
-F16C, RDSEED ADCX and PREFETCHW instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND,
+F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX and PREFETCHW
+instruction set support.
 
 @item skylake
 Intel Skylake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVEC and XSAVES
-instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND,
+F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX, PREFETCHW, AES,
+CLFLUSHOPT, XSAVEC, XSAVES and SGX instruction set support.
 
 @item bonnell
 Intel Bonnell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3 and SSSE3
@@ -31153,113 +31156,121 @@ instruction set support.
 
 @item silvermont
 Intel Silvermont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, 
SSSE3,
-SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL and RDRND instruction set 
support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW and RDRND
+instruction set support.
 
 @item goldmont
 Intel Goldmont CPU with 64-bit extensions, 

[PATCH] x86: Update model value for Alderlake and Rocketlake

2022-01-03 Thread Cui,Lili via Gcc-patches
Hi Uros,

This patch is to update model value for Alderlake and Rocketlake.

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?

gcc/ChangeLog

* common/config/i386/cpuinfo.h (get_intel_cpu): Add new model values
to Alderlake and Rocketlake.
---
 gcc/common/config/i386/cpuinfo.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h
index 2d8ea201ab5..61b1a0f291c 100644
--- a/gcc/common/config/i386/cpuinfo.h
+++ b/gcc/common/config/i386/cpuinfo.h
@@ -415,6 +415,7 @@ get_intel_cpu (struct __processor_model *cpu_model,
   cpu_model->__cpu_subtype = INTEL_COREI7_SKYLAKE;
   break;
 case 0xa7:
+case 0xa8:
   /* Rocket Lake.  */
   cpu = "rocketlake";
   CHECK___builtin_cpu_is ("corei7");
@@ -487,6 +488,7 @@ get_intel_cpu (struct __processor_model *cpu_model,
   break;
 case 0x97:
 case 0x9a:
+case 0xbf:
   /* Alder Lake.  */
   cpu = "alderlake";
   CHECK___builtin_cpu_is ("corei7");
-- 
2.17.1

Thanks,
Lili.


[PATCH] x86: Update -mtune=tremont

2021-12-08 Thread Cui,Lili via Gcc-patches
Hi Uros,

This patch is to update mtune for tremont.

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?


Silvermont has a special handle in add_stmt_cost function, because it has in
order SIMD pipeline. But for Tremont, its SIMD pipeline is out of order,
remove Tremont from this special handle.

gcc/ChangeLog

* config/i386/i386.c (ix86_vector_costs::add_stmt_cost): Remove Tremont.
---
 gcc/config/i386/i386.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index f1e41fd55f9..9f4ed34ffd5 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -23144,8 +23144,7 @@ ix86_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
  for Silvermont as it has out of order integer pipeline and can execute
  2 scalar instruction per tick, but has in order SIMD pipeline.  */
   if ((TARGET_CPU_P (SILVERMONT) || TARGET_CPU_P (GOLDMONT)
-   || TARGET_CPU_P (GOLDMONT_PLUS) || TARGET_CPU_P (TREMONT)
-   || TARGET_CPU_P (INTEL))
+   || TARGET_CPU_P (GOLDMONT_PLUS) || TARGET_CPU_P (INTEL))
   && stmt_info && stmt_info->stmt)
 {
   tree lhs_op = gimple_get_lhs (stmt_info->stmt);
-- 
2.17.1

Thanks,
Lili.


[PATCH] x86: Update -mtune=alderlake

2021-11-10 Thread Cui,Lili via Gcc-patches
Hi Uros,

This patch is to update mtune for alderlake.

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?

Update mtune for alderlake, Alder Lake Intel Hybrid Technology will not support
Intel® AVX-512. ISA features such as Intel® AVX, AVX-VNNI, Intel® AVX2, and
UMONITOR/UMWAIT/TPAUSE are supported.

gcc/ChangeLog

* config/i386/i386-options.c (m_CORE_AVX2): Remove Alderlake
from m_CORE_AVX2.
(processor_cost_table): Use alderlake_cost for Alderlake.
* config/i386/i386.c (ix86_sched_init_global): Handle Alderlake.
* config/i386/x86-tune-costs.h (struct processor_costs): Add alderlake
cost.
* config/i386/x86-tune-sched.c (ix86_issue_rate): Change Alderlake
issue rate to 4.
(ix86_adjust_cost): Handle Alderlake.
* config/i386/x86-tune.def (X86_TUNE_SCHEDULE): Enable for Alderlake.
(X86_TUNE_PARTIAL_REG_DEPENDENCY): Likewise.
(X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY): Likewise.
(X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): Likewise.
(X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
(X86_TUNE_MEMORY_MISMATCH_STALL): Likewise.
(X86_TUNE_USE_LEAVE): Likewise.
(X86_TUNE_PUSH_MEMORY): Likewise.
(X86_TUNE_USE_INCDEC): Likewise.
(X86_TUNE_INTEGER_DFMODE_MOVES): Likewise.
(X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES): Likewise.
(X86_TUNE_USE_SAHF): Likewise.
(X86_TUNE_USE_BT): Likewise.
(X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Likewise.
(X86_TUNE_ONE_IF_CONV_INSN): Likewise.
(X86_TUNE_AVOID_MFENCE): Likewise.
(X86_TUNE_USE_SIMODE_FIOP): Likewise.
(X86_TUNE_EXT_80387_CONSTANTS): Likewise.
(X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Likewise.
(X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Likewise.
(X86_TUNE_SSE_TYPELESS_STORES): Likewise.
(X86_TUNE_SSE_LOAD0_BY_PXOR): Likewise.
(X86_TUNE_AVOID_4BYTE_PREFIXES): Likewise.
(X86_TUNE_USE_GATHER): Disable for Alderlake.
(X86_TUNE_AVX256_MOVE_BY_PIECES): Likewise.
(X86_TUNE_AVX256_STORE_BY_PIECES): Likewise.
---
 gcc/config/i386/i386-options.c   |   4 +-
 gcc/config/i386/i386.c   |   1 +
 gcc/config/i386/x86-tune-costs.h | 120 +++
 gcc/config/i386/x86-tune-sched.c |   2 +
 gcc/config/i386/x86-tune.def |  58 +++
 5 files changed, 155 insertions(+), 30 deletions(-)

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index e7a3bd4aaea..a8cc0664f11 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -131,7 +131,7 @@ along with GCC; see the file COPYING3.  If not see
   | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_CASCADELAKE \
   | m_TIGERLAKE | m_COOPERLAKE | m_SAPPHIRERAPIDS \
   | m_ROCKETLAKE)
-#define m_CORE_AVX2 (m_HASWELL | m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
+#define m_CORE_AVX2 (m_HASWELL | m_SKYLAKE | m_CORE_AVX512)
 #define m_CORE_ALL (m_CORE2 | m_NEHALEM  | m_SANDYBRIDGE | m_CORE_AVX2)
 #define m_GOLDMONT (HOST_WIDE_INT_1Uinteger and integer->SSE moves 
*/
+  6, 6,/* mask->integer and integer->mask 
moves */
+  {6, 6, 6},   /* cost of loading mask register
+  in QImode, 

RE: [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS

2021-09-16 Thread Cui, Lili via Gcc-patches

> -Original Message-
> From: Uros Bizjak 
> Sent: Thursday, September 16, 2021 2:28 PM
> To: Cui, Lili 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao ; H. J. Lu
> 
> Subject: Re: [PATCH 3/4] [PATCH 3/4] x86: Properly handle
> USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
> 
> On Wed, Sep 15, 2021 at 10:10 AM  wrote:
> >
> > From: "H.J. Lu" 
> >
> > Check TARGET_USE_VECTOR_FP_CONVERTS or
> TARGET_USE_VECTOR_CONVERTS when
> > handling avx_partial_xmm_update attribute.  Don't convert AVX partial
> > XMM register update if vector packed SSE conversion should be used.
> >
> > gcc/
> >
> > PR target/101900
> > * config/i386/i386-features.c (remove_partial_avx_dependency):
> > Check TARGET_USE_VECTOR_FP_CONVERTS and
> TARGET_USE_VECTOR_CONVERTS
> > before generating vxorps.
> >
> > gcc/
> >
> > PR target/101900
> > * testsuite/gcc.target/i386/pr101900-1.c: New test.
> > * testsuite/gcc.target/i386/pr101900-2.c: Likewise.
> > * testsuite/gcc.target/i386/pr101900-3.c: Likewise.
> > ---
> >  gcc/config/i386/i386-features.c| 21 ++---
> >  gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++
> > gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++
> > gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++
> >  4 files changed, 73 insertions(+), 3 deletions(-)  create mode 100644
> > gcc/testsuite/gcc.target/i386/pr101900-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c
> >
> > diff --git a/gcc/config/i386/i386-features.c
> > b/gcc/config/i386/i386-features.c index 5a99ea7c046..ae5ea02a002
> > 100644
> > --- a/gcc/config/i386/i386-features.c
> > +++ b/gcc/config/i386/i386-features.c
> > @@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void)
> >   != AVX_PARTIAL_XMM_UPDATE_TRUE)
> > continue;
> >
> > - if (!v4sf_const0)
> > -   v4sf_const0 = gen_reg_rtx (V4SFmode);
> > -
> >   /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
> >  SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
> >  vec_merge with subreg.  */
> >   rtx src = SET_SRC (set);
> >   rtx dest = SET_DEST (set);
> >   machine_mode dest_mode = GET_MODE (dest);
> > + machine_mode src_mode;
> > +
> > + if (TARGET_USE_VECTOR_FP_CONVERTS)
> > +   {
> > + src_mode = GET_MODE (XEXP (src, 0));
> > + if (src_mode == E_SFmode || src_mode == E_DFmode)
> > +   continue;
> > +   }
> > +
> > + if (TARGET_USE_VECTOR_CONVERTS)
> > +   {
> > + src_mode = GET_MODE (XEXP (src, 0));
> > + if (src_mode == E_SImode || src_mode == E_DImode)
> > +   continue;
> > +   }
> > +
> > + if (!v4sf_const0)
> > +   v4sf_const0 = gen_reg_rtx (V4SFmode);
> 
> Please better move initialization of src_mode to the top of the new hunk, 
> like:
> 
> machine_mode src_mode = GET_MODE (XEXP (src, 0)); switch (src_mode) {
>   case E_SFmode:
>   case E_DFmode:
> if (TARGET_USE_VECTOR_FP_CONVERTS)
>   continue;
> break;
>   case E_SImode:
>   case E_DImode:
> if (TARGET_USE_VECTOR_CONVERTS)
>   continue;
> break;
>   default:
> break;
> }
> 
> or something like the above.

Done, thanks for your good advice, I also rebased patch 4/4, since it is based 
on patch 3/4.

Changed it to:

+ machine_mode src_mode = GET_MODE (XEXP (src, 0));
+
+ switch (src_mode)
+   {
+   case E_SFmode:
+   case E_DFmode:
+ if (TARGET_USE_VECTOR_FP_CONVERTS)
+   continue;
+ break;
+   case E_SImode:
+   case E_DImode:
+ if (TARGET_USE_VECTOR_CONVERTS)
+   continue;
+ break;
+   default:
+ break;
+   }
+ if (!v4sf_const0)
+   v4sf_const0 = gen_reg_rtx (V4SFmode);

Thanks,
Lili.

> 
> Uros.
> 
> >
> >   rtx zero;
> >   machine_mode dest_vecmode;
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > b/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > new file mode 100644
> > index 000..0a45f8e340a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=skylake -mfpmath=sse
> > +-mtune-ctrl=use_vector_fp_converts" } */
> > +
> > +extern float f;
> > +extern double d;
> > +extern int i;
> > +
> > +void
> > +foo (void)
> > +{
> > +  d = f;
> > +  f = i;
> > +}
> > +
> > +/* { dg-final { scan-assembler "vcvtps2pd" } } */
> > +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
> > +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
> > +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 }
> > +} */
> > diff 

RE: [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY

2021-09-15 Thread Cui, Lili via Gcc-patches


> -Original Message-
> From: H.J. Lu 
> Sent: Wednesday, September 15, 2021 10:14 PM
> To: Cui, Lili 
> Cc: Uros Bizjak ; GCC Patches  patc...@gcc.gnu.org>; Liu, Hongtao 
> Subject: Re: [PATCH 4/4] [PATCH 4/4] x86: Add
> TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
> 
> There is no need to add [PATCH N/4] in the first line of the git commit
> message.  "git format-patch" or "git send-email" will add them automatically.
> 
Thanks for the reminder, I didn't notice it before.

> On Wed, Sep 15, 2021 at 1:10 AM  wrote:
> >
> > From: "H.J. Lu" 
> >
> > 1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> > TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP
> splitters.
> > 2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> > TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP
> splitters.
> > 3.  Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> and
> > TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling
> avx_partial_xmm_update
> > attribute.  Don't convert AVX partial XMM register update if there is
> > no partial SSE register dependency for SSE conversion.
> >
> > gcc/
> >
> > * config/i386/i386-features.c (remove_partial_avx_dependency):
> > Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> and
> > and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before
> generating
> > vxorps.
> > * config/i386/i386.h
> (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
> > New.
> > (TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
> > * config/i386/i386.md (SSE FP to FP splitters): Replace
> > TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> > TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY.
> > (SSE INT to FP splitter): Replace
> TARGET_SSE_PARTIAL_REG_DEPENDENCY
> > with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY.
> > * config/i386/x86-tune.def
> > (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New.
> > (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
> >
> > gcc/testsuite/
> >
> > * gcc.target/i386/avx-covert-1.c: New file.
> > * gcc.target/i386/avx-fp-covert-1.c: Likewise.
> > * gcc.target/i386/avx-int-covert-1.c: Likewise.
> > * gcc.target/i386/sse-covert-1.c: Likewise.
> > * gcc.target/i386/sse-fp-covert-1.c: Likewise.
> > * gcc.target/i386/sse-int-covert-1.c: Likewise.
> > ---
> >  gcc/config/i386/i386-features.c   |  6 --
> >  gcc/config/i386/i386.h|  4 
> >  gcc/config/i386/i386.md   |  9 ++---
> >  gcc/config/i386/x86-tune.def  | 15 +++
> >  gcc/testsuite/gcc.target/i386/avx-covert-1.c  | 19 +++
> >  .../gcc.target/i386/avx-fp-covert-1.c | 15 +++
> >  .../gcc.target/i386/avx-int-covert-1.c| 14 ++
> >  gcc/testsuite/gcc.target/i386/sse-covert-1.c  | 19 +++
> >  .../gcc.target/i386/sse-fp-covert-1.c | 15 +++
> >  .../gcc.target/i386/sse-int-covert-1.c| 14 ++
> >  10 files changed, 125 insertions(+), 5 deletions(-)  create mode
> > 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
> >
> > diff --git a/gcc/config/i386/i386-features.c
> > b/gcc/config/i386/i386-features.c index ae5ea02a002..91bfa06d4bf
> > 100644
> > --- a/gcc/config/i386/i386-features.c
> > +++ b/gcc/config/i386/i386-features.c
> > @@ -2218,14 +2218,16 @@ remove_partial_avx_dependency (void)
> >   machine_mode dest_mode = GET_MODE (dest);
> >   machine_mode src_mode;
> >
> > - if (TARGET_USE_VECTOR_FP_CONVERTS)
> > + if (TARGET_USE_VECTOR_FP_CONVERTS
> > + || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY)
> > {
> >   src_mode = GET_MODE (XEXP (src, 0));
> >   if (src_mode == E_SFmode || src_mode == E_DFmode)
> > continue;
> > }
> >
> > - if (TARGET_USE_VECTOR_CONVERTS)
> > + if (TARGET_USE_VECTOR_CONVERTS
> > + || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY)
> > {
> >   src_mode = GET_MODE (XEXP (src, 0));
> >   if (src_mode == E_SImode || src_mode == E_DImode) diff
> > --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> > e76bb55c080..ec60b89753e 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -334,6 +334,10 @@ extern unsigned char
> ix86_tune_features[X86_TUNE_LAST];
> > ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY]
> >  #define 

[PATCH] Synchronize Rocket Lake's processor_names and processor_cost_table with processor_type

2021-04-24 Thread Cui, Lili via Gcc-patches
Hi Uros,

This patch is  to synchronize Rocket Lake's processor_names and 
processor_cost_table with processor_type.

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?

[PATCH] Synchronize Rocket Lake's processor_names and
 processor_cost_table with processor_type

gcc/ChangeLog

* common/config/i386/i386-common.c (processor_names):
Sync processor_names with processor_type.
* config/i386/i386-options.c (processor_cost_table):
Sync processor_cost_table with processor_type.
---
 gcc/common/config/i386/i386-common.c | 2 +-
 gcc/config/i386/i386-options.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/common/config/i386/i386-common.c 
b/gcc/common/config/i386/i386-common.c
index 1e6c1590ac4..6a7b5c8312f 100644
--- a/gcc/common/config/i386/i386-common.c
+++ b/gcc/common/config/i386/i386-common.c
@@ -1743,13 +1743,13 @@ const char *const processor_names[] =
   "skylake-avx512",
   "cannonlake",
   "icelake-client",
-  "rocketlake",
   "icelake-server",
   "cascadelake",
   "tigerlake",
   "cooperlake",
   "sapphirerapids",
   "alderlake",
+  "rocketlake",
   "intel",
   "geode",
   "k6",
diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index 7e59ccd988d..eafa3d4f715 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -726,12 +726,12 @@ static const struct processor_costs 
*processor_cost_table[] =
   _cost,
   _cost,
   _cost,
-  _cost,
   _cost,
   _cost,
   _cost,
   _cost,
   _cost,
+  _cost,
   _cost,
   _cost,
   _cost,
-- 
2.17.1

Thanks,
Lili.


0001-Synchronize-Rocket-Lake-s-processor_names-and-proces.patch
Description: 0001-Synchronize-Rocket-Lake-s-processor_names-and-proces.patch


[PATCH wwwdoc] Mention Rocketlake [GCC11]

2021-04-12 Thread Cui, Lili via Gcc-patches

Updated wwwdocs for Rocketlake [GCC11], thanks.

 [PATCH] Mention Rocketlake
---
 htdocs/gcc-11/changes.html | 4 
 1 file changed, 4 insertions(+)

diff --git a/htdocs/gcc-11/changes.html b/htdocs/gcc-11/changes.html
index a7fa4e1b..38725abc 100644
--- a/htdocs/gcc-11/changes.html
+++ b/htdocs/gcc-11/changes.html
@@ -634,6 +634,10 @@ a work-in-progress.
 The switch enables the CLDEMOTE, PTWRITE, WAITPKG, SERIALIZE, KEYLOCKER,
 AVX-VNNI, and HRESET ISA extensions.
   
+  GCC now supports the Intel CPU named Rocketlake through
+-march=rocketlake.
+Rocket Lake is based on Icelake client and minus SGX.
+  
 
 
-- 
2.17.1
Thanks,
Lili.


0001-Mention-Rocketlake.patch
Description: 0001-Mention-Rocketlake.patch


[PATCH] Add rocketlake to gcc.

2021-04-11 Thread Cui, Lili via Gcc-patches
Hi Uros,

This patch is about to add Rocket Lake to GCC.
Rocket Lake is based on Ice Lake client  and minus SGX.

For detailed information, please refer to 
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?

 [PATCH] Add rocketlake to gcc.

gcc/
* common/config/i386/cpuinfo.h
(get_intel_cpu): Handle rocketlake.
* common/config/i386/i386-common.c
(processor_names): Add rocketlake.
(processor_alias_table): Add rocketlake.
* common/config/i386/i386-cpuinfo.h
(processor_subtypes): Add INTEL_COREI7_ROCKETLAKE.
* config.gcc: Add -march=rocketlake.
* config/i386/i386-c.c
(ix86_target_macros_internal): Handle rocketlake.
* config/i386/i386-options.c
(m_ROCKETLAKE)  : Define.
(processor_cost_table): Add rocketlake cost.
* config/i386/i386.h
(ix86_size_cost) : Define TARGET_ROCKETLAKE.
(processor_type) : Add PROCESSOR_ROCKETLAKE.
(PTA_ROCKETLAKE): Ditto.
* doc/extend.texi: Add rocketlake.
* doc/invoke.texi: Add rocketlake.

gcc/testsuite/
* gcc.target/i386/funcspec-56.inc: Handle new march.
* g++.target/i386/mv16.C: Handle new march
---
 gcc/common/config/i386/cpuinfo.h  | 10 --
 gcc/common/config/i386/i386-common.c  |  4 
 gcc/common/config/i386/i386-cpuinfo.h |  1 +
 gcc/config.gcc|  2 +-
 gcc/config/i386/i386-c.c  |  7 +++
 gcc/config/i386/i386-options.c|  5 -
 gcc/config/i386/i386.h|  3 +++
 gcc/doc/extend.texi   |  3 +++
 gcc/doc/invoke.texi   |  8 
 gcc/testsuite/g++.target/i386/mv16.C  |  6 ++
 gcc/testsuite/gcc.target/i386/funcspec-56.inc |  1 +
 11 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h
index c1ee7a1f8b8..458f41de776 100644
--- a/gcc/common/config/i386/cpuinfo.h
+++ b/gcc/common/config/i386/cpuinfo.h
@@ -404,14 +404,20 @@ get_intel_cpu (struct __processor_model *cpu_model,
 case 0xa5:
 case 0xa6:
   /* Comet Lake.  */
-case 0xa7:
-  /* Rocket Lake.  */
   cpu = "skylake";
   CHECK___builtin_cpu_is ("corei7");
   CHECK___builtin_cpu_is ("skylake");
   cpu_model->__cpu_type = INTEL_COREI7;
   cpu_model->__cpu_subtype = INTEL_COREI7_SKYLAKE;
   break;
+case 0xa7:
+  /* Rocket Lake.  */
+  cpu = "rocketlake";
+  CHECK___builtin_cpu_is ("corei7");
+  CHECK___builtin_cpu_is ("rocketlake");
+  cpu_model->__cpu_type = INTEL_COREI7;
+  cpu_model->__cpu_subtype = INTEL_COREI7_ROCKETLAKE;
+  break;
 case 0x55:
   CHECK___builtin_cpu_is ("corei7");
   cpu_model->__cpu_type = INTEL_COREI7;
diff --git a/gcc/common/config/i386/i386-common.c 
b/gcc/common/config/i386/i386-common.c
index b89183b830e..1e6c1590ac4 100644
--- a/gcc/common/config/i386/i386-common.c
+++ b/gcc/common/config/i386/i386-common.c
@@ -1743,6 +1743,7 @@ const char *const processor_names[] =
   "skylake-avx512",
   "cannonlake",
   "icelake-client",
+  "rocketlake",
   "icelake-server",
   "cascadelake",
   "tigerlake",
@@ -1845,6 +1846,9 @@ const pta processor_alias_table[] =
   {"icelake-client", PROCESSOR_ICELAKE_CLIENT, CPU_HASWELL,
 PTA_ICELAKE_CLIENT,
 M_CPU_SUBTYPE (INTEL_COREI7_ICELAKE_CLIENT), P_PROC_AVX512F},
+  {"rocketlake", PROCESSOR_ROCKETLAKE, CPU_HASWELL,
+PTA_ROCKETLAKE,
+M_CPU_SUBTYPE (INTEL_COREI7_ROCKETLAKE), P_PROC_AVX512F},
   {"icelake-server", PROCESSOR_ICELAKE_SERVER, CPU_HASWELL,
 PTA_ICELAKE_SERVER,
 M_CPU_SUBTYPE (INTEL_COREI7_ICELAKE_SERVER), P_PROC_AVX512F},
diff --git a/gcc/common/config/i386/i386-cpuinfo.h 
b/gcc/common/config/i386/i386-cpuinfo.h
index 869115c4b6a..e68dd656046 100644
--- a/gcc/common/config/i386/i386-cpuinfo.h
+++ b/gcc/common/config/i386/i386-cpuinfo.h
@@ -88,6 +88,7 @@ enum processor_subtypes
   INTEL_COREI7_SAPPHIRERAPIDS,
   INTEL_COREI7_ALDERLAKE,
   AMDFAM19H_ZNVER3,
+  INTEL_COREI7_ROCKETLAKE,
   CPU_SUBTYPE_MAX
 };
 
diff --git a/gcc/config.gcc b/gcc/config.gcc
index 997a9f61a5c..357b0bed067 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -677,7 +677,7 @@ opteron-sse3 nocona core2 corei7 corei7-avx core-avx-i 
core-avx2 atom \
 slm nehalem westmere sandybridge ivybridge haswell broadwell bonnell \
 silvermont knl knm skylake-avx512 cannonlake icelake-client icelake-server \
 skylake goldmont goldmont-plus tremont cascadelake tigerlake cooperlake \
-sapphirerapids alderlake eden-x2 nano nano-1000 nano-2000 nano-3000 \
+sapphirerapids alderlake rocketlake eden-x2 nano nano-1000 nano-2000 nano-3000 
\
 nano-x2 eden-x4 nano-x4 x86-64 x86-64-v2 x86-64-v3 x86-64-v4 

[PATCH] Change march=alderlake ISA list and add m_ALDERLAKE to m_CORE_AVX2

2021-04-11 Thread Cui, Lili via Gcc-patches
Hi Uros,

This patch is about to change Alder Lake ISA list to GCC add m_ALDERLAKE to 
m_CORE_AVX2.
Alder Lake Intel Hybrid Technology is based on Tremont and plus 
ADCX/AVX/AVX2/BMI/BMI2/F16C/FMA/LZCNT/
PCONFIG/PKU/VAES/VPCLMULQDQ/SERIALIZE/HRESET/KL/WIDEKL/AVX-VNNI
For detailed information, please refer to 
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master backport to GCC 10?

 [PATCH] Change march=alderlake ISA list and add m_ALDERLAKE to
 m_CORE_AVX2

Alder Lake Intel Hybrid Technology will not support Intel(r) AVX-512. ISA
features such as Intel(r) AVX, AVX-VNNI, Intel(r) AVX2, and 
UMONITOR/UMWAIT/TPAUSE
are supported.

gcc/
* config/i386/i386.h
(PTA_ALDERLAKE): Change alderlake ISA list.
* config/i386/i386-options.c
(m_CORE_AVX2): Add m_ALDERLAKE.
*common/config/i386/cpuinfo.h:
(get_intel_cpu): Add rocketlake model.
* doc/invoke.texi: Change alderlake ISA list.
---
 gcc/common/config/i386/cpuinfo.h | 1 +
 gcc/config/i386/i386-options.c   | 2 +-
 gcc/config/i386/i386.h   | 7 ---
 gcc/doc/invoke.texi  | 9 +
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h
index dbce022620a..c1ee7a1f8b8 100644
--- a/gcc/common/config/i386/cpuinfo.h
+++ b/gcc/common/config/i386/cpuinfo.h
@@ -476,6 +476,7 @@ get_intel_cpu (struct __processor_model *cpu_model,
   cpu_model->__cpu_subtype = INTEL_COREI7_TIGERLAKE;
   break;
  case 0x97:
+case 0x9a:   /* Alder Lake.  */
   cpu = "alderlake";
   CHECK___builtin_cpu_is ("corei7");
diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index a8d06735d79..02e9c97d174 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -129,7 +129,7 @@ along with GCC; see the file COPYING3.  If not see
 #define m_CORE_AVX512 (m_SKYLAKE_AVX512 | m_CANNONLAKE \
   | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_CASCADELAKE \
   | m_TIGERLAKE | m_COOPERLAKE | m_SAPPHIRERAPIDS)
-#define m_CORE_AVX2 (m_HASWELL | m_SKYLAKE | m_CORE_AVX512)
+#define m_CORE_AVX2 (m_HASWELL | m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
 #define m_CORE_ALL (m_CORE2 | m_NEHALEM  | m_SANDYBRIDGE | m_CORE_AVX2)
 #define m_GOLDMONT (HOST_WIDE_INT_1U<

0001-Change-march-alderlake-ISA-list-and-add-m_ALDERLAKE-.patch
Description: 0001-Change-march-alderlake-ISA-list-and-add-m_ALDERLAKE-.patch


RE: Enable MOVDIRI, MOVDIR64B, CLDEMOTE and WAITPKG for march=tremont

2020-11-13 Thread Cui, Lili via Gcc-patches
Hi Uros,

This patch is  to correct previous patch,
PREFETCHW should be both in march=broadwell and march=Silvermont,
but I move PREFETCHW from march=broadwell to march=silvermont in previous
patch, sorry for that.

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?


[PATCH] Put PREFETCHW back to march=broadwell

PREFETCHW should be both in march=broadwell and march=silvermont.
I move PREFETCHW from march=broadwell to march=silvermont in previous
patch.

gcc/ChangeLog:

* config/i386/i386.h: Add PREFETCHW to march=broadwell.
* doc/invoke.texi: Put PREFETCHW back to relation arch.
---
 gcc/config/i386/i386.h |  3 ++-
 gcc/doc/invoke.texi| 50 +++---
 2 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 3be7551d6c3..b8ae16e2865 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2518,7 +2518,8 @@ const wide_int_bitmask PTA_IVYBRIDGE = PTA_SANDYBRIDGE | 
PTA_FSGSBASE
   | PTA_RDRND | PTA_F16C;
 const wide_int_bitmask PTA_HASWELL = PTA_IVYBRIDGE | PTA_AVX2 | PTA_BMI
   | PTA_BMI2 | PTA_LZCNT | PTA_FMA | PTA_MOVBE | PTA_HLE;
-const wide_int_bitmask PTA_BROADWELL = PTA_HASWELL | PTA_ADX | PTA_RDSEED;
+const wide_int_bitmask PTA_BROADWELL = PTA_HASWELL | PTA_ADX | PTA_RDSEED
+  | PTA_PRFCHW;
 const wide_int_bitmask PTA_SKYLAKE = PTA_BROADWELL | PTA_AES | PTA_CLFLUSHOPT
   | PTA_XSAVEC | PTA_XSAVES | PTA_SGX;
 const wide_int_bitmask PTA_SKYLAKE_AVX512 = PTA_SKYLAKE | PTA_AVX512F
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 69bf1fa89dd..3c292593030 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -29560,13 +29560,13 @@ BMI, BMI2 and F16C instruction set support.
 @item broadwell
 Intel Broadwell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
 SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, 
BMI2,
-F16C, RDSEED and ADCX instruction set support.
+F16C, RDSEED ADCX and PREFETCHW instruction set support.
 
 @item skylake
 Intel Skylake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
 SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2, F16C, RDSEED, ADCX, CLFLUSHOPT, XSAVEC and XSAVES instruction set
-support.
+BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVEC and XSAVES
+instruction set support.
 
 @item bonnell
 Intel Bonnell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3 and SSSE3
@@ -29595,32 +29595,33 @@ MOVDIR64B, CLDEMOTE and WAITPKG instruction set 
support.
 @item knl
 Intel Knight's Landing CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
 SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHWT1, AVX512F, AVX512PF, AVX512ER and
-AVX512CD instruction set support.
+BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, PREFETCHWT1, AVX512F, AVX512PF,
+AVX512ER and AVX512CD instruction set support.
 
 @item knm
 Intel Knights Mill CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
 SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHWT1, AVX512F, AVX512PF, AVX512ER, 
AVX512CD,
-AVX5124VNNIW, AVX5124FMAPS and AVX512VPOPCNTDQ instruction set support.
+BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, PREFETCHWT1, AVX512F, AVX512PF,
+AVX512ER, AVX512CD, AVX5124VNNIW, AVX5124FMAPS and AVX512VPOPCNTDQ instruction
+set support.
 
 @item skylake-avx512
 Intel Skylake Server CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
 SSSE3, SSE4.1, SSE4.2, POPCNT, PKU, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, 
FMA,
-BMI, BMI2, F16C, RDSEED, ADCX, CLFLUSHOPT, XSAVEC, XSAVES, AVX512F,
+BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVEC, XSAVES, AVX512F,
 CLWB, AVX512VL, AVX512BW, AVX512DQ and AVX512CD instruction set support.
 
 @item cannonlake
 Intel Cannonlake Server CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2,
 SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, PKU, AVX, AVX2, AES, PCLMUL, FSGSBASE,
-RDRND, FMA, BMI, BMI2, F16C, RDSEED, ADCX, CLFLUSHOPT, XSAVEC,
+RDRND, FMA, BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVEC,
 XSAVES, AVX512F, AVX512VL, AVX512BW, AVX512DQ, AVX512CD, AVX512VBMI,
 AVX512IFMA, SHA and UMIP instruction set support.
 
 @item icelake-client
 Intel Icelake Client CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2,
 SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, PKU, AVX, AVX2, AES, PCLMUL, FSGSBASE,
-RDRND, FMA, BMI, BMI2, F16C, RDSEED, ADCX, CLFLUSHOPT, XSAVEC,
+RDRND, FMA, BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVEC,
 XSAVES, AVX512F, AVX512VL, AVX512BW, AVX512DQ, AVX512CD, AVX512VBMI,
 AVX512IFMA, SHA, CLWB, UMIP, RDPID, GFNI, AVX512VBMI2, AVX512VPOPCNTDQ,
 AVX512BITALG, AVX512VNNI, VPCLMULQDQ, VAES instruction set support.
@@ -29628,7 +29629,7 @@ AVX512BITALG, AVX512VNNI, VPCLMULQDQ, VAES instruction 
set support.
 @item icelake-server
 Intel Icelake 

Enable MOVDIRI, MOVDIR64B, CLDEMOTE and WAITPKG for march=tremont

2020-11-09 Thread Cui, Lili via Gcc-patches
Hi Uros,

This patch is  to correct some instruction sets for 
march=Tremont/Broadwell/Silvermont/knl

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?

[PATCH] Enable MOVDIRI, MOVDIR64B, CLDEMOTE and WAITPKG for
 march=tremont

1. Enable MOVDIRI, MOVDIR64B, CLDEMOTE and WAITPKG for march=tremont
2. Move PREFETCHW from march=broadwell to march=silvermont.
3. Add PREFETCHWT1 to march=knl

gcc/ChangeLog:

PR target/97685
* config/i386/i386.h:
(PTA_BROADWELL): Delete PTA_PRFCHW.
(PTA_SILVERMONT): Add PTA_PRFCHW.
(PTA_KNL): Add PTA_PREFETCHWT1.
(PTA_TREMONT): Add PTA_MOVDIRI, PTA_MOVDIR64B, PTA_CLDEMOTE and 
PTA_WAITPKG.
* doc/invoke.texi: Delete PREFETCHW for broadwell, skylake, knl, knm,
skylake-avx512, cannonlake, icelake-client, icelake-server, cascadelake,
cooperlake, tigerlake and sapphirerapids.
Add PREFETCHW for silvermont, goldmont, goldmont-plus and tremont.
Add XSAVEC and XSAVES for goldmont, goldmont-plus and tremont.
Add MOVDIRI, MOVDIR64B, CLDEMOTE and WAITPKG for tremont.
Add KEYLOCKER and HREST for alderlake.
Add AMX-BF16, AMX-TILE, AMX-INT8 and UINTR for sapphirerapids.
Add KEYLOCKER for tigerlake.
---
 gcc/config/i386/i386.h | 10 +++
 gcc/doc/invoke.texi| 59 +-
 2 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index d0c157a9970..5e01fe6b841 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2515,8 +2515,7 @@ const wide_int_bitmask PTA_IVYBRIDGE = PTA_SANDYBRIDGE | 
PTA_FSGSBASE
   | PTA_RDRND | PTA_F16C;
 const wide_int_bitmask PTA_HASWELL = PTA_IVYBRIDGE | PTA_AVX2 | PTA_BMI
   | PTA_BMI2 | PTA_LZCNT | PTA_FMA | PTA_MOVBE | PTA_HLE;
-const wide_int_bitmask PTA_BROADWELL = PTA_HASWELL | PTA_ADX | PTA_PRFCHW
-  | PTA_RDSEED;
+const wide_int_bitmask PTA_BROADWELL = PTA_HASWELL | PTA_ADX | PTA_RDSEED;
 const wide_int_bitmask PTA_SKYLAKE = PTA_BROADWELL | PTA_AES | PTA_CLFLUSHOPT
   | PTA_XSAVEC | PTA_XSAVES | PTA_SGX;
 const wide_int_bitmask PTA_SKYLAKE_AVX512 = PTA_SKYLAKE | PTA_AVX512F
@@ -2541,16 +2540,17 @@ const wide_int_bitmask PTA_SAPPHIRERAPIDS = 
PTA_COOPERLAKE | PTA_MOVDIRI
 const wide_int_bitmask PTA_ALDERLAKE = PTA_SKYLAKE | PTA_CLDEMOTE | PTA_PTWRITE
   | PTA_WAITPKG | PTA_SERIALIZE | PTA_HRESET | PTA_KL | PTA_WIDEKL;
 const wide_int_bitmask PTA_KNL = PTA_BROADWELL | PTA_AVX512PF | PTA_AVX512ER
-  | PTA_AVX512F | PTA_AVX512CD;
+  | PTA_AVX512F | PTA_AVX512CD | PTA_PREFETCHWT1;
 const wide_int_bitmask PTA_BONNELL = PTA_CORE2 | PTA_MOVBE;
-const wide_int_bitmask PTA_SILVERMONT = PTA_WESTMERE | PTA_MOVBE | PTA_RDRND;
+const wide_int_bitmask PTA_SILVERMONT = PTA_WESTMERE | PTA_MOVBE | PTA_RDRND
+  | PTA_PRFCHW;
 const wide_int_bitmask PTA_GOLDMONT = PTA_SILVERMONT | PTA_AES | PTA_SHA | 
PTA_XSAVE
   | PTA_RDSEED | PTA_XSAVEC | PTA_XSAVES | PTA_CLFLUSHOPT | PTA_XSAVEOPT
   | PTA_FSGSBASE;
 const wide_int_bitmask PTA_GOLDMONT_PLUS = PTA_GOLDMONT | PTA_RDPID
   | PTA_SGX | PTA_PTWRITE;
 const wide_int_bitmask PTA_TREMONT = PTA_GOLDMONT_PLUS | PTA_CLWB
-  | PTA_GFNI;
+  | PTA_GFNI | PTA_MOVDIRI | PTA_MOVDIR64B | PTA_CLDEMOTE | PTA_WAITPKG;
 const wide_int_bitmask PTA_KNM = PTA_KNL | PTA_AVX5124VNNIW
   | PTA_AVX5124FMAPS | PTA_AVX512VPOPCNTDQ;
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d2a188d7c75..d01beb248e1 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -29528,14 +29528,14 @@ BMI, BMI2 and F16C instruction set support.
 
 @item broadwell
 Intel Broadwell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2, F16C, RDSEED, ADCX and PREFETCHW instruction set support.
+SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, 
BMI2,
+F16C, RDSEED and ADCX instruction set support.
 
 @item skylake
 Intel Skylake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
 SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVEC and
-XSAVES instruction set support.
+BMI, BMI2, F16C, RDSEED, ADCX, CLFLUSHOPT, XSAVEC and XSAVES instruction set
+support.
 
 @item bonnell
 Intel Bonnell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3 and SSSE3
@@ -29543,52 +29543,53 @@ instruction set support.
 
 @item silvermont
 Intel Silvermont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, 
SSSE3,
-SSE4.1, SSE4.2, POPCNT, AES, PCLMUL and RDRND instruction set support.
+SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL and RDRND instruction set 
support.
 
 @item goldmont
 Intel Goldmont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AES, PCLMUL, RDRND, XSAVE, XSAVEOPT and FSGSBASE
-instruction set support.
+SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL, 

Initial Sapphire Rapids and Alder Lake support from ISA r40

2020-07-09 Thread Cui, Lili via Gcc-patches
Hi:
This patch is about to add Sapphire Rapids and Alder Lake to GCC.
Sapphire Rapids is based on Cooper Lake and plus ISA 
MOVDIRI/MOVDIR64B/AVX512VP2INTERSECT/ENQCMD/CLDEMOTE/PTWRITE/WAITPKG/SERIALIZE/TSXLDTRK.
Alder Lake is based on Skylake and plus ISA CLDEMOTE/PTWRITE/WAITPK/SERIALIZE.

For detailed information, please refer to 
https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

Bootstrap is ok, and no regressions for i386/x86-64 testsuite.

OK for master?

gcc/ChangeLog
* common/config/i386/cpuinfo.h
(get_intel_cpu): Handle sapphirerapids.
* common/config/i386/i386-common.c
(processor_names): Add sapphirerapids and alderlake.
(processor_alias_table): Add sapphirerapids and alderlake.
* common/config/i386/i386-cpuinfo.h
(processor_subtypes): Add INTEL_COREI7_ALDERLAKE and
INTEL_COREI7_ALDERLAKE.
* config.gcc: Add -march=sapphirerapids and alderlake.
* config/i386/driver-i386.c
(host_detect_local_cpu) Handle sapphirerapids and alderlake.
* config/i386/i386-c.c
(ix86_target_macros_internal): Handle sapphirerapids and alderlake.
* config/i386/i386-options.c
(m_SAPPHIRERAPIDS) : Define.
(m_ALDERLAKE): Ditto.
(m_CORE_AVX512) : Add m_SAPPHIRERAPIDS.
(processor_cost_table): Add sapphirerapids and alderlake.
(ix86_option_override_internal) Handle PTA_WAITPKG, PTA_ENQCMD,
PTA_CLDEMOTE, PTA_SERIALIZE, PTA_TSXLDTRK.
* config/i386/i386.h
(ix86_size_cost) : Define SAPPHIRERAPIDS and ALDERLAKE.
(processor_type) : Add PROCESSOR_SAPPHIRERAPIDS and
PROCESSOR_ALDERLAKE.
(PTA_ENQCMD): New.
(PTA_CLDEMOTE): Ditto.
(PTA_SERIALIZE): Ditto.
(PTA_TSXLDTRK): New.
(PTA_SAPPHIRERAPIDS): Ditto.
(PTA_ALDERLAKE): Ditto.
(processor_type) : Add PROCESSOR_SAPPHIRERAPIDS and
PROCESSOR_ALDERLAKE.
* doc/extend.texi: Add sapphirerapids and alderlake.
* doc/invoke.texi: Add sapphirerapids and alderlake.

gcc/testsuite/ChangeLog
* gcc.target/i386/funcspec-56.inc: Handle new march.
* g++.target/i386/mv16.C: Handle new march

Thanks,
Lili.



0001-Initial-Sapphire-Rapids-and-Alder-Lake-support-from-.patch
Description: 0001-Initial-Sapphire-Rapids-and-Alder-Lake-support-from-.patch


[PATCH] fix bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG

2020-06-04 Thread Cui, Lili via Gcc-patches
Hi Uros,

This patch is to fix bitmask conflict between PTA_AVX512VP2INTERSECT  and 
PTA_WAITPKG
 in gcc/config/i386/i386.h

Bootstrap is ok, make-check ok for i386 target. Ok for trunk?


gcc/ChangeLog:
* config/i386/i386.h (PTA_WAITPKG): Change bitmask value.
---
 gcc/config/i386/i386.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 48a5735d4e7..5bae257b435 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2444,9 +2444,9 @@ const wide_int_bitmask PTA_RDPID (0, HOST_WIDE_INT_1U << 
6);
 const wide_int_bitmask PTA_PCONFIG (0, HOST_WIDE_INT_1U << 7);
 const wide_int_bitmask PTA_WBNOINVD (0, HOST_WIDE_INT_1U << 8);
 const wide_int_bitmask PTA_AVX512VP2INTERSECT (0, HOST_WIDE_INT_1U << 9);
-const wide_int_bitmask PTA_WAITPKG (0, HOST_WIDE_INT_1U << 9);
 const wide_int_bitmask PTA_PTWRITE (0, HOST_WIDE_INT_1U << 10);
 const wide_int_bitmask PTA_AVX512BF16 (0, HOST_WIDE_INT_1U << 11);
+const wide_int_bitmask PTA_WAITPKG (0, HOST_WIDE_INT_1U << 12);
 const wide_int_bitmask PTA_MOVDIRI(0, HOST_WIDE_INT_1U << 13);
 const wide_int_bitmask PTA_MOVDIR64B(0, HOST_WIDE_INT_1U << 14);
 


Thanks,
Lili.


0001-Fix-bitmask-conflict-between-PTA_AVX512VP2INTERSECT-.patch
Description: 0001-Fix-bitmask-conflict-between-PTA_AVX512VP2INTERSECT-.patch