Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-23 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> Hi Richard,
>
>>> A smart reassociation pass could form more FMAs while also increasing
>>> parallelism, but the way it currently works always results in fewer FMAs.
>>
>> Yeah, as Richard said, that seems the right long-term fix.
>> It would also avoid the hack of treating PLUS_EXPR as a signal
>> of an FMA, which has the drawback of assuming (for 2-FMA cores)
>> that plain addition never benefits from reassociation in its own right.
>
> True but it's hard to separate them. You will have a mix of FADD and FMAs
> to reassociate (since FMA still counts as an add), and the ratio between
> them as well as the number of operations may affect the best reassociation
> width.
>
>> Still, I guess the hackiness is pre-existing and the patch is removing
>> the hackiness for some cores, so from that point of view it's a strict
>> improvement over the status quo.  And it's too late in the GCC 13
>> cycle to do FMA reassociation properly.  So I'm OK with the patch
>> in principle, but could you post an update with more commentary?
>
> Sure, here is an update with longer comment in aarch64_reassociation_width:
>
>
> Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
> existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
> FMA pipes.  This improves SPECFP2017 on Neoverse V1 by ~1.5%.
>
> Passes regress/bootstrap, OK for commit?
>
> gcc/ChangeLog/
> PR 107413
> * config/aarch64/aarch64.cc (struct tune_params): Add
> fma_reassoc_width to all CPU tuning structures.
> (aarch64_reassociation_width): Use fma_reassoc_width.
> * config/aarch64/aarch64-protos.h (struct tune_params): Add
> fma_reassoc_width.

OK, thanks.

Richard

> ---
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> 238820581c5ee7617f8eed1df2cf5418b1127e19..4be93c93c26e091f878bc8e4cf06e90888405fb2
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -540,6 +540,7 @@ struct tune_params
>const char *loop_align;
>int int_reassoc_width;
>int fp_reassoc_width;
> +  int fma_reassoc_width;
>int vec_reassoc_width;
>int min_div_recip_mul_sf;
>int min_div_recip_mul_df;
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> c91df6f5006c257690aafb75398933d628a970e1..15d478c77ceb2d6c52a70b6ffd8fdadcfa8deba0
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -1346,6 +1346,7 @@ static const struct tune_params generic_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1382,6 +1383,7 @@ static const struct tune_params cortexa35_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1415,6 +1417,7 @@ static const struct tune_params cortexa53_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1448,6 +1451,7 @@ static const struct tune_params cortexa57_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1481,6 +1485,7 @@ static const struct tune_params cortexa72_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1514,6 +1519,7 @@ static const struct tune_params cortexa73_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1548,6 +1554,7 @@ static const struct tune_params exynosm1_tunings =
>"4", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1580,6 +1587,7 @@ static const struct tune_params thunderxt88_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> 

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-23 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>> A smart reassociation pass could form more FMAs while also increasing
>> parallelism, but the way it currently works always results in fewer FMAs.
>
> Yeah, as Richard said, that seems the right long-term fix.
> It would also avoid the hack of treating PLUS_EXPR as a signal
> of an FMA, which has the drawback of assuming (for 2-FMA cores)
> that plain addition never benefits from reassociation in its own right.

True but it's hard to separate them. You will have a mix of FADD and FMAs
to reassociate (since FMA still counts as an add), and the ratio between
them as well as the number of operations may affect the best reassociation
width.

> Still, I guess the hackiness is pre-existing and the patch is removing
> the hackiness for some cores, so from that point of view it's a strict
> improvement over the status quo.  And it's too late in the GCC 13
> cycle to do FMA reassociation properly.  So I'm OK with the patch
> in principle, but could you post an update with more commentary?

Sure, here is an update with longer comment in aarch64_reassociation_width:


Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
FMA pipes.  This improves SPECFP2017 on Neoverse V1 by ~1.5%.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog/
PR 107413
* config/aarch64/aarch64.cc (struct tune_params): Add
fma_reassoc_width to all CPU tuning structures.
(aarch64_reassociation_width): Use fma_reassoc_width.
* config/aarch64/aarch64-protos.h (struct tune_params): Add
fma_reassoc_width.

---
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
238820581c5ee7617f8eed1df2cf5418b1127e19..4be93c93c26e091f878bc8e4cf06e90888405fb2
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -540,6 +540,7 @@ struct tune_params
   const char *loop_align;
   int int_reassoc_width;
   int fp_reassoc_width;
+  int fma_reassoc_width;
   int vec_reassoc_width;
   int min_div_recip_mul_sf;
   int min_div_recip_mul_df;
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
c91df6f5006c257690aafb75398933d628a970e1..15d478c77ceb2d6c52a70b6ffd8fdadcfa8deba0
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -1346,6 +1346,7 @@ static const struct tune_params generic_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1382,6 +1383,7 @@ static const struct tune_params cortexa35_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1415,6 +1417,7 @@ static const struct tune_params cortexa53_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1448,6 +1451,7 @@ static const struct tune_params cortexa57_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1481,6 +1485,7 @@ static const struct tune_params cortexa72_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1514,6 +1519,7 @@ static const struct tune_params cortexa73_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1548,6 +1554,7 @@ static const struct tune_params exynosm1_tunings =
   "4", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1580,6 +1587,7 @@ static const struct tune_params thunderxt88_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1612,6 +1620,7 @@ static const struct tune_params thunderx_tunings =
   "8", /* loop_align.  */
   2,   /* 

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-22 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> Hi Richard,
>
>> I guess an obvious question is: if 1 (rather than 2) was the right value
>> for cores with 2 FMA pipes, why is 4 the right value for cores with 4 FMA
>> pipes?  It would be good to clarify how, conceptually, the core property
>> should map to the fma_reassoc_width value.
>
> 1 turns off reassociation so that FMAs get properly formed. After 
> reassociation far
> fewer FMAs get formed so we end up with more FLOPS which means slower 
> execution.
> It's a significant slowdown on cores that are not wide, have only 1 or 2 FP 
> pipes and
> may have high FP latencies. So we turn it off by default on all older cores.
>
>> It sounds from the existing comment like the main motivation for returning 1
>> was to encourage more FMAs to be formed, rather than to prevent FMAs from
>> being reassociated.  Is that no longer an issue?  Or is the point that,
>> with more FMA pipes, lower FMA formation is a price worth paying for
>> the better parallelism we get when FMAs can be formed?
>
> Exactly. A wide CPU can deal with the extra instructions, so the loss from 
> fewer
> FMAs ends up lower than the speedup from the extra parallelism. Having more 
> FMAs
> will be even faster of course.

Thanks.  It would be good to put this in a comment somewhere, perhaps above
the fma_reassoc_width field.  It isn't obvious from the patch as posted,
and changing the existing comment drops the previous hint about what
was going on.

>
>> Does this code ever see opc == FMA?
>
> No, that's the problem, reassociation ignores the fact that we actually want 
> FMAs.

Yeah, but I was wondering if later code would sometimes query this
hook for existing FMAs, even if that code wasn't the focus of the patch.
Once we add the distinction between FMAs and other ops, it seemed natural
to test for existing FMAs.

But of course, FMA is an rtl code rather than a tree code (oops), so that
was never going to happen.

> A smart reassociation pass could form more FMAs while also increasing
> parallelism, but the way it currently works always results in fewer FMAs.

Yeah, as Richard said, that seems the right long-term fix.
It would also avoid the hack of treating PLUS_EXPR as a signal
of an FMA, which has the drawback of assuming (for 2-FMA cores)
that plain addition never benefits from reassociation in its own right.

Still, I guess the hackiness is pre-existing and the patch is removing
the hackiness for some cores, so from that point of view it's a strict
improvement over the status quo.  And it's too late in the GCC 13
cycle to do FMA reassociation properly.  So I'm OK with the patch
in principle, but could you post an update with more commentary?

Thanks,
Richard


Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-22 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> I guess an obvious question is: if 1 (rather than 2) was the right value
> for cores with 2 FMA pipes, why is 4 the right value for cores with 4 FMA
> pipes?  It would be good to clarify how, conceptually, the core property
> should map to the fma_reassoc_width value.

1 turns off reassociation so that FMAs get properly formed. After reassociation 
far
fewer FMAs get formed so we end up with more FLOPS which means slower execution.
It's a significant slowdown on cores that are not wide, have only 1 or 2 FP 
pipes and
may have high FP latencies. So we turn it off by default on all older cores.

> It sounds from the existing comment like the main motivation for returning 1
> was to encourage more FMAs to be formed, rather than to prevent FMAs from
> being reassociated.  Is that no longer an issue?  Or is the point that,
> with more FMA pipes, lower FMA formation is a price worth paying for
> the better parallelism we get when FMAs can be formed?

Exactly. A wide CPU can deal with the extra instructions, so the loss from fewer
FMAs ends up lower than the speedup from the extra parallelism. Having more FMAs
will be even faster of course.

> Does this code ever see opc == FMA?

No, that's the problem, reassociation ignores the fact that we actually want 
FMAs. A smart
reassociation pass could form more FMAs while also increasing parallelism, but 
the way it
currently works always results in fewer FMAs.

Cheers,
Wilco

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-22 Thread Richard Biener via Gcc-patches
On Tue, Nov 22, 2022 at 8:59 AM Richard Sandiford via Gcc-patches
 wrote:
>
> Wilco Dijkstra  writes:
> > Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
> > existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
> > FMA pipes.  This improves SPECFP2017 on Neoverse V1 by ~1.5%.
> >
> > Passes regress/bootstrap, OK for commit?
> >
> > gcc/
> > PR 107413
> > * config/aarch64/aarch64.cc (struct tune_params): Add
> > fma_reassoc_width to all CPU tuning structures.
> > * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > fma_reassoc_width.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-protos.h 
> > b/gcc/config/aarch64/aarch64-protos.h
> > index 
> > a73bfa20acb9b92ae0475794c3f11c67d22feb97..71365a446007c26b906b61ca8b2a68ee06c83037
> >  100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -540,6 +540,7 @@ struct tune_params
> >const char *loop_align;
> >int int_reassoc_width;
> >int fp_reassoc_width;
> > +  int fma_reassoc_width;
> >int vec_reassoc_width;
> >int min_div_recip_mul_sf;
> >int min_div_recip_mul_df;
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 
> > 798363bcc449c414de5bbb4f26b8e1c64a0cf71a..643162cdecd6a8fe5587164cb2d0d62b709a491d
> >  100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -1346,6 +1346,7 @@ static const struct tune_params generic_tunings =
> >"8", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1382,6 +1383,7 @@ static const struct tune_params cortexa35_tunings =
> >"8", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1415,6 +1417,7 @@ static const struct tune_params cortexa53_tunings =
> >"8", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1448,6 +1451,7 @@ static const struct tune_params cortexa57_tunings =
> >"8", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1481,6 +1485,7 @@ static const struct tune_params cortexa72_tunings =
> >"8", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1514,6 +1519,7 @@ static const struct tune_params cortexa73_tunings =
> >"8", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1548,6 +1554,7 @@ static const struct tune_params exynosm1_tunings =
> >"4", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1580,6 +1587,7 @@ static const struct tune_params thunderxt88_tunings =
> >"8", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1612,6 +1620,7 @@ static const struct tune_params thunderx_tunings =
> >"8", /* loop_align.  */
> >2,   /* int_reassoc_width.  */
> >4,   /* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,   /* vec_reassoc_width.  */
> >2,   /* min_div_recip_mul_sf.  */
> >2,   /* min_div_recip_mul_df.  */
> > @@ -1646,6 +1655,7 @@ static const struct tune_params tsv110_tunings =
> >"8",  /* loop_align.  */
> >2,/* int_reassoc_width.  */
> >4,/* fp_reassoc_width.  */
> > +  1,   /* fma_reassoc_width.  */
> >1,/* vec_reassoc_width.  */
> >2,/* min_div_recip_mul_sf.  */
> >2,/* min_div_recip_mul_df.  */
> > @@ -1678,6 +1688,7 @@ static const struct tune_params xgene1_tunings =
> >"16",/* 

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-21 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
> existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
> FMA pipes.  This improves SPECFP2017 on Neoverse V1 by ~1.5%.
>
> Passes regress/bootstrap, OK for commit?
>
> gcc/
> PR 107413
> * config/aarch64/aarch64.cc (struct tune_params): Add
> fma_reassoc_width to all CPU tuning structures.
> * config/aarch64/aarch64-protos.h (struct tune_params): Add
> fma_reassoc_width.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> a73bfa20acb9b92ae0475794c3f11c67d22feb97..71365a446007c26b906b61ca8b2a68ee06c83037
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -540,6 +540,7 @@ struct tune_params
>const char *loop_align;
>int int_reassoc_width;
>int fp_reassoc_width;
> +  int fma_reassoc_width;
>int vec_reassoc_width;
>int min_div_recip_mul_sf;
>int min_div_recip_mul_df;
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 798363bcc449c414de5bbb4f26b8e1c64a0cf71a..643162cdecd6a8fe5587164cb2d0d62b709a491d
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -1346,6 +1346,7 @@ static const struct tune_params generic_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1382,6 +1383,7 @@ static const struct tune_params cortexa35_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1415,6 +1417,7 @@ static const struct tune_params cortexa53_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1448,6 +1451,7 @@ static const struct tune_params cortexa57_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1481,6 +1485,7 @@ static const struct tune_params cortexa72_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1514,6 +1519,7 @@ static const struct tune_params cortexa73_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1548,6 +1554,7 @@ static const struct tune_params exynosm1_tunings =
>"4", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1580,6 +1587,7 @@ static const struct tune_params thunderxt88_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1612,6 +1620,7 @@ static const struct tune_params thunderx_tunings =
>"8", /* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1646,6 +1655,7 @@ static const struct tune_params tsv110_tunings =
>"8",  /* loop_align.  */
>2,/* int_reassoc_width.  */
>4,/* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,/* vec_reassoc_width.  */
>2,/* min_div_recip_mul_sf.  */
>2,/* min_div_recip_mul_df.  */
> @@ -1678,6 +1688,7 @@ static const struct tune_params xgene1_tunings =
>"16",/* loop_align.  */
>2,   /* int_reassoc_width.  */
>4,   /* fp_reassoc_width.  */
> +  1,   /* fma_reassoc_width.  */
>1,   /* vec_reassoc_width.  */
>2,   /* min_div_recip_mul_sf.  */
>2,   /* min_div_recip_mul_df.  */
> @@ -1710,6 +1721,7 @@ static const struct tune_params emag_tunings =
>

[PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-09 Thread Wilco Dijkstra via Gcc-patches
Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
FMA pipes.  This improves SPECFP2017 on Neoverse V1 by ~1.5%.

Passes regress/bootstrap, OK for commit?

gcc/
PR 107413
* config/aarch64/aarch64.cc (struct tune_params): Add
fma_reassoc_width to all CPU tuning structures.
* config/aarch64/aarch64-protos.h (struct tune_params): Add
fma_reassoc_width.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
a73bfa20acb9b92ae0475794c3f11c67d22feb97..71365a446007c26b906b61ca8b2a68ee06c83037
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -540,6 +540,7 @@ struct tune_params
   const char *loop_align;
   int int_reassoc_width;
   int fp_reassoc_width;
+  int fma_reassoc_width;
   int vec_reassoc_width;
   int min_div_recip_mul_sf;
   int min_div_recip_mul_df;
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
798363bcc449c414de5bbb4f26b8e1c64a0cf71a..643162cdecd6a8fe5587164cb2d0d62b709a491d
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -1346,6 +1346,7 @@ static const struct tune_params generic_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1382,6 +1383,7 @@ static const struct tune_params cortexa35_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1415,6 +1417,7 @@ static const struct tune_params cortexa53_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1448,6 +1451,7 @@ static const struct tune_params cortexa57_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1481,6 +1485,7 @@ static const struct tune_params cortexa72_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1514,6 +1519,7 @@ static const struct tune_params cortexa73_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1548,6 +1554,7 @@ static const struct tune_params exynosm1_tunings =
   "4", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1580,6 +1587,7 @@ static const struct tune_params thunderxt88_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1612,6 +1620,7 @@ static const struct tune_params thunderx_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1646,6 +1655,7 @@ static const struct tune_params tsv110_tunings =
   "8",  /* loop_align.  */
   2,/* int_reassoc_width.  */
   4,/* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,/* vec_reassoc_width.  */
   2,/* min_div_recip_mul_sf.  */
   2,/* min_div_recip_mul_df.  */
@@ -1678,6 +1688,7 @@ static const struct tune_params xgene1_tunings =
   "16",/* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1710,6 +1721,7 @@ static const struct tune_params emag_tunings =
   "16",/* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1743,6 +1755,7 @@ static const