Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-08-03 Thread Hao Liu OS via Gcc-patches
Gentle ping. Is it OK for master?

I'm afraid the ICE may cause trouble and hope it can be fixed ASAP.

Thanks,
Hao


From: Hao Liu OS 
Sent: Wednesday, August 2, 2023 11:45
To: Richard Sandiford
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Hi Richard,

Update the patch with a simple case (see below case and comments).  It shows a 
live stmt may not have reduction def, which introduce the ICE.

Is it OK for trunk?


Fix the assertion failure on empty reduction define in info_for_reduction.
Even a stmt is live, it may still have empty reduction define.  Check the
reduction definition instead of live info before calling info_for_reduction.

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr110625_3.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc |  2 +-
 gcc/testsuite/gcc.target/aarch64/pr110625_3.c | 34 +++
 2 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_3.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
 return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_3.c 
b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
new file mode 100644
index 000..35a50290cb0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mcpu=neoverse-n2" } */
+
+/* Avoid ICE on empty reduction def in single_defuse_cycle.
+
+   E.g.
+  [local count: 858993456]:
+ # sum_18 = PHI 
+ sum.0_5 = (unsigned int) sum_18;
+ _6 = _4 + sum.0_5; <-- it is "live" but doesn't have reduction def
+ sum_15 = (int) _6;
+ ...
+ if (ivtmp_29 != 0)
+   goto ; [75.00%]
+ else
+   goto ; [25.00%]
+
+  [local count: 644245086]:
+ goto ; [100.00%]
+
+  [local count: 214748368]:
+ # _31 = PHI <_6(3)>
+ _8 = _31 >> 1;
+*/
+
+int
+f (unsigned int *tmp)
+{
+  int sum = 0;
+  for (int i = 0; i < 4; i++)
+sum += tmp[i];
+
+  return (unsigned int) sum >> 1;
+}
--
2.34.1


From: Hao Liu OS 
Sent: Tuesday, August 1, 2023 17:43
To: Richard Sandiford
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Hi Richard,

This is a quick fix to the several ICEs.  It seems even STMT_VINFO_LIVE_P is 
true, some reduct stmts still don't have REDUC_DEF.  So I change the check to 
STMT_VINFO_REDUC_DEF.

Is it OK for trunk?

---
Fix the ICEs on empty reduction define.  Even STMT_VINFO_LIVE_P is true, some 
reduct stmts
still don't have definition.

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction
---
 gcc/config/aarch64/aarch64.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
 return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
--
2.40.0



From: Richard Sandiford 
Sent: Monday, July 31, 2023 17:11
To: Hao Liu OS
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Hao Liu OS  writes:
>> Which test case do you see this for?  The two tests in the patch still
>> seem to report correct latencies for me if I make the change above.
>
> Not the newly added tests.  It is still the existing case causing the 
> previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.
>
> It's not the test case itself failed, but the dump message of vect says the 
> "reduction latency" is 0:
>
> Before the change:
> cost_model_13.c:7:21: note:  

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-08-01 Thread Hao Liu OS via Gcc-patches
Hi Richard,

Update the patch with a simple case (see below case and comments).  It shows a 
live stmt may not have reduction def, which introduce the ICE.

Is it OK for trunk?


Fix the assertion failure on empty reduction define in info_for_reduction.
Even a stmt is live, it may still have empty reduction define.  Check the
reduction definition instead of live info before calling info_for_reduction.

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr110625_3.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc |  2 +-
 gcc/testsuite/gcc.target/aarch64/pr110625_3.c | 34 +++
 2 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_3.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
 return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_3.c 
b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
new file mode 100644
index 000..35a50290cb0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mcpu=neoverse-n2" } */
+
+/* Avoid ICE on empty reduction def in single_defuse_cycle.
+
+   E.g.
+  [local count: 858993456]:
+ # sum_18 = PHI 
+ sum.0_5 = (unsigned int) sum_18;
+ _6 = _4 + sum.0_5; <-- it is "live" but doesn't have reduction def
+ sum_15 = (int) _6;
+ ...
+ if (ivtmp_29 != 0)
+   goto ; [75.00%]
+ else
+   goto ; [25.00%]
+
+  [local count: 644245086]:
+ goto ; [100.00%]
+
+  [local count: 214748368]:
+ # _31 = PHI <_6(3)>
+ _8 = _31 >> 1;
+*/
+
+int
+f (unsigned int *tmp)
+{
+  int sum = 0;
+  for (int i = 0; i < 4; i++)
+sum += tmp[i];
+
+  return (unsigned int) sum >> 1;
+}
--
2.34.1


From: Hao Liu OS 
Sent: Tuesday, August 1, 2023 17:43
To: Richard Sandiford
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Hi Richard,

This is a quick fix to the several ICEs.  It seems even STMT_VINFO_LIVE_P is 
true, some reduct stmts still don't have REDUC_DEF.  So I change the check to 
STMT_VINFO_REDUC_DEF.

Is it OK for trunk?

---
Fix the ICEs on empty reduction define.  Even STMT_VINFO_LIVE_P is true, some 
reduct stmts
still don't have definition.

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction
---
 gcc/config/aarch64/aarch64.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
 return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
--
2.40.0



From: Richard Sandiford 
Sent: Monday, July 31, 2023 17:11
To: Hao Liu OS
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Hao Liu OS  writes:
>> Which test case do you see this for?  The two tests in the patch still
>> seem to report correct latencies for me if I make the change above.
>
> Not the newly added tests.  It is still the existing case causing the 
> previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.
>
> It's not the test case itself failed, but the dump message of vect says the 
> "reduction latency" is 0:
>
> Before the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 1
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 1.00
> 

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-08-01 Thread Hao Liu OS via Gcc-patches
Hi Richard,

This is a quick fix to the several ICEs.  It seems even STMT_VINFO_LIVE_P is 
true, some reduct stmts still don't have REDUC_DEF.  So I change the check to 
STMT_VINFO_REDUC_DEF.

Is it OK for trunk?

---
Fix the ICEs on empty reduction define.  Even STMT_VINFO_LIVE_P is true, some 
reduct stmts
still don't have definition.

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction
---
 gcc/config/aarch64/aarch64.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
 return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
--
2.40.0



From: Richard Sandiford 
Sent: Monday, July 31, 2023 17:11
To: Hao Liu OS
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Hao Liu OS  writes:
>> Which test case do you see this for?  The two tests in the patch still
>> seem to report correct latencies for me if I make the change above.
>
> Not the newly added tests.  It is still the existing case causing the 
> previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.
>
> It's not the test case itself failed, but the dump message of vect says the 
> "reduction latency" is 0:
>
> Before the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 1
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 1.00
> cost_model_13.c:7:21: note:estimated cycles per vector iteration (for VF 
> 8) = 8.00
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 2
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 2.00
>
> After the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 0 <--- seems not 
> consistent with above result
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 1.00
> cost_model_13.c:7:21: note:estimated cycles per vector iteration (for VF 
> 8) = 8.00
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 0 <--- seems not 
> consistent with above result
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 1.00  
><--- seems not consistent with above result
>
> BTW. this should be caused by the reduction stmt is not live, which indicates 
> whether this stmts is part of a computation whose result is used outside the 
> loop (tree-vectorized.h:1204):
>   :
>   # res_18 = PHI 
>   # i_20 = PHI 
>   _1 = (long unsigned int) i_20;
>   _2 = _1 * 2;
>   _3 = x_14(D) + _2;
>   _4 = *_3;
>   _5 = (unsigned short) _4;
>   res.0_6 = (unsigned short) res_18;
>   _7 = _5 + res.0_6; <-- This is not live, may be 
> caused by the below type cast stmt.
>   res_15 = (short int) _7;
>   i_16 = i_20 + 1;
>   if (n_11(D) > i_16)
> goto ;
>   else
> goto ;
>
>   :
>   goto ;

Ah, I see, thanks.  My concern was: if requiring !STMT_VINFO_LIVE_P stmts
can cause "normal" reductions to have a latency of 0, could the same thing
happen for single-cycle reductions?  But I suppose the answer is "no".
Introducing a cast like the above would cause reduc_chain_length > 1,
and so:

  if (ncopies > 1
  && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
  && reduc_chain_length == 1
  && loop_vinfo->suggested_unroll_factor == 1)
single_defuse_cycle = true;

wouldn't trigger.  Which makes the single-cycle thing a bit 

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-07-31 Thread Hao Liu OS via Gcc-patches
Sure, the helper makes the code simpler.  I'll test the new patch and push if 
there is no other issue.

Thanks,
Hao


From: Richard Sandiford 
Sent: Monday, July 31, 2023 17:11
To: Hao Liu OS
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Hao Liu OS  writes:
>> Which test case do you see this for?  The two tests in the patch still
>> seem to report correct latencies for me if I make the change above.
>
> Not the newly added tests.  It is still the existing case causing the 
> previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.
>
> It's not the test case itself failed, but the dump message of vect says the 
> "reduction latency" is 0:
>
> Before the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 1
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 1.00
> cost_model_13.c:7:21: note:estimated cycles per vector iteration (for VF 
> 8) = 8.00
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 2
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 2.00
>
> After the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 0 <--- seems not 
> consistent with above result
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 1.00
> cost_model_13.c:7:21: note:estimated cycles per vector iteration (for VF 
> 8) = 8.00
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:load operations = 1
> cost_model_13.c:7:21: note:store operations = 0
> cost_model_13.c:7:21: note:general operations = 1
> cost_model_13.c:7:21: note:reduction latency = 0 <--- seems not 
> consistent with above result
> cost_model_13.c:7:21: note:estimated min cycles per iteration = 1.00  
><--- seems not consistent with above result
>
> BTW. this should be caused by the reduction stmt is not live, which indicates 
> whether this stmts is part of a computation whose result is used outside the 
> loop (tree-vectorized.h:1204):
>   :
>   # res_18 = PHI 
>   # i_20 = PHI 
>   _1 = (long unsigned int) i_20;
>   _2 = _1 * 2;
>   _3 = x_14(D) + _2;
>   _4 = *_3;
>   _5 = (unsigned short) _4;
>   res.0_6 = (unsigned short) res_18;
>   _7 = _5 + res.0_6; <-- This is not live, may be 
> caused by the below type cast stmt.
>   res_15 = (short int) _7;
>   i_16 = i_20 + 1;
>   if (n_11(D) > i_16)
> goto ;
>   else
> goto ;
>
>   :
>   goto ;

Ah, I see, thanks.  My concern was: if requiring !STMT_VINFO_LIVE_P stmts
can cause "normal" reductions to have a latency of 0, could the same thing
happen for single-cycle reductions?  But I suppose the answer is "no".
Introducing a cast like the above would cause reduc_chain_length > 1,
and so:

  if (ncopies > 1
  && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
  && reduc_chain_length == 1
  && loop_vinfo->suggested_unroll_factor == 1)
single_defuse_cycle = true;

wouldn't trigger.  Which makes the single-cycle thing a bit hit-and-miss...

So yeah, I agree the patch is safe after all.

Please split the check out into a helper though, to avoid the awkward
formatting:

/* Return true if STMT_INFO is part of a reduction that has the form:

  r = r op ...;
  r = r op ...;

   with the single accumulator being read and written multiple times.  */
static bool
aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
{
  if (!STMT_VINFO_LIVE_P (stmt_info))
return false;

  auto reduc_info = info_for_reduction (vinfo, stmt_info);
  return STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
}

OK with that change, thanks.

Richard


Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-07-30 Thread Hao Liu OS via Gcc-patches
opies.  */
> ops->reduction_latency = MAX (ops->reduction_latency, base * count);
>   else
> ops->reduction_latency = MAX (ops->reduction_latency, base);
>
> Thanks,
> Hao
>
> 
> From: Richard Sandiford 
> Sent: Wednesday, July 26, 2023 17:14
> To: Richard Biener
> Cc: Hao Liu OS; GCC-patches@gcc.gnu.org
> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
> multiplying count [PR110625]
>
> Richard Biener  writes:
>> On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
>>  wrote:
>>>
>>> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're 
>>> > not papering over an issue elsewhere.
>>>
>>> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is 
>>> the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
>>>
>>>   :
>>>   # res_18 = PHI 
>>>   # i_20 = PHI 
>>>   _1 = (long unsigned int) i_20;
>>>   _2 = _1 * 2;
>>>   _3 = x_14(D) + _2;
>>>   _4 = *_3;
>>>   _5 = (unsigned short) _4;
>>>   res.0_6 = (unsigned short) res_18;
>>>   _7 = _5 + res.0_6; <-- The current stmt_info
>>>   res_15 = (short int) _7;
>>>   i_16 = i_20 + 1;
>>>   if (n_11(D) > i_16)
>>> goto ;
>>>   else
>>> goto ;
>>>
>>>   :
>>>   goto ;
>>>
>>> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI >> 0(6)>"?
>>> The status here is:
>>>   STMT_VINFO_REDUC_IDX (stmt_info): 1
>>>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
>>>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0
>>
>> Not all stmts in the SSA cycle forming the reduction have
>> STMT_VINFO_REDUC_DEF set,
>> only the last (latch def) and live stmts have at the moment.
>
> Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>   && vect_is_reduction (stmt_info))
>
> to:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>   && STMT_VINFO_LIVE_P (stmt_info)
>   && vect_is_reduction (stmt_info))
>
> instead of using a null check.
>
> I see that vectorizable_reduction calculates a reduc_chain_length.
> Would it be OK to store that in the stmt_vec_info?  I suppose the
> AArch64 code should be multiplying by that as well.  (It would be a
> separate patch from this one though.)
>
> Richard
>
>
>>
>> Richard.
>>
>>> Thanks,
>>> Hao
>>>
>>> 
>>> From: Richard Sandiford 
>>> Sent: Tuesday, July 25, 2023 17:44
>>> To: Hao Liu OS
>>> Cc: GCC-patches@gcc.gnu.org
>>> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
>>> multiplying count [PR110625]
>>>
>>> Hao Liu OS  writes:
>>> > Hi,
>>> >
>>> > Thanks for the suggestion.  I tested it and found a gcc_assert failure:
>>> > gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in 
>>> > info_for_reduction, at tree-vect-loop.cc:5473)
>>> >
>>> > It is caused by empty STMT_VINFO_REDUC_DEF.
>>>
>>> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
>>> we're not papering over an issue elsewhere.
>>>
>>> Thanks,
>>> Richard
>>>
>>>   So, I added an extra check before checking single_defuse_cycle. The 
>>> updated patch is below.  Is it OK for trunk?
>>> >
>>> > ---
>>> >
>>> > The new costs should only count reduction latency by multiplying count for
>>> > single_defuse_cycle.  For other situations, this will increase the 
>>> > reduction
>>> > latency a lot and miss vectorization opportunities.
>>> >
>>> > Tested on aarch64-linux-gnu.
>>> >
>>> > gcc/ChangeLog:
>>> >
>>> >   PR target/110625
>>> >   * config/aarch64/aarch64.cc (count_ops): Only '* count' for
>>> >   single_defuse_cycle while counting reduction_latency.
>>> >
>>> > gcc/testsuite/ChangeLog:
>>> >
>>> >   * gcc.target/aarch64/pr110625_1.c: New testcase.
>>> >   * gcc.target/aarch64/pr110625_2.c: New testcase.
>>> > ---
>>&

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-07-28 Thread Hao Liu OS via Gcc-patches
 I think we can avoid the ICE by changing:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>   && vect_is_reduction (stmt_info))
>
> to:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>   && STMT_VINFO_LIVE_P (stmt_info)
>   && vect_is_reduction (stmt_info))

I  tried this and it indeed can avoid ICE.  But it seems the reduction_latency 
calculation is also skipped, after such modification, the redunction_latency is 
0 for this case. Previously, it is 1 and 2 for scalar and vector separately.

IMHO, to keep it consistent with previous result, should we move 
STMT_VINFO_LIVE_P check below and inside the if? such as:

  /* Calculate the minimum cycles per iteration imposed by a reduction
 operation.  */
  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
  && vect_is_reduction (stmt_info))
{
  unsigned int base
= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
  if (STMT_VINFO_LIVE_P (stmt_info) && STMT_VINFO_FORCE_SINGLE_CYCLE (
info_for_reduction (m_vinfo, stmt_info)))
/* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
   and then accumulate that, but at the moment the loop-carried
   dependency includes all copies.  */
ops->reduction_latency = MAX (ops->reduction_latency, base * count);
  else
ops->reduction_latency = MAX (ops->reduction_latency, base);

Thanks,
Hao


From: Richard Sandiford 
Sent: Wednesday, July 26, 2023 17:14
To: Richard Biener
Cc: Hao Liu OS; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Richard Biener  writes:
> On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
>  wrote:
>>
>> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're 
>> > not papering over an issue elsewhere.
>>
>> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is 
>> the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
>>
>>   :
>>   # res_18 = PHI 
>>   # i_20 = PHI 
>>   _1 = (long unsigned int) i_20;
>>   _2 = _1 * 2;
>>   _3 = x_14(D) + _2;
>>   _4 = *_3;
>>   _5 = (unsigned short) _4;
>>   res.0_6 = (unsigned short) res_18;
>>   _7 = _5 + res.0_6; <-- The current stmt_info
>>   res_15 = (short int) _7;
>>   i_16 = i_20 + 1;
>>   if (n_11(D) > i_16)
>> goto ;
>>   else
>> goto ;
>>
>>   :
>>   goto ;
>>
>> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI > 0(6)>"?
>> The status here is:
>>   STMT_VINFO_REDUC_IDX (stmt_info): 1
>>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
>>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0
>
> Not all stmts in the SSA cycle forming the reduction have
> STMT_VINFO_REDUC_DEF set,
> only the last (latch def) and live stmts have at the moment.

Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:

  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
  && vect_is_reduction (stmt_info))

to:

  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
  && STMT_VINFO_LIVE_P (stmt_info)
  && vect_is_reduction (stmt_info))

instead of using a null check.

I see that vectorizable_reduction calculates a reduc_chain_length.
Would it be OK to store that in the stmt_vec_info?  I suppose the
AArch64 code should be multiplying by that as well.  (It would be a
separate patch from this one though.)

Richard


>
> Richard.
>
>> Thanks,
>> Hao
>>
>> 
>> From: Richard Sandiford 
>> Sent: Tuesday, July 25, 2023 17:44
>> To: Hao Liu OS
>> Cc: GCC-patches@gcc.gnu.org
>> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
>> multiplying count [PR110625]
>>
>> Hao Liu OS  writes:
>> > Hi,
>> >
>> > Thanks for the suggestion.  I tested it and found a gcc_assert failure:
>> > gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in 
>> > info_for_reduction, at tree-vect-loop.cc:5473)
>> >
>> > It is caused by empty STMT_VINFO_REDUC_DEF.
>>
>> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
>> we're not papering over an issue elsewhere.
>>
>> Thanks,
>> Richard
>>
>>   So, I added an extra check before checking single_defuse_cycle. The 
>> 

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-07-26 Thread Hao Liu OS via Gcc-patches
> Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>   && vect_is_reduction (stmt_info))
>
> to:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>   && STMT_VINFO_LIVE_P (stmt_info)
>   && vect_is_reduction (stmt_info))

I  tried this and it indeed can avoid ICE.  But it seems the reduction_latency 
calculation is also skipped, after such modification, the redunction_latency is 
0 for this case. Previously, it is 1 and 2 for scalar and vector separately.

IMHO, to keep it consistent with previous result, should we move 
STMT_VINFO_LIVE_P check below and inside the if? such as:

  /* Calculate the minimum cycles per iteration imposed by a reduction
 operation.  */
  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
  && vect_is_reduction (stmt_info))
{
  unsigned int base
= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
  if (STMT_VINFO_LIVE_P (stmt_info) && STMT_VINFO_FORCE_SINGLE_CYCLE (
info_for_reduction (m_vinfo, stmt_info)))
/* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
   and then accumulate that, but at the moment the loop-carried
   dependency includes all copies.  */
ops->reduction_latency = MAX (ops->reduction_latency, base * count);
  else
ops->reduction_latency = MAX (ops->reduction_latency, base);

Thanks,
Hao


From: Richard Sandiford 
Sent: Wednesday, July 26, 2023 17:14
To: Richard Biener
Cc: Hao Liu OS; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Richard Biener  writes:
> On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
>  wrote:
>>
>> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're 
>> > not papering over an issue elsewhere.
>>
>> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is 
>> the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
>>
>>   :
>>   # res_18 = PHI 
>>   # i_20 = PHI 
>>   _1 = (long unsigned int) i_20;
>>   _2 = _1 * 2;
>>   _3 = x_14(D) + _2;
>>   _4 = *_3;
>>   _5 = (unsigned short) _4;
>>   res.0_6 = (unsigned short) res_18;
>>   _7 = _5 + res.0_6; <-- The current stmt_info
>>   res_15 = (short int) _7;
>>   i_16 = i_20 + 1;
>>   if (n_11(D) > i_16)
>> goto ;
>>   else
>> goto ;
>>
>>   :
>>   goto ;
>>
>> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI > 0(6)>"?
>> The status here is:
>>   STMT_VINFO_REDUC_IDX (stmt_info): 1
>>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
>>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0
>
> Not all stmts in the SSA cycle forming the reduction have
> STMT_VINFO_REDUC_DEF set,
> only the last (latch def) and live stmts have at the moment.

Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:

  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
  && vect_is_reduction (stmt_info))

to:

  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
  && STMT_VINFO_LIVE_P (stmt_info)
  && vect_is_reduction (stmt_info))

instead of using a null check.

I see that vectorizable_reduction calculates a reduc_chain_length.
Would it be OK to store that in the stmt_vec_info?  I suppose the
AArch64 code should be multiplying by that as well.  (It would be a
separate patch from this one though.)

Richard


>
> Richard.
>
>> Thanks,
>> Hao
>>
>> 
>> From: Richard Sandiford 
>> Sent: Tuesday, July 25, 2023 17:44
>> To: Hao Liu OS
>> Cc: GCC-patches@gcc.gnu.org
>> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
>> multiplying count [PR110625]
>>
>> Hao Liu OS  writes:
>> > Hi,
>> >
>> > Thanks for the suggestion.  I tested it and found a gcc_assert failure:
>> > gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in 
>> > info_for_reduction, at tree-vect-loop.cc:5473)
>> >
>> > It is caused by empty STMT_VINFO_REDUC_DEF.
>>
>> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
>> we're not papering over an issue elsewhere.
>>
>> Thanks,
>> Richard
>>
>>   So, I added an extra check b

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-07-25 Thread Hao Liu OS via Gcc-patches
> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're not 
> papering over an issue elsewhere.

Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is the 
the gimple of "gcc.target/aarch64/sve/cost_model_13.c":

  :
  # res_18 = PHI 
  # i_20 = PHI 
  _1 = (long unsigned int) i_20;
  _2 = _1 * 2;
  _3 = x_14(D) + _2;
  _4 = *_3;
  _5 = (unsigned short) _4;
  res.0_6 = (unsigned short) res_18;
  _7 = _5 + res.0_6; <-- The current stmt_info
  res_15 = (short int) _7;
  i_16 = i_20 + 1;
  if (n_11(D) > i_16)
goto ;
  else
goto ;

  :
  goto ;

It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI "?
The status here is:
  STMT_VINFO_REDUC_IDX (stmt_info): 1
  STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
  STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0

Thanks,
Hao


From: Richard Sandiford 
Sent: Tuesday, July 25, 2023 17:44
To: Hao Liu OS
Cc: GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

Hao Liu OS  writes:
> Hi,
>
> Thanks for the suggestion.  I tested it and found a gcc_assert failure:
> gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in 
> info_for_reduction, at tree-vect-loop.cc:5473)
>
> It is caused by empty STMT_VINFO_REDUC_DEF.

When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
we're not papering over an issue elsewhere.

Thanks,
Richard

  So, I added an extra check before checking single_defuse_cycle. The updated 
patch is below.  Is it OK for trunk?
>
> ---
>
> The new costs should only count reduction latency by multiplying count for
> single_defuse_cycle.  For other situations, this will increase the reduction
> latency a lot and miss vectorization opportunities.
>
> Tested on aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
>   PR target/110625
>   * config/aarch64/aarch64.cc (count_ops): Only '* count' for
>   single_defuse_cycle while counting reduction_latency.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/pr110625_1.c: New testcase.
>   * gcc.target/aarch64/pr110625_2.c: New testcase.
> ---
>  gcc/config/aarch64/aarch64.cc | 13 --
>  gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++
>  gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++
>  3 files changed, 69 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 560e5431636..478a4e00110 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int 
> count, vect_cost_for_stmt kind,
>  {
>unsigned int base
>   = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
> -
> -  /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
> -  that's not yet the case.  */
> -  ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> +  if (STMT_VINFO_REDUC_DEF (stmt_info)
> +   && STMT_VINFO_FORCE_SINGLE_CYCLE (
> + info_for_reduction (m_vinfo, stmt_info)))
> + /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
> +and then accumulate that, but at the moment the loop-carried
> +dependency includes all copies.  */
> + ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> +  else
> + ops->reduction_latency = MAX (ops->reduction_latency, base);
>  }
>
>/* Assume that multiply-adds will become a single operation.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c 
> b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> new file mode 100644
> index 000..0965cac33a0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> @@ -0,0 +1,46 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details 
> -fno-tree-slp-vectorize" } */
> +/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
> +
> +/* Do not increase the vector body cost due to the incorrect reduction 
> latency
> +Original vector body cost = 51
> +Scalar issue estimate:
> +  ...
> +  reduction latency = 2
> +  estimated min cycles per iteration = 2.00
> +  estimated cycles per vector iteration (for VF 2) = 4.00
> +Vector issue estimate:
> +  ...
> +  reduction latency = 8  <-- Too large
> +  estimated min cycles per iteration = 8.00
> +Increasing body cost to 102 because scalar code would issue more quickly
> +  ...
> +missed:  cost model: the vector iteration cost = 102 divided by the 
> scalar iteration cost = 44 is greater or equal to the vectorization factor = 
> 2.
> 

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-07-25 Thread Hao Liu OS via Gcc-patches
Hi,

Thanks for the suggestion.  I tested it and found a gcc_assert failure:
gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in 
info_for_reduction, at tree-vect-loop.cc:5473)

It is caused by empty STMT_VINFO_REDUC_DEF.  So, I added an extra check before 
checking single_defuse_cycle. The updated patch is below.  Is it OK for trunk?

---

The new costs should only count reduction latency by multiplying count for
single_defuse_cycle.  For other situations, this will increase the reduction
latency a lot and miss vectorization opportunities.

Tested on aarch64-linux-gnu.

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (count_ops): Only '* count' for
single_defuse_cycle while counting reduction_latency.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr110625_1.c: New testcase.
* gcc.target/aarch64/pr110625_2.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc | 13 --
 gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++
 gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++
 3 files changed, 69 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 560e5431636..478a4e00110 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, 
vect_cost_for_stmt kind,
 {
   unsigned int base
= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
-
-  /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
-that's not yet the case.  */
-  ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+  if (STMT_VINFO_REDUC_DEF (stmt_info)
+ && STMT_VINFO_FORCE_SINGLE_CYCLE (
+   info_for_reduction (m_vinfo, stmt_info)))
+   /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
+  and then accumulate that, but at the moment the loop-carried
+  dependency includes all copies.  */
+   ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+  else
+   ops->reduction_latency = MAX (ops->reduction_latency, base);
 }
 
   /* Assume that multiply-adds will become a single operation.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c 
b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
new file mode 100644
index 000..0965cac33a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
@@ -0,0 +1,46 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details 
-fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
+
+/* Do not increase the vector body cost due to the incorrect reduction latency
+Original vector body cost = 51
+Scalar issue estimate:
+  ...
+  reduction latency = 2
+  estimated min cycles per iteration = 2.00
+  estimated cycles per vector iteration (for VF 2) = 4.00
+Vector issue estimate:
+  ...
+  reduction latency = 8  <-- Too large
+  estimated min cycles per iteration = 8.00
+Increasing body cost to 102 because scalar code would issue more quickly
+  ...
+missed:  cost model: the vector iteration cost = 102 divided by the scalar 
iteration cost = 44 is greater or equal to the vectorization factor = 2.
+missed:  not vectorized: vectorization not profitable.  */
+
+typedef struct
+{
+  unsigned short m1, m2, m3, m4;
+} the_struct_t;
+typedef struct
+{
+  double m1, m2, m3, m4, m5;
+} the_struct2_t;
+
+double
+bar (the_struct2_t *);
+
+double
+foo (double *k, unsigned int n, the_struct_t *the_struct)
+{
+  unsigned int u;
+  the_struct2_t result;
+  for (u = 0; u < n; u++, k--)
+{
+  result.m1 += (*k) * the_struct[u].m1;
+  result.m2 += (*k) * the_struct[u].m2;
+  result.m3 += (*k) * the_struct[u].m3;
+  result.m4 += (*k) * the_struct[u].m4;
+}
+  return bar ();
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c 
b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
new file mode 100644
index 000..7a84aa8355e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details 
-fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */
+
+/* The reduction latency should be multiplied by the count for
+   single_defuse_cycle.  */
+
+long
+f (long res, short *ptr1, short *ptr2, int n)
+{
+  for (int i = 0; i < n; ++i)
+res += (long) ptr1[i] << ptr2[i];
+  return res;
+}
-- 
2.34.1



From: Richard Sandiford 
Sent: Monday, July 24, 2023 19:10
To: Hao Liu OS
Cc: 

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-07-23 Thread Hao Liu OS via Gcc-patches
Hi Richard,

Gentle ping.  Is it ok for trunk?

Or, you will have patch covering such fix?

Thanks,
-Hao


From: Hao Liu OS 
Sent: Wednesday, July 19, 2023 12:33
To: GCC-patches@gcc.gnu.org
Cc: richard.sandif...@arm.com
Subject: [PATCH] AArch64: Do not increase the vect reduction latency by 
multiplying count [PR110625]

This only affects the new costs in aarch64 backend.  Currently, the reduction
latency of vector body is too large as it is multiplied by stmt count.  As the
scalar reduction latency is small, the new costs model may think "scalar code
would issue more quickly" and increase the vector body cost a lot, which will
miss vectorization opportunities.

Tested by bootstrapping on aarch64-linux-gnu.

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (count_ops): Remove the '* count'
for reduction_latency.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr110625.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc   |  5 +--
 gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +
 2 files changed, 47 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 560e5431636..27afa64b7d5 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16788,10 +16788,7 @@ aarch64_vector_costs::count_ops (unsigned int count, 
vect_cost_for_stmt kind,
 {
   unsigned int base
= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
-
-  /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
-that's not yet the case.  */
-  ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+  ops->reduction_latency = MAX (ops->reduction_latency, base);
 }

   /* Assume that multiply-adds will become a single operation.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625.c 
b/gcc/testsuite/gcc.target/aarch64/pr110625.c
new file mode 100644
index 000..0965cac33a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625.c
@@ -0,0 +1,46 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details 
-fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
+
+/* Do not increase the vector body cost due to the incorrect reduction latency
+Original vector body cost = 51
+Scalar issue estimate:
+  ...
+  reduction latency = 2
+  estimated min cycles per iteration = 2.00
+  estimated cycles per vector iteration (for VF 2) = 4.00
+Vector issue estimate:
+  ...
+  reduction latency = 8  <-- Too large
+  estimated min cycles per iteration = 8.00
+Increasing body cost to 102 because scalar code would issue more quickly
+  ...
+missed:  cost model: the vector iteration cost = 102 divided by the scalar 
iteration cost = 44 is greater or equal to the vectorization factor = 2.
+missed:  not vectorized: vectorization not profitable.  */
+
+typedef struct
+{
+  unsigned short m1, m2, m3, m4;
+} the_struct_t;
+typedef struct
+{
+  double m1, m2, m3, m4, m5;
+} the_struct2_t;
+
+double
+bar (the_struct2_t *);
+
+double
+foo (double *k, unsigned int n, the_struct_t *the_struct)
+{
+  unsigned int u;
+  the_struct2_t result;
+  for (u = 0; u < n; u++, k--)
+{
+  result.m1 += (*k) * the_struct[u].m1;
+  result.m2 += (*k) * the_struct[u].m2;
+  result.m3 += (*k) * the_struct[u].m3;
+  result.m4 += (*k) * the_struct[u].m4;
+}
+  return bar ();
+}
--
2.34.1


[PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-07-18 Thread Hao Liu OS via Gcc-patches
This only affects the new costs in aarch64 backend.  Currently, the reduction
latency of vector body is too large as it is multiplied by stmt count.  As the
scalar reduction latency is small, the new costs model may think "scalar code
would issue more quickly" and increase the vector body cost a lot, which will
miss vectorization opportunities.

Tested by bootstrapping on aarch64-linux-gnu.

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (count_ops): Remove the '* count'
for reduction_latency.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr110625.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc   |  5 +--
 gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +
 2 files changed, 47 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 560e5431636..27afa64b7d5 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16788,10 +16788,7 @@ aarch64_vector_costs::count_ops (unsigned int count, 
vect_cost_for_stmt kind,
 {
   unsigned int base
= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
-
-  /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
-that's not yet the case.  */
-  ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+  ops->reduction_latency = MAX (ops->reduction_latency, base);
 }
 
   /* Assume that multiply-adds will become a single operation.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625.c 
b/gcc/testsuite/gcc.target/aarch64/pr110625.c
new file mode 100644
index 000..0965cac33a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625.c
@@ -0,0 +1,46 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details 
-fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
+
+/* Do not increase the vector body cost due to the incorrect reduction latency
+Original vector body cost = 51
+Scalar issue estimate:
+  ...
+  reduction latency = 2
+  estimated min cycles per iteration = 2.00
+  estimated cycles per vector iteration (for VF 2) = 4.00
+Vector issue estimate:
+  ...
+  reduction latency = 8  <-- Too large
+  estimated min cycles per iteration = 8.00
+Increasing body cost to 102 because scalar code would issue more quickly
+  ...
+missed:  cost model: the vector iteration cost = 102 divided by the scalar 
iteration cost = 44 is greater or equal to the vectorization factor = 2.
+missed:  not vectorized: vectorization not profitable.  */
+
+typedef struct
+{
+  unsigned short m1, m2, m3, m4;
+} the_struct_t;
+typedef struct
+{
+  double m1, m2, m3, m4, m5;
+} the_struct2_t;
+
+double
+bar (the_struct2_t *);
+
+double
+foo (double *k, unsigned int n, the_struct_t *the_struct)
+{
+  unsigned int u;
+  the_struct2_t result;
+  for (u = 0; u < n; u++, k--)
+{
+  result.m1 += (*k) * the_struct[u].m1;
+  result.m2 += (*k) * the_struct[u].m2;
+  result.m3 += (*k) * the_struct[u].m3;
+  result.m4 += (*k) * the_struct[u].m4;
+}
+  return bar ();
+}
-- 
2.34.1


Re: [PATCH] Vect: use a small step to calculate induction for the unrolled loop (PR tree-optimization/110449)

2023-07-06 Thread Hao Liu OS via Gcc-patches
Hi Jeff,

Thanks for your help.

Actually I have write access as I was added to the "contributor list". Anyway, 
that's very kind of you to help committing the patch.

Thanks,
-Hao

From: Jeff Law 
Sent: Friday, July 7, 2023 0:06
To: Richard Biener; Hao Liu OS
Cc: GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] Vect: use a small step to calculate induction for the 
unrolled loop (PR tree-optimization/110449)



On 7/6/23 06:44, Richard Biener via Gcc-patches wrote:
> On Wed, Jul 5, 2023 at 8:44 AM Hao Liu OS via Gcc-patches
>  wrote:
>>
>> Hi,
>>
>> If a loop is unrolled by n times during vectoriation, two steps are used to
>> calculate the induction variable:
>>- The small step for the unrolled ith-copy: vec_1 = vec_iv + (VF/n * Step)
>>- The large step for the whole loop: vec_loop = vec_iv + (VF * Step)
>>
>> This patch calculates an extra vec_n to replace vec_loop:
>>vec_n = vec_prev + (VF/n * S) = vec_iv + (VF/n * S) * n = vec_loop.
>>
>> So that we can save the large step register and related operations.
>
> OK.  It would be nice to avoid the dead stmts created earlier though.
>
> Thanks,
> Richard.
>
>> gcc/ChangeLog:
>>
>>  PR tree-optimization/110449
>>  * tree-vect-loop.cc (vectorizable_induction): use vec_n to replace
>>  vec_loop for the unrolled loop.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/pr110449.c: New testcase.
I didn't see Hao Liu in the MAINTAINERS file, so probably doesn't have
write access.  Therefore I went ahead and pushed this for Hao.

jeff


[PATCH] Vect: select small VF for epilog of unrolled loop (PR tree-optimization/110474)

2023-07-05 Thread Hao Liu OS via Gcc-patches
Hi,

If a loop is unrolled during vectorization (i.e. suggested_unroll_factor > 1),
the VFs of both main and epilog loop are enlarged.  The epilog vect loop is
specific for a loop with small iteration counts, so a large VF may hurt
performance.

This patch unscales the main loop VF by suggested_unroll_factor while selecting
the epilog loop VF, so that it will be the same as vectorized loop without
unrolling (i.e. suggested_unroll_factor = 1).

gcc/ChangeLog:

PR tree-optimization/110474
* tree-vect-loop.cc (vect_analyze_loop_2): unscale the VF by suggested
unroll factor while selecting the epilog vect loop VF.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr110474.c: New testcase.
---
 gcc/testsuite/gcc.target/aarch64/pr110474.c | 37 +
 gcc/tree-vect-loop.cc   | 16 +
 2 files changed, 47 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110474.c

diff --git a/gcc/testsuite/gcc.target/aarch64/pr110474.c 
b/gcc/testsuite/gcc.target/aarch64/pr110474.c
new file mode 100644
index 000..e548416162a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110474.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mtune=neoverse-n2 -mcpu=neoverse-n1 
-fdump-tree-vect-details --param aarch64-vect-unroll-limit=2" } */
+/* { dg-final { scan-tree-dump "Choosing vector mode V8HI"  "vect" } } */
+/* { dg-final { scan-tree-dump "Choosing epilogue vector mode V8QI"  "vect" } 
} */
+
+/* Do not increase the the vector factor of the epilog vectorized loop
+   for a loop with suggested_unroll_factor > 1.
+
+   before (suggested_unroll_factor=1):
+ if N >= 16:
+ main vect loop
+ if N >= 8:
+ epilog vect loop
+ scalar code
+
+   before (suggested_unroll_factor=2):
+ if N >= 32:
+ main vect loop
+ if N >= 16:  // May fail to execute vectorized code (e.g. N is 8)
+ epilog vect loop
+ scalar code
+
+   after  (suggested_unroll_factor=2):
+ if N >= 32:
+ main vect loop
+ if N >= 8:  // The same VF as suggested_unroll_factor=1
+ epilog vect loop
+ scalar code  */
+
+int
+foo (short *A, char *B, int N)
+{
+  int sum = 0;
+  for (int i = 0; i < N; ++i)
+sum += A[i] * B[i];
+  return sum;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 3b46c58a8d8..4d9abd035ea 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3021,12 +3021,16 @@ start_over:
  to be able to handle fewer than VF scalars, or needs to have a lower VF
  than the main loop.  */
   if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
-  && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-  && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-  LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
-return opt_result::failure_at (vect_location,
-  "Vectorization factor too high for"
-  " epilogue loop.\n");
+  && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+{
+  poly_uint64 unscaled_vf
+   = exact_div (LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo),
+orig_loop_vinfo->suggested_unroll_factor);
+  if (maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), unscaled_vf))
+   return opt_result::failure_at (vect_location,
+  "Vectorization factor too high for"
+  " epilogue loop.\n");
+}
 
   /* Decide whether this loop_vinfo should use partial vectors or peeling,
  assuming that the loop will be used as a main loop.  We will redo
-- 
2.34.1


[PATCH] Vect: use a small step to calculate induction for the unrolled loop (PR tree-optimization/110449)

2023-07-05 Thread Hao Liu OS via Gcc-patches
Hi,

If a loop is unrolled by n times during vectoriation, two steps are used to
calculate the induction variable:
  - The small step for the unrolled ith-copy: vec_1 = vec_iv + (VF/n * Step)
  - The large step for the whole loop: vec_loop = vec_iv + (VF * Step)

This patch calculates an extra vec_n to replace vec_loop:
  vec_n = vec_prev + (VF/n * S) = vec_iv + (VF/n * S) * n = vec_loop.

So that we can save the large step register and related operations.

gcc/ChangeLog:

PR tree-optimization/110449
* tree-vect-loop.cc (vectorizable_induction): use vec_n to replace
vec_loop for the unrolled loop.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr110449.c: New testcase.
---
 gcc/testsuite/gcc.target/aarch64/pr110449.c | 40 +
 gcc/tree-vect-loop.cc   | 21 +--
 2 files changed, 58 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110449.c

diff --git a/gcc/testsuite/gcc.target/aarch64/pr110449.c 
b/gcc/testsuite/gcc.target/aarch64/pr110449.c
new file mode 100644
index 000..bb3b6dcfe08
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110449.c
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 --param aarch64-vect-unroll-limit=2" 
} */
+/* { dg-final { scan-assembler-not "8.0e\\+0" } } */
+
+/* Calcualte the vectorized induction with smaller step for an unrolled loop.
+
+   before (suggested_unroll_factor=2):
+ fmovs30, 8.0e+0
+ fmovs31, 4.0e+0
+ dup v27.4s, v30.s[0]
+ dup v28.4s, v31.s[0]
+ .L6:
+ mov v30.16b, v31.16b
+ faddv31.4s, v31.4s, v27.4s
+ faddv29.4s, v30.4s, v28.4s
+ stp q30, q29, [x0]
+ add x0, x0, 32
+ cmp x1, x0
+ bne .L6
+
+   after:
+ fmovs31, 4.0e+0
+ dup v29.4s, v31.s[0]
+ .L6:
+ faddv30.4s, v31.4s, v29.4s
+ stp q31, q30, [x0]
+ add x0, x0, 32
+ faddv31.4s, v29.4s, v30.4s
+ cmp x0, x1
+ bne .L6  */
+
+void
+foo2 (float *arr, float freq, float step)
+{
+  for (int i = 0; i < 1024; i++)
+{
+  arr[i] = freq;
+  freq += step;
+}
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 3b46c58a8d8..706ecbffd0c 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10114,7 +10114,7 @@ vectorizable_induction (loop_vec_info loop_vinfo,
   new_vec, step_vectype, NULL);
 
   vec_def = induc_def;
-  for (i = 1; i < ncopies; i++)
+  for (i = 1; i < ncopies + 1; i++)
{
  /* vec_i = vec_prev + vec_step  */
  gimple_seq stmts = NULL;
@@ -10124,8 +10124,23 @@ vectorizable_induction (loop_vec_info loop_vinfo,
  vec_def = gimple_convert (, vectype, vec_def);
  
  gsi_insert_seq_before (, stmts, GSI_SAME_STMT);
- new_stmt = SSA_NAME_DEF_STMT (vec_def);
- STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+ if (i < ncopies)
+   {
+ new_stmt = SSA_NAME_DEF_STMT (vec_def);
+ STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+   }
+ else
+   {
+ /* vec_1 = vec_iv + (VF/n * S)
+vec_2 = vec_1 + (VF/n * S)
+...
+vec_n = vec_prev + (VF/n * S) = vec_iv + VF * S = vec_loop
+
+vec_n is used as vec_loop to save the large step register and
+related operations.  */
+ add_phi_arg (induction_phi, vec_def, loop_latch_edge (iv_loop),
+  UNKNOWN_LOCATION);
+   }
}
 }
 
-- 
2.34.1

[PATCH] Vect: avoid using uninitialized variable (PR tree-optimization/110531)

2023-07-04 Thread Hao Liu OS via Gcc-patches
slp_done_for_suggested_uf is used in vect_analyze_loop_2 without
initialization, which is undefined behavior.  Initialize it to false
according to the discussion.

gcc/ChangeLog:
PR tree-optimization/110531
* tree-vect-loop.cc (vect_analyze_loop_1): initialize
slp_done_for_suggested_uf to false.
---
 gcc/tree-vect-loop.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index f39a1ecb306..e504645f1df 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -,7 +,7 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared 
*shared,
   machine_mode vector_mode = vector_modes[mode_i];
   loop_vinfo->vector_mode = vector_mode;
   unsigned int suggested_unroll_factor = 1;
-  bool slp_done_for_suggested_uf;
+  bool slp_done_for_suggested_uf = false;
 
   /* Run the main analysis.  */
   opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal,
-- 
2.34.1

RE: Add libcody

2020-12-21 Thread Hao Liu OS via Gcc-patches
Hi Nathan,

This patch causes a build failure on CentOS.

More information: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98318#c3 

Thanks,
-Hao

> -Original Message-
> From: Gcc-patches  On Behalf Of Nathan
> Sidwell
> Sent: Tuesday, December 15, 2020 11:46 PM
> To: GCC Patches 
> Subject: Add libcody
> 
> In order to separate compiler from build system, C++ Modules, as
> implemented in GCC introduces a communication channel between those
> two entities.  This is implemented by libcody.  It is anticipated that other
> implementations will also implement this protocol, or use libcody to provide
> it.
> 
>  * Makefile.def: Add libcody.
>  * Makefile.in: Regenerated.
>  * configure: Regenerated.
>  gcc/
>  * Makefile.in (CODYINC, CODYLIB, CODYLIB_H): New. Use them.
>  libcody/
>  * CMakeLists.txt: New.
>  * CMakeLists.txt: New.
>  * CODING.md: New.
>  * CONTRIB.md: New.
>  * LICENSE: New.
>  * LICENSE.gcc: New.
>  * Makefile.in: New.
>  * Makesub.in: New.
>  * README.md: New.
>  * buffer.cc: New.
>  * build-aux/config.guess: New.
>  * build-aux/config.sub: New.
>  * build-aux/install-sh: New.
>  * client.cc: New.
>  * cmake/libcody-config-ix.cmake: New.
>  * cody.hh: New.
>  * config.h.in: New.
>  * config.m4: New.
>  * configure: New.
>  * configure.ac: New.
>  * dox.cfg.in: New.
>  * fatal.cc: New.
>  * gdbinit.in: New.
>  * internal.hh: New.
>  * netclient.cc: New.
>  * netserver.cc: New.
>  * packet.cc: New.
>  * resolver.cc: New.
>  * server.cc: New.
>  * tests/01-serialize/connect.cc: New.
>  * tests/01-serialize/decoder.cc: New.
>  * tests/01-serialize/encoder.cc: New.
>  * tests/02-comms/client-1.cc: New.
>  * tests/02-comms/pivot-1.cc: New.
>  * tests/02-comms/server-1.cc: New.
>  * tests/Makesub.in: New.
>  * tests/jouster: New.
> 
> pushing to trunk
> --
> Nathan Sidwell


Re: [PATCH] extend cselim to check non-trapping for more references (PR tree-optimizaton/89430)

2020-06-04 Thread Hao Liu OS via Gcc-patches
Hi Jakub,

I've updated the incorrect  ChangLog.

gcc/:

PR tree-optimization/89430
* tree-ssa-phiopt.c
(struct name_to_bb): Rename to ref_to_bb; add a new field exp;
remove ssa_name_ver, store, offset fields.
(struct ssa_names_hasher): Rename to refs_hasher; update functions.
(class nontrapping_dom_walker): Rename m_seen_ssa_names to m_seen_refs.
(nontrapping_dom_walker::add_or_mark_expr): Extend to support ARRAY_REFs
and COMPONENT_REFs.

 Thanks a lot.
-Hao


From: Jakub Jelinek 
Sent: Thursday, June 4, 2020 12:55
To: Hao Liu OS
Cc: Richard Biener; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] extend cselim to check non-trapping for more references 
(PR tree-optimizaton/89430)

On Thu, Jun 04, 2020 at 04:47:43AM +, Hao Liu OS wrote:
> The patch is refactored a little according to the last comment. Do you have 
> more comments? If no, I will commit it later.
>
> Tested on X86_64 and AArch64.
>
> gcc/:
>
> PR tree-optimization/89430
> * tree-ssa-phiopt.c (cond_store_replacement): Extend non-trap checking
> to support ARRAY_REFs and COMPONENT_REFs.  Support a special case: if
> there is a dominating load of local variable without address escape,
> a store is not trapped (as local stack is always writable).
> The logic is also simplified to ignore other loads, as they don't
> help to check if a store is trapped (may be read-only).

The ChangeLog entry is certainly incorrect, it doesn't mention all the
classes and methods you've actually changed, but mentions a routine you
haven't changed at all.  And it describes the intent of the changes rather
than the details on what actually changed.  This struct got renamed and this
and this member has been added, etc.

Jakub



Re: [PATCH] extend cselim to check non-trapping for more references (PR tree-optimizaton/89430)

2020-06-03 Thread Hao Liu OS via Gcc-patches
Hi All,

The patch is refactored a little according to the last comment. Do you have 
more comments? If no, I will commit it later.

Tested on X86_64 and AArch64.

gcc/:

PR tree-optimization/89430
* tree-ssa-phiopt.c (cond_store_replacement): Extend non-trap checking
to support ARRAY_REFs and COMPONENT_REFs.  Support a special case: if
there is a dominating load of local variable without address escape,
a store is not trapped (as local stack is always writable).
The logic is also simplified to ignore other loads, as they don't
help to check if a store is trapped (may be read-only).

gcc/testsuite/:

PR tree-optimization/89430
* gcc.dg/tree-ssa/pr89430-1.c: Remove xfail.
* gcc.dg/tree-ssa/pr89430-2.c: Remove xfail.
* gcc.dg/tree-ssa/pr89430-5.c: Remove xfail.
* gcc.dg/tree-ssa/pr89430-6.c: Remove xfail.
* gcc.dg/tree-ssa/pr89430-7-comp-ref.c: New test.
* gcc.dg/tree-ssa/pr89430-8-mem-ref-size.c: New test.
* gcc.dg/tree-ssa/ssa-pre-17.c: Add -fno-tree-cselim.

---
 gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c |   2 +-
 .../gcc.dg/tree-ssa/pr89430-7-comp-ref.c  |  17 +++
 .../gcc.dg/tree-ssa/pr89430-8-mem-ref-size.c  |  15 ++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c|   2 +-
 gcc/tree-ssa-phiopt.c | 129 ++
 8 files changed, 107 insertions(+), 64 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr89430-7-comp-ref.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr89430-8-mem-ref-size.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c
index ce242ba569b..8ee1850ac63 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c
@@ -9,4 +9,4 @@ unsigned test(unsigned k, unsigned b) {
 return a[0]+a[1];
 }

-/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" { 
xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c
index 90ae36bfce2..9b96875ac7a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c
@@ -11,4 +11,4 @@ unsigned test(unsigned k, unsigned b) {
 return a[0]+a[1];
 }

-/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" { 
xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c
index c633cbe947d..b2d04119381 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c
@@ -13,4 +13,4 @@ int test(int b, int k) {
 return a.data[0] + a.data[1];
 }

-/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" { 
xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c
index 7cad563128d..8d3c4f7cc6a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c
@@ -16,4 +16,4 @@ int test(int b, int k) {
 return a.data[0].x + a.data[1].x;
 }

-/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" { 
xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-7-comp-ref.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-7-comp-ref.c
new file mode 100644
index 000..c35a2afc70b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-7-comp-ref.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-cselim-details" } */
+
+typedef union {
+  int i;
+  float f;
+} U;
+
+int foo(U *u, int b, int i)
+{
+  u->i = 0;
+  if (b)
+u->i = i;
+  return u->i;
+}
+
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-8-mem-ref-size.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-8-mem-ref-size.c
new file mode 100644
index 000..f9e66aefb13
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-8-mem-ref-size.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-cselim-details" } */
+
+int *t;
+
+int f1 (int tt)
+{
+  int *t1 = t;
+  *t1 = -5;
+  if (*t1 < tt)
+*((unsigned *) t1) = 5;
+  return *t1;
+}
+
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c
index 

RE: [PATCH] extend cselim to check non-trapping for more references (PR tree-optimizaton/89430)

2020-05-29 Thread Hao Liu OS via Gcc-patches
Hi Richard,

Thanks for your comments. It's a good idea to simplify the code and remove 
get_inner_reference. I've updated the patch accordingly. I also simplified the 
code to ignore other loads, which can not help to check if a store can be 
trapped. 

About tests:
1.  All previously XFAIL tests (gcc.dg/tree-ssa/pr89430-*) for pr89430 are 
passed with this patch. 
2.  ssa-pre-17.c is failed as cselim optimizes the conditional store, so 
"-fno-tree-cselim" is added. That case is added as a new test case for pr89430.
3.  Other test cases (including the regression test for pr94734) in gcc 
testsuit are not affected by this patch, according to gcc "make check".
4.  Some other benchmarks are also tested for correctness and performance. 
The performance regression mentioned in pr89430 can be fixed. 
 
Review, please.

gcc/ChangeLog:

PR tree-optimization/89430
* tree-ssa-phiopt.c (cond_store_replacement): Extend non-trap checking to
support ARRAY_REFs and COMPONENT_REFs.  Support a special case: if there is
a dominating load of local variable without address escape, a store is not
trapped (as local stack is always writable).  Other loads are ignored for
simplicity, as they don't help to check if a store can be trapped.

gcc/testsuite/ChangeLog:

PR tree-optimization/89430
* gcc.dg/tree-ssa/pr89430-1.c: Remove xfail.
* gcc.dg/tree-ssa/pr89430-2.c: Remove xfail.
* gcc.dg/tree-ssa/pr89430-5.c: Remove xfail.
* gcc.dg/tree-ssa/pr89430-6.c: Remove xfail.
* gcc.dg/tree-ssa/pr89430-7-comp-ref.c: New test.
* gcc.dg/tree-ssa/ssa-pre-17.c: Add -fno-tree-cselim.

---
 gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c |   2 +-
 .../gcc.dg/tree-ssa/pr89430-7-comp-ref.c  |  17 +++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c|   2 +-
 gcc/tree-ssa-phiopt.c | 140 +-
 7 files changed, 91 insertions(+), 76 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr89430-7-comp-ref.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c
index ce242ba569b..8ee1850ac63 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-1.c
@@ -9,4 +9,4 @@ unsigned test(unsigned k, unsigned b) {
 return a[0]+a[1];
 }
 
-/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" { 
xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c
index 90ae36bfce2..9b96875ac7a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-2.c
@@ -11,4 +11,4 @@ unsigned test(unsigned k, unsigned b) {
 return a[0]+a[1];
 }
 
-/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" { 
xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c
index c633cbe947d..b2d04119381 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-5.c
@@ -13,4 +13,4 @@ int test(int b, int k) {
 return a.data[0] + a.data[1];
 }
 
-/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" { 
xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c
index 7cad563128d..8d3c4f7cc6a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-6.c
@@ -16,4 +16,4 @@ int test(int b, int k) {
 return a.data[0].x + a.data[1].x;
 }
 
-/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" { 
xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr89430-7-comp-ref.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-7-comp-ref.c
new file mode 100644
index 000..c35a2afc70b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr89430-7-comp-ref.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-cselim-details" } */
+
+typedef union {
+  int i;
+  float f;
+} U;
+
+int foo(U *u, int b, int i)
+{
+  u->i = 0;
+  if (b)
+u->i = i;
+  return u->i;
+}
+
+/* { dg-final { scan-tree-dump "Conditional store replacement" "cselim" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c
index 09313716598..a06f339f0bb 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c
@@ -1,5 +1,5 @@
 /* { dg-do 

[PATCH] extend cselim to check non-trapping for more references (PR tree-optimizaton/89430)

2020-05-26 Thread Hao Liu OS via Gcc-patches
Hi all,

Previously, the fix for 
PR89430 was reverted by 
PR94734 due to a bug. The 
root cause is missing non-trapping check with dominating LOAD/STORE.

This patch extends the cselim non-trapping check to support ARRAY_REF and 
COMPONENT_REF (previously only support MEM_REF) by get_inner_reference and 
hashing according to the comments from Jakub.

To support cases in PR89430, if there is a dominating LOAD of local variable 
without address escape,  as local stack is always writable, the STORE is not 
trapped and can be optimized.

Review, please.


diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c
index b1e0dce93d8..3733780a0bc 100644
--- a/gcc/tree-ssa-phiopt.c
+++ b/gcc/tree-ssa-phiopt.c
@@ -1986,26 +1986,31 @@ abs_replacement (basic_block cond_bb, basic_block 
middle_bb,

??? We currently are very conservative and assume that a load might
trap even if a store doesn't (write-only memory).  This probably is
-   overly conservative.  */
+   overly conservative.

-/* A hash-table of SSA_NAMEs, and in which basic block an MEM_REF
-   through it was seen, which would constitute a no-trap region for
-   same accesses.  */
-struct name_to_bb
+   We currently support a special case that for !TREE_ADDRESSABLE automatic
+   variables, it could ignore whether something is a load or store because the
+   local stack should be always writable.  */
+
+/* A hash-table of references (MEM_REF/ARRAY_REF/COMPONENT_REF), and in which
+   basic block an *_REF through it was seen, which would constitute a
+   no-trap region for same accesses.  */
+struct ref_to_bb
 {
-  unsigned int ssa_name_ver;
+  tree base;
+  poly_int64 bitsize, bitpos;
+  tree offset;
   unsigned int phase;
-  bool store;
-  HOST_WIDE_INT offset, size;
+  bool writable;
   basic_block bb;
 };

 /* Hashtable helpers.  */

-struct ssa_names_hasher : free_ptr_hash 
+struct refs_hasher : free_ptr_hash
 {
-  static inline hashval_t hash (const name_to_bb *);
-  static inline bool equal (const name_to_bb *, const name_to_bb *);
+  static inline hashval_t hash (const ref_to_bb *);
+  static inline bool equal (const ref_to_bb *, const ref_to_bb *);
 };

 /* Used for quick clearing of the hash-table when we see calls.
@@ -2015,28 +2020,44 @@ static unsigned int nt_call_phase;
 /* The hash function.  */

 inline hashval_t
-ssa_names_hasher::hash (const name_to_bb *n)
+refs_hasher::hash (const ref_to_bb *n)
 {
-  return n->ssa_name_ver ^ (((hashval_t) n->store) << 31)
- ^ (n->offset << 6) ^ (n->size << 3);
+  inchash::hash hstate;
+  inchash::add_expr (n->base, hstate);
+  hstate.add_poly_int (n->bitsize);
+  hstate.add_poly_int (n->bitpos);
+  hstate.add_int (n->writable);
+  if (n->offset != NULL_TREE)
+{
+  inchash::add_expr (n->offset, hstate);
+}
+  return hstate.end ();
 }

 /* The equality function of *P1 and *P2.  */

 inline bool
-ssa_names_hasher::equal (const name_to_bb *n1, const name_to_bb *n2)
+refs_hasher::equal (const ref_to_bb *n1, const ref_to_bb *n2)
 {
-  return n1->ssa_name_ver == n2->ssa_name_ver
- && n1->store == n2->store
- && n1->offset == n2->offset
- && n1->size == n2->size;
+  if (operand_equal_p (n1->base, n2->base, 0)
+  && known_eq (n1->bitsize, n2->bitsize)
+  && known_eq (n1->bitpos, n2->bitpos) && n1->writable == n2->writable)
+{
+  /* Should not call operand_equal_p with NULL_TREE.  */
+  if (n1->offset == NULL_TREE || n2->offset == NULL_TREE)
+  return n1->offset == n2->offset;
+  else
+  return operand_equal_p (n1->offset, n2->offset, 0);
+}
+  return false;
 }

 class nontrapping_dom_walker : public dom_walker
 {
 public:
   nontrapping_dom_walker (cdi_direction direction, hash_set *ps)
-: dom_walker (direction), m_nontrapping (ps), m_seen_ssa_names (128) {}
+: dom_walker (direction), m_nontrapping (ps), m_seen_refs (256)
+  {}

   virtual edge before_dom_children (basic_block);
   virtual void after_dom_children (basic_block);
@@ -2053,7 +2074,7 @@ private:
   hash_set *m_nontrapping;

   /* The hash table for remembering what we've seen.  */
-  hash_table m_seen_ssa_names;
+  hash_table m_seen_refs;
 };

 /* Called by walk_dominator_tree, when entering the block BB.  */
@@ -2102,65 +2123,76 @@ nontrapping_dom_walker::after_dom_children (basic_block 
bb)
 }

 /* We see the expression EXP in basic block BB.  If it's an interesting
-   expression (an MEM_REF through an SSA_NAME) possibly insert the
-   expression into the set NONTRAP or the hash table of seen expressions.
-   STORE is true if this expression is on the LHS, otherwise it's on
-   the RHS.  */
+   expression of:
+ 1) MEM_REF
+ 2) ARRAY_REF
+ 3) COMPONENT_REF
+   possibly insert the expression into the set NONTRAP or the hash table
+   of seen expressions.  STORE is true if this expression is on the LHS,
+