[PATCH 5/5] RISC-V: tree-optimization/65518 - extend fix to SLP

2024-05-24 Thread Richard Biener
This extends the PR65518 workaround to also apply for single-lane SLP.

* tree-vect-stmts.cc (get_group_load_store_type): For SLP also
check for the PR65518 single-element interleaving case as done in
vect_grouped_load_supported.
---
 gcc/tree-vect-stmts.cc | 17 +
 1 file changed, 17 insertions(+)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 1c30a0388ca..a01099d3456 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2154,6 +2154,23 @@ get_group_load_store_type (vec_info *vinfo, 
stmt_vec_info stmt_info,
}
  overrun_p = true;
}
+
+ /* If this is single-element interleaving with an element
+distance that leaves unused vector loads around punt - we
+at least create very sub-optimal code in that case (and
+blow up memory, see PR65518).  */
+ if (loop_vinfo
+ && *memory_access_type == VMAT_CONTIGUOUS
+ && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
+ && single_element_p
+ && maybe_gt (group_size, TYPE_VECTOR_SUBPARTS (vectype)))
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"single-element interleaving not supported "
+"for not adjacent vector loads\n");
+ return false;
+   }
}
 }
   else
-- 
2.35.3


[PATCH 4/5] Allow optimized SLP reduction epilog with single-lane reductions

2024-05-24 Thread Richard Biener
This extends optimized reduction epilog handling to cover the
trivial single-lane SLP reduction case.

* tree-vect-loop.cc (vect_create_epilog_for_reduction): Allow
direct opcode and shift reduction also for SLP reductions
with a single lane.
---
 gcc/tree-vect-loop.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 83c0544b6aa..31abfe047a4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -6500,7 +6500,7 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
   /* 2.3 Create the reduction code, using one of the three schemes described
  above. In SLP we simply need to extract all the elements from the 
  vector (without reducing them), so we use scalar shifts.  */
-  else if (reduc_fn != IFN_LAST && !slp_reduc)
+  else if (reduc_fn != IFN_LAST && (!slp_reduc || group_size == 1))
 {
   tree tmp;
   tree vec_elem_type;
@@ -6670,7 +6670,7 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
   gsi_insert_seq_before (_gsi, stmts, GSI_SAME_STMT);
   reduc_inputs[0] = new_temp;
 
-  if (reduce_with_shift && !slp_reduc)
+  if (reduce_with_shift && (!slp_reduc || group_size == 1))
{
  int element_bitsize = tree_to_uhwi (bitsize);
  /* Enforced by vectorizable_reduction, which disallows SLP reductions
-- 
2.35.3



[PATCH 3/5] Reduce single-lane SLP testresult noise

2024-05-24 Thread Richard Biener
The following avoids dumping 'vectorizing stmts using SLP' for
single-lane instances since that causes extra testsuite fallout.

* tree-vect-slp.cc (vect_schedule_slp): Gate dumping
'vectorizing stmts using SLP' on > 1 lanes.
---
 gcc/tree-vect-slp.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 73cc69d85ce..ebb71c209eb 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -10098,7 +10098,8 @@ vect_schedule_slp (vec_info *vinfo, const 
vec _instances)
   if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ())
vectorize_slp_instance_root_stmt (node, instance);
 
-  if (dump_enabled_p ())
+  /* ???  Reduce some testsuite noise because of "more SLP".  */
+  if (SLP_TREE_LANES (node) > 1 && dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
  "vectorizing stmts using SLP.\n");
 }
-- 
2.35.3



[PATCH 2/5] Avoid bogus SLP outer loop vectorization

2024-05-24 Thread Richard Biener
This fixes the check for multiple types which go wrong I think
because of bogus pointer IV increments when there are multiple
copies of vector stmts in the inner loop.

* tree-vect-stmts.cc (vectorizable_load): Avoid outer loop
SLP vectorization with multi-copy vector stmts in the inner
loop.
(vectorizable_store): Likewise.
---
 gcc/tree-vect-stmts.cc | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 4219ad832db..1c30a0388ca 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8196,7 +8196,9 @@ vectorizable_store (vec_info *vinfo,
   gcc_assert (ncopies >= 1);
 
   /* FORNOW.  This restriction should be relaxed.  */
-  if (loop && nested_in_vect_loop_p (loop, stmt_info) && ncopies > 1)
+  if (loop
+  && nested_in_vect_loop_p (loop, stmt_info)
+  && (ncopies > 1 || (slp && SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) > 1)))
 {
   if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -9939,7 +9941,8 @@ vectorizable_load (vec_info *vinfo,
   gcc_assert (ncopies >= 1);
 
   /* FORNOW. This restriction should be relaxed.  */
-  if (nested_in_vect_loop && ncopies > 1)
+  if (nested_in_vect_loop
+  && (ncopies > 1 || (slp && SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) > 1)))
 {
   if (dump_enabled_p ())
 dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-- 
2.35.3



[PATCH 1/5] Do single-lane SLP discovery for reductions

2024-05-24 Thread Richard Biener
This is the second merge proposed from the SLP vectorizer branch.
I have again managed without adding and using --param vect-single-lane-slp
but instead this provides always enabled functionality.

This makes us use SLP reductions (a group of reductions) for the
case where the group size is one.  This basically means we try
to use SLP for all reductions.

I've kept the series close to changes how they are on the branch
but in the end I'll squash it, having separate commits for review
eventually helps identifying common issues we will run into.  In
particular we lack full SLP support for several reduction kinds
and the branch has more enabling patches than in this series.
For example 4/5 makes sure we use shifts and direct opcode
reductions in the reduction epilog for SLP reductions but doesn't
bother to try covering the general case but enables it only
for the single-element group case to avoid regressions
in gcc.dg/vect/reduc-{mul,or}_[12].c testcases.

Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
successfully built SPEC CPU 2017.  This posting should trigger
arm & riscv pre-checkin CI.

There's one ICE in gcc.target/i386/pr51235.c I discovered late
that I will investigate and address after the weekend.

This change should be more straight-forward than the previous one,
still comments are of course welcome.  After pushed I will followup
with changes to enable single-lane SLP reductions for various
COND_EXPR reductions as well as double-reduction support and
in-order reduction support (also all restricted to single-lane
for the moment).

Thanks,
Richard.

--

The following performs single-lane SLP discovery for reductions.
This exposes a latent issue with reduction SLP in outer loop
vectorization and makes gcc.dg/vect/vect-outer-4[fgkl].c FAIL
execution.

* tree-vect-slp.cc (vect_build_slp_tree_2): Only multi-lane
discoveries are reduction chains and need special backedge
treatment.
(vect_analyze_slp): Fall back to single-lane SLP discovery
for reductions.  Make sure to try single-lane SLP reduction
for all reductions as fallback.
---
 gcc/tree-vect-slp.cc | 71 +---
 1 file changed, 54 insertions(+), 17 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index c7ed520b629..73cc69d85ce 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1907,7 +1907,8 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
/* Reduction chain backedge defs are filled manually.
   ???  Need a better way to identify a SLP reduction chain PHI.
   Or a better overall way to SLP match those.  */
-   if (all_same && def_type == vect_reduction_def)
+   if (stmts.length () > 1
+   && all_same && def_type == vect_reduction_def)
  skip_args[loop_latch_edge (loop)->dest_idx] = true;
  }
else if (def_type != vect_internal_def)
@@ -3905,9 +3906,10 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
max_tree_size)
  }
 
   /* Find SLP sequences starting from groups of reductions.  */
-  if (loop_vinfo->reductions.length () > 1)
+  if (loop_vinfo->reductions.length () > 0)
{
- /* Collect reduction statements.  */
+ /* Collect reduction statements we can combine into
+a SLP reduction.  */
  vec scalar_stmts;
  scalar_stmts.create (loop_vinfo->reductions.length ());
  for (auto next_info : loop_vinfo->reductions)
@@ -3920,25 +3922,60 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
max_tree_size)
 reduction path.  In that case we'd have to reverse
 engineer that conversion stmt following the chain using
 reduc_idx and from the PHI using reduc_def.  */
- && STMT_VINFO_DEF_TYPE (next_info) == vect_reduction_def
- /* Do not discover SLP reductions for lane-reducing ops, that
-will fail later.  */
- && (!(g = dyn_cast  (STMT_VINFO_STMT (next_info)))
+ && STMT_VINFO_DEF_TYPE (next_info) == vect_reduction_def)
+   {
+ /* Do not discover SLP reductions combining lane-reducing
+ops, that will fail later.  */
+ if (!(g = dyn_cast  (STMT_VINFO_STMT (next_info)))
  || (gimple_assign_rhs_code (g) != DOT_PROD_EXPR
  && gimple_assign_rhs_code (g) != WIDEN_SUM_EXPR
- && gimple_assign_rhs_code (g) != SAD_EXPR)))
-   scalar_stmts.quick_push (next_info);
+ && gimple_assign_rhs_code (g) != SAD_EXPR))
+   scalar_stmts.quick_push (next_info);
+ else
+   {
+ /* Do SLP discovery for single-lane reductions.  */
+ vec stmts;
+ vec roots = vNULL;
+  

Re: [PATCH] vect: Fix access size alignment assumption [PR115192]

2024-05-24 Thread Richard Biener
On Fri, May 24, 2024 at 2:35 PM Richard Sandiford
 wrote:
>
> create_intersect_range_checks checks whether two access ranges
> a and b are alias-free using something equivalent to:
>
>   end_a <= start_b || end_b <= start_a
>
> It has two ways of doing this: a "vanilla" way that calculates
> the exact exclusive end pointers, and another way that uses the
> last inclusive aligned pointers (and changes the comparisons
> accordingly).  The comment for the latter is:
>
>   /* Calculate the minimum alignment shared by all four pointers,
>  then arrange for this alignment to be subtracted from the
>  exclusive maximum values to get inclusive maximum values.
>  This "- min_align" is cumulative with a "+ access_size"
>  in the calculation of the maximum values.  In the best
>  (and common) case, the two cancel each other out, leaving
>  us with an inclusive bound based only on seg_len.  In the
>  worst case we're simply adding a smaller number than before.
>
> The problem is that the associated code implicitly assumed that the
> access size was a multiple of the pointer alignment, and so the
> alignment could be carried over to the exclusive end pointer.
>
> The testcase started failing after g:9fa5b473b5b8e289b6542
> because that commit improved the alignment information for
> the accesses.
>
> Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK for trunk
> and backports?

OK.

Thanks,
Richard.

> Richard
>
>
> gcc/
> PR tree-optimization/115192
> * tree-data-ref.cc (create_intersect_range_checks): Take the
> alignment of the access sizes into account.
>
> gcc/testsuite/
> PR tree-optimization/115192
> * gcc.dg/vect/pr115192.c: New test.
> ---
>  gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
>  gcc/tree-data-ref.cc |  5 -
>  2 files changed, 32 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr115192.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
> b/gcc/testsuite/gcc.dg/vect/pr115192.c
> new file mode 100644
> index 000..923d377c1bb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
> @@ -0,0 +1,28 @@
> +#include "tree-vect.h"
> +
> +int data[4 * 16 * 16] __attribute__((aligned(16)));
> +
> +__attribute__((noipa)) void
> +foo (__SIZE_TYPE__ n)
> +{
> +  for (__SIZE_TYPE__ i = 1; i < n; ++i)
> +{
> +  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
> +  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
> +}
> +}
> +
> +int
> +main ()
> +{
> +  check_vect ();
> +
> +  data[0] = 10;
> +  data[1] = 20;
> +
> +  foo (3);
> +
> +  if (data[24] != 12 || data[25] != 24)
> +__builtin_abort ();
> +  return 0;
> +}
> diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
> index db15ddb43de..7c4049faf34 100644
> --- a/gcc/tree-data-ref.cc
> +++ b/gcc/tree-data-ref.cc
> @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
>
>  */
>
> +#define INCLUDE_ALGORITHM
>  #include "config.h"
>  #include "system.h"
>  #include "coretypes.h"
> @@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
> *cond_expr,
>  Because the maximum values are inclusive, there is an alias
>  if the maximum value of one segment is equal to the minimum
>  value of the other.  */
> -  min_align = MIN (dr_a.align, dr_b.align);
> +  min_align = std::min (dr_a.align, dr_b.align);
> +  min_align = std::min (min_align, known_alignment (dr_a.access_size));
> +  min_align = std::min (min_align, known_alignment (dr_b.access_size));
>cmp_code = LT_EXPR;
>  }
>
> --
> 2.25.1
>


Re: [PATCH] tree-ssa-pre.c/1071140(ICE in find_or_generate_expression, at tree-ssa-pre.c:2780): Return NULL_TREE if no equal.

2024-05-24 Thread Richard Biener
On Fri, May 24, 2024 at 1:49 PM Jiawei  wrote:
>
> An ICE bug reported in
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071140.
> https://godbolt.org/z/WE9aGYvoo
>
> Return NULL_TREE when TREE_CODE(op) not equal to SSA_NAME.

The assert is on purpose.  Can you open a GCC bug for this please?  It looks
like we have unfolded POLY_INT_CST [16, 16] /[ex] 16 here.

It seems that

/* We can't always put a size in units of the element alignment
   here as the element alignment may be not visible.  See
   PR43783.  Simply drop the element size for constant
   sizes.  */
if (TREE_CODE (genop3) == INTEGER_CST
&& TREE_CODE (TYPE_SIZE_UNIT (elmt_type)) == INTEGER_CST
&& wi::eq_p (wi::to_offset (TYPE_SIZE_UNIT (elmt_type)),
 (wi::to_offset (genop3)
  * vn_ref_op_align_unit (currop
  genop3 = NULL_TREE;

fails to match the POLY_INT case - the unit alignment is 16 here.  One
possibility would be to match the EXACT_DIV_EXPR case and the
INTEGER_CST divisor to vn_ref_op_align_unit and the other half
separately.  But maybe this can be written in a "proper" way?

The EXACT_DIV_EXPR is built by copy_reference_ops_from_ref,
I suppose SVE could be similarly affected.

Richard.

> gcc/ChangeLog:
>
> * tree-ssa-pre.cc (find_or_generate_expression): Remove assert.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/vsetvl/pr1071140.c: New test.
>
> ---
>  .../gcc.target/riscv/rvv/vsetvl/pr1071140.c   | 52 +++
>  gcc/tree-ssa-pre.cc   |  4 +-
>  2 files changed, 55 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr1071140.c
>
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr1071140.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr1071140.c
> new file mode 100644
> index 000..4f0815e099f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr1071140.c
> @@ -0,0 +1,52 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mrvv-vector-bits=scalable -march=rv64gcv -mabi=lp64d -O3 
> -w" } */
> +
> +#include 
> +
> +static inline __attribute__(()) int vaddq_f32();
> +static inline __attribute__(()) int vload_tillz_f32(int nlane) {
> +  vint32m1_t __trans_tmp_9;
> +  {
> +int __trans_tmp_0 = nlane;
> +{
> +  vint64m1_t __trans_tmp_1;
> +  vint64m1_t __trans_tmp_2;
> +  vint64m1_t __trans_tmp_3;
> +  vint64m1_t __trans_tmp_4;
> +  if (__trans_tmp_0 == 1) {
> +{
> +  __trans_tmp_3 =
> +  __riscv_vslideup_vx_i64m1(__trans_tmp_1, __trans_tmp_2, 1, 2);
> +}
> +__trans_tmp_4 = __trans_tmp_2;
> +  }
> +  __trans_tmp_4 = __trans_tmp_3;
> +  __trans_tmp_9 = __riscv_vreinterpret_v_i64m1_i32m1(__trans_tmp_3);
> +}
> +  }
> +  return vaddq_f32(__trans_tmp_9); /* { dg-error {RVV type 'vint32m1_t' 
> cannot be passed to an unprototyped function} } */
> +}
> +
> +char CFLOAT_add_args[3];
> +const int *CFLOAT_add_steps;
> +const int CFLOAT_steps;
> +
> +__attribute__(()) void CFLOAT_add() {
> +  char *b_src0 = _add_args[0], *b_src1 = _add_args[1],
> +   *b_dst = _add_args[2];
> +  const float *src1 = (float *)b_src1;
> +  float *dst = (float *)b_dst;
> +  const int ssrc1 = CFLOAT_add_steps[1] / sizeof(float);
> +  const int sdst = CFLOAT_add_steps[2] / sizeof(float);
> +  const int hstep = 4 / 2;
> +  vfloat32m1x2_t a;
> +  int len = 255;
> +  for (; len > 0; len -= hstep, src1 += 4, dst += 4) {
> +int b = vload_tillz_f32(len);
> +int r = vaddq_f32(a.__val[0], b); /* { dg-error {RVV type 
> '__rvv_float32m1_t' cannot be passed to an unprototyped function} } */
> +  }
> +  for (; len > 0; --len, b_src0 += CFLOAT_steps,
> +  b_src1 += CFLOAT_add_steps[1], b_dst += 
> CFLOAT_add_steps[2])
> +;
> +}
> +
> diff --git a/gcc/tree-ssa-pre.cc b/gcc/tree-ssa-pre.cc
> index 75217f5cde1..e3d9c47f96b 100644
> --- a/gcc/tree-ssa-pre.cc
> +++ b/gcc/tree-ssa-pre.cc
> @@ -2777,7 +2777,9 @@ find_or_generate_expression (basic_block block, tree 
> op, gimple_seq *stmts)
>if (is_gimple_min_invariant (op))
>  return op;
>
> -  gcc_assert (TREE_CODE (op) == SSA_NAME);
> +  if (TREE_CODE (op) != SSA_NAME)
> +return NULL_TREE;
> +
>vn_ssa_aux_t info = VN_INFO (op);
>unsigned int lookfor = info->value_id;
>if (value_id_constant_p (lookfor))
> --
> 2.25.1
>


Re: [RFC/PATCH] Replace {FLOAT, {, LONG_}DOUBLE}_TYPE_SIZE with new hook

2024-05-24 Thread Richard Biener
On Fri, May 24, 2024 at 12:20 PM Kewen.Lin  wrote:
>
> Hi Joseph and Richi,
>
> on 2024/5/13 21:18, Joseph Myers wrote:
> > On Mon, 13 May 2024, Kewen.Lin wrote:
> >
> >>> In fact replacing all of X_TYPE_SIZE with a single hook might be 
> >>> worthwhile
> >>> though this removes the "convenient" defaulting, requiring each target to
> >>> enumerate all standard C ABI type modes.  But that might be also a good 
> >>> thing.
> >>>
> >>
> >> I guess the main value by extending from floating point types to all is to
> >> unify them?  (Assuming that excepting for floating types the others would
> >> not have multiple possible representations like what we faces on 128bit 
> >> fp).
> >
> > For integer types, giving the number of bits makes sense as an interface -
> > there isn't an issue with different modes.
> >
> > So I think it's appropriate for floating and integer types to have
> > separate hooks - with the one for floating types returning a mode, and the
> > one for integer types returning a number of bits.  (And also keep the
> > existing separate hook for _FloatN / _FloatNx modes.)
> >
> > That may also make for more convenient defaults (whether a target has long
> > double wider than double is largely independent of what sizes it uses for
> > integer types).
> >
>
> Following your suggestion and comments, I made this patch
> for mode_for_floating_type first, considering this touches
> a few FE and port specific code, I think I have to split
> it into a patch series.  Before making that, I'd like to
> ensure this meets what you expected, and also seek for the
> suggestion on how to organize the sub-patches.  There seem
> two ways for sub-patches:
>   1) split this into pieces according to FEs and ports, and
>  squash all of them and commit one patch.
>   2) extract all hook implementation as 1st series (split
>  as ports);
>  extract the hook enablement as 2nd part (split as
>  generic and FEs);
>  the remaining is to remove useless macros (split it
>  as generic and ports);
>
> The 1) is straightforward, while the 2) is fine-grained and
> easy for isolation, but not sure if it's worth doing.
>
> btw, the attached patch is bootstrapped and regtested on
> powerpc64-linux-gnu and powerpc64le-linux-gnu with all
> languages on, cross cc1 built well for affected ports.

Looks reasonable to me - I'd split language changes out but
keep target and middle-end together.  The middle-end parts
look good to me - I'm always a bit nervous when using
size and precision exchangably, esp. for FP, but it seems
this has been done before.

I hope Joseph will eye that part as well.

Thanks for doing this,
Richard.

> BR,
> Kewen
>
> -


[gcc r15-816] Fix gcc.dg/vect/vect-gather-4.c for cascadelake

2024-05-24 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:85e2ce10f76aee93e43aab6558cf8e39cec911e4

commit r15-816-g85e2ce10f76aee93e43aab6558cf8e39cec911e4
Author: Richard Biener 
Date:   Fri May 24 13:15:38 2024 +0200

Fix gcc.dg/vect/vect-gather-4.c for cascadelake

There's not really a good way to test what the testcase wants to
test, the following exchanges one dump scan for another (imperfect)
one.

* gcc.dg/vect/vect-gather-4.c: Scan for not vectorizing using
SLP.

Diff:
---
 gcc/testsuite/gcc.dg/vect/vect-gather-4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-gather-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-gather-4.c
index 1ce63e69199..d18094d6982 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-gather-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-gather-4.c
@@ -45,4 +45,4 @@ f3 (int *restrict y, int *restrict x, int *restrict indices)
 }
 }
 
-/* { dg-final { scan-tree-dump-not "Loop contains only SLP stmts" vect } } */
+/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" vect } } */


[PATCH] Fix gcc.dg/vect/vect-gather-4.c for cascadelake

2024-05-24 Thread Richard Biener
There's not really a good way to test what the testcase wants to
test, the following exchanges one dump scan for another (imperfect)
one.

Pushed.

* gcc.dg/vect/vect-gather-4.c: Scan for not vectorizing using
SLP.
---
 gcc/testsuite/gcc.dg/vect/vect-gather-4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-gather-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-gather-4.c
index 1ce63e69199..d18094d6982 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-gather-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-gather-4.c
@@ -45,4 +45,4 @@ f3 (int *restrict y, int *restrict x, int *restrict indices)
 }
 }
 
-/* { dg-final { scan-tree-dump-not "Loop contains only SLP stmts" vect } } */
+/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" vect } } */
-- 
2.35.3


Re: [PATCH v2] MATCH: Look through VIEW_CONVERT when folding VEC_PERM_EXPRs.

2024-05-24 Thread Richard Biener
On Fri, 24 May 2024, Manolis Tsamis wrote:

> The match.pd patterns to merge two vector permutes into one fail when a
> potentially no-op view convert expressions is between the two permutes.
> This change lifts this restriction.

OK.

> gcc/ChangeLog:
> 
>   * match.pd: Allow no-op view_convert between permutes.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/fold-perm-2.c: New test.
> 
> Signed-off-by: Manolis Tsamis 
> ---
> 
> Changes in v2:
> Use TYPE_SIZE (TREE_TYPE (TREE_TYPE (@))) instead of element_precision (@).
> 
>  gcc/match.pd   | 14 --
>  gcc/testsuite/gcc.dg/fold-perm-2.c | 16 
>  2 files changed, 24 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/fold-perm-2.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 07e743ae464..1f91b9857c8 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -10039,19 +10039,21 @@ and,
>   d = VEC_PERM_EXPR ;  */
>  
>  (simplify
> - (vec_perm (vec_perm@0 @1 @2 VECTOR_CST@3) @0 VECTOR_CST@4)
> + (vec_perm (view_convert?@0 (vec_perm@1 @2 @3 VECTOR_CST@4)) @0 VECTOR_CST@5)
>   (if (TYPE_VECTOR_SUBPARTS (type).is_constant ())
>(with
> {
>   machine_mode result_mode = TYPE_MODE (type);
> - machine_mode op_mode = TYPE_MODE (TREE_TYPE (@1));
> + machine_mode op_mode = TYPE_MODE (TREE_TYPE (@2));
>   int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
>   vec_perm_builder builder0;
>   vec_perm_builder builder1;
>   vec_perm_builder builder2 (nelts, nelts, 1);
> }
> -   (if (tree_to_vec_perm_builder (, @3)
> - && tree_to_vec_perm_builder (, @4))
> +   (if (tree_to_vec_perm_builder (, @4)
> + && tree_to_vec_perm_builder (, @5)
> + && TYPE_SIZE (TREE_TYPE (TREE_TYPE (@0)))
> +== TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1
>  (with
>   {
> vec_perm_indices sel0 (builder0, 2, nelts);
> @@ -10073,10 +10075,10 @@ and,
>  ? (!can_vec_perm_const_p (result_mode, op_mode, sel0, false)
> || !can_vec_perm_const_p (result_mode, op_mode, sel1, false))
>  : !can_vec_perm_const_p (result_mode, op_mode, sel1, false)))
> -  op0 = vec_perm_indices_to_tree (TREE_TYPE (@4), sel2);
> +  op0 = vec_perm_indices_to_tree (TREE_TYPE (@5), sel2);
>   }
>   (if (op0)
> -  (vec_perm @1 @2 { op0; })))
> +  (view_convert (vec_perm @2 @3 { op0; }
>  
>  /* Merge
>   c = VEC_PERM_EXPR ;
> diff --git a/gcc/testsuite/gcc.dg/fold-perm-2.c 
> b/gcc/testsuite/gcc.dg/fold-perm-2.c
> new file mode 100644
> index 000..1a4ab4065de
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/fold-perm-2.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O -fdump-tree-fre1" } */
> +
> +typedef int veci __attribute__ ((vector_size (4 * sizeof (int;
> +typedef unsigned int vecu __attribute__ ((vector_size (4 * sizeof (unsigned 
> int;
> +
> +void fun (veci *a, veci *b, veci *c)
> +{
> +  veci r1 = __builtin_shufflevector (*a, *b, 0, 5, 2, 7);
> +  vecu r2 = __builtin_convertvector (r1, vecu);
> +  vecu r3 = __builtin_shufflevector (r2, r2, 2, 3, 1, 0);
> +  *c = __builtin_convertvector (r3, veci);
> +}
> +
> +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 7, 5, 0 }" "fre1" } } */
> +/* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 1 "fre1" } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[gcc r15-815] tree-optimization/115144 - improve sinking destination choice

2024-05-24 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:5b9b3bae33cae7fca2e3c3e3028be6b8bee9b698

commit r15-815-g5b9b3bae33cae7fca2e3c3e3028be6b8bee9b698
Author: Richard Biener 
Date:   Wed May 22 09:16:51 2024 +0200

tree-optimization/115144 - improve sinking destination choice

When sinking code closer to its uses we already try to minimize the
distance we move by inserting at the start of the basic-block.  The
following makes sure to sink closest to the control dependence
check of the region we want to sink to as well as make sure to
ignore control dependences that are only guarding exceptional code.
This restores somewhat the old profile check but without requiring
nearly even probabilities.  The patch also makes sure to not give
up completely when the best sink location is one we do not want to
sink to but possibly then choose the next best one.

PR tree-optimization/115144
* tree-ssa-sink.cc (do_not_sink): New function, split out
from ...
(select_best_block): Here.  First pick valid block to
sink to.  From that search for the best valid block,
avoiding sinking across conditions to exceptional code.
(sink_code_in_bb): When updating vuses of stores in
paths we do not sink a store to make sure we didn't
pick a dominating sink location.

* gcc.dg/tree-ssa/ssa-sink-22.c: New testcase.

Diff:
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c |  14 
 gcc/tree-ssa-sink.cc| 106 +++-
 2 files changed, 86 insertions(+), 34 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..e35626d4070
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink1-details" } */
+
+extern void abort (void);
+
+int foo (int x, int y, int f)
+{
+  int tem = x / y;
+  if (f)
+abort ();
+  return tem;
+}
+
+/* { dg-final { scan-tree-dump-not "Sinking" "sink1" } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index 2188b7523c7..b0fe871cf1e 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -172,6 +172,39 @@ nearest_common_dominator_of_uses (def_operand_p def_p, 
bool *debug_stmts)
   return commondom;
 }
 
+/* Return whether sinking STMT from EARLY_BB to BEST_BB should be avoided.  */
+
+static bool
+do_not_sink (gimple *stmt, basic_block early_bb, basic_block best_bb)
+{
+  /* Placing a statement before a setjmp-like function would be invalid
+ (it cannot be reevaluated when execution follows an abnormal edge).
+ If we selected a block with abnormal predecessors, just punt.  */
+  if (bb_has_abnormal_pred (best_bb))
+return true;
+
+  /* If the latch block is empty, don't make it non-empty by sinking
+ something into it.  */
+  if (best_bb == early_bb->loop_father->latch
+  && empty_block_p (best_bb))
+return true;
+
+  /* Avoid turning an unconditional read into a conditional one when we
+ still might want to perform vectorization.  */
+  if (best_bb->loop_father == early_bb->loop_father
+  && loop_outer (best_bb->loop_father)
+  && !best_bb->loop_father->inner
+  && gimple_vuse (stmt)
+  && !gimple_vdef (stmt)
+  && flag_tree_loop_vectorize
+  && !(cfun->curr_properties & PROP_loop_opts_done)
+  && dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, early_bb)
+  && !dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, 
best_bb))
+return true;
+
+  return false;
+}
+
 /* Given EARLY_BB and LATE_BB, two blocks in a path through the dominator
tree, return the best basic block between them (inclusive) to place
statements.
@@ -185,54 +218,57 @@ select_best_block (basic_block early_bb,
   basic_block late_bb,
   gimple *stmt)
 {
+  /* First pick a block we do not disqualify.  */
+  while (late_bb != early_bb
+&& do_not_sink (stmt, early_bb, late_bb))
+late_bb = get_immediate_dominator (CDI_DOMINATORS, late_bb);
+
   basic_block best_bb = late_bb;
   basic_block temp_bb = late_bb;
-
   while (temp_bb != early_bb)
 {
   /* Walk up the dominator tree, hopefully we'll find a shallower
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
 
+  /* Do not consider blocks we do not want to sink to.  */
+  if (temp_bb != early_bb && do_not_sink (stmt, early_bb, temp_bb))
+   ;
+
   /* If we've moved into a lower loop nest, then that becomes
 our best block.  */
-  if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
+  else if (bb_loop_depth (temp_bb) < bb_loop_depth (b

[PATCH][v2] tree-optimization/115144 - improve sinking destination choice

2024-05-24 Thread Richard Biener
When sinking code closer to its uses we already try to minimize the
distance we move by inserting at the start of the basic-block.  The
following makes sure to sink closest to the control dependence
check of the region we want to sink to as well as make sure to
ignore control dependences that are only guarding exceptional code.
This restores somewhat the old profile check but without requiring
nearly even probabilities.  The patch also makes sure to not give
up completely when the best sink location is one we do not want to
sink to but possibly then choose the next best one.

This addresses fallout observed in building libgo.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/115144
* tree-ssa-sink.cc (do_not_sink): New function, split out
from ...
(select_best_block): Here.  First pick valid block to
sink to.  From that search for the best valid block,
avoiding sinking across conditions to exceptional code.
(sink_code_in_bb): When updating vuses of stores in
paths we do not sink a store to make sure we didn't
pick a dominating sink location.

* gcc.dg/tree-ssa/ssa-sink-22.c: New testcase.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c |  14 +++
 gcc/tree-ssa-sink.cc| 106 +---
 2 files changed, 86 insertions(+), 34 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..e35626d4070
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink1-details" } */
+
+extern void abort (void);
+
+int foo (int x, int y, int f)
+{
+  int tem = x / y;
+  if (f)
+abort ();
+  return tem;
+}
+
+/* { dg-final { scan-tree-dump-not "Sinking" "sink1" } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index 2188b7523c7..b0fe871cf1e 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -172,6 +172,39 @@ nearest_common_dominator_of_uses (def_operand_p def_p, 
bool *debug_stmts)
   return commondom;
 }
 
+/* Return whether sinking STMT from EARLY_BB to BEST_BB should be avoided.  */
+
+static bool
+do_not_sink (gimple *stmt, basic_block early_bb, basic_block best_bb)
+{
+  /* Placing a statement before a setjmp-like function would be invalid
+ (it cannot be reevaluated when execution follows an abnormal edge).
+ If we selected a block with abnormal predecessors, just punt.  */
+  if (bb_has_abnormal_pred (best_bb))
+return true;
+
+  /* If the latch block is empty, don't make it non-empty by sinking
+ something into it.  */
+  if (best_bb == early_bb->loop_father->latch
+  && empty_block_p (best_bb))
+return true;
+
+  /* Avoid turning an unconditional read into a conditional one when we
+ still might want to perform vectorization.  */
+  if (best_bb->loop_father == early_bb->loop_father
+  && loop_outer (best_bb->loop_father)
+  && !best_bb->loop_father->inner
+  && gimple_vuse (stmt)
+  && !gimple_vdef (stmt)
+  && flag_tree_loop_vectorize
+  && !(cfun->curr_properties & PROP_loop_opts_done)
+  && dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, early_bb)
+  && !dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, 
best_bb))
+return true;
+
+  return false;
+}
+
 /* Given EARLY_BB and LATE_BB, two blocks in a path through the dominator
tree, return the best basic block between them (inclusive) to place
statements.
@@ -185,54 +218,57 @@ select_best_block (basic_block early_bb,
   basic_block late_bb,
   gimple *stmt)
 {
+  /* First pick a block we do not disqualify.  */
+  while (late_bb != early_bb
+&& do_not_sink (stmt, early_bb, late_bb))
+late_bb = get_immediate_dominator (CDI_DOMINATORS, late_bb);
+
   basic_block best_bb = late_bb;
   basic_block temp_bb = late_bb;
-
   while (temp_bb != early_bb)
 {
   /* Walk up the dominator tree, hopefully we'll find a shallower
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
 
+  /* Do not consider blocks we do not want to sink to.  */
+  if (temp_bb != early_bb && do_not_sink (stmt, early_bb, temp_bb))
+   ;
+
   /* If we've moved into a lower loop nest, then that becomes
 our best block.  */
-  if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
+  else if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
best_bb = temp_bb;
-}
 
-  /* Placing a statement before a setjmp-like function would be invalid
- (it cannot be reevaluated when execution follows an abnormal edge).
- If we selected a block with abnormal predecessors, just punt.  */
-  if (bb_has_abnormal_pred (best_bb))
-return early_bb;
-
-  /* If we found 

Re: [PATCH] MATCH: Look through VIEW_CONVERT when folding VEC_PERM_EXPRs.

2024-05-24 Thread Richard Biener
On Fri, 24 May 2024, Manolis Tsamis wrote:

> On Fri, May 24, 2024 at 10:46 AM Richard Biener  wrote:
> >
> > On Fri, 24 May 2024, Manolis Tsamis wrote:
> >
> > > On Fri, May 24, 2024 at 9:31 AM Richard Biener  wrote:
> > > >
> > > > On Wed, 22 May 2024, Manolis Tsamis wrote:
> > > >
> > > > > The match.pd patterns to merge two vector permutes into one fail when 
> > > > > a
> > > > > potentially no-op view convert expressions is between the two 
> > > > > permutes.
> > > > > This change lifts this restriction.
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >   * match.pd: Allow no-op view_convert between permutes.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >   * gcc.dg/fold-perm-2.c: New test.
> > > > >
> > > > > Signed-off-by: Manolis Tsamis 
> > > > > ---
> > > > >
> > > > >  gcc/match.pd   | 14 --
> > > > >  gcc/testsuite/gcc.dg/fold-perm-2.c | 16 
> > > > >  2 files changed, 24 insertions(+), 6 deletions(-)
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/fold-perm-2.c
> > > > >
> > > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > > index 07e743ae464..cbb3c5d86e0 100644
> > > > > --- a/gcc/match.pd
> > > > > +++ b/gcc/match.pd
> > > > > @@ -10039,19 +10039,21 @@ and,
> > > > >   d = VEC_PERM_EXPR ;  */
> > > > >
> > > > >  (simplify
> > > > > - (vec_perm (vec_perm@0 @1 @2 VECTOR_CST@3) @0 VECTOR_CST@4)
> > > > > + (vec_perm (view_convert?@0 (vec_perm@1 @2 @3 VECTOR_CST@4)) @0 
> > > > > VECTOR_CST@5)
> > > > >   (if (TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > > > >(with
> > > > > {
> > > > >   machine_mode result_mode = TYPE_MODE (type);
> > > > > - machine_mode op_mode = TYPE_MODE (TREE_TYPE (@1));
> > > > > + machine_mode op_mode = TYPE_MODE (TREE_TYPE (@2));
> > > > >   int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > > >   vec_perm_builder builder0;
> > > > >   vec_perm_builder builder1;
> > > > >   vec_perm_builder builder2 (nelts, nelts, 1);
> > > > > }
> > > > > -   (if (tree_to_vec_perm_builder (, @3)
> > > > > - && tree_to_vec_perm_builder (, @4))
> > > > > +   (if (tree_to_vec_perm_builder (, @4)
> > > > > + && tree_to_vec_perm_builder (, @5)
> > > > > + && element_precision (TREE_TYPE (@0))
> > > > > +== element_precision (TREE_TYPE (@1)))
> > > >
> > > > I think you want to check TYPE_SIZE (TREE_TYPE (@0/@1)) for equality
> > > > instead.
> > > >
> > >
> > > I think TYPE_SIZE is not enough as we need the vector elements to have
> > > the same size, not just the vector as a whole.
> >
> > Err, yes - you want to check the element sizes of course.
> >
> From what I understand, checking the element size should be enough.
> Otherwise we can check both TYPE_SIZE and element_precision to be
> equal.
> So OK to commit with just element_precision?

Please just check the element size.  I'm always worried when
using TYPE_PRECISION on FP types and for shuffles it's really
only about size.

> BTW I also noticed from these testcases that there is a gcc 13 -> 14
> regression with weird XORs being introduced:
> 
> typedef int veci __attribute__ ((vector_size (4 * sizeof (int;
> void fun (veci *a, veci *b, veci *c) {
>   *c = __builtin_shufflevector (*a, *b, 0, 5, 2, 7);
> }
> 
> gcc 13.3:
>   adrp x3, .LC0
>   ldr q0, [x0]
>   ldr q1, [x1]
>   ldr q2, [x3, #:lo12:.LC0]
>   tbl v0.16b, {v0.16b - v1.16b}, v2.16b
>   str q0, [x2]
> 
> gcc 14.1:
>   ldr q30, [x1]
>   adrp x3, .LC0
>   ldr q31, [x0]
>   ldr q29, [x3, #:lo12:.LC0]
>   eor v31.16b, v31.16b, v30.16b
>   eor v30.16b, v31.16b, v30.16b
>   eor v31.16b, v31.16b, v30.16b
>   tbl v30.16b, {v30.16b - v31.16b}, v29.16b
>   str q30, [x2]

You'd need to bisect that but I'd guess we got some extra
match patterns triggering?

> Manolis
> 
> > > For example, when using the TYPE_SIZE check instead the following
> > > testcase miscompiles
> > >

Re: [PATCH] MATCH: Look through VIEW_CONVERT when folding VEC_PERM_EXPRs.

2024-05-24 Thread Richard Biener
On Fri, 24 May 2024, Manolis Tsamis wrote:

> On Fri, May 24, 2024 at 9:31 AM Richard Biener  wrote:
> >
> > On Wed, 22 May 2024, Manolis Tsamis wrote:
> >
> > > The match.pd patterns to merge two vector permutes into one fail when a
> > > potentially no-op view convert expressions is between the two permutes.
> > > This change lifts this restriction.
> > >
> > > gcc/ChangeLog:
> > >
> > >   * match.pd: Allow no-op view_convert between permutes.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   * gcc.dg/fold-perm-2.c: New test.
> > >
> > > Signed-off-by: Manolis Tsamis 
> > > ---
> > >
> > >  gcc/match.pd   | 14 --
> > >  gcc/testsuite/gcc.dg/fold-perm-2.c | 16 
> > >  2 files changed, 24 insertions(+), 6 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/fold-perm-2.c
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > index 07e743ae464..cbb3c5d86e0 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -10039,19 +10039,21 @@ and,
> > >   d = VEC_PERM_EXPR ;  */
> > >
> > >  (simplify
> > > - (vec_perm (vec_perm@0 @1 @2 VECTOR_CST@3) @0 VECTOR_CST@4)
> > > + (vec_perm (view_convert?@0 (vec_perm@1 @2 @3 VECTOR_CST@4)) @0 
> > > VECTOR_CST@5)
> > >   (if (TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > >(with
> > > {
> > >   machine_mode result_mode = TYPE_MODE (type);
> > > - machine_mode op_mode = TYPE_MODE (TREE_TYPE (@1));
> > > + machine_mode op_mode = TYPE_MODE (TREE_TYPE (@2));
> > >   int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > >   vec_perm_builder builder0;
> > >   vec_perm_builder builder1;
> > >   vec_perm_builder builder2 (nelts, nelts, 1);
> > > }
> > > -   (if (tree_to_vec_perm_builder (, @3)
> > > - && tree_to_vec_perm_builder (, @4))
> > > +   (if (tree_to_vec_perm_builder (, @4)
> > > + && tree_to_vec_perm_builder (, @5)
> > > + && element_precision (TREE_TYPE (@0))
> > > +== element_precision (TREE_TYPE (@1)))
> >
> > I think you want to check TYPE_SIZE (TREE_TYPE (@0/@1)) for equality
> > instead.
> >
> 
> I think TYPE_SIZE is not enough as we need the vector elements to have
> the same size, not just the vector as a whole.

Err, yes - you want to check the element sizes of course.

> For example, when using the TYPE_SIZE check instead the following
> testcase miscompiles
> 
> typedef int veci __attribute__ ((vector_size (4 * sizeof (int;
> typedef double vecd __attribute__ ((vector_size (2 * sizeof (double;
> 
> void fun (veci *a, veci *b, veci *c)
> {
>   char data[16];
>   veci r1 = __builtin_shufflevector (*a, *b, 0, 5, 2, 7);
>   vecd r2;
>   __builtin_memcpy(data, , sizeof(veci));
>   __builtin_memcpy(, data, sizeof(vecd));
>   vecd r3 = __builtin_shufflevector (r2, r2, 1, 0);
>   __builtin_memcpy(data, , sizeof(vecd));
>   __builtin_memcpy(c, data, sizeof(veci));
> }
> 
> To:
> 
> ldr q31, [x0]
> rev64   v31.4s, v31.4s
> str q31, [x2]
> ret
> 
> > Otherwise OK.
> >
> > Thanks,
> > Richard.
> >
> > >  (with
> > >   {
> > > vec_perm_indices sel0 (builder0, 2, nelts);
> > > @@ -10073,10 +10075,10 @@ and,
> > >  ? (!can_vec_perm_const_p (result_mode, op_mode, sel0, false)
> > > || !can_vec_perm_const_p (result_mode, op_mode, sel1, 
> > > false))
> > >  : !can_vec_perm_const_p (result_mode, op_mode, sel1, false)))
> > > -  op0 = vec_perm_indices_to_tree (TREE_TYPE (@4), sel2);
> > > +  op0 = vec_perm_indices_to_tree (TREE_TYPE (@5), sel2);
> > >   }
> > >   (if (op0)
> > > -  (vec_perm @1 @2 { op0; })))
> > > +  (view_convert (vec_perm @2 @3 { op0; }
> > >
> > >  /* Merge
> > >   c = VEC_PERM_EXPR ;
> > > diff --git a/gcc/testsuite/gcc.dg/fold-perm-2.c 
> > > b/gcc/testsuite/gcc.dg/fold-perm-2.c
> > > new file mode 100644
> > > index 000..1a4ab4065de
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/fold-perm-2.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O -fdump-tree-fre1" } */
> > > +
> > > +typedef int ve

Re: [PATCH v2] Match: Support __builtin_add_overflow branch form for unsigned SAT_ADD

2024-05-24 Thread Richard Biener
On Fri, May 24, 2024 at 8:56 AM Richard Biener
 wrote:
>
> On Fri, May 24, 2024 at 8:37 AM Li, Pan2  wrote:
> >
> > Thanks Jeff and Richard for suggestion and reviewing.
> >
> > Have another try in phiopt to do the convert from PHI to stmt = cond ? a : 
> > b.
> > It can perform the convert from PHI to stmt = cond ? a : b successfully, 
> > and then
> > the widen-mul is able to do the recog to .SAT_ADD.
> >
> > For now, to limit the risck, the above convert from PHI to stmt = cond ? a 
> > : b only be performed when matched,
> > as well as the backend support the usadd standard name. Unfortunately, I am 
> > stuck in the case that when the lhs
> > is not matched, we need to clean up something like created stmt in 
> > previous, or we will have ICE for missing definition.
> >
> > sat_add.c: In function ‘sat_add_u_3_uint8_t’:
> > sat_add.c:69:1: error: missing definition
> >69 | SAT_ADD_U_3(uint8_t);
> >   | ^~~
> > for SSA_NAME: _6 in statement:
> > # VUSE <.MEM_14(D)>
> > return _6;
> > during GIMPLE pass: phiopt
> > dump file: sat_add.c.046t.phiopt1
> > sat_add.c:69:1: internal compiler error: verify_ssa failed
> > 0x1db41ba verify_ssa(bool, bool
> > /home/pli/gcc/555/riscv-gnu-toolchain/gcc/__RISCV_BUILD__/../gcc/tree-ssa.cc:1203
> > 0x18e3075 execute_function_todo
> > 
> > /home/pli/gcc/555/riscv-gnu-toolchain/gcc/__RISCV_BUILD__/../gcc/passes.cc:2096
> > 0x18e1c52 do_per_function
> > 
> > /home/pli/gcc/555/riscv-gnu-toolchain/gcc/__RISCV_BUILD__/../gcc/passes.cc:1688
> > 0x18e3222 execute_todo
> >
> > I bet the reason is that we created new stmt like stmt_cond and stmt_val 
> > but we don't insert it.
> > Thus, there will be orphan nodes somewhere and we need something like 
> > rollback to recover the
> > gimple up to a point. I tried sorts of release_xx or likewise but seems not 
> > working.
> >
> > So is there any suggest to take care of such gimple rollback or another 
> > solution for this? Below are
> > The function to perform the convert from PHI to stmt = cond ? a : b for 
> > reference, thanks a lot.
> >
> > Pan
> >
> > diff --git a/gcc/tree-ssa-phiopt.cc b/gcc/tree-ssa-phiopt.cc
> > index 918cf50b589..7982b65bac4 100644
> > --- a/gcc/tree-ssa-phiopt.cc
> > +++ b/gcc/tree-ssa-phiopt.cc
> > @@ -486,6 +486,88 @@ phiopt_early_allow (gimple_seq , gimple_match_op 
> > )
> >  }
> >  }
> >
> > +extern bool gimple_unsigned_integer_sat_add (tree, tree*, tree (*)(tree));
> > +
> > +/* Try to match the phi expr to the gimple cond. Return true if we can
> > +   perform the convert or return false.  There will be some restrictions
> > +   or such kind of conversion, aka:
> > +
> > +   1. Only selected pattern will try this convert.
> > +   2. The generated gassign matched the selected IFN pattern.
> > +   3. The backend has implement the standard name.
> > +
> > +   From:
> > +  :
> > + _1 = x_3(D) + y_4(D);
> > + if (_1 >= x_3(D))
> > +   goto ; [INV]
> > + else
> > +   goto ; [INV]
> > +
> > +  :
> > +
> > +  :
> > + # _2 = PHI <255(2), _1(3)>
> > +
> > +   To:
> > +  :
> > + _1 = x_3(D) + y_4(D);
> > + phi_cond_6 = _1 >= x_3(D);
> > + _2 = phi_cond_6 ? _1 : 255; */
> > +
> > +static bool
> > +match_phi_to_gimple_cond (basic_block cond_bb, gphi *phi, tree arg0, tree 
> > arg1)
>
> You should do this in widen-mult and/or ISEL and if necessary for 
> vectorization
> in tree-if-conv.cc, though eventually what if-convert creates might be
> good enough
> to match during pattern recognition.
>
> > +{
> > +  gcond *cond = as_a  (*gsi_last_bb (cond_bb));
> > +
> > +  if (!cond)
> > +return false;
> > +
> > +  enum tree_code code = gimple_cond_code (cond);
> > +  tree phi_result = gimple_phi_result (phi);
> > +  tree cond_tree = make_temp_ssa_name (boolean_type_node, NULL, 
> > "phi_cond");
> > +  tree cmp_tree = build2 (code, boolean_type_node, gimple_cond_lhs (cond),
> > + gimple_cond_rhs (cond));
> > +  tree rhs = build3 (COND_EXPR, TREE_TYPE (phi_result), cond_tree, arg0, 
> > arg1);
>
> phiopt directly uses cmp_tree, so you could do that as well and avoid 
> stmt_cond.
>
> > +
> > +  gassign *stmt_cond = gimple_build_assign (cond_tree, cmp_tree);
> > +  gassign *stmt_val = gimple_

Re: [PATCH v2] Match: Support __builtin_add_overflow branch form for unsigned SAT_ADD

2024-05-24 Thread Richard Biener
ou do not insert the actual .SAT_ADD here and that's
the definition that's missing.  You probably shouldn't need to add the
cond-stmt?

> +}
> +  else
> +{
> +  // Clean up the stmt created, but non of blow works well.
> +  // gsi = gsi_for_stmt (stmt_val);
> +  // gsi_remove (, true);
> +  // release_defs (stmt_val);
> +  // ggc_free (stmt_val);
> +
> +  // gsi = gsi_for_stmt (stmt_cond);
> +  // gsi_remove (, true);
> +  // release_defs (stmt_cond);
> +  // ggc_free (stmt_cond);
> +
> +  // release_defs (stmt_cond);
> +  // release_defs (stmt_val);
> +  release_ssa_name (cond_tree);

As you don't insert the stmts you should be able to simply
only release the SSA names and ggc_free the stmt.  You can also
look at maybe_fold_comparisons_from_match_pd in gimple-fold.cc
for more "ugly" ways to do this.

Building a helper in one place to match a PHI def as COND_EXPR
might be nice.  As said, avoiding all the mess by providing native
support from genmatch would be even better, but I'm not asking you
to do that.

Richard.

> +}
> +
> +  return matched_p;
> +}
> +
> /* gimple_simplify_phiopt is like gimple_simplify but designed for PHIOPT.
> Return NULL if nothing can be simplified or the resulting simplified value
> with parts pushed if EARLY_P was true. Also rejects non allowed tree code
> @@ -826,6 +908,9 @@ match_simplify_replacement (basic_block cond_bb, 
> basic_block middle_bb,
>   So, given the condition COND, and the two PHI arguments, match and 
> simplify
>   can happen on (COND) ? arg0 : arg1. */
>
> +  if (match_phi_to_gimple_cond (cond_bb, phi, arg0, arg1))
> +    return true;
> +
>stmt = last_nondebug_stmt (cond_bb);
>
>/* We need to know which is the true edge and which is the false
>
>
> -Original Message-
> From: Jeff Law 
> Sent: Thursday, May 23, 2024 10:59 PM
> To: Richard Biener ; Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
> tamar.christ...@arm.com; pins...@gmail.com
> Subject: Re: [PATCH v2] Match: Support __builtin_add_overflow branch form for 
> unsigned SAT_ADD
>
>
>
> On 5/23/24 6:14 AM, Richard Biener wrote:
> > On Thu, May 23, 2024 at 1:08 PM Li, Pan2  wrote:
> >>
> >> I have a try to convert the PHI from Part-A to Part-B, aka PHI to _2 = 
> >> phi_cond ? _1 : 255.
> >> And then we can do the matching on COND_EXPR in the underlying widen-mul 
> >> pass.
> >>
> >> Unfortunately, meet some ICE when verify_gimple_phi in sccopy1 pass =>
> >> sat_add.c:66:1: internal compiler error: tree check: expected class 
> >> ‘type’, have ‘exceptional’ (error_mark) in useless_type_conversion_p, at 
> >> gimple-expr.cc:86
> >
> > Likely you have released _2, more comments below on your previous mail.
> You can be sure by calling debug_tree () on the SSA_NAME node in
> question.  If it reports "in-free-list", then that's definitive that the
> SSA_NAME was released back to the SSA_NAME manager.  If that SSA_NAME is
> still in the IL, then that's very bad.
>
> jeff
>


Re: Question about optimizing function pointers for direct function calls

2024-05-24 Thread Richard Biener via Gcc
On Fri, May 24, 2024 at 5:53 AM Hanke Zhang via Gcc  wrote:
>
> Hi,
> I got a question about optimizing function pointers for direct
> function calls in C.
>
> Consider the following scenario: one of the fields of a structure is a
> function pointer, and all its assignments come from the same function.
> Can all its uses be replaced by direct calls to this function? So the
> later passes can do more optimizations.
>
> Here is the example:
>
> int add(int a, int b) { return a + b; }
> int sub(int a, int b) { return a - b; }
>
> struct Foo {
> int (*add)(int, int);
> };
> int main()
> {
> struct Foo foo[5] = malloc(sizeof(struct Foo) * 5);
>
> for (int i = 0; i < 5; i++) {
> foo[i].add = add;
> }
>
> int sum = 0;
> for (int i = 0; i < 5; i++) {
> sum += foo[i].add(1, 2);
> }
>
> return 0;
> }
>
> Can I replace the code above to the code below?
>
> int add(int a, int b) { return a + b; }
> int sub(int a, int b) { return a - b; }
>
> struct Foo {
> int (*add)(int, int);
> };
> int main()
> {
> struct Foo foo[5] = malloc(sizeof(struct Foo) * 5);
>
> for (int i = 0; i < 5; i++) {
> foo[i].add = add;
> }
>
> int sum = 0;
> for (int i = 0; i < 5; i++) {
> sum += add(1,2);
> }
>
> return 0;
> }
>
> My idea is to determine whether the assignment of the field is the
> same function, and if so, perform the replacement.

If it's as simple as above then sure, even CSE should do it.  If you
can prove otherwise the memory location with the function pointer
always has the same value you are obviously fine.  If you just
do not see any other store via 'add's FIELD_DECL then no, that
isn't good enough.  Every void * store you do not know where it
goes might go to that slot.

> Of course this is not a reasonable optimization, I just want to know
> if there are security issues in doing so, and if I want to do it in
> the IPA stage, is it possible?

For the more general case you can do what we do for speculative
devirtualization - replace the code with

  sum += foo[i].add == add ? add (1,2) : foo[i].add(1,2);

that way we can inline the direct call and hopefully the branch will be
well predicted.

In some SPEC there is IIRC the situation where such speculative
devirtualization candidates can be found solely based on function
signature.  With LTO/IPA you'd basically collect candidate targets
for each indirect call (possibly address-taken function definitions
with correct signature) and if there's just a single one you can
choose that as speculative devirt target.

Speculative devirt as we have now of course works with profile
data to identify the most probable candidate.

Richard.

>
> Thanks
> Hanke Zhang


Re: [PATCH] MATCH: Look through VIEW_CONVERT when folding VEC_PERM_EXPRs.

2024-05-24 Thread Richard Biener
On Wed, 22 May 2024, Manolis Tsamis wrote:

> The match.pd patterns to merge two vector permutes into one fail when a
> potentially no-op view convert expressions is between the two permutes.
> This change lifts this restriction.
> 
> gcc/ChangeLog:
> 
>   * match.pd: Allow no-op view_convert between permutes.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/fold-perm-2.c: New test.
> 
> Signed-off-by: Manolis Tsamis 
> ---
> 
>  gcc/match.pd   | 14 --
>  gcc/testsuite/gcc.dg/fold-perm-2.c | 16 
>  2 files changed, 24 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/fold-perm-2.c
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 07e743ae464..cbb3c5d86e0 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -10039,19 +10039,21 @@ and,
>   d = VEC_PERM_EXPR ;  */
>  
>  (simplify
> - (vec_perm (vec_perm@0 @1 @2 VECTOR_CST@3) @0 VECTOR_CST@4)
> + (vec_perm (view_convert?@0 (vec_perm@1 @2 @3 VECTOR_CST@4)) @0 VECTOR_CST@5)
>   (if (TYPE_VECTOR_SUBPARTS (type).is_constant ())
>(with
> {
>   machine_mode result_mode = TYPE_MODE (type);
> - machine_mode op_mode = TYPE_MODE (TREE_TYPE (@1));
> + machine_mode op_mode = TYPE_MODE (TREE_TYPE (@2));
>   int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
>   vec_perm_builder builder0;
>   vec_perm_builder builder1;
>   vec_perm_builder builder2 (nelts, nelts, 1);
> }
> -   (if (tree_to_vec_perm_builder (, @3)
> - && tree_to_vec_perm_builder (, @4))
> +   (if (tree_to_vec_perm_builder (, @4)
> + && tree_to_vec_perm_builder (, @5)
> + && element_precision (TREE_TYPE (@0))
> +== element_precision (TREE_TYPE (@1)))

I think you want to check TYPE_SIZE (TREE_TYPE (@0/@1)) for equality
instead.

Otherwise OK.

Thanks,
Richard.

>  (with
>   {
> vec_perm_indices sel0 (builder0, 2, nelts);
> @@ -10073,10 +10075,10 @@ and,
>  ? (!can_vec_perm_const_p (result_mode, op_mode, sel0, false)
> || !can_vec_perm_const_p (result_mode, op_mode, sel1, false))
>  : !can_vec_perm_const_p (result_mode, op_mode, sel1, false)))
> -  op0 = vec_perm_indices_to_tree (TREE_TYPE (@4), sel2);
> +  op0 = vec_perm_indices_to_tree (TREE_TYPE (@5), sel2);
>   }
>   (if (op0)
> -  (vec_perm @1 @2 { op0; })))
> +  (view_convert (vec_perm @2 @3 { op0; }
>  
>  /* Merge
>   c = VEC_PERM_EXPR ;
> diff --git a/gcc/testsuite/gcc.dg/fold-perm-2.c 
> b/gcc/testsuite/gcc.dg/fold-perm-2.c
> new file mode 100644
> index 000..1a4ab4065de
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/fold-perm-2.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O -fdump-tree-fre1" } */
> +
> +typedef int veci __attribute__ ((vector_size (4 * sizeof (int;
> +typedef unsigned int vecu __attribute__ ((vector_size (4 * sizeof (unsigned 
> int;
> +
> +void fun (veci *a, veci *b, veci *c)
> +{
> +  veci r1 = __builtin_shufflevector (*a, *b, 0, 5, 2, 7);
> +  vecu r2 = __builtin_convertvector (r1, vecu);
> +  vecu r3 = __builtin_shufflevector (r2, r2, 2, 3, 1, 0);
> +  *c = __builtin_convertvector (r3, veci);
> +}
> +
> +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 7, 5, 0 }" "fre1" } } */
> +/* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 1 "fre1" } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH] [RFC] Target-independent store forwarding avoidance. [PR48696] Target-independent store forwarding avoidance.

2024-05-24 Thread Richard Biener
imization
> +Maximum number of instruction distance that a small store forwarded to a 
> larger load may stall.
> +
>  -param=switch-conversion-max-branch-ratio=
>  Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) 
> IntegerRange(1, 65536) Param Optimization
>  The maximum ratio between array size and switch branches for a switch 
> conversion to take place.
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 1cbbd413097..1e608774707 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3.  If not see
>NEXT_PASS (pass_lower_subreg);
>NEXT_PASS (pass_df_initialize_opt);
>NEXT_PASS (pass_cse);
> +  NEXT_PASS (pass_rtl_avoid_store_forwarding);
>NEXT_PASS (pass_rtl_fwprop);
>NEXT_PASS (pass_rtl_cprop);
>NEXT_PASS (pass_rtl_pre);
> diff --git a/gcc/testsuite/gcc.dg/avoid-store-forwarding-1.c 
> b/gcc/testsuite/gcc.dg/avoid-store-forwarding-1.c
> new file mode 100644
> index 000..0775aee898b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/avoid-store-forwarding-1.c
> @@ -0,0 +1,46 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-rtl-avoid_store_forwarding" } */
> +
> +typedef union {
> +char arr_8[8];
> +long long_value;
> +} DataUnion;
> +
> +long ssll_1 (DataUnion *data, char x)
> +{
> +  data->arr_8[0] = x;
> +  return data->long_value;
> +}
> +
> +long ssll_2 (DataUnion *data, char x)
> +{
> +  data->arr_8[1] = x;
> +  return data->long_value;
> +}
> +
> +long ssll_3 (DataUnion *data, char x)
> +{
> +  data->arr_8[7] = x;
> +  return data->long_value;
> +}
> +
> +long ssll_4 (DataUnion **data, char x)
> +{
> +  (*data)->arr_8[0] = x;
> +  return (*data)->long_value;
> +}
> +
> +long ssll_5 (DataUnion **data, char x)
> +{
> +  (*data)->arr_8[1] = x;
> +  return (*data)->long_value;
> +}
> +
> +long ssll_6 (DataUnion **data, char x)
> +{
> +  (*data)->arr_8[7] = x;
> +  return (*data)->long_value;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Store forwarding detected" 6 } } */
> +/* { dg-final { scan-tree-dump-times "Store forwarding avoided" 6 } } */
> diff --git a/gcc/testsuite/gcc.dg/avoid-store-forwarding-2.c 
> b/gcc/testsuite/gcc.dg/avoid-store-forwarding-2.c
> new file mode 100644
> index 000..cd81aa248fe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/avoid-store-forwarding-2.c
> @@ -0,0 +1,39 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-rtl-avoid_store_forwarding" } */
> +
> +typedef union {
> +char arr_8[8];
> +int long_value;
> +} DataUnion1;
> +
> +long no_ssll_1 (DataUnion1 *data, char x)
> +{
> +  data->arr_8[4] = x;
> +  return data->long_value;
> +}
> +
> +long no_ssll_2 (DataUnion1 *data, char x)
> +{
> +  data->arr_8[5] = x;
> +  return data->long_value;
> +}
> +
> +typedef union {
> +char arr_8[8];
> +short long_value[4];
> +} DataUnion2;
> +
> +long no_ssll_3 (DataUnion2 *data, char x)
> +{
> +  data->arr_8[4] = x;
> +  return data->long_value[1];
> +}
> +
> +long no_ssll_4 (DataUnion2 *data, char x)
> +{
> +  data->arr_8[0] = x;
> +  return data->long_value[1];
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Store forwarding detected" 0 } } */
> +/* { dg-final { scan-tree-dump-times "Store forwarding avoided" 0 } } */
> diff --git a/gcc/testsuite/gcc.dg/avoid-store-forwarding-3.c 
> b/gcc/testsuite/gcc.dg/avoid-store-forwarding-3.c
> new file mode 100644
> index 000..3175f882c86
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/avoid-store-forwarding-3.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-rtl-avoid_store_forwarding" } */
> +
> +typedef union {
> +char arr_8[8];
> +long long_value;
> +} DataUnion;
> +
> +long ssll_multi_1 (DataUnion **data, char x)
> +{
> +  (*data)->arr_8[0] = x;
> +  (*data)->arr_8[2] = x;
> +  return (*data)->long_value;
> +}
> +
> +long ssll_multi_2 (DataUnion **data, char x)
> +{
> +  (*data)->arr_8[0] = x;
> +  (*data)->arr_8[1] = 11;
> +  return (*data)->long_value;
> +}
> +
> +long ssll_multi_3 (DataUnion **data, char x, short y)
> +{
> +  (*data)->arr_8[1] = x;
> +  __builtin_memcpy((*data)->arr_8 + 4, , sizeof(short));
> +  return (*data)->long_value;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Store forwardings detected" 3 } } */
> +/* { dg-final { scan-tree-dump-times "Store forwardings avoided" 3 } } */
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 29267589eeb..49957ba3373 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -570,6 +570,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context 
> *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_rtl_avoid_store_forwarding (gcc::context 
> *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH] Use simple_dce_from_worklist in phiprop

2024-05-24 Thread Richard Biener
On Thu, May 23, 2024 at 10:55 PM Andrew Pinski  wrote:
>
> I noticed that phiprop leaves around phi nodes which
> defines a ssa name which is unused. This just adds a
> bitmap to mark those ssa names and then calls
> simple_dce_from_worklist at the very end to remove
> those phi nodes and all of the dependencies if there
> was any. This might allow us to optimize something earlier
> due to the removal of the phi which was taking the address
> of the variables.
>
> Bootstrapped and tested on x86_64-linux-gnu with no regressions.

OK

> gcc/ChangeLog:
>
> * tree-ssa-phiprop.cc (phiprop_insert_phi): Add
> dce_ssa_names argument. Add the phi's result to it.
> (propagate_with_phi): Add dce_ssa_names argument.
> Update call to phiprop_insert_phi.
> (pass_phiprop::execute): Update call to propagate_with_phi.
> Call simple_dce_from_worklist if there was a change.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/tree-ssa-phiprop.cc | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/tree-ssa-phiprop.cc b/gcc/tree-ssa-phiprop.cc
> index 041521ef106..2a1cdae46d2 100644
> --- a/gcc/tree-ssa-phiprop.cc
> +++ b/gcc/tree-ssa-phiprop.cc
> @@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "stor-layout.h"
>  #include "tree-ssa-loop.h"
>  #include "tree-cfg.h"
> +#include "tree-ssa-dce.h"
>
>  /* This pass propagates indirect loads through the PHI node for its
> address to make the load source possibly non-addressable and to
> @@ -132,12 +133,15 @@ phivn_valid_p (struct phiprop_d *phivn, tree name, 
> basic_block bb)
>
>  static tree
>  phiprop_insert_phi (basic_block bb, gphi *phi, gimple *use_stmt,
> -   struct phiprop_d *phivn, size_t n)
> +   struct phiprop_d *phivn, size_t n,
> +   bitmap dce_ssa_names)
>  {
>tree res;
>gphi *new_phi = NULL;
>edge_iterator ei;
>edge e;
> +  tree phi_result = PHI_RESULT (phi);
> +  bitmap_set_bit (dce_ssa_names, SSA_NAME_VERSION (phi_result));
>
>gcc_assert (is_gimple_assign (use_stmt)
>   && gimple_assign_rhs_code (use_stmt) == MEM_REF);
> @@ -276,7 +280,7 @@ chk_uses (tree, tree *idx, void *data)
>
>  static bool
>  propagate_with_phi (basic_block bb, gphi *phi, struct phiprop_d *phivn,
> -   size_t n)
> +   size_t n, bitmap dce_ssa_names)
>  {
>tree ptr = PHI_RESULT (phi);
>gimple *use_stmt;
> @@ -420,9 +424,10 @@ propagate_with_phi (basic_block bb, gphi *phi, struct 
> phiprop_d *phivn,
> goto next;
> }
>
> - phiprop_insert_phi (bb, phi, use_stmt, phivn, n);
> + phiprop_insert_phi (bb, phi, use_stmt, phivn, n, dce_ssa_names);
>
> - /* Remove old stmt.  The phi is taken care of by DCE.  */
> + /* Remove old stmt. The phi and all of maybe its depedencies
> +will be removed later via simple_dce_from_worklist. */
>   gsi = gsi_for_stmt (use_stmt);
>   /* Unlinking the VDEF here is fine as we are sure that we process
>  stmts in execution order due to aggregate copies having VDEFs
> @@ -442,16 +447,15 @@ propagate_with_phi (basic_block bb, gphi *phi, struct 
> phiprop_d *phivn,
>  is the first load transformation.  */
>else if (!phi_inserted)
> {
> - res = phiprop_insert_phi (bb, phi, use_stmt, phivn, n);
> + res = phiprop_insert_phi (bb, phi, use_stmt, phivn, n, 
> dce_ssa_names);
>   type = TREE_TYPE (res);
>
>   /* Remember the value we created for *ptr.  */
>   phivn[SSA_NAME_VERSION (ptr)].value = res;
>   phivn[SSA_NAME_VERSION (ptr)].vuse = vuse;
>
> - /* Remove old stmt.  The phi is taken care of by DCE, if we
> -want to delete it here we also have to delete all intermediate
> -copies.  */
> + /* Remove old stmt.  The phi and all of maybe its depedencies
> +will be removed later via simple_dce_from_worklist. */
>   gsi = gsi_for_stmt (use_stmt);
>   gsi_remove (, true);
>
> @@ -514,6 +518,7 @@ pass_phiprop::execute (function *fun)
>gphi_iterator gsi;
>unsigned i;
>size_t n;
> +  auto_bitmap dce_ssa_names;
>
>calculate_dominance_info (CDI_DOMINATORS);
>
> @@ -531,11 +536,14 @@ pass_phiprop::execute (function *fun)
>if (bb_has_abnormal_pred (bb))
> continue;
>for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next ())
> -   did_something |= propagate_with_phi (bb, gsi.phi (), phivn, n);
> +   did_something |= propagate_with_phi (bb, gsi.phi (), phivn, n, 
> dce_ssa_names);
>  }
>
>if (did_something)
> -gsi_commit_edge_inserts ();
> +{
> +  gsi_commit_edge_inserts ();
> +  simple_dce_from_worklist (dce_ssa_names);
> +}
>
>free (phivn);
>
> --
> 2.43.0
>


Re: [C PATCH]: allow aliasing of compatible types derived from enumeral types [PR115157]

2024-05-23 Thread Richard Biener
On Thu, 23 May 2024, Ian Lance Taylor wrote:

> On Thu, May 23, 2024 at 2:48 PM Martin Uecker  wrote:
> >
> > Am Donnerstag, dem 23.05.2024 um 14:30 -0700 schrieb Ian Lance Taylor:
> > > On Thu, May 23, 2024 at 2:00 PM Joseph Myers  wrote:
> > > >
> > > > On Tue, 21 May 2024, Martin Uecker wrote:
> > > > >
> > > > > C: allow aliasing of compatible types derived from enumeral types 
> > > > > [PR115157]
> > > > >
> > > > > Aliasing of enumeral types with the underlying integer is now 
> > > > > allowed
> > > > > by setting the aliasing set to zero.  But this does not allow 
> > > > > aliasing
> > > > > of derived types which are compatible as required by ISO C.  
> > > > > Instead,
> > > > > initially set structural equality.  Then set TYPE_CANONICAL and 
> > > > > update
> > > > > pointers and main variants when the type is completed (as done for
> > > > > structures and unions in C23).
> > > > >
> > > > > PR 115157
> > > > >
> > > > > gcc/c/
> > > > > * c-decl.cc (shadow_tag-warned,parse_xref_tag,start_enum,
> > > > > finish_enum): Set SET_TYPE_STRUCTURAL_EQUALITY / 
> > > > > TYPE_CANONICAL.
> > > > > * c-obj-common.cc (get_alias_set): Remove special case.
> > > > > (get_aka_type): Add special case.
> > > > >
> > > > > gcc/
> > > > > * godump.cc (go_output_typedef): use TYPE_MAIN_VARIANT 
> > > > > instead
> > > > > of TYPE_CANONICAL.
> > > > >
> > > > > gcc/testsuite/
> > > > > * gcc.dg/enum-alias-1.c: New test.
> > > > > * gcc.dg/enum-alias-2.c: New test.
> > > > > * gcc.dg/enum-alias-3.c: New test.
> > > >
> > > > OK, in the absence of objections on middle-end or Go grounds within the
> > > > next week.
> > >
> > > The godump.cc patch is
> > >
> > >&& (TYPE_CANONICAL (TREE_TYPE (decl)) == NULL_TREE
> > >   || !container->decls_seen.contains
> > > -   (TYPE_CANONICAL (TREE_TYPE (decl)
> > > +   (TYPE_MAIN_VARIANT (TREE_TYPE 
> > > (decl)
> > >  {
> > >
> > > What is the problem you are seeing?
> >
> > Test failures in godump-1.c
> >
> > >
> > > This patch isn't right:
> > >
> > > 1) The code is saying if "X == NULL_TREE || !already_seen(X)".  This
> > > patch is changing the latter X but not the former.  They should be
> > > consistent.
> >
> > Maybe the X == NULL_TREE can be removed if we
> > add TYPE_MAIN_VARIANTs instead?
> 
> If TYPE_MAIN_VARIANT is never NULL_TREE, then I agree that the
> NULL_TREE test can be removed.

TYPE_MAIN_VARIANT is indeed never NULL_TREE.

Richard.

Re: [PATCH] RISC-V: Avoid splitting store dataref groups during SLP discovery

2024-05-23 Thread Richard Biener
On Thu, 23 May 2024, Richard Biener wrote:

> The following avoids splitting store dataref groups during SLP
> discovery but instead forces (eventually single-lane) consecutive
> lane SLP discovery for all lanes of the group, creating VEC_PERM
> SLP nodes merging them so the store will always cover the whole group.
> 
> With this for example
> 
> int x[1024], y[1024], z[1024], w[1024];
> void foo (void)
> {
>   for (int i = 0; i < 256; i++)
> {
>   x[4*i+0] = y[2*i+0];
>   x[4*i+1] = y[2*i+1];
>   x[4*i+2] = z[i];
>   x[4*i+3] = w[i];
> }
> }
> 
> which was previously using hybrid SLP can now be fully SLPed and
> SSE code generated looks better (but of course you never know,
> I didn't actually benchmark).  We of course need a VF of four here.
> 
> .L2:
> movdqa  z(%rax), %xmm0
> movdqa  w(%rax), %xmm4
> movdqa  y(%rax,%rax), %xmm2
> movdqa  y+16(%rax,%rax), %xmm1
> movdqa  %xmm0, %xmm3
> punpckhdq   %xmm4, %xmm0
> punpckldq   %xmm4, %xmm3
> movdqa  %xmm2, %xmm4
> shufps  $238, %xmm3, %xmm2
> movaps  %xmm2, x+16(,%rax,4)
> movdqa  %xmm1, %xmm2
> shufps  $68, %xmm3, %xmm4
> shufps  $68, %xmm0, %xmm2
> movaps  %xmm4, x(,%rax,4)
> shufps  $238, %xmm0, %xmm1
> movaps  %xmm2, x+32(,%rax,4)
> movaps  %xmm1, x+48(,%rax,4)
> addq$16, %rax
> cmpq$1024, %rax
> jne .L2
> 
> The extra permute nodes merging distinct branches of the SLP
> tree might be unexpected for some code, esp. since
> SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
> cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
> consistently as we can have a mix of both.
> 
> The patch keeps the sub-trees form consecutive lanes but that's
> in principle not necessary if we for example have an even/odd
> split which now would result in N single-lane sub-trees.  That's
> left for future improvements.
> 
> The interesting part is how VLA vector ISAs handle merging of
> two vectors that's not trivial even/odd merging.  The strathegy
> of how to build the permute tree might need adjustments for that
> (in the end splitting each branch to single lanes and then doing
> even/odd merging would be the brute-force fallback).  Not sure
> how much we can or should rely on the SLP optimize pass to handle
> this.
> 
> The gcc.dg/vect/slp-12a.c case is interesting as we currently split
> the 8 store group into lanes 0-5 which we SLP with an unroll factor
> of two (on x86-64 with SSE) and the remaining two lanes are using
> interleaving vectorization with a final unroll factor of four.  Thus
> we're using hybrid SLP within a single store group.  After the change
> we discover the same 0-5 lane SLP part as well as two single-lane
> parts feeding the full store group.  But that results in a load
> permutation that isn't supported (I have WIP patchs to rectify that).
> So we end up cancelling SLP and vectorizing the whole loop with
> interleaving which is IMO good and results in better code.
> 
> This is similar for gcc.target/i386/pr52252-atom.c where interleaving
> generates much better code than hybrid SLP.  I'm unsure how to update
> the testcase though.
> 
> gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
> when analyzing SLP operations we discard an instance we currently
> force the full loop to have no SLP because hybrid detection is
> broken.  It's probably not worth fixing this at this moment.
> 
> For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
> into two but merge the two 8 lane loads into one before doing the
> store and thus have only a single SLP instance.  A similar situation
> happens in gcc.dg/vect/slp-11c.c but the branches feeding the
> single SLP store only have a single lane.  Likewise for
> gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
> 
> gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
> with a SLP store group of size two but two single-lane branches.
> 
> (merged with the testsuite changes, re-posted because the RISC-V
> CI ran on a tree w/o a fix, hopefully fixing all the reported
> ICEs)

This worked out so I pushed the change.  The gcc.dg/vect/pr97428.c
test is FAILing on RISC-V (it still gets 0 SLP), because of missed
load permutations.  I hope the followup reorg for the load side will
fix this.  It also FAILs gcc.target/riscv/rvv/autovec/struct/struct_vect-4.c
which does excessive assembly scanning on many functions - I'll leave
this for target maintainers to update - there's one or two functions
which we now expect to SLP.

Richard.

>   * tree-vect-slp.cc (vec

[gcc r15-812] Avoid splitting store dataref groups during SLP discovery

2024-05-23 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:c71886f2ca2e46ce1449c7064d6f1b447d02fcba

commit r15-812-gc71886f2ca2e46ce1449c7064d6f1b447d02fcba
Author: Richard Biener 
Date:   Fri Sep 29 13:13:16 2023 +0200

Avoid splitting store dataref groups during SLP discovery

The following avoids splitting store dataref groups during SLP
discovery but instead forces (eventually single-lane) consecutive
lane SLP discovery for all lanes of the group, creating VEC_PERM
SLP nodes merging them so the store will always cover the whole group.

With this for example

int x[1024], y[1024], z[1024], w[1024];
void foo (void)
{
  for (int i = 0; i < 256; i++)
{
  x[4*i+0] = y[2*i+0];
  x[4*i+1] = y[2*i+1];
  x[4*i+2] = z[i];
  x[4*i+3] = w[i];
}
}

which was previously using hybrid SLP can now be fully SLPed and
SSE code generated looks better (but of course you never know,
I didn't actually benchmark).  We of course need a VF of four here.

.L2:
movdqa  z(%rax), %xmm0
movdqa  w(%rax), %xmm4
movdqa  y(%rax,%rax), %xmm2
movdqa  y+16(%rax,%rax), %xmm1
movdqa  %xmm0, %xmm3
punpckhdq   %xmm4, %xmm0
punpckldq   %xmm4, %xmm3
movdqa  %xmm2, %xmm4
shufps  $238, %xmm3, %xmm2
movaps  %xmm2, x+16(,%rax,4)
movdqa  %xmm1, %xmm2
shufps  $68, %xmm3, %xmm4
shufps  $68, %xmm0, %xmm2
movaps  %xmm4, x(,%rax,4)
shufps  $238, %xmm0, %xmm1
movaps  %xmm2, x+32(,%rax,4)
movaps  %xmm1, x+48(,%rax,4)
addq$16, %rax
cmpq$1024, %rax
jne .L2

The extra permute nodes merging distinct branches of the SLP
tree might be unexpected for some code, esp. since
SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
consistently as we can have a mix of both.

The patch keeps the sub-trees form consecutive lanes but that's
in principle not necessary if we for example have an even/odd
split which now would result in N single-lane sub-trees.  That's
left for future improvements.

The interesting part is how VLA vector ISAs handle merging of
two vectors that's not trivial even/odd merging.  The strathegy
of how to build the permute tree might need adjustments for that
(in the end splitting each branch to single lanes and then doing
even/odd merging would be the brute-force fallback).  Not sure
how much we can or should rely on the SLP optimize pass to handle
this.

The gcc.dg/vect/slp-12a.c case is interesting as we currently split
the 8 store group into lanes 0-5 which we SLP with an unroll factor
of two (on x86-64 with SSE) and the remaining two lanes are using
interleaving vectorization with a final unroll factor of four.  Thus
we're using hybrid SLP within a single store group.  After the change
we discover the same 0-5 lane SLP part as well as two single-lane
parts feeding the full store group.  But that results in a load
permutation that isn't supported (I have WIP patchs to rectify that).
So we end up cancelling SLP and vectorizing the whole loop with
interleaving which is IMO good and results in better code.

This is similar for gcc.target/i386/pr52252-atom.c where interleaving
generates much better code than hybrid SLP.  I'm unsure how to update
the testcase though.

gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
when analyzing SLP operations we discard an instance we currently
force the full loop to have no SLP because hybrid detection is
broken.  It's probably not worth fixing this at this moment.

For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
into two but merge the two 8 lane loads into one before doing the
store and thus have only a single SLP instance.  A similar situation
happens in gcc.dg/vect/slp-11c.c but the branches feeding the
single SLP store only have a single lane.  Likewise for
gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.

gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
with a SLP store group of size two but two single-lane branches.

* tree-vect-slp.cc (vect_build_slp_instance): Do not split
store dataref groups on loop SLP discovery failure but create
a single SLP instance for the stores but branch to SLP sub-trees
and merge with a series of VEC_PERM nodes.

* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
* gcc.dg/vect/slp-11c.c: Likewise, if !vect_load_lanes.
* gcc.dg/vect/vect-complex-5.c: Likewise.
* gcc.dg/vect/slp-12a.c: Do not expect 

[PATCH] RISC-V: Avoid splitting store dataref groups during SLP discovery

2024-05-23 Thread Richard Biener
The following avoids splitting store dataref groups during SLP
discovery but instead forces (eventually single-lane) consecutive
lane SLP discovery for all lanes of the group, creating VEC_PERM
SLP nodes merging them so the store will always cover the whole group.

With this for example

int x[1024], y[1024], z[1024], w[1024];
void foo (void)
{
  for (int i = 0; i < 256; i++)
{
  x[4*i+0] = y[2*i+0];
  x[4*i+1] = y[2*i+1];
  x[4*i+2] = z[i];
  x[4*i+3] = w[i];
}
}

which was previously using hybrid SLP can now be fully SLPed and
SSE code generated looks better (but of course you never know,
I didn't actually benchmark).  We of course need a VF of four here.

.L2:
movdqa  z(%rax), %xmm0
movdqa  w(%rax), %xmm4
movdqa  y(%rax,%rax), %xmm2
movdqa  y+16(%rax,%rax), %xmm1
movdqa  %xmm0, %xmm3
punpckhdq   %xmm4, %xmm0
punpckldq   %xmm4, %xmm3
movdqa  %xmm2, %xmm4
shufps  $238, %xmm3, %xmm2
movaps  %xmm2, x+16(,%rax,4)
movdqa  %xmm1, %xmm2
shufps  $68, %xmm3, %xmm4
shufps  $68, %xmm0, %xmm2
movaps  %xmm4, x(,%rax,4)
shufps  $238, %xmm0, %xmm1
movaps  %xmm2, x+32(,%rax,4)
movaps  %xmm1, x+48(,%rax,4)
addq$16, %rax
cmpq$1024, %rax
jne .L2

The extra permute nodes merging distinct branches of the SLP
tree might be unexpected for some code, esp. since
SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
consistently as we can have a mix of both.

The patch keeps the sub-trees form consecutive lanes but that's
in principle not necessary if we for example have an even/odd
split which now would result in N single-lane sub-trees.  That's
left for future improvements.

The interesting part is how VLA vector ISAs handle merging of
two vectors that's not trivial even/odd merging.  The strathegy
of how to build the permute tree might need adjustments for that
(in the end splitting each branch to single lanes and then doing
even/odd merging would be the brute-force fallback).  Not sure
how much we can or should rely on the SLP optimize pass to handle
this.

The gcc.dg/vect/slp-12a.c case is interesting as we currently split
the 8 store group into lanes 0-5 which we SLP with an unroll factor
of two (on x86-64 with SSE) and the remaining two lanes are using
interleaving vectorization with a final unroll factor of four.  Thus
we're using hybrid SLP within a single store group.  After the change
we discover the same 0-5 lane SLP part as well as two single-lane
parts feeding the full store group.  But that results in a load
permutation that isn't supported (I have WIP patchs to rectify that).
So we end up cancelling SLP and vectorizing the whole loop with
interleaving which is IMO good and results in better code.

This is similar for gcc.target/i386/pr52252-atom.c where interleaving
generates much better code than hybrid SLP.  I'm unsure how to update
the testcase though.

gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
when analyzing SLP operations we discard an instance we currently
force the full loop to have no SLP because hybrid detection is
broken.  It's probably not worth fixing this at this moment.

For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
into two but merge the two 8 lane loads into one before doing the
store and thus have only a single SLP instance.  A similar situation
happens in gcc.dg/vect/slp-11c.c but the branches feeding the
single SLP store only have a single lane.  Likewise for
gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.

gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
with a SLP store group of size two but two single-lane branches.

(merged with the testsuite changes, re-posted because the RISC-V
CI ran on a tree w/o a fix, hopefully fixing all the reported
ICEs)

* tree-vect-slp.cc (vect_build_slp_instance): Do not split
store dataref groups on loop SLP discovery failure but create
a single SLP instance for the stores but branch to SLP sub-trees
and merge with a series of VEC_PERM nodes.

* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
* gcc.dg/vect/slp-11c.c: Likewise, if !vect_load_lanes.
* gcc.dg/vect/vect-complex-5.c: Likewise.
* gcc.dg/vect/slp-12a.c: Do not expect SLP.
* gcc.dg/vect/slp-21.c: Remove not important scanning for SLP.
* gcc.dg/vect/slp-cond-1.c: Expect one more SLP if !vect_load_lanes.
* gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
* gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
---
 gcc/testsuite/gcc.dg/vect/pr97428.c  |   2 +-
 gcc/testsuite/gcc.dg/vect/slp-11c.c  |   6 +-
 gcc/testsuite/gcc.dg/vect/slp-12a.c  |   6 +-
 gcc/testsuite/gcc.dg/vect/slp-21.c   |  18 +-
 

[gcc r15-793] tree-optimization/115197 - fix ICE w/ constant in LC PHI and loop distribution

2024-05-23 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:2b2476d4d18c92b8aba3567ebccd2100c2f7c258

commit r15-793-g2b2476d4d18c92b8aba3567ebccd2100c2f7c258
Author: Richard Biener 
Date:   Thu May 23 14:36:39 2024 +0200

tree-optimization/115197 - fix ICE w/ constant in LC PHI and loop 
distribution

Forgot a check for an SSA name before trying to replace a PHI arg with
its current definition.

PR tree-optimization/115197
* tree-loop-distribution.cc (copy_loop_before): Constant PHI
args remain the same.

* gcc.dg/pr115197.c: New testcase.

Diff:
---
 gcc/testsuite/gcc.dg/pr115197.c | 14 ++
 gcc/tree-loop-distribution.cc   |  7 +--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/pr115197.c b/gcc/testsuite/gcc.dg/pr115197.c
new file mode 100644
index 000..00d674b3bd9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr115197.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O1 -fno-tree-scev-cprop -ftree-pre 
-ftree-loop-distribute-patterns" } */
+
+int a, b[2], c, d, e, f[2];
+int main() {
+  while (a)
+if (d) {
+  if (e)
+return 0;
+  for (; c; c++)
+f[c] = 0 < (b[c] = ~(f[c + 1] < a));
+}
+  return 0;
+}
diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
index 668dc420449..4d1ed234fcb 100644
--- a/gcc/tree-loop-distribution.cc
+++ b/gcc/tree-loop-distribution.cc
@@ -977,8 +977,11 @@ copy_loop_before (class loop *loop, bool 
redirect_lc_phi_defs)
  if (virtual_operand_p (gimple_phi_result (phi)))
continue;
  use_operand_p use_p = PHI_ARG_DEF_PTR_FROM_EDGE (phi, exit);
- tree new_def = get_current_def (USE_FROM_PTR (use_p));
- SET_USE (use_p, new_def);
+ if (TREE_CODE (USE_FROM_PTR (use_p)) == SSA_NAME)
+   {
+ tree new_def = get_current_def (USE_FROM_PTR (use_p));
+ SET_USE (use_p, new_def);
+   }
}
 }


[PATCH] tree-optimization/115197 - fix ICE w/ constant in LC PHI and loop distribution

2024-05-23 Thread Richard Biener
Forgot a check for an SSA name before trying to replace a PHI arg with
its current definition.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/115197
* tree-loop-distribution.cc (copy_loop_before): Constant PHI
args remain the same.

* gcc.dg/pr115197.c: New testcase.
---
 gcc/testsuite/gcc.dg/pr115197.c | 14 ++
 gcc/tree-loop-distribution.cc   |  7 +--
 2 files changed, 19 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr115197.c

diff --git a/gcc/testsuite/gcc.dg/pr115197.c b/gcc/testsuite/gcc.dg/pr115197.c
new file mode 100644
index 000..00d674b3bd9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr115197.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O1 -fno-tree-scev-cprop -ftree-pre 
-ftree-loop-distribute-patterns" } */
+
+int a, b[2], c, d, e, f[2];
+int main() {
+  while (a)
+if (d) {
+  if (e)
+return 0;
+  for (; c; c++)
+f[c] = 0 < (b[c] = ~(f[c + 1] < a));
+}
+  return 0;
+}
diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
index 668dc420449..4d1ed234fcb 100644
--- a/gcc/tree-loop-distribution.cc
+++ b/gcc/tree-loop-distribution.cc
@@ -977,8 +977,11 @@ copy_loop_before (class loop *loop, bool 
redirect_lc_phi_defs)
  if (virtual_operand_p (gimple_phi_result (phi)))
continue;
  use_operand_p use_p = PHI_ARG_DEF_PTR_FROM_EDGE (phi, exit);
- tree new_def = get_current_def (USE_FROM_PTR (use_p));
- SET_USE (use_p, new_def);
+ if (TREE_CODE (USE_FROM_PTR (use_p)) == SSA_NAME)
+   {
+ tree new_def = get_current_def (USE_FROM_PTR (use_p));
+ SET_USE (use_p, new_def);
+   }
}
 }
 
-- 
2.35.3


[gcc r15-792] tree-optimization/115199 - fix PTA constraint processing for LHS

2024-05-23 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:f0a02467bbc35a478eb82f5a8a7e8870827b51fc

commit r15-792-gf0a02467bbc35a478eb82f5a8a7e8870827b51fc
Author: Richard Biener 
Date:   Thu May 23 13:33:15 2024 +0200

tree-optimization/115199 - fix PTA constraint processing for  LHS

When processing a  = X constraint we treat it as *ANYTHING = X
during constraint processing but then end up recording it as
 = X anyway, breaking constraint graph building.  This is
because we only update the local copy of the LHS and not the constraint
itself.

PR tree-optimization/115199
* tree-ssa-structalias.cc (process_constraint): Also
record  = X as *ANYTING = X in the end.

* gcc.dg/torture/pr115199.c: New testcase.

Diff:
---
 gcc/testsuite/gcc.dg/torture/pr115199.c | 24 
 gcc/tree-ssa-structalias.cc |  2 +-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/torture/pr115199.c 
b/gcc/testsuite/gcc.dg/torture/pr115199.c
new file mode 100644
index 000..981a7330b32
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115199.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+
+struct b {
+  char *volatile c;
+};
+struct b * __attribute__((noipa))
+d()
+{
+  char *e;
+  struct b *b = __builtin_malloc(sizeof(b));
+  void *f = __builtin_malloc(1);
+
+  e = __builtin_memcpy(f, "z", 1);
+  b->c = e;
+  return b;
+}
+
+int main()
+{
+  struct b b = *d();
+  if (b.c[0] != 'z')
+__builtin_abort();
+  return 0;
+}
diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 0e9423a78ec..a39b36c146e 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -3104,7 +3104,7 @@ process_constraint (constraint_t t)
  it here by turning it into *ANYTHING.  */
   if (lhs.type == ADDRESSOF
   && lhs.var == anything_id)
-lhs.type = DEREF;
+t->lhs.type = lhs.type = DEREF;
 
   /* ADDRESSOF on the lhs is invalid.  */
   gcc_assert (lhs.type != ADDRESSOF);


[gcc r15-791] tree-optimization/115138 - ptr-vs-ptr and FUNCTION_DECLs

2024-05-23 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:61f5b3c59ed20438d7d9918d7a83d29a21097d4e

commit r15-791-g61f5b3c59ed20438d7d9918d7a83d29a21097d4e
Author: Richard Biener 
Date:   Thu May 23 11:26:14 2024 +0200

tree-optimization/115138 - ptr-vs-ptr and FUNCTION_DECLs

I failed to realize we do not represent FUNCTION_DECLs or LABEL_DECLs
in vars explicitly and thus have to compare pt.vars_contains_nonlocal.

PR tree-optimization/115138
* tree-ssa-alias.cc (ptrs_compare_unequal): Make sure
pt.vars_contains_nonlocal differs since we do not represent
FUNCTION_DECLs or LABEL_DECLs in vars explicitly.

* gcc.dg/torture/pr115138.c: New testcase.

Diff:
---
 gcc/testsuite/gcc.dg/torture/pr115138.c | 28 
 gcc/tree-ssa-alias.cc   |  6 ++
 2 files changed, 34 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/torture/pr115138.c 
b/gcc/testsuite/gcc.dg/torture/pr115138.c
new file mode 100644
index 000..6becaecbaff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115138.c
@@ -0,0 +1,28 @@
+/* { dg-do run } */
+
+int foo (int) {}
+int bar (int) {}
+
+typedef int (*pred)(int);
+
+int x, y;
+pred A () { if (x) return foo; else return bar; }
+pred B () { if (y) return foo; else return bar; }
+int __attribute__((noipa)) baz()
+{
+  pred a = A();
+  pred b = B();
+  if (a != b)
+return 42;
+  return 0;
+}
+
+int main()
+{
+  if (baz () != 0)
+__builtin_abort ();
+  y = 1;
+  if (baz () != 42)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index d64d6d02f4a..1a91d63a31e 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -501,6 +501,12 @@ ptrs_compare_unequal (tree ptr1, tree ptr2)
  || pi2->pt.vars_contains_interposable)
return false;
  if ((!pi1->pt.null || !pi2->pt.null)
+ /* ???  We do not represent FUNCTION_DECL and LABEL_DECL
+in pt.vars but only set pt.vars_contains_nonlocal.  This
+makes compares involving those and other nonlocals
+imprecise.  */
+ && (!pi1->pt.vars_contains_nonlocal
+ || !pi2->pt.vars_contains_nonlocal)
  && (!pt_solution_includes_const_pool (>pt)
  || !pt_solution_includes_const_pool (>pt)))
return !pt_solutions_intersect (>pt, >pt);


Re: [PATCH v2] Match: Support __builtin_add_overflow branch form for unsigned SAT_ADD

2024-05-23 Thread Richard Biener
On Thu, May 23, 2024 at 1:08 PM Li, Pan2  wrote:
>
> I have a try to convert the PHI from Part-A to Part-B, aka PHI to _2 = 
> phi_cond ? _1 : 255.
> And then we can do the matching on COND_EXPR in the underlying widen-mul pass.
>
> Unfortunately, meet some ICE when verify_gimple_phi in sccopy1 pass =>
> sat_add.c:66:1: internal compiler error: tree check: expected class ‘type’, 
> have ‘exceptional’ (error_mark) in useless_type_conversion_p, at 
> gimple-expr.cc:86

Likely you have released _2, more comments below on your previous mail.

> will go on to see if this works or not.
>
> Part-A:
> uint8_t sat_add_u_1_uint8_t (uint8_t x, uint8_t y)
> {
>   unsigned char _1;
>   uint8_t _2;
>
>:
>   _1 = x_3(D) + y_4(D);
>   if (_1 >= x_3(D))
> goto ; [INV]
>   else
> goto ; [INV]
>
>:
>
>:
>   # _2 = PHI <255(2), _1(3)>
>   return _2;
>
> }
>
> Part-B:
> uint8_t sat_add_u_1_uint8_t (uint8_t x, uint8_t y)
> {
>   unsigned char _1;
>   _Bool phi_cond_6;
>
>:
>   _1 = x_3(D) + y_4(D);
>   phi_cond_6 = _1 >= x_3(D);
>   _2 = phi_cond_6 ? _1 : 255;
>   return _2;
>
> }
>
> -Original Message-
> From: Li, Pan2
> Sent: Thursday, May 23, 2024 12:17 PM
> To: Richard Biener 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
> tamar.christ...@arm.com; pins...@gmail.com
> Subject: RE: [PATCH v2] Match: Support __builtin_add_overflow branch form for 
> unsigned SAT_ADD
>
> Thanks Richard for reviewing.
>
> > I'm not convinced we should match this during early if-conversion, should 
> > we?
> > The middle-end doesn't really know .SAT_ADD but some handling of
> > .ADD_OVERFLOW is present.
>
> I tried to do the branch (aka cond) match in widen-mult pass similar as 
> previous branchless form.
> Unfortunately, the branch will be converted to PHI when widen-mult, thus try 
> to bypass the PHI handling
> and convert the branch form to the branchless form in v2.
>
> > But please add a comment before the new pattern, esp. since it's
> > non-obvious that this is an improvent.
>
> Sure thing.
>
> > I suspect you rely on this form being recognized as .SAT_ADD later but
> > what prevents us from breaking this?  Why not convert it to .SAT_ADD
> > immediately?  If this is because the ISEL pass (or the widen-mult pass)
> > cannot handle PHIs then I would suggest to split out enough parts of
> > tree-ssa-phiopt.cc to be able to query match.pd for COND_EXPRs.
>
> Yes, this is sort of redundant, we can also convert it to .SAT_ADD 
> immediately in match.pd before widen-mult.
>
> Sorry I may get confused here, for branch form like below, what transform 
> should we perform in phiopt?
> The gimple_simplify_phiopt mostly leverage the simplify in match.pd but we 
> may hit the simplify in the
> other early pass.
>
> Or we can leverage branch version of unsigned_integer_sat_add gimple match in 
> phiopt and generate the gimple call .SAT_ADD
> In phiopt (mostly like what we do in widen-mult).
> Not sure if my understanding is correct or not, thanks again for help.

The trick for widen-mult (or ISEL) would be to try to match the PHI
nodes in a similar way as to
gimple_simplify_phiopt calls op.resimplify.  The difficulty resides in
that the (match ...) generated
code gets the entry to the stmt root.  Either we'd teach genmatch to
recognize a PHI def
as a COND or we make (match ..) (additionally?) generate entry points
taking a gimple_match_op
so the toplevel COND trick works.  Note it's already a bit awkward
because we build a GENERIC
form of the condition and that's now invalid in the IL for a GIMPLE
COND_EXPR but still present
because of that phiopt trick.  There isn't a SSA def for the condition
in the IL (it's only part
of a GIMPLE_COND and that one doesn't define "CC flags").

That means possibly special-casing (match (..) (cond (cmp ...) ..)) in
genmatch to handle
PHI defs might be the easiest "trick" here.

Not sure what you did for the IL you quoted above.

Richard.

> #define SAT_ADD_U_1(T) \
> T sat_add_u_1_##T(T x, T y) \
> { \
>   return (T)(x + y) >= x ? (x + y) : -1; \
> }
>
> SAT_ADD_U_1(uint8_t);
>
> Pan
>
> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, May 22, 2024 9:14 PM
> To: Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
> tamar.christ...@arm.com; pins...@gmail.com
> Subject: Re: [PATCH v2] Match: Support __builtin_add_overflow branch form for 
> unsigned SAT_ADD
>
> On Wed, May 22, 2024 at 3:17 AM  wrote:
> >
> > From: Pan Li 
> >
> > This patch would like to support the __builtin_add

Re: [PATCH] .gitattributes: disable crlf translation

2024-05-23 Thread Richard Biener
On Thu, May 23, 2024 at 5:50 AM Peter Damianov  wrote:
>
> By default, git has the "autocrlf" """feature""" enabled. This causes the 
> files
> to have CRLF line endings when checked out on windows, which in the case of
> configure, causes confusing errors like:
>
> ./gcc/configure: line 14: $'\r': command not found
> ./gcc/configure: line 29: syntax error near unexpected token `newline'
> '/gcc/configure: line 29: ` ;;
>
> when it is invoked.
>
> Any files damaged in this way can be fixed with:
> $ git config core.autocrlf false
> $ git reset
> $ git checkout .
>
> But, it's better to simply avoid this problem in the first place.
> This behavior is never helpful or desired for gcc.

For files added/edited on Windows does this then also strip the \r
(upon which action?)?  Otherwise I think this looks good but I'm not
a git expert.

Richard.

> Signed-off-by: Peter Damianov 
> ---
>  .gitattributes | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/.gitattributes b/.gitattributes
> index e75bfc595bf..1e116987c98 100644
> --- a/.gitattributes
> +++ b/.gitattributes
> @@ -8,3 +8,6 @@ ChangeLog 
> whitespace=indent-with-non-tab,space-before-tab,trailing-space
>  # Use together with git config diff.md.xfuncname '^\(define.*$'
>  # which is run by contrib/gcc-git-customization.sh too.
>  *.md diff=md
> +
> +# Disable lf -> crlf translation on windows.
> +* -crlf
> --
> 2.39.2
>


Re: [V2 PATCH] Don't reduce estimated unrolled size for innermost loop at cunrolli.

2024-05-23 Thread Richard Biener
On Wed, May 22, 2024 at 7:07 AM liuhongt  wrote:
>
> >> Hard to find a default value satisfying all testcases.
> >> some require loop unroll with 7 insns increment, some don't want loop
> >> unroll w/ 5 insn increment.
> >> The original 2/3 reduction happened to meet all those testcases(or the
> >> testcases are constructed based on the old 2/3).
> >> Can we define the parameter as the size of the loop, below the size we
> >> still do the reduction, so the small loop can be unrolled?
>
> >Yeah, that's also a sensible possibility.  Does it work to have a parameter
> >for the unrolled body size?  Thus, amend the existing
> >--param max-completely-peeled-insns with a --param
> >max-completely-peeled-insns-nogrowth?
>
> Update V2:
> It's still hard to find a default value for loop boday size. So I move the
> 2 / 3 reduction from estimated_unrolled_size to try_unroll_loop_completely.
> For the check of body size shrink, 2 / 3 reduction is added, so small loops
> can still be unrolled.
> For the check of comparison between body size and 
> param_max_completely_peeled_insns,
> 2 / 3 is conditionally added for loop->inner || !cunrolli.
> Then the patch avoid gcc testsuite regression, and also prevent big inner loop
> completely unrolled at cunrolli.
>
> --
>
> For the innermost loop, after completely loop unroll, it will most likely
> not be able to reduce the body size to 2/3. The current 2/3 reduction
> will make some of the larger loops completely unrolled during
> cunrolli, which will then result in them not being able to be
> vectorized. It also increases the register pressure. The patch move
> from estimated_unrolled_size to
> the 2/3 reduction at cunrolli.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR tree-optimization/112325
> * tree-ssa-loop-ivcanon.cc (estimated_unrolled_size): Move the
> 2 / 3 loop body size reduction to ..
> (try_unroll_loop_completely): .. here, add it for the check of
> body size shrink, and the check of comparison against
> param_max_completely_peeled_insns when
> (!cunrolli ||loop->inner).
> (canonicalize_loop_induction_variables): Add new parameter
> cunrolli and pass down.
> (tree_unroll_loops_completely_1): Ditto.
> (tree_unroll_loops_completely): Ditto.
> (canonicalize_induction_variables): Handle new parameter.
> (pass_complete_unrolli::execute): Ditto.
> (pass_complete_unroll::execute): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/pr112325.c: New test.
> * gcc.dg/vect/pr69783.c: Add extra option --param
> max-completely-peeled-insns=300.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/pr112325.c | 57 
>  gcc/testsuite/gcc.dg/vect/pr69783.c  |  2 +-
>  gcc/tree-ssa-loop-ivcanon.cc | 45 ++-
>  3 files changed, 83 insertions(+), 21 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> new file mode 100644
> index 000..14208b3e7f8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> +
> +typedef unsigned short ggml_fp16_t;
> +static float table_f32_f16[1 << 16];
> +
> +inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
> +unsigned short s;
> +__builtin_memcpy(, , sizeof(unsigned short));
> +return table_f32_f16[s];
> +}
> +
> +typedef struct {
> +ggml_fp16_t d;
> +ggml_fp16_t m;
> +unsigned char qh[4];
> +unsigned char qs[32 / 2];
> +} block_q5_1;
> +
> +typedef struct {
> +float d;
> +float s;
> +char qs[32];
> +} block_q8_1;
> +
> +void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void * 
> restrict vx, const void * restrict vy) {
> +const int qk = 32;
> +const int nb = n / qk;
> +
> +const block_q5_1 * restrict x = vx;
> +const block_q8_1 * restrict y = vy;
> +
> +float sumf = 0.0;
> +
> +for (int i = 0; i < nb; i++) {
> +unsigned qh;
> +__builtin_memcpy(, x[i].qh, sizeof(qh));
> +
> +int sumi = 0;
> +
> +for (int j = 0; j < qk/2; ++j) {
> +const unsigned char xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
> +const unsigned char xh_1 = ((qh >> (j + 12)) ) & 0x10;
> +
> +const int x0 = (x[i].qs[j] & 0xF) | xh_0;
> +const int x1 = (x[i].qs[j] >> 4) | xh_1;
> +
> +sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
> +}
> +
> +sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi + 
> ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
> +}
> +
> +*s = sumf;
> +}
> +
> +/* { dg-final { scan-tree-dump {(?n)Not unrolling loop [1-9] \(--param 
> max-completely-peel-times limit 

Re: [PATCH v4] Match: Add overloaded types_match to avoid code dup [NFC]

2024-05-23 Thread Richard Biener
On Thu, May 23, 2024 at 2:24 AM  wrote:
>
> From: Pan Li 
>
> There are sorts of match pattern for SAT related cases,  there will be
> some duplicated code to check the dest, op_0, op_1 are same tree types.
> Aka ternary tree type matches.  Thus,  add overloaded types_match func
> do this and avoid match code duplication.
>
> The below test suites are passed for this patch:
> * The rv64gcv fully regression test.
> * The x86 bootstrap test.
> * The x86 regression test.
>
> gcc/ChangeLog:
>
> * generic-match-head.cc (types_match): Add overloaded types_match
> for 3 types.
> * gimple-match-head.cc (types_match): Ditto.
> * match.pd: Leverage overloaded types_match.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/generic-match-head.cc | 14 ++
>  gcc/gimple-match-head.cc  | 14 ++
>  gcc/match.pd  | 30 ++
>  3 files changed, 38 insertions(+), 20 deletions(-)
>
> diff --git a/gcc/generic-match-head.cc b/gcc/generic-match-head.cc
> index 0d3f648fe8d..8d8ecfaeb1d 100644
> --- a/gcc/generic-match-head.cc
> +++ b/gcc/generic-match-head.cc
> @@ -59,6 +59,20 @@ types_match (tree t1, tree t2)
>return TYPE_MAIN_VARIANT (t1) == TYPE_MAIN_VARIANT (t2);
>  }
>
> +/* Routine to determine if the types T1, T2 and T3 are effectively
> +   the same for GENERIC.  If T1, T2 or T2 is not a type, the test
> +   applies to their TREE_TYPE.  */
> +
> +static inline bool
> +types_match (tree t1, tree t2, tree t3)
> +{
> +  t1 = TYPE_P (t1) ? t1 : TREE_TYPE (t1);
> +  t2 = TYPE_P (t2) ? t2 : TREE_TYPE (t2);
> +  t3 = TYPE_P (t3) ? t3 : TREE_TYPE (t3);

the above three lines are redundant.

> +  return types_match (t1, t2) && types_match (t2, t3);
> +}
> +
>  /* Return if T has a single use.  For GENERIC, we assume this is
> always true.  */
>
> diff --git a/gcc/gimple-match-head.cc b/gcc/gimple-match-head.cc
> index 5f8a1a1ad8e..2b7f746ab13 100644
> --- a/gcc/gimple-match-head.cc
> +++ b/gcc/gimple-match-head.cc
> @@ -79,6 +79,20 @@ types_match (tree t1, tree t2)
>return types_compatible_p (t1, t2);
>  }
>
> +/* Routine to determine if the types T1, T2 and T3 are effectively
> +   the same for GIMPLE.  If T1, T2 or T2 is not a type, the test
> +   applies to their TREE_TYPE.  */
> +
> +static inline bool
> +types_match (tree t1, tree t2, tree t3)
> +{
> +  t1 = TYPE_P (t1) ? t1 : TREE_TYPE (t1);
> +  t2 = TYPE_P (t2) ? t2 : TREE_TYPE (t2);
> +  t3 = TYPE_P (t3) ? t3 : TREE_TYPE (t3);

likewise.

OK with those removed.

Richard.

> +  return types_match (t1, t2) && types_match (t2, t3);
> +}
> +
>  /* Return if T has a single use.  For GIMPLE, we also allow any
> non-SSA_NAME (ie constants) and zero uses to cope with uses
> that aren't linked up yet.  */
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 35e3d82b131..7081d76d56a 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3048,38 +3048,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned Saturation Add */
>  (match (usadd_left_part_1 @0 @1)
>   (plus:c @0 @1)
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0, @1
>
>  (match (usadd_left_part_2 @0 @1)
>   (realpart (IFN_ADD_OVERFLOW:c @0 @1))
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0, @1
>
>  (match (usadd_right_part_1 @0 @1)
>   (negate (convert (lt (plus:c @0 @1) @0)))
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0, @1
>
>  (match (usadd_right_part_1 @0 @1)
>   (negate (convert (gt @0 (plus:c @0 @1
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0, @1
>
>  (match (usadd_right_part_2 @0 @1)
>   (negate (convert (ne (imagpart (IFN_ADD_OVERFLOW:c @0 @1)) integer_zerop)))
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0, @1
>
>  /* We cannot merge or overload usadd_left_part_1 and usadd_left_part_2
> because the sub part of left_part_2 cannot work with right_part_1.
> --
> 2.34.1
>


Re: [RFC][PATCH] PR tree-optimization/109071 - -Warray-bounds false positive warnings due to code duplication from jump threading

2024-05-23 Thread Richard Biener
On Wed, May 22, 2024 at 8:53 PM Qing Zhao  wrote:
>
>
>
> > On May 22, 2024, at 03:38, Richard Biener  
> > wrote:
> >
> > On Tue, May 21, 2024 at 11:36 PM David Malcolm  wrote:
> >>
> >> On Tue, 2024-05-21 at 15:13 +, Qing Zhao wrote:
> >>> Thanks for the comments and suggestions.
> >>>
> >>>> On May 15, 2024, at 10:00, David Malcolm 
> >>>> wrote:
> >>>>
> >>>> On Tue, 2024-05-14 at 15:08 +0200, Richard Biener wrote:
> >>>>> On Mon, 13 May 2024, Qing Zhao wrote:
> >>>>>
> >>>>>> -Warray-bounds is an important option to enable linux kernal to
> >>>>>> keep
> >>>>>> the array out-of-bound errors out of the source tree.
> >>>>>>
> >>>>>> However, due to the false positive warnings reported in
> >>>>>> PR109071
> >>>>>> (-Warray-bounds false positive warnings due to code duplication
> >>>>>> from
> >>>>>> jump threading), -Warray-bounds=1 cannot be added on by
> >>>>>> default.
> >>>>>>
> >>>>>> Although it's impossible to elinimate all the false positive
> >>>>>> warnings
> >>>>>> from -Warray-bounds=1 (See PR104355 Misleading -Warray-bounds
> >>>>>> documentation says "always out of bounds"), we should minimize
> >>>>>> the
> >>>>>> false positive warnings in -Warray-bounds=1.
> >>>>>>
> >>>>>> The root reason for the false positive warnings reported in
> >>>>>> PR109071 is:
> >>>>>>
> >>>>>> When the thread jump optimization tries to reduce the # of
> >>>>>> branches
> >>>>>> inside the routine, sometimes it needs to duplicate the code
> >>>>>> and
> >>>>>> split into two conditional pathes. for example:
> >>>>>>
> >>>>>> The original code:
> >>>>>>
> >>>>>> void sparx5_set (int * ptr, struct nums * sg, int index)
> >>>>>> {
> >>>>>>  if (index >= 4)
> >>>>>>warn ();
> >>>>>>  *ptr = 0;
> >>>>>>  *val = sg->vals[index];
> >>>>>>  if (index >= 4)
> >>>>>>warn ();
> >>>>>>  *ptr = *val;
> >>>>>>
> >>>>>>  return;
> >>>>>> }
> >>>>>>
> >>>>>> With the thread jump, the above becomes:
> >>>>>>
> >>>>>> void sparx5_set (int * ptr, struct nums * sg, int index)
> >>>>>> {
> >>>>>>  if (index >= 4)
> >>>>>>{
> >>>>>>  warn ();
> >>>>>>  *ptr = 0; // Code duplications since "warn" does
> >>>>>> return;
> >>>>>>  *val = sg->vals[index];   // same this line.
> >>>>>>// In this path, since it's
> >>>>>> under
> >>>>>> the condition
> >>>>>>// "index >= 4", the compiler
> >>>>>> knows
> >>>>>> the value
> >>>>>>// of "index" is larger then 4,
> >>>>>> therefore the
> >>>>>>// out-of-bound warning.
> >>>>>>  warn ();
> >>>>>>}
> >>>>>>  else
> >>>>>>{
> >>>>>>  *ptr = 0;
> >>>>>>  *val = sg->vals[index];
> >>>>>>}
> >>>>>>  *ptr = *val;
> >>>>>>  return;
> >>>>>> }
> >>>>>>
> >>>>>> We can see, after the thread jump optimization, the # of
> >>>>>> branches
> >>>>>> inside
> >>>>>> the routine "sparx5_set" is reduced from 2 to 1, however,  due
> >>>>>> to
> >>>>>> the
> >>>>>> code duplication (which is needed for the correctness of the
> >>>>>>

[PATCH] tree-optimization/115199 - fix PTA constraint processing for LHS

2024-05-23 Thread Richard Biener
When processing a  = X constraint we treat it as *ANYTHING = X
during constraint processing but then end up recording it as
 = X anyway, breaking constraint graph building.  This is
because we only update the local copy of the LHS and not the constraint
itself.

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

PR tree-optimization/115199
* tree-ssa-structalias.cc (process_constraint): Also
record  = X as *ANYTING = X in the end.

* gcc.dg/torture/pr115199.c: New testcase.
---
 gcc/testsuite/gcc.dg/torture/pr115199.c | 24 
 gcc/tree-ssa-structalias.cc |  2 +-
 2 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr115199.c

diff --git a/gcc/testsuite/gcc.dg/torture/pr115199.c 
b/gcc/testsuite/gcc.dg/torture/pr115199.c
new file mode 100644
index 000..981a7330b32
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115199.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+
+struct b {
+  char *volatile c;
+};
+struct b * __attribute__((noipa))
+d()
+{
+  char *e;
+  struct b *b = __builtin_malloc(sizeof(b));
+  void *f = __builtin_malloc(1);
+
+  e = __builtin_memcpy(f, "z", 1);
+  b->c = e;
+  return b;
+}
+
+int main()
+{
+  struct b b = *d();
+  if (b.c[0] != 'z')
+__builtin_abort();
+  return 0;
+}
diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 0e9423a78ec..a39b36c146e 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -3104,7 +3104,7 @@ process_constraint (constraint_t t)
  it here by turning it into *ANYTHING.  */
   if (lhs.type == ADDRESSOF
   && lhs.var == anything_id)
-lhs.type = DEREF;
+t->lhs.type = lhs.type = DEREF;
 
   /* ADDRESSOF on the lhs is invalid.  */
   gcc_assert (lhs.type != ADDRESSOF);
-- 
2.35.3


[PATCH] tree-optimization/115138 - ptr-vs-ptr and FUNCTION_DECLs

2024-05-23 Thread Richard Biener
I failed to realize we do not represent FUNCTION_DECLs or LABEL_DECLs
in vars explicitly and thus have to compare pt.vars_contains_nonlocal.

Bootstrapped and tested with bootstrap-O3 and D to verify the
comparison fail is fixed.  I'm now doing a regular bootstrap and
regtest with the volatile fix and will push afterwards.

PR tree-optimization/115138
* tree-ssa-alias.cc (ptrs_compare_unequal): Make sure
pt.vars_contains_nonlocal differs since we do not represent
FUNCTION_DECLs or LABEL_DECLs in vars explicitly.

* gcc.dg/torture/pr115138.c: New testcase.
---
 gcc/testsuite/gcc.dg/torture/pr115138.c | 28 +
 gcc/tree-ssa-alias.cc   |  6 ++
 2 files changed, 34 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr115138.c

diff --git a/gcc/testsuite/gcc.dg/torture/pr115138.c 
b/gcc/testsuite/gcc.dg/torture/pr115138.c
new file mode 100644
index 000..6becaecbaff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115138.c
@@ -0,0 +1,28 @@
+/* { dg-do run } */
+
+int foo (int) {}
+int bar (int) {}
+
+typedef int (*pred)(int);
+
+int x, y;
+pred A () { if (x) return foo; else return bar; }
+pred B () { if (y) return foo; else return bar; }
+int __attribute__((noipa)) baz()
+{
+  pred a = A();
+  pred b = B();
+  if (a != b)
+return 42;
+  return 0;
+}
+
+int main()
+{
+  if (baz () != 0)
+__builtin_abort ();
+  y = 1;
+  if (baz () != 42)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index d64d6d02f4a..1a91d63a31e 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -501,6 +501,12 @@ ptrs_compare_unequal (tree ptr1, tree ptr2)
  || pi2->pt.vars_contains_interposable)
return false;
  if ((!pi1->pt.null || !pi2->pt.null)
+ /* ???  We do not represent FUNCTION_DECL and LABEL_DECL
+in pt.vars but only set pt.vars_contains_nonlocal.  This
+makes compares involving those and other nonlocals
+imprecise.  */
+ && (!pi1->pt.vars_contains_nonlocal
+ || !pi2->pt.vars_contains_nonlocal)
  && (!pt_solution_includes_const_pool (>pt)
  || !pt_solution_includes_const_pool (>pt)))
return !pt_solutions_intersect (>pt, >pt);
-- 
2.35.3


Re: [x86_64 PATCH] Correct insn_cost of movabsq.

2024-05-22 Thread Richard Biener



> Am 22.05.2024 um 17:30 schrieb Uros Bizjak :
> 
> On Wed, May 22, 2024 at 5:15 PM Roger Sayle  
> wrote:
>> 
>> This single line patch fixes a strange quirk/glitch in i386's rtx_costs,
>> which considers an instruction loading a 64-bit constant to be significantly
>> cheaper than loading a 32-bit (or smaller) constant.
>> 
>> Consider the two functions:
>> unsigned long long foo() { return 0x0123456789abcdefULL; }
>> unsigned int bar() { return 10; }
>> 
>> and the corresponding lines from combine's dump file:
>>  insn_cost 1 for #: r98:DI=0x123456789abcdef
>>  insn_cost 4 for #: ax:SI=0xa
>> 
>> The same issue can be seen in -dP assembler output.
>>  movabsq $81985529216486895, %rax# 5  [c=1 l=10]  *movdi_internal/4
>> 
>> The problem is that pattern_costs interpretation of rtx_costs contains
>> "return cost > 0 ? cost : COSTS_N_INSNS (1)" where a zero value (for
>> example a register or small immediate constant) is considered special,
>> and equivalent to a single instruction, but all other values are treated
>> as verbatim.

A zero cost is interpreted as „not implemented“ and assigned a cost of 1, 
assuming a COSTS_N_INSNS basing.
IMO a bit bogus but I didn’t dare to argue further with Segher.

Richard 


>>  Hence to make x86_64's 10-byte long movabsq instruction
>> slightly more expensive than a simple constant, rtx_costs needs to
>> return COSTS_N_INSNS(1)+1 and not 1.  With this change, the insn_cost
>> of movabsq is the intended value 5:
>>  insn_cost 5 for #: r98:DI=0x123456789abcdef
>> and
>>  movabsq $81985529216486895, %rax# 5  [c=5 l=10]  *movdi_internal/4
>> 
>> 
>> [I'd originally tried fixing this by adding a ix86_insn_cost target
>> hook, but the testsuite is very sensitive to the costing of insns].
>> 
>> 
>> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
>> and make -k check, both with and without --target_board=unix{-m32}
>> with no new failures.  Ok for mainline?
>> 
>> 
>> 2024-05-22  Roger Sayle  
>> 
>> gcc/ChangeLog
>>* config/i386/i386.cc (ix86_rtx_costs) :
>>A CONST_INT that isn't x86_64_immediate_operand requires an extra
>>(expensive) movabsq insn to load, so return COSTS_N_INSNS (1) + 1.
> 
> 1 of 20,796
> 
> [x86_64 PATCH] Correct insn_cost of movabsq.
> 
> Inbox
> 
> Roger Sayle
> 
> 5:15 PM (12 minutes ago)
> 
> 
> to gcc-patches, me
> This single line patch fixes a strange quirk/glitch in i386's rtx_costs,
> which considers an instruction loading a 64-bit constant to be significantly
> cheaper than loading a 32-bit (or smaller) constant.
> 
> Consider the two functions:
> unsigned long long foo() { return 0x0123456789abcdefULL; }
> unsigned int bar() { return 10; }
> 
> and the corresponding lines from combine's dump file:
>  insn_cost 1 for #: r98:DI=0x123456789abcdef
>  insn_cost 4 for #: ax:SI=0xa
> 
> The same issue can be seen in -dP assembler output.
>  movabsq $81985529216486895, %rax# 5  [c=1 l=10]  *movdi_internal/4
> 
> The problem is that pattern_costs interpretation of rtx_costs contains
> "return cost > 0 ? cost : COSTS_N_INSNS (1)" where a zero value (for
> example a register or small immediate constant) is considered special,
> and equivalent to a single instruction, but all other values are treated
> as verbatim.  Hence to make x86_64's 10-byte long movabsq instruction
> slightly more expensive than a simple constant, rtx_costs needs to
> return COSTS_N_INSNS(1)+1 and not 1.  With this change, the insn_cost
> of movabsq is the intended value 5:
>  insn_cost 5 for #: r98:DI=0x123456789abcdef
> and
>  movabsq $81985529216486895, %rax# 5  [c=5 l=10]  *movdi_internal/4
> 
> 
> [I'd originally tried fixing this by adding a ix86_insn_cost target
> hook, but the testsuite is very sensitive to the costing of insns].
> 
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
> 
> 
> 2024-05-22  Roger Sayle  
> 
> gcc/ChangeLog
>* config/i386/i386.cc (ix86_rtx_costs) :
>A CONST_INT that isn't x86_64_immediate_operand requires an extra
>(expensive) movabsq insn to load, so return COSTS_N_INSNS (1) + 1.
> 
> 
> Thanks in advance,
> Roger
> --
> 
> 
> One attachment • Scanned by Gmail
> 
> 
> Roger Sayle (nextmovesoftware.com), gcc-patches@gcc.gnu.org
> 
> 
>> On Wed, May 22, 2024 at 5:15 PM Roger Sayle  
>> wrote:
>> 
>> This single line patch fixes a strange quirk/glitch in i386's rtx_costs,
>> which considers an instruction loading a 64-bit constant to be significantly
>> cheaper than loading a 32-bit (or smaller) constant.
>> 
>> Consider the two functions:
>> unsigned long long foo() { return 0x0123456789abcdefULL; }
>> unsigned int bar() { return 10; }
>> 
>> and the corresponding lines from combine's dump file:
>>  insn_cost 1 for #: r98:DI=0x123456789abcdef
>>  insn_cost 4 for #: ax:SI=0xa
>> 
>> The same issue 

Re: [PATCH 3/4] Avoid splitting store dataref groups during SLP discovery

2024-05-22 Thread Richard Biener
On Tue, 21 May 2024, Richard Sandiford wrote:

> Richard Biener  writes:
> > The following avoids splitting store dataref groups during SLP
> > discovery but instead forces (eventually single-lane) consecutive
> > lane SLP discovery for all lanes of the group, creating VEC_PERM
> > SLP nodes merging them so the store will always cover the whole group.
> >
> > With this for example
> >
> > int x[1024], y[1024], z[1024], w[1024];
> > void foo (void)
> > {
> >   for (int i = 0; i < 256; i++)
> > {
> >   x[4*i+0] = y[2*i+0];
> >   x[4*i+1] = y[2*i+1];
> >   x[4*i+2] = z[i];
> >   x[4*i+3] = w[i];
> > }
> > }
> >
> > which was previously using hybrid SLP can now be fully SLPed and
> 
> Nice!
> 
> > SSE code generated looks better (but of course you never know,
> > I didn't actually benchmark).  We of course need a VF of four here.
> >
> > .L2:
> > movdqa  z(%rax), %xmm0
> > movdqa  w(%rax), %xmm4
> > movdqa  y(%rax,%rax), %xmm2
> > movdqa  y+16(%rax,%rax), %xmm1
> > movdqa  %xmm0, %xmm3
> > punpckhdq   %xmm4, %xmm0
> > punpckldq   %xmm4, %xmm3
> > movdqa  %xmm2, %xmm4
> > shufps  $238, %xmm3, %xmm2
> > movaps  %xmm2, x+16(,%rax,4)
> > movdqa  %xmm1, %xmm2
> > shufps  $68, %xmm3, %xmm4
> > shufps  $68, %xmm0, %xmm2
> > movaps  %xmm4, x(,%rax,4)
> > shufps  $238, %xmm0, %xmm1
> > movaps  %xmm2, x+32(,%rax,4)
> > movaps  %xmm1, x+48(,%rax,4)
> > addq$16, %rax
> > cmpq$1024, %rax
> > jne .L2
> >
> > The extra permute nodes merging distinct branches of the SLP
> > tree might be unexpected for some code, esp. since
> > SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
> > cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
> > consistently as we can have a mix of both.
> >
> > The patch keeps the sub-trees form consecutive lanes but that's
> > in principle not necessary if we for example have an even/odd
> > split which now would result in N single-lane sub-trees.  That's
> > left for future improvements.
> >
> > The interesting part is how VLA vector ISAs handle merging of
> > two vectors that's not trivial even/odd merging.  The strathegy
> > of how to build the permute tree might need adjustments for that
> > (in the end splitting each branch to single lanes and then doing
> > even/odd merging would be the brute-force fallback).  Not sure
> > how much we can or should rely on the SLP optimize pass to handle
> > this.
> 
> Yeah, I think we'll have to play it by ear.  It might involve tweaking
> the order in which we "reduce" the VEC_PERM_EXPRs.  E.g. in the above
> example, my guess is that it would be better to reduce the z/w part
> first and then permute that with y, whereas it looks like the patch
> always goes left-to-right.

The patch reduces the two inputs with the least number of lanes
recursively.  And within that from left-to-right.  That should keep
us in the bound of two input vectors for one output vector.  It
should also resemble classical interleaving when we have N single
lanes.

> The patch LGTM FWIW.

I've sent out a v2 for the CIs and pushed the bugfix parts of the
series.  I hope to see that riscv isn't left with 100s of FAILs
beause of the change and if that looks green push and polish up
what I have for the load side.

> I suppose this does further hard-code the assumption that the vector
> type is uniquely determined by the element type (and so we can safely
> assume that everything has the same vector type as the first split node).
> But that's pretty much pervasive, and not easy to solve until we're
> serious about putting some infrastructre in place for it.  It just
> caught me out when reading vector code for the first time in a while :)
>
> (E.g. in the above example, the y vector could eventually be double the
> z & w vectors.)

Yeah, you might have noticed the RFC patch series I sent out last
year where I tried to get rid of this constraint.  I stopped implementing
when I figured it should work but doing all-SLP first really is
important.

Richard.
 
> Thanks,
> Richard
> 
> > * tree-vect-slp.cc (vect_build_slp_instance): Do not split
> > store dataref groups on loop SLP discovery failure but create
> > a single SLP instance for the stores but branch to SLP sub-trees
> > and merge with a series of VEC_PERM nodes.
> > ---
> >  gcc/

Re: [PATCH v1 1/2] Match: Support branch form for unsigned SAT_ADD

2024-05-22 Thread Richard Biener
On Mon, May 20, 2024 at 1:50 PM Tamar Christina  wrote:
>
> Hi Pan,
>
> > -Original Message-
> > From: pan2...@intel.com 
> > Sent: Monday, May 20, 2024 12:01 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; Tamar Christina
> > ; richard.guent...@gmail.com; Pan Li
> > 
> > Subject: [PATCH v1 1/2] Match: Support branch form for unsigned SAT_ADD
> >
> > From: Pan Li 
> >
> > This patch would like to support the branch form for unsigned
> > SAT_ADD.  For example as below:
> >
> > uint64_t
> > sat_add (uint64_t x, uint64_t y)
> > {
> >   return (uint64_t) (x + y) >= x ? (x + y) : -1;
> > }
> >
> > Different to the branchless version,  we leverage the simplify to
> > convert the branch version of SAT_ADD into branchless if and only
> > if the backend has supported the IFN_SAT_ADD.  Thus,  the backend has
> > the ability to choose branch or branchless implementation of .SAT_ADD.
> > For example,  some target can take care of branches code more optimally.
> >
> > When the target implement the IFN_SAT_ADD for unsigned and before this
> > patch:
> > uint64_t sat_add_u_1_uint64_t (uint64_t x, uint64_t y)
> > {
> >   long unsigned int _1;
> >   uint64_t _2;
> >   __complex__ long unsigned int _6;
> >   long unsigned int _7;
> >
> > ;;   basic block 2, loop depth 0
> > ;;pred:   ENTRY
> >   _6 = .ADD_OVERFLOW (x_3(D), y_4(D));
> >   _1 = REALPART_EXPR <_6>;
> >   _7 = IMAGPART_EXPR <_6>;
> >   if (_7 == 0)
> > goto ; [65.00%]
> >   else
> > goto ; [35.00%]
> > ;;succ:   4
> > ;;3
> >
> > ;;   basic block 3, loop depth 0
> > ;;pred:   2
> > ;;succ:   4
> >
> > ;;   basic block 4, loop depth 0
> > ;;pred:   3
> > ;;2
> >   # _2 = PHI <18446744073709551615(3), _1(2)>
> >   return _2;
> > ;;succ:   EXIT
> >
> > }
> >
> > After this patch:
> > uint64_t sat_add (uint64_t x, uint64_t y)
> > {
> >   long unsigned int _9;
> >
> > ;;   basic block 2, loop depth 0
> > ;;pred:   ENTRY
> >   _9 = .SAT_ADD (x_3(D), y_4(D)); [tail call]
> >   return _9;
> > ;;succ:   EXIT
> > }
> >
> > The below test suites are passed for this patch:
> > * The x86 bootstrap test.
> > * The x86 fully regression test.
> > * The riscv fully regression test.
> >
> > gcc/ChangeLog:
> >
> >   * match.pd: Add new simplify to convert branch SAT_ADD into
> >   branchless,  if and only if backend implement the IFN.
> >
> > Signed-off-by: Pan Li 
> > ---
> >  gcc/match.pd | 18 ++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 0f9c34fa897..0547b57b3a3 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -3094,6 +3094,24 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >  (match (unsigned_integer_sat_add @0 @1)
> >   (bit_ior:c (usadd_left_part_2 @0 @1) (usadd_right_part_2 @0 @1)))
> >
> > +#if GIMPLE
> > +
> > +/* Simplify the branch version of SAT_ADD into branchless if and only if
> > +   the backend has supported the IFN_SAT_ADD.  Thus, the backend has the
> > +   ability to choose branch or branchless implementation of .SAT_ADD.  */

This comment or part of the description above should say this simplifies

   (x + y) >= x ? (x + y) : -1

as

  (x + y) | (-(typeof(x))((x + y) < x))

> > +(simplify
> > + (cond (ge (plus:c@2 @0 @1) @0) @2 integer_minus_onep)
> > +  (if (direct_internal_fn_supported_p (IFN_SAT_ADD, type,
> > OPTIMIZE_FOR_BOTH))
> > +   (bit_ior @2 (negate (convert (lt @2 @0))
> > +
> > +(simplify
> > + (cond (le @0 (plus:c@2 @0 @1)) @2 integer_minus_onep)
> > +  (if (direct_internal_fn_supported_p (IFN_SAT_ADD, type,
> > OPTIMIZE_FOR_BOTH))
> > +   (bit_ior @2 (negate (convert (lt @2 @0))

and this should probably be (gt @2 @0)?

This misses INTEGER_TYPE_P constraints and it's supposed to be only
for TYPE_UNSIGNED?

> > +
> > +#endif
>
> Thanks, this looks good to me!
>
> I'll leave it up to Richard to approve,
> Richard: The reason for the direct_internal_fn_supported_p is because some
> targets said that they currently handle the branch version better due to the 
> lack
> of some types.  At the time I reason it's just a target expansion bug but 
> didn't hear anything.
>
> To be honest, it feels to me like we should do this unconditionally, and just 
> have the targets
> that get faster branch version to handle it during expand? Since the patch 
> series provides
> a canonicalized version now.

I'm not sure this is a good canonical form.

__imag .ADD_OVERFLOW (x, y) ? __real .ADD_OVERFLOW (x, y) : -1

would be better IMO.  It can be branch-less by using a COND_EXPR.

> This means we can also better support targets that have the vector optab but 
> not the scalar one
> as the above check would fail for these targets.
>
> What do you think?
>
> Thanks,
> Tamar
>
> > +
> >  /* x >  y  &&  x != XXX_MIN  -->  x > y
> > x >  y  &&  x == XXX_MIN  -->  false . */
> >  (for eqne (eq ne)
> > --
> > 2.34.1
>


Re: [PATCH v1 1/2] Match: Support __builtin_add_overflow for branchless unsigned SAT_ADD

2024-05-22 Thread Richard Biener
On Sun, May 19, 2024 at 8:37 AM  wrote:
>
> From: Pan Li 
>
> This patch would like to support the branchless form for unsigned
> SAT_ADD when leverage __builtin_add_overflow.  For example as below:
>
> uint64_t sat_add_u(uint64_t x, uint64_t y)
> {
>   uint64_t ret;
>   uint64_t overflow = __builtin_add_overflow (x, y, );
>
>   return (T)(-overflow) | ret;
> }
>
> Before this patch:
>
> uint64_t sat_add_u (uint64_t x, uint64_t y)
> {
>   long unsigned int _1;
>   long unsigned int _2;
>   long unsigned int _3;
>   __complex__ long unsigned int _6;
>   uint64_t _8;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _6 = .ADD_OVERFLOW (x_4(D), y_5(D));
>   _1 = REALPART_EXPR <_6>;
>   _2 = IMAGPART_EXPR <_6>;
>   _3 = -_2;
>   _8 = _1 | _3;
>   return _8;
> ;;succ:   EXIT
>
> }
>
> After this patch:
>
> uint64_t sat_add_u (uint64_t x, uint64_t y)
> {
>   uint64_t _8;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _8 = .SAT_ADD (x_4(D), y_5(D)); [tail call]
>   return _8;
> ;;succ:   EXIT
>
> }
>
> The below tests suite are passed for this patch.
> * The rv64gcv fully regression test.
> * The x86 bootstrap test.
> * The x86 fully regression test.
>
> gcc/ChangeLog:
>
> * match.pd: Add SAT_ADD right part 2 for __builtin_add_overflow.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index b291e34bbe4..5328e846aff 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3064,6 +3064,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (negate (convert (ne (imagpart (IFN_ADD_OVERFLOW:c @0 @1)) integer_zerop)))
>   (if (TYPE_UNSIGNED (type) && integer_types_ternary_match (type, @0, @1
>
> +(match (usadd_right_part_2 @0 @1)
> + (negate (imagpart (IFN_ADD_OVERFLOW:c @0 @1)))
> + (if (TYPE_UNSIGNED (type) && integer_types_ternary_match (type, @0, @1
> +

Can you merge this with the patch that makes use of the
usadd_right_part_2 match?
It's difficult to review on its own.

>  /* We cannot merge or overload usadd_left_part_1 and usadd_left_part_2
> because the sub part of left_part_2 cannot work with right_part_1.
> For example, left_part_2 pattern focus one .ADD_OVERFLOW but the
> --
> 2.34.1
>


Re: [PATCH v2] Match: Support __builtin_add_overflow branch form for unsigned SAT_ADD

2024-05-22 Thread Richard Biener
On Wed, May 22, 2024 at 3:17 AM  wrote:
>
> From: Pan Li 
>
> This patch would like to support the __builtin_add_overflow branch form for
> unsigned SAT_ADD.  For example as below:
>
> uint64_t
> sat_add (uint64_t x, uint64_t y)
> {
>   uint64_t ret;
>   return __builtin_add_overflow (x, y, ) ? -1 : ret;
> }
>
> Different to the branchless version,  we leverage the simplify to
> convert the branch version of SAT_ADD into branchless if and only
> if the backend has supported the IFN_SAT_ADD.  Thus,  the backend has
> the ability to choose branch or branchless implementation of .SAT_ADD.
> For example,  some target can take care of branches code more optimally.
>
> When the target implement the IFN_SAT_ADD for unsigned and before this
> patch:
>
> uint64_t sat_add (uint64_t x, uint64_t y)
> {
>   long unsigned int _1;
>   long unsigned int _2;
>   uint64_t _3;
>   __complex__ long unsigned int _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _6 = .ADD_OVERFLOW (x_4(D), y_5(D));
>   _2 = IMAGPART_EXPR <_6>;
>   if (_2 != 0)
> goto ; [35.00%]
>   else
> goto ; [65.00%]
> ;;succ:   4
> ;;3
>
> ;;   basic block 3, loop depth 0
> ;;pred:   2
>   _1 = REALPART_EXPR <_6>;
> ;;succ:   4
>
> ;;   basic block 4, loop depth 0
> ;;pred:   3
> ;;2
>   # _3 = PHI <_1(3), 18446744073709551615(2)>
>   return _3;
> ;;succ:   EXIT
> }
>
> After this patch:
> uint64_t sat_add (uint64_t x, uint64_t y)
> {
>   long unsigned int _12;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _12 = .SAT_ADD (x_4(D), y_5(D)); [tail call]
>   return _12;
> ;;succ:   EXIT
> }
>
> The below test suites are passed for this patch:
> * The x86 bootstrap test.
> * The x86 fully regression test.
> * The riscv fully regression test.

I'm not convinced we should match this during early if-conversion, should we?
The middle-end doesn't really know .SAT_ADD but some handling of
.ADD_OVERFLOW is present.

But please add a comment before the new pattern, esp. since it's
non-obvious that this is an improvent.

I suspect you rely on this form being recognized as .SAT_ADD later but
what prevents us from breaking this?  Why not convert it to .SAT_ADD
immediately?  If this is because the ISEL pass (or the widen-mult pass)
cannot handle PHIs then I would suggest to split out enough parts of
tree-ssa-phiopt.cc to be able to query match.pd for COND_EXPRs.

> gcc/ChangeLog:
>
> * match.pd: Add new simplify to convert branch SAT_ADD into
> branchless,  if and only if backend implement the IFN.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index cff67c84498..2dc77a46e67 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3080,6 +3080,17 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (match (unsigned_integer_sat_add @0 @1)
>   (bit_ior:c (usadd_left_part_2 @0 @1) (usadd_right_part_2 @0 @1)))
>
> +#if GIMPLE
> +
> +(simplify
> + (cond (ne (imagpart (IFN_ADD_OVERFLOW@2 @0 @1)) integer_zerop)
> +  integer_minus_onep (realpart @2))
> + (if (ternary_integer_types_match_p (type, @0, @1) && TYPE_UNSIGNED (type)
> +  && direct_internal_fn_supported_p (IFN_SAT_ADD, type, 
> OPTIMIZE_FOR_BOTH))
> +  (bit_ior (plus@3 @0 @1) (negate (convert (lt @3 @0))
> +
> +#endif
> +
>  /* x >  y  &&  x != XXX_MIN  -->  x > y
> x >  y  &&  x == XXX_MIN  -->  false . */
>  (for eqne (eq ne)
> --
> 2.34.1
>


Re: [PATCH v2] Match: Extract integer_types_ternary_match helper to avoid code dup [NFC]

2024-05-22 Thread Richard Biener
On Mon, May 20, 2024 at 1:00 PM  wrote:
>
> From: Pan Li 
>
> There are sorts of match pattern for SAT related cases,  there will be
> some duplicated code to check the dest, op_0, op_1 are same tree types.
> Aka ternary tree type matches.  Thus, extract one helper function to
> do this and avoid match code duplication.

I think it's more useful to add an overload to types_match with three
arguments and then use

 (if (INTEGRAL_TYPE_P (type)
  && types_match (type, TREE_TYPE (@0), TREE_TYPE (@1))
...

Richard.

> The below test suites are passed for this patch:
> * The rv64gcv fully regression test.
> * The x86 bootstrap test.
> * The x86 regression test.
>
> gcc/ChangeLog:
>
> * generic-match-head.cc (integer_types_ternary_match): New helper
> function to check tenary tree type matches or not.
> * gimple-match-head.cc (integer_types_ternary_match): Ditto but
> for match.
> * match.pd: Leverage above helper function to avoid code dup.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/generic-match-head.cc | 17 +
>  gcc/gimple-match-head.cc  | 17 +
>  gcc/match.pd  | 25 +
>  3 files changed, 39 insertions(+), 20 deletions(-)
>
> diff --git a/gcc/generic-match-head.cc b/gcc/generic-match-head.cc
> index 0d3f648fe8d..cdd48c7a5cc 100644
> --- a/gcc/generic-match-head.cc
> +++ b/gcc/generic-match-head.cc
> @@ -59,6 +59,23 @@ types_match (tree t1, tree t2)
>return TYPE_MAIN_VARIANT (t1) == TYPE_MAIN_VARIANT (t2);
>  }
>
> +/* Routine to determine if the types T1,  T2 and T3 are effectively
> +   the same integer type for GENERIC.  If T1,  T2 or T3 is not a type,
> +   the test applies to their TREE_TYPE.  */
> +
> +static inline bool
> +integer_types_ternary_match (tree t1, tree t2, tree t3)
> +{
> +  t1 = TYPE_P (t1) ? t1 : TREE_TYPE (t1);
> +  t2 = TYPE_P (t2) ? t2 : TREE_TYPE (t2);
> +  t3 = TYPE_P (t3) ? t3 : TREE_TYPE (t3);
> +
> +  if (!INTEGRAL_TYPE_P (t1) || !INTEGRAL_TYPE_P (t2) || !INTEGRAL_TYPE_P 
> (t3))
> +return false;
> +
> +  return types_match (t1, t2) && types_match (t1, t3);
> +}
> +
>  /* Return if T has a single use.  For GENERIC, we assume this is
> always true.  */
>
> diff --git a/gcc/gimple-match-head.cc b/gcc/gimple-match-head.cc
> index 5f8a1a1ad8e..91f2e56b8ef 100644
> --- a/gcc/gimple-match-head.cc
> +++ b/gcc/gimple-match-head.cc
> @@ -79,6 +79,23 @@ types_match (tree t1, tree t2)
>return types_compatible_p (t1, t2);
>  }
>
> +/* Routine to determine if the types T1,  T2 and T3 are effectively
> +   the same integer type for GIMPLE.  If T1,  T2 or T3 is not a type,
> +   the test applies to their TREE_TYPE.  */
> +
> +static inline bool
> +integer_types_ternary_match (tree t1, tree t2, tree t3)
> +{
> +  t1 = TYPE_P (t1) ? t1 : TREE_TYPE (t1);
> +  t2 = TYPE_P (t2) ? t2 : TREE_TYPE (t2);
> +  t3 = TYPE_P (t3) ? t3 : TREE_TYPE (t3);
> +
> +  if (!INTEGRAL_TYPE_P (t1) || !INTEGRAL_TYPE_P (t2) || !INTEGRAL_TYPE_P 
> (t3))
> +return false;
> +
> +  return types_match (t1, t2) && types_match (t1, t3);
> +}
> +
>  /* Return if T has a single use.  For GIMPLE, we also allow any
> non-SSA_NAME (ie constants) and zero uses to cope with uses
> that aren't linked up yet.  */
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 0f9c34fa897..401b52e7573 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3046,38 +3046,23 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned Saturation Add */
>  (match (usadd_left_part_1 @0 @1)
>   (plus:c @0 @1)
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (integer_types_ternary_match (type, @0, @1) && TYPE_UNSIGNED (type
>
>  (match (usadd_left_part_2 @0 @1)
>   (realpart (IFN_ADD_OVERFLOW:c @0 @1))
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (integer_types_ternary_match (type, @0, @1) && TYPE_UNSIGNED (type
>
>  (match (usadd_right_part_1 @0 @1)
>   (negate (convert (lt (plus:c @0 @1) @0)))
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (integer_types_ternary_match (type, @0, @1) && TYPE_UNSIGNED (type
>
>  (match (usadd_right_part_1 @0 @1)
>   (negate (convert (gt @0 (plus:c @0 @1
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@0))
> -  && types_match (type, TREE_TYPE (@1)
> + (if (integer_types_ternary_match (type, @0, @1) && TYPE_UNSIGNED (type
>
>  (match (usadd_right_part_2 @0 @1)
>   (negate (convert (ne (imagpart (IFN_ADD_OVERFLOW:c @0 @1)) integer_zerop)))
> - (if (INTEGRAL_TYPE_P (type)
> -  && TYPE_UNSIGNED (TREE_TYPE (@0))

[PATCH 2/2][v2] RISC-V: Testsuite updates

2024-05-22 Thread Richard Biener
The gcc.dg/vect/slp-12a.c case is interesting as we currently split
the 8 store group into lanes 0-5 which we SLP with an unroll factor
of two (on x86-64 with SSE) and the remaining two lanes are using
interleaving vectorization with a final unroll factor of four.  Thus
we're using hybrid SLP within a single store group.  After the change
we discover the same 0-5 lane SLP part as well as two single-lane
parts feeding the full store group.  But that results in a load
permutation that isn't supported (I have WIP patchs to rectify that).
So we end up cancelling SLP and vectorizing the whole loop with
interleaving which is IMO good and results in better code.

This is similar for gcc.target/i386/pr52252-atom.c where interleaving
generates much better code than hybrid SLP.  I'm unsure how to update
the testcase though.

gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
when analyzing SLP operations we discard an instance we currently
force the full loop to have no SLP because hybrid detection is
broken.  It's probably not worth fixing this at this moment.

For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
into two but merge the two 8 lane loads into one before doing the
store and thus have only a single SLP instance.  A similar situation
happens in gcc.dg/vect/slp-11c.c but the branches feeding the
single SLP store only have a single lane.  Likewise for
gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.

gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
with a SLP store group of size two but two single-lane branches.

gcc.target/i386/pr98928.c ICEs in SLP permute optimization
because we don't expect a constant and internal branch to be
merged with a permute node in
vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
permutes merging two SLP nodes are two-operator nodes right now).
This still requires fixing.

* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
* gcc.dg/vect/slp-11c.c: Likewise, if !vect_load_lanes.
* gcc.dg/vect/vect-complex-5.c: Likewise.
* gcc.dg/vect/slp-12a.c: Do not expect SLP.
* gcc.dg/vect/slp-21.c: Remove not important scanning for SLP.
* gcc.dg/vect/slp-cond-1.c: Expect one more SLP if !vect_load_lanes.
* gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
* gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
---
 gcc/testsuite/gcc.dg/vect/pr97428.c  |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-11c.c  |  6 --
 gcc/testsuite/gcc.dg/vect/slp-12a.c  |  6 +-
 gcc/testsuite/gcc.dg/vect/slp-21.c   | 18 +++---
 gcc/testsuite/gcc.dg/vect/slp-cond-1.c   |  3 ++-
 gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
 gcc/testsuite/gcc.dg/vect/vect-gather-2.c|  1 -
 gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
 8 files changed, 18 insertions(+), 23 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c 
b/gcc/testsuite/gcc.dg/vect/pr97428.c
index 60dd984cfd3..3cc9976c00c 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97428.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
@@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
 /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" "vect" 
} } */
 /* We're not able to peel & apply re-aligning to make accesses well-aligned 
for !vect_hw_misalign,
but we could by peeling the stores for alignment and applying re-aligning 
loads.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
xfail { ! vect_hw_misalign } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
xfail { ! vect_hw_misalign } } } } */
 /* { dg-final { scan-tree-dump-not "gap of 6 elements" "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-11c.c 
b/gcc/testsuite/gcc.dg/vect/slp-11c.c
index 0f680cd4e60..2e70fca39ba 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-11c.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-11c.c
@@ -13,7 +13,8 @@ main1 ()
   unsigned int in[N*8] = 
{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};
   float out[N*8];
 
-  /* Different operations - not SLPable.  */
+  /* Different operations - we SLP the store and split the group to two
+ single-lane branches.  */
   for (i = 0; i < N*4; i++)
 {
   out[i*2] = ((float) in[i*2] * 2 + 6) ;
@@ -44,4 +45,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { 
{ vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { 
! { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0  "vect"  
} } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 

[PATCH 1/2][v2] Avoid splitting store dataref groups during SLP discovery

2024-05-22 Thread Richard Biener
The following avoids splitting store dataref groups during SLP
discovery but instead forces (eventually single-lane) consecutive
lane SLP discovery for all lanes of the group, creating VEC_PERM
SLP nodes merging them so the store will always cover the whole group.

With this for example

int x[1024], y[1024], z[1024], w[1024];
void foo (void)
{
  for (int i = 0; i < 256; i++)
{
  x[4*i+0] = y[2*i+0];
  x[4*i+1] = y[2*i+1];
  x[4*i+2] = z[i];
  x[4*i+3] = w[i];
}
}

which was previously using hybrid SLP can now be fully SLPed and
SSE code generated looks better (but of course you never know,
I didn't actually benchmark).  We of course need a VF of four here.

.L2:
movdqa  z(%rax), %xmm0
movdqa  w(%rax), %xmm4
movdqa  y(%rax,%rax), %xmm2
movdqa  y+16(%rax,%rax), %xmm1
movdqa  %xmm0, %xmm3
punpckhdq   %xmm4, %xmm0
punpckldq   %xmm4, %xmm3
movdqa  %xmm2, %xmm4
shufps  $238, %xmm3, %xmm2
movaps  %xmm2, x+16(,%rax,4)
movdqa  %xmm1, %xmm2
shufps  $68, %xmm3, %xmm4
shufps  $68, %xmm0, %xmm2
movaps  %xmm4, x(,%rax,4)
shufps  $238, %xmm0, %xmm1
movaps  %xmm2, x+32(,%rax,4)
movaps  %xmm1, x+48(,%rax,4)
addq$16, %rax
cmpq$1024, %rax
jne .L2

The extra permute nodes merging distinct branches of the SLP
tree might be unexpected for some code, esp. since
SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
consistently as we can have a mix of both.

The patch keeps the sub-trees form consecutive lanes but that's
in principle not necessary if we for example have an even/odd
split which now would result in N single-lane sub-trees.  That's
left for future improvements.

The interesting part is how VLA vector ISAs handle merging of
two vectors that's not trivial even/odd merging.  The strathegy
of how to build the permute tree might need adjustments for that
(in the end splitting each branch to single lanes and then doing
even/odd merging would be the brute-force fallback).  Not sure
how much we can or should rely on the SLP optimize pass to handle
this.

* tree-vect-slp.cc (vect_build_slp_instance): Do not split
store dataref groups on loop SLP discovery failure but create
a single SLP instance for the stores but branch to SLP sub-trees
and merge with a series of VEC_PERM nodes.
---
 gcc/tree-vect-slp.cc | 247 ++-
 1 file changed, 221 insertions(+), 26 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 3f8209b43a7..1fbc7a672a7 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3468,12 +3468,7 @@ vect_build_slp_instance (vec_info *vinfo,
  return true;
}
 }
-  else
-{
-  /* Failed to SLP.  */
-  /* Free the allocated memory.  */
-  scalar_stmts.release ();
-}
+  /* Failed to SLP.  */
 
   stmt_vec_info stmt_info = stmt_info_;
   /* Try to break the group up into pieces.  */
@@ -3491,6 +3486,9 @@ vect_build_slp_instance (vec_info *vinfo,
   if (is_a  (vinfo)
  && (i > 1 && i < group_size))
{
+ /* Free the allocated memory.  */
+ scalar_stmts.release ();
+
  tree scalar_type
= TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (stmt_info)));
  tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
@@ -3535,38 +3533,235 @@ vect_build_slp_instance (vec_info *vinfo,
}
}
 
-  /* For loop vectorization split into arbitrary pieces of size > 1.  */
-  if (is_a  (vinfo)
- && (i > 1 && i < group_size)
- && !vect_slp_prefer_store_lanes_p (vinfo, stmt_info, group_size, i))
+  /* For loop vectorization split the RHS into arbitrary pieces of
+size >= 1.  */
+  else if (is_a  (vinfo)
+  && (i > 0 && i < group_size)
+  && !vect_slp_prefer_store_lanes_p (vinfo,
+ stmt_info, group_size, i))
{
- unsigned group1_size = i;
-
  if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "Splitting SLP group at stmt %u\n", i);
 
- stmt_vec_info rest = vect_split_slp_store_group (stmt_info,
-  group1_size);
- /* Loop vectorization cannot handle gaps in stores, make sure
-the split group appears as strided.  */
- STMT_VINFO_STRIDED_P (rest) = 1;
- DR_GROUP_GAP (rest) = 0;
- STMT_VINFO_STRIDED_P (stmt_info) = 1;
- DR_GROUP_GAP (stmt_info) = 0;
+ /* Analyze the stored values and pinch them together with
+a permute node so we can preserve the whole store group.  */
+ auto_vec rhs_nodes;
+
+ /* Calculate the unrolling factor based 

[gcc r15-773] Fix mixed input kind permute optimization

2024-05-22 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:3507ab1b018a68500e49fa9f1de7caa0f1b53dda

commit r15-773-g3507ab1b018a68500e49fa9f1de7caa0f1b53dda
Author: Richard Biener 
Date:   Tue May 21 19:15:33 2024 +0200

Fix mixed input kind permute optimization

When change_vec_perm_layout runs into a permute combining two
nodes where one is invariant and one internal the partition of
one input can be -1 but the other might not be.  The following
supports this case by simply ignoring inputs with input partiton -1.

I'm not sure this is correct but it avoids ICEing when accessing
that partitions layout for gcc.target/i386/pr98928.c with the
change to avoid splitting store dataref groups during SLP discovery.

* tree-vect-slp.cc (change_vec_perm_layout): Ignore an
input partition of -1.

Diff:
---
 gcc/tree-vect-slp.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 43f2c153bf0..3f8209b43a7 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -4640,6 +4640,8 @@ change_vec_perm_layout (slp_tree node, lane_permutation_t 
,
{
  slp_tree in_node = SLP_TREE_CHILDREN (node)[entry.first];
  unsigned int in_partition_i = m_vertices[in_node->vertex].partition;
+ if (in_partition_i == -1u)
+   continue;
  this_in_layout_i = m_partitions[in_partition_i].layout;
}
   if (this_in_layout_i > 0)


[gcc r15-772] Avoid SLP_REPRESENTATIVE access for VEC_PERM in SLP scheduling

2024-05-22 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:31e9bae0ea5e5413abfa3ca9050e66cc6760553e

commit r15-772-g31e9bae0ea5e5413abfa3ca9050e66cc6760553e
Author: Richard Biener 
Date:   Fri May 17 15:23:38 2024 +0200

Avoid SLP_REPRESENTATIVE access for VEC_PERM in SLP scheduling

SLP permute nodes can end up without a SLP_REPRESENTATIVE now,
the following avoids touching it in this case in vect_schedule_slp_node.

* tree-vect-slp.cc (vect_schedule_slp_node): Avoid looking
at SLP_REPRESENTATIVE for VEC_PERM nodes.

Diff:
---
 gcc/tree-vect-slp.cc | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index f34ed54a70b..43f2c153bf0 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -9301,13 +9301,8 @@ vect_schedule_slp_node (vec_info *vinfo,
   gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0);
   SLP_TREE_VEC_DEFS (node).create (SLP_TREE_NUMBER_OF_VEC_STMTS (node));
 
-  if (dump_enabled_p ())
-dump_printf_loc (MSG_NOTE, vect_location,
-"-->vectorizing SLP node starting from: %G",
-stmt_info->stmt);
-
-  if (STMT_VINFO_DATA_REF (stmt_info)
-  && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
+  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+  && STMT_VINFO_DATA_REF (stmt_info))
 {
   /* Vectorized loads go before the first scalar load to make it
 ready early, vectorized stores go before the last scalar
@@ -9319,10 +9314,10 @@ vect_schedule_slp_node (vec_info *vinfo,
last_stmt_info = vect_find_last_scalar_stmt_in_slp (node);
   si = gsi_for_stmt (last_stmt_info->stmt);
 }
-  else if ((STMT_VINFO_TYPE (stmt_info) == cycle_phi_info_type
-   || STMT_VINFO_TYPE (stmt_info) == induc_vec_info_type
-   || STMT_VINFO_TYPE (stmt_info) == phi_info_type)
-  && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
+  else if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+  && (STMT_VINFO_TYPE (stmt_info) == cycle_phi_info_type
+  || STMT_VINFO_TYPE (stmt_info) == induc_vec_info_type
+  || STMT_VINFO_TYPE (stmt_info) == phi_info_type))
 {
   /* For PHI node vectorization we do not use the insertion iterator.  */
   si = gsi_none ();
@@ -9456,6 +9451,9 @@ vect_schedule_slp_node (vec_info *vinfo,
   /* Handle purely internal nodes.  */
   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
 {
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"-->vectorizing SLP permutation node\n");
   /* ???  the transform kind is stored to STMT_VINFO_TYPE which might
 be shared with different SLP nodes (but usually it's the same
 operation apart from the case the stmt is only there for denoting
@@ -9474,7 +9472,13 @@ vect_schedule_slp_node (vec_info *vinfo,
  }
 }
   else
-vect_transform_stmt (vinfo, stmt_info, , node, instance);
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"-->vectorizing SLP node starting from: %G",
+stmt_info->stmt);
+  vect_transform_stmt (vinfo, stmt_info, , node, instance);
+}
 }
 
 /* Replace scalar calls from SLP node NODE with setting of their lhs to zero.


[gcc r15-771] Avoid requiring VEC_PERM represenatives

2024-05-22 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:0c7792f707368d0225a9a457895b847ef660c270

commit r15-771-g0c7792f707368d0225a9a457895b847ef660c270
Author: Richard Biener 
Date:   Fri May 17 14:26:38 2024 +0200

Avoid requiring VEC_PERM represenatives

The following plugs one hole where we require a VEC_PERM node
representative unnecessarily.  This is for vect_check_store_rhs
which looks at the RHS and checks whether a constant can be
native encoded.  The fix is to guard that with vect_constant_def
additionally and making vect_is_simple_use forgiving for a missing
SLP_TREE_REPRESENTATIVE when the child is a VEC_PERM node,
initializing the scalar def to error_mark_node.

* tree-vect-stmts.cc (vect_check_store_rhs): Look at *rhs
only when it's a vec_constant_def.
(vect_is_simple_use): When we have no representative for
an internal node, fill in *op with error_mark_node.

Diff:
---
 gcc/tree-vect-stmts.cc | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 672959501bb..4219ad832db 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2553,7 +2553,8 @@ vect_check_store_rhs (vec_info *vinfo, stmt_vec_info 
stmt_info,
 
   /* In the case this is a store from a constant make sure
  native_encode_expr can handle it.  */
-  if (CONSTANT_CLASS_P (*rhs) && native_encode_expr (*rhs, NULL, 64) == 0)
+  if (rhs_dt == vect_constant_def
+  && CONSTANT_CLASS_P (*rhs) && native_encode_expr (*rhs, NULL, 64) == 0)
 {
   if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -14002,8 +14003,26 @@ vect_is_simple_use (vec_info *vinfo, stmt_vec_info 
stmt, slp_tree slp_node,
   *vectype = SLP_TREE_VECTYPE (child);
   if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
{
- *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt);
- return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out);
+ /* ???  VEC_PERM nodes might be intermediate and their lane value
+have no representative (nor do we build a VEC_PERM stmt for
+the actual operation).  Note for two-operator nodes we set
+a representative but leave scalar stmts empty as we'd only
+have one for a subset of lanes.  Ideally no caller would
+require *op for internal defs.  */
+ if (SLP_TREE_REPRESENTATIVE (child))
+   {
+ *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt);
+ return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out);
+   }
+ else
+   {
+ gcc_assert (SLP_TREE_CODE (child) == VEC_PERM_EXPR);
+ *op = error_mark_node;
+ *dt = vect_internal_def;
+ if (def_stmt_info_out)
+   *def_stmt_info_out = NULL;
+ return true;
+   }
}
   else
{


Re: [PATCH 4/4] Testsuite updates

2024-05-22 Thread Richard Biener
On Tue, 21 May 2024, Richard Biener wrote:

> The gcc.dg/vect/slp-12a.c case is interesting as we currently split
> the 8 store group into lanes 0-5 which we SLP with an unroll factor
> of two (on x86-64 with SSE) and the remaining two lanes are using
> interleaving vectorization with a final unroll factor of four.  Thus
> we're using hybrid SLP within a single store group.  After the change
> we discover the same 0-5 lane SLP part as well as two single-lane
> parts feeding the full store group.  But that results in a load
> permutation that isn't supported (I have WIP patchs to rectify that).
> So we end up cancelling SLP and vectorizing the whole loop with
> interleaving which is IMO good and results in better code.
> 
> This is similar for gcc.target/i386/pr52252-atom.c where interleaving
> generates much better code than hybrid SLP.  I'm unsure how to update
> the testcase though.
> 
> gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
> when analyzing SLP operations we discard an instance we currently
> force the full loop to have no SLP because hybrid detection is
> broken.  It's probably not worth fixing this at this moment.
> 
> For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
> into two but merge the two 8 lane loads into one before doing the
> store and thus have only a single SLP instance.  A similar situation
> happens in gcc.dg/vect/slp-11c.c but the branches feeding the
> single SLP store only have a single lane.  Likewise for
> gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
> 
> gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
> with a SLP store group of size two but two single-lane branches.
> 
> gcc.target/i386/pr98928.c ICEs in SLP permute optimization
> because we don't expect a constant and internal branch to be
> merged with a permute node in
> vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
> permutes merging two SLP nodes are two-operator nodes right now).
> This still requires fixing.
> 
> The whole series has been bootstrapped and tested on 
> x86_64-unknown-linux-gnu with the gcc.target/i386/pr98928.c FAIL
> unfixed.
> 
> Comments welcome (and hello ARM CI), RISC-V and other arch
> testing appreciated.  Unless there are comments to the contrary
> I plan to push patch 1 and 2 tomorrow.

RISC-V CI didn't trigger (not sure what magic is required).  Both
ARM and AARCH64 show that the "Vectorizing stmts using SLP" are a bit
fragile because we sometimes cancel SLP becuase we want to use
load/store-lanes.

I have locally scrapped the SLP scanning for gcc.dg/vect/slp-21.c where
it doesn't really matter (and if we are finished with all-SLP it will
matter nowhere).  I've conditionalized the outcome based on
vect_load_lanes for gcc.dg/vect/slp-11c.c and
gcc.dg/vect/slp-cond-1.c

On AARCH64 additionally gcc.target/aarch64/sve/mask_struct_store_4.c
ICEs, I have a fix for that.

gcc.target/aarch64/pr99873_2.c FAILs because with a single
SLP store group merged from two two-lane load groups we cancel
the SLP and want to use load/store-lanes.  I'll leave this
FAILing or shall I XFAIL it?

Thanks,
Richard.

> Thanks,
> Richard.
> 
>   * gcc.dg/vect/pr97428.c: Expect a single store SLP group.
>   * gcc.dg/vect/slp-11c.c: Likewise.
>   * gcc.dg/vect/vect-complex-5.c: Likewise.
>   * gcc.dg/vect/slp-12a.c: Do not expect SLP.
>   * gcc.dg/vect/slp-21.c: Likewise.
>   * gcc.dg/vect/slp-cond-1.c: Expect one more SLP.
>   * gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
>   * gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
> ---
>  gcc/testsuite/gcc.dg/vect/pr97428.c  |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-11c.c  |  5 +++--
>  gcc/testsuite/gcc.dg/vect/slp-12a.c  |  6 +-
>  gcc/testsuite/gcc.dg/vect/slp-21.c   | 19 +--
>  gcc/testsuite/gcc.dg/vect/slp-cond-1.c   |  2 +-
>  gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
>  gcc/testsuite/gcc.dg/vect/vect-gather-2.c|  1 -
>  gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
>  8 files changed, 18 insertions(+), 22 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c 
> b/gcc/testsuite/gcc.dg/vect/pr97428.c
> index 60dd984cfd3..3cc9976c00c 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr97428.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
> @@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
>  /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" 
> "vect" } } */
>  /* We're not able to peel & apply re-aligning to make accesses well-aligned 
> for !vect_hw_misalign,
> but we could by peeling the stores for alignment and applying re-aligning 
> loads.  */

[PATCH] tree-optimization/115144 - improve sinking destination choice

2024-05-22 Thread Richard Biener
When sinking code closer to its uses we already try to minimize the
distance we move by inserting at the start of the basic-block.  The
following makes sure to sink closest to the control dependence
check of the region we want to sink to as well as make sure to
ignore control dependences that are only guarding exceptional code.
This restores somewhat the old profile check but without requiring
nearly even probabilities.  The patch also makes sure to not give
up completely when the best sink location is one we do not want to
sink to but possibly then choose the next best one.

Re-bootstrap and regtest running on x86_64-unknown-linux-gnu after
a minor fix.

PR tree-optimization/115144
* tree-ssa-sink.cc (do_not_sink): New function, split out
from ...
(select_best_block): Here.  First pick valid block to
sink to.  From that search for the best valid block,
avoiding sinking across conditions to exceptional code.

* gcc.dg/tree-ssa/ssa-sink-22.c: New testcase.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c |  14 +++
 gcc/tree-ssa-sink.cc| 101 +---
 2 files changed, 82 insertions(+), 33 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..e35626d4070
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink1-details" } */
+
+extern void abort (void);
+
+int foo (int x, int y, int f)
+{
+  int tem = x / y;
+  if (f)
+abort ();
+  return tem;
+}
+
+/* { dg-final { scan-tree-dump-not "Sinking" "sink1" } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index 2188b7523c7..a06b43e61af 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -172,6 +172,38 @@ nearest_common_dominator_of_uses (def_operand_p def_p, 
bool *debug_stmts)
   return commondom;
 }
 
+/* Return whether sinking STMT from EARLY_BB to BEST_BB should be avoided.  */
+
+static bool
+do_not_sink (gimple *stmt, basic_block early_bb, basic_block best_bb)
+{
+  /* Placing a statement before a setjmp-like function would be invalid
+ (it cannot be reevaluated when execution follows an abnormal edge).
+ If we selected a block with abnormal predecessors, just punt.  */
+  if (bb_has_abnormal_pred (best_bb))
+return true;
+
+  /* If the latch block is empty, don't make it non-empty by sinking
+ something into it.  */
+  if (best_bb == early_bb->loop_father->latch
+  && empty_block_p (best_bb))
+return true;
+
+  /* Avoid turning an unconditional read into a conditional one when we
+ still might want to perform vectorization.  */
+  if (best_bb->loop_father == early_bb->loop_father
+  && loop_outer (best_bb->loop_father)
+  && !best_bb->loop_father->inner
+  && gimple_vuse (stmt)
+  && flag_tree_loop_vectorize
+  && !(cfun->curr_properties & PROP_loop_opts_done)
+  && dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, early_bb)
+  && !dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, 
best_bb))
+return true;
+
+  return false;
+}
+
 /* Given EARLY_BB and LATE_BB, two blocks in a path through the dominator
tree, return the best basic block between them (inclusive) to place
statements.
@@ -185,54 +217,57 @@ select_best_block (basic_block early_bb,
   basic_block late_bb,
   gimple *stmt)
 {
+  /* First pick a block we do not disqualify.  */
+  while (late_bb != early_bb
+&& do_not_sink (stmt, early_bb, late_bb))
+late_bb = get_immediate_dominator (CDI_DOMINATORS, late_bb);
+
   basic_block best_bb = late_bb;
   basic_block temp_bb = late_bb;
-
   while (temp_bb != early_bb)
 {
   /* Walk up the dominator tree, hopefully we'll find a shallower
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
 
+  /* Do not consider blocks we do not want to sink to.  */
+  if (temp_bb != early_bb && do_not_sink (stmt, early_bb, temp_bb))
+   ;
+
   /* If we've moved into a lower loop nest, then that becomes
 our best block.  */
-  if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
+  else if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
best_bb = temp_bb;
-}
 
-  /* Placing a statement before a setjmp-like function would be invalid
- (it cannot be reevaluated when execution follows an abnormal edge).
- If we selected a block with abnormal predecessors, just punt.  */
-  if (bb_has_abnormal_pred (best_bb))
-return early_bb;
-
-  /* If we found a shallower loop nest, then we always consider that
- a win.  This will always give us the most control dependent block
- within that loop nest.  */
-  if (bb_loop_depth (best_bb) < bb_loop_depth (early_bb))
-

Re: [PATCH] Fix mixed input kind permute optimization

2024-05-22 Thread Richard Biener
On Wed, 22 May 2024, Richard Sandiford wrote:

> Richard Sandiford  writes:
> > Richard Biener  writes:
> >> When change_vec_perm_layout runs into a permute combining two
> >> nodes where one is invariant and one internal the partition of
> >> one input can be -1 but the other might not be.  The following
> >> supports this case by simply ignoring inputs with input partiton -1.
> >>
> >> I'm not sure this is correct but it avoids ICEing when accessing
> >> that partitions layout for gcc.target/i386/pr98928.c with the
> >> change to avoid splitting store dataref groups during SLP discovery.
> >>
> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (ontop of
> >> the SLP series).  The change can't break anything that's already
> >> broken but I'm not sure this does the right thing - the testcase
> >> has an uniform constant.  I'll try to come up with a better runtime
> >> testcase tomorrow.  Hints as to where to correctly fix such case
> >> appreciated.
> >
> > Famous last words, but yeah, it looks correct to me.  I think the
> > routine in principle should have a free choice of which layout to
> > choose for invariants (as long as it's consistent for all queries
> > about the same node).  So it should just be a question of whether
> > keeping the original layout is more likely to give a valid
> > permutation, or whether going with out_layout_i would be better.
> > I don't have a strong intuition either way.
> 
> BTW, I should have said that using a different layout from 0
> would require compensating code in the materialize function.
> So this is definitely the simplest and most direct fix.

Yeah, I guess we can improve on that later.  I'm going to push the
change after lunch together with the other two fixes - the ARM CI
discovered its share of testsuite fallout for the actual change
I'm going to look at.

Richard.


[gcc r15-768] web/115183 - fix typo in C++ docs

2024-05-22 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:424f8a01df9b311250e416759ad61c00bba4af48

commit r15-768-g424f8a01df9b311250e416759ad61c00bba4af48
Author: Richard Biener 
Date:   Wed May 22 10:19:08 2024 +0200

web/115183 - fix typo in C++ docs

The following fixes a reported typo.

* doc/invoke.texi (C++ Modules): Fix typo.

Diff:
---
 gcc/doc/invoke.texi | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 218901c0b20..0625a5ede6f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -37646,7 +37646,7 @@ not get debugging information for routines in the 
precompiled header.
 @cindex speed of compilation
 
 Modules are a C++20 language feature.  As the name suggests, they
-provides a modular compilation system, intending to provide both
+provide a modular compilation system, intending to provide both
 faster builds and better library isolation.  The ``Merging Modules''
 paper @uref{https://wg21.link/p1103}, provides the easiest to read set
 of changes to the standard, although it does not capture later


[PATCH] web/115183 - fix typo in C++ docs

2024-05-22 Thread Richard Biener
The following fixes a reported typo.

Pushed.

* doc/invoke.texi (C++ Modules): Fix typo.
---
 gcc/doc/invoke.texi | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 218901c0b20..0625a5ede6f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -37646,7 +37646,7 @@ not get debugging information for routines in the 
precompiled header.
 @cindex speed of compilation
 
 Modules are a C++20 language feature.  As the name suggests, they
-provides a modular compilation system, intending to provide both
+provide a modular compilation system, intending to provide both
 faster builds and better library isolation.  The ``Merging Modules''
 paper @uref{https://wg21.link/p1103}, provides the easiest to read set
 of changes to the standard, although it does not capture later
-- 
2.35.3


gcc-wwwdocs branch master updated. 9d10f6fccee3a68102173f28cf312ed266b7d95d

2024-05-22 Thread Richard Biener via Gcc-cvs-wwwdocs
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "gcc-wwwdocs".

The branch, master has been updated
   via  9d10f6fccee3a68102173f28cf312ed266b7d95d (commit)
  from  9e79c5e411eb3236b481c6093fad4dc5ae5141c5 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -
commit 9d10f6fccee3a68102173f28cf312ed266b7d95d
Author: Richard Biener 
Date:   Wed May 22 10:04:32 2024 +0200

web/115183 - Remove duplicate links to GCCGO online docs.

diff --git a/htdocs/onlinedocs/13.1.0/index.html 
b/htdocs/onlinedocs/13.1.0/index.html
index 2abc06ac..08d312ba 100644
--- a/htdocs/onlinedocs/13.1.0/index.html
+++ b/htdocs/onlinedocs/13.1.0/index.html
@@ -62,12 +62,6 @@
  
href="https://gcc.gnu.org/onlinedocs/gcc-13.1.0/libstdc++-api-gfdl.xml.gz;>XML 
GFDL or https://gcc.gnu.org/onlinedocs/gcc-13.1.0/libstdc++-api-html.tar.gz;>an
  HTML tarball)
-   https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gccgo/;>GCCGO 13.1 
Manual (https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gccgo.pdf;>also in
-   PDF or https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gccgo.ps.gz;>PostScript or 
https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gccgo-html.tar.gz;>an
-   HTML tarball)
 https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gccgo/;>GCCGO 13.1 
Manual (https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gccgo.pdf;>also in
PDF or https://gcc.gnu.org/onlinedocs/gcc-13.2.0/libstdc++-api-gfdl.xml.gz;>XML 
GFDL or https://gcc.gnu.org/onlinedocs/gcc-13.2.0/libstdc++-api-html.tar.gz;>an
  HTML tarball)
-   https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gccgo/;>GCCGO 13.2 
Manual (https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gccgo.pdf;>also in
-   PDF or https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gccgo.ps.gz;>PostScript or 
https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gccgo-html.tar.gz;>an
-   HTML tarball)
 https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gccgo/;>GCCGO 13.2 
Manual (https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gccgo.pdf;>also in
PDF or https://gcc.gnu.org/onlinedocs/gcc-13.3.0/libstdc++-api-gfdl.xml.gz;>XML 
GFDL or https://gcc.gnu.org/onlinedocs/gcc-13.3.0/libstdc++-api-html.tar.gz;>an
  HTML tarball)
-   https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gccgo/;>GCCGO 13.3 
Manual (https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gccgo.pdf;>also in
-   PDF or https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gccgo.ps.gz;>PostScript or 
https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gccgo-html.tar.gz;>an
-   HTML tarball)
 https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gccgo/;>GCCGO 13.3 
Manual (https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gccgo.pdf;>also in
PDF or https://gcc.gnu.org/onlinedocs/gcc-14.1.0/libstdc++-api-gfdl.xml.gz;>XML 
GFDL or https://gcc.gnu.org/onlinedocs/gcc-14.1.0/libstdc++-api-html.tar.gz;>an
  HTML tarball)
-   https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gccgo/;>GCCGO 14.1 
Manual (https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gccgo.pdf;>also in
-   PDF or https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gccgo.ps.gz;>PostScript or 
https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gccgo-html.tar.gz;>an
-   HTML tarball)
 https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gccgo/;>GCCGO 14.1 
Manual (https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gccgo.pdf;>also in
PDF or 

Re: [PATCH] Avoid ICE in except.cc on targets that don't support exceptions.

2024-05-22 Thread Richard Biener
On Wed, May 22, 2024 at 9:21 AM Roger Sayle  wrote:
>
>
> A number of testcases currently fail on nvptx with the ICE:
>
> during RTL pass: final
> openmp-simd-2.c: In function 'foo':
> openmp-simd-2.c:28:1: internal compiler error: in get_personality_function,
> at expr.cc:14037
>28 | }
>   | ^
> 0x98a38f get_personality_function(tree_node*)
> /home/roger/GCC/nvptx-none/gcc/gcc/expr.cc:14037
> 0x969d3b output_function_exception_table(int)
> /home/roger/GCC/nvptx-none/gcc/gcc/except.cc:3226
> 0x9b760d rest_of_handle_final
> /home/roger/GCC/nvptx-none/gcc/gcc/final.cc:4252
>
> The simple oversight in output_function_exception_table is that it calls
> get_personality_function (immediately) before checking the target's
> except_unwind_info hook (which on nvptx always returns UI_NONE).
> The (perhaps obvious) fix is to move the assignments of fname and
> personality after the tests that they are needed, and before their
> first use.
>
> This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
> with no new failures in the testsuite, and ~220 fewer FAILs.
> Ok for mainline?

OK.

Richard.

> 2024-05-22  Roger Sayle  
>
> gcc/ChangeLog
> * except.cc (output_function_exception_table): Move call to
> get_personality_function after targetm_common.except_unwind_info
> check, to avoid ICE on targets that don't support exceptions.
>
>
> Thanks in advance,
> Roger
> --
>


Re: [PATCH] Don't simplify NAN/INF or out-of-range constant for FIX/UNSIGNED_FIX.

2024-05-22 Thread Richard Biener
On Wed, May 22, 2024 at 3:58 AM liuhongt  wrote:
>
> According to IEEE standard, for conversions from floating point to
> integer. When a NaN or infinite operand cannot be represented in the
> destination format and this cannot otherwise be indicated, the invalid
> operation exception shall be signaled. When a numeric operand would
> convert to an integer outside the range of the destination format, the
> invalid operation exception shall be signaled if this situation cannot
> otherwise be indicated.
>
> The patch prevent simplication of the conversion from floating point
> to integer for NAN/INF/out-of-range constant when flag_trapping_math.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> Ok for trunk?

OK if there are no further comments today.

Thanks,
Richard.

> gcc/ChangeLog:
>
> PR rtl-optimization/100927
> PR rtl-optimization/115161
> PR rtl-optimization/115115
> * simplify-rtx.cc (simplify_const_unary_operation): Prevent
> simplication of FIX/UNSIGNED_FIX for NAN/INF/out-of-range
> constant when flag_trapping_math.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr100927.c: New test.
> ---
>  gcc/simplify-rtx.cc  | 23 
>  gcc/testsuite/gcc.target/i386/pr100927.c | 27 
>  2 files changed, 46 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr100927.c
>
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index 53f54d1d392..b7a770dad60 100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -2256,14 +2256,25 @@ simplify_const_unary_operation (enum rtx_code code, 
> machine_mode mode,
>switch (code)
> {
> case FIX:
> + /* According to IEEE standard, for conversions from floating point 
> to
> +integer. When a NaN or infinite operand cannot be represented in
> +the destination format and this cannot otherwise be indicated, 
> the
> +invalid operation exception shall be signaled. When a numeric
> +operand would convert to an integer outside the range of the
> +destination format, the invalid operation exception shall be
> +signaled if this situation cannot otherwise be indicated.  */
>   if (REAL_VALUE_ISNAN (*x))
> -   return const0_rtx;
> +   return flag_trapping_math ? NULL_RTX : const0_rtx;
> +
> + if (REAL_VALUE_ISINF (*x) && flag_trapping_math)
> +   return NULL_RTX;
>
>   /* Test against the signed upper bound.  */
>   wmax = wi::max_value (width, SIGNED);
>   real_from_integer (, VOIDmode, wmax, SIGNED);
>   if (real_less (, x))
> -   return immed_wide_int_const (wmax, mode);
> +   return (flag_trapping_math
> +   ? NULL_RTX : immed_wide_int_const (wmax, mode));
>
>   /* Test against the signed lower bound.  */
>   wmin = wi::min_value (width, SIGNED);
> @@ -2276,13 +2287,17 @@ simplify_const_unary_operation (enum rtx_code code, 
> machine_mode mode,
>
> case UNSIGNED_FIX:
>   if (REAL_VALUE_ISNAN (*x) || REAL_VALUE_NEGATIVE (*x))
> -   return const0_rtx;
> +   return flag_trapping_math ? NULL_RTX : const0_rtx;
> +
> + if (REAL_VALUE_ISINF (*x) && flag_trapping_math)
> +   return NULL_RTX;
>
>   /* Test against the unsigned upper bound.  */
>   wmax = wi::max_value (width, UNSIGNED);
>   real_from_integer (, VOIDmode, wmax, UNSIGNED);
>   if (real_less (, x))
> -   return immed_wide_int_const (wmax, mode);
> +   return (flag_trapping_math
> +   ? NULL_RTX : immed_wide_int_const (wmax, mode));
>
>   return immed_wide_int_const (real_to_integer (x, , width),
>mode);
> diff --git a/gcc/testsuite/gcc.target/i386/pr100927.c 
> b/gcc/testsuite/gcc.target/i386/pr100927.c
> new file mode 100644
> index 000..b137396c30f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100927.c
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-msse2 -O2 -ftrapping-math" } */
> +/* { dg-final { scan-assembler-times "cvttps2dq" 3 } }  */
> +
> +#include 
> +
> +__m128i foo_ofr() {
> +  const __m128i iv = _mm_set_epi32(0x4f00, 0x4f00, 0x4f00, 
> 0x4f00);
> +  const __m128  fv = _mm_castsi128_ps(iv);
> +  const __m128i riv = _mm_cvttps_epi32(fv);
> +  return riv;
> +}
> +
> +__m128i foo_nan() {
> +  const __m128i iv = _mm_set_epi32(0xff81, 0xff81, 0xff81, 
> 0xff81);
> +  const __m128  fv = _mm_castsi128_ps(iv);
> +  const __m128i riv = _mm_cvttps_epi32(fv);
> +  return riv;
> +}
> +
> +__m128i foo_inf() {
> +  const __m128i iv = _mm_set_epi32(0xff80, 0xff80, 0xff80, 
> 0xff80);
> +  const __m128  fv = _mm_castsi128_ps(iv);
> +  const __m128i riv = _mm_cvttps_epi32(fv);
> +  return 

Re: [RFC][PATCH] PR tree-optimization/109071 - -Warray-bounds false positive warnings due to code duplication from jump threading

2024-05-22 Thread Richard Biener
On Tue, May 21, 2024 at 11:36 PM David Malcolm  wrote:
>
> On Tue, 2024-05-21 at 15:13 +, Qing Zhao wrote:
> > Thanks for the comments and suggestions.
> >
> > > On May 15, 2024, at 10:00, David Malcolm 
> > > wrote:
> > >
> > > On Tue, 2024-05-14 at 15:08 +0200, Richard Biener wrote:
> > > > On Mon, 13 May 2024, Qing Zhao wrote:
> > > >
> > > > > -Warray-bounds is an important option to enable linux kernal to
> > > > > keep
> > > > > the array out-of-bound errors out of the source tree.
> > > > >
> > > > > However, due to the false positive warnings reported in
> > > > > PR109071
> > > > > (-Warray-bounds false positive warnings due to code duplication
> > > > > from
> > > > > jump threading), -Warray-bounds=1 cannot be added on by
> > > > > default.
> > > > >
> > > > > Although it's impossible to elinimate all the false positive
> > > > > warnings
> > > > > from -Warray-bounds=1 (See PR104355 Misleading -Warray-bounds
> > > > > documentation says "always out of bounds"), we should minimize
> > > > > the
> > > > > false positive warnings in -Warray-bounds=1.
> > > > >
> > > > > The root reason for the false positive warnings reported in
> > > > > PR109071 is:
> > > > >
> > > > > When the thread jump optimization tries to reduce the # of
> > > > > branches
> > > > > inside the routine, sometimes it needs to duplicate the code
> > > > > and
> > > > > split into two conditional pathes. for example:
> > > > >
> > > > > The original code:
> > > > >
> > > > > void sparx5_set (int * ptr, struct nums * sg, int index)
> > > > > {
> > > > >   if (index >= 4)
> > > > > warn ();
> > > > >   *ptr = 0;
> > > > >   *val = sg->vals[index];
> > > > >   if (index >= 4)
> > > > > warn ();
> > > > >   *ptr = *val;
> > > > >
> > > > >   return;
> > > > > }
> > > > >
> > > > > With the thread jump, the above becomes:
> > > > >
> > > > > void sparx5_set (int * ptr, struct nums * sg, int index)
> > > > > {
> > > > >   if (index >= 4)
> > > > > {
> > > > >   warn ();
> > > > >   *ptr = 0; // Code duplications since "warn" does
> > > > > return;
> > > > >   *val = sg->vals[index];   // same this line.
> > > > > // In this path, since it's
> > > > > under
> > > > > the condition
> > > > > // "index >= 4", the compiler
> > > > > knows
> > > > > the value
> > > > > // of "index" is larger then 4,
> > > > > therefore the
> > > > > // out-of-bound warning.
> > > > >   warn ();
> > > > > }
> > > > >   else
> > > > > {
> > > > >   *ptr = 0;
> > > > >   *val = sg->vals[index];
> > > > > }
> > > > >   *ptr = *val;
> > > > >   return;
> > > > > }
> > > > >
> > > > > We can see, after the thread jump optimization, the # of
> > > > > branches
> > > > > inside
> > > > > the routine "sparx5_set" is reduced from 2 to 1, however,  due
> > > > > to
> > > > > the
> > > > > code duplication (which is needed for the correctness of the
> > > > > code),
> > > > > we
> > > > > got a false positive out-of-bound warning.
> > > > >
> > > > > In order to eliminate such false positive out-of-bound warning,
> > > > >
> > > > > A. Add one more flag for GIMPLE: is_splitted.
> > > > > B. During the thread jump optimization, when the basic blocks
> > > > > are
> > > > >duplicated, mark all the STMTs inside the original and
> > > > > duplicated
> > > > >basic blocks as "is_sp

Re: [PATCH] ubsan: Use right address space for MEM_REF created for bool/enum sanitization [PR115172]

2024-05-22 Thread Richard Biener
On Tue, 21 May 2024, Jakub Jelinek wrote:

> Hi!
> 
> The following testcase is miscompiled, because -fsanitize=bool,enum
> creates a MEM_REF without propagating there address space qualifiers,
> so what should be normally loaded using say %gs:/%fs: segment prefix
> isn't.  Together with asan it then causes that load to be sanitized.
> 
> Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux,
> ok for trunk and release branches?

OK.

Thanks,
Richard.

> 2024-05-21  Jakub Jelinek  
> 
>   PR sanitizer/115172
>   * ubsan.cc (instrument_bool_enum_load): If rhs is not in generic
>   address space, use qualified version of utype with the right
>   address space.  Formatting fix.
> 
>   * gcc.dg/asan/pr115172.c: New test.
> 
> --- gcc/ubsan.cc.jj   2024-03-22 09:23:37.695296775 +0100
> +++ gcc/ubsan.cc  2024-05-21 12:10:24.261454107 +0200
> @@ -1776,13 +1776,17 @@ instrument_bool_enum_load (gimple_stmt_i
>|| TREE_CODE (gimple_assign_lhs (stmt)) != SSA_NAME)
>  return;
>  
> +  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (rhs));
> +  if (as != TYPE_ADDR_SPACE (utype))
> +utype = build_qualified_type (utype, TYPE_QUALS (utype)
> +  | ENCODE_QUAL_ADDR_SPACE (as));
>bool ends_bb = stmt_ends_bb_p (stmt);
>location_t loc = gimple_location (stmt);
>tree lhs = gimple_assign_lhs (stmt);
>tree ptype = build_pointer_type (TREE_TYPE (rhs));
>tree atype = reference_alias_ptr_type (rhs);
>gimple *g = gimple_build_assign (make_ssa_name (ptype),
> -   build_fold_addr_expr (rhs));
> +build_fold_addr_expr (rhs));
>gimple_set_location (g, loc);
>gsi_insert_before (gsi, g, GSI_SAME_STMT);
>tree mem = build2 (MEM_REF, utype, gimple_assign_lhs (g),
> --- gcc/testsuite/gcc.dg/asan/pr115172.c.jj   2024-05-21 17:28:18.302815400 
> +0200
> +++ gcc/testsuite/gcc.dg/asan/pr115172.c  2024-05-21 22:50:43.272753785 
> +0200
> @@ -0,0 +1,20 @@
> +/* PR sanitizer/115172 */
> +/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
> +/* { dg-options "-O2 -fsanitize=address,bool -ffat-lto-objects 
> -fdump-tree-asan1" } */
> +/* { dg-final { scan-tree-dump-not "\.ASAN_CHECK " "asan1" } } */
> +
> +#ifdef __x86_64__
> +#define SEG __seg_gs
> +#else
> +#define SEG __seg_fs
> +#endif
> +
> +extern struct S { _Bool b; } s;
> +void bar (void);
> +
> +void
> +foo (void)
> +{
> +  if (*(volatile _Bool SEG *) (__UINTPTR_TYPE__) )
> +bar ();
> +}
> 
>   Jakub
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH] strlen: Fix up !si->full_string_p handling in count_nonzero_bytes_addr [PR115152]

2024-05-22 Thread Richard Biener
On Tue, 21 May 2024, Jakub Jelinek wrote:

> Hi!
> 
> The following testcase is miscompiled because
> strlen_pass::count_nonzero_bytes_addr doesn't handle correctly
> the !si->full_string_p case.
> If si->full_string_p, it correctly computes minlen and maxlen as
> minimum and maximum length of the '\0' terminated stgring and
> clears *nulterm (ie. makes sure !full_string_p in the ultimate
> caller) if minlen is equal or larger than nbytes and so
> '\0' isn't guaranteed to be among those bytes.
> But in the !si->full_string_p case, all we know is that there
> are [minlen,maxlen] non-zero bytes followed by unknown bytes,
> so effectively the maxlen is infinite (but caller cares about only
> the first nbytes bytes) and furthermore, we never know if there is
> any '\0' char among those, so *nulterm needs to be always cleared.
> 
> Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux,
> ok for trunk and affected release branches?

OK.

Richard.

> 2024-05-21  Jakub Jelinek  
> 
>   PR tree-optimization/115152
>   * tree-ssa-strlen.cc (strlen_pass::count_nonzero_bytes_addr): If
>   !si->full_string_p, clear *nulterm and set maxlen to nbytes.
> 
>   * gcc.dg/pr115152.c: New test.
> 
> --- gcc/tree-ssa-strlen.cc.jj 2024-04-29 11:00:45.0 +0200
> +++ gcc/tree-ssa-strlen.cc2024-05-21 13:43:31.031208000 +0200
> @@ -4829,7 +4829,7 @@ strlen_pass::count_nonzero_bytes_addr (t
>if (maxlen + 1 < nbytes)
>   return false;
>  
> -  if (nbytes <= minlen)
> +  if (nbytes <= minlen || !si->full_string_p)
>   *nulterm = false;
>  
>if (nbytes < minlen)
> @@ -4839,6 +4839,9 @@ strlen_pass::count_nonzero_bytes_addr (t
>   maxlen = nbytes;
>   }
>  
> +  if (!si->full_string_p)
> + maxlen = nbytes;
> +
>if (minlen < lenrange[0])
>   lenrange[0] = minlen;
>if (lenrange[1] < maxlen)
> --- gcc/testsuite/gcc.dg/pr115152.c.jj2024-05-21 13:46:02.793214348 
> +0200
> +++ gcc/testsuite/gcc.dg/pr115152.c   2024-05-21 12:49:38.791626073 +0200
> @@ -0,0 +1,17 @@
> +/* PR tree-optimization/115152 */
> +/* { dg-do run } */
> +/* { dg-options "-O3 -fno-tree-fre -fno-tree-dominator-opts 
> -fno-tree-loop-im" } */
> +
> +int a, b, c, d;
> +signed char e[1] = { 1 };
> +
> +int
> +main ()
> +{
> +  for (a = 0; a < 3; a++)
> +for (b = 0; b < 2; b++)
> +  c = e[0] = e[0] ^ d;
> +  if (!c)
> +__builtin_abort ();
> +  return 0;
> +}
> 
>   Jakub
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[PATCH] Fix mixed input kind permute optimization

2024-05-21 Thread Richard Biener
When change_vec_perm_layout runs into a permute combining two
nodes where one is invariant and one internal the partition of
one input can be -1 but the other might not be.  The following
supports this case by simply ignoring inputs with input partiton -1.

I'm not sure this is correct but it avoids ICEing when accessing
that partitions layout for gcc.target/i386/pr98928.c with the
change to avoid splitting store dataref groups during SLP discovery.

Bootstrap and regtest running on x86_64-unknown-linux-gnu (ontop of
the SLP series).  The change can't break anything that's already
broken but I'm not sure this does the right thing - the testcase
has an uniform constant.  I'll try to come up with a better runtime
testcase tomorrow.  Hints as to where to correctly fix such case
appreciated.

* tree-vect-slp.cc (change_vec_perm_layout): Ignore an
input partition of -1.
---
 gcc/tree-vect-slp.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 873748b0a72..f6ec1a81c96 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -4828,6 +4828,8 @@ change_vec_perm_layout (slp_tree node, lane_permutation_t 
,
{
  slp_tree in_node = SLP_TREE_CHILDREN (node)[entry.first];
  unsigned int in_partition_i = m_vertices[in_node->vertex].partition;
+ if (in_partition_i == -1u)
+   continue;
  this_in_layout_i = m_partitions[in_partition_i].layout;
}
   if (this_in_layout_i > 0)
-- 
2.35.3


Re: [committed] PATCH for Re: Stepping down as maintainer for ARC and Epiphany

2024-05-21 Thread Richard Biener via Gcc
On Tue, May 21, 2024 at 6:21 PM Jeff Law  wrote:
>
>
>
> On 5/21/24 8:02 AM, Paul Koning wrote:
> >
> >
> >> On May 21, 2024, at 9:57 AM, Jeff Law  wrote:
> >>
> >>
> >>
> >> On 5/21/24 12:05 AM, Richard Biener via Gcc wrote:
> >>> On Mon, May 20, 2024 at 4:45 PM Gerald Pfeifer  wrote:
> >>>>
> >>>> On Wed, 5 Jul 2023, Joern Rennecke wrote:
> >>>>> I haven't worked with these targets in years and can't really do
> >>>>> sensible maintenance or reviews of patches for them. I am currently
> >>>>> working on optimizations for other ports like RISC-V.
> >>>>
> >>>> I noticed MAINTAINERS was not updated, so pushed the patch below.
> >>> That leaves the epiphany port unmaintained.  Should we automatically add 
> >>> such
> >>> ports to the list of obsoleted ports?
> >> Given that epiphany has randomly failed tests for the last 3+ years due to 
> >> bugs in its patterns, yes, it really needs to be deprecated.
> >>
> >> I tried to fix the worst of the offenders in epiphany.md a few years back 
> >> and gave up.  Essentially seemingly innocent changes in the RTL will cause 
> >> reload to occasionally not see a path to get constraints satisfied.  So a 
> >> test which passes today, will flip to failing tomorrow while some other 
> >> test of tests will go the other way.
> >
> > Does LRA make that issue go away, or does it not help?
> LRA didn't trivially work on epiphany.  I didn't care enough about the
> port to try and make it LRA compatible.

In that case LRA will make the issue go away (the port, that is ...).

Richard.

>
> jeff
>


Re: [committed] PATCH for Re: Stepping down as maintainer for ARC and Epiphany

2024-05-21 Thread Richard Biener
On Tue, May 21, 2024 at 6:21 PM Jeff Law  wrote:
>
>
>
> On 5/21/24 8:02 AM, Paul Koning wrote:
> >
> >
> >> On May 21, 2024, at 9:57 AM, Jeff Law  wrote:
> >>
> >>
> >>
> >> On 5/21/24 12:05 AM, Richard Biener via Gcc wrote:
> >>> On Mon, May 20, 2024 at 4:45 PM Gerald Pfeifer  wrote:
> >>>>
> >>>> On Wed, 5 Jul 2023, Joern Rennecke wrote:
> >>>>> I haven't worked with these targets in years and can't really do
> >>>>> sensible maintenance or reviews of patches for them. I am currently
> >>>>> working on optimizations for other ports like RISC-V.
> >>>>
> >>>> I noticed MAINTAINERS was not updated, so pushed the patch below.
> >>> That leaves the epiphany port unmaintained.  Should we automatically add 
> >>> such
> >>> ports to the list of obsoleted ports?
> >> Given that epiphany has randomly failed tests for the last 3+ years due to 
> >> bugs in its patterns, yes, it really needs to be deprecated.
> >>
> >> I tried to fix the worst of the offenders in epiphany.md a few years back 
> >> and gave up.  Essentially seemingly innocent changes in the RTL will cause 
> >> reload to occasionally not see a path to get constraints satisfied.  So a 
> >> test which passes today, will flip to failing tomorrow while some other 
> >> test of tests will go the other way.
> >
> > Does LRA make that issue go away, or does it not help?
> LRA didn't trivially work on epiphany.  I didn't care enough about the
> port to try and make it LRA compatible.

In that case LRA will make the issue go away (the port, that is ...).

Richard.

>
> jeff
>


Re: [PATCH v5 1/5] Improve must tail in RTL backend

2024-05-21 Thread Richard Biener
On Tue, May 21, 2024 at 3:35 PM Andi Kleen  wrote:
>
> > I can't see how this triggers on the IL above, the loop should have
> > ignored both the return and the clobber and when recursing to
> > the predecessor stop before the above check when runnig into the
> > call?
>
> Yes, I tracked that down later. The problem was that there
> were multiple successors to the BB due to exception handling,
> which makes the find_tail_calls walker give up.
>
> Putting the new pass after ehcleanup fixed that, but there
> are still cases when ehcleanup cannot get rid of them and
> then it gives up. musttail checking at expand time still
> works, but can only give a vague error message.
>
> >
> > > In a optimized build this passes, but with -O0 it always fails
> > > when the pass is placed before pass_optimizations_g. I assume
> > > it's some problem with mem ssa form.
> > >
> > > Any ideas how to fix that? Otherwise I can restrict musttail to non
> > > structs.
> >
> > I wonder how this works when optimizing?
>
> It just doesn't. You need optimization to do tail calls with
> structs. The only alternative would be to detect the situation
> and pull in some extra passes.
>
> Also even with optimization it only works for structs that
> fit into registers. This could be maybe fixed, but is out of scope
> for this patch kit.

I see.  I do wonder how we should deal with the inherent
dependence on optimization for [[musttail]] to work then?  "Solve"
the problem with good documentation?  Offer a -fignore-musttail
option to allow a -O0 build to at least succeed?  But then [[musttail]]
would rather be [[shouldtail]] and can no longer be for correctness?

How does clang solve this?

Richard.

>
> -Andi


[PATCH 4/4] Testsuite updates

2024-05-21 Thread Richard Biener
The gcc.dg/vect/slp-12a.c case is interesting as we currently split
the 8 store group into lanes 0-5 which we SLP with an unroll factor
of two (on x86-64 with SSE) and the remaining two lanes are using
interleaving vectorization with a final unroll factor of four.  Thus
we're using hybrid SLP within a single store group.  After the change
we discover the same 0-5 lane SLP part as well as two single-lane
parts feeding the full store group.  But that results in a load
permutation that isn't supported (I have WIP patchs to rectify that).
So we end up cancelling SLP and vectorizing the whole loop with
interleaving which is IMO good and results in better code.

This is similar for gcc.target/i386/pr52252-atom.c where interleaving
generates much better code than hybrid SLP.  I'm unsure how to update
the testcase though.

gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
when analyzing SLP operations we discard an instance we currently
force the full loop to have no SLP because hybrid detection is
broken.  It's probably not worth fixing this at this moment.

For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
into two but merge the two 8 lane loads into one before doing the
store and thus have only a single SLP instance.  A similar situation
happens in gcc.dg/vect/slp-11c.c but the branches feeding the
single SLP store only have a single lane.  Likewise for
gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.

gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
with a SLP store group of size two but two single-lane branches.

gcc.target/i386/pr98928.c ICEs in SLP permute optimization
because we don't expect a constant and internal branch to be
merged with a permute node in
vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
permutes merging two SLP nodes are two-operator nodes right now).
This still requires fixing.

The whole series has been bootstrapped and tested on 
x86_64-unknown-linux-gnu with the gcc.target/i386/pr98928.c FAIL
unfixed.

Comments welcome (and hello ARM CI), RISC-V and other arch
testing appreciated.  Unless there are comments to the contrary
I plan to push patch 1 and 2 tomorrow.

Thanks,
Richard.

* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
* gcc.dg/vect/slp-11c.c: Likewise.
* gcc.dg/vect/vect-complex-5.c: Likewise.
* gcc.dg/vect/slp-12a.c: Do not expect SLP.
* gcc.dg/vect/slp-21.c: Likewise.
* gcc.dg/vect/slp-cond-1.c: Expect one more SLP.
* gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
* gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
---
 gcc/testsuite/gcc.dg/vect/pr97428.c  |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-11c.c  |  5 +++--
 gcc/testsuite/gcc.dg/vect/slp-12a.c  |  6 +-
 gcc/testsuite/gcc.dg/vect/slp-21.c   | 19 +--
 gcc/testsuite/gcc.dg/vect/slp-cond-1.c   |  2 +-
 gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
 gcc/testsuite/gcc.dg/vect/vect-gather-2.c|  1 -
 gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
 8 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c 
b/gcc/testsuite/gcc.dg/vect/pr97428.c
index 60dd984cfd3..3cc9976c00c 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97428.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
@@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
 /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" "vect" 
} } */
 /* We're not able to peel & apply re-aligning to make accesses well-aligned 
for !vect_hw_misalign,
but we could by peeling the stores for alignment and applying re-aligning 
loads.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
xfail { ! vect_hw_misalign } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
xfail { ! vect_hw_misalign } } } } */
 /* { dg-final { scan-tree-dump-not "gap of 6 elements" "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-11c.c 
b/gcc/testsuite/gcc.dg/vect/slp-11c.c
index 0f680cd4e60..169b0d10eec 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-11c.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-11c.c
@@ -13,7 +13,8 @@ main1 ()
   unsigned int in[N*8] = 
{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};
   float out[N*8];
 
-  /* Different operations - not SLPable.  */
+  /* Different operations - we SLP the store and split the group to two
+ single-lane branches.  */
   for (i = 0; i < N*4; i++)
 {
   out[i*2] = ((float) in[i*2] * 2 + 6) ;
@@ -44,4 +45,4 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { 
{ vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { 

[PATCH 3/4] Avoid splitting store dataref groups during SLP discovery

2024-05-21 Thread Richard Biener
The following avoids splitting store dataref groups during SLP
discovery but instead forces (eventually single-lane) consecutive
lane SLP discovery for all lanes of the group, creating VEC_PERM
SLP nodes merging them so the store will always cover the whole group.

With this for example

int x[1024], y[1024], z[1024], w[1024];
void foo (void)
{
  for (int i = 0; i < 256; i++)
{
  x[4*i+0] = y[2*i+0];
  x[4*i+1] = y[2*i+1];
  x[4*i+2] = z[i];
  x[4*i+3] = w[i];
}
}

which was previously using hybrid SLP can now be fully SLPed and
SSE code generated looks better (but of course you never know,
I didn't actually benchmark).  We of course need a VF of four here.

.L2:
movdqa  z(%rax), %xmm0
movdqa  w(%rax), %xmm4
movdqa  y(%rax,%rax), %xmm2
movdqa  y+16(%rax,%rax), %xmm1
movdqa  %xmm0, %xmm3
punpckhdq   %xmm4, %xmm0
punpckldq   %xmm4, %xmm3
movdqa  %xmm2, %xmm4
shufps  $238, %xmm3, %xmm2
movaps  %xmm2, x+16(,%rax,4)
movdqa  %xmm1, %xmm2
shufps  $68, %xmm3, %xmm4
shufps  $68, %xmm0, %xmm2
movaps  %xmm4, x(,%rax,4)
shufps  $238, %xmm0, %xmm1
movaps  %xmm2, x+32(,%rax,4)
movaps  %xmm1, x+48(,%rax,4)
addq$16, %rax
cmpq$1024, %rax
jne .L2

The extra permute nodes merging distinct branches of the SLP
tree might be unexpected for some code, esp. since
SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
consistently as we can have a mix of both.

The patch keeps the sub-trees form consecutive lanes but that's
in principle not necessary if we for example have an even/odd
split which now would result in N single-lane sub-trees.  That's
left for future improvements.

The interesting part is how VLA vector ISAs handle merging of
two vectors that's not trivial even/odd merging.  The strathegy
of how to build the permute tree might need adjustments for that
(in the end splitting each branch to single lanes and then doing
even/odd merging would be the brute-force fallback).  Not sure
how much we can or should rely on the SLP optimize pass to handle
this.

* tree-vect-slp.cc (vect_build_slp_instance): Do not split
store dataref groups on loop SLP discovery failure but create
a single SLP instance for the stores but branch to SLP sub-trees
and merge with a series of VEC_PERM nodes.
---
 gcc/tree-vect-slp.cc | 240 ++-
 1 file changed, 214 insertions(+), 26 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 43f2c153bf0..873748b0a72 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3468,12 +3468,7 @@ vect_build_slp_instance (vec_info *vinfo,
  return true;
}
 }
-  else
-{
-  /* Failed to SLP.  */
-  /* Free the allocated memory.  */
-  scalar_stmts.release ();
-}
+  /* Failed to SLP.  */
 
   stmt_vec_info stmt_info = stmt_info_;
   /* Try to break the group up into pieces.  */
@@ -3491,6 +3486,9 @@ vect_build_slp_instance (vec_info *vinfo,
   if (is_a  (vinfo)
  && (i > 1 && i < group_size))
{
+ /* Free the allocated memory.  */
+ scalar_stmts.release ();
+
  tree scalar_type
= TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (stmt_info)));
  tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
@@ -3535,38 +3533,228 @@ vect_build_slp_instance (vec_info *vinfo,
}
}
 
-  /* For loop vectorization split into arbitrary pieces of size > 1.  */
-  if (is_a  (vinfo)
- && (i > 1 && i < group_size)
- && !vect_slp_prefer_store_lanes_p (vinfo, stmt_info, group_size, i))
+  /* For loop vectorization split the RHS into arbitrary pieces of
+size >= 1.  */
+  else if (is_a  (vinfo)
+  && (i > 0 && i < group_size)
+  && !vect_slp_prefer_store_lanes_p (vinfo,
+ stmt_info, group_size, i))
{
- unsigned group1_size = i;
-
  if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "Splitting SLP group at stmt %u\n", i);
 
- stmt_vec_info rest = vect_split_slp_store_group (stmt_info,
-  group1_size);
- /* Loop vectorization cannot handle gaps in stores, make sure
-the split group appears as strided.  */
- STMT_VINFO_STRIDED_P (rest) = 1;
- DR_GROUP_GAP (rest) = 0;
- STMT_VINFO_STRIDED_P (stmt_info) = 1;
- DR_GROUP_GAP (stmt_info) = 0;
+ /* Analyze the stored values and pinch them together with
+a permute node so we can preserve the whole store group.  */
+ auto_vec rhs_nodes;
+
+ /* Calculate the unrolling factor based 

[PATCH 2/4] Avoid SLP_REPRESENTATIVE access for VEC_PERM in SLP scheduling

2024-05-21 Thread Richard Biener
SLP permute nodes can end up without a SLP_REPRESENTATIVE now,
the following avoids touching it in this case in vect_schedule_slp_node.

* tree-vect-slp.cc (vect_schedule_slp_node): Avoid looking
at SLP_REPRESENTATIVE for VEC_PERM nodes.
---
 gcc/tree-vect-slp.cc | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index f34ed54a70b..43f2c153bf0 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -9301,13 +9301,8 @@ vect_schedule_slp_node (vec_info *vinfo,
   gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0);
   SLP_TREE_VEC_DEFS (node).create (SLP_TREE_NUMBER_OF_VEC_STMTS (node));
 
-  if (dump_enabled_p ())
-dump_printf_loc (MSG_NOTE, vect_location,
-"-->vectorizing SLP node starting from: %G",
-stmt_info->stmt);
-
-  if (STMT_VINFO_DATA_REF (stmt_info)
-  && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
+  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+  && STMT_VINFO_DATA_REF (stmt_info))
 {
   /* Vectorized loads go before the first scalar load to make it
 ready early, vectorized stores go before the last scalar
@@ -9319,10 +9314,10 @@ vect_schedule_slp_node (vec_info *vinfo,
last_stmt_info = vect_find_last_scalar_stmt_in_slp (node);
   si = gsi_for_stmt (last_stmt_info->stmt);
 }
-  else if ((STMT_VINFO_TYPE (stmt_info) == cycle_phi_info_type
-   || STMT_VINFO_TYPE (stmt_info) == induc_vec_info_type
-   || STMT_VINFO_TYPE (stmt_info) == phi_info_type)
-  && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
+  else if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+  && (STMT_VINFO_TYPE (stmt_info) == cycle_phi_info_type
+  || STMT_VINFO_TYPE (stmt_info) == induc_vec_info_type
+  || STMT_VINFO_TYPE (stmt_info) == phi_info_type))
 {
   /* For PHI node vectorization we do not use the insertion iterator.  */
   si = gsi_none ();
@@ -9456,6 +9451,9 @@ vect_schedule_slp_node (vec_info *vinfo,
   /* Handle purely internal nodes.  */
   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
 {
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"-->vectorizing SLP permutation node\n");
   /* ???  the transform kind is stored to STMT_VINFO_TYPE which might
 be shared with different SLP nodes (but usually it's the same
 operation apart from the case the stmt is only there for denoting
@@ -9474,7 +9472,13 @@ vect_schedule_slp_node (vec_info *vinfo,
  }
 }
   else
-vect_transform_stmt (vinfo, stmt_info, , node, instance);
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"-->vectorizing SLP node starting from: %G",
+stmt_info->stmt);
+  vect_transform_stmt (vinfo, stmt_info, , node, instance);
+}
 }
 
 /* Replace scalar calls from SLP node NODE with setting of their lhs to zero.
-- 
2.35.3



[PATCH 1/4] Avoid requiring VEC_PERM represenatives

2024-05-21 Thread Richard Biener
The following plugs one hole where we require a VEC_PERM node
representative unnecessarily.  This is for vect_check_store_rhs
which looks at the RHS and checks whether a constant can be
native encoded.  The fix is to guard that with vect_constant_def
additionally and making vect_is_simple_use forgiving for a missing
SLP_TREE_REPRESENTATIVE when the child is a VEC_PERM node,
initializing the scalar def to error_mark_node.

* tree-vect-stmts.cc (vect_check_store_rhs): Look at *rhs
only when it's a vec_constant_def.
(vect_is_simple_use): When we have no representative for
an internal node, fill in *op with error_mark_node.
---
 gcc/tree-vect-stmts.cc | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 672959501bb..4219ad832db 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2553,7 +2553,8 @@ vect_check_store_rhs (vec_info *vinfo, stmt_vec_info 
stmt_info,
 
   /* In the case this is a store from a constant make sure
  native_encode_expr can handle it.  */
-  if (CONSTANT_CLASS_P (*rhs) && native_encode_expr (*rhs, NULL, 64) == 0)
+  if (rhs_dt == vect_constant_def
+  && CONSTANT_CLASS_P (*rhs) && native_encode_expr (*rhs, NULL, 64) == 0)
 {
   if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -14002,8 +14003,26 @@ vect_is_simple_use (vec_info *vinfo, stmt_vec_info 
stmt, slp_tree slp_node,
   *vectype = SLP_TREE_VECTYPE (child);
   if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
{
- *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt);
- return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out);
+ /* ???  VEC_PERM nodes might be intermediate and their lane value
+have no representative (nor do we build a VEC_PERM stmt for
+the actual operation).  Note for two-operator nodes we set
+a representative but leave scalar stmts empty as we'd only
+have one for a subset of lanes.  Ideally no caller would
+require *op for internal defs.  */
+ if (SLP_TREE_REPRESENTATIVE (child))
+   {
+ *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt);
+ return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out);
+   }
+ else
+   {
+ gcc_assert (SLP_TREE_CODE (child) == VEC_PERM_EXPR);
+ *op = error_mark_node;
+ *dt = vect_internal_def;
+ if (def_stmt_info_out)
+   *def_stmt_info_out = NULL;
+ return true;
+   }
}
   else
{
-- 
2.35.3



[gcc r15-753] tree-optimization/115137 - more ptr-vs-ptr compare fixes

2024-05-21 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:85f7828679edc3ae7488594145756cd53787650e

commit r15-753-g85f7828679edc3ae7488594145756cd53787650e
Author: Richard Biener 
Date:   Tue May 21 10:12:40 2024 +0200

tree-optimization/115137 - more ptr-vs-ptr compare fixes

The following fixes the omission of const-pool included in NONLOCAL.

PR tree-optimization/115137
* tree-ssa-structalias.cc (pt_solution_includes_const_pool): 
NONLOCAL
also includes constant pool entries.

* gcc.dg/torture/pr115137.c: New testcase.

Diff:
---
 gcc/testsuite/gcc.dg/torture/pr115137.c | 34 +
 gcc/tree-ssa-structalias.cc |  1 +
 2 files changed, 35 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/torture/pr115137.c 
b/gcc/testsuite/gcc.dg/torture/pr115137.c
new file mode 100644
index 000..9cd8ff93633
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115137.c
@@ -0,0 +1,34 @@
+/* { dg-do run } */
+
+struct a {
+  int b;
+} c;
+
+int d;
+long e;
+
+static void f(char *g, char *h, struct a *l) {
+  char a[1024];
+  int j = 0;
+
+  if (d)
+h = a;
+
+  for (; g < h; g++)
+if (__builtin_iscntrl(*g))
+  ++j;
+
+  while (l->b < j)
+;
+}
+
+int main() {
+  static const struct {
+char *input;
+  } k[] = {{"somepage.html"}, {""}};
+
+  for (unsigned int i = 0; i < 1; ++i) {
+e = __builtin_strlen(k[i].input);
+f(k[i].input, k[i].input + e, );
+  }
+}
diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 61fb3610a17..0e9423a78ec 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -7087,6 +7087,7 @@ bool
 pt_solution_includes_const_pool (struct pt_solution *pt)
 {
   return (pt->const_pool
+ || pt->nonlocal
  || (pt->escaped && (!cfun || cfun->gimple_df->escaped.const_pool))
  || (pt->ipa_escaped && ipa_escaped_pt.const_pool));
 }


[PATCH] tree-optimization/115137 - more ptr-vs-ptr compare fixes

2024-05-21 Thread Richard Biener
The following fixes the omission of const-pool included in NONLOCAL.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/115137
* tree-ssa-structalias.cc (pt_solution_includes_const_pool): NONLOCAL
also includes constant pool entries.

* gcc.dg/torture/pr115137.c: New testcase.
---
 gcc/testsuite/gcc.dg/torture/pr115137.c | 34 +
 gcc/tree-ssa-structalias.cc |  1 +
 2 files changed, 35 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr115137.c

diff --git a/gcc/testsuite/gcc.dg/torture/pr115137.c 
b/gcc/testsuite/gcc.dg/torture/pr115137.c
new file mode 100644
index 000..9cd8ff93633
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115137.c
@@ -0,0 +1,34 @@
+/* { dg-do run } */
+
+struct a {
+  int b;
+} c;
+
+int d;
+long e;
+
+static void f(char *g, char *h, struct a *l) {
+  char a[1024];
+  int j = 0;
+
+  if (d)
+h = a;
+
+  for (; g < h; g++)
+if (__builtin_iscntrl(*g))
+  ++j;
+
+  while (l->b < j)
+;
+}
+
+int main() {
+  static const struct {
+char *input;
+  } k[] = {{"somepage.html"}, {""}};
+
+  for (unsigned int i = 0; i < 1; ++i) {
+e = __builtin_strlen(k[i].input);
+f(k[i].input, k[i].input + e, );
+  }
+}
diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 61fb3610a17..0e9423a78ec 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -7087,6 +7087,7 @@ bool
 pt_solution_includes_const_pool (struct pt_solution *pt)
 {
   return (pt->const_pool
+ || pt->nonlocal
  || (pt->escaped && (!cfun || cfun->gimple_df->escaped.const_pool))
  || (pt->ipa_escaped && ipa_escaped_pt.const_pool));
 }
-- 
2.35.3


Re: [PATCH] Cache the set of EH_RETURN_DATA_REGNOs

2024-05-21 Thread Richard Biener
On Tue, May 21, 2024 at 11:03 AM Richard Sandiford
 wrote:
>
> While reviewing Andrew's fix for PR114843, it seemed like it would
> be convenient to have a HARD_REG_SET of EH_RETURN_DATA_REGNOs.
> This patch adds one and uses it to simplify a couple of use sites.
>
> Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK to install?

OK

> Richard
>
>
> gcc/
> * hard-reg-set.h (target_hard_regs::x_eh_return_data_regs): New field.
> (eh_return_data_regs): New macro.
> * reginfo.cc (init_reg_sets_1): Initialize x_eh_return_data_regs.
> * df-scan.cc (df_get_exit_block_use_set): Use it.
> * ira-lives.cc (process_out_of_region_eh_regs): Likewise.
> ---
>  gcc/df-scan.cc |  8 +---
>  gcc/hard-reg-set.h |  5 +
>  gcc/ira-lives.cc   | 10 ++
>  gcc/reginfo.cc | 10 ++
>  4 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc
> index 1bade2cd71e..c8ab3c09cee 100644
> --- a/gcc/df-scan.cc
> +++ b/gcc/df-scan.cc
> @@ -3702,13 +3702,7 @@ df_get_exit_block_use_set (bitmap exit_block_uses)
>
>/* Mark the registers that will contain data for the handler.  */
>if (reload_completed && crtl->calls_eh_return)
> -for (i = 0; ; ++i)
> -  {
> -   unsigned regno = EH_RETURN_DATA_REGNO (i);
> -   if (regno == INVALID_REGNUM)
> - break;
> -   bitmap_set_bit (exit_block_uses, regno);
> -  }
> +IOR_REG_SET_HRS (exit_block_uses, eh_return_data_regs);
>
>  #ifdef EH_RETURN_STACKADJ_RTX
>if ((!targetm.have_epilogue () || ! epilogue_completed)
> diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h
> index 8c1d1512ca2..340eb425c10 100644
> --- a/gcc/hard-reg-set.h
> +++ b/gcc/hard-reg-set.h
> @@ -421,6 +421,9 @@ struct target_hard_regs {
>   with the local stack frame are safe, but scant others.  */
>HARD_REG_SET x_regs_invalidated_by_call;
>
> +  /* The set of registers that are used by EH_RETURN_DATA_REGNO.  */
> +  HARD_REG_SET x_eh_return_data_regs;
> +
>/* Table of register numbers in the order in which to try to use them.  */
>int x_reg_alloc_order[FIRST_PSEUDO_REGISTER];
>
> @@ -485,6 +488,8 @@ extern struct target_hard_regs *this_target_hard_regs;
>  #define call_used_or_fixed_regs \
>(regs_invalidated_by_call | fixed_reg_set)
>  #endif
> +#define eh_return_data_regs \
> +  (this_target_hard_regs->x_eh_return_data_regs)
>  #define reg_alloc_order \
>(this_target_hard_regs->x_reg_alloc_order)
>  #define inv_reg_alloc_order \
> diff --git a/gcc/ira-lives.cc b/gcc/ira-lives.cc
> index e07d3dc3e89..958eabb9708 100644
> --- a/gcc/ira-lives.cc
> +++ b/gcc/ira-lives.cc
> @@ -1260,14 +1260,8 @@ process_out_of_region_eh_regs (basic_block bb)
>for (int n = ALLOCNO_NUM_OBJECTS (a) - 1; n >= 0; n--)
> {
>   ira_object_t obj = ALLOCNO_OBJECT (a, n);
> - for (int k = 0; ; k++)
> -   {
> - unsigned int regno = EH_RETURN_DATA_REGNO (k);
> - if (regno == INVALID_REGNUM)
> -   break;
> - SET_HARD_REG_BIT (OBJECT_CONFLICT_HARD_REGS (obj), regno);
> - SET_HARD_REG_BIT (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj), regno);
> -   }
> + OBJECT_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
> + OBJECT_TOTAL_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
> }
>  }
>  }
> diff --git a/gcc/reginfo.cc b/gcc/reginfo.cc
> index a0baeb90e12..73121365c47 100644
> --- a/gcc/reginfo.cc
> +++ b/gcc/reginfo.cc
> @@ -420,6 +420,16 @@ init_reg_sets_1 (void)
> }
>  }
>
> +  /* Recalculate eh_return_data_regs.  */
> +  CLEAR_HARD_REG_SET (eh_return_data_regs);
> +  for (i = 0; ; ++i)
> +{
> +  unsigned int regno = EH_RETURN_DATA_REGNO (i);
> +  if (regno == INVALID_REGNUM)
> +   break;
> +  SET_HARD_REG_BIT (eh_return_data_regs, regno);
> +}
> +
>memset (have_regs_of_mode, 0, sizeof (have_regs_of_mode));
>memset (contains_reg_of_mode, 0, sizeof (contains_reg_of_mode));
>for (m = 0; m < (unsigned int) MAX_MACHINE_MODE; m++)
> --
> 2.25.1
>


[gcc r15-750] tree-optimization/115149 - VOP live and missing PHIs

2024-05-21 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:ec9b8bafe20755d13ab9a1b834b5da79ae972c0e

commit r15-750-gec9b8bafe20755d13ab9a1b834b5da79ae972c0e
Author: Richard Biener 
Date:   Tue May 21 09:48:04 2024 +0200

tree-optimization/115149 - VOP live and missing PHIs

The following fixes a bug in vop-live get_live_in which was using
NULL to indicate the first processed edge but at the same time
using it for the case the live-in virtual operand cannot be computed.
The following fixes this, avoiding sinking a load to a place where
we'd have to insert virtual PHIs to make the virtual operand SSA
web OK.

PR tree-optimization/115149
* tree-ssa-live.cc (virtual_operand_live::get_live_in):
Explicitly track the first processed edge.

* gcc.dg/pr115149.c: New testcase.

Diff:
---
 gcc/testsuite/gcc.dg/pr115149.c | 16 
 gcc/tree-ssa-live.cc|  8 ++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/pr115149.c b/gcc/testsuite/gcc.dg/pr115149.c
new file mode 100644
index 000..9f6bc97dbe6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr115149.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fno-inline -fno-tree-vrp -fno-ipa-sra -fno-tree-dce 
-fno-tree-ch" } */
+
+int a, c, e, f, g, h[1], i;
+static int j(int b) { return 0; }
+static void k(int d) {}
+int main()
+{
+  if (h[0])
+while (1) {
+   k(f && j(i && (h[g] = e)));
+   while (a)
+ c ^= 1;
+}
+  return 0;
+}
diff --git a/gcc/tree-ssa-live.cc b/gcc/tree-ssa-live.cc
index e6ae551a457..60dfc05dcd9 100644
--- a/gcc/tree-ssa-live.cc
+++ b/gcc/tree-ssa-live.cc
@@ -1675,14 +1675,18 @@ virtual_operand_live::get_live_in (basic_block bb)
   edge_iterator ei;
   edge e;
   tree livein = NULL_TREE;
+  bool first = true;
   FOR_EACH_EDGE (e, ei, bb->preds)
 if (e->flags & EDGE_DFS_BACK)
   /* We can ignore backedges since if there's a def there it would
 have forced a PHI in the source because it also acts as use
 downstream.  */
   continue;
-else if (!livein)
-  livein = get_live_out (e->src);
+else if (first)
+  {
+   livein = get_live_out (e->src);
+   first = false;
+  }
 else if (get_live_out (e->src) != livein)
   /* When there's no virtual use downstream this indicates a point
 where we'd insert a PHI merging the different live virtual


[PATCH] tree-optimization/115149 - VOP live and missing PHIs

2024-05-21 Thread Richard Biener
The following fixes a bug in vop-live get_live_in which was using
NULL to indicate the first processed edge but at the same time
using it for the case the live-in virtual operand cannot be computed.
The following fixes this, avoiding sinking a load to a place where
we'd have to insert virtual PHIs to make the virtual operand SSA
web OK.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/115149
* tree-ssa-live.cc (virtual_operand_live::get_live_in):
Explicitly track the first processed edge.

* gcc.dg/pr115149.c: New testcase.
---
 gcc/testsuite/gcc.dg/pr115149.c | 16 
 gcc/tree-ssa-live.cc|  8 ++--
 2 files changed, 22 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr115149.c

diff --git a/gcc/testsuite/gcc.dg/pr115149.c b/gcc/testsuite/gcc.dg/pr115149.c
new file mode 100644
index 000..9f6bc97dbe6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr115149.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fno-inline -fno-tree-vrp -fno-ipa-sra -fno-tree-dce 
-fno-tree-ch" } */
+
+int a, c, e, f, g, h[1], i;
+static int j(int b) { return 0; }
+static void k(int d) {}
+int main()
+{
+  if (h[0])
+while (1) {
+   k(f && j(i && (h[g] = e)));
+   while (a)
+ c ^= 1;
+}
+  return 0;
+}
diff --git a/gcc/tree-ssa-live.cc b/gcc/tree-ssa-live.cc
index e6ae551a457..60dfc05dcd9 100644
--- a/gcc/tree-ssa-live.cc
+++ b/gcc/tree-ssa-live.cc
@@ -1675,14 +1675,18 @@ virtual_operand_live::get_live_in (basic_block bb)
   edge_iterator ei;
   edge e;
   tree livein = NULL_TREE;
+  bool first = true;
   FOR_EACH_EDGE (e, ei, bb->preds)
 if (e->flags & EDGE_DFS_BACK)
   /* We can ignore backedges since if there's a def there it would
 have forced a PHI in the source because it also acts as use
 downstream.  */
   continue;
-else if (!livein)
-  livein = get_live_out (e->src);
+else if (first)
+  {
+   livein = get_live_out (e->src);
+   first = false;
+  }
 else if (get_live_out (e->src) != livein)
   /* When there's no virtual use downstream this indicates a point
 where we'd insert a PHI merging the different live virtual
-- 
2.35.3


Re: [PATCH v5 1/5] Improve must tail in RTL backend

2024-05-21 Thread Richard Biener
On Mon, May 20, 2024 at 6:53 AM Andi Kleen  wrote:
>
> On Tue, May 14, 2024 at 04:15:08PM +0200, Richard Biener wrote:
> > On Sun, May 5, 2024 at 8:16 PM Andi Kleen  wrote:
> > >
> > > - Give error messages for all causes of non sibling call generation
> > > - Don't override choices of other non sibling call checks with
> > > must tail. This causes ICEs. The must tail attribute now only
> > > overrides flag_optimize_sibling_calls locally.
> > > - Error out when tree-tailcall failed to mark a must-tail call
> > > sibcall. In this case it doesn't know the true reason and only gives
> > > a vague message (this could be improved, but it's already useful without
> > > that) tree-tailcall usually fails without optimization, so must
> > > adjust the existing must-tail plugin test to specify -O2.
> > >
> > > PR83324
> > >
> > > gcc/ChangeLog:
> > >
> > > * calls.cc (expand_call): Fix mustcall implementation.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.dg/plugin/must-tail-call-1.c: Adjust.
> > > ---
> > >  gcc/calls.cc  | 30 ---
> > >  .../gcc.dg/plugin/must-tail-call-1.c  |  1 +
> > >  2 files changed, 21 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/gcc/calls.cc b/gcc/calls.cc
> > > index 21d78f9779fe..a6b8ee44cc29 100644
> > > --- a/gcc/calls.cc
> > > +++ b/gcc/calls.cc
> > > @@ -2650,7 +2650,9 @@ expand_call (tree exp, rtx target, int ignore)
> > >/* The type of the function being called.  */
> > >tree fntype;
> > >bool try_tail_call = CALL_EXPR_TAILCALL (exp);
> > > -  bool must_tail_call = CALL_EXPR_MUST_TAIL_CALL (exp);
> > > +  /* tree-tailcall decided not to do tail calls. Error for the musttail 
> > > case.  */
> > > +  if (!try_tail_call)
> > > +  maybe_complain_about_tail_call (exp, "other reasons");
> > >int pass;
> > >
> > >/* Register in which non-BLKmode value will be returned,
> > > @@ -3022,10 +3024,22 @@ expand_call (tree exp, rtx target, int ignore)
> > >   pushed these optimizations into -O2.  Don't try if we're already
> > >   expanding a call, as that means we're an argument.  Don't try if
> > >   there's cleanups, as we know there's code to follow the call.  */
> > > -  if (currently_expanding_call++ != 0
> > > -  || (!flag_optimize_sibling_calls && !CALL_FROM_THUNK_P (exp))
> > > -  || args_size.var
> > > -  || dbg_cnt (tail_call) == false)
> > > +  if (currently_expanding_call++ != 0)
> > > +{
> > > +  maybe_complain_about_tail_call (exp, "inside another call");
> > > +  try_tail_call = 0;
> > > +}
> > > +  if (!flag_optimize_sibling_calls
> > > +   && !CALL_FROM_THUNK_P (exp)
> > > +   && !CALL_EXPR_MUST_TAIL_CALL (exp))
> > > +try_tail_call = 0;
> > > +  if (args_size.var)
> >
> > If we are both inside another call and run into this we give two errors,
> > but I guess that's OK ...
> >
> > > +{
> > > +  /* ??? correct message?  */
> > > +  maybe_complain_about_tail_call (exp, "stack space needed");
> >
> > args_size.var != NULL_TREE means the argument size is not constant.
> > I'm quite sure this is an overly conservative check.
> >
> > > +  try_tail_call = 0;
> > > +}
> > > +  if (dbg_cnt (tail_call) == false)
> > >  try_tail_call = 0;
> > >
> > >/* Workaround buggy C/C++ wrappers around Fortran routines with
> > > @@ -3046,15 +3060,11 @@ expand_call (tree exp, rtx target, int ignore)
> > > if (MEM_P (*iter))
> > >   {
> > > try_tail_call = 0;
> > > +   maybe_complain_about_tail_call (exp, "hidden string 
> > > length argument");
> >
> > "hidden string length argument passed on stack"
> >
> > from what I read the code.
> >
> > > break;
> > >   }
> > > }
> > >
> > > -  /* If the user has marked the function as requiring tail-call
> > > - optimization, attempt it.  */
> > > -  if (must_tail_call)
> > > -try_tail_call = 1;
> > > -
> > >/*  Rest of purposes for tail call optimiz

Re: [PATCH] Don't reduce estimated unrolled size for innermost loop.

2024-05-21 Thread Richard Biener
On Tue, May 21, 2024 at 4:35 AM Hongtao Liu  wrote:
>
> On Wed, May 15, 2024 at 5:24 PM Richard Biener
>  wrote:
> >
> > On Wed, May 15, 2024 at 4:15 AM Hongtao Liu  wrote:
> > >
> > > On Mon, May 13, 2024 at 3:40 PM Richard Biener
> > >  wrote:
> > > >
> > > > On Mon, May 13, 2024 at 4:29 AM liuhongt  wrote:
> > > > >
> > > > > As testcase in the PR, O3 cunrolli may prevent vectorization for the
> > > > > innermost loop and increase register pressure.
> > > > > The patch removes the 1/3 reduction of unr_insn for innermost loop 
> > > > > for UL_ALL.
> > > > > ul != UR_ALL is needed since some small loop complete unrolling at O2 
> > > > > relies
> > > > > the reduction.
> > > > >
> > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > > No big impact for SPEC2017.
> > > > > Ok for trunk?
> > > >
> > > > This removes the 1/3 reduction when unrolling a loop nest (the case I 
> > > > was
> > > > concerned about).  Unrolling of a nest is by iterating in
> > > > tree_unroll_loops_completely
> > > > so the to be unrolled loop appears innermost.  So I think you need a new
> > > > parameter on tree_unroll_loops_completely_1 indicating whether we're in 
> > > > the
> > > > first iteration (or whether to assume inner most loops will "simplify").
> > > yes, it would be better.
> > > >
> > > > Few comments below
> > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > PR tree-optimization/112325
> > > > > * tree-ssa-loop-ivcanon.cc (estimated_unrolled_size): Add 2
> > > > > new parameters: loop and ul, and remove unr_insns reduction
> > > > > for innermost loop.
> > > > > (try_unroll_loop_completely): Pass loop and ul to
> > > > > estimated_unrolled_size.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > > * gcc.dg/tree-ssa/pr112325.c: New test.
> > > > > * gcc.dg/vect/pr69783.c: Add extra option --param
> > > > > max-completely-peeled-insns=300.
> > > > > ---
> > > > >  gcc/testsuite/gcc.dg/tree-ssa/pr112325.c | 57 
> > > > > 
> > > > >  gcc/testsuite/gcc.dg/vect/pr69783.c  |  2 +-
> > > > >  gcc/tree-ssa-loop-ivcanon.cc | 16 +--
> > > > >  3 files changed, 71 insertions(+), 4 deletions(-)
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > > > >
> > > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c 
> > > > > b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > > > > new file mode 100644
> > > > > index 000..14208b3e7f8
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > > > > @@ -0,0 +1,57 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> > > > > +
> > > > > +typedef unsigned short ggml_fp16_t;
> > > > > +static float table_f32_f16[1 << 16];
> > > > > +
> > > > > +inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
> > > > > +unsigned short s;
> > > > > +__builtin_memcpy(, , sizeof(unsigned short));
> > > > > +return table_f32_f16[s];
> > > > > +}
> > > > > +
> > > > > +typedef struct {
> > > > > +ggml_fp16_t d;
> > > > > +ggml_fp16_t m;
> > > > > +unsigned char qh[4];
> > > > > +unsigned char qs[32 / 2];
> > > > > +} block_q5_1;
> > > > > +
> > > > > +typedef struct {
> > > > > +float d;
> > > > > +float s;
> > > > > +char qs[32];
> > > > > +} block_q8_1;
> > > > > +
> > > > > +void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const 
> > > > > void * restrict vx, const void * restrict vy) {
> > > > > +const int qk = 32;
> > > > > +const int nb = n / qk;
> > > > > +
> > > >

Re: [PATCH] match: Disable `(type)zero_one_valuep*CST` for 1bit signed types [PR115154]

2024-05-21 Thread Richard Biener
On Tue, May 21, 2024 at 12:02 AM Andrew Pinski  wrote:
>
> The problem here is the pattern added in r13-1162-g9991d84d2a8435
> assumes that it is well defined to multiply zero_one_valuep by the truncated
> converted integer constant. It is well defined for all types except for 
> signed 1bit types.
> Where `a * -1` is produced which is undefined/
> So disable this pattern for 1bit signed types.
>
> Note the pattern added in r14-3432-gddd64a6ec3b38e is able to workaround the 
> undefinedness except when
> `-fsanitize=undefined` is turned on, this is why I added a testcase for that.
>
> OK for trunk and gcc-14 and gcc-13 branches? Bootstrapped and tested on 
> x86_64-linux-gnu with no regressions.

OK for trunk and branches.  Please wait until after 13.3.

Richard.

> PR tree-optimization/115154
>
> gcc/ChangeLog:
>
> * match.pd (convert (mult zero_one_valued_p@1 INTEGER_CST@2)): Disable
> for 1bit signed types.
>
> gcc/testsuite/ChangeLog:
>
> * c-c++-common/ubsan/signed1bitfield-1.c: New test.
> * gcc.c-torture/execute/signed1bitfield-1.c: New test.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/match.pd  |  6 +++--
>  .../c-c++-common/ubsan/signed1bitfield-1.c| 25 +++
>  .../gcc.c-torture/execute/signed1bitfield-1.c | 23 +
>  3 files changed, 52 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/c-c++-common/ubsan/signed1bitfield-1.c
>  create mode 100644 gcc/testsuite/gcc.c-torture/execute/signed1bitfield-1.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 0f9c34fa897..35e3d82b131 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -2395,12 +2395,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>(mult (convert @0) @1)))
>
>  /* Narrow integer multiplication by a zero_one_valued_p operand.
> -   Multiplication by [0,1] is guaranteed not to overflow.  */
> +   Multiplication by [0,1] is guaranteed not to overflow except for
> +   1bit signed types.  */
>  (simplify
>   (convert (mult@0 zero_one_valued_p@1 INTEGER_CST@2))
>   (if (INTEGRAL_TYPE_P (type)
>&& INTEGRAL_TYPE_P (TREE_TYPE (@0))
> -  && TYPE_PRECISION (type) < TYPE_PRECISION (TREE_TYPE (@0)))
> +  && TYPE_PRECISION (type) < TYPE_PRECISION (TREE_TYPE (@0))
> +  && (TYPE_UNSIGNED (type) || TYPE_PRECISION (type) > 1))
>(mult (convert @1) (convert @2
>
>  /* (X << C) != 0 can be simplified to X, when C is zero_one_valued_p.
> diff --git a/gcc/testsuite/c-c++-common/ubsan/signed1bitfield-1.c 
> b/gcc/testsuite/c-c++-common/ubsan/signed1bitfield-1.c
> new file mode 100644
> index 000..2ba8cf4dab0
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/ubsan/signed1bitfield-1.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fsanitize=undefined" } */
> +
> +/* PR tree-optimization/115154 */
> +/* This was being miscompiled with -fsanitize=undefined due to
> +   `(signed:1)(t*5)` being transformed into `-((signed:1)t)` which
> +   is undefined. */
> +
> +struct s {
> +  signed b : 1;
> +} f;
> +int i = 55;
> +__attribute__((noinline))
> +void check(int a)
> +{
> +if (!a)
> +__builtin_abort();
> +}
> +int main() {
> +int t = i != 5;
> +t = t*5;
> +f.b = t;
> +int tt = f.b;
> +check(f.b);
> +}
> diff --git a/gcc/testsuite/gcc.c-torture/execute/signed1bitfield-1.c 
> b/gcc/testsuite/gcc.c-torture/execute/signed1bitfield-1.c
> new file mode 100644
> index 000..ab888ca3a04
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/execute/signed1bitfield-1.c
> @@ -0,0 +1,23 @@
> +/* PR tree-optimization/115154 */
> +/* This was being miscompiled to `(signed:1)(t*5)`
> +   being transformed into `-((signed:1)t)` which is undefined.
> +   Note there is a pattern which removes the negative in some cases
> +   which works around the issue.  */
> +
> +struct {
> +  signed b : 1;
> +} f;
> +int i = 55;
> +__attribute__((noinline))
> +void check(int a)
> +{
> +if (!a)
> +__builtin_abort();
> +}
> +int main() {
> +int t = i != 5;
> +t = t*5;
> +f.b = t;
> +int tt = f.b;
> +check(f.b);
> +}
> --
> 2.43.0
>


Re: [committed] PATCH for Re: Stepping down as maintainer for ARC and Epiphany

2024-05-21 Thread Richard Biener
On Mon, May 20, 2024 at 4:45 PM Gerald Pfeifer  wrote:
>
> On Wed, 5 Jul 2023, Joern Rennecke wrote:
> > I haven't worked with these targets in years and can't really do
> > sensible maintenance or reviews of patches for them. I am currently
> > working on optimizations for other ports like RISC-V.
>
> I noticed MAINTAINERS was not updated, so pushed the patch below.

That leaves the epiphany port unmaintained.  Should we automatically add such
ports to the list of obsoleted ports?

Richard.

> Gerald
>
>
> commit f94598ffaf5affbc9421ff230502357b07c55d9c
> Author: Gerald Pfeifer 
> Date:   Mon May 20 16:43:05 2024 +0200
>
> MAINTAINERS: Update Joern Rennecke's status
>
> This is per his mail to g...@gcc.gnu.org on 7 Jul 2023.
>
> ChangeLog:
> * MAINTAINERS: Move Joern Rennecke from arc and epiphany 
> maintainer
> to Write After Approval.
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8e0add6bef8..e2870eef2ef 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -56,7 +56,6 @@ aarch64 port  Kyrylo Tkachov  
> 
>  alpha port Richard Henderson   
>  amdgcn portJulian Brown
>  amdgcn portAndrew Stubbs   
> -arc port   Joern Rennecke  
>  arc port   Claudiu Zissulescu  
>  arm port   Nick Clifton
>  arm port   Richard Earnshaw
> @@ -68,7 +67,6 @@ c6x port  Bernd Schmidt   
> 
>  cris port  Hans-Peter Nilsson  
>  c-sky port Xianmiao Qu 
>  c-sky port Yunhai Shang
> -epiphany port  Joern Rennecke  
>  fr30 port  Nick Clifton
>  frv port   Nick Clifton
>  frv port   Alexandre Oliva 
> @@ -634,6 +632,7 @@ Joe Ramsay  
> 
>  Rolf Rasmussen 
>  Fritz Reese
>  Volker Reichelt
> 
> +Joern Rennecke 
>  Bernhard Reutner-Fischer   
>  Tom Rix
>  Thomas Rodgers 


Re: [PATCH] PHIOPT: Don't transform minmax if middle bb contains a phi [PR115143]

2024-05-21 Thread Richard Biener
On Mon, May 20, 2024 at 11:37 PM Andrew Pinski (QUIC)
 wrote:
>
> > -Original Message-
> > From: Richard Biener 
> > Sent: Sunday, May 19, 2024 11:55 AM
> > To: Andrew Pinski (QUIC) 
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH] PHIOPT: Don't transform minmax if
> > middle bb contains a phi [PR115143]
> >
> >
> >
> > > Am 19.05.2024 um 01:12 schrieb Andrew Pinski
> > :
> > >
> > > The problem here is even if last_and_only_stmt returns a
> > statement,
> > > the bb might still contain a phi node which defines a ssa
> > name which
> > > is used in that statement so we need to add a check to make
> > sure that
> > > the phi nodes are empty for the middle bbs in both the
> > > `CMP?MINMAX:MINMAX` case and the `CMP?MINMAX:B`
> > cases.
> >
> > Is that single arg PHIs or do we have an extra edge into the
> > middle BB?  I think that might be unexpected, at least costing
> > wise.  Maybe Also to some of the replacement code we have ?
>
> It is only a single arg PHI since we already reject multiple edges in the 
> middle BBs for these cases.
> It was EVPR that produces the single arg PHI in the original testcase from 
> folding of a conditional to false and evpr does not do simple name prop in 
> this case and there is no pass inbetween evrp and phiopt that will clear up 
> single arg PHI.
> I added the Gimple based testcases basically to avoid the needing of 
> depending on what previous passes could produce too.
>
> >
> > > OK for trunk and backport to all open branches since r14-
> > 3827-g30e6ee074588ba was backported?
> > > Bootstrapped and tested on x86_64_linux-gnu with no
> > regressions.
> > >
> >
> > Ok
>
> Does this include the GCC 13 branch or should I wait until after the GCC 
> 13.3.0 release?

Please wait until after the release.

Thanks,
Richard.

> Thanks,
> Andrew Pinski
>
> >
> > Richard
> >
> > >PR tree-optimization/115143
> > >
> > > gcc/ChangeLog:
> > >
> > >* tree-ssa-phiopt.cc (minmax_replacement): Check for
> > empty
> > >phi nodes for middle bbs for the case where middle bb is
> > not empty.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >* gcc.c-torture/compile/pr115143-1.c: New test.
> > >* gcc.c-torture/compile/pr115143-2.c: New test.
> > >* gcc.c-torture/compile/pr115143-3.c: New test.
> > >
> > > Signed-off-by: Andrew Pinski 
> > > ---
> > > .../gcc.c-torture/compile/pr115143-1.c| 21
> > +
> > > .../gcc.c-torture/compile/pr115143-2.c| 30
> > +++
> > > .../gcc.c-torture/compile/pr115143-3.c| 29
> > ++
> > > gcc/tree-ssa-phiopt.cc| 12 
> > > 4 files changed, 92 insertions(+)
> > > create mode 100644 gcc/testsuite/gcc.c-
> > torture/compile/pr115143-1.c
> > > create mode 100644 gcc/testsuite/gcc.c-
> > torture/compile/pr115143-2.c
> > > create mode 100644 gcc/testsuite/gcc.c-
> > torture/compile/pr115143-3.c
> > >
> > > diff --git a/gcc/testsuite/gcc.c-torture/compile/pr115143-1.c
> > > b/gcc/testsuite/gcc.c-torture/compile/pr115143-1.c
> > > new file mode 100644
> > > index 000..5cb119ea432
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.c-torture/compile/pr115143-1.c
> > > @@ -0,0 +1,21 @@
> > > +/* PR tree-optimization/115143 */
> > > +/* This used to ICE.
> > > +   minmax part of phiopt would transform,
> > > +   would transform `a!=0?min(a, b) : 0` into `min(a,b)`
> > > +   which was correct except b was defined by a phi in the
> > inner
> > > +   bb which was not handled. */
> > > +short a, d;
> > > +char b;
> > > +long c;
> > > +unsigned long e, f;
> > > +void g(unsigned long h) {
> > > +  if (c ? e : b)
> > > +if (e)
> > > +  if (d) {
> > > +a = f ? ({
> > > +  unsigned long i = d ? f : 0, j = e ? h : 0;
> > > +  i < j ? i : j;
> > > +}) : 0;
> > > +  }
> > > +}
> > > +
> > > diff --git a/gcc/testsuite/gcc.c-torture/compile/pr115143-2.c
> > > b/gcc/testsuite/gcc.c-torture/compile/pr115143-2.c
> > > new file mode 100644
> > > index 000..05c3bbe9738
> > > --- /dev/null
> > > +++ b/gcc/

Re: [committed] PATCH for Re: Stepping down as maintainer for ARC and Epiphany

2024-05-21 Thread Richard Biener via Gcc
On Mon, May 20, 2024 at 4:45 PM Gerald Pfeifer  wrote:
>
> On Wed, 5 Jul 2023, Joern Rennecke wrote:
> > I haven't worked with these targets in years and can't really do
> > sensible maintenance or reviews of patches for them. I am currently
> > working on optimizations for other ports like RISC-V.
>
> I noticed MAINTAINERS was not updated, so pushed the patch below.

That leaves the epiphany port unmaintained.  Should we automatically add such
ports to the list of obsoleted ports?

Richard.

> Gerald
>
>
> commit f94598ffaf5affbc9421ff230502357b07c55d9c
> Author: Gerald Pfeifer 
> Date:   Mon May 20 16:43:05 2024 +0200
>
> MAINTAINERS: Update Joern Rennecke's status
>
> This is per his mail to gcc@gcc.gnu.org on 7 Jul 2023.
>
> ChangeLog:
> * MAINTAINERS: Move Joern Rennecke from arc and epiphany 
> maintainer
> to Write After Approval.
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8e0add6bef8..e2870eef2ef 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -56,7 +56,6 @@ aarch64 port  Kyrylo Tkachov  
> 
>  alpha port Richard Henderson   
>  amdgcn portJulian Brown
>  amdgcn portAndrew Stubbs   
> -arc port   Joern Rennecke  
>  arc port   Claudiu Zissulescu  
>  arm port   Nick Clifton
>  arm port   Richard Earnshaw
> @@ -68,7 +67,6 @@ c6x port  Bernd Schmidt   
> 
>  cris port  Hans-Peter Nilsson  
>  c-sky port Xianmiao Qu 
>  c-sky port Yunhai Shang
> -epiphany port  Joern Rennecke  
>  fr30 port  Nick Clifton
>  frv port   Nick Clifton
>  frv port   Alexandre Oliva 
> @@ -634,6 +632,7 @@ Joe Ramsay  
> 
>  Rolf Rasmussen 
>  Fritz Reese
>  Volker Reichelt
> 
> +Joern Rennecke 
>  Bernhard Reutner-Fischer   
>  Tom Rix
>  Thomas Rodgers 


Re: [PATCH] PHIOPT: Don't transform minmax if middle bb contains a phi [PR115143]

2024-05-19 Thread Richard Biener



> Am 19.05.2024 um 01:12 schrieb Andrew Pinski :
> 
> The problem here is even if last_and_only_stmt returns a statement,
> the bb might still contain a phi node which defines a ssa name
> which is used in that statement so we need to add a check to make sure
> that the phi nodes are empty for the middle bbs in both the
> `CMP?MINMAX:MINMAX` case and the `CMP?MINMAX:B` cases.

Is that single arg PHIs or do we have an extra edge into the middle BB?  I 
think that might be unexpected, at least costing wise.  Maybe
Also to some of the replacement code we have ?

> OK for trunk and backport to all open branches since r14-3827-g30e6ee074588ba 
> was backported?
> Bootstrapped and tested on x86_64_linux-gnu with no regressions.
> 

Ok

Richard 

>PR tree-optimization/115143
> 
> gcc/ChangeLog:
> 
>* tree-ssa-phiopt.cc (minmax_replacement): Check for empty
>phi nodes for middle bbs for the case where middle bb is not empty.
> 
> gcc/testsuite/ChangeLog:
> 
>* gcc.c-torture/compile/pr115143-1.c: New test.
>* gcc.c-torture/compile/pr115143-2.c: New test.
>* gcc.c-torture/compile/pr115143-3.c: New test.
> 
> Signed-off-by: Andrew Pinski 
> ---
> .../gcc.c-torture/compile/pr115143-1.c| 21 +
> .../gcc.c-torture/compile/pr115143-2.c| 30 +++
> .../gcc.c-torture/compile/pr115143-3.c| 29 ++
> gcc/tree-ssa-phiopt.cc| 12 
> 4 files changed, 92 insertions(+)
> create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr115143-1.c
> create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr115143-2.c
> create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr115143-3.c
> 
> diff --git a/gcc/testsuite/gcc.c-torture/compile/pr115143-1.c 
> b/gcc/testsuite/gcc.c-torture/compile/pr115143-1.c
> new file mode 100644
> index 000..5cb119ea432
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/compile/pr115143-1.c
> @@ -0,0 +1,21 @@
> +/* PR tree-optimization/115143 */
> +/* This used to ICE.
> +   minmax part of phiopt would transform,
> +   would transform `a!=0?min(a, b) : 0` into `min(a,b)`
> +   which was correct except b was defined by a phi in the inner
> +   bb which was not handled. */
> +short a, d;
> +char b;
> +long c;
> +unsigned long e, f;
> +void g(unsigned long h) {
> +  if (c ? e : b)
> +if (e)
> +  if (d) {
> +a = f ? ({
> +  unsigned long i = d ? f : 0, j = e ? h : 0;
> +  i < j ? i : j;
> +}) : 0;
> +  }
> +}
> +
> diff --git a/gcc/testsuite/gcc.c-torture/compile/pr115143-2.c 
> b/gcc/testsuite/gcc.c-torture/compile/pr115143-2.c
> new file mode 100644
> index 000..05c3bbe9738
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/compile/pr115143-2.c
> @@ -0,0 +1,30 @@
> +/* { dg-options "-fgimple" } */
> +/* PR tree-optimization/115143 */
> +/* This used to ICE.
> +   minmax part of phiopt would transform,
> +   would transform `a!=0?min(a, b) : 0` into `min(a,b)`
> +   which was correct except b was defined by a phi in the inner
> +   bb which was not handled. */
> +unsigned __GIMPLE (ssa,startwith("phiopt"))
> +foo (unsigned a, unsigned b)
> +{
> +  unsigned j;
> +  unsigned _23;
> +  unsigned _12;
> +
> +  __BB(2):
> +  if (a_6(D) != 0u)
> +goto __BB3;
> +  else
> +goto __BB4;
> +
> +  __BB(3):
> +  j_10 = __PHI (__BB2: b_11(D));
> +  _23 = __MIN (a_6(D), j_10);
> +  goto __BB4;
> +
> +  __BB(4):
> +  _12 = __PHI (__BB3: _23, __BB2: 0u);
> +  return _12;
> +
> +}
> diff --git a/gcc/testsuite/gcc.c-torture/compile/pr115143-3.c 
> b/gcc/testsuite/gcc.c-torture/compile/pr115143-3.c
> new file mode 100644
> index 000..53c5fb5588e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/compile/pr115143-3.c
> @@ -0,0 +1,29 @@
> +/* { dg-options "-fgimple" } */
> +/* PR tree-optimization/115143 */
> +/* This used to ICE.
> +   minmax part of phiopt would transform,
> +   would transform `a!=0?min(a, b) : 0` into `min(a,b)`
> +   which was correct except b was defined by a phi in the inner
> +   bb which was not handled. */
> +unsigned __GIMPLE (ssa,startwith("phiopt"))
> +foo (unsigned a, unsigned b)
> +{
> +  unsigned j;
> +  unsigned _23;
> +  unsigned _12;
> +
> +  __BB(2):
> +  if (a_6(D) > 0u)
> +goto __BB3;
> +  else
> +goto __BB4;
> +
> +  __BB(3):
> +  j_10 = __PHI (__BB2: b_7(D));
> +  _23 = __MIN (a_6(D), j_10);
> +  goto __BB4;
> +
> +  __BB(4):
> +  _12 = __PHI (__BB3: _23, __BB2: 0u);
> +  return _12;
> +}
> diff --git a/gcc/tree-ssa-phiopt.cc b/gcc/tree-ssa-phiopt.cc
> index f166c3132cb..918cf50b589 100644
> --- a/gcc/tree-ssa-phiopt.cc
> +++ b/gcc/tree-ssa-phiopt.cc
> @@ -1925,6 +1925,10 @@ minmax_replacement (basic_block cond_bb, basic_block 
> middle_bb, basic_block alt_
>  || gimple_code (assign) != GIMPLE_ASSIGN)
>return false;
> 
> +  /* There cannot be any phi nodes in the middle bb. */
> +  if (!gimple_seq_empty_p (phi_nodes (middle_bb)))
> +return false;
> +
>   lhs = 

Re: [RFC] Merge strathegy for all-SLP vectorizer

2024-05-17 Thread Richard Biener via Gcc
On Fri, 17 May 2024, Richard Sandiford wrote:

> Richard Biener via Gcc  writes:
> > Hi,
> >
> > I'd like to discuss how to go forward with getting the vectorizer to
> > all-SLP for this stage1.  While there is a personal branch with my
> > ongoing work (users/rguenth/vect-force-slp) branches haven't proved
> > themselves working well for collaboration.
> 
> Speaking for myself, the problem hasn't been so much the branch as
> lack of time.  I've been pretty swamped the last eight months of so
> (except for the time that I took off, which admittedly was quite a
> bit!), and so I never even got around to properly reading and replying
> to your message after the Cauldron.  It's been on the "this is important,
> I should make time to read and understand it properly" list all this time.
> Sorry about that. :(
> 
> I'm hoping to have time to work/help out on SLP stuff soon.
> 
> > The branch isn't ready to be merged in full but I have been picking
> > improvements to trunk last stage1 and some remaining bits in the past
> > weeks.  I have refrained from merging code paths that cannot be
> > exercised on trunk.
> >
> > There are two important set of changes on the branch, both critical
> > to get more testing on non-x86 targets.
> >
> >  1. enable single-lane SLP discovery
> >  2. avoid splitting store groups (9315bfc661432c3 and 4336060fe2db8ec
> > if you fetch the branch)
> >
> > The first point is also most annoying on the testsuite since doing
> > SLP instead of interleaving changes what we dump and thus tests
> > start to fail in random ways when you switch between both modes.
> > On the branch single-lane SLP discovery is gated with
> > --param vect-single-lane-slp.
> >
> > The branch has numerous changes to enable single-lane SLP for some
> > code paths that have SLP not implemented and where I did not bother
> > to try supporting multi-lane SLP at this point.  It also adds more
> > SLP discovery entry points.
> >
> > I'm not sure how to try merging these pieces to allow others to
> > more easily help out.  One possibility is to merge
> > --param vect-single-lane-slp defaulted off and pick dependent
> > changes even when they cause testsuite regressions with
> > vect-single-lane-slp=1.  Alternatively adjust the testsuite by
> > adding --param vect-single-lane-slp=0 and default to 1
> > (or keep the default).
> 
> FWIW, this one sounds good to me (the default to 1 version).
> I.e. mechanically add --param vect-single-lane-slp=0 to any tests
> that fail with the new default.  That means that the test that need
> fixing are easily greppable for anyone who wants to help.  Sometimes
> it'll just be a test update.  Sometimes it will be new vectoriser code.

OK.  Meanwhile I figured the most important part is 2. from above
since that enables the single-lane in a grouped access (also covering
single element interleaving).  This will cover all problematical cases
with respect to vectorizing loads and stores.  It also has less
testsuite fallout, mainly because we have a lot less coverage for
grouped stores without SLP.

So I'll see to produce a mergeable patch for part 2 and post that
for review next week.

Thanks,
Richard.

> Thanks,
> Richard
> 
> > Or require a clean testsuite with
> > --param vect-single-lane-slp defaulted to 1 but keep the --param
> > for debugging (and allow FAILs with 0).
> >
> > For fun I merged just single-lane discovery of non-grouped stores
> > and have that enabled by default.  On x86_64 this results in the
> > set of FAILs below.
> >
> > Any suggestions?
> >
> > Thanks,
> > Richard.
> >
> > FAIL: gcc.dg/vect/O3-pr39675-2.c scan-tree-dump-times vect "vectorizing 
> > stmts using SLP" 1
> > XPASS: gcc.dg/vect/no-scevccp-outer-12.c scan-tree-dump-times vect "OUTER 
> > LOOP VECTORIZED." 1
> > FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> > "Alignment of access forced using peeling" 2
> > FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> > "Vectorizing an unaligned access" 0
> > FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> > "Alignment of access forced using peeling" 2
> > FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> > "Vectorizing an unaligned access" 0
> > FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
> > "Alignment of access forced using peeling" 1
> > FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times 

gcc-wwwdocs branch master updated. 48be8366fcbf9246b2b5b1625febb5e9202842b8

2024-05-17 Thread Richard Biener via Gcc-cvs-wwwdocs
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "gcc-wwwdocs".

The branch, master has been updated
   via  48be8366fcbf9246b2b5b1625febb5e9202842b8 (commit)
  from  ed9ceba9b8b038f0e0f333798da7abe046679d0c (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -
commit 48be8366fcbf9246b2b5b1625febb5e9202842b8
Author: Richard Biener 
Date:   Fri May 17 13:48:56 2024 +0200

Fix trunk status

diff --git a/htdocs/index.html b/htdocs/index.html
index 63fbcdcd..c52bb30b 100644
--- a/htdocs/index.html
+++ b/htdocs/index.html
@@ -222,7 +222,7 @@ More news? Let ger...@pfeifer.com know!
   
   https://gcc.gnu.org/pipermail/gcc/2024-April/243824.html;>2024-04-26
   
-  (regression fixes  docs only).
+  (general development).
   
   https://gcc.gnu.org/bugzilla/buglist.cgi?query_format=advancedshort_desc_type=regexpshort_desc=%5C%5B(%5B%200-9.%2F%5D*%5B%20%2F%5D)*15%5B%20%2F%5D%5B%200-9.%2F%5D*%5BRr%5Degression%20*%5C%5Dtarget_milestone=11.5target_milestone=12.4target_milestone=13.3target_milestone=14.2target_milestone=15.0known_to_fail_type=allwordssubstrknown_to_work_type=allwordssubstrlong_desc_type=allwordssubstrlong_desc=bug_file_loc_type=allwordssubstrbug_file_loc=gcchost_type=allwordssubstrgcchost=gcctarget_type=allwordssubstrgcctarget=gccbuild_type=allwordssubstrgccbuild=keywords_type=allwordskeywords=bug_status=UNCONFIRMEDbug_status=NEWbug_status=ASSIGNEDbug_status=SUSPENDEDbug_status=WAITINGbug_status=REOPENEDpriority=P1priority=P2priority=P3emailtype1=substringemail1=emailtype2=substringemail2=bugidtype=includebug_i
 
d=votes=chfieldfrom=chfieldto=Nowchfieldvalue=cmdtype=doitorder=Reuse+same+sort+as+last+timefield0-0-0=nooptype0-0-0=noopvalue0-0-0=">Serious

---

Summary of changes:
 htdocs/index.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


hooks/post-receive
-- 
gcc-wwwdocs


Re: [PATCH] MATCH: Maybe expand (T)(A + C1) * C2 and (T)(A + C1) * C2 + C3 [PR109393]

2024-05-17 Thread Richard Biener
On Fri, 17 May 2024, Manolis Tsamis wrote:

> On Fri, May 17, 2024 at 12:22 PM Richard Biener  wrote:
> >
> > On Fri, 17 May 2024, Manolis Tsamis wrote:
> >
> > > Hi Richard,
> > >
> > > While I was re-testing the latest version of this patch I noticed that
> > > it FAILs an AArch64 test, gcc.target/aarch64/subsp.c. With the patch
> > > we generate one instruction more:
> > >
> > > sbfiz   x1, x1, 4, 32
> > > stp x29, x30, [sp, -16]!
> > > add x1, x1, 16
> > > mov x29, sp
> > > sub sp, sp, x1
> > > mov x0, sp
> > > bl  foo
> > >
> > > Instead of:
> > >
> > > stp x29, x30, [sp, -16]!
> > > add w1, w1, 1
> > > mov x29, sp
> > > sub sp, sp, w1, sxtw 4
> > > mov x0, sp
> > > bl  foo
> > >
> > > I've looked at it but can't really find a way to solve the regression.
> > > Any thoughts on this?
> >
> > Can you explain what goes wrong?  As I said rewriting parts of
> > address calculation is tricky, there's always the chance that some
> > cases regress (see your observation in comment#4 of the PR).
> >
> 
> In this case the int -> sizetype cast ends up happening earlier. Instead of
> 
>   _7 = y_6(D) + 1;
>   _1 = (sizetype) _7;
>   _2 = _1 * 16;
> 
> We get
> 
>   _13 = (sizetype) y_6(D);
>   _15 = _13 + 1;
>   _2 = _15 * 16;
> 
> and then in RTL we have
> 
> x1 = ((sizetype) x1) << 4
> sp = sp - (x1 + 16)
> 
> instead of
> 
> x1 = x1 + 1
> sp = sp - ((sizetype) x1) << 4
> 
> which doesn't form sub sp, sp, w1, sxtw 4.
> 
> But more importantly, I realized that (in this case among others) the
> pattern is undone by (A * C) +- (B * C) -> (A+-B) * C and (A * C) +- A
> -> A * (C+-1). AFAIK having one pattern and its reverse is a bad thing
> so something needs to be changed.

Yes, we have that issue.  And we've guarded GIMPLE vs. non-GIMPLE and
have recursion limits in match to deal with this.  But yes, having
both is bad.  I'd say that clearly patterns reducing the number of
operations are good at least for canonicalization.

> One idea could be to only keep the larger one ((T)(A + CST1)) * CST2 +
> CST3 -> ((T)(A) * CST2) + ((T)CST1 * CST2 + CST3). it's not enough to
> deal with the testcases of the ticket but it does help in other cases.

The issue with such larger patterns is that they hint at the fact
the transform should happen with an eye on more than just the
small expresion.  Thus not in match.pd but in a pass like reassoc
or SLSR or IVOPTs or even CSE itself.  We also have to avoid
doing changes that cannot be undone when canonicalizing.

Richard.

> Manolis
> 
> > Note that I still believe that avoiding the early and premature
> > promotion of the addition to unsigned is a good thing.
> >
> > Note the testcase in the PR is fixed with -fwrapv because then
> > we do _not_ perform this premature optimization.  Without -fwrapv
> > the optimization is valid but as you note we do not perform it
> > consistently - otherwise we wouldn't regress.
> >
> > Richard.
> >
> >
> >
> > > Thanks,
> > > Manolis
> > >
> > >
> > >
> > > On Thu, May 16, 2024 at 11:15 AM Richard Biener
> > >  wrote:
> > > >
> > > > On Tue, May 14, 2024 at 10:58 AM Manolis Tsamis 
> > > >  wrote:
> > > > >
> > > > > New patch with the requested changes can be found below.
> > > > >
> > > > > I don't know how much this affects SCEV, but I do believe that we
> > > > > should incorporate this change somehow. I've seen various cases of
> > > > > suboptimal address calculation codegen that boil down to this.
> > > >
> > > > This misses the ChangeLog (I assume it's unchanged) and indent
> > > > of the match.pd part is now off.
> > > >
> > > > Please fix that, the patch is OK with that change.
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > gcc/match.pd | 31 +++
> > > > > gcc/testsuite/gcc.dg/pr109393.c | 16 
> > > > > 2 files changed, 47 insertions(+)
> > > > > create mode 100644 gcc/testsuite/gcc.dg/pr109393.c
> > > > >
> > > > > diff --git a/gcc/match.pd b/gcc/match.pd
&g

[gcc r12-10456] tree-optimization/112281 - loop distribution and zero dependence distances

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:5db4b5449df8f59a61438f8db1836dbc9b53f02e

commit r12-10456-g5db4b5449df8f59a61438f8db1836dbc9b53f02e
Author: Richard Biener 
Date:   Mon Nov 20 13:39:52 2023 +0100

tree-optimization/112281 - loop distribution and zero dependence distances

The following fixes an omission in dependence testing for loop
distribution.  When the overall dependence distance is not zero but
the dependence direction in the innermost common loop is = there is
a conflict between the partitions and we have to merge them.

PR tree-optimization/112281
* tree-loop-distribution.cc
(loop_distribution::pg_add_dependence_edges): For = in the
innermost common loop record a partition conflict.

* gcc.dg/torture/pr112281-1.c: New testcase.
* gcc.dg/torture/pr112281-2.c: Likewise.

(cherry picked from commit 3b34902417259031823bff7f853f615a60464bbd)

Diff:
---
 gcc/testsuite/gcc.dg/torture/pr112281-1.c | 18 ++
 gcc/testsuite/gcc.dg/torture/pr112281-2.c | 18 ++
 gcc/tree-loop-distribution.cc | 18 ++
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/torture/pr112281-1.c 
b/gcc/testsuite/gcc.dg/torture/pr112281-1.c
new file mode 100644
index ..711f5663195c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr112281-1.c
@@ -0,0 +1,18 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-loop-distribution" } */
+
+struct {
+  int : 8;
+  int a;
+} b, d[4] = {{0}, {0}, {0}, {5}};
+int c, e;
+int main() {
+  for (c = 2; c; c--)
+for (e = 0; e < 2; e++) {
+  d[c] = b = d[c + 1];
+  d[c + 1].a = 0;
+}
+  if (b.a != 0)
+__builtin_abort();
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/torture/pr112281-2.c 
b/gcc/testsuite/gcc.dg/torture/pr112281-2.c
new file mode 100644
index ..d7671e3322b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr112281-2.c
@@ -0,0 +1,18 @@
+/* { dg-do run } */
+/* { dg-additional-options "-ftree-loop-distribution" } */
+
+struct {
+  int : 8;
+  int a;
+} b, d[4] = {{5}, {0}, {0}, {0}};
+int c, e;
+int main() {
+  for (c = 0; c < 2; c++)
+for (e = 0; e < 2; e++) {
+  d[c + 1] = b = d[c];
+  d[c].a = 0;
+}
+  if (b.a != 0)
+__builtin_abort();
+  return 0;
+}
diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
index 606eb05e64a5..1b7d2a1ea7d2 100644
--- a/gcc/tree-loop-distribution.cc
+++ b/gcc/tree-loop-distribution.cc
@@ -2117,9 +2117,6 @@ loop_distribution::pg_add_dependence_edges (struct graph 
*rdg, int dir,
}
  else if (DDR_ARE_DEPENDENT (ddr) == NULL_TREE)
{
- if (DDR_REVERSED_P (ddr))
-   this_dir = -this_dir;
-
  /* Known dependences can still be unordered througout the
 iteration space, see gcc.dg/tree-ssa/ldist-16.c and
 gcc.dg/tree-ssa/pr94969.c.  */
@@ -2132,7 +2129,20 @@ loop_distribution::pg_add_dependence_edges (struct graph 
*rdg, int dir,
  /* Else as the distance vector is lexicographic positive swap
 the dependence direction.  */
  else
-   this_dir = -this_dir;
+   {
+ if (DDR_REVERSED_P (ddr))
+   this_dir = -this_dir;
+ this_dir = -this_dir;
+
+ /* When then dependence distance of the innermost common
+loop of the DRs is zero we have a conflict.  */
+ auto l1 = gimple_bb (DR_STMT (dr1))->loop_father;
+ auto l2 = gimple_bb (DR_STMT (dr2))->loop_father;
+ int idx = index_in_loop_nest (find_common_loop (l1, l2)->num,
+   DDR_LOOP_NEST (ddr));
+ if (DDR_DIST_VECT (ddr, 0)[idx] == 0)
+   this_dir = 2;
+   }
}
  else
this_dir = 0;


[gcc r12-10458] middle-end/110176 - wrong zext (bool) <= (int) 4294967295u folding

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:65e5547e5468ce404d0f9ebd646a1d63abf3a772

commit r12-10458-g65e5547e5468ce404d0f9ebd646a1d63abf3a772
Author: Richard Biener 
Date:   Wed Jan 31 14:40:24 2024 +0100

middle-end/110176 - wrong zext (bool) <= (int) 4294967295u folding

The following fixes a wrong pattern that didn't match the behavior
of the original fold_widened_comparison in that get_unwidened
returned a constant always in the wider type.  But here we're
using (int) 4294967295u without the conversion applied.  Fixed
by doing as earlier in the pattern - matching constants only
if the conversion was actually applied.

PR middle-end/110176
* match.pd (zext (bool) <= (int) 4294967295u): Make sure
to match INTEGER_CST only without outstanding conversion.

* gcc.dg/torture/pr110176.c: New testcase.

(cherry picked from commit 22dbfbe8767ff4c1d93e39f68ec7c2d5b1358beb)

Diff:
---
 gcc/match.pd| 12 -
 gcc/testsuite/gcc.dg/torture/pr110176.c | 46 +
 2 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index 0938d56fa45f..45ed34205106 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5379,19 +5379,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   >= TYPE_PRECISION (TREE_TYPE (@10)))
  && (TYPE_UNSIGNED (TREE_TYPE (@00))
  == TYPE_UNSIGNED (TREE_TYPE (@10
- || (TREE_CODE (@10) == INTEGER_CST
+ || (TREE_CODE (@1) == INTEGER_CST
  && INTEGRAL_TYPE_P (TREE_TYPE (@00))
- && int_fits_type_p (@10, TREE_TYPE (@00)
+ && int_fits_type_p (@1, TREE_TYPE (@00)
   (cmp @00 (convert @10))
-  (if (TREE_CODE (@10) == INTEGER_CST
+  (if (TREE_CODE (@1) == INTEGER_CST
   && INTEGRAL_TYPE_P (TREE_TYPE (@00))
-  && !int_fits_type_p (@10, TREE_TYPE (@00)))
+  && !int_fits_type_p (@1, TREE_TYPE (@00)))
(with
{
  tree min = lower_bound_in_type (TREE_TYPE (@10), TREE_TYPE (@00));
  tree max = upper_bound_in_type (TREE_TYPE (@10), TREE_TYPE (@00));
- bool above = integer_nonzerop (const_binop (LT_EXPR, type, max, @10));
- bool below = integer_nonzerop (const_binop (LT_EXPR, type, @10, min));
+ bool above = integer_nonzerop (const_binop (LT_EXPR, type, max, @1));
+ bool below = integer_nonzerop (const_binop (LT_EXPR, type, @1, min));
}
(if (above || below)
 (if (cmp == EQ_EXPR || cmp == NE_EXPR)
diff --git a/gcc/testsuite/gcc.dg/torture/pr110176.c 
b/gcc/testsuite/gcc.dg/torture/pr110176.c
new file mode 100644
index ..e41e3a0c3a7e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr110176.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+
+int f(_Bool t)
+{
+int tt = t;
+unsigned x = -1;
+int xx = x;
+return xx <= tt;
+}
+
+int a, b;
+void c() {}
+__attribute__((noipa))
+void h() {__builtin_abort();}
+int d() {
+  unsigned f[1];
+  int i;
+  if (a)
+goto h;
+  f[0] = -1;
+  while (1) {
+c();
+for (; a < 1; a++) {
+  if (0) {
+  j:
+continue;
+  }
+  i = f[0];
+  if (a)
+break;
+  b = i >= (b == 0);
+}
+if (!b) {
+  if (0) {
+  h:
+goto j;
+  }
+  return 0;
+}
+h();
+  }
+}
+int main() {
+  d();
+  return 0;
+}


[gcc r12-10454] tree-optimization/112505 - bit-precision induction vectorization

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:4a71557fbebe3fb4031d1c2adc4f89c89a8c6c62

commit r12-10454-g4a71557fbebe3fb4031d1c2adc4f89c89a8c6c62
Author: Richard Biener 
Date:   Thu Jan 11 14:00:33 2024 +0100

tree-optimization/112505 - bit-precision induction vectorization

Vectorization of bit-precision inductions isn't implemented but we
don't check this, instead we ICE during transform.

PR tree-optimization/112505
* tree-vect-loop.cc (vectorizable_induction): Reject
bit-precision induction.

* gcc.dg/vect/pr112505.c: New testcase.

(cherry picked from commit ec345df53556ec581590347f71c3d9ff3cdbca76)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr112505.c | 14 ++
 gcc/tree-vect-loop.cc|  9 +
 2 files changed, 23 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/vect/pr112505.c 
b/gcc/testsuite/gcc.dg/vect/pr112505.c
new file mode 100644
index ..56546c1095aa
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr112505.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3" } */
+
+short int w9;
+struct T {
+  short a : 14;
+  int b;
+};
+struct T v;
+void zc()
+{
+  for(int i = 0; i < 4; i ++)
+w9 *= v.b ? v.a-- < 0 : 0;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 86ee9e449e19..fd0e5a70a962 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8211,6 +8211,15 @@ vectorizable_induction (loop_vec_info loop_vinfo,
 
   step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (stmt_info);
   gcc_assert (step_expr != NULL_TREE);
+  if (INTEGRAL_TYPE_P (TREE_TYPE (step_expr))
+  && !type_has_mode_precision_p (TREE_TYPE (step_expr)))
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"bit-precision induction vectorization not "
+"supported.\n");
+  return false;
+}
   tree step_vectype = get_same_sized_vectype (TREE_TYPE (step_expr), vectype);
 
   /* Check for backend support of PLUS/MINUS_EXPR. */


[gcc r12-10457] tree-optimization/111039 - abnormals and bit test merging

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:47e6bff94d980e2fcb6bcb42df04d3b73bd67da7

commit r12-10457-g47e6bff94d980e2fcb6bcb42df04d3b73bd67da7
Author: Richard Biener 
Date:   Thu Aug 17 13:10:14 2023 +0200

tree-optimization/111039 - abnormals and bit test merging

The following guards the bit test merging code in if-combine against
the appearance of SSA names used in abnormal PHIs.

PR tree-optimization/111039
* tree-ssa-ifcombine.cc (ifcombine_ifandif): Check for
SSA_NAME_OCCURS_IN_ABNORMAL_PHI.

* gcc.dg/pr111039.c: New testcase.

(cherry picked from commit 482551a79a3d3f107f6239679ee74655cfe8707e)

Diff:
---
 gcc/testsuite/gcc.dg/pr111039.c | 15 +++
 gcc/tree-ssa-ifcombine.cc   |  7 +++
 2 files changed, 22 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/pr111039.c b/gcc/testsuite/gcc.dg/pr111039.c
new file mode 100644
index ..bec9983b35f8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr111039.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+
+int _setjmp ();
+void abcd ();
+void abcde ();
+void compiler_corruption_function(int flags)
+{
+  int nowait = flags & 1048576, isexpand = flags & 8388608;
+  abcd();
+  _setjmp(flags);
+  if (nowait && isexpand)
+flags &= 0;
+  abcde();
+}
diff --git a/gcc/tree-ssa-ifcombine.cc b/gcc/tree-ssa-ifcombine.cc
index ce9bbebf9480..b139328af224 100644
--- a/gcc/tree-ssa-ifcombine.cc
+++ b/gcc/tree-ssa-ifcombine.cc
@@ -415,6 +415,9 @@ ifcombine_ifandif (basic_block inner_cond_bb, bool 
inner_inv,
 {
   tree t, t2;
 
+  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name1))
+   return false;
+
   /* Do it.  */
   gsi = gsi_for_stmt (inner_cond);
   t = fold_build2 (LSHIFT_EXPR, TREE_TYPE (name1),
@@ -465,6 +468,10 @@ ifcombine_ifandif (basic_block inner_cond_bb, bool 
inner_inv,
   gimple_stmt_iterator gsi;
   tree t;
 
+  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name1)
+ || SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name2))
+   return false;
+
   /* Find the common name which is bit-tested.  */
   if (name1 == name2)
;


[gcc r12-10455] tree-optimization/112495 - alias versioning and address spaces

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:dbb5273996259b04350a1e3d35e633c51fc9310f

commit r12-10455-gdbb5273996259b04350a1e3d35e633c51fc9310f
Author: Richard Biener 
Date:   Mon Nov 13 10:20:37 2023 +0100

tree-optimization/112495 - alias versioning and address spaces

We are not correctly handling differing address spaces in dependence
analysis runtime alias check generation so refuse to do that.

PR tree-optimization/112495
* tree-data-ref.cc (runtime_alias_check_p): Reject checks
between different address spaces.

* gcc.target/i386/pr112495.c: New testcase.

(cherry picked from commit 0f593c0521caab8cfac53514b1a5e7d0d0dd1932)

Diff:
---
 gcc/testsuite/gcc.target/i386/pr112495.c | 12 
 gcc/tree-data-ref.cc |  7 +++
 2 files changed, 19 insertions(+)

diff --git a/gcc/testsuite/gcc.target/i386/pr112495.c 
b/gcc/testsuite/gcc.target/i386/pr112495.c
new file mode 100644
index ..21afbaa6945d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr112495.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+typedef struct { int v; } T1;
+typedef struct { T1 v[32]; } T2;
+
+T1 s;
+T1 f1() { return s; }
+
+void f2(__seg_gs T2 *p, int n) {
+  for (int i = 0; i < n; ++i) p->v[i] = f1();
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index 397792c3584c..0df4a3525f4c 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -1632,6 +1632,13 @@ runtime_alias_check_p (ddr_p ddr, class loop *loop, bool 
speed_p)
   "runtime alias check not supported for"
   " outer loop.\n");
 
+  /* FORNOW: We don't support handling different address spaces.  */
+  if (TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (DR_BASE_ADDRESS (DDR_A (ddr)
+  != TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (DR_BASE_ADDRESS (DDR_B 
(ddr))
+return opt_result::failure_at (DR_STMT (DDR_A (ddr)),
+  "runtime alias check between different "
+  "address spaces not supported.\n");
+
   return opt_result::success ();
 }


[gcc r12-10453] debug/112718 - reset all type units with -ffat-lto-objects

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:1f41e8eef3da1d76c18fe1a93846054c53dc5a47

commit r12-10453-g1f41e8eef3da1d76c18fe1a93846054c53dc5a47
Author: Richard Biener 
Date:   Mon Jan 22 15:42:59 2024 +0100

debug/112718 - reset all type units with -ffat-lto-objects

When mixing -flto, -ffat-lto-objects and -fdebug-type-section we
fail to reset all type units after early output resulting in an
ICE when attempting to add then duplicate sibling attributes.

PR debug/112718
* dwarf2out.cc (dwarf2out_finish): Reset all type units
for the fat part of an LTO compile.

* gcc.dg/debug/pr112718.c: New testcase.

(cherry picked from commit 7218f5050cb7163edae331f54ca163248ab48bfa)

Diff:
---
 gcc/dwarf2out.cc  | 12 
 gcc/testsuite/gcc.dg/debug/pr112718.c | 12 
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/gcc/dwarf2out.cc b/gcc/dwarf2out.cc
index d14ec0261b6b..cfe87cba4c4c 100644
--- a/gcc/dwarf2out.cc
+++ b/gcc/dwarf2out.cc
@@ -32162,24 +32162,12 @@ dwarf2out_finish (const char *filename)
   reset_dies (comp_unit_die ());
   for (limbo_die_node *node = cu_die_list; node; node = node->next)
reset_dies (node->die);
-
-  hash_table comdat_type_table (100);
   for (ctnode = comdat_type_list; ctnode != NULL; ctnode = ctnode->next)
{
- comdat_type_node **slot
- = comdat_type_table.find_slot (ctnode, INSERT);
-
- /* Don't reset types twice.  */
- if (*slot != HTAB_EMPTY_ENTRY)
-   continue;
-
  /* Remove the pointer to the line table.  */
  remove_AT (ctnode->root_die, DW_AT_stmt_list);
-
  if (debug_info_level >= DINFO_LEVEL_TERSE)
reset_dies (ctnode->root_die);
-
- *slot = ctnode;
}
 
   /* Reset die CU symbol so we don't output it twice.  */
diff --git a/gcc/testsuite/gcc.dg/debug/pr112718.c 
b/gcc/testsuite/gcc.dg/debug/pr112718.c
new file mode 100644
index ..ff80ca5a2981
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/debug/pr112718.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target lto } */
+/* { dg-options "-g -fdebug-types-section -flto -ffat-lto-objects" } */
+
+struct {
+  int h;
+  unsigned char data[20 + 24 * 6];
+} _EC_X9_62_PRIME_192V2;
+struct {
+  int h;
+  unsigned char data[20 + 24 * 6];
+} _EC_X9_62_PRIME_192V3;


[gcc r12-10452] tree-optimization/112793 - SLP of constant/external code-generated twice

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:9bad5cf9ae446b367f666176537eb76e94cc4448

commit r12-10452-g9bad5cf9ae446b367f666176537eb76e94cc4448
Author: Richard Biener 
Date:   Wed Dec 13 14:23:31 2023 +0100

tree-optimization/112793 - SLP of constant/external code-generated twice

The following makes the attempt at code-generating a constant/external
SLP node twice well-formed as that can happen when partitioning BB
vectorization attempts where we keep constants/externals unpartitioned.

PR tree-optimization/112793
* tree-vect-slp.cc (vect_schedule_slp_node): Already
code-generated constant/external nodes are OK.

* g++.dg/vect/pr112793.cc: New testcase.

(cherry picked from commit d782ec8362eadc3169286eb1e39c631effd02323)

Diff:
---
 gcc/testsuite/g++.dg/vect/pr112793.cc | 32 
 gcc/tree-vect-slp.cc  | 16 +---
 2 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/gcc/testsuite/g++.dg/vect/pr112793.cc 
b/gcc/testsuite/g++.dg/vect/pr112793.cc
new file mode 100644
index ..258d7c1b1119
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/pr112793.cc
@@ -0,0 +1,32 @@
+// { dg-do compile }
+// { dg-require-effective-target c++11 }
+// { dg-additional-options "-march=znver2" { target x86_64-*-* i?86-*-* } }
+
+typedef double T;
+T c, s;
+T a[16];
+struct Matrix4 {
+  Matrix4(){}
+  Matrix4(T e, T f, T i, T j) {
+r[1] = r[4] = e;
+r[5] = f;
+r[8] = i;
+r[9] = j;
+  }
+  Matrix4 operator*(Matrix4 a) {
+return Matrix4(
+   r[0] * a.r[4] + r[4] + r[15] + r[6],
+   r[1] * a.r[4] + 1 + 2 + 3,  r[0] * r[8] + 1 + 2 + 3,
+   r[1] * r[8] + r[1] + r[14] + r[2] * r[3]);
+  }
+  T r[16] = {};
+};
+Matrix4 t1, t2;
+Matrix4 tt;
+Matrix4 getRotAltAzToEquatorial()
+{
+  t2.r[4] =  0;
+  t1.r[1] =  -s;
+  t1.r[8] = 0;
+  return t1 * t2;
+}
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 26c989cbff9a..54e6a9e4224f 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -7240,12 +7240,6 @@ vect_schedule_slp_node (vec_info *vinfo,
   int i;
   slp_tree child;
 
-  /* For existing vectors there's nothing to do.  */
-  if (SLP_TREE_VEC_DEFS (node).exists ())
-return;
-
-  gcc_assert (SLP_TREE_VEC_STMTS (node).is_empty ());
-
   /* Vectorize externals and constants.  */
   if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
   || SLP_TREE_DEF_TYPE (node) == vect_external_def)
@@ -7256,10 +7250,18 @@ vect_schedule_slp_node (vec_info *vinfo,
   if (!SLP_TREE_VECTYPE (node))
return;
 
-  vect_create_constant_vectors (vinfo, node);
+  /* There are two reasons vector defs might already exist.  The first
+is that we are vectorizing an existing vector def.  The second is
+when performing BB vectorization shared constant/external nodes
+are not split apart during partitioning so during the code-gen
+DFS walk we can end up visiting them twice.  */
+  if (! SLP_TREE_VEC_DEFS (node).exists ())
+   vect_create_constant_vectors (vinfo, node);
   return;
 }
 
+  gcc_assert (SLP_TREE_VEC_DEFS (node).is_empty ());
+
   stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
 
   gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0);


[gcc r14-10214] tree-optimization/114998 - use-after-free with loop distribution

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:1e9ae50d4d160f6d557fc4cbbe95c4a36897c09f

commit r14-10214-g1e9ae50d4d160f6d557fc4cbbe95c4a36897c09f
Author: Richard Biener 
Date:   Fri May 10 14:19:49 2024 +0200

tree-optimization/114998 - use-after-free with loop distribution

When loop distribution releases a PHI node of the original IL it
can end up clobbering memory that's re-used when it upon releasing
its RDG resets all stmt UIDs back to -1, even those that got released.

The fix is to avoid resetting UIDs based on stmts in the RDG but
instead reset only those still present in the loop.

PR tree-optimization/114998
* tree-loop-distribution.cc (free_rdg): Take loop argument.
Reset UIDs of stmts still in the IL rather than all stmts
referenced from the RDG.
(loop_distribution::build_rdg): Pass loop to free_rdg.
(loop_distribution::distribute_loop): Likewise.
(loop_distribution::transform_reduction_loop): Likewise.

* gcc.dg/torture/pr114998.c: New testcase.

(cherry picked from commit 34d15a4d630a0d54eddb99bdab086c506e10dac5)

Diff:
---
 gcc/testsuite/gcc.dg/torture/pr114998.c | 35 +
 gcc/tree-loop-distribution.cc   | 24 --
 2 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/torture/pr114998.c 
b/gcc/testsuite/gcc.dg/torture/pr114998.c
new file mode 100644
index ..81fc1e077cb9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr114998.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-fno-tree-dce -ftree-loop-distribution" } */
+
+short a, d;
+int b, c, f, g, h, i, j[2], o;
+__attribute__((const)) int s(char r);
+int main() {
+  int l, m, k, n;
+  if (b) {
+char p;
+for (; p >= 0; p--) {
+  int e[] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
+ 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
+ 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0};
+  if (j[p]) {
+int q[1];
+i = o;
+o = q[h];
+if (g)
+  n = d;
+m = 4;
+for (; m; m--) {
+  if (l)
+k |= c;
+  if (a)
+break;
+}
+  }
+  s(n);
+  f |= b;
+}
+  }
+  return 0;
+}
diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
index 95203fefa188..45932bae5e7f 100644
--- a/gcc/tree-loop-distribution.cc
+++ b/gcc/tree-loop-distribution.cc
@@ -778,7 +778,7 @@ loop_distribution::stmts_from_loop (class loop *loop, 
vec *stmts)
 /* Free the reduced dependence graph RDG.  */
 
 static void
-free_rdg (struct graph *rdg)
+free_rdg (struct graph *rdg, loop_p loop)
 {
   int i;
 
@@ -792,13 +792,25 @@ free_rdg (struct graph *rdg)
 
   if (v->data)
{
- gimple_set_uid (RDGV_STMT (v), -1);
  (RDGV_DATAREFS (v)).release ();
  free (v->data);
}
 }
 
   free_graph (rdg);
+
+  /* Reset UIDs of stmts still in the loop.  */
+  basic_block *bbs = get_loop_body (loop);
+  for (unsigned i = 0; i < loop->num_nodes; ++i)
+{
+  basic_block bb = bbs[i];
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next ())
+   gimple_set_uid (gsi_stmt (gsi), -1);
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next ())
+   gimple_set_uid (gsi_stmt (gsi), -1);
+}
+  free (bbs);
 }
 
 struct graph *
@@ -812,7 +824,7 @@ loop_distribution::build_rdg (class loop *loop, 
control_dependences *cd)
   rdg = new_graph (stmts.length ());
   if (!create_rdg_vertices (rdg, stmts, loop))
 {
-  free_rdg (rdg);
+  free_rdg (rdg, loop);
   return NULL;
 }
   stmts.release ();
@@ -3062,7 +3074,7 @@ loop_distribution::distribute_loop (class loop *loop,
 "Loop %d not distributed: too many memory references.\n",
 loop->num);
 
-  free_rdg (rdg);
+  free_rdg (rdg, loop);
   loop_nest.release ();
   free_data_refs (datarefs_vec);
   delete ddrs_table;
@@ -3259,7 +3271,7 @@ loop_distribution::distribute_loop (class loop *loop,
   FOR_EACH_VEC_ELT (partitions, i, partition)
 partition_free (partition);
 
-  free_rdg (rdg);
+  free_rdg (rdg, loop);
   return nbp - *nb_calls;
 }
 
@@ -3665,7 +3677,7 @@ loop_distribution::transform_reduction_loop (loop_p loop)
   auto_bitmap partition_stmts;
   bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
   find_single_drs (loop, rdg, partition_stmts, _dr, _dr);
-  free_rdg (rdg);
+  free_rdg (rdg, loop);
 
   /* Bail out if there is no single load.  */
   if (load_dr == NULL)


[RFC] Merge strathegy for all-SLP vectorizer

2024-05-17 Thread Richard Biener via Gcc


Hi,

I'd like to discuss how to go forward with getting the vectorizer to
all-SLP for this stage1.  While there is a personal branch with my
ongoing work (users/rguenth/vect-force-slp) branches haven't proved
themselves working well for collaboration.  The branch isn't ready
to be merged in full but I have been picking improvements to trunk
last stage1 and some remaining bits in the past weeks.  I have
refrained from merging code paths that cannot be exercised on trunk.

There are two important set of changes on the branch, both critical
to get more testing on non-x86 targets.

 1. enable single-lane SLP discovery
 2. avoid splitting store groups (9315bfc661432c3 and 4336060fe2db8ec
if you fetch the branch)

The first point is also most annoying on the testsuite since doing
SLP instead of interleaving changes what we dump and thus tests
start to fail in random ways when you switch between both modes.
On the branch single-lane SLP discovery is gated with
--param vect-single-lane-slp.

The branch has numerous changes to enable single-lane SLP for some
code paths that have SLP not implemented and where I did not bother
to try supporting multi-lane SLP at this point.  It also adds more
SLP discovery entry points.

I'm not sure how to try merging these pieces to allow others to
more easily help out.  One possibility is to merge
--param vect-single-lane-slp defaulted off and pick dependent
changes even when they cause testsuite regressions with
vect-single-lane-slp=1.  Alternatively adjust the testsuite by
adding --param vect-single-lane-slp=0 and default to 1
(or keep the default).  Or require a clean testsuite with
--param vect-single-lane-slp defaulted to 1 but keep the --param
for debugging (and allow FAILs with 0).

For fun I merged just single-lane discovery of non-grouped stores
and have that enabled by default.  On x86_64 this results in the
set of FAILs below.

Any suggestions?

Thanks,
Richard.

FAIL: gcc.dg/vect/O3-pr39675-2.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 1
XPASS: gcc.dg/vect/no-scevccp-outer-12.c scan-tree-dump-times vect "OUTER 
LOOP VECTORIZED." 1
FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
"Alignment of access forced using peeling" 2
FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
"Vectorizing an unaligned access" 0
FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
"Alignment of access forced using peeling" 2
FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
"Vectorizing an unaligned access" 0
FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
"Alignment of access forced using peeling" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
"Vectorizing an unaligned access" 0
FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect 
"Alignment of access forced using peeling" 2
FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect 
"Vectorizing an unaligned access" 0
FAIL: gcc.dg/vect/slp-12a.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-12a.c scan-tree-dump-times vect "vectorizing stmts 
using SLP" 1
FAIL: gcc.dg/vect/slp-19a.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-19a.c scan-tree-dump-times vect "vectorizing stmts 
using SLP" 1
FAIL: gcc.dg/vect/slp-19b.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-19b.c scan-tree-dump-times vect "vectorizing stmts 
using SLP" 1
FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorized 1 loops" 1
FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-19c.c scan-tree-dump-times vect "vectorized 1 loops" 
1
FAIL: gcc.dg/vect/slp-19c.c scan-tree-dump-times vect "vectorizing stmts 
using SLP" 1
XPASS: gcc.dg/vect/tsvc/vect-tsvc-s1115.c -flto -ffat-lto-objects  
scan-tree-dump vect "vectorized 1 loops"
XPASS: gcc.dg/vect/tsvc/vect-tsvc-s1115.c scan-tree-dump vect "vectorized 
1 loops"
XPASS: gcc.dg/vect/tsvc/vect-tsvc-s114.c -flto -ffat-lto-objects  
scan-tree-dump vect "vectorized 1 loops"
XPASS: gcc.dg/vect/tsvc/vect-tsvc-s114.c scan-tree-dump vect "vectorized 1 
loops"
XPASS: gcc.dg/vect/tsvc/vect-tsvc-s1232.c -flto -ffat-lto-objects  
scan-tree-dump vect "vectorized 1 loops"
XPASS: gcc.dg/vect/tsvc/vect-tsvc-s1232.c scan-tree-dump vect "vectorized 
1 loops"
XPASS: gcc.dg/vect/tsvc/vect-tsvc-s257.c -flto -ffat-lto-objects  
scan-tree-dump vect "vectorized 1 loops"
XPASS: gcc.dg/vect/tsvc/vect-tsvc-s257.c scan-tree-dump vect "vectorized 1 
loops"
FAIL: gcc.dg/vect/vect-26.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "Alignment of access forced using peeling" 1
FAIL: gcc.dg/vect/vect-26.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect 

Re: [PATCH] middle-end: Expand {u|s}dot product support in autovectorizer

2024-05-17 Thread Richard Biener
On Fri, May 17, 2024 at 11:56 AM Tamar Christina
 wrote:
>
> > -Original Message-
> > From: Richard Biener 
> > Sent: Friday, May 17, 2024 10:46 AM
> > To: Tamar Christina 
> > Cc: Victor Do Nascimento ; gcc-
> > patc...@gcc.gnu.org; Richard Sandiford ; Richard
> > Earnshaw ; Victor Do Nascimento
> > 
> > Subject: Re: [PATCH] middle-end: Expand {u|s}dot product support in
> > autovectorizer
> >
> > On Fri, May 17, 2024 at 11:05 AM Tamar Christina
> >  wrote:
> > >
> > > > -Original Message-
> > > > From: Richard Biener 
> > > > Sent: Friday, May 17, 2024 6:51 AM
> > > > To: Victor Do Nascimento 
> > > > Cc: gcc-patches@gcc.gnu.org; Richard Sandiford
> > ;
> > > > Richard Earnshaw ; Victor Do Nascimento
> > > > 
> > > > Subject: Re: [PATCH] middle-end: Expand {u|s}dot product support in
> > > > autovectorizer
> > > >
> > > > On Thu, May 16, 2024 at 4:40 PM Victor Do Nascimento
> > > >  wrote:
> > > > >
> > > > > From: Victor Do Nascimento 
> > > > >
> > > > > At present, the compiler offers the `{u|s|us}dot_prod_optab' direct
> > > > > optabs for dealing with vectorizable dot product code sequences.  The
> > > > > consequence of using a direct optab for this is that backend-pattern
> > > > > selection is only ever able to match against one datatype - Either
> > > > > that of the operands or of the accumulated value, never both.
> > > > >
> > > > > With the introduction of the 2-way (un)signed dot-product insn [1][2]
> > > > > in AArch64 SVE2, the existing direct opcode approach is no longer
> > > > > sufficient for full specification of all the possible dot product
> > > > > machine instructions to be matched to the code sequence; a dot product
> > > > > resulting in VNx4SI may result from either dot products on VNx16QI or
> > > > > VNx8HI values for the 4- and 2-way dot product operations, 
> > > > > respectively.
> > > > >
> > > > > This means that the following example fails autovectorization:
> > > > >
> > > > > uint32_t foo(int n, uint16_t* data) {
> > > > >   uint32_t sum = 0;
> > > > >   for (int i=0; i > > > > sum += data[i] * data[i];
> > > > >   }
> > > > >   return sum;
> > > > > }
> > > > >
> > > > > To remedy the issue a new optab is added, tentatively named
> > > > > `udot_prod_twoway_optab', whose selection is dependent upon checking
> > > > > of both input and output types involved in the operation.
> > > >
> > > > I don't like this too much.  I'll note we document dot_prod as
> > > >
> > > > @cindex @code{sdot_prod@var{m}} instruction pattern
> > > > @item @samp{sdot_prod@var{m}}
> > > >
> > > > Compute the sum of the products of two signed elements.
> > > > Operand 1 and operand 2 are of the same mode. Their
> > > > product, which is of a wider mode, is computed and added to operand 3.
> > > > Operand 3 is of a mode equal or wider than the mode of the product. The
> > > > result is placed in operand 0, which is of the same mode as operand 3.
> > > > @var{m} is the mode of operand 1 and operand 2.
> > > >
> > > > with no restriction on the wider mode but we don't specify it which is
> > > > bad design.  This should have been a convert optab with two modes
> > > > from the start - adding a _twoway variant is just a hack.
> > >
> > > We did discuss this at the time we started implementing it.  There was two
> > > options, one was indeed to change it to a convert dot_prod optab, but 
> > > doing
> > > this means we have to update every target that uses it.
> > >
> > > Now that means 3 ISAs for AArch64, Arm, Arc, c6x, 2 for x86, loongson and
> > altivec.
> > >
> > > Which sure could be possible, but there's also every use in the backends 
> > > that
> > need
> > > to be updated, and tested, which for some targets we don't even know how 
> > > to
> > begin.
> > >
> > > So it seems very hard to correct dotprod to a convert optab now.
> >
> > It's still the correct way to go.  At _least_ your new pattern should
> > have been this,
> > othe

[gcc r15-626] middle-end/115110 - Fix view_converted_memref_p

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:a5b3721c06646bf5b9b50a22964e8e2bd4d03f5f

commit r15-626-ga5b3721c06646bf5b9b50a22964e8e2bd4d03f5f
Author: Richard Biener 
Date:   Fri May 17 11:02:29 2024 +0200

middle-end/115110 - Fix view_converted_memref_p

view_converted_memref_p was checking the reference type against the
pointer type of the offset operand rather than its pointed-to type
which leads to all refs being subject to view-convert treatment
in get_alias_set causing numerous testsuite fails but with its
new uses from r15-512-g9b7cad5884f21c is also a wrong-code issue.

PR middle-end/115110
* tree-ssa-alias.cc (view_converted_memref_p): Fix.

Diff:
---
 gcc/tree-ssa-alias.cc | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index 9f5f69bcfad2..d64d6d02f4a8 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -2077,8 +2077,9 @@ view_converted_memref_p (tree base)
 {
   if (TREE_CODE (base) != MEM_REF && TREE_CODE (base) != TARGET_MEM_REF)
 return false;
-  return same_type_for_tbaa (TREE_TYPE (base),
-TREE_TYPE (TREE_OPERAND (base, 1))) != 1;
+  return (same_type_for_tbaa (TREE_TYPE (base),
+ TREE_TYPE (TREE_TYPE (TREE_OPERAND (base, 1
+ != 1);
 }
 
 /* Return true if an indirect reference based on *PTR1 constrained


Re: [PATCH] middle-end: Expand {u|s}dot product support in autovectorizer

2024-05-17 Thread Richard Biener
On Fri, May 17, 2024 at 11:05 AM Tamar Christina
 wrote:
>
> > -Original Message-
> > From: Richard Biener 
> > Sent: Friday, May 17, 2024 6:51 AM
> > To: Victor Do Nascimento 
> > Cc: gcc-patches@gcc.gnu.org; Richard Sandiford ;
> > Richard Earnshaw ; Victor Do Nascimento
> > 
> > Subject: Re: [PATCH] middle-end: Expand {u|s}dot product support in
> > autovectorizer
> >
> > On Thu, May 16, 2024 at 4:40 PM Victor Do Nascimento
> >  wrote:
> > >
> > > From: Victor Do Nascimento 
> > >
> > > At present, the compiler offers the `{u|s|us}dot_prod_optab' direct
> > > optabs for dealing with vectorizable dot product code sequences.  The
> > > consequence of using a direct optab for this is that backend-pattern
> > > selection is only ever able to match against one datatype - Either
> > > that of the operands or of the accumulated value, never both.
> > >
> > > With the introduction of the 2-way (un)signed dot-product insn [1][2]
> > > in AArch64 SVE2, the existing direct opcode approach is no longer
> > > sufficient for full specification of all the possible dot product
> > > machine instructions to be matched to the code sequence; a dot product
> > > resulting in VNx4SI may result from either dot products on VNx16QI or
> > > VNx8HI values for the 4- and 2-way dot product operations, respectively.
> > >
> > > This means that the following example fails autovectorization:
> > >
> > > uint32_t foo(int n, uint16_t* data) {
> > >   uint32_t sum = 0;
> > >   for (int i=0; i > > sum += data[i] * data[i];
> > >   }
> > >   return sum;
> > > }
> > >
> > > To remedy the issue a new optab is added, tentatively named
> > > `udot_prod_twoway_optab', whose selection is dependent upon checking
> > > of both input and output types involved in the operation.
> >
> > I don't like this too much.  I'll note we document dot_prod as
> >
> > @cindex @code{sdot_prod@var{m}} instruction pattern
> > @item @samp{sdot_prod@var{m}}
> >
> > Compute the sum of the products of two signed elements.
> > Operand 1 and operand 2 are of the same mode. Their
> > product, which is of a wider mode, is computed and added to operand 3.
> > Operand 3 is of a mode equal or wider than the mode of the product. The
> > result is placed in operand 0, which is of the same mode as operand 3.
> > @var{m} is the mode of operand 1 and operand 2.
> >
> > with no restriction on the wider mode but we don't specify it which is
> > bad design.  This should have been a convert optab with two modes
> > from the start - adding a _twoway variant is just a hack.
>
> We did discuss this at the time we started implementing it.  There was two
> options, one was indeed to change it to a convert dot_prod optab, but doing
> this means we have to update every target that uses it.
>
> Now that means 3 ISAs for AArch64, Arm, Arc, c6x, 2 for x86, loongson and 
> altivec.
>
> Which sure could be possible, but there's also every use in the backends that 
> need
> to be updated, and tested, which for some targets we don't even know how to 
> begin.
>
> So it seems very hard to correct dotprod to a convert optab now.

It's still the correct way to go.  At _least_ your new pattern should
have been this,
otherwise what do you do when you have two-way, four-way and eight-way variants?
Add yet another optab?

Another thing is that when you do it your way you should fix the existing optab
to be two-way by documenting how the second mode derives from the first.

And sure, it's not the only optab suffering from this issue.

Richard.

> Tamar
>
> >
> > Richard.
> >
> > > In order to minimize changes to the existing codebase,
> > > `optab_for_tree_code' is renamed `optab_for_tree_code_1' and a new
> > > argument is added to its signature - `const_tree otype', allowing type
> > > information to be specified for both input and output types.  The
> > > existing nterface is retained by defining a new `optab_for_tree_code',
> > > which serves as a shim to `optab_for_tree_code_1', passing old
> > > parameters as-is and setting the new `optype' argument to `NULL_TREE'.
> > >
> > > For DOT_PROD_EXPR tree codes, we can call `optab_for_tree_code_1'
> > > directly, passing it both types, adding the internal logic to the
> > > function to distinguish between competing optabs.
> > >
> > > Finally, necessary changes are made to `expand_widen_pattern_expr' to
> > > ensure the new ic

Re: [PATCH] MATCH: Maybe expand (T)(A + C1) * C2 and (T)(A + C1) * C2 + C3 [PR109393]

2024-05-17 Thread Richard Biener
On Fri, 17 May 2024, Manolis Tsamis wrote:

> Hi Richard,
> 
> While I was re-testing the latest version of this patch I noticed that
> it FAILs an AArch64 test, gcc.target/aarch64/subsp.c. With the patch
> we generate one instruction more:
> 
> sbfiz   x1, x1, 4, 32
> stp x29, x30, [sp, -16]!
> add x1, x1, 16
> mov x29, sp
> sub sp, sp, x1
> mov x0, sp
> bl  foo
> 
> Instead of:
> 
> stp x29, x30, [sp, -16]!
> add w1, w1, 1
> mov x29, sp
> sub sp, sp, w1, sxtw 4
> mov x0, sp
> bl  foo
> 
> I've looked at it but can't really find a way to solve the regression.
> Any thoughts on this?

Can you explain what goes wrong?  As I said rewriting parts of
address calculation is tricky, there's always the chance that some
cases regress (see your observation in comment#4 of the PR).

Note that I still believe that avoiding the early and premature
promotion of the addition to unsigned is a good thing.

Note the testcase in the PR is fixed with -fwrapv because then
we do _not_ perform this premature optimization.  Without -fwrapv
the optimization is valid but as you note we do not perform it
consistently - otherwise we wouldn't regress.

Richard.



> Thanks,
> Manolis
> 
> 
> 
> On Thu, May 16, 2024 at 11:15 AM Richard Biener
>  wrote:
> >
> > On Tue, May 14, 2024 at 10:58 AM Manolis Tsamis  
> > wrote:
> > >
> > > New patch with the requested changes can be found below.
> > >
> > > I don't know how much this affects SCEV, but I do believe that we
> > > should incorporate this change somehow. I've seen various cases of
> > > suboptimal address calculation codegen that boil down to this.
> >
> > This misses the ChangeLog (I assume it's unchanged) and indent
> > of the match.pd part is now off.
> >
> > Please fix that, the patch is OK with that change.
> >
> > Thanks,
> > Richard.
> >
> > > gcc/match.pd | 31 +++
> > > gcc/testsuite/gcc.dg/pr109393.c | 16 
> > > 2 files changed, 47 insertions(+)
> > > create mode 100644 gcc/testsuite/gcc.dg/pr109393.c
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > index 07e743ae464..1d642c205f0 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -3650,6 +3650,37 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > (plus (convert @0) (op @2 (convert @1))
> > > #endif
> > > +/* ((T)(A + CST1)) * CST2 + CST3
> > > + -> ((T)(A) * CST2) + ((T)CST1 * CST2 + CST3)
> > > + Where (A + CST1) doesn't need to have a single use. */
> > > +#if GIMPLE
> > > + (for op (plus minus)
> > > + (simplify
> > > + (plus (mult:s (convert:s (op @0 INTEGER_CST@1)) INTEGER_CST@2)
> > > + INTEGER_CST@3)
> > > + (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
> > > + && INTEGRAL_TYPE_P (type)
> > > + && TYPE_PRECISION (type) > TYPE_PRECISION (TREE_TYPE (@0))
> > > + && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0))
> > > + && !TYPE_OVERFLOW_SANITIZED (TREE_TYPE (@0))
> > > + && TYPE_OVERFLOW_WRAPS (type))
> > > + (op (mult (convert @0) @2) (plus (mult (convert @1) @2) @3)
> > > +#endif
> > > +
> > > +/* ((T)(A + CST1)) * CST2 -> ((T)(A) * CST2) + ((T)CST1 * CST2) */
> > > +#if GIMPLE
> > > + (for op (plus minus)
> > > + (simplify
> > > + (mult (convert:s (op:s @0 INTEGER_CST@1)) INTEGER_CST@2)
> > > + (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
> > > + && INTEGRAL_TYPE_P (type)
> > > + && TYPE_PRECISION (type) > TYPE_PRECISION (TREE_TYPE (@0))
> > > + && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0))
> > > + && !TYPE_OVERFLOW_SANITIZED (TREE_TYPE (@0))
> > > + && TYPE_OVERFLOW_WRAPS (type))
> > > + (op (mult (convert @0) @2) (mult (convert @1) @2)
> > > +#endif
> > > +
> > > /* (T)(A) +- (T)(B) -> (T)(A +- B) only when (A +- B) could be simplified
> > > to a simple value. */
> > > (for op (plus minus)
> > > diff --git a/gcc/testsuite/gcc.dg/pr109393.c 
> > > b/gcc/testsuite/gcc.dg/pr109393.c
> > > new file mode 100644
> > > index 000..e9051273672
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/pr109393.c
> > > @@ -0,0 +1,16 @@
> > > +/* PR tree-optimization/109393 */
> > > +/* { dg-do compile } */
>

[PATCH] middle-end/115110 - Fix view_converted_memref_p

2024-05-17 Thread Richard Biener
view_converted_memref_p was checking the reference type against the
pointer type of the offset operand rather than its pointed-to type
which leads to all refs being subject to view-convert treatment
in get_alias_set causing numerous testsuite fails but with its
new uses from r15-512-g9b7cad5884f21c is also a wrong-code issue.

Bootstrap & regtest on x86_64-unknown-linux-gnu in progress.

PR middle-end/115110
* tree-ssa-alias.cc (view_converted_memref_p): Fix.
---
 gcc/tree-ssa-alias.cc | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index 9f5f69bcfad..d64d6d02f4a 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -2077,8 +2077,9 @@ view_converted_memref_p (tree base)
 {
   if (TREE_CODE (base) != MEM_REF && TREE_CODE (base) != TARGET_MEM_REF)
 return false;
-  return same_type_for_tbaa (TREE_TYPE (base),
-TREE_TYPE (TREE_OPERAND (base, 1))) != 1;
+  return (same_type_for_tbaa (TREE_TYPE (base),
+ TREE_TYPE (TREE_TYPE (TREE_OPERAND (base, 1
+ != 1);
 }
 
 /* Return true if an indirect reference based on *PTR1 constrained
-- 
2.35.3


[gcc r15-622] Add missing check for const_pool in the escaped solutions

2024-05-17 Thread Richard Biener via Gcc-cvs
https://gcc.gnu.org/g:b420e0b920613c42f63252aa2478a8315dc37a13

commit r15-622-gb420e0b920613c42f63252aa2478a8315dc37a13
Author: Richard Biener 
Date:   Fri May 17 09:31:52 2024 +0200

Add missing check for const_pool in the escaped solutions

The ptr-vs-ptr compare folding using points-to info was missing a
check for const_pool being included in the escaped solution.  The
following fixes that, fixing the observed execute FAIL of
experimental/functional/searchers.cc

* tree-ssa-alias.h (pt_solution_includes_const_pool): Declare.
* tree-ssa-alias.cc (ptrs_compare_unequal): Use
pt_solution_includes_const_pool.
* tree-ssa-structalias.cc (pt_solution_includes_const_pool): New.

* gcc.dg/torture/20240517-1.c: New testcase.

Diff:
---
 gcc/testsuite/gcc.dg/torture/20240517-1.c | 26 ++
 gcc/tree-ssa-alias.cc |  3 ++-
 gcc/tree-ssa-alias.h  |  1 +
 gcc/tree-ssa-structalias.cc   | 11 +++
 4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/torture/20240517-1.c 
b/gcc/testsuite/gcc.dg/torture/20240517-1.c
new file mode 100644
index ..ab83d3ca6fba
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/20240517-1.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-additional-options "-fmerge-all-constants" } */
+
+char *p;
+
+char * __attribute__((noipa))
+foo () { return p+1; }
+
+volatile int z;
+
+int main()
+{
+  /* ESCAPED = CONST_POOL */
+  p = "Hello";
+  /* PT = ESCAPED */
+  char *x = foo ();
+  char *y;
+  /* y PT = CONST_POOL */
+  if (z)
+y = "Baz";
+  else
+y = "Hello" + 1;
+  if (y != x)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index 6d31fc836917..9f5f69bcfad2 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -501,7 +501,8 @@ ptrs_compare_unequal (tree ptr1, tree ptr2)
  || pi2->pt.vars_contains_interposable)
return false;
  if ((!pi1->pt.null || !pi2->pt.null)
- && (!pi1->pt.const_pool || !pi2->pt.const_pool))
+ && (!pt_solution_includes_const_pool (>pt)
+ || !pt_solution_includes_const_pool (>pt)))
return !pt_solutions_intersect (>pt, >pt);
}
 }
diff --git a/gcc/tree-ssa-alias.h b/gcc/tree-ssa-alias.h
index e29dff583750..5cd64e722955 100644
--- a/gcc/tree-ssa-alias.h
+++ b/gcc/tree-ssa-alias.h
@@ -178,6 +178,7 @@ extern bool pt_solution_empty_p (const pt_solution *);
 extern bool pt_solution_singleton_or_null_p (struct pt_solution *, unsigned *);
 extern bool pt_solution_includes_global (struct pt_solution *, bool);
 extern bool pt_solution_includes (struct pt_solution *, const_tree);
+extern bool pt_solution_includes_const_pool (struct pt_solution *);
 extern bool pt_solutions_intersect (struct pt_solution *, struct pt_solution 
*);
 extern void pt_solution_reset (struct pt_solution *);
 extern void pt_solution_set (struct pt_solution *, bitmap, bool);
diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 0c6085b17662..61fb3610a172 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -7080,6 +7080,17 @@ pt_solution_includes (struct pt_solution *pt, const_tree 
decl)
   return res;
 }
 
+/* Return true if the points-to solution *PT contains a reference to a
+   constant pool entry.  */
+
+bool
+pt_solution_includes_const_pool (struct pt_solution *pt)
+{
+  return (pt->const_pool
+ || (pt->escaped && (!cfun || cfun->gimple_df->escaped.const_pool))
+ || (pt->ipa_escaped && ipa_escaped_pt.const_pool));
+}
+
 /* Return true if both points-to solutions PT1 and PT2 have a non-empty
intersection.  */


[PATCH] Add missing check for const_pool in the escaped solutions

2024-05-17 Thread Richard Biener
The ptr-vs-ptr compare folding using points-to info was missing a
check for const_pool being included in the escaped solution.  The
following fixes that, fixing the observed execute FAIL of
experimental/functional/searchers.cc

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

* tree-ssa-alias.h (pt_solution_includes_const_pool): Declare.
* tree-ssa-alias.cc (ptrs_compare_unequal): Use
pt_solution_includes_const_pool.
* tree-ssa-structalias.cc (pt_solution_includes_const_pool): New.

* gcc.dg/torture/20240517-1.c: New testcase.
---
 gcc/testsuite/gcc.dg/torture/20240517-1.c | 26 +++
 gcc/tree-ssa-alias.cc |  3 ++-
 gcc/tree-ssa-alias.h  |  1 +
 gcc/tree-ssa-structalias.cc   | 11 ++
 4 files changed, 40 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/20240517-1.c

diff --git a/gcc/testsuite/gcc.dg/torture/20240517-1.c 
b/gcc/testsuite/gcc.dg/torture/20240517-1.c
new file mode 100644
index 000..ab83d3ca6fb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/20240517-1.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-additional-options "-fmerge-all-constants" } */
+
+char *p;
+
+char * __attribute__((noipa))
+foo () { return p+1; }
+
+volatile int z;
+
+int main()
+{
+  /* ESCAPED = CONST_POOL */
+  p = "Hello";
+  /* PT = ESCAPED */
+  char *x = foo ();
+  char *y;
+  /* y PT = CONST_POOL */
+  if (z)
+y = "Baz";
+  else
+y = "Hello" + 1;
+  if (y != x)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index 6d31fc83691..9f5f69bcfad 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -501,7 +501,8 @@ ptrs_compare_unequal (tree ptr1, tree ptr2)
  || pi2->pt.vars_contains_interposable)
return false;
  if ((!pi1->pt.null || !pi2->pt.null)
- && (!pi1->pt.const_pool || !pi2->pt.const_pool))
+ && (!pt_solution_includes_const_pool (>pt)
+ || !pt_solution_includes_const_pool (>pt)))
return !pt_solutions_intersect (>pt, >pt);
}
 }
diff --git a/gcc/tree-ssa-alias.h b/gcc/tree-ssa-alias.h
index e29dff58375..5cd64e72295 100644
--- a/gcc/tree-ssa-alias.h
+++ b/gcc/tree-ssa-alias.h
@@ -178,6 +178,7 @@ extern bool pt_solution_empty_p (const pt_solution *);
 extern bool pt_solution_singleton_or_null_p (struct pt_solution *, unsigned *);
 extern bool pt_solution_includes_global (struct pt_solution *, bool);
 extern bool pt_solution_includes (struct pt_solution *, const_tree);
+extern bool pt_solution_includes_const_pool (struct pt_solution *);
 extern bool pt_solutions_intersect (struct pt_solution *, struct pt_solution 
*);
 extern void pt_solution_reset (struct pt_solution *);
 extern void pt_solution_set (struct pt_solution *, bitmap, bool);
diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 0c6085b1766..61fb3610a17 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -7080,6 +7080,17 @@ pt_solution_includes (struct pt_solution *pt, const_tree 
decl)
   return res;
 }
 
+/* Return true if the points-to solution *PT contains a reference to a
+   constant pool entry.  */
+
+bool
+pt_solution_includes_const_pool (struct pt_solution *pt)
+{
+  return (pt->const_pool
+ || (pt->escaped && (!cfun || cfun->gimple_df->escaped.const_pool))
+ || (pt->ipa_escaped && ipa_escaped_pt.const_pool));
+}
+
 /* Return true if both points-to solutions PT1 and PT2 have a non-empty
intersection.  */
 
-- 
2.35.3


Re: [PATCH] middle-end: Drop __builtin_pretech calls in autovectorization [PR114061]'

2024-05-17 Thread Richard Biener
On Thu, May 16, 2024 at 11:19 PM Tamar Christina
 wrote:
>
> Hi,
>
> > -Original Message-
> > From: Victor Do Nascimento 
> > Sent: Thursday, May 16, 2024 2:57 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Richard Sandiford ; Richard Earnshaw
> > ; Victor Do Nascimento
> > 
> > Subject: [PATCH] middle-end: Drop __builtin_pretech calls in 
> > autovectorization
> > [PR114061]'
> >
> > At present the autovectorizer fails to vectorize simple loops
> > involving calls to `__builtin_prefetch'.  A simple example of such
> > loop is given below:
> >
> > void foo(double * restrict a, double * restrict b, int n){
> >   int i;
> >   for(i=0; i > a[i] = a[i] + b[i];
> > __builtin_prefetch(&(b[i+8]));
> >   }
> > }
> >
> > The failure stems from two issues:
> >
> > 1. Given that it is typically not possible to fully reason about a
> >function call due to the possibility of side effects, the
> >autovectorizer does not attempt to vectorize loops which make such
> >calls.
> >
> >Given the memory reference passed to `__builtin_prefetch', in the
> >absence of assurances about its effect on the passed memory
> >location the compiler deems the function unsafe to vectorize,
> >marking it as clobbering memory in `vect_find_stmt_data_reference'.
> >This leads to the failure in autovectorization.
> >
> > 2. Notwithstanding the above issue, though the prefetch statement
> >would be classed as `vect_unused_in_scope', the loop invariant that
> >is used in the address of the prefetch is the scalar loop's and not
> >the vector loop's IV. That is, it still uses `i' and not `vec_iv'
> >because the instruction wasn't vectorized, causing DCE to think the
> >value is live, such that we now have both the vector and scalar loop
> >invariant actively used in the loop.
> >
> > This patch addresses both of these:
> >
> > 1. About the issue regarding the memory clobber, data prefetch does
> >not generate faults if its address argument is invalid and does not
> >write to memory.  Therefore, it does not alter the internal state
> >of the program or its control flow under any circumstance.  As
> >such, it is reasonable that the function be marked as not affecting
> >memory contents.
> >
> >To achieve this, we add the necessary logic to
> >`get_references_in_stmt' to ensure that builtin functions are given
> >given the same treatment as internal functions.  If the gimple call
> >is to a builtin function and its function code is
> >`BUILT_IN_PREFETCH', we mark `clobbers_memory' as false.
> >
> > 2. Finding precedence in the way clobber statements are handled,
> >whereby the vectorizer drops these from both the scalar and
> >vectorized versions of a given loop, we choose to drop prefetch
> >hints in a similar fashion.  This seems appropriate given how
> >software prefetch hints are typically ignored by processors across
> >architectures, as they seldom lead to performance gain over their
> >hardware counterparts.
> >
> >PR target/114061
> >
> > gcc/ChangeLog:
> >
> >   * tree-data-ref.cc (get_references_in_stmt): set
> >   `clobbers_memory' to false for __builtin_prefetch.
> >   * tree-vect-loop.cc (vect_transform_loop): Drop all
> >   __builtin_prefetch calls from loops.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.dg/vect/vect-prefetch-drop.c: New test.
> > ---
> >  gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c | 14 ++
> >  gcc/tree-data-ref.cc   |  9 +
> >  gcc/tree-vect-loop.cc  |  7 ++-
> >  3 files changed, 29 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
> > b/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
> > new file mode 100644
> > index 000..57723a8c972
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
> > @@ -0,0 +1,14 @@
> > +/* { dg-do compile { target { aarch64*-*-* } } } */
> > +/* { dg-additional-options "-march=-O3 -march=armv9.2-a+sve -fdump-tree-
> > vect-details" { target { aarch64*-*-* } } } */
> > +
>
> See the review about two-way dotprod for comments on this.
> However this specific test does not need to check for any assembly 
> instructions.
>
> You're going from being unable to vectorize a function, to being able to 
> vectorize
> It.
>
> So the `vectorized 1 loops` check is sufficient, then this will work for all 
> targets.
> This requires a check on vect_double (see 
> gcc/testsuite/lib/target-supports.exp)
>
> I'd also change the loop to just use int, as more targets will support 
> vectorizing
> those, (and of course at a vect_int check instead)
>
> > +void foo(double * restrict a, double * restrict b, int n){
> > +  int i;
> > +  for(i=0; i > +a[i] = a[i] + b[i];
> > +__builtin_prefetch(&(b[i+8]));
> > +  }
> > +}
> > +
> > 

Re: [PATCH] middle-end: Expand {u|s}dot product support in autovectorizer

2024-05-16 Thread Richard Biener
On Thu, May 16, 2024 at 4:40 PM Victor Do Nascimento
 wrote:
>
> From: Victor Do Nascimento 
>
> At present, the compiler offers the `{u|s|us}dot_prod_optab' direct
> optabs for dealing with vectorizable dot product code sequences.  The
> consequence of using a direct optab for this is that backend-pattern
> selection is only ever able to match against one datatype - Either
> that of the operands or of the accumulated value, never both.
>
> With the introduction of the 2-way (un)signed dot-product insn [1][2]
> in AArch64 SVE2, the existing direct opcode approach is no longer
> sufficient for full specification of all the possible dot product
> machine instructions to be matched to the code sequence; a dot product
> resulting in VNx4SI may result from either dot products on VNx16QI or
> VNx8HI values for the 4- and 2-way dot product operations, respectively.
>
> This means that the following example fails autovectorization:
>
> uint32_t foo(int n, uint16_t* data) {
>   uint32_t sum = 0;
>   for (int i=0; i sum += data[i] * data[i];
>   }
>   return sum;
> }
>
> To remedy the issue a new optab is added, tentatively named
> `udot_prod_twoway_optab', whose selection is dependent upon checking
> of both input and output types involved in the operation.

I don't like this too much.  I'll note we document dot_prod as

@cindex @code{sdot_prod@var{m}} instruction pattern
@item @samp{sdot_prod@var{m}}

Compute the sum of the products of two signed elements.
Operand 1 and operand 2 are of the same mode. Their
product, which is of a wider mode, is computed and added to operand 3.
Operand 3 is of a mode equal or wider than the mode of the product. The
result is placed in operand 0, which is of the same mode as operand 3.
@var{m} is the mode of operand 1 and operand 2.

with no restriction on the wider mode but we don't specify it which is
bad design.  This should have been a convert optab with two modes
from the start - adding a _twoway variant is just a hack.

Richard.

> In order to minimize changes to the existing codebase,
> `optab_for_tree_code' is renamed `optab_for_tree_code_1' and a new
> argument is added to its signature - `const_tree otype', allowing type
> information to be specified for both input and output types.  The
> existing nterface is retained by defining a new `optab_for_tree_code',
> which serves as a shim to `optab_for_tree_code_1', passing old
> parameters as-is and setting the new `optype' argument to `NULL_TREE'.
>
> For DOT_PROD_EXPR tree codes, we can call `optab_for_tree_code_1'
> directly, passing it both types, adding the internal logic to the
> function to distinguish between competing optabs.
>
> Finally, necessary changes are made to `expand_widen_pattern_expr' to
> ensure the new icode can be correctly selected, given the new optab.
>
> [1] 
> https://developer.arm.com/documentation/ddi0602/2024-03/SVE-Instructions/UDOT--2-way--vectors---Unsigned-integer-dot-product-
> [2] 
> https://developer.arm.com/documentation/ddi0602/2024-03/SVE-Instructions/SDOT--2-way--vectors---Signed-integer-dot-product-
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-sve2.md (@aarch64_sve_dotvnx4sivnx8hi):
> renamed to `dot_prod_twoway_vnx8hi'.
> * config/aarch64/aarch64-sve-builtins-base.cc (svdot_impl.expand):
> update icodes used in line with above rename.
> * optabs-tree.cc (optab_for_tree_code_1): Renamed
> `optab_for_tree_code' and added new argument.
> (optab_for_tree_code): Now a call to `optab_for_tree_code_1'.
> * optabs-tree.h (optab_for_tree_code_1): New.
> * optabs.cc (expand_widen_pattern_expr): Expand support for
> DOT_PROD_EXPR patterns.
> * optabs.def (udot_prod_twoway_optab): New.
> (sdot_prod_twoway_optab): Likewise.
> * tree-vect-patterns.cc (vect_supportable_direct_optab_p): Add
> support for misc optabs that use two modes.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/vect/vect-dotprod-twoway.c: New.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc  |  4 ++--
>  gcc/config/aarch64/aarch64-sve2.md|  2 +-
>  gcc/optabs-tree.cc| 23 --
>  gcc/optabs-tree.h |  2 ++
>  gcc/optabs.cc |  2 +-
>  gcc/optabs.def|  2 ++
>  .../gcc.dg/vect/vect-dotprod-twoway.c | 24 +++
>  gcc/tree-vect-patterns.cc |  2 +-
>  8 files changed, 54 insertions(+), 7 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-dotprod-twoway.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index 0d2edf3f19e..e457db09f66 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -764,8 +764,8 @@ public:
>icode = (e.type_suffix (0).float_p
>? 

Re: [PATCH] tree-optimization/13962 - handle ptr-ptr compares in ptrs_compare_unequal

2024-05-16 Thread Richard Biener
On Thu, 16 May 2024, Jeff Law wrote:

> 
> 
> On 5/16/24 6:03 AM, Richard Biener wrote:
> > Now that we handle pt.null conservatively we can implement the missing
> > tracking of constant pool entries (aka STRING_CST) and handle
> > ptr-ptr compares using points-to info in ptrs_compare_unequal.
> > 
> > Bootstrapped on x86_64-unknown-linux-gnu, (re-)testing in progress.
> > 
> > Richard.
> > 
> >  PR tree-optimization/13962
> >  PR tree-optimization/96564
> >  * tree-ssa-alias.h (pt_solution::const_pool): New flag.
> >  * tree-ssa-alias.cc (ptrs_compare_unequal): Handle pointer-pointer
> >  compares.
> >  (dump_points_to_solution): Dump the const_pool flag, fix guard
> >  of flag dumping.
> >  * gimple-pretty-print.cc (pp_points_to_solution): Likewise.
> >  * tree-ssa-structalias.cc (find_what_var_points_to): Set
> >  the const_pool flag for STRING.
> >  (pt_solution_ior_into): Handle the const_pool flag.
> >  (ipa_escaped_pt): Initialize it.
> > 
> >  * gcc.dg/tree-ssa/alias-39.c: New testcase.
> >  * g++.dg/vect/pr68145.cc: Use -fno-tree-pta to avoid UB
> >  to manifest in transforms no longer vectorizing this testcase
> >  for an ICE.
> You might want to test this against 92539 as well.  There's a nonzero chance
> it'll resolve that one.

Unfortunately it doesn't.

Richard.


  1   2   3   4   5   6   7   8   9   10   >