[PATCH] [RFC] load and store-lanes with SLP

Richard Biener Fri, 05 Jul 2024 02:18:57 -0700

The following is a prototype for how to represent load/store-lanes
within SLP.  I've for now settled with having a single load node
with multiple permute nodes, one for each loaded lane and a single
store node plus a single permute node feeding it.  For


  for (int i = 0; i < 1024; ++i)
    {
      a[2*i] = b[2*i] + 7;
      a[2*i+1] = b[2*i+1] * 3;
    }

you have the following SLP graph where I explain how things are set
up and code-generated:

t.c:3:21: note:   SLP graph after lowering permutations:
t.c:3:21: note:   node 0x578af60 (max_nunits=1, refcnt=1) vector([4,4]) int
t.c:3:21: note:   op template: *_6 = _7;
t.c:3:21: note:         stmt 0 *_6 = _7;
t.c:3:21: note:         stmt 1 *_12 = _13;
t.c:3:21: note:         children 0x578aff8

This is the store node, it's marked with ldst_lanes = true during
SLP discovery.  This node code-generates

  .MASK_LEN_STORE_LANES (vectp_a.10_30, 32B, { -1, ... }, loop_len_38, 0, 
vect_array.9);

t.c:3:21: note:   node 0x578aff8 (max_nunits=1, refcnt=1) vector([4,4]) int
t.c:3:21: note:   op: VEC_PERM_EXPR
t.c:3:21: note:         { }
t.c:3:21: note:         lane permutation { 0[0] 1[0] }
t.c:3:21: note:         children 0x578ab38 0x578ad98

This is the store-lane shuffling node, it's also marked with ldst_lanes
= true during SLP discovery.  This node code-generates

  vect_array.9[0] = vect__7.7_35;
  vect_array.9[1] = vect__13.8_33;

it records virtual defs as vectorized defs.  This is a bit awkward,
on the store side it would be nicer to elide 'vect_array' and instead
make .MASK_LEN_STORE_LANES "varargs", having the vector defs as
arguments.

t.c:3:21: note:   node 0x578ab38 (max_nunits=4, refcnt=2) vector([4,4]) int
t.c:3:21: note:   op template: _7 = _5 + 7;
t.c:3:21: note:         stmt 0 _7 = _5 + 7;
t.c:3:21: note:         children 0x578abd0 0x578ac68
t.c:3:21: note:   node (constant) 0x578ac68 (max_nunits=1, refcnt=1)
t.c:3:21: note:         { 7 }
t.c:3:21: note:   node 0x578b978 (max_nunits=2, refcnt=2) vector(2) int
t.c:3:21: note:   op template: _13 = _11 * 3;
t.c:3:21: note:         stmt 0 _13 = _11 * 3;
t.c:3:21: note:         children 0x578b4b8 0x578b8e0
t.c:3:21: note:   node (constant) 0x578b8e0 (max_nunits=1, refcnt=1)
t.c:3:21: note:         { 3 }
t.c:3:21: note:   node 0x578abd0 (max_nunits=4, refcnt=2) vector([4,4]) int
t.c:3:21: note:   op: VEC_PERM_EXPR
t.c:3:21: note:         stmt 0 _5 = *_4;
t.c:3:21: note:         lane permutation { 0[0] }
t.c:3:21: note:         children 0x578b090
t.c:3:21: note:   node 0x578b4b8 (max_nunits=2, refcnt=2) vector(2) int
t.c:3:21: note:   op: VEC_PERM_EXPR
t.c:3:21: note:         stmt 0 _11 = *_10;
t.c:3:21: note:         lane permutation { 0[1] }
t.c:3:21: note:         children 0x578bbd8

The above two are the load-lane shufflings, they are marked with
ldst_lanes during load permutation lowering.  They code generate

  _34 = vect_array.6[1];
  _36 = vect_array.6[0];

they pick up the vect_array.6 by looking at the virtual operand def
in the children node.  As with the store side this is awkward but
much harder to avoid since we lack multiple SSA defs support.  What
could be done though is to write vect_array.6 into SSA and use
BIT_FIELD_REF for the extracts?  (yes, that would require ARRAY_TYPE
gimple registers)

t.c:3:21: note:   node 0x578b090 (max_nunits=4, refcnt=3) vector([4,4]) int
t.c:3:21: note:   op template: _5 = *_4;
t.c:3:21: note:         stmt 0 _5 = *_4;
t.c:3:21: note:         stmt 1 _11 = *_10;
t.c:3:21: note:         load permutation { 0 1 }

This is the load itself, marked ldst_lanes during load permutation
lowering.  It code generates

  vect_array.6 = .MASK_LEN_LOAD_LANES (vectp_b.4_40, 32B, { -1, ... }, 
loop_len_38, 0);

The patch below (tested with just the above testcase and on riscv)
runs into the following where I have not figured out what's wrong.

t.c:1:6: internal compiler error: Segmentation fault
    1 | void foo (int * __restrict a, int *b)
      |      ^~~
0x3ea25f5 internal_error(char const*, ...)
        /home/rguenther/src/gcc/gcc/diagnostic-global-context.cc:491
0x1ac12b6 crash_signal
        /home/rguenther/src/gcc/gcc/toplev.cc:319
0x10af391 tree_check(tree_node const*, char const*, int, char const*, tree_code)
        /home/rguenther/src/gcc/gcc/tree.h:3919
0x1140a0a TYPE_VECTOR_SUBPARTS(tree_node const*)
        /home/rguenther/src/gcc/gcc/tree.h:4254
0x1e7ab36 vect_set_loop_controls_directly
        /home/rguenther/src/gcc/gcc/tree-vect-loop-manip.cc:514

The patch gets ncopies wrong (I'm sure of that), and the way I create
virtual operands is surely going to break things.  It likely also
gets the support for doing load lanes for multi-lane SLP sub-groups
wrong.

The way this SLP representation forces us to separate IFN and array
access code generation to be separated introduces problems not present
with non-SLP.  Keeping the operation in one node would be managable
on the store side by recording an incoming lane permutation for it.
On the load side we could code generate everything from the load
node and make a special contract between the "extract" and the load
(but I suppose that's already present with the patch below), it would
then become just a forwarder doing no code generation.

I'm posting this as-is, I will probably try the alternative to decide
which is more ugly.  I guess I'm hoping for magic other ideas here ;)

Thanks,
Richard.

        * tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark
        load, store and permute nodes.
        * tree-vect-slp.cc (vect_build_slp_instance): For stores
        iff the target prefers store-lanes discover single-lane
        sub-groups, do not perform interleaving lowering but mark
        the nodes with ldst_lanes.
        (vect_lower_load_permutations): When the target supports
        load lanes and the loads all fit the pattern split out
        a single level of permutes only and mark the load and
        permute nodes with ldst_lanes.
        (vectorizable_slp_permutation_1): Code-generate the
        load-/store-lane permute parts.
        * tree-vect-stmts.cc (get_group_load_store_type): Support
        load/store-lanes for SLP.
        (vectorizable_store): Hack code generation for store-lanes.
        (vectorizable_load): Hack code generation for load-lanes.
---
 gcc/tree-vect-slp.cc   | 146 +++++++++++++++++++++++++++++++++++++++--
 gcc/tree-vect-stmts.cc |  63 ++++++++++++++----
 gcc/tree-vectorizer.h  |   3 +
 3 files changed, 192 insertions(+), 20 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 2dc6d365303..14a3e166e73 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -120,6 +120,7 @@ _slp_tree::_slp_tree ()
   SLP_TREE_SIMD_CLONE_INFO (this) = vNULL;
   SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def;
   SLP_TREE_CODE (this) = ERROR_MARK;
+  this->ldst_lanes = false;
   SLP_TREE_VECTYPE (this) = NULL_TREE;
   SLP_TREE_REPRESENTATIVE (this) = NULL;
   SLP_TREE_REF_COUNT (this) = 1;
@@ -3601,9 +3602,22 @@ vect_build_slp_instance (vec_info *vinfo,
         size >= 1.  */
       else if (is_a <loop_vec_info> (vinfo)
               && (i > 0 && i < group_size)
-              && !vect_slp_prefer_store_lanes_p (vinfo,
-                                                 stmt_info, group_size, i))
-       {
+              /*&& !vect_slp_prefer_store_lanes_p (vinfo,
+                                                 stmt_info, group_size, i)*/)
+       {
+         /* There are targets that cannot do even/odd interleaving schemes
+            so they absolutely need to use load/store-lanes.  For now
+            force single-lane SLP for them - they would be happy with
+            uniform power-of-two lanes (but depending on element size),
+            but even if we can use 'i' as indicator we would need to
+            backtrack when later lanes fail to discover with the same
+            granularity.  */
+         bool want_store_lanes
+           = vect_slp_prefer_store_lanes_p (vinfo,
+                                            stmt_info, group_size, 1 /*i*/);
+         if (want_store_lanes)
+           i = 1;
+
          if (dump_enabled_p ())
            dump_printf_loc (MSG_NOTE, vect_location,
                             "Splitting SLP group at stmt %u\n", i);
@@ -3637,7 +3651,10 @@ vect_build_slp_instance (vec_info *vinfo,
                                               (max_nunits, end - start));
                  rhs_nodes.safe_push (node);
                  start = end;
-                 end = group_size;
+                 if (want_store_lanes)
+                   end = start + 1;
+                 else
+                   end = group_size;
                }
              else
                {
@@ -3711,7 +3728,14 @@ vect_build_slp_instance (vec_info *vinfo,
                 we end up with two inputs.  That scheme makes sure we end
                 up with permutes satisfying the restriction of requiring at
                 most two vector inputs to produce a single vector output
-                when the number of lanes is even.  */
+                when the number of lanes is even.
+                Leave the merge as is when we want to use store-lanes.  */
+             if (want_store_lanes)
+               {
+                 perm->ldst_lanes = 1;
+                 node->ldst_lanes = 1;
+               }
+             else
              while (SLP_TREE_CHILDREN (perm).length () > 2)
                {
                  /* When we have three equal sized groups left the pairwise
@@ -4057,6 +4081,36 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
   if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
     return;
 
+  /* Verify if all load permutations can be implemented with a suitably
+     large element load-lanes operation.  */
+  unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]);
+  if (exact_log2 (ld_lanes_lanes) == -1
+      /* ???  Need to handle ld_lanes_lanes != 1 correctly.  */
+      || !vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]),
+                                    group_lanes, false))
+    ld_lanes_lanes = 0;
+  else
+    for (slp_tree load : loads)
+      {
+       if (SLP_TREE_LANES (load) != ld_lanes_lanes)
+         {
+           ld_lanes_lanes = 0;
+           break;
+         }
+       unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0];
+       if (first % ld_lanes_lanes != 0)
+         {
+           ld_lanes_lanes = 0;
+           break;
+         }
+       for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i)
+         if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i)
+           {
+             ld_lanes_lanes = 0;
+             break;
+           }
+      }
+
   for (slp_tree load : loads)
     {
       /* Leave masked or gather loads alone for now.  */
@@ -4071,7 +4125,8 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
         with a non-1:1 load permutation around instead of canonicalizing
         those into a load and a permute node.  Removing this early
         check would do such canonicalization.  */
-      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2
+         && ld_lanes_lanes == 0)
        continue;
 
       /* First build (and possibly re-use) a load node for the
@@ -4104,6 +4159,12 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
        final_perm.quick_push
          (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
 
+      if (ld_lanes_lanes != 0)
+       {
+         l0->ldst_lanes = true;
+         load->ldst_lanes = true;
+       }
+      else
       while (1)
        {
          unsigned group_lanes = SLP_TREE_LANES (l0);
@@ -9705,6 +9766,8 @@ vect_add_slp_permutation (vec_info *vinfo, 
gimple_stmt_iterator *gsi,
   node->push_vec_def (perm_stmt);
 }
 
+extern tree create_vector_array (tree elem_type, unsigned HOST_WIDE_INT 
nelems);
+
 /* Subroutine of vectorizable_slp_permutation.  Check whether the target
    can perform permutation PERM on the (1 or 2) input nodes in CHILDREN.
    If GSI is nonnull, emit the permutation there.
@@ -9758,6 +9821,77 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, 
gimple_stmt_iterator *gsi,
 
   gcc_assert (perm.length () == SLP_TREE_LANES (node));
 
+  /* Load/store-lanes permute.  */
+  if (node->ldst_lanes)
+    {
+      if (!gsi)
+       /* ???  We should have this flag set on the permute before a store
+          or the permutes after a load.  Check vect_load_lanes_supported
+          or leave that to the store?  */
+       return 1;
+      /* For the load case there should be a single child and
+        it's vector def should be the virtual SSA def of the
+        .LOAD_LANES call.  Create the vect__5.7_48 = vect_array.6[n]
+        load here plus punning of the vector type (in case an element
+        represents multiple lanes).  */
+      if (children.length () == 1)
+       {
+         slp_tree child = children[0];
+         unsigned vec_num = (SLP_TREE_LANE_PERMUTATION (node)[0].second
+                             / SLP_TREE_LANES (node));
+         for (tree def : SLP_TREE_VEC_DEFS (child))
+           {
+             gcc_assert (virtual_operand_p (def));
+             tree decl = gimple_call_lhs (SSA_NAME_DEF_STMT (def));
+             tree vec = make_ssa_name (TREE_TYPE (TREE_TYPE (decl)));
+             gassign *new_stmt
+               = gimple_build_assign (vec,
+                                      build4 (ARRAY_REF, TREE_TYPE (vec),
+                                              decl, build_int_cstu
+                                                (unsigned_type_node, vec_num),
+                                              NULL_TREE, NULL_TREE));
+             vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi);
+             if (!useless_type_conversion_p (vectype, TREE_TYPE (vec)))
+               {
+                 vec = make_ssa_name (vectype);
+                 new_stmt = gimple_build_assign (vec, build1 
(VIEW_CONVERT_EXPR, vectype, gimple_assign_lhs (new_stmt)));
+                 vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi);
+               }
+             node->push_vec_def (vec);
+           }
+         return SLP_TREE_VEC_DEFS (node).length ();
+       }
+      /* For the store case create the array and the set of stores from
+        all children vector defs.  The vector def should then be the
+        virtual def of the last store into the array.  */
+      else
+       {
+         unsigned vec_num = SLP_TREE_VEC_DEFS (children[0]).length ();
+         for (unsigned i = 0; i < vec_num; ++i)
+           {
+             tree decl = create_vector_array (op_vectype,
+                                              SLP_TREE_LANES (node)
+                                              / SLP_TREE_LANES (children[0]));
+             tree last_vdef = NULL_TREE;
+             for (unsigned j = 0; j < children.length (); ++j)
+               {
+                 slp_tree child = children[j];
+                 tree ref = build4 (ARRAY_REF, op_vectype, decl,
+                                    build_int_cstu (unsigned_type_node, j),
+                                    NULL_TREE, NULL_TREE);
+                 gimple *new_stmt
+                   = gimple_build_assign (ref, SLP_TREE_VEC_DEFS (child)[i]);
+                 vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi);
+                 last_vdef = make_ssa_name (gimple_vop (cfun));
+                 SSA_NAME_DEF_STMT (last_vdef) = new_stmt;
+                 gimple_set_vdef (new_stmt, last_vdef);
+               }
+             node->push_vec_def (last_vdef);
+           }
+         return vec_num;
+       }
+    }
+
   /* REPEATING_P is true if every output vector is guaranteed to use the
      same permute vector.  We can handle that case for both variable-length
      and constant-length vectors, but we only handle other cases for
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fdcda0d2aba..ee28accca8c 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -146,7 +146,7 @@ record_stmt_cost (stmt_vector_for_cost *body_cost_vec, int 
count,
 
 /* Return a variable of type ELEM_TYPE[NELEMS].  */
 
-static tree
+tree
 create_vector_array (tree elem_type, unsigned HOST_WIDE_INT nelems)
 {
   return create_tmp_var (build_array_type_nelts (elem_type, nelems),
@@ -2069,6 +2069,14 @@ get_group_load_store_type (vec_info *vinfo, 
stmt_vec_info stmt_info,
                 is irrelevant for them.  */
              *alignment_support_scheme = dr_unaligned_supported;
            }
+         /* Try using LOAD/STORE_LANES.  */
+         else if (slp_node->ldst_lanes
+                  && (*lanes_ifn
+                        = (vls_type == VLS_LOAD
+                           ? vect_load_lanes_supported (vectype, group_size, 
masked_p)
+                           : vect_store_lanes_supported (vectype, group_size,
+                                                         masked_p))) != 
IFN_LAST)
+           *memory_access_type = VMAT_LOAD_STORE_LANES;
          else
            *memory_access_type = VMAT_CONTIGUOUS;
 
@@ -8762,7 +8770,7 @@ vectorizable_store (vec_info *vinfo,
 
   if (memory_access_type == VMAT_LOAD_STORE_LANES)
     {
-      gcc_assert (!slp && grouped_store);
+      //gcc_assert (!slp && grouped_store);
       unsigned inside_cost = 0, prologue_cost = 0;
       /* For costing some adjacent vector stores, we'd like to cost with
         the total number of them once instead of cost each one by one. */
@@ -8784,7 +8792,7 @@ vectorizable_store (vec_info *vinfo,
                  op = vect_get_store_rhs (next_stmt_info);
                  if (costing_p)
                    update_prologue_cost (&prologue_cost, op);
-                 else
+                 else if (!slp)
                    {
                      vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
                                                     ncopies, op,
@@ -8799,15 +8807,15 @@ vectorizable_store (vec_info *vinfo,
                {
                  if (mask)
                    {
-                     vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
-                                                    mask, &vec_masks,
-                                                    mask_vectype);
+                     if (slp_node)
+                       vect_get_slp_defs (mask_node, &vec_masks);
+                     else
+                       vect_get_vec_defs_for_operand (vinfo, stmt_info, 
ncopies,
+                                                      mask, &vec_masks,
+                                                      mask_vectype);
                      vec_mask = vec_masks[0];
                    }
 
-                 /* We should have catched mismatched types earlier.  */
-                 gcc_assert (
-                   useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
                  dataref_ptr
                    = vect_create_data_ref_ptr (vinfo, first_stmt_info,
                                                aggr_type, NULL, offset, &dummy,
@@ -8819,11 +8827,17 @@ vectorizable_store (vec_info *vinfo,
              gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
              /* DR_CHAIN is then used as an input to
                 vect_permute_store_chain().  */
+             if (!slp)
+               {
+                 /* We should have catched mismatched types earlier.  */
+                 gcc_assert (
+                   useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
              for (i = 0; i < group_size; i++)
                {
                  vec_oprnd = (*gvec_oprnds[i])[j];
                  dr_chain[i] = vec_oprnd;
                }
+               }
              if (mask)
                vec_mask = vec_masks[j];
              dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi,
@@ -8836,8 +8850,16 @@ vectorizable_store (vec_info *vinfo,
              continue;
            }
 
+         tree vec_array;
+         if (slp)
+           {
+             tree def = SLP_TREE_VEC_DEFS (SLP_TREE_CHILDREN (slp_node)[0])[j];
+             vec_array = TREE_OPERAND (gimple_assign_lhs (SSA_NAME_DEF_STMT 
(def)), 0);
+           }
+         else
+           {
          /* Get an array into which we can store the individual vectors.  */
-         tree vec_array = create_vector_array (vectype, vec_num);
+         vec_array = create_vector_array (vectype, vec_num);
 
          /* Invalidate the current contents of VEC_ARRAY.  This should
             become an RTL clobber too, which prevents the vector registers
@@ -8851,6 +8873,7 @@ vectorizable_store (vec_info *vinfo,
              write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array,
                                  i);
            }
+           }
 
          tree final_mask = NULL;
          tree final_len = NULL;
@@ -8917,9 +8940,10 @@ vectorizable_store (vec_info *vinfo,
 
          /* Record that VEC_ARRAY is now dead.  */
          vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
-         if (j == 0)
+         if (j == 0 && !slp)
            *vec_stmt = new_stmt;
-         STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+         if (!slp)
+           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
        }
 
       if (costing_p)
@@ -10765,7 +10789,7 @@ vectorizable_load (vec_info *vinfo,
     {
       gcc_assert (alignment_support_scheme == dr_aligned
                  || alignment_support_scheme == dr_unaligned_supported);
-      gcc_assert (grouped_load && !slp);
+      //gcc_assert (grouped_load && !slp);
 
       unsigned int inside_cost = 0, prologue_cost = 0;
       /* For costing some adjacent vector loads, we'd like to cost with
@@ -10884,6 +10908,15 @@ vectorizable_load (vec_info *vinfo,
          gimple_call_set_nothrow (call, true);
          vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 
+         if (slp)
+           {
+             tree def = copy_ssa_name (gimple_vuse (call));
+             gimple_set_vdef (call, def);
+             SSA_NAME_DEF_STMT (def) = call;
+             slp_node->push_vec_def (def);
+           }
+         else
+           {
          dr_chain.create (vec_num);
          /* Extract each vector into an SSA_NAME.  */
          for (i = 0; i < vec_num; i++)
@@ -10900,8 +10933,10 @@ vectorizable_load (vec_info *vinfo,
          vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 
          dr_chain.release ();
+           }
 
-         *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
+         if (!slp_node)
+           *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
        }
 
       if (costing_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8eb3ec4df86..ac288541c51 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -222,6 +222,9 @@ struct _slp_tree {
   unsigned int lanes;
   /* The operation of this node.  */
   enum tree_code code;
+  /* Whether uses of this load or feeders of this store are suitable
+     for load/store-lanes.  */
+  bool ldst_lanes;
 
   int vertex;
 
-- 
2.43.0

[PATCH] [RFC] load and store-lanes with SLP

Reply via email to