On Fri, 2 Sep 2022, Richard Sandiford wrote:

> In the PR we have two REDUC_PLUS SLP instances that share a common
> load of stride 4.  Each instance also has a unique contiguous load.
> 
> Initially all three loads are out of order, so have a nontrivial
> load permutation.  The layout pass puts them in order instead,
> For the two contiguous loads it is possible to do this by adjusting the
> SLP_LOAD_PERMUTATION to be { 0, 1, 2, 3 }.  But a SLP_LOAD_PERMUTATION
> of { 0, 4, 8, 12 } is rejected as unsupported, so the pass creates a
> separate VEC_PERM_EXPR instead.
> 
> Later the 4-stride load's initial SLP_LOAD_PERMUTATION is rejected too,
> so that the load gets replaced by an external node built from scalars.
> We then have an external node feeding a VEC_PERM_EXPR.
> 
> VEC_PERM_EXPRs created in this way do not have any associated
> SLP_TREE_SCALAR_STMTS.  This means that they do not affect the
> decision about which nodes should be in which subgraph for costing
> purposes.  If the VEC_PERM_EXPR is fed by a vect_external_def,
> then the VEC_PERM_EXPR's input doesn't affect that decision either.
> 
> The net effect is that a shared VEC_PERM_EXPR fed by an external def
> can appear in more than one subgraph.  This triggered an ICE in
> vect_schedule_node, which (rightly) expects to be called no more
> than once for the same internal def.
> 
> There seemed to be many possible fixes, including:
> 
> (1) Replace unsupported loads with external defs *before* doing
>     the layout optimisation.  This would avoid the need for the
>     VEC_PERM_EXPR altogether.
>
> (2) If the target doesn't support a load in its original layout,
>     stop the layout optimisation from checking whether the target
>     supports loads in any new candidate layout.  In other words,
>     treat all layouts as if they were supported whenever the
>     original layout is not in fact supported.
> 
>     I'd rather not do this.  In principle, the layout optimisation
>     could convert an unsupported layout to a supported one.
>     Selectively ignoring target support would work against that.
> 
>     We could try to look specifically for loads that will need
>     to be decomposed, but that just seems like admitting that
>     things are happening in the wrong order.
> 
> (3) Add SLP_TREE_SCALAR_STMTS to VEC_PERM_EXPRs.
> 
>     That would be OK for this case, but wouldn't be possible
>     for external defs that represent existing vectors.

In general it's good to provide SLP_TREE_SCALAR_STMTS when we
can, but yes, that's not a fix for the actual problem.

> (4) Make vect_schedule_slp share SCC info between subgraphs.
> 
>     It feels like that's working around the partitioning problem
>     rather than a real fix though.
> 
> (5) Directly ensure that internal def nodes belong to a single
>     subgraph.
> 
> (1) is probably the best long-term fix, but (5) is much simpler.
> The subgraph partitioning code already has a hash set to record
> which nodes have been visited; we just need to convert that to a
> map from nodes to instances instead.

Agreed.

> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

OK.

Thanks,
Richard.

> Richard
> 
> 
> gcc/
>       PR tree-optimization/106787
>       * tree-vect-slp.cc (vect_map_to_instance): New function, split out
>       from...
>       (vect_bb_partition_graph_r): ...here.  Replace the visited set
>       with a map from nodes to instances.  Ensure that a node only
>       appears in one partition.
>       (vect_bb_partition_graph): Update accordingly.
> 
> gcc/testsuite/
>       * gcc.dg/vect/bb-slp-layout-19.c: New test.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c | 34 ++++++++++
>  gcc/tree-vect-slp.cc                         | 69 ++++++++++++--------
>  2 files changed, 77 insertions(+), 26 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c 
> b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c
> new file mode 100644
> index 00000000000..f075a83a25b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-19.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-fno-tree-loop-vectorize" } */
> +
> +extern int a[][4], b[][4], c[][4], d[4], e[4];
> +void f()
> +{
> +  int t0 = a[0][3];
> +  int t1 = a[1][3];
> +  int t2 = a[2][3];
> +  int t3 = a[3][3];
> +  int a0 = 0, a1 = 0, a2 = 0, a3 = 0, b0 = 0, b1 = 0, b2 = 0, b3 = 0;
> +  for (int j = 0; j < 100; ++j)
> +    for (int i = 0; i < 400; i += 4)
> +      {
> +     a0 += b[i][3] * t0;
> +     a1 += b[i][2] * t1;
> +     a2 += b[i][1] * t2;
> +     a3 += b[i][0] * t3;
> +     b0 += c[i][3] * t0;
> +     b1 += c[i][2] * t1;
> +     b2 += c[i][1] * t2;
> +     b3 += c[i][0] * t3;
> +      }
> +  d[0] = a0;
> +  d[1] = a1;
> +  d[2] = a2;
> +  d[3] = a3;
> +  e[0] = b0;
> +  e[1] = b1;
> +  e[2] = b2;
> +  e[3] = b3;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "add new stmt: \[^\\n\\r\]* = 
> VEC_PERM_EXPR" 3 "slp1" { target { vect_int_mult && vect_perm } } } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 1cf79eee4a6..59ec66a6f96 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -6435,47 +6435,64 @@ get_ultimate_leader (slp_instance instance,
>    return instance;
>  }
>  
> +namespace {
> +/* Subroutine of vect_bb_partition_graph_r.  Map KEY to INSTANCE in
> +   KEY_TO_INSTANCE, making INSTANCE the leader of any previous mapping
> +   for KEY.  Return true if KEY was already in KEY_TO_INSTANCE.
> +
> +   INSTANCE_LEADER is as for get_ultimate_leader.  */
> +
> +template<typename T>
> +bool
> +vect_map_to_instance (slp_instance instance, T key,
> +                   hash_map<T, slp_instance> &key_to_instance,
> +                   hash_map<slp_instance, slp_instance> &instance_leader)
> +{
> +  bool existed_p;
> +  slp_instance &key_instance = key_to_instance.get_or_insert (key, 
> &existed_p);
> +  if (!existed_p)
> +    ;
> +  else if (key_instance != instance)
> +    {
> +      /* If we're running into a previously marked key make us the
> +      leader of the current ultimate leader.  This keeps the
> +      leader chain acyclic and works even when the current instance
> +      connects two previously independent graph parts.  */
> +      slp_instance key_leader
> +     = get_ultimate_leader (key_instance, instance_leader);
> +      if (key_leader != instance)
> +     instance_leader.put (key_leader, instance);
> +    }
> +  key_instance = instance;
> +  return existed_p;
> +}
> +}
> +
>  /* Worker of vect_bb_partition_graph, recurse on NODE.  */
>  
>  static void
>  vect_bb_partition_graph_r (bb_vec_info bb_vinfo,
>                          slp_instance instance, slp_tree node,
>                          hash_map<stmt_vec_info, slp_instance> 
> &stmt_to_instance,
> -                        hash_map<slp_instance, slp_instance> 
> &instance_leader,
> -                        hash_set<slp_tree> &visited)
> +                        hash_map<slp_tree, slp_instance> &node_to_instance,
> +                        hash_map<slp_instance, slp_instance> 
> &instance_leader)
>  {
>    stmt_vec_info stmt_info;
>    unsigned i;
>  
>    FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
> -    {
> -      bool existed_p;
> -      slp_instance &stmt_instance
> -     = stmt_to_instance.get_or_insert (stmt_info, &existed_p);
> -      if (!existed_p)
> -     ;
> -      else if (stmt_instance != instance)
> -     {
> -       /* If we're running into a previously marked stmt make us the
> -          leader of the current ultimate leader.  This keeps the
> -          leader chain acyclic and works even when the current instance
> -          connects two previously independent graph parts.  */
> -       slp_instance stmt_leader
> -         = get_ultimate_leader (stmt_instance, instance_leader);
> -       if (stmt_leader != instance)
> -         instance_leader.put (stmt_leader, instance);
> -     }
> -      stmt_instance = instance;
> -    }
> +    vect_map_to_instance (instance, stmt_info, stmt_to_instance,
> +                       instance_leader);
>  
> -  if (!SLP_TREE_SCALAR_STMTS (node).is_empty () && visited.add (node))
> +  if (vect_map_to_instance (instance, node, node_to_instance,
> +                         instance_leader))
>      return;
>  
>    slp_tree child;
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
>      if (child && SLP_TREE_DEF_TYPE (child) == vect_internal_def)
>        vect_bb_partition_graph_r (bb_vinfo, instance, child, stmt_to_instance,
> -                              instance_leader, visited);
> +                              node_to_instance, instance_leader);
>  }
>  
>  /* Partition the SLP graph into pieces that can be costed independently.  */
> @@ -6489,16 +6506,16 @@ vect_bb_partition_graph (bb_vec_info bb_vinfo)
>       corresponding SLP graph entry and upon visiting a previously
>       marked stmt, make the stmts leader the current SLP graph entry.  */
>    hash_map<stmt_vec_info, slp_instance> stmt_to_instance;
> +  hash_map<slp_tree, slp_instance> node_to_instance;
>    hash_map<slp_instance, slp_instance> instance_leader;
> -  hash_set<slp_tree> visited;
>    slp_instance instance;
>    for (unsigned i = 0; bb_vinfo->slp_instances.iterate (i, &instance); ++i)
>      {
>        instance_leader.put (instance, instance);
>        vect_bb_partition_graph_r (bb_vinfo,
>                                instance, SLP_INSTANCE_TREE (instance),
> -                              stmt_to_instance, instance_leader,
> -                              visited);
> +                              stmt_to_instance, node_to_instance,
> +                              instance_leader);
>      }
>  
>    /* Then collect entries to each independent subgraph.  */
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Reply via email to