Re: [PATCH] RFC: Extend SLP permutation optimisations

Richard Biener via Gcc-patches Wed, 03 Aug 2022 04:23:55 -0700

On Tue, 2 Aug 2022, Richard Sandiford wrote:

> Currently SLP tries to force permute operations "down" the graph
> from loads in the hope of reducing the total number of permutes
> needed or (in the best case) removing the need for the permutes
> entirely.  This patch tries to extend it as follows:
> 
> - Allow loads to take a different permutation from the one they
>   started with, rather than choosing between "original permutation"
>   and "no permutation".
> 
> - Allow changes in both directions, if the target supports the
>   reverse permute operation.
> 
> - Treat the placement of permute operations as a two-way dataflow
>   problem: after propagating information from leaves to roots (as now),
>   propagate information back up the graph.
> 
> - Take execution frequency into account when optimising for speed,
>   so that (for example) permutes inside loops have a higher cost
>   than permutes outside loops.
> 
> - Try to reduce the total number of permutes when optimising for
>   size, even if that increases the number of permutes on a given
>   execution path.
> 
> See the big block comment above vect_optimize_slp_pass for
> a detailed description.
> 
> A while back I posted a patch to extend the existing optimisation
> to non-consecutive loads.  This patch doesn't include that change
> (although it's a possible future addition).
> 
> The original motivation for doing this was to add a framework that would
> allow other layout differences in future.  The two main ones are:
> 
> - Make it easier to represent predicated operations, including
>   predicated operations with gaps.  E.g.:
> 
>      a[0] += 1;
>      a[1] += 1;
>      a[3] += 1;
> 
>   could be a single load/add/store for SVE.  We could handle this
>   by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>   (depending on what's being counted).  We might need to move
>   elements between lanes at various points, like with permutes.
> 
>   (This would first mean adding support for stores with gaps.)
> 
> - Make it easier to switch between an even/odd and unpermuted layout
>   when switching between wide and narrow elements.  E.g. if a widening
>   operation produces an even vector and an odd vector, we should try
>   to keep operations on the wide elements in that order rather than
>   force them to be permuted back "in order".
> 
> To give some examples of what the current patch does:
> 
> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   a[0] = (b[1] << c[3]) - d[1];
>   a[1] = (b[0] << c[2]) - d[0];
>   a[2] = (b[3] << c[1]) - d[3];
>   a[3] = (b[2] << c[0]) - d[2];
> }
> 
> continues to produce the same code as before when optimising for
> speed: b, c and d are permuted at load time.  But when optimising
> for size we instead permute c into the same order as b+d and then
> permute the result of the arithmetic into the same order as a:
> 
>         ldr     q1, [x2]
>         ldr     q0, [x1]
>         ext     v1.16b, v1.16b, v1.16b, #8     // <------
>         sshl    v0.4s, v0.4s, v1.4s
>         ldr     q1, [x3]
>         sub     v0.4s, v0.4s, v1.4s
>         rev64   v0.4s, v0.4s                   // <------
>         str     q0, [x0]
>         ret
> 
> The following function:
> 
> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   a[0] = (b[3] << c[3]) - d[3];
>   a[1] = (b[2] << c[2]) - d[2];
>   a[2] = (b[1] << c[1]) - d[1];
>   a[3] = (b[0] << c[0]) - d[0];
> }
> 
> continues to push the reverse down to just before the store,
> like the current code does.
> 
> In:
> 
> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   for (int i = 0; i < 100; ++i)
>     {
>       a[0] = (a[0] + c[3]);
>       a[1] = (a[1] + c[2]);
>       a[2] = (a[2] + c[1]);
>       a[3] = (a[3] + c[0]);
>       c += 4;
>     }
> }
> 
> the loads of a are hoisted and the stores of a are sunk, so that
> only the load from c happens in the loop.  When optimising for
> speed, we prefer to have the loop operate on the reversed layout,
> changing on entry and exit from the loop:
> 
>         mov     x3, x0
>         adrp    x0, .LC0
>         add     x1, x2, 1600
>         ldr     q2, [x0, #:lo12:.LC0]
>         ldr     q0, [x3]
>         mov     v1.16b, v0.16b
>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>         .p2align 3,,7
> .L6:
>         ldr     q1, [x2], 16
>         add     v0.4s, v0.4s, v1.4s
>         cmp     x2, x1
>         bne     .L6
>         mov     v1.16b, v0.16b
>         adrp    x0, .LC0
>         ldr     q2, [x0, #:lo12:.LC0]
>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>         str     q0, [x3]
>         ret
> 
> Similarly, for the very artificial testcase:
> 
> int f4(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   int a0 = a[0];
>   int a1 = a[1];
>   int a2 = a[2];
>   int a3 = a[3];
>   for (int i = 0; i < 100; ++i)
>     {
>       a0 ^= c[0];
>       a1 ^= c[1];
>       a2 ^= c[2];
>       a3 ^= c[3];
>       c += 4;
>       for (int j = 0; j < 100; ++j)
>       {
>         a0 += d[1];
>         a1 += d[0];
>         a2 += d[3];
>         a3 += d[2];
>         d += 4;
>       }
>       b[0] = a0;
>       b[1] = a1;
>       b[2] = a2;
>       b[3] = a3;
>       b += 4;
>     }
>   a[0] = a0;
>   a[1] = a1;
>   a[2] = a2;
>   a[3] = a3;
> }
> 
> the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
> even though it's part of an SCC that includes the outer loop.
> In other words, this is a motivating case for not assigning
> permutes at SCC granularity.  The code we get is:
> 
>         ldr     q0, [x0]
>         mov     x4, x1
>         mov     x5, x0
>         add     x1, x3, 1600
>         add     x3, x4, 1600
>         .p2align 3,,7
> .L11:
>         ldr     q1, [x2], 16
>         sub     x0, x1, #1600
>         eor     v0.16b, v1.16b, v0.16b
>         rev64   v0.4s, v0.4s              // <---
>         .p2align 3,,7
> .L10:
>         ldr     q1, [x0], 16
>         add     v0.4s, v0.4s, v1.4s
>         cmp     x0, x1
>         bne     .L10
>         rev64   v0.4s, v0.4s              // <---
>         add     x1, x0, 1600
>         str     q0, [x4], 16
>         cmp     x3, x4
>         bne     .L11
>         str     q0, [x5]
>         ret
> 
> The patch is still incomplete, hence the FIXMEs, but it's at the stage
> where it passes bootstrap & regtest on aarch64-linux-gnu.  (There's one
> regression in slp-11b.c, but I think it's arguably restoring the
> expected load-lanes behaviour.)  Does this look like a plausible
> direction, before I go filling in the FIXME parts?


Yes, it looks nice!  I wonder if you spent much thoughts on
layout optimizing for SLP patterns and/or how to integrate both?

Thanks,
Richard.

> Thanks,
> Richard
> 
> ---
>  gcc/graphds.cc        |   13 +-
>  gcc/graphds.h         |    3 +-
>  gcc/hash-map-traits.h |   74 +-
>  gcc/hash-traits.h     |   96 ++-
>  gcc/tree-vect-slp.cc  | 1870 ++++++++++++++++++++++++++++++++---------
>  5 files changed, 1599 insertions(+), 457 deletions(-)
> 
> *** /tmp/iBbApg_graphds.cc    2022-08-02 08:16:23.975354155 +0100
> --- gcc/graphds.cc    2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 281,287 ****
>      numbers assigned by the previous pass.  If SUBGRAPH is not NULL, it
>      specifies the subgraph of G whose strongly connected components we want
>      to determine.  If SKIP_EDGE_P is not NULL, it points to a callback 
> function.
> !    Edge E will be skipped if callback function returns true.
>   
>      After running this function, v->component is the number of the strongly
>      connected component for each vertex of G.  Returns the number of the
> --- 281,294 ----
>      numbers assigned by the previous pass.  If SUBGRAPH is not NULL, it
>      specifies the subgraph of G whose strongly connected components we want
>      to determine.  If SKIP_EDGE_P is not NULL, it points to a callback 
> function.
> !    Edge E will be skipped if callback function returns true.  If 
> SCC_GROUPING
> !    is not null, the nodes will be added to it in the following order:
> ! 
> !    - If SCC A is a direct or indirect predecessor of SCC B in the SCC dag,
> !      A's nodes come before B's nodes.
> ! 
> !    - All of an SCC's nodes are listed consecutively, although the order
> !      of the nodes within an SCC is not really meaningful.
>   
>      After running this function, v->component is the number of the strongly
>      connected component for each vertex of G.  Returns the number of the
> ***************
> *** 289,295 ****
>   
>   int
>   graphds_scc (struct graph *g, bitmap subgraph,
> !          skip_edge_callback skip_edge_p)
>   {
>     int *queue = XNEWVEC (int, g->n_vertices);
>     vec<int> postorder = vNULL;
> --- 296,302 ----
>   
>   int
>   graphds_scc (struct graph *g, bitmap subgraph,
> !          skip_edge_callback skip_edge_p, vec<int> *scc_grouping)
>   {
>     int *queue = XNEWVEC (int, g->n_vertices);
>     vec<int> postorder = vNULL;
> ***************
> *** 317,323 ****
>   
>     for (i = 0; i < nq; i++)
>       queue[i] = postorder[nq - i - 1];
> !   comp = graphds_dfs (g, queue, nq, NULL, true, subgraph, skip_edge_p);
>   
>     free (queue);
>     postorder.release ();
> --- 324,330 ----
>   
>     for (i = 0; i < nq; i++)
>       queue[i] = postorder[nq - i - 1];
> !   comp = graphds_dfs (g, queue, nq, scc_grouping, true, subgraph, 
> skip_edge_p);
>   
>     free (queue);
>     postorder.release ();
> *** /tmp/eRUAEk_graphds.h     2022-08-02 08:16:23.987354123 +0100
> --- gcc/graphds.h     2022-08-01 20:44:18.835700593 +0100
> ***************
> *** 58,64 ****
>   typedef bool (*skip_edge_callback) (struct graph_edge *);
>   int graphds_dfs (struct graph *, int *, int,
>                vec<int> *, bool, bitmap, skip_edge_callback = NULL);
> ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL);
>   void graphds_domtree (struct graph *, int, int *, int *, int *);
>   typedef void (*graphds_edge_callback) (struct graph *,
>                                      struct graph_edge *, void *);
> --- 58,65 ----
>   typedef bool (*skip_edge_callback) (struct graph_edge *);
>   int graphds_dfs (struct graph *, int *, int,
>                vec<int> *, bool, bitmap, skip_edge_callback = NULL);
> ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL,
> !              vec<int> * = NULL);
>   void graphds_domtree (struct graph *, int, int *, int *, int *);
>   typedef void (*graphds_edge_callback) (struct graph *,
>                                      struct graph_edge *, void *);
> *** /tmp/4zWIek_hash-map-traits.h     2022-08-02 08:16:23.995354102 +0100
> --- gcc/hash-map-traits.h     2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 105,118 ****
>     static const bool maybe_mx = false;
>   };
>   
> ! /* Implement traits for a hash_map with values of type Value for cases
> !    in which the key cannot represent empty and deleted slots.  Instead
> !    record empty and deleted entries in Value.  Derived classes must
> !    implement the hash and equal_keys functions.  */
>   
> ! template <typename Value>
>   struct unbounded_hashmap_traits
>   {
>     template <typename T> static inline void remove (T &);
>     static const bool empty_zero_p = default_hash_traits 
> <Value>::empty_zero_p;
>     template <typename T> static inline bool is_empty (const T &);
> --- 105,123 ----
>     static const bool maybe_mx = false;
>   };
>   
> ! /* Implement traits for a hash_map with keys of type Key and values of
> !    type Value for cases in which the key cannot represent empty and
> !    deleted slots.  Instead record empty and deleted entries in Value.  */
>   
> ! template <typename Key, typename Value>
>   struct unbounded_hashmap_traits
>   {
> +   typedef typename Key::value_type key_type;
> + 
> +   static hashval_t hash (const typename Key::value_type &);
> +   static bool equal_keys (const typename Key::value_type &,
> +                       const typename Key::compare_type &);
> + 
>     template <typename T> static inline void remove (T &);
>     static const bool empty_zero_p = default_hash_traits 
> <Value>::empty_zero_p;
>     template <typename T> static inline bool is_empty (const T &);
> ***************
> *** 121,162 ****
>     template <typename T> static inline void mark_deleted (T &);
>   };
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::remove (T &entry)
>   {
>     default_hash_traits <Value>::remove (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Value>::is_empty (const T &entry)
>   {
>     return default_hash_traits <Value>::is_empty (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Value>::is_deleted (const T &entry)
>   {
>     return default_hash_traits <Value>::is_deleted (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::mark_empty (T &entry)
>   {
>     default_hash_traits <Value>::mark_empty (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::mark_deleted (T &entry)
>   {
>     default_hash_traits <Value>::mark_deleted (entry.m_value);
>   }
> --- 126,184 ----
>     template <typename T> static inline void mark_deleted (T &);
>   };
>   
> ! template <typename Key, typename Value>
> ! inline hashval_t
> ! unbounded_hashmap_traits <Key, Value>
> ! ::hash (const typename Key::value_type &key)
> ! {
> !   return Key::hash (key);
> ! }
> ! 
> ! template <typename Key, typename Value>
> ! inline bool
> ! unbounded_hashmap_traits <Key, Value>
> ! ::equal_keys (const typename Key::value_type &x,
> !           const typename Key::compare_type &y)
> ! {
> !   return Key::equal (x, y);
> ! }
> ! 
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::remove (T &entry)
>   {
>     default_hash_traits <Value>::remove (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Key, Value>::is_empty (const T &entry)
>   {
>     return default_hash_traits <Value>::is_empty (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Key, Value>::is_deleted (const T &entry)
>   {
>     return default_hash_traits <Value>::is_deleted (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::mark_empty (T &entry)
>   {
>     default_hash_traits <Value>::mark_empty (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::mark_deleted (T &entry)
>   {
>     default_hash_traits <Value>::mark_deleted (entry.m_value);
>   }
> ***************
> *** 166,190 ****
>      slots.  */
>   
>   template <typename Key, typename Value>
> ! struct unbounded_int_hashmap_traits : unbounded_hashmap_traits <Value>
> ! {
> !   typedef Key key_type;
> !   static inline hashval_t hash (Key);
> !   static inline bool equal_keys (Key, Key);
> ! };
> ! 
> ! template <typename Key, typename Value>
> ! inline hashval_t
> ! unbounded_int_hashmap_traits <Key, Value>::hash (Key k)
> ! {
> !   return k;
> ! }
> ! 
> ! template <typename Key, typename Value>
> ! inline bool
> ! unbounded_int_hashmap_traits <Key, Value>::equal_keys (Key k1, Key k2)
> ! {
> !   return k1 == k2;
> ! }
>   
>   #endif // HASH_MAP_TRAITS_H
> --- 188,194 ----
>      slots.  */
>   
>   template <typename Key, typename Value>
> ! using unbounded_int_hashmap_traits
> !   = unbounded_hashmap_traits <int_hash_base <Key>, Value>;
>   
>   #endif // HASH_MAP_TRAITS_H
> *** /tmp/FL7vGb_hash-traits.h 2022-08-02 08:16:24.011354061 +0100
> --- gcc/hash-traits.h 2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 85,125 ****
>   {
>   }
>   
>   
> ! /* Hasher for integer type Type in which Empty is a spare value that can be
> !    used to mark empty slots.  If Deleted != Empty then Deleted is another
> !    spare value that can be used for deleted slots; if Deleted == Empty then
> !    hash table entries cannot be deleted.  */
> ! 
> ! template <typename Type, Type Empty, Type Deleted = Empty>
> ! struct int_hash : typed_noop_remove <Type>
>   {
>     typedef Type value_type;
>     typedef Type compare_type;
>   
>     static inline hashval_t hash (value_type);
>     static inline bool equal (value_type existing, value_type candidate);
> -   static inline void mark_deleted (Type &);
> -   static const bool empty_zero_p = Empty == 0;
> -   static inline void mark_empty (Type &);
> -   static inline bool is_deleted (Type);
> -   static inline bool is_empty (Type);
>   };
>   
> ! template <typename Type, Type Empty, Type Deleted>
>   inline hashval_t
> ! int_hash <Type, Empty, Deleted>::hash (value_type x)
>   {
>     return x;
>   }
>   
> ! template <typename Type, Type Empty, Type Deleted>
>   inline bool
> ! int_hash <Type, Empty, Deleted>::equal (value_type x, value_type y)
>   {
>     return x == y;
>   }
>   
>   template <typename Type, Type Empty, Type Deleted>
>   inline void
>   int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
> --- 85,135 ----
>   {
>   }
>   
> + /* Base traits for integer type Type, providing just the hash and
> +    comparison functionality.  */
>   
> ! template <typename Type>
> ! struct int_hash_base : typed_noop_remove <Type>
>   {
>     typedef Type value_type;
>     typedef Type compare_type;
>   
>     static inline hashval_t hash (value_type);
>     static inline bool equal (value_type existing, value_type candidate);
>   };
>   
> ! template <typename Type>
>   inline hashval_t
> ! int_hash_base <Type>::hash (value_type x)
>   {
>     return x;
>   }
>   
> ! template <typename Type>
>   inline bool
> ! int_hash_base <Type>::equal (value_type x, value_type y)
>   {
>     return x == y;
>   }
>   
> + /* Hasher for integer type Type in which Empty is a spare value that can be
> +    used to mark empty slots.  If Deleted != Empty then Deleted is another
> +    spare value that can be used for deleted slots; if Deleted == Empty then
> +    hash table entries cannot be deleted.  */
> + 
> + template <typename Type, Type Empty, Type Deleted = Empty>
> + struct int_hash : int_hash_base <Type>
> + {
> +   typedef Type value_type;
> +   typedef Type compare_type;
> + 
> +   static inline void mark_deleted (Type &);
> +   static const bool empty_zero_p = Empty == 0;
> +   static inline void mark_empty (Type &);
> +   static inline bool is_deleted (Type);
> +   static inline bool is_empty (Type);
> + };
> + 
>   template <typename Type, Type Empty, Type Deleted>
>   inline void
>   int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
> ***************
> *** 398,403 ****
> --- 408,467 ----
>     return T1::is_empty (x.first);
>   }
>   
> + /* Base traits for vectors, providing just the hash and comparison
> +    functionality.  Type gives the traits for the element type.  */
> + 
> + template <typename Type>
> + struct vec_hash_base
> + {
> +   typedef vec<typename Type::value_type> value_type;
> +   typedef vec<typename Type::compare_type> compare_type;
> + 
> +   static inline hashval_t hash (value_type);
> +   static inline bool equal (value_type, compare_type);
> + };
> + 
> + template <typename Type>
> + inline hashval_t
> + vec_hash_base <Type>::hash (value_type x)
> + {
> +   inchash::hash hstate;
> +   hstate.add_int (x.length ());
> +   for (auto &value : x)
> +     hstate.merge_hash (Type::hash (value));
> +   return hstate.end ();
> + }
> + 
> + template <typename Type>
> + inline bool
> + vec_hash_base <Type>::equal (value_type x, compare_type y)
> + {
> +   if (x.length () != y.length ())
> +     return false;
> +   for (unsigned int i = 0; i < x.length (); ++i)
> +     if (!Type::equal (x[i], y[i]))
> +       return false;
> +   return true;
> + }
> + 
> + /* Traits for vectors whose contents should be freed normally.  */
> + 
> + template <typename Type>
> + struct vec_free_hash_base : vec_hash_base <Type>
> + {
> +   static void remove (typename vec_hash_base <Type>::value_type &);
> + };
> + 
> + template <typename Type>
> + void
> + vec_free_hash_base <Type>
> + ::remove (typename vec_hash_base <Type>::value_type &x)
> + {
> +   for (auto &value : x)
> +     Type::remove (x);
> +   x.release ();
> + }
> + 
>   template <typename T> struct default_hash_traits : T {};
>   
>   template <typename T>
> *** /tmp/5ek0ya_tree-vect-slp.cc      2022-08-02 08:16:24.027354017 +0100
> --- gcc/tree-vect-slp.cc      2022-08-01 20:44:22.631690380 +0100
> ***************
> *** 20,25 ****
> --- 20,26 ----
>   <http://www.gnu.org/licenses/>.  */
>   
>   #include "config.h"
> + #define INCLUDE_ALGORITHM
>   #include "system.h"
>   #include "coretypes.h"
>   #include "backend.h"
> ***************
> *** 49,54 ****
> --- 50,57 ----
>   #include "tree-eh.h"
>   #include "tree-cfg.h"
>   #include "alloc-pool.h"
> + #include "sreal.h"
> + #include "predict.h"
>   
>   static bool vectorizable_slp_permutation (vec_info *, gimple_stmt_iterator 
> *,
>                                         slp_tree, stmt_vector_for_cost *);
> ***************
> *** 304,309 ****
> --- 307,322 ----
>     oprnds_info.release ();
>   }
>   
> + /* Return the execution frequency of NODE (so that a higher value indicates
> +    a "more important" node when optimizing for speed).  */
> + 
> + static sreal
> + vect_slp_node_weight (slp_tree node)
> + {
> +   stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
> +   basic_block bb = gimple_bb (stmt_info->stmt);
> +   return bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count);
> + }
>   
>   /* Return true if STMTS contains a pattern statement.  */
>   
> ***************
> *** 3531,3559 ****
>     return opt_result::success ();
>   }
>   
> ! struct slpg_vertex
>   {
> !   slpg_vertex (slp_tree node_)
> !     : node (node_), perm_in (-1), perm_out (-1) {}
>   
> !   int get_perm_materialized () const
> !     { return perm_in != perm_out ? perm_in : 0; }
>   
>     slp_tree node;
> !   /* The common permutation on the incoming lanes (towards SLP children).  
> */
> !   int perm_in;
> !   /* The permutation on the outgoing lanes (towards SLP parents).  When
> !      the node is a materialization point for a permute this differs
> !      from perm_in (and is then usually zero).  Materialization happens
> !      on the input side.  */
> !   int perm_out;
>   };
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! static void
> ! vect_slp_build_vertices (hash_set<slp_tree> &visited, slp_tree node,
> !                      vec<slpg_vertex> &vertices, vec<int> &leafs)
>   {
>     unsigned i;
>     slp_tree child;
> --- 3544,3994 ----
>     return opt_result::success ();
>   }
>   
> ! /* Estimates the cost of inserting layout changes into the SLP graph.  */
> ! 
> ! struct slpg_layout_cost
> ! {
> !   slpg_layout_cost () = default;
> !   slpg_layout_cost (sreal, bool);
> ! 
> !   bool operator== (const slpg_layout_cost &) const;
> !   bool operator!= (const slpg_layout_cost &) const;
> ! 
> !   bool is_better_than (const slpg_layout_cost &, bool) const;
> ! 
> !   void add_parallel_cost (const slpg_layout_cost &);
> !   void add_serial_cost (const slpg_layout_cost &);
> !   void split (unsigned int);
> ! 
> !   /* The longest sequence of layout changes needed during any traversal
> !      of the partition dag, weighted by execution frequency.
> ! 
> !      This is the most important metric when optimizing for speed, since
> !      it helps to ensure that we keep the number of operations on
> !      critical paths to a minimum.  */
> !   sreal depth = 0;
> ! 
> !   /* An estimate of the total number of operations needed.  IT is weighted 
> by
> !      execution frequency when optimizing for speed but not when optimizing 
> for
> !      size.  In order to avoid double-counting, a node with a fanout of N 
> will
> !      distribute 1/N of its total cost to each successor.
> ! 
> !      This is the most important metric when optimizing for size, since
> !      it helps to keep the total number of operations to a minimum,  */
> !   sreal total = 0;
> ! };
> ! 
> ! /* Construct costs for a node with weight WEIGHT.  A higher weight
> !    indicates more frequent execution.  IS_FOR_SIZE is true if we are
> !    optimizing for size rather than speed.  */
> ! 
> ! slpg_layout_cost::slpg_layout_cost (sreal weight, bool is_for_size)
> !   : depth (weight), total (is_for_size && weight > 0 ? 1 : weight)
> ! {
> ! }
> ! 
> ! bool
> ! slpg_layout_cost::operator== (const slpg_layout_cost &other) const
> ! {
> !   return depth == other.depth && total == other.total;
> ! }
> ! 
> ! bool
> ! slpg_layout_cost::operator!= (const slpg_layout_cost &other) const
> ! {
> !   return !operator== (other);
> ! }
> ! 
> ! /* Return true if these costs are better than OTHER.  IS_FOR_SIZE is
> !    true if we are optimizing for size rather than speed.  */
> ! 
> ! bool
> ! slpg_layout_cost::is_better_than (const slpg_layout_cost &other,
> !                               bool is_for_size) const
> ! {
> !   if (is_for_size)
> !     {
> !       if (total != other.total)
> !     return total < other.total;
> !       return depth < other.depth;
> !     }
> !   else
> !     {
> !       if (depth != other.depth)
> !     return depth < other.depth;
> !       return total < other.total;
> !     }
> ! }
> ! 
> ! /* Increase the costs to account for something with cost INPUT_COST
> !    happening in parallel with the current costs.  */
> ! 
> ! void
> ! slpg_layout_cost::add_parallel_cost (const slpg_layout_cost &input_cost)
>   {
> !   depth = std::max (depth, input_cost.depth);
> !   total += input_cost.total;
> ! }
> ! 
> ! /* Increase the costs to account for something with cost INPUT_COST
> !    happening in series with the current costs.  */
> ! 
> ! void
> ! slpg_layout_cost::add_serial_cost (const slpg_layout_cost &other)
> ! {
> !   depth += other.depth;
> !   total += other.total;
> ! }
> ! 
> ! /* Split the total cost among TIMES successors or predecessors.  */
>   
> ! void
> ! slpg_layout_cost::split (unsigned int times)
> ! {
> !   if (times > 1)
> !     total /= times;
> ! }
> ! 
> ! /* Information about one node in the SLP graph, for use during
> !    vect_optimize_slp_pass.  */
> ! 
> ! struct slpg_vertex
> ! {
> !   slpg_vertex (slp_tree node_) : node (node_) {}
>   
> +   /* The node itself.  */
>     slp_tree node;
> ! 
> !   /* Which partition the node belongs to, or -1 if none.  */
> !   int partition = -1;
> ! 
> !   /* The number of nodes that directly use the result of this one
> !      (i.e. the number of nodes that count this one as a child).  */
> !   unsigned int out_degree = 0;
> ! 
> !   /* The execution frequency of the node.  */
> !   sreal weight = 0;
> ! 
> !   /* The total execution frequency of all nodes that directly use the
> !      result of this one.  */
> !   sreal out_weight = 0;
> ! };
> ! 
> ! /* Information about one partition of the SLP graph, for use during
> !    vect_optimize_slp_pass.  */
> ! 
> ! struct slpg_partition_info
> ! {
> !   /* The nodes in the partition occupy indices [NODE_BEGIN, NODE_END)
> !      of m_partitioned_nodes.  */
> !   unsigned int node_begin = 0;
> !   unsigned int node_end = 0;
> ! 
> !   /* Which layout we've chosen to use for this partition, or -1 if
> !      we haven't picked one yet.  */
> !   int layout = -1;
> ! 
> !   /* The number of predecessors and successors in the partition dag.
> !      The predecessors always have lower partition numbers and the
> !      successors always have higher partition numbers.
> ! 
> !      Note that the directions of these edges are not necessarily the
> !      same as in the data flow graph.  For example, if an SCC contains
> !      an inner and an outer loop, the inner loop's partition will have
> !      at least two incoming edges from the outer loop's partition:
> !      one for a live-in value and one for a live-out value.  In data
> !      flow terms, the live-in edge would also be from the outer loop
> !      to the inner loop, but the live-out edge would be from the inner
> !      loop to the outer loop.  */
> !   unsigned int in_degree = 0;
> !   unsigned int out_degree = 0;
> ! };
> ! 
> ! /* Information about the costs of using a particular layout for a
> !    particular partition.  It can also say that the combination is
> !    invalid.  */
> ! 
> ! struct slpg_partition_layout_costs
> ! {
> !   bool is_possible () const { return cost.total != sreal::min (); }
> !   void mark_impossible () { cost.total = sreal::min (); }
> ! 
> !   /* The costs inherited from predecessor partitions plus the inherent
> !      cost of the layout within the node itself.  */
> !   slpg_layout_cost in_cost;
> ! 
> !   /* The costs inherited from successor partitions.  */
> !   slpg_layout_cost out_cost;
> ! 
> !   /* The sum of the above, when used.  */
> !   slpg_layout_cost cost;
> ! };
> ! 
> ! /* This class tries to optimize the layout of vectors in order to avoid
> !    unnecessary shuffling.  At the moment, the set of possible layouts are
> !    restricted to bijective permutes.
> ! 
> !    The goal of the pass depends on whether we're optimizing for size or
> !    for speed.  When optimizing for size, the goal is to reduce the overall
> !    number of layout changes (including layout changes implied by things
> !    like load permutes).  When optimizing for speed, the goal is to
> !    reduce the maximum latency attributable to layout changes on any
> !    non-cyclic path through the data flow graph.
> ! 
> !    For example, when optimizing a loop nest for speed, we will prefer
> !    to make layout changes outside of a loop rather than inside of a loop,
> !    and will prefer to make layout changes in parallel rather than serially,
> !    even if that increases the overall number of layout changes.
> ! 
> !    The high-level procedure is:
> ! 
> !    (1) Build a graph in which edges go from uses (parents) to definitions
> !        (children).
> ! 
> !    (2) Divide the graph into a dag of strongly-connected components (SCCs).
> ! 
> !    (3) Partition the nodes in each SCC based on their containing cfg loop,
> !        creating a dag of partitions.  The goal is now to assign a layout
> !        to each partition.
> ! 
> !    (4) Construct a set of vector layouts that are worth considering and
> !        calculate the local cost of using each layout in each partition.
> !        Some partitions might not allow a given layout at all.
> ! 
> !    (5) Perform a forward walk over the partition dag (from loads to stores)
> !        accumulating the "forward" cost of using each layout.  When visiting
> !        each partition, assign a tentative choice of layout to the partition
> !        and use that choice when calculating the cost of using a different
> !        layout in successor partitions.
> ! 
> !    (6) Perform a backward walk over the partition dag (from stores to 
> loads),
> !        accumulating the "backward" cost of using each layout.  When visiting
> !        each partition, make a final choice of layout for that partition 
> based
> !        on the accumulated forward costs (from (5)) and backward costs
> !        (from (6)).
> ! 
> !    (7) Apply the chosen layouts to the SLP graph.
> ! 
> !    For example, consider the SLP statements:
> ! 
> !    S1:      a_1 = load
> !        loop:
> !    S2:      a_2 = PHI<a_1, a_3>
> !    S3:      b_1 = load
> !    S4:      a_3 = a_2 + b_1
> !        exit:
> !    S5:      a_4 = PHI<a_3>
> !    S6:      store a_4
> ! 
> !    S2 and S4 form an SCC and are part of the same loop.  Every other
> !    statement is in a singleton SCC.  In this example there is a one-to-one
> !    mapping between SCCs and partitions and the partition dag looks like 
> this;
> ! 
> !     S1     S3
> !      \     /
> !       S2+S4
> !         |
> !        S5
> !         |
> !        S6
> ! 
> !    S2, S3 and S4 will have a higher execution frequency than the other
> !    statements, so when optimizing for speed, the goal is to avoid any
> !    layout changes:
> ! 
> !    - within S3
> !    - within S2+S4
> !    - on the S3->S2+S4 edge
> ! 
> !    For example, if S3 was originally a reversing load, the goal of the
> !    pass is to make it an unreversed load and change the layout on the
> !    S1->S2+S4 and S2+S4->S5 edges to compensate.  (Changing the layout
> !    on S1->S2+S4 and S5->S6 would also be acceptable.)
> ! 
> !    The difference between SCCs and partitions becomes important if we
> !    add an outer loop:
> ! 
> !    S1:      a_1 = ...
> !        loop1:
> !    S2:      a_2 = PHI<a_1, a_6>
> !    S3:      b_1 = load
> !    S4:      a_3 = a_2 + b_1
> !        loop2:
> !    S5:      a_4 = PHI<a_3, a_5>
> !    S6:      c_1 = load
> !    S7:      a_5 = a_4 + c_1
> !        exit2:
> !    S8:      a_6 = PHI<a_5>
> !    S9:      store a_6
> !        exit1:
> ! 
> !    Here, S2, S4, S5, S7 and S8 form a single SCC.  However, we usually
> !    do not want restrictions in the outer loop to "infect" the decision
> !    for the inner loop.  For example, if an outer-loop node in the SCC
> !    contains a statement with a fixed layout, that should not prevent the
> !    inner loop from using a different layout.  Conversely, the inner loop
> !    should not dictate a layout to the outer loop: if the outer loop does
> !    a lot of computation, then it may not be efficient to do all of that
> !    computation in the inner loop's preferred layout.
> ! 
> !    We therefore partition the SCC into S2+S4+S8 (outer) and S5+S7 (inner).
> !    We also try to arrange partitions so that:
> ! 
> !    - the partition for an outer loop comes before the partition for
> !      an inner loop
> ! 
> !    - if a sibling loop A dominates a sibling loop B, A's partition
> !      comes before B's
> ! 
> !    So in this case, the partition dag is:
> ! 
> !     S1        S3
> !      \        /
> !       S2+S4+S8   S6
> !        |   \\    /
> !        |    S5+S7
> !        |
> !       S9
> ! 
> !    There are two edges from S2+S4+S8 to S5+S7: one for the edge S4->S5 and
> !    one for a reversal of the edge S7->S8.
> ! 
> !    The backward walk picks a layout for S5+S7 before S2+S4+S8.  The choice
> !    for S2+S4+S8 therefore has to balance the cost of using the outer loop's
> !    preferred layout against the cost of changing the layout on entry to the
> !    inner loop (S4->S5) and on exit from the inner loop (S7->S8 reversed).
> ! 
> !    ??? Although this works well when optimizing for speed, it has the
> !    downside when optimizing for size that the choice of layout for S5+S7
> !    is completely independent of S9.  Perhaps we should not partition SCCs
> !    when optimizing for size?
> ! 
> !    To give a concrete example of the difference between optimizing
> !    for size and speed, consider:
> ! 
> !    a[0] = (b[1] << c[3]) - d[1];
> !    a[1] = (b[0] << c[2]) - d[0];
> !    a[2] = (b[3] << c[1]) - d[3];
> !    a[3] = (b[2] << c[0]) - d[2];
> ! 
> !    There are three different layouts here: one for a, one for b and d,
> !    and one for c.  When optimizing for speed it is better to permute
> !    each of b, c and d into the order required by a, since those permutes
> !    happen in parallel.  But when optimizing for size, it is better to:
> ! 
> !    - permute c into the same order as b
> !    - do the arithmetic
> !    - permute the result into the order required by a
> ! 
> !    This gives 2 permutes rather than 3.  */
> ! 
> ! class vect_optimize_slp_pass
> ! {
> ! public:
> !   vect_optimize_slp_pass (vec_info *vinfo) : m_vinfo (vinfo) {}
> !   void run ();
> ! 
> ! private:
> !   /* Graph building.  */
> !   struct loop *containing_loop (slp_tree);
> !   bool is_cfg_latch_edge (graph_edge *);
> !   void build_vertices (hash_set<slp_tree> &, slp_tree);
> !   void build_vertices ();
> !   void build_graph ();
> ! 
> !   /* Partitioning.  */
> !   void create_partitions ();
> !   template<typename T> void for_each_partition_edge (unsigned int, T);
> ! 
> !   /* Layout selection.  */
> !   bool can_change_layouts (unsigned int, unsigned int);
> !   bool layout_is_reversible (unsigned int);
> !   unsigned int layout_cost (unsigned int, unsigned int);
> !   slpg_partition_layout_costs &partition_layout_costs (unsigned int,
> !                                                    unsigned int);
> !   bool partition_can_absorb_permutes (unsigned int);
> !   int node_absorption_cost (slp_tree, unsigned int);
> !   void start_choosing_layouts ();
> ! 
> !   /* Cost propagation.  */
> !   slpg_layout_cost edge_layout_cost (graph_edge *, unsigned int,
> !                                  unsigned int, unsigned int);
> !   slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned int);
> !   slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned int);
> !   void forward_pass ();
> !   void backward_pass ();
> ! 
> !   /* Rematerialization.  */
> !   slp_tree get_result_with_layout (slp_tree, unsigned int);
> !   void materialize ();
> ! 
> !   /* Clean-up.  */
> !   void remove_redundant_permutes ();
> ! 
> !   void dump ();
> ! 
> !   vec_info *m_vinfo;
> ! 
> !   /* True if we should optimize the graph for size, false if we should
> !      optimize it for speed.  (It wouldn't be easy to make this decision
> !      more locally.)  */
> !   bool m_optimize_size;
> ! 
> !   /* A graph of all SLP nodes, with edges leading from uses to definitions.
> !      In other words, a node's predecessors are its slp_tree parents and
> !      a node's successors are its slp_tree children.  */
> !   graph *m_slpg = nullptr;
> ! 
> !   /* The vertices of M_SLPG, indexed by slp_tree::vertex.  */
> !   auto_vec<slpg_vertex> m_vertices;
> ! 
> !   /* The list of all leaves of M_SLPG. such as external definitions, 
> constants,
> !      and loads.  */
> !   auto_vec<int> m_leafs;
> ! 
> !   /* This array has one entry for every vector layout that we're 
> considering.
> !      Element 0 is null and indicates "no change".  Other entries describe
> !      permutes that are inherent in the current graph and that we would
> !      like to reverse if possible.
> ! 
> !      For example, a permute { 1, 2, 3, 0 } means that something has
> !      effectively been permuted in that way, such as a load group
> !      { a[1], a[2], a[3], a[0] } (viewed as a permutation of a[0:3]).
> !      We'd then like to apply the reverse permute { 3, 0, 1, 2 }
> !      in order to put things "back" in order.  */
> !   auto_vec<vec<unsigned> > m_perms;
> ! 
> !   /* A partitioning of the nodes for which a layout must be chosen.
> !      Each partition represents an <SCC, cfg loop> pair; that is,
> !      nodes in different SCCs belong to different partitions, and nodes
> !      within an SCC are further partitioned according to the containing
> !      cfg loop.  Partition <SCC1, L1> comes before <SCC2, L2> if:
> ! 
> !      - SCC1 != SCC2 and SCC1 is a predecessor of SCC2 in a forward walk
> !        from leaves (such as loads) to roots (such as stores).
> ! 
> !      - SCC1 == SCC2 and L1's header strictly dominates L2's header.  */
> !   auto_vec<slpg_partition_info> m_partitions;
> ! 
> !   /* The list of all nodes for which a layout must be chosen.  Nodes for
> !      partition P come before the nodes for partition P+1.  Nodes within a
> !      partition are in reverse postorder.  */
> !   auto_vec<unsigned int> m_partitioned_nodes;
> ! 
> !   /* Index P * num-layouts + L contains the cost of using layout L
> !      for partition P.  */
> !   auto_vec<slpg_partition_layout_costs> m_partition_layout_costs;
> ! 
> !   /* Index N * num-layouts + L, if nonnull, is a node that provides the
> !      original output of node N adjusted to have layout L.  */
> !   auto_vec<slp_tree> m_node_layouts;
>   };
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_vertices (hash_set<slp_tree> &visited,
> !                                     slp_tree node)
>   {
>     unsigned i;
>     slp_tree child;
> ***************
> *** 3561,3568 ****
>     if (visited.add (node))
>       return;
>   
> !   node->vertex = vertices.length ();
> !   vertices.safe_push (slpg_vertex (node));
>   
>     bool leaf = true;
>     bool force_leaf = false;
> --- 3996,4003 ----
>     if (visited.add (node))
>       return;
>   
> !   node->vertex = m_vertices.length ();
> !   m_vertices.safe_push (slpg_vertex (node));
>   
>     bool leaf = true;
>     bool force_leaf = false;
> ***************
> *** 3570,3576 ****
>       if (child)
>         {
>       leaf = false;
> !     vect_slp_build_vertices (visited, child, vertices, leafs);
>         }
>       else
>         force_leaf = true;
> --- 4005,4011 ----
>       if (child)
>         {
>       leaf = false;
> !     build_vertices (visited, child);
>         }
>       else
>         force_leaf = true;
> ***************
> *** 3580,3600 ****
>        and inductions.  Force those SLP PHIs to act as leafs to make them
>        backwards reachable.  */
>     if (leaf || force_leaf)
> !     leafs.safe_push (node->vertex);
>   }
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! static void
> ! vect_slp_build_vertices (vec_info *info, vec<slpg_vertex> &vertices,
> !                      vec<int> &leafs)
>   {
>     hash_set<slp_tree> visited;
>     unsigned i;
>     slp_instance instance;
> !   FOR_EACH_VEC_ELT (info->slp_instances, i, instance)
> !     vect_slp_build_vertices (visited, SLP_INSTANCE_TREE (instance), 
> vertices,
> !                          leafs);
>   }
>   
>   /* Apply (reverse) bijectite PERM to VEC.  */
> --- 4015,4033 ----
>        and inductions.  Force those SLP PHIs to act as leafs to make them
>        backwards reachable.  */
>     if (leaf || force_leaf)
> !     m_leafs.safe_push (node->vertex);
>   }
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_vertices ()
>   {
>     hash_set<slp_tree> visited;
>     unsigned i;
>     slp_instance instance;
> !   FOR_EACH_VEC_ELT (m_vinfo->slp_instances, i, instance)
> !     build_vertices (visited, SLP_INSTANCE_TREE (instance));
>   }
>   
>   /* Apply (reverse) bijectite PERM to VEC.  */
> ***************
> *** 3625,3699 ****
>       }
>   }
>   
> ! /* Return whether permutations PERM_A and PERM_B as recorded in the
> !    PERMS vector are equal.  */
>   
> ! static bool
> ! vect_slp_perms_eq (const vec<vec<unsigned> > &perms,
> !                int perm_a, int perm_b)
>   {
> !   return (perm_a == perm_b
> !       || (perm_a != -1 && perm_b != -1
> !           && perms[perm_a].length () == perms[perm_b].length ()
> !           && memcmp (&perms[perm_a][0], &perms[perm_b][0],
> !                      sizeof (unsigned) * perms[perm_a].length ()) == 0));
>   }
>   
> ! /* Optimize the SLP graph of VINFO.  */
>   
> ! void
> ! vect_optimize_slp (vec_info *vinfo)
>   {
> !   if (vinfo->slp_instances.is_empty ())
> !     return;
>   
> !   slp_tree node;
> !   unsigned i;
> !   auto_vec<slpg_vertex> vertices;
> !   auto_vec<int> leafs;
> !   vect_slp_build_vertices (vinfo, vertices, leafs);
>   
> !   struct graph *slpg = new_graph (vertices.length ());
> !   for (slpg_vertex &v : vertices)
>       for (slp_tree child : SLP_TREE_CHILDREN (v.node))
>         if (child)
> !     add_edge (slpg, v.node->vertex, child->vertex);
>   
> !   /* Compute (reverse) postorder on the inverted graph.  */
> !   auto_vec<int> ipo;
> !   graphds_dfs (slpg, &leafs[0], leafs.length (), &ipo, false, NULL, NULL);
> ! 
> !   auto_vec<vec<unsigned> > perms;
> !   perms.safe_push (vNULL); /* zero is no permute */
> ! 
> !   /* Produce initial permutations.  */
> !   for (i = 0; i < leafs.length (); ++i)
> !     {
> !       int idx = leafs[i];
> !       slp_tree node = vertices[idx].node;
> ! 
> !       /* Handle externals and constants optimistically throughout the
> !      iteration.  But treat existing vectors as fixed since we
> !      do not handle permuting them below.  */
> !       if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
> !        && !SLP_TREE_VEC_DEFS (node).exists ())
> !       || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> !     continue;
>   
> !       /* Leafs do not change across iterations.  Note leafs also double
> !      as entries to the reverse graph.  */
> !       if (!slpg->vertices[idx].succ)
>       {
> !       vertices[idx].perm_in = 0;
> !       vertices[idx].perm_out = 0;
>       }
>   
>         /* Loads are the only thing generating permutes.  */
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>       continue;
>   
>         /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> !      node unpermuted, record this permute.  */
>         stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
>         if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>       continue;
> --- 4058,4416 ----
>       }
>   }
>   
> ! /* Return the cfg loop that contains NODE.  */
>   
> ! struct loop *
> ! vect_optimize_slp_pass::containing_loop (slp_tree node)
>   {
> !   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> !   if (!rep)
> !     return ENTRY_BLOCK_PTR_FOR_FN (cfun)->loop_father;
> !   return gimple_bb (vect_orig_stmt (rep)->stmt)->loop_father;
>   }
>   
> ! /* Return true if UD (an edge from a use to a definition) is associated
> !    with a loop latch edge in the cfg.  */
>   
> ! bool
> ! vect_optimize_slp_pass::is_cfg_latch_edge (graph_edge *ud)
>   {
> !   slp_tree use = m_vertices[ud->src].node;
> !   slp_tree def = m_vertices[ud->dest].node;
> !   if (SLP_TREE_DEF_TYPE (use) != vect_internal_def
> !       || SLP_TREE_DEF_TYPE (def) != vect_internal_def)
> !     return false;
>   
> !   stmt_vec_info use_rep = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (use));
> !   return (is_a<gphi *> (use_rep->stmt)
> !       && bb_loop_header_p (gimple_bb (use_rep->stmt))
> !       && containing_loop (def) == containing_loop (use));
> ! }
> ! 
> ! /* Build the graph.  Mark edges that correspond to cfg loop latch edges with
> !    a nonnull data field.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_graph ()
> ! {
> !   build_vertices ();
> ! 
> !   m_slpg = new_graph (m_vertices.length ());
> !   for (slpg_vertex &v : m_vertices)
>       for (slp_tree child : SLP_TREE_CHILDREN (v.node))
>         if (child)
> !     {
> !       graph_edge *ud = add_edge (m_slpg, v.node->vertex, child->vertex);
> !       if (is_cfg_latch_edge (ud))
> !         ud->data = this;
> !     }
> ! }
>   
> ! /* Return true if E corresponds to a loop latch edge in the cfg.  */
>   
> ! static bool
> ! skip_cfg_latch_edges (graph_edge *e)
> ! {
> !   return e->data;
> ! }
> ! 
> ! /* Create the node partitions.  */
> ! 
> ! void
> ! vect_optimize_slp_pass::create_partitions ()
> ! {
> !   /* Calculate a postorder of the graph, ignoring edges that correspond
> !      to natural latch edges in the cfg.  Reading the vector from the
> !      end to the beginning gives the reverse postorder.  */
> !   auto_vec<int> initial_rpo;
> !   graphds_dfs (m_slpg, &m_leafs[0], m_leafs.length (), &initial_rpo,
> !            false, NULL, skip_cfg_latch_edges);
> !   gcc_assert (initial_rpo.length () == m_vertices.length ());
> ! 
> !   /* Calculate the strongly connected components of the graph.  */
> !   auto_vec<int> scc_grouping;
> !   unsigned int num_sccs = graphds_scc (m_slpg, NULL, NULL, &scc_grouping);
> ! 
> !   /* Create a new index order in which all nodes from the same SCC are
> !      consecutive.  Use scc_pos to record the index of the first node in
> !      each SCC.  */
> !   auto_vec<unsigned int> scc_pos (num_sccs);
> !   int last_component = -1;
> !   unsigned int node_count = 0;
> !   for (unsigned int node_i : scc_grouping)
> !     {
> !       if (last_component != m_slpg->vertices[node_i].component)
> !     {
> !       last_component = m_slpg->vertices[node_i].component;
> !       gcc_assert (last_component == int (scc_pos.length ()));
> !       scc_pos.quick_push (node_count);
> !     }
> !       node_count += 1;
> !     }
> !   gcc_assert (node_count == initial_rpo.length ()
> !           && last_component + 1 == int (num_sccs));
> ! 
> !   /* Use m_partitioned_nodes to group nodes into SCC order, with the nodes
> !      inside each SCC following the RPO we calculated above.  The fact that
> !      we ignored natural latch edges when calculating the RPO should ensure
> !      that, for natural loop nests:
> ! 
> !      - the first node that we encounter in a cfg loop is the loop header phi
> !      - the loop header phis are in dominance order
> ! 
> !      Arranging for this is an optimization (see below) rather than a
> !      correctness issue.  Unnatural loops with a tangled mess of backedges
> !      will still work correctly, but might give poorer results.
> ! 
> !      Also update scc_pos so that it gives 1 + the index of the last node
> !      in the SCC.  */
> !   m_partitioned_nodes.safe_grow (node_count);
> !   for (unsigned int old_i = initial_rpo.length (); old_i-- > 0;)
> !     {
> !       unsigned int node_i = initial_rpo[old_i];
> !       unsigned int new_i = scc_pos[m_slpg->vertices[node_i].component]++;
> !       m_partitioned_nodes[new_i] = node_i;
> !     }
> ! 
> !   /* Partition each SCC based on the containing cfg loop.  The order we
> !      constructed above should ensure that, for natural cfg loops, we'll
> !      create sub-SCC partitions for outer loops before the corresponding
> !      sub-SCC partitions for inner loops.  Similarly, when one sibling
> !      loop A dominates another sibling loop B, we should create a sub-SCC
> !      partition for A before a sub-SCC partition for B.
> ! 
> !      As above, nothing depends for correctness on whether this achieves
> !      a natural nesting, but we should get better results when it does.  */
> !   m_partitions.reserve (m_vertices.length ());
> !   unsigned int next_partition_i = 0;
> !   hash_map<struct loop *, int> loop_partitions;
> !   unsigned int rpo_begin = 0;
> !   unsigned int num_partitioned_nodes = 0;
> !   for (unsigned int rpo_end : scc_pos)
> !     {
> !       loop_partitions.empty ();
> !       for (unsigned int rpo_i = rpo_begin; rpo_i < rpo_end; ++rpo_i)
> !     {
> !       /* Handle externals and constants optimistically throughout.
> !          But treat existing vectors as fixed since we do not handle
> !          permuting them.  */
> !       unsigned int node_i = m_partitioned_nodes[rpo_i];
> !       slp_tree node = m_vertices[node_i].node;
> !       if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
> !            && !SLP_TREE_VEC_DEFS (node).exists ())
> !           || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> !         m_vertices[node_i].partition = -1;
> !       else
> !         {
> !           struct loop *loop = containing_loop (node);
> !           bool existed;
> !           auto &partition_i
> !             = loop_partitions.get_or_insert (loop, &existed);
> !           if (!existed)
> !             {
> !               partition_i = next_partition_i++;
> !               m_partitions.quick_push (slpg_partition_info ());
> !             }
> !           m_vertices[node_i].partition = partition_i;
> !           num_partitioned_nodes += 1;
> !           m_partitions[partition_i].node_end += 1;
> !         }
> !     }
> !       rpo_begin = rpo_end;
> !     }
> ! 
> !   /* Assign ranges of consecutive node indices to each partition,
> !      in partition order.  Start with node_end being the same as
> !      node_begin so that the next loop can use it as a counter.  */
> !   unsigned int node_begin = 0;
> !   for (auto &partition : m_partitions)
> !     {
> !       partition.node_begin = node_begin;
> !       node_begin += partition.node_end;
> !       partition.node_end = partition.node_begin;
> !     }
> !   gcc_assert (node_begin == num_partitioned_nodes);
> ! 
> !   /* Finally build the list of nodes in partition order.  */
> !   m_partitioned_nodes.truncate (num_partitioned_nodes);
> !   for (unsigned int node_i = 0; node_i < m_vertices.length (); ++node_i)
> !     {
> !       int partition_i = m_vertices[node_i].partition;
> !       if (partition_i >= 0)
>       {
> !       unsigned int order_i = m_partitions[partition_i].node_end++;
> !       m_partitioned_nodes[order_i] = node_i;
>       }
> +     }
> + }
> + 
> + /* Look for edges from earlier partitions into node NODE_I and edges from
> +    node NODE_I into later partitions.  Call:
> + 
> +       FN (ud, other_node_i, other_partition_i)
> + 
> +    for each such use-to-def edge ud, where other_node_i and 
> other_partition_i
> +    are the node and partition at the other end of the edge.  */
> + 
> + template<typename T>
> + void
> + vect_optimize_slp_pass::for_each_partition_edge (unsigned int node_i, T fn)
> + {
> +   int partition_i = m_vertices[node_i].partition;
> +   for (graph_edge *pred = m_slpg->vertices[node_i].pred;
> +        pred; pred = pred->pred_next)
> +     {
> +       int src_partition_i = m_vertices[pred->src].partition;
> +       if (src_partition_i >= 0 && src_partition_i != partition_i)
> +     fn (pred, pred->src, src_partition_i);
> +     }
> +   for (graph_edge *succ = m_slpg->vertices[node_i].succ;
> +        succ; succ = succ->succ_next)
> +     {
> +       int dest_partition_i = m_vertices[succ->dest].partition;
> +       if (dest_partition_i >= 0 && dest_partition_i != partition_i)
> +     fn (succ, succ->dest, dest_partition_i);
> +     }
> + }
> + 
> + /* Return true if the target is able to change a vector from layout
> +    FROM_LAYOUT_I to layout TO_LAYOUT_I.  */
> + 
> + bool
> + vect_optimize_slp_pass::can_change_layouts (unsigned int from_layout_i,
> +                                         unsigned int to_layout_i)
> + {
> +   /* FIXME: test what the target can support.  */
> +   (void) from_layout_i;
> +   (void) to_layout_i;
> +   return true;
> + }
> + 
> + /* We already know that the target can perform permute m_perms[LAYOUT_I].
> +    In our terms this means that we know that the target can go from a
> +    layout with m_perms[LAYOUT_I] reversed (layout LAYOUT_I) to one with
> +    it applied (layout 0, the original layout of the SLP graph).  Return
> +    true if the target can also go in the opposite direction: from the
> +    original layout to one in which m_perms[LAYOUT_I] has been reversed.  */
> + 
> + bool
> + vect_optimize_slp_pass::layout_is_reversible (unsigned int layout_i)
> + {
> +   return can_change_layouts (0, layout_i);
> + }
> + 
> + /* Return the cost (in arbtirary units) of going from layout FROM_LAYOUT_I
> +    to layout TO_LAYOUT_I.  */
> + 
> + unsigned int
> + vect_optimize_slp_pass::layout_cost (unsigned int from_layout_i,
> +                                  unsigned int to_layout_i)
> + {
> +   if (from_layout_i == to_layout_i)
> +     return 0;
> + 
> +   if (can_change_layouts (from_layout_i, to_layout_i))
> +     return 1;
> + 
> +   /* Change to layout 0 and then out again.  */
> +   gcc_assert (layout_is_reversible (from_layout_i));
> +   return 2;
> + }
> + 
> + /* Return the costs of assigning layout LAYOUT_I to partition PARTITION_I.  
> */
> + 
> + inline slpg_partition_layout_costs &
> + vect_optimize_slp_pass::partition_layout_costs (unsigned int partition_i,
> +                                             unsigned int layout_i)
> + {
> +   return m_partition_layout_costs[partition_i * m_perms.length () + 
> layout_i];
> + }
> + 
> + /* Return true if partition PARTITION_I is structually allowed to absorb
> +    neighboring permutes.  */
> + 
> + bool
> + vect_optimize_slp_pass::
> + partition_can_absorb_permutes (unsigned int partition_i)
> + {
> +   auto &partition = m_partitions[partition_i];
> +   if (partition.node_end != partition.node_begin + 1)
> +     return false;
> + 
> +   slp_tree node = 
> m_vertices[m_partitioned_nodes[partition.node_begin]].node;
> +   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> +     return true;
> + 
> +   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> +   if (rep
> +       && STMT_VINFO_DATA_REF (rep)
> +       && DR_IS_READ (STMT_VINFO_DATA_REF (rep)))
> +     return true;
> + 
> +   return false;
> + }
> + 
> + /* Check whether NODE (which is a VEC_PERM_EXPR or a load) can absorb the
> +    permute that would be required to go from layout 0 to layout LAYOUT_I.
> +    Return the cost of the change if so, in the same arbitrary units as
> +    for layout_cost.  Return -1 otherwise.  */
> + 
> + int
> + vect_optimize_slp_pass::node_absorption_cost (slp_tree node,
> +                                           unsigned int layout_i)
> + {
> +   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> +     /* FIXME: check whether NODE can handle cases in which both the inputs
> +        and outputs have layout LAYOUT_I.  Return -1 if not.  */
> +     return 0;
> + 
> +   auto &vertex = m_vertices[node->vertex];
> +   auto &partition = m_partitions[vertex.partition];
> +   if (partition.layout == int (layout_i))
> +     return 0;
> +   /* FIXME: check whether NODE can be changed from its current layout
> +      to layout LAYOUT_I.  Return -1 if not.  */
> +   return 1;
> + }
> + 
> + /* Decide which element layouts we should consider using.  Mark which
> +    partitions can potentially use which layouts and the local costs of
> +    each choice.  */
> + 
> + void
> + vect_optimize_slp_pass::start_choosing_layouts ()
> + {
> +   /* Used to assign unique permute indices.  */
> +   using perm_hash = unbounded_hashmap_traits<
> +     vec_free_hash_base<int_hash_base<unsigned>>,
> +     int_hash<int, -1, -2>
> +   >;
> +   hash_map<vec<unsigned>, int, perm_hash> layout_ids;
> + 
> +   /* Layout 0 is "no change".  */
> +   m_perms.safe_push (vNULL);
> + 
> +   /* Create layouts from existing permutations.  */
> +   /* FIXME: cap the maximum number of layouts, to avoid quadraticness.  */
> +   for (unsigned int node_i : m_leafs)
> +     {
> +       int partition_i = m_vertices[node_i].partition;
> +       if (partition_i < 0)
> +     continue;
> + 
> +       /* Leafs also double as entries to the reverse graph.  Allow the
> +      layout of those to be changed.  */
> +       if (!m_slpg->vertices[node_i].succ)
> +     m_partitions[partition_i].layout = 0;
>   
>         /* Loads are the only thing generating permutes.  */
> +       slp_tree node = m_vertices[node_i].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>       continue;
>   
>         /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> !      node unpermuted, record a layout that reverses this permute.  */
> !       gcc_assert (m_partitions[partition_i].layout == 0);
>         stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
>         if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>       continue;
> ***************
> *** 3708,3720 ****
>         if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
>           any_permute = true;
>       }
> -       /* If there's no permute no need to split one out.  */
> -       if (!any_permute)
> -     continue;
>         /* If the span doesn't match we'd disrupt VF computation, avoid
>        that for now.  */
>         if (imax - imin + 1 != SLP_TREE_LANES (node))
>       continue;
>   
>         /* For now only handle true permutes, like
>        vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> --- 4425,4442 ----
>         if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
>           any_permute = true;
>       }
>         /* If the span doesn't match we'd disrupt VF computation, avoid
>        that for now.  */
>         if (imax - imin + 1 != SLP_TREE_LANES (node))
>       continue;
> +       /* If there's no permute no need to split one out.  In this case
> +      we can consider turning the load into a permuted load, if that
> +      turns out to be cheaper than alternatives.  */
> +       if (!any_permute)
> +     {
> +       m_partitions[partition_i].layout = -1;
> +       continue;
> +     }
>   
>         /* For now only handle true permutes, like
>        vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> ***************
> *** 3735,3985 ****
>         perm.safe_grow (SLP_TREE_LANES (node), true);
>         for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>       perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> !       perms.safe_push (perm);
> !       vertices[idx].perm_in = perms.length () - 1;
> !       vertices[idx].perm_out = perms.length () - 1;
>       }
>   
> !   /* In addition to the above we have to mark outgoing permutes facing
> !      non-reduction graph entries that are not represented as to be
> !      materialized.  */
> !   for (slp_instance instance : vinfo->slp_instances)
>       if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
>         {
> !     /* Just setting perm_out isn't enough for the propagation to
> !        pick this up.  */
> !     vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_in = 0;
> !     vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_out = 0;
>         }
>   
> !   /* Propagate permutes along the graph and compute materialization points. 
>  */
> !   bool changed;
> !   bool do_materialization = false;
> !   unsigned iteration = 0;
> !   do
> !     {
> !       changed = false;
> !       ++iteration;
> ! 
> !       if (dump_enabled_p ())
> !     dump_printf_loc (MSG_NOTE, vect_location,
> !                      "SLP optimize iteration %d\n", iteration);
>   
> !       for (i = vertices.length (); i > 0 ; --i)
>       {
> !       int idx = ipo[i-1];
> !       slp_tree node = vertices[idx].node;
>   
> !       /* Handle externals and constants optimistically throughout the
> !          iteration.  */
> !       if (SLP_TREE_DEF_TYPE (node) == vect_external_def
> !           || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> !         continue;
>   
> !       /* We still eventually have failed backedge SLP nodes in the
> !          graph, those are only cancelled when analyzing operations.
> !          Simply treat them as transparent ops, propagating permutes
> !          through them.  */
> !       if (SLP_TREE_DEF_TYPE (node) == vect_internal_def)
> !         {
> !           /* We do not handle stores with a permutation, so all
> !              incoming permutes must have been materialized.  */
> !           stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> !           if (STMT_VINFO_DATA_REF (rep)
> !               && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> !             {
> !               /* ???  We're forcing materialization in place
> !                  of the child here, we'd need special handling
> !                  in materialization to leave perm_in -1 here.  */
> !               vertices[idx].perm_in = 0;
> !               vertices[idx].perm_out = 0;
> !             }
> !           /* We cannot move a permute across an operation that is
> !              not independent on lanes.  Note this is an explicit
> !              negative list since that's much shorter than the respective
> !              positive one but it's critical to keep maintaining it.  */
> !           if (is_gimple_call (STMT_VINFO_STMT (rep)))
> !             switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
> !               {
> !               case CFN_COMPLEX_ADD_ROT90:
> !               case CFN_COMPLEX_ADD_ROT270:
> !               case CFN_COMPLEX_MUL:
> !               case CFN_COMPLEX_MUL_CONJ:
> !               case CFN_VEC_ADDSUB:
> !               case CFN_VEC_FMADDSUB:
> !               case CFN_VEC_FMSUBADD:
> !                 vertices[idx].perm_in = 0;
> !                 vertices[idx].perm_out = 0;
> !               default:;
> !               }
>           }
>   
> !       if (!slpg->vertices[idx].succ)
> !         /* Pick up pre-computed leaf values.  */
> !         ;
> !       else
>           {
> !           bool any_succ_perm_out_m1 = false;
> !           int perm_in = vertices[idx].perm_in;
> !           for (graph_edge *succ = slpg->vertices[idx].succ;
> !                succ; succ = succ->succ_next)
>               {
> !               int succ_idx = succ->dest;
> !               int succ_perm = vertices[succ_idx].perm_out;
> !               /* Handle unvisited (and constant) nodes optimistically.  */
> !               /* ???  But for constants once we want to handle
> !                  non-bijective permutes we have to verify the permute,
> !                  when unifying lanes, will not unify different constants.
> !                  For example see gcc.dg/vect/bb-slp-14.c for a case
> !                  that would break.  */
> !               if (succ_perm == -1)
> !                 {
> !                   /* When we handled a non-leaf optimistically, note
> !                      that so we can adjust its outgoing permute below.  */
> !                   slp_tree succ_node = vertices[succ_idx].node;
> !                   if (SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
> !                       && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
> !                     any_succ_perm_out_m1 = true;
> !                   continue;
> !                 }
> !               if (perm_in == -1)
> !                 perm_in = succ_perm;
> !               else if (succ_perm == 0
> !                        || !vect_slp_perms_eq (perms, perm_in, succ_perm))
>                   {
> !                   perm_in = 0;
> !                   break;
>                   }
>               }
>   
> !           /* Adjust any incoming permutes we treated optimistically.  */
> !           if (perm_in != -1 && any_succ_perm_out_m1)
>               {
> !               for (graph_edge *succ = slpg->vertices[idx].succ;
> !                    succ; succ = succ->succ_next)
>                   {
> !                   slp_tree succ_node = vertices[succ->dest].node;
> !                   if (vertices[succ->dest].perm_out == -1
> !                       && SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
> !                       && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
> !                     {
> !                       vertices[succ->dest].perm_out = perm_in;
> !                       /* And ensure this propagates.  */
> !                       if (vertices[succ->dest].perm_in == -1)
> !                         vertices[succ->dest].perm_in = perm_in;
> !                     }
>                   }
> !               changed = true;
> !             }
>   
> !           if (!vect_slp_perms_eq (perms, perm_in,
> !                                   vertices[idx].perm_in))
> !             {
> !               /* Make sure we eventually converge.  */
> !               gcc_checking_assert (vertices[idx].perm_in == -1
> !                                    || perm_in == 0);
> !               vertices[idx].perm_in = perm_in;
> ! 
> !               /* While we can handle VEC_PERM nodes as transparent
> !                  pass-through they can be a cheap materialization
> !                  point as well.  In addition they can act as source
> !                  of a random permutation as well.
> !                  The following ensures that former materialization
> !                  points that now have zero incoming permutes no
> !                  longer appear as such and that former "any" permutes
> !                  get pass-through.  We keep VEC_PERM nodes optimistic
> !                  as "any" outgoing permute though.  */
> !               if (vertices[idx].perm_out != 0
> !                   && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
> !                 vertices[idx].perm_out = perm_in;
> !               changed = true;
> !             }
>           }
>   
> !       /* Elide pruning at materialization points in the first
> !          iteration phase.  */
> !       if (!do_materialization)
> !         continue;
>   
> !       int perm = vertices[idx].perm_out;
> !       if (perm == 0 || perm == -1)
>           continue;
>   
> !       /* Decide on permute materialization.  Look whether there's
> !          a use (pred) edge that is permuted differently than us.
> !          In that case mark ourselves so the permutation is applied.  */
> !       bool all_preds_permuted = slpg->vertices[idx].pred != NULL;
> !       if (all_preds_permuted)
> !         for (graph_edge *pred = slpg->vertices[idx].pred;
> !              pred; pred = pred->pred_next)
> !           {
> !             int pred_perm = vertices[pred->src].perm_in;
> !             gcc_checking_assert (pred_perm != -1);
> !             if (!vect_slp_perms_eq (perms, perm, pred_perm))
> !               {
> !                 all_preds_permuted = false;
> !                 break;
> !               }
> !           }
> !       if (!all_preds_permuted)
> !         {
> !           vertices[idx].perm_out = 0;
> !           changed = true;
>           }
> -     }
>   
> !       /* If the initial propagation converged, switch on materialization
> !      and re-propagate.  */
> !       if (!changed && !do_materialization)
>       {
> !       do_materialization = true;
> !       changed = true;
>       }
>       }
> !   while (changed);
> !   statistics_histogram_event (cfun, "SLP optimize perm iterations", 
> iteration);
>   
> !   /* Materialize.  */
> !   for (i = 0; i < vertices.length (); ++i)
>       {
> !       int perm_in = vertices[i].perm_in;
> !       slp_tree node = vertices[i].node;
>   
> !       /* First permute invariant/external original successors, we handle
> !      those optimistically during propagation and duplicate them if
> !      they are used with different permutations.  */
> !       unsigned j;
> !       slp_tree child;
> !       if (perm_in > 0)
> !     FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
> !       {
> !         if (!child
> !             || (SLP_TREE_DEF_TYPE (child) != vect_constant_def
> !                 && SLP_TREE_DEF_TYPE (child) != vect_external_def))
> !           continue;
>   
> !         /* If the vector is uniform there's nothing to do.  */
> !         if (vect_slp_tree_uniform_p (child))
> !           continue;
>   
> !         /* We can end up sharing some externals via two_operator
> !            handling.  Be prepared to unshare those.  */
> !         if (child->refcnt != 1)
> !           {
> !             gcc_assert (slpg->vertices[child->vertex].pred->pred_next);
> !             SLP_TREE_CHILDREN (node)[j] = child
> !               = vect_create_new_slp_node
> !                   (SLP_TREE_SCALAR_OPS (child).copy ());
> !           }
> !         vect_slp_permute (perms[perm_in],
> !                           SLP_TREE_SCALAR_OPS (child), true);
> !       }
>   
>         if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
>       {
>         /* Apply the common permutes to the input vectors.  */
> !       if (perm_in > 0)
> !         {
>             /* If the node is already a permute node we can apply
>                the permutation to the lane selection, effectively
>                materializing it on the incoming vectors.  */
> --- 4457,5092 ----
>         perm.safe_grow (SLP_TREE_LANES (node), true);
>         for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>       perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> !       bool existed;
> !       int &layout_i = layout_ids.get_or_insert (perm, &existed);
> !       if (existed)
> !     perm.release ();
> !       else
> !     {
> !       layout_i = m_perms.length ();
> !       m_perms.safe_push (perm);
> !     }
> !       m_partitions[partition_i].layout = layout_i;
>       }
>   
> !   /* Initially assume that every layout is possible and has zero cost
> !      in every partition.  */
> !   m_partition_layout_costs.safe_grow_cleared (m_partitions.length ()
> !                                           * m_perms.length ());
> ! 
> !   /* We have to mark outgoing permutes facing non-reduction graph
> !      entries that are not represented as to be materialized.  */
> !   for (slp_instance instance : m_vinfo->slp_instances)
>       if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
>         {
> !     unsigned int node_i = SLP_INSTANCE_TREE (instance)->vertex;
> !     m_partitions[m_vertices[node_i].partition].layout = 0;
>         }
>   
> !   /* Check which layouts each node and partition can handle.  Calculate the
> !      weights associated with inserting layout changes on edges.  */
> !   m_optimize_size = true;
> !   for (unsigned int node_i : m_partitioned_nodes)
> !     {
> !       auto &vertex = m_vertices[node_i];
> !       unsigned int partition_i = vertex.partition;
> !       auto &partition = m_partitions[partition_i];
> !       slp_tree node = vertex.node;
> ! 
> !       if (stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node))
> !     {
> !       basic_block bb = gimple_bb (vect_orig_stmt (rep)->stmt);
> !       if (optimize_bb_for_speed_p (bb))
> !         m_optimize_size = false;
> ! 
> !       vertex.weight = vect_slp_node_weight (node);
> ! 
> !       /* We do not handle stores with a permutation, so all
> !          incoming permutes must have been materialized.  */
> !       if (STMT_VINFO_DATA_REF (rep)
> !           && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> !         /* ???  We're forcing materialization in place
> !            of the child here, we'd need special handling
> !            in materialization to leave layout -1 here.  */
> !         partition.layout = 0;
> ! 
> !       /* We cannot change the layout of an operation that is
> !          not independent on lanes.  Note this is an explicit
> !          negative list since that's much shorter than the respective
> !          positive one but it's critical to keep maintaining it.  */
> !       if (is_gimple_call (STMT_VINFO_STMT (rep)))
> !         switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
> !           {
> !           case CFN_COMPLEX_ADD_ROT90:
> !           case CFN_COMPLEX_ADD_ROT270:
> !           case CFN_COMPLEX_MUL:
> !           case CFN_COMPLEX_MUL_CONJ:
> !           case CFN_VEC_ADDSUB:
> !           case CFN_VEC_FMADDSUB:
> !           case CFN_VEC_FMSUBADD:
> !             partition.layout = 0;
> !           default:;
> !           }
> !     }
>   
> !       if (partition.layout == 0)
>       {
> !       /* The node must stay in its current order.  */
> !       for (unsigned int layout_i = 1; layout_i < m_perms.length ();
> !            ++layout_i)
> !         partition_layout_costs (partition_i, layout_i).mark_impossible ();
> !     }
> !       else if (partition_can_absorb_permutes (partition_i))
> !     /* Calculate the cost of converting NODE to use each permute.
> !        If PARTITION.LAYOUT > 0 then that layout's permute will be free.  */
> !     for (unsigned int layout_i = 0; layout_i < m_perms.length ();
> !          ++layout_i)
> !       {
> !         auto &layout_costs = partition_layout_costs (partition_i,
> !                                                      layout_i);
> !         int factor = node_absorption_cost (node, layout_i);
> !         if (factor < 0)
> !           {
> !             layout_costs.mark_impossible ();
> !             continue;
> !           }
> !         gcc_assert (layout_costs.is_possible ());
> !         layout_costs.in_cost = { vertex.weight * factor, m_optimize_size };
> !       }
>   
> !       auto process_edge = [&](graph_edge *ud, unsigned int node2_i,
> !                           unsigned int partition2_i)
> !     {
> !       /* Count the number of edges from earlier partitions and the number
> !          of edges to later partitions.  */
> !       if (partition2_i < partition_i)
> !         partition.in_degree += 1;
> !       else
> !         partition.out_degree += 1;
>   
> !       /* If the current node uses the result of NODE2_I, accumulate
> !          the effects of that.  */
> !       if (ud->src == int (node_i))
> !         {
> !           m_vertices[node2_i].out_weight += vertex.weight;
> !           m_vertices[node2_i].out_degree += 1;
>           }
> +     };
> +       for_each_partition_edge (node_i, process_edge);
> +     }
>   
> !   /* Layout 0 represents the original order and layout LAYOUT_I represents
> !      the effect of reversing m_perms[LAYOUT_I].  If a predecessor partition
> !      does not support layout LAYOUT_I and if it also isn't possible to go
> !      from layout 0 to layout LAYOUT_I, then it is very difficult for a
> !      successor partition to support layout LAYOUT_I.  (The only alternative
> !      would go to layout LAYOUT_I from a different nonzero layout, but
> !      that's a niche case and not really worth supporting.)  */
> !   for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
> !     {
> !       if (layout_is_reversible (layout_i))
> !     continue;
> ! 
> !       bool changed;
> !       do
> !     {
> !       changed = false;
> !       for (unsigned int node_i : m_partitioned_nodes)
>           {
> !           unsigned int partition_i = m_vertices[node_i].partition;
> !           if (partition_layout_costs (partition_i,
> !                                       layout_i).is_possible ())
> !             continue;
> !           for (graph_edge *pred = m_slpg->vertices[node_i].pred;
> !                pred; pred = pred->pred_next)
>               {
> !               unsigned int partition2_i = m_vertices[pred->src].partition;
> !               auto &costs = partition_layout_costs (partition2_i,
> !                                                     layout_i);
> !               if (costs.is_possible ())
>                   {
> !                   costs.mark_impossible ();
> !                   changed = true;
>                   }
>               }
> +         }
> +     }
> +       while (changed);
> +     }
> + }
> + 
> + /* Return the cost of switching between layout LAYOUT1_I (at node NODE1_I)
> +    and layout LAYOUT2_I on cross-partition use-to-def edge UD.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::edge_layout_cost (graph_edge *ud,
> +                                       unsigned int node1_i,
> +                                       unsigned int layout1_i,
> +                                       unsigned int layout2_i)
> + {
> +   auto def_layout_i = ud->dest == int (node1_i) ? layout1_i : layout2_i;
> +   auto use_layout_i = ud->dest == int (node1_i) ? layout2_i : layout1_i;
> +   auto factor = layout_cost (def_layout_i, use_layout_i);
> + 
> +   /* We have a choice of putting the layout change at the site of the
> +      definition or at the site of the use.  Prefer the former when
> +      optimizing for size or when the execution frequency of the
> +      definition is no greater than the combined execution frequencies of
> +      the uses.  When putting the layout change at the site of the 
> definition,
> +      divvy up the cost among all consumers.  */
> +   slpg_vertex &def_vertex = m_vertices[ud->dest];
> +   slpg_vertex &use_vertex = m_vertices[ud->src];
> +   if (m_optimize_size || def_vertex.weight <= def_vertex.out_weight)
> +     {
> +       slpg_layout_cost cost = { def_vertex.weight * factor, m_optimize_size 
> };
> +       cost.split (def_vertex.out_degree);
> +       return cost;
> +     }
> +   return { use_vertex.weight * factor, m_optimize_size };
> + }
> + 
> + /* UD represents a use-def link between FROM_NODE_I and a node in a later
> +    partition; FROM_NODE_I could be the definition node or the use node.
> +    The node at the other end of the link wants to use layout TO_LAYOUT_I;
> +    return the cost of any necessary fix-ups on edge UD.
> + 
> +    At this point, FROM_NODE_I's partition has chosen the cheapest
> +    layout based on the information available so far, but this choice
> +    is only provisional.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::forward_cost (graph_edge *ud,
> +                                   unsigned int from_node_i,
> +                                   unsigned int to_layout_i)
> + {
> +   slpg_vertex &from_vertex = m_vertices[from_node_i];
> +   unsigned int from_partition_i = from_vertex.partition;
> +   slpg_partition_info &from_partition = m_partitions[from_partition_i];
> +   gcc_assert (from_partition.layout >= 0);
> + 
> +   /* First calculate the cost on the assumption that FROM_PARTITION sticks
> +      with its current layout preference.  */
> +   slpg_layout_cost cost
> +     = partition_layout_costs (from_partition_i,
> +                           from_partition.layout).in_cost;
> +   cost.split (from_partition.out_degree);
> +   cost.add_serial_cost (edge_layout_cost (ud, from_node_i,
> +                                       from_partition.layout, to_layout_i));
> + 
> +   /* Take the minimum of that cost and the cost that applies if
> +      FROM_PARTITION instead switches to TO_LAYOUT_I.  */
> +   auto &direct_layout_costs = partition_layout_costs (from_partition_i,
> +                                                   to_layout_i);
> +   if (direct_layout_costs.is_possible ())
> +     {
> +       slpg_layout_cost direct_cost = direct_layout_costs.in_cost;
> +       direct_cost.split (from_partition.out_degree);
> +       if (direct_cost.is_better_than (cost, m_optimize_size))
> +     cost = direct_cost;
> +     }
> + 
> +   /* If FROM_PARTITION is likely to be able to absorb the required outgoing
> +      permute, optimistically assume that that is what will happen.  */
> +   if (partition_can_absorb_permutes (from_partition_i))
> +     {
> +       int factor = node_absorption_cost (from_vertex.node, to_layout_i);
> +       if (factor >= 0)
> +     {
> +       sreal weight = from_vertex.weight * factor;
> +       slpg_layout_cost absorb_cost = { weight, m_optimize_size };
> +       if (absorb_cost.is_better_than (cost, m_optimize_size))
> +         cost = absorb_cost;
> +     }
> +     }
> + 
> +   return cost;
> + }
> + 
> + /* UD represents a use-def link between TO_NODE_I and a node in an earlier
> +    partition; TO_NODE_I could be the definition node or the use node.
> +    The node at the other end of the link wants to use layout FROM_LAYOUT_I;
> +    return the cost of any necessary fix-ups on edge UD.
> + 
> +    At this point, TO_NODE_I's partition has a fixed choice of layout.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::backward_cost (graph_edge *ud,
> +                                    unsigned int to_node_i,
> +                                    unsigned int from_layout_i)
> + {
> +   slpg_vertex &to_vertex = m_vertices[to_node_i];
> +   unsigned int to_partition_i = to_vertex.partition;
> +   slpg_partition_info &to_partition = m_partitions[to_partition_i];
> +   gcc_assert (to_partition.layout >= 0);
> + 
> +   slpg_layout_cost cost
> +     = partition_layout_costs (to_partition_i,
> +                           to_partition.layout).out_cost;
> +   cost.split (to_partition.in_degree);
> +   cost.add_serial_cost (edge_layout_cost (ud, to_node_i, 
> to_partition.layout,
> +                                       from_layout_i));
> + 
> +   /* If TO_PARTITION is likely to be able to absorb the required input 
> permute,
> +      optimistically assume that that is what will happen.  */
> +   if (partition_can_absorb_permutes (to_partition_i))
> +     {
> +       int factor = node_absorption_cost (to_vertex.node, from_layout_i);
> +       if (factor >= 0)
> +     {
> +       sreal weight = to_vertex.weight * factor;
> +       slpg_layout_cost absorb_cost = { weight, m_optimize_size };
> +       if (absorb_cost.is_better_than (cost, m_optimize_size))
> +         cost = absorb_cost;
> +     }
> +     }
> + 
> +   return cost;
> + }
>   
> ! /* Make a forward pass through the partitions, accumulating input costs.
> !    Make a tentative (provisional) choice of layout for each partition.  */
> ! 
> ! void
> ! vect_optimize_slp_pass::forward_pass ()
> ! {
> !   for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
> !        ++partition_i)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       unsigned int min_layout_i = 0;
> !       slpg_layout_cost *min_layout_cost = nullptr;
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); 
> ++layout_i)
> !     {
> !       auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> !       if (!layout_costs.is_possible ())
> !         continue;
> ! 
> !       /* Accumulate the costs from predecessor partitions.  */
> !       for (unsigned int order_i = partition.node_begin;
> !            order_i < partition.node_end; ++order_i)
> !         {
> !           unsigned int node_i = m_partitioned_nodes[order_i];
> !           auto add_cost = [&](graph_edge *ud, unsigned int from_node_i,
> !                               unsigned int from_partition_i)
>               {
> !               if (from_partition_i < partition_i)
>                   {
> !                   auto cost = forward_cost (ud, from_node_i, layout_i);
> !                   layout_costs.in_cost.add_parallel_cost (cost);
>                   }
> !             };
> !           for_each_partition_edge (node_i, add_cost);
> !         }
>   
> !       /* Record the layout with the lowest cost.  Prefer layout 0 in
> !          the event of a tie between it and another layout.  */
> !       if (!min_layout_cost
> !           || layout_costs.in_cost.is_better_than (*min_layout_cost,
> !                                                   m_optimize_size))
> !         {
> !           min_layout_i = layout_i;
> !           min_layout_cost = &layout_costs.in_cost;
>           }
> +     }
> +       gcc_assert (min_layout_cost);
> +       partition.layout = min_layout_i;
> +     }
> + }
>   
> ! /* Make a backward pass through the partitions, accumulating output costs.
> !    Make a final choice of layout for each partition.  */
>   
> ! void
> ! vect_optimize_slp_pass::backward_pass ()
> ! {
> !   for (unsigned int partition_i = m_partitions.length (); partition_i-- > 
> 0;)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       unsigned int min_layout_i = 0;
> !       slpg_layout_cost *min_layout_cost = nullptr;
> !       unsigned int seen_min_layout_i = 0;
> !       slpg_layout_cost *seen_min_layout_cost = nullptr;
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); 
> ++layout_i)
> !     {
> !       auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> !       if (!layout_costs.is_possible ())
>           continue;
>   
> !       /* Accumulate the costs from successor partitions.  Record whether
> !          any of them actually uses this layout. */
> !       bool seen = false;
> !       for (unsigned int order_i = partition.node_begin;
> !            order_i < partition.node_end; ++order_i)
> !         {
> !           unsigned int node_i = m_partitioned_nodes[order_i];
> !           auto add_cost = [&](graph_edge *ud, unsigned int to_node_i,
> !                               unsigned int to_partition_i)
> !             {
> !               if (to_partition_i > partition_i)
> !                 {
> !                   auto cost = backward_cost (ud, to_node_i, layout_i);
> !                   layout_costs.out_cost.add_parallel_cost (cost);
> !                   seen |= (m_partitions[to_partition_i].layout
> !                            == int (layout_i));
> !                 }
> !             };
> !           for_each_partition_edge (node_i, add_cost);
>           }
>   
> !       /* Locally combine the costs from the forward and backward passes.
> !          (This combined cost is not passed on, since that would lead
> !          to double counting.)  */
> !       layout_costs.cost = layout_costs.in_cost;
> !       layout_costs.cost.add_serial_cost (layout_costs.out_cost);
> ! 
> !       /* Record the layout with the lowest cost.  Prefer layout 0 in
> !          the event of a tie between it and another layout.  */
> !       if (!min_layout_cost
> !           || layout_costs.cost.is_better_than (*min_layout_cost,
> !                                                m_optimize_size))
> !         {
> !           min_layout_i = layout_i;
> !           min_layout_cost = &layout_costs.cost;
> !         }
> !       /* Likewise, but restrict the choice of layouts to those that
> !          are used by successor partitions.  */
> !       if (seen
> !           && (!seen_min_layout_cost
> !               || layout_costs.cost.is_better_than (*seen_min_layout_cost,
> !                                                    m_optimize_size)))
> !         {
> !           seen_min_layout_i = min_layout_i;
> !           seen_min_layout_cost = &layout_costs.cost;
> !         }
> !     }
> !       gcc_assert (min_layout_cost);
> ! 
> !       /* If the partition can absorb permutes, prefer a layout that has
> !      actually been used.  */
> !       if (seen_min_layout_cost
> !       && partition_can_absorb_permutes (partition_i))
> !     partition.layout = seen_min_layout_i;
> !       else
> !     partition.layout = min_layout_i;
> !     }
> ! }
> ! 
> ! /* Return a node that applies layout TO_LAYOUT_I to the original form of 
> NODE.
> !    NODE already has the layout that was selected for its partition.  */
> ! 
> ! slp_tree
> ! vect_optimize_slp_pass::get_result_with_layout (slp_tree node,
> !                                             unsigned int to_layout_i)
> ! {
> !   unsigned int result_i = node->vertex * m_perms.length () + to_layout_i;
> !   slp_tree result = m_node_layouts[result_i];
> !   if (result)
> !     return result;
> ! 
> !   if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
> !       || SLP_TREE_DEF_TYPE (node) == vect_external_def)
> !     {
> !       /* If the vector is uniform or unchanged, there's nothing to do.  */
> !       if (to_layout_i == 0 || vect_slp_tree_uniform_p (node))
> !     result = node;
> !       else
>       {
> !       auto scalar_ops = SLP_TREE_SCALAR_OPS (node).copy ();
> !       result = vect_create_new_slp_node (scalar_ops);
> !       vect_slp_permute (m_perms[to_layout_i], scalar_ops, true);
>       }
>       }
> !   else
> !     {
> !       unsigned int partition_i = m_vertices[node->vertex].partition;
> !       unsigned int from_layout_i = m_partitions[partition_i].layout;
> !       if (from_layout_i == to_layout_i)
> !     return node;
> ! 
> !       /* FIXME: if the result of a pre-existing VEC_PERM_EXPR is used
> !      twice, with two different layouts, then we could create
> !      parallel VEC_PERM_EXPRs rather than add a serial one.  */
> ! 
> !       /* FIXME: When calculating the costs, we considered the possiblity
> !      of duplicating a VEC_PERM_EXPR at each use site, rather than
> !      having one at the definition site.  Do that here if it turns
> !      out to be cheaper.  */
> ! 
> !       if (dump_enabled_p ())
> !     dump_printf_loc (MSG_NOTE, vect_location,
> !                      "inserting permute node in place of %p\n",
> !                      node);
> ! 
> !       unsigned int num_lanes = SLP_TREE_LANES (node);
> !       result = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> !       if (SLP_TREE_SCALAR_STMTS (node).length ())
> !     {
> !       auto &stmts = SLP_TREE_SCALAR_STMTS (result);
> !       stmts.safe_splice (SLP_TREE_SCALAR_STMTS (result));
> !       if (from_layout_i != 0)
> !         vect_slp_permute (m_perms[from_layout_i], stmts, false);
> !       if (to_layout_i != 0)
> !         vect_slp_permute (m_perms[to_layout_i], stmts, true);
> !     }
> !       SLP_TREE_REPRESENTATIVE (result) = SLP_TREE_REPRESENTATIVE (node);
> !       SLP_TREE_LANES (result) = num_lanes;
> !       SLP_TREE_VECTYPE (result) = SLP_TREE_VECTYPE (node);
> !       result->vertex = -1;
> ! 
> !       auto &lane_perm = SLP_TREE_LANE_PERMUTATION (result);
> !       lane_perm.create (num_lanes);
> !       for (unsigned j = 0; j < num_lanes; ++j)
> !     lane_perm.quick_push ({ 0, j });
> !       if (from_layout_i != 0)
> !     vect_slp_permute (m_perms[from_layout_i], lane_perm, false);
> !       if (to_layout_i != 0)
> !     vect_slp_permute (m_perms[to_layout_i], lane_perm, true);
> ! 
> !       node->refcnt++;
> !       SLP_TREE_CHILDREN (result).safe_push (node);
> !     }
> !   m_node_layouts[result_i] = result;
> !   return result;
> ! }
> ! 
> ! /* Print the partition graph and layout information to the dump file.  */
>   
> ! void
> ! vect_optimize_slp_pass::dump ()
> ! {
> !   dump_printf_loc (MSG_NOTE, vect_location,
> !                "SLP optimize permutes:\n");
> !   for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
>       {
> !       dump_printf_loc (MSG_NOTE, vect_location, "  %d: { ", layout_i);
> !       const char *sep = "";
> !       for (unsigned int idx : m_perms[layout_i])
> !     {
> !       dump_printf (MSG_NOTE, "%s%d", sep, idx);
> !       sep = ", ";
> !     }
> !       dump_printf (MSG_NOTE, " }\n");
> !     }
> !   dump_printf_loc (MSG_NOTE, vect_location,
> !                "SLP optimize partitions:\n");
> !   for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
> !        ++partition_i)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       dump_printf_loc (MSG_NOTE, vect_location,  "  -------------\n");
> !       dump_printf_loc (MSG_NOTE, vect_location,
> !                    "  partition %d (layout %d):\n",
> !                    partition_i, partition.layout);
> !       dump_printf_loc (MSG_NOTE, vect_location, "    nodes:\n");
> !       for (unsigned int order_i = partition.node_begin;
> !        order_i < partition.node_end; ++order_i)
> !     {
> !       unsigned int node_i = m_partitioned_nodes[order_i];
> !       slp_tree node = m_vertices[node_i].node;
> !       dump_printf_loc (MSG_NOTE, vect_location, "      - %p:\n",
> !                        (void *) m_vertices[node_i].node);
> !       dump_printf_loc (MSG_NOTE, vect_location,
> !                        "          weight: %f\n",
> !                        m_vertices[node_i].weight.to_double ());
> !       if (m_vertices[node_i].out_degree)
> !         dump_printf_loc (MSG_NOTE, vect_location,
> !                          "          out weight: %f (degree %d)\n",
> !                          m_vertices[node_i].out_weight.to_double (),
> !                          m_vertices[node_i].out_degree);
> !       if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> !         dump_printf_loc (MSG_NOTE, vect_location,
> !                          "          op: VEC_PERM_EXPR\n");
> !       else if (auto rep = SLP_TREE_REPRESENTATIVE (node))
> !         dump_printf_loc (MSG_NOTE, vect_location,
> !                          "          op template: %G", rep->stmt);
> !     }
> !       dump_printf_loc (MSG_NOTE, vect_location, "    edges:\n");
> !       for (unsigned int order_i = partition.node_begin;
> !        order_i < partition.node_end; ++order_i)
> !     {
> !       unsigned int node_i = m_partitioned_nodes[order_i];
> !       auto print_edge = [&](graph_edge *, unsigned int node2_i,
> !                             unsigned int partition2_i)
> !         {
> !           if (partition2_i < partition_i)
> !             dump_printf_loc (MSG_NOTE, vect_location,
> !                              "      - %p [%d] --> %p\n",
> !                              m_vertices[node2_i].node, partition2_i,
> !                              m_vertices[node_i].node);
> !           else
> !             dump_printf_loc (MSG_NOTE, vect_location,
> !                              "      - %p --> [%d] %p\n",
> !                              m_vertices[node_i].node, partition2_i,
> !                              m_vertices[node2_i].node);
> !         };
> !       for_each_partition_edge (node_i, print_edge);
> !     }
>   
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); 
> ++layout_i)
> !     {
> !       auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> !       if (layout_costs.is_possible ())
> !         {
> !           dump_printf_loc (MSG_NOTE, vect_location,
> !                            "    layout %d:%s\n", layout_i,
> !                            partition.layout == int (layout_i)
> !                            ? " (*)" : "");
> ! #define TEMPLATE "{depth: %f, total: %f}"
> !           dump_printf_loc (MSG_NOTE, vect_location,
> !                            "        " TEMPLATE "\n", layout_i,
> !                            layout_costs.in_cost.depth.to_double (),
> !                            layout_costs.in_cost.total.to_double ());
> !           dump_printf_loc (MSG_NOTE, vect_location,
> !                            "      + " TEMPLATE "\n",
> !                            layout_costs.out_cost.depth.to_double (),
> !                            layout_costs.out_cost.total.to_double ());
> !           dump_printf_loc (MSG_NOTE, vect_location,
> !                            "      = " TEMPLATE "\n",
> !                            layout_costs.cost.depth.to_double (),
> !                            layout_costs.cost.total.to_double ());
> ! #undef TEMPLATE
> !         }
> !       else
> !         dump_printf_loc (MSG_NOTE, vect_location,
> !                          "    layout %d: rejected\n", layout_i);
> !     }
> !     }
> ! }
>   
> ! /* Apply the chosen vector layouts to the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::materialize ()
> ! {
> !   m_partition_layout_costs.release ();
> !   m_node_layouts.safe_grow_cleared (m_vertices.length () * m_perms.length 
> ());
>   
> +   for (unsigned int node_i : m_partitioned_nodes)
> +     {
> +       /* Rearrange the scalar statements to match the chosen layout.  */
> +       slp_tree node = m_vertices[node_i].node;
> +       int layout_i = m_partitions[m_vertices[node_i].partition].layout;
> +       gcc_assert (layout_i >= 0);
> +       if (layout_i > 0)
> +     vect_slp_permute (m_perms[layout_i],
> +                       SLP_TREE_SCALAR_STMTS (node), true);
> + 
> +       /* Update load and lane permutes.  */
>         if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
>       {
>         /* Apply the common permutes to the input vectors.  */
> !       /* FIXME: check whether the target accepts this folding and
> !          handle it like a normal node if not.  That would include
> !          adjusting the follow-up loop below.  */
> !       unsigned int opno;
> !       slp_tree child;
> !       FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), opno, child)
> !         {
> !           unsigned int partition2_i = m_vertices[child->vertex].partition;
> !           int layout2_i = m_partitions[partition2_i].layout;
> !           if (layout2_i == 0)
> !             continue;
> ! 
>             /* If the node is already a permute node we can apply
>                the permutation to the lane selection, effectively
>                materializing it on the incoming vectors.  */
> ***************
> *** 3987,4141 ****
>               dump_printf_loc (MSG_NOTE, vect_location,
>                                "simplifying permute node %p\n",
>                                node);
> !           for (unsigned k = 0;
> !                k < SLP_TREE_LANE_PERMUTATION (node).length (); ++k)
> !             SLP_TREE_LANE_PERMUTATION (node)[k].second
> !               = perms[perm_in][SLP_TREE_LANE_PERMUTATION (node)[k].second];
> !         }
> !       /* Apply the anticipated output permute to the permute and
> !          stmt vectors.  */
> !       int perm_out = vertices[i].perm_out;
> !       if (perm_out > 0)
> !         {
> !           vect_slp_permute (perms[perm_out],
> !                             SLP_TREE_SCALAR_STMTS (node), true);
> !           vect_slp_permute (perms[perm_out],
> !                             SLP_TREE_LANE_PERMUTATION (node), true);
> !         }
> !     }
> !       else if (vertices[i].get_perm_materialized () != 0)
> !     {
> !       if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> !         /* For loads simply drop the permutation, the load permutation
> !            already performs the desired permutation.  */
> !         ;
> !       else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
> !         gcc_unreachable ();
> !       else
> !         {
> !           if (dump_enabled_p ())
> !             dump_printf_loc (MSG_NOTE, vect_location,
> !                              "inserting permute node in place of %p\n",
> !                              node);
> ! 
> !           /* Make a copy of NODE and in-place change it to a
> !              VEC_PERM node to permute the lanes of the copy.  */
> !           slp_tree copy = new _slp_tree;
> !           SLP_TREE_CHILDREN (copy) = SLP_TREE_CHILDREN (node);
> !           SLP_TREE_CHILDREN (node) = vNULL;
> !           SLP_TREE_SCALAR_STMTS (copy)
> !             = SLP_TREE_SCALAR_STMTS (node).copy ();
> !           vect_slp_permute (perms[perm_in],
> !                             SLP_TREE_SCALAR_STMTS (copy), true);
> !           gcc_assert (!SLP_TREE_SCALAR_OPS (node).exists ());
> !           SLP_TREE_REPRESENTATIVE (copy) = SLP_TREE_REPRESENTATIVE (node);
> !           gcc_assert (!SLP_TREE_LOAD_PERMUTATION (node).exists ());
> !           SLP_TREE_LANE_PERMUTATION (copy)
> !             = SLP_TREE_LANE_PERMUTATION (node);
> !           SLP_TREE_LANE_PERMUTATION (node) = vNULL;
> !           SLP_TREE_VECTYPE (copy) = SLP_TREE_VECTYPE (node);
> !           copy->refcnt = 1;
> !           copy->max_nunits = node->max_nunits;
> !           SLP_TREE_DEF_TYPE (copy) = SLP_TREE_DEF_TYPE (node);
> !           SLP_TREE_LANES (copy) = SLP_TREE_LANES (node);
> !           SLP_TREE_CODE (copy) = SLP_TREE_CODE (node);
> ! 
> !           /* Now turn NODE into a VEC_PERM.  */
> !           SLP_TREE_CHILDREN (node).safe_push (copy);
> !           SLP_TREE_LANE_PERMUTATION (node).create (SLP_TREE_LANES (node));
> !           for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> !             SLP_TREE_LANE_PERMUTATION (node)
> !               .quick_push (std::make_pair (0, perms[perm_in][j]));
> !           SLP_TREE_CODE (node) = VEC_PERM_EXPR;
> !         }
> !     }
> !       else if (perm_in > 0) /* perm_in == perm_out */
> !     {
> !       /* Apply the reverse permutation to our stmts.  */
> !       vect_slp_permute (perms[perm_in],
> !                         SLP_TREE_SCALAR_STMTS (node), true);
> !       /* And to the lane/load permutation, which we can simply
> !          make regular by design.  */
> !       if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> !         {
> !           gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
> !           /* ???  When we handle non-bijective permutes the idea
> !              is that we can force the load-permutation to be
> !              { min, min + 1, min + 2, ... max }.  But then the
> !              scalar defs might no longer match the lane content
> !              which means wrong-code with live lane vectorization.
> !              So we possibly have to have NULL entries for those.  */
> !           vect_slp_permute (perms[perm_in],
> !                             SLP_TREE_LOAD_PERMUTATION (node), true);
> !         }
> !       else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
> !         gcc_unreachable ();
>       }
>       }
>   
> !   /* Elide any permutations at BB reduction roots.  */
> !   if (is_a <bb_vec_info> (vinfo))
>       {
> !       for (slp_instance instance : vinfo->slp_instances)
>       {
> !       if (SLP_INSTANCE_KIND (instance) != slp_inst_kind_bb_reduc)
>           continue;
> !       slp_tree old = SLP_INSTANCE_TREE (instance);
> !       if (SLP_TREE_CODE (old) == VEC_PERM_EXPR
> !           && SLP_TREE_CHILDREN (old).length () == 1)
>           {
> !           slp_tree child = SLP_TREE_CHILDREN (old)[0];
> !           if (SLP_TREE_DEF_TYPE (child) == vect_external_def)
> !             {
> !               /* Preserve the special VEC_PERM we use to shield existing
> !                  vector defs from the rest.  But make it a no-op.  */
> !               unsigned i = 0;
> !               for (std::pair<unsigned, unsigned> &p
> !                    : SLP_TREE_LANE_PERMUTATION (old))
> !                 p.second = i++;
> !             }
> !           else
> !             {
> !               SLP_INSTANCE_TREE (instance) = child;
> !               SLP_TREE_REF_COUNT (child)++;
> !               vect_free_slp_tree (old);
> !             }
> !         }
> !       else if (SLP_TREE_LOAD_PERMUTATION (old).exists ()
> !                && SLP_TREE_REF_COUNT (old) == 1
> !                && vertices[old->vertex].get_perm_materialized () != 0)
> !         {
> !           /* ???  For loads the situation is more complex since
> !              we can't modify the permute in place in case the
> !              node is used multiple times.  In fact for loads this
> !              should be somehow handled in the propagation engine.  */
> !           /* Apply the reverse permutation to our stmts.  */
> !           int perm = vertices[old->vertex].get_perm_materialized ();
> !           vect_slp_permute (perms[perm],
> !                             SLP_TREE_SCALAR_STMTS (old), true);
> !           vect_slp_permute (perms[perm],
> !                             SLP_TREE_LOAD_PERMUTATION (old), true);
>           }
>       }
>       }
>   
> !   /* Free the perms vector used for propagation.  */
> !   while (!perms.is_empty ())
> !     perms.pop ().release ();
> !   free_graph (slpg);
> ! 
>   
> !   /* Now elide load permutations that are not necessary.  */
> !   for (i = 0; i < leafs.length (); ++i)
>       {
> !       node = vertices[leafs[i]].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>       continue;
>   
>         /* In basic block vectorization we allow any subchain of an 
> interleaving
>        chain.
>        FORNOW: not in loop SLP because of realignment complications.  */
> !       if (is_a <bb_vec_info> (vinfo))
>       {
>         bool subchain_p = true;
>         stmt_vec_info next_load_info = NULL;
> --- 5094,5173 ----
>               dump_printf_loc (MSG_NOTE, vect_location,
>                                "simplifying permute node %p\n",
>                                node);
> !           for (auto &entry : SLP_TREE_LANE_PERMUTATION (node))
> !             if (entry.first == opno)
> !               entry.second = m_perms[layout2_i][entry.second];
> !         }
> !       /* Reverse the permutation that we chose to reverse.  */
> !       if (layout_i > 0)
> !         vect_slp_permute (m_perms[layout_i],
> !                           SLP_TREE_LANE_PERMUTATION (node), true);
> !     }
> !       else
> !     {
> !       gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
> !       auto &load_perm = SLP_TREE_LOAD_PERMUTATION (node);
> !       if (layout_i > 0)
> !         /* ???  When we handle non-bijective permutes the idea
> !            is that we can force the load-permutation to be
> !            { min, min + 1, min + 2, ... max }.  But then the
> !            scalar defs might no longer match the lane content
> !            which means wrong-code with live lane vectorization.
> !            So we possibly have to have NULL entries for those.  */
> !         vect_slp_permute (m_perms[layout_i], load_perm, true);
>       }
>       }
>   
> !   /* Do this before any nodes disappear, since it involves a walk
> !      over the leaves.  */
> !   remove_redundant_permutes ();
> ! 
> !   /* Replace each child with a correctly laid-out version.  */
> !   for (unsigned int node_i : m_partitioned_nodes)
>       {
> !       slp_tree node = m_vertices[node_i].node;
> ! 
> !       /* The loop above adjusted permutes for each child's chosen layout,
> !      so we don't need to adjust them here.  */
> !       if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> !     continue;
> ! 
> !       int layout_i = m_partitions[m_vertices[node_i].partition].layout;
> !       gcc_assert (layout_i >= 0);
> !       unsigned j;
> !       slp_tree child;
> !       FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
>       {
> !       if (!child)
>           continue;
> ! 
> !       slp_tree new_child = get_result_with_layout (child, layout_i);
> !       if (new_child != child)
>           {
> !           vect_free_slp_tree (child);
> !           SLP_TREE_CHILDREN (node)[j] = new_child;
> !           new_child->refcnt += 1;
>           }
>       }
>       }
> + }
>   
> ! /* Elide load permutations that are not necessary.  Such permutations might
> !    be pre-existing, rather than created by the layout optimizations.  */
>   
> ! void
> ! vect_optimize_slp_pass::remove_redundant_permutes ()
> ! {
> !   for (unsigned int node_i : m_leafs)
>       {
> !       slp_tree node = m_vertices[node_i].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>       continue;
>   
>         /* In basic block vectorization we allow any subchain of an 
> interleaving
>        chain.
>        FORNOW: not in loop SLP because of realignment complications.  */
> !       if (is_a <bb_vec_info> (m_vinfo))
>       {
>         bool subchain_p = true;
>         stmt_vec_info next_load_info = NULL;
> ***************
> *** 4160,4165 ****
> --- 5192,5198 ----
>       }
>         else
>       {
> +       loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
>         stmt_vec_info load_info;
>         bool this_load_permuted = false;
>         unsigned j;
> ***************
> *** 4175,4182 ****
>             /* The load requires permutation when unrolling exposes
>                a gap either because the group is larger than the SLP
>                group-size or because there is a gap between the groups.  */
> !           && (known_eq (LOOP_VINFO_VECT_FACTOR
> !                           (as_a <loop_vec_info> (vinfo)), 1U)
>                 || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
>                     && DR_GROUP_GAP (first_stmt_info) == 0)))
>           {
> --- 5208,5214 ----
>             /* The load requires permutation when unrolling exposes
>                a gap either because the group is larger than the SLP
>                group-size or because there is a gap between the groups.  */
> !           && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
>                 || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
>                     && DR_GROUP_GAP (first_stmt_info) == 0)))
>           {
> ***************
> *** 4187,4192 ****
> --- 5219,5258 ----
>       }
>   }
>   
> + /* Main entry point for the optimization pass.  */
> + 
> + void
> + vect_optimize_slp_pass::run ()
> + {
> +   build_graph ();
> +   create_partitions ();
> +   start_choosing_layouts ();
> +   if (m_perms.length () > 1)
> +     {
> +       forward_pass ();
> +       backward_pass ();
> +       if (dump_enabled_p ())
> +     dump ();
> +       materialize ();
> +     }
> +   else
> +     remove_redundant_permutes ();
> +   /* Free the perms vector used for propagation.  */
> +   while (!m_perms.is_empty ())
> +     m_perms.pop ().release ();
> +   free_graph (m_slpg);
> + }
> + 
> + /* Optimize the SLP graph of VINFO.  */
> + 
> + void
> + vect_optimize_slp (vec_info *vinfo)
> + {
> +   if (vinfo->slp_instances.is_empty ())
> +     return;
> +   vect_optimize_slp_pass (vinfo).run ();
> + }
> + 
>   /* Gather loads reachable from the individual SLP graph entries.  */
>   
>   void
> ***************
> *** 6647,6653 ****
>   {
>     stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>     int vec_index = 0;
> !   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>     unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>     unsigned int mask_element;
>     machine_mode mode;
> --- 7713,7719 ----
>   {
>     stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>     int vec_index = 0;
> !   tree vectype = SLP_TREE_VECTYPE (node);
>     unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>     unsigned int mask_element;
>     machine_mode mode;
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: [PATCH] RFC: Extend SLP permutation optimisations

Reply via email to