Ps. The code in comment is needed, but was commented out due to the early free 
issue described in the patch.

Thanks,
Tamar

> -----Original Message-----
> From: Tamar Christina <tamar.christ...@arm.com>
> Sent: Friday, February 19, 2021 2:42 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <n...@arm.com>; rguent...@suse.de
> Subject: [PATCH] slp: fix sharing of SLP only patterns. (PR99149)
> 
> Hi Richi,
> 
> The attached testcase ICEs due to a couple of issues.
> In the testcase you have two SLP instances that share the majority of their
> definition with each other.  One tree defines a COMPLEX_MUL sequence
> and the other tree a COMPLEX_FMA.
> 
> The ice happens because:
> 
> 1. the refcounts are wrong, in particular the FMA case doesn't correctly count
> the references for the COMPLEX_MUL that it consumes.
> 
> 2. when the FMA is created it incorrectly assumes it can just tear apart the
> MUL node that it's consuming.  This is wrong and should only be done when
> there is no more uses of the node, in which case the vector only pattern is no
> longer relevant.
> 
> To fix the last part the SLP only pattern reset code was moved into
> vect_free_slp_tree which results in cleaner code.  I also think it does belong
> there since that function knows when there are no more uses of the node
> and so the pattern should be unmarked, so when the the vectorizer is
> inspecting the BB it doesn't find the now invalid vector only patterns.
> 
> This has the obvious problem in that, eventually after analysis is done, the
> entire SLP tree is dissolved before codegen.  Where we get into trouble as
> we have now dissolved the patterns too...
> 
> My initial thought was to add a parameter to vect_free_slp_tree, but I know
> you wouldn't like that.  So I am sending this patch up as an RFC.
> 
> PS. This testcase actually shows that the codegen we get in these cases is not
> optimal.  Currently this won't vectorize as the compiler thinks the vector
> version is too expensive.
> 
> My guess here is because the patterns now unshare the tree and it's likely
> costing the setup for the vector code twice?
> 
> Even with the shared code (without patterns, so same as GCC 10, or turning
> off the patterns) it won't vectorize it.
> 
> The scalar code is
> 
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x2, [x3, 20]
>         ldr     x1, [x3, 8]
>         lsl     x2, x2, 3
>         ldrsw   x3, [x3, 16]
>         ldp     s3, s2, [x4]
>         add     x5, x1, x2
>         ldr     s0, [x1, x2]
>         lsl     x3, x3, 3
>         add     x4, x1, x3
>         fmul    s1, s2, s2
>         fnmsub  s1, s3, s3, s1
>         fmul    s0, s2, s0
>         fmadd   s0, s3, s2, s0
>         ldr     s3, [x1, x3]
>         ldr     s2, [x4, 4]
>         fadd    s3, s3, s1
>         fadd    s2, s0, s2
>         str     s3, [x1, x3]
>         str     s2, [x4, 4]
>         str     s1, [x1, x2]
>         str     s0, [x5, 4]
>         ret
> 
> and turning off the cost model we get
> 
>         movi    v1.2s, 0
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x3, [x2, 16]
>         ldr     x1, [x2, 8]
>         ldrsw   x2, [x2, 20]
>         ldr     d0, [x4]
>         fcmla   v1.2s, v0.2s, v0.2s, #0
>         ldr     d2, [x1, x3, lsl 3]
>         fcmla   v2.2s, v0.2s, v0.2s, #0
>         fcmla   v2.2s, v0.2s, v0.2s, #90
>         str     d2, [x1, x3, lsl 3]
>         fcmla   v1.2s, v0.2s, v0.2s, #90
>         str     d1, [x1, x2, lsl 3]
> 
> however, if the pattern matcher doesn't create the FMA node because it
> would unshare the tree (which I think is a general heuristic that would work
> out to better code wouldn't it?) then we would get (with the cost model
> enabled even)
> 
>         movi    v0.2s, 0
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x3, [x2, 16]
>         ldr     x1, [x2, 8]
>         ldr     d1, [x4]
>         fcmla   v0.2s, v1.2s, v1.2s, #0
>         fcmla   v0.2s, v1.2s, v1.2s, #90
>         ldrsw   x2, [x2, 20]
>         ldr     d1, [x1, x3, lsl 3]
>         fadd    v1.2s, v1.2s, v0.2s
>         str     d1, [x1, x3, lsl 3]
>         str     d0, [x1, x2, lsl 3]
>         ret
> 
> Which is the most optimal form.  So I think this should perhaps be handled in
> GCC 12 if there's a way to detect when you're going to unshare a sub-tree.
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       PR tree-optimization/99149
>       * tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the
>       buffer.
>       (vect_slp_reset_pattern): Remove.
>       (complex_fma_pattern::matches): Remove call to
> vect_slp_reset_pattern.
>       (complex_mul_pattern::build, complex_fma_pattern::build,
>       complex_fms_pattern::build): Fix ref counts.
>       * tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern
> relevancy
>       when node is being deleted.
>       * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value.
> 
> gcc/testsuite/ChangeLog:
> 
>       PR tree-optimization/99149
>       * gcc.dg/vect/pr99149.C: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/testsuite/gcc.dg/vect/pr99149.C
> b/gcc/testsuite/gcc.dg/vect/pr99149.C
> new file mode 100755
> index
> 0000000000000000000000000000000000000000..b12fe17e4ded148ce2bf67486
> e425dd65461a148
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr99149.C
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-w -O3 -march=armv8.3-a" { target {
> +aarch64*-*-* } } } */
> +
> +class a {
> +  float b;
> +  float c;
> +
> +public:
> +  a(float d, float e) : b(d), c(e) {}
> +  a operator+(a d) { return a(b + d.b, c + d.c); }
> +  a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); } }; int
> +f, g; class {
> +  a *h;
> +  a *i;
> +
> +public:
> +  void j() {
> +    a k = h[0], l = i[g], m = k * i[f];
> +    i[g] = l + m;
> +    i[f] = m;
> +  }
> +} n;
> +main() { n.j(); }
> diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index
> f0817da9f622d22e3df2e30410d1cf610b4ffa1d..1e2769662a54229ab8e24390f9
> 7dfe206f17ab57 100644
> --- a/gcc/tree-vect-slp-patterns.c
> +++ b/gcc/tree-vect-slp-patterns.c
> @@ -407,9 +407,8 @@ vect_detect_pair_op (slp_tree node1, slp_tree
> node2, lane_permutation_t &lanes,
> 
>    if (result != CMPLX_NONE && ops != NULL)
>      {
> -      ops->create (2);
> -      ops->quick_push (node1);
> -      ops->quick_push (node2);
> +      ops->safe_push (node1);
> +      ops->safe_push (node2);
>      }
>    return result;
>  }
> @@ -1090,15 +1089,17 @@ complex_mul_pattern::build (vec_info *vinfo)  {
>    slp_tree node;
>    unsigned i;
> +  slp_tree newnode
> +    = vect_build_combine_node (this->m_ops[0], this->m_ops[1],
> + *this->m_node);  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
>      vect_free_slp_tree (node);
> 
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node).reserve_exact (2);
>    SLP_TREE_CHILDREN (*this->m_node)[0] = this->m_ops[2];
> -  SLP_TREE_CHILDREN (*this->m_node)[1] =
> -    vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this-
> >m_node);
> -  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +  SLP_TREE_CHILDREN (*this->m_node)[1] = newnode;
> 
>    /* And then rewrite the node itself.  */
>    complex_pattern::build (vinfo);
> @@ -1133,18 +1134,6 @@ class complex_fma_pattern : public
> complex_pattern
>      }
>  };
> 
> -/* Helper function to "reset" a previously matched node and undo the
> changes
> -   made enough so that the node is treated as an irrelevant node.  */
> -
> -static inline void
> -vect_slp_reset_pattern (slp_tree node)
> -{
> -  stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE
> (node));
> -  STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> -  STMT_SLP_TYPE (stmt_info) = pure_slp;
> -  SLP_TREE_REPRESENTATIVE (node) = stmt_info; -}
> -
>  /* Pattern matcher for trying to match complex multiply and accumulate
>     and multiply and subtract patterns in SLP tree.
>     If the operation matches then IFN is set to the operation it matched and
> @@ -1208,15 +1197,6 @@ complex_fma_pattern::matches
> (complex_operation_t op,
>    if (!vect_pattern_validate_optab (ifn, vnode))
>      return IFN_LAST;
> 
> -  /* FMA matched ADD + CMUL.  During the matching of CMUL the
> -     stmt that starts the pattern is marked as being in a pattern,
> -     namely the CMUL.  When replacing this with a CFMA we have to
> -     unmark this statement as being in a pattern.  This is because
> -     vect_mark_pattern_stmts will only mark the current stmt as being
> -     in a pattern.  Later on when the scalar stmts are examined the
> -     old statement which is supposed to be irrelevant will point to
> -     CMUL unless we undo the pattern relationship here.  */
> -  vect_slp_reset_pattern (node);
>    ops->truncate (0);
>    ops->create (3);
> 
> @@ -1259,10 +1239,17 @@ complex_fma_pattern::recognize
> (slp_tree_to_load_perm_map_t *perm_cache,  void
> complex_fma_pattern::build (vec_info *vinfo)  {
> +  slp_tree node = SLP_TREE_CHILDREN (*this->m_node)[1];
> +
>    SLP_TREE_CHILDREN (*this->m_node).release ();
>    SLP_TREE_CHILDREN (*this->m_node).create (3);
>    SLP_TREE_CHILDREN (*this->m_node).safe_splice (this->m_ops);
> 
> +  SLP_TREE_REF_COUNT (this->m_ops[1])++;  SLP_TREE_REF_COUNT
> + (this->m_ops[2])++;
> +
> +  vect_free_slp_tree (node);
> +
>    complex_pattern::build (vinfo);
>  }
> 
> @@ -1427,6 +1414,11 @@ complex_fms_pattern::build (vec_info *vinfo)  {
>    slp_tree node;
>    unsigned i;
> +  slp_tree newnode =
> +    vect_build_combine_node (this->m_ops[2], this->m_ops[3],
> + *this->m_node);  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> + SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
>      vect_free_slp_tree (node);
> 
> @@ -1436,10 +1428,7 @@ complex_fms_pattern::build (vec_info *vinfo)
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[0]);
>    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[1]);
> -  SLP_TREE_CHILDREN (*this->m_node).quick_push (
> -    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this-
> >m_node));
> -  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> -  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +  SLP_TREE_CHILDREN (*this->m_node).quick_push (newnode);
> 
>    /* And then rewrite the node itself.  */
>    complex_pattern::build (vinfo);
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index
> ea8a97b01c6371791ac66de3e1dabfedee69cb67..65c2ff867ab41ea70367087dc
> 26fb6eea1375ffb 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -146,6 +146,16 @@ vect_free_slp_tree (slp_tree node)
>      if (child)
>        vect_free_slp_tree (child);
> 
> +  /* If the node defines any SLP only patterns then those patterns are no
> +     longer valid and should be removed.  */  stmt_vec_info
> + rep_stmt_info = SLP_TREE_REPRESENTATIVE (node);  if (rep_stmt_info &&
> + STMT_VINFO_SLP_VECT_ONLY_PATTERN (rep_stmt_info))
> +    {
> +      stmt_vec_info stmt_info = vect_orig_stmt (rep_stmt_info);
> +      //STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> +      //STMT_SLP_TYPE (stmt_info) = STMT_SLP_TYPE (rep_stmt_info);
> +    }
> +
>    delete node;
>  }
> 
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
> 5b45df3a4e00266b7530eb4da6985f0d940cb05b..63ba594f2276850a00fc37207
> 2d98326891f19e6 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -695,6 +695,7 @@ vec_info::new_stmt_vec_info (gimple *stmt)
>    STMT_VINFO_REDUC_FN (res) = IFN_LAST;
>    STMT_VINFO_REDUC_IDX (res) = -1;
>    STMT_VINFO_SLP_VECT_ONLY (res) = false;
> +  STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) = false;
>    STMT_VINFO_VEC_STMTS (res) = vNULL;
> 
>    if (is_a <loop_vec_info> (this)
> 
> 
> --

Reply via email to