Ps. The code in comment is needed, but was commented out due to the early free issue described in the patch.
Thanks, Tamar > -----Original Message----- > From: Tamar Christina <tamar.christ...@arm.com> > Sent: Friday, February 19, 2021 2:42 PM > To: gcc-patches@gcc.gnu.org > Cc: nd <n...@arm.com>; rguent...@suse.de > Subject: [PATCH] slp: fix sharing of SLP only patterns. (PR99149) > > Hi Richi, > > The attached testcase ICEs due to a couple of issues. > In the testcase you have two SLP instances that share the majority of their > definition with each other. One tree defines a COMPLEX_MUL sequence > and the other tree a COMPLEX_FMA. > > The ice happens because: > > 1. the refcounts are wrong, in particular the FMA case doesn't correctly count > the references for the COMPLEX_MUL that it consumes. > > 2. when the FMA is created it incorrectly assumes it can just tear apart the > MUL node that it's consuming. This is wrong and should only be done when > there is no more uses of the node, in which case the vector only pattern is no > longer relevant. > > To fix the last part the SLP only pattern reset code was moved into > vect_free_slp_tree which results in cleaner code. I also think it does belong > there since that function knows when there are no more uses of the node > and so the pattern should be unmarked, so when the the vectorizer is > inspecting the BB it doesn't find the now invalid vector only patterns. > > This has the obvious problem in that, eventually after analysis is done, the > entire SLP tree is dissolved before codegen. Where we get into trouble as > we have now dissolved the patterns too... > > My initial thought was to add a parameter to vect_free_slp_tree, but I know > you wouldn't like that. So I am sending this patch up as an RFC. > > PS. This testcase actually shows that the codegen we get in these cases is not > optimal. Currently this won't vectorize as the compiler thinks the vector > version is too expensive. > > My guess here is because the patterns now unshare the tree and it's likely > costing the setup for the vector code twice? > > Even with the shared code (without patterns, so same as GCC 10, or turning > off the patterns) it won't vectorize it. > > The scalar code is > > mov w0, 0 > ldr x4, [x1, #:lo12:.LANCHOR0] > ldrsw x2, [x3, 20] > ldr x1, [x3, 8] > lsl x2, x2, 3 > ldrsw x3, [x3, 16] > ldp s3, s2, [x4] > add x5, x1, x2 > ldr s0, [x1, x2] > lsl x3, x3, 3 > add x4, x1, x3 > fmul s1, s2, s2 > fnmsub s1, s3, s3, s1 > fmul s0, s2, s0 > fmadd s0, s3, s2, s0 > ldr s3, [x1, x3] > ldr s2, [x4, 4] > fadd s3, s3, s1 > fadd s2, s0, s2 > str s3, [x1, x3] > str s2, [x4, 4] > str s1, [x1, x2] > str s0, [x5, 4] > ret > > and turning off the cost model we get > > movi v1.2s, 0 > mov w0, 0 > ldr x4, [x1, #:lo12:.LANCHOR0] > ldrsw x3, [x2, 16] > ldr x1, [x2, 8] > ldrsw x2, [x2, 20] > ldr d0, [x4] > fcmla v1.2s, v0.2s, v0.2s, #0 > ldr d2, [x1, x3, lsl 3] > fcmla v2.2s, v0.2s, v0.2s, #0 > fcmla v2.2s, v0.2s, v0.2s, #90 > str d2, [x1, x3, lsl 3] > fcmla v1.2s, v0.2s, v0.2s, #90 > str d1, [x1, x2, lsl 3] > > however, if the pattern matcher doesn't create the FMA node because it > would unshare the tree (which I think is a general heuristic that would work > out to better code wouldn't it?) then we would get (with the cost model > enabled even) > > movi v0.2s, 0 > mov w0, 0 > ldr x4, [x1, #:lo12:.LANCHOR0] > ldrsw x3, [x2, 16] > ldr x1, [x2, 8] > ldr d1, [x4] > fcmla v0.2s, v1.2s, v1.2s, #0 > fcmla v0.2s, v1.2s, v1.2s, #90 > ldrsw x2, [x2, 20] > ldr d1, [x1, x3, lsl 3] > fadd v1.2s, v1.2s, v0.2s > str d1, [x1, x3, lsl 3] > str d0, [x1, x2, lsl 3] > ret > > Which is the most optimal form. So I think this should perhaps be handled in > GCC 12 if there's a way to detect when you're going to unshare a sub-tree. > > Thanks, > Tamar > > gcc/ChangeLog: > > PR tree-optimization/99149 > * tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the > buffer. > (vect_slp_reset_pattern): Remove. > (complex_fma_pattern::matches): Remove call to > vect_slp_reset_pattern. > (complex_mul_pattern::build, complex_fma_pattern::build, > complex_fms_pattern::build): Fix ref counts. > * tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern > relevancy > when node is being deleted. > * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value. > > gcc/testsuite/ChangeLog: > > PR tree-optimization/99149 > * gcc.dg/vect/pr99149.C: New test. > > --- inline copy of patch -- > diff --git a/gcc/testsuite/gcc.dg/vect/pr99149.C > b/gcc/testsuite/gcc.dg/vect/pr99149.C > new file mode 100755 > index > 0000000000000000000000000000000000000000..b12fe17e4ded148ce2bf67486 > e425dd65461a148 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/pr99149.C > @@ -0,0 +1,25 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-w -O3 -march=armv8.3-a" { target { > +aarch64*-*-* } } } */ > + > +class a { > + float b; > + float c; > + > +public: > + a(float d, float e) : b(d), c(e) {} > + a operator+(a d) { return a(b + d.b, c + d.c); } > + a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); } }; int > +f, g; class { > + a *h; > + a *i; > + > +public: > + void j() { > + a k = h[0], l = i[g], m = k * i[f]; > + i[g] = l + m; > + i[f] = m; > + } > +} n; > +main() { n.j(); } > diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index > f0817da9f622d22e3df2e30410d1cf610b4ffa1d..1e2769662a54229ab8e24390f9 > 7dfe206f17ab57 100644 > --- a/gcc/tree-vect-slp-patterns.c > +++ b/gcc/tree-vect-slp-patterns.c > @@ -407,9 +407,8 @@ vect_detect_pair_op (slp_tree node1, slp_tree > node2, lane_permutation_t &lanes, > > if (result != CMPLX_NONE && ops != NULL) > { > - ops->create (2); > - ops->quick_push (node1); > - ops->quick_push (node2); > + ops->safe_push (node1); > + ops->safe_push (node2); > } > return result; > } > @@ -1090,15 +1089,17 @@ complex_mul_pattern::build (vec_info *vinfo) { > slp_tree node; > unsigned i; > + slp_tree newnode > + = vect_build_combine_node (this->m_ops[0], this->m_ops[1], > + *this->m_node); SLP_TREE_REF_COUNT (this->m_ops[2])++; > + > FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node) > vect_free_slp_tree (node); > > /* First re-arrange the children. */ > SLP_TREE_CHILDREN (*this->m_node).reserve_exact (2); > SLP_TREE_CHILDREN (*this->m_node)[0] = this->m_ops[2]; > - SLP_TREE_CHILDREN (*this->m_node)[1] = > - vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this- > >m_node); > - SLP_TREE_REF_COUNT (this->m_ops[2])++; > + SLP_TREE_CHILDREN (*this->m_node)[1] = newnode; > > /* And then rewrite the node itself. */ > complex_pattern::build (vinfo); > @@ -1133,18 +1134,6 @@ class complex_fma_pattern : public > complex_pattern > } > }; > > -/* Helper function to "reset" a previously matched node and undo the > changes > - made enough so that the node is treated as an irrelevant node. */ > - > -static inline void > -vect_slp_reset_pattern (slp_tree node) > -{ > - stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE > (node)); > - STMT_VINFO_IN_PATTERN_P (stmt_info) = false; > - STMT_SLP_TYPE (stmt_info) = pure_slp; > - SLP_TREE_REPRESENTATIVE (node) = stmt_info; -} > - > /* Pattern matcher for trying to match complex multiply and accumulate > and multiply and subtract patterns in SLP tree. > If the operation matches then IFN is set to the operation it matched and > @@ -1208,15 +1197,6 @@ complex_fma_pattern::matches > (complex_operation_t op, > if (!vect_pattern_validate_optab (ifn, vnode)) > return IFN_LAST; > > - /* FMA matched ADD + CMUL. During the matching of CMUL the > - stmt that starts the pattern is marked as being in a pattern, > - namely the CMUL. When replacing this with a CFMA we have to > - unmark this statement as being in a pattern. This is because > - vect_mark_pattern_stmts will only mark the current stmt as being > - in a pattern. Later on when the scalar stmts are examined the > - old statement which is supposed to be irrelevant will point to > - CMUL unless we undo the pattern relationship here. */ > - vect_slp_reset_pattern (node); > ops->truncate (0); > ops->create (3); > > @@ -1259,10 +1239,17 @@ complex_fma_pattern::recognize > (slp_tree_to_load_perm_map_t *perm_cache, void > complex_fma_pattern::build (vec_info *vinfo) { > + slp_tree node = SLP_TREE_CHILDREN (*this->m_node)[1]; > + > SLP_TREE_CHILDREN (*this->m_node).release (); > SLP_TREE_CHILDREN (*this->m_node).create (3); > SLP_TREE_CHILDREN (*this->m_node).safe_splice (this->m_ops); > > + SLP_TREE_REF_COUNT (this->m_ops[1])++; SLP_TREE_REF_COUNT > + (this->m_ops[2])++; > + > + vect_free_slp_tree (node); > + > complex_pattern::build (vinfo); > } > > @@ -1427,6 +1414,11 @@ complex_fms_pattern::build (vec_info *vinfo) { > slp_tree node; > unsigned i; > + slp_tree newnode = > + vect_build_combine_node (this->m_ops[2], this->m_ops[3], > + *this->m_node); SLP_TREE_REF_COUNT (this->m_ops[0])++; > + SLP_TREE_REF_COUNT (this->m_ops[1])++; > + > FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node) > vect_free_slp_tree (node); > > @@ -1436,10 +1428,7 @@ complex_fms_pattern::build (vec_info *vinfo) > /* First re-arrange the children. */ > SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[0]); > SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[1]); > - SLP_TREE_CHILDREN (*this->m_node).quick_push ( > - vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this- > >m_node)); > - SLP_TREE_REF_COUNT (this->m_ops[0])++; > - SLP_TREE_REF_COUNT (this->m_ops[1])++; > + SLP_TREE_CHILDREN (*this->m_node).quick_push (newnode); > > /* And then rewrite the node itself. */ > complex_pattern::build (vinfo); > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index > ea8a97b01c6371791ac66de3e1dabfedee69cb67..65c2ff867ab41ea70367087dc > 26fb6eea1375ffb 100644 > --- a/gcc/tree-vect-slp.c > +++ b/gcc/tree-vect-slp.c > @@ -146,6 +146,16 @@ vect_free_slp_tree (slp_tree node) > if (child) > vect_free_slp_tree (child); > > + /* If the node defines any SLP only patterns then those patterns are no > + longer valid and should be removed. */ stmt_vec_info > + rep_stmt_info = SLP_TREE_REPRESENTATIVE (node); if (rep_stmt_info && > + STMT_VINFO_SLP_VECT_ONLY_PATTERN (rep_stmt_info)) > + { > + stmt_vec_info stmt_info = vect_orig_stmt (rep_stmt_info); > + //STMT_VINFO_IN_PATTERN_P (stmt_info) = false; > + //STMT_SLP_TYPE (stmt_info) = STMT_SLP_TYPE (rep_stmt_info); > + } > + > delete node; > } > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index > 5b45df3a4e00266b7530eb4da6985f0d940cb05b..63ba594f2276850a00fc37207 > 2d98326891f19e6 100644 > --- a/gcc/tree-vectorizer.c > +++ b/gcc/tree-vectorizer.c > @@ -695,6 +695,7 @@ vec_info::new_stmt_vec_info (gimple *stmt) > STMT_VINFO_REDUC_FN (res) = IFN_LAST; > STMT_VINFO_REDUC_IDX (res) = -1; > STMT_VINFO_SLP_VECT_ONLY (res) = false; > + STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) = false; > STMT_VINFO_VEC_STMTS (res) = vNULL; > > if (is_a <loop_vec_info> (this) > > > --