Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > On Thu, 22 Aug 2019 at 16:44, Richard Biener <richard.guent...@gmail.com> > wrote: >> >> On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni >> <prathamesh.kulka...@linaro.org> wrote: >> > >> > On Thu, 15 Aug 2019 at 01:50, Richard Sandiford >> > <richard.sandif...@arm.com> wrote: >> > > >> > > Richard Biener <richard.guent...@gmail.com> writes: >> > > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener >> > > > <richard.guent...@gmail.com> wrote: >> > > >> >> > > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni >> > > >> <prathamesh.kulka...@linaro.org> wrote: >> > > >> > >> > > >> > Hi, >> > > >> > The attached patch tries to fix PR86753. >> > > >> > >> > > >> > For following test: >> > > >> > void >> > > >> > f1 (int *restrict x, int *restrict y, int *restrict z) >> > > >> > { >> > > >> > for (int i = 0; i < 100; ++i) >> > > >> > x[i] = y[i] ? z[i] : 10; >> > > >> > } >> > > >> > >> > > >> > vect dump shows: >> > > >> > vect_cst__42 = { 0, ... }; >> > > >> > vect_cst__48 = { 0, ... }; >> > > >> > >> > > >> > vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40); >> > > >> > _4 = *_3; >> > > >> > _5 = z_12(D) + _2; >> > > >> > mask__35.8_43 = vect__4.7_41 != vect_cst__42; >> > > >> > _35 = _4 != 0; >> > > >> > vec_mask_and_46 = mask__35.8_43 & loop_mask_40; >> > > >> > vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46); >> > > >> > iftmp.0_13 = 0; >> > > >> > vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48, >> > > >> > vect_iftmp.11_47, vect_cst__49>; >> > > >> > >> > > >> > and following code-gen: >> > > >> > L2: >> > > >> > ld1w z0.s, p2/z, [x1, x3, lsl 2] >> > > >> > cmpne p1.s, p3/z, z0.s, #0 >> > > >> > cmpne p0.s, p2/z, z0.s, #0 >> > > >> > ld1w z0.s, p0/z, [x2, x3, lsl 2] >> > > >> > sel z0.s, p1, z0.s, z1.s >> > > >> > >> > > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions >> > > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42 >> > > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != >> > > >> > vect_cst__48. >> > > >> > >> > > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from >> > > >> > masked load, >> > > >> > which is conditional on C, then we could reuse the mask used in >> > > >> > load, >> > > >> > in vec_cond_expr ? >> > > >> > >> > > >> > The patch maintains a hash_map cond_to_vec_mask >> > > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied). >> > > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask >> > > >> > & loop_mask, >> > > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists >> > > >> > in >> > > >> > cond_to_vec_mask >> > > >> > and if found, the corresponding vec_mask is used as 1st operand of >> > > >> > vec_cond_expr. >> > > >> > >> > > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch >> > > >> > adds tree_cond_ops to represent condition operator and operands >> > > >> > coming >> > > >> > either from cond_expr >> > > >> > or a gimple comparison stmt. If the stmt is not comparison, it >> > > >> > returns >> > > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask. >> > > >> > >> > > >> > With patch, the redundant p1 is eliminated and sel uses p0 for >> > > >> > above test. >> > > >> > >> > > >> > For following test: >> > > >> > void >> > > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback) >> > > >> > { >> > > >> > for (int i = 0; i < 100; ++i) >> > > >> > x[i] = y[i] ? z[i] : fallback; >> > > >> > } >> > > >> > >> > > >> > input to vectorizer has operands swapped in cond_expr: >> > > >> > _36 = _4 != 0; >> > > >> > iftmp.0_14 = .MASK_LOAD (_5, 32B, _36); >> > > >> > iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14; >> > > >> > >> > > >> > So we need to check for inverted condition in cond_to_vec_mask, >> > > >> > and swap the operands. >> > > >> > Does the patch look OK so far ? >> > > >> > >> > > >> > One major issue remaining with the patch is value numbering. >> > > >> > Currently, it does value numbering for entire function using sccvn >> > > >> > during start of vect pass, which is too expensive since we only need >> > > >> > block based VN. I am looking into that. >> > > >> >> > > >> Why do you need it at all? We run VN on the if-converted loop bodies >> > > >> btw. >> > > >> > > This was my suggestion, but with the idea being to do the numbering >> > > per-statement as we vectorise. We'll then see pattern statements too. >> > > >> > > That's important because we use pattern statements to set the right >> > > vector boolean type (e.g. vect_recog_mask_conversion_pattern). >> > > So some of the masks we care about don't exist after if converison. >> > > >> > > > Also I can't trivially see the equality of the masks and probably so >> > > > can't VN. Is it that we just don't bother to apply loop_mask to >> > > > VEC_COND but there's no harm if we do? >> > > >> > > Yeah. The idea of the optimisation is to decide when it's more >> > > profitable >> > > to apply the loop mask, even though doing so isn't necessary. It would >> > > be hard to do after vectorisation because the masks aren't equivalent. >> > > We're relying on knowledge of how the vectoriser uses the result. >> > Hi, >> > Sorry for late response. This is an updated patch, that integrates >> > block-based VN into vect pass. >> > The patch >> > (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to >> > initialize VN state, and vn_bb_free to free it. >> > (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering >> > stmts. We're only interested in obtaining >> > value numbers, not eliminating redundancies. >> > Does it look in the right direction ? >> >> It looks a bit odd to me. I'd have expected it to work by generating >> the stmts as before in the vectorizer and then on the stmts we care >> invoke vn_visit_stmt that does both value-numbering and elimination. >> Alternatively you could ask the VN state to generate the stmt for >> you via vn_nary_build_or_lookup () (certainly that needs a bit more >> work). One complication might be availability if you don't value-number >> all stmts in the block, but well. I'm not sure constraining to a single >> block is necessary - I've thought of having a "CSE"ing gimple_build >> for some time (add & CSE new stmts onto a sequence), so one >> should keep this mode in mind when designing the one working on >> an existing BB. Note as you write it it depends on visiting the >> stmts in proper order - is that guaranteed when for example >> vectorizing SLP? > Hi, > Indeed, I wrote the function with assumption that, stmts would be > visited in proper order. > This doesn't affect SLP currently, because call to vn_visit_stmt in > vect_transform_stmt is > conditional on cond_to_vec_mask, which is only allocated inside > vect_transform_loop. > But I agree we could make it more general. > AFAIU, the idea of constraining VN to single block was to avoid using defs > from > non-dominating scalar stmts during outer-loop vectorization.
Maybe we could do the numbering in a separate walk immediately before the transform phase instead. > * fmla_2.c regression with patch: > This happens because with patch, forwprop4 is able to convert all 3 > vec_cond_expr's > to .cond_fma(), which results in 3 calls to fmla, regressing the > test-case. If matching with > inverted condition is disabled in patch in vectorizable_condition, > then the old behavior gets preserved. Ugh, yeah. This all stems from the fact that we don't get rid of the first redundant store (before or after the patch). I think it's just luck that we managed to CSE the FMLAs feeding the two stores. (That wasn't what the test was added for; it was added for a suboptimal .md choice.) I think it'd be OK to XFAIL the current scan-assembler-times and add a new one for 3 times instead of 2 (with a comment of course). I've filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91532 for the redundant stores. Thanks, Richard