On Tue, 2 Nov 2021, Richard Sandiford wrote: > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > >> > -----Original Message----- > >> > From: Richard Biener <rguent...@suse.de> > >> > Sent: Tuesday, November 2, 2021 2:24 PM > >> > To: Tamar Christina <tamar.christ...@arm.com> > >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > >> > vectorization > >> > > >> > On Tue, 2 Nov 2021, Tamar Christina wrote: > >> > > >> > > Hi All, > >> > > > >> > > Following my current SVE predicate optimization series a problem has > >> > > presented itself in that the way vector masks are generated for masked > >> > > operations relies on CSE to share masks efficiently. > >> > > > >> > > The issue however is that masking is done using the & operand and & is > >> > > associative and so reassoc decides to reassociate the masked > >> > > operations. > >> > > >> > But it does this for the purpose of canonicalization and thus CSE. > >> > >> Yes, but it turns something like > >> > >> (a & b) & mask into a & (b & mask). > >> > >> When (a & b) is used somewhere else you now lose the CSE. So it's > >> actually hurting > >> In this case. > > > > OK, so that's a known "issue" with reassoc, it doesn't consider global > > CSE opportunities and I guess it pushes 'mask' to leaf if it is loop > > carried. > > > >> > > >> > > This makes CSE then unable to CSE an unmasked and a masked operation > >> > > leading to duplicate operations being performed. > >> > > > >> > > To counter this we want to add an RPO pass over the vectorized loop > >> > > body when vectorization succeeds. This makes it then no longer > >> > > reliant on the RTL level CSE. > >> > > > >> > > I have not added a testcase for this as it requires the changes in my > >> > > patch series, however the entire series relies on this patch to work > >> > > so all the tests there cover it. > >> > > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and > >> > > no issues. > >> > > > >> > > Ok for master? > >> > > >> > You are running VN over _all_ loop bodies rather only those vectorized. > >> > We loop over vectorized loops earlier for optimizing masked store > >> > sequences. > >> > I suppose you could hook in there. I'll also notice that we have > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late > >> > FRE. > >> > So I don't understand why it doesn't work to CSE later. > >> > > >> > >> Atm, say you have the conditions a > b, and a > b & a > c > >> > >> We generate > >> > >> mask1 = (a > b) & loop_mask > >> mask2 = (a > b & a > c) & loop_mask > >> > >> with the intention that mask1 can be re-used in mask2. > >> > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask) > >> > >> Which has now unmasked (a > b) in mask2, which leaves us unable to combine > >> the mask1 and mask2. It doesn't generate incorrect code, just inefficient. > >> > >> > for (i = 1; i < number_of_loops (cfun); i++) > >> > { > >> > loop_vec_info loop_vinfo; > >> > bool has_mask_store; > >> > > >> > loop = get_loop (cfun, i); > >> > if (!loop || !loop->aux) > >> > continue; > >> > loop_vinfo = (loop_vec_info) loop->aux; > >> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > >> > delete loop_vinfo; > >> > if (has_mask_store > >> > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) > >> > optimize_mask_stores (loop); > >> > loop->aux = NULL; > >> > } > >> > > >> > >> Ah thanks, I'll make the changes. > > > > Note I think that full-blown CSE is a bit overkill just to counter > > a deficient reassoc (or VN). At least it is supposed to be "cheap" > > and can be conditionalized on loop masks being used as well. > > Not sure we should make this conditional on loop masks being used. > It seems either that: > > (a) the vectoriser is supposed to avoid creating code that has folding > or VN opportunities, in which case we need to generate the vectorised > code in a smarter way or > > (b) the vectoriser is allowed to create code that has folding or VN > opportunities, in which case it would be good to have a defined > place to get rid of them.
It's certainly (b), and the definitive place to get rid of those is the post-loop optimizer FRE pass. That just happens to be after a reassoc pass which makes FRE run into the pre-existing issue that we fail to capture all (or the best) possible CSE opportunity from separate associatable chains. > I'm just worried that if we make it conditional on loop masks, > we could see cases that in which non-loop-mask stuff is optimised > differently based on whether the loop has masks or not. E.g. > we might get worse code with an unpredicated main loop and > a predicated epilogue compared to a predicated main loop. Sure. Note for loop vectorization we can indeed reasonably easy CSE the main body and RPO VN should be O(region size) and cheap enough for this case (we could even add an extra cheap entry for single basic-blocks). For BB vectorization we have to rely on the full function FRE pass though. Richard. > Thanks, > Richard > > > >> Thanks, > >> Tamar > >> > >> > > >> > > Thanks, > >> > > Tamar > >> > > > >> > > gcc/ChangeLog: > >> > > > >> > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN > >> > upon > >> > > successful vectorization. > >> > > > >> > > --- inline copy of patch -- > >> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index > >> > > > >> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c > >> > 660 > >> > > af4b6d32dc51 100644 > >> > > --- a/gcc/tree-vectorizer.c > >> > > +++ b/gcc/tree-vectorizer.c > >> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > >> > > #include "gimple-pretty-print.h" > >> > > #include "opt-problem.h" > >> > > #include "internal-fn.h" > >> > > - > >> > > +#include "tree-ssa-sccvn.h" > >> > > > >> > > /* Loop or bb location, with hotness information. */ > >> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@ > >> > > vectorize_loops (void) > >> > > ??? Also while we try hard to update loop-closed SSA form we > >> > > fail > >> > > to properly do this in some corner-cases (see PR56286). */ > >> > > rewrite_into_loop_closed_ssa (NULL, > >> > > TODO_update_ssa_only_virtuals); > >> > > + > >> > > + for (i = 1; i < number_of_loops (cfun); i++) > >> > > + { > >> > > + loop = get_loop (cfun, i); > >> > > + if (!loop || !single_exit (loop)) > >> > > + continue; > >> > > + > >> > > + bitmap exit_bbs; > >> > > + /* Perform local CSE, this esp. helps because we emit code for > >> > > + predicates that need to be shared for optimal predicate > >> > > usage. > >> > > + However reassoc will re-order them and prevent CSE from > >> > > working > >> > > + as it should. CSE only the loop body, not the entry. */ > >> > > + exit_bbs = BITMAP_ALLOC (NULL); > >> > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > >> > > + bitmap_set_bit (exit_bbs, loop->latch->index); > >> > > + > >> > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > >> > > + > >> > > + BITMAP_FREE (exit_bbs); > >> > > + } > >> > > + > >> > > return TODO_cleanup_cfg; > >> > > } > >> > > > >> > > > >> > > > >> > > > >> > > >> > -- > >> > Richard Biener <rguent...@suse.de> > >> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > >> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg) > >> > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)