[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Resolution|--- |INVALID --- Comment #14 from Richard Biener --- Not a bug.
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #13 from Hongtao.liu --- (In reply to rguent...@suse.de from comment #12) > On Wed, 15 Feb 2023, crazylht at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 > > > > --- Comment #11 from Hongtao.liu --- > > > > > > > > There's no other way to do N bit to two N/2 bit hi/lo (un)packing > > > (there's a 2x N/2 bit -> N bit operation, for whatever reason). > > > There's also no way to transform the d rgroup mask into the > > > f rgroup mask for the first example aka duplicate bits in place, > > > { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN }, > > > nor the reverse. > > > > > > > Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type. > > .i.e > > we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with > > same > > mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be > > ok > > as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0, > > b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move). > > Hmm, yes, if we arrange for the larger mask to be available that would > work for loads and stores I guess. It wouldn't work for arithmetic > cond_* IFNs though. It's also going to be a bit tricky within the > masking framework - I'm going to see whether that works though, it might > be a nice way to avoid an excessive number of masks for integer code > at least. There could be some limitation for nV(it should be power of 2 for VCE?) .i.e. There's no suitable vectype for VCE of src1 vectype to resure loop mask. void foo (int* __restrict dest, int* src1, int* src2) { for (int i = 0; i != 1; i++) dest[i] = src1[3*i] + src1[3*i + 1] + src1[3*i + 2]; } Maybe AVX512 could use gather instruction for .MASK_LOAD_LANES to use LOOP_MASK?
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #12 from rguenther at suse dot de --- On Wed, 15 Feb 2023, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 > > --- Comment #11 from Hongtao.liu --- > > > > > There's no other way to do N bit to two N/2 bit hi/lo (un)packing > > (there's a 2x N/2 bit -> N bit operation, for whatever reason). > > There's also no way to transform the d rgroup mask into the > > f rgroup mask for the first example aka duplicate bits in place, > > { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN }, > > nor the reverse. > > > > Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type. > .i.e > we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with > same > mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok > as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0, > b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move). Hmm, yes, if we arrange for the larger mask to be available that would work for loads and stores I guess. It wouldn't work for arithmetic cond_* IFNs though. It's also going to be a bit tricky within the masking framework - I'm going to see whether that works though, it might be a nice way to avoid an excessive number of masks for integer code at least.
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #11 from Hongtao.liu --- > > There's no other way to do N bit to two N/2 bit hi/lo (un)packing > (there's a 2x N/2 bit -> N bit operation, for whatever reason). > There's also no way to transform the d rgroup mask into the > f rgroup mask for the first example aka duplicate bits in place, > { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN }, > nor the reverse. > Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type. .i.e we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with same mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0, b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #10 from Hongtao.liu --- > > There's no other way to do N bit to two N/2 bit hi/lo (un)packing > (there's a 2x N/2 bit -> N bit operation, for whatever reason). > There's also no way to transform the d rgroup mask into the > f rgroup mask for the first example aka duplicate bits in place, > { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN }, > nor the reverse. > Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type. .i.e we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with same mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0, b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #9 from Richard Biener --- (In reply to rsand...@gcc.gnu.org from comment #8) > (In reply to rguent...@suse.de from comment #7) > > more like precision but x86 uses QImode for two-element, four-element > > and eight-element masks (rather than two partial integer modes with > > two and four bits precision). > Ah, OK. So yeah, maybe the precision of the vector boolean element * > the number of elements. For SVE the following holds: diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 1996ecfee7a..9b24b481867 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10097,6 +10097,12 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks, TYPE_VECTOR_SUBPARTS (vectype))); gimple_seq seq = NULL; mask_type = truth_type_for (vectype); + /* Assert that both mask types have the same total number of value +bits. */ + gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask))) + * TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)), + TYPE_PRECISION (TREE_TYPE (mask_type)) + * TYPE_VECTOR_SUBPARTS (mask_type))); mask = gimple_build (, VIEW_CONVERT_EXPR, mask_type, mask); if (seq) gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); for AVX the TYPE_PRECISION is always 1 so for unequal subparts we cannot directly share masks. I'm going to change LOOP_VINFO_MASKS from an array indexed by nV to a two-dimensional indexed by nV and bit-precision * subparts. Well, probably using a hash_map instead since this will be quite sparse. Or maybe not, but at least dynamically growing as we do now is difficult and subparts can be non-constant.
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #8 from rsandifo at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #7) > more like precision but x86 uses QImode for two-element, four-element > and eight-element masks (rather than two partial integer modes with > two and four bits precision). Ah, OK. So yeah, maybe the precision of the vector boolean element * the number of elements.
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #7 from rguenther at suse dot de --- On Mon, 10 Oct 2022, rsandifo at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 > > --- Comment #6 from rsandifo at gcc dot gnu.org > --- > The PR means supporting targets where this assumption doesn't hold. > But I think we should test for it explicitly somehow. For now we > can probably assume that the assumption holds when the two mask > modes have equal size. more like precision but x86 uses QImode for two-element, four-element and eight-element masks (rather than two partial integer modes with two and four bits precision).
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #6 from rsandifo at gcc dot gnu.org --- I don't think we should key this off whether the masks are integers or vectors. In principle there's no reason why a vector-based couldn't work the AVX512 way, or an integer approach the SVE way. (build_truth_vector_type_for_mode is currently hard-coded to 1 bit per element for integers, but that could change.) At the moment: We make the simplifying assumption that if a sequence of nV masks is suitable for one (nS,nL) pair, we can reuse it for (nS/2,nL/2) by VIEW_CONVERTing it. This holds for all current targets that support fully-masked loops. The PR means supporting targets where this assumption doesn't hold. But I think we should test for it explicitly somehow. For now we can probably assume that the assumption holds when the two mask modes have equal size.
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #5 from Richard Biener --- (In reply to Andrew Stubbs from comment #4) > I don't understand rgroups, but I can say that GCN masks are very simply > one-bit-one-lane. There are always 64-lanes, regardless of the type, so > V64QI mode has fewer bytes and bits than V64DImode (when written to memory). > > This is different to most other architectures where the bit-size remains the > same and number of lanes varies with the inner type, and has caused us some > issues with invalid assumptions in GCC (e.g. "there's no need for > sign-extends in vector registers" is not true for GCN). > > However, I think it's the same as you're describing for AVX512, at least in > this respect. > > Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in > the GCN backend (in lieu of implementing full mask support everywhere in the > middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR > issues are getting worse. I have a bunch of vec_extract patterns that fix up > some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all > really just 64-lane vectors with the mask preset, so the mask has to remain > DImode or register allocation becomes tricky. For the documentation test case GCN seems to skirt the issue by using different "sized" vectors so it manages to get two different nV here, one for the float and one for the double rgroup. With AVX512 I get the same nV and wrong code, re-using the mask of the floats: _51 = VIEW_CONVERT_EXPR>(loop_mask_40); (but the verifier not ICEing because it only checks modes). GCN gets nV == 1 for both for void foo (float *f, double * __restrict d, int n) { for (int i = 0; i < n; ++i) { f[i] += 1.0f; d[i] += 3.0; } } and here sharing the mask is OK. So it looks like the sharing logic depends on how we get to assign vector modes - GCN insists on handing out only 64 lane vectors. If you'd change that and allow mixing I guess you'll run into similar issues as AVX512. Only handing out 8-lane vectors would limit AVX512 quite a bit so that doesn't sound like a viable option to us. RVV might be in the same situation as GCN here. For GCN with DImode mask vectors at the point where V_C_Es would be emitted we could assert that the number of lanes are the same (we probably should, otherwise we'd have wrong-code). So eventually special-casing integer mode masks might work out.
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #4 from Andrew Stubbs --- I don't understand rgroups, but I can say that GCN masks are very simply one-bit-one-lane. There are always 64-lanes, regardless of the type, so V64QI mode has fewer bytes and bits than V64DImode (when written to memory). This is different to most other architectures where the bit-size remains the same and number of lanes varies with the inner type, and has caused us some issues with invalid assumptions in GCC (e.g. "there's no need for sign-extends in vector registers" is not true for GCN). However, I think it's the same as you're describing for AVX512, at least in this respect. Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in the GCN backend (in lieu of implementing full mask support everywhere in the middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR issues are getting worse. I have a bunch of vec_extract patterns that fix up some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all really just 64-lane vectors with the mask preset, so the mask has to remain DImode or register allocation becomes tricky.
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 Richard Biener changed: What|Removed |Added CC||ams at gcc dot gnu.org --- Comment #3 from Richard Biener --- (In reply to rsand...@gcc.gnu.org from comment #2) > See the comment above rgroup_controls in tree-vectorizer.h for the > current assumptions around loop predication. If AVX512 wants something > different then some extensions will be needed :-) Coming back to this now. Crucially If only the first three lanes are active, the masks we need are: f rgroup: 1 1 | 1 1 | 1 1 | 0 0 d rgroup: 1 | 1 | 1 | 0 Here we can use a mask calculated for f's rgroup for d's, but not vice versa. that seems to assume that the space in the mask vector for the "bools" in the d rgroup is twice as large as in that for the f rgroup. For AVX512 there's a single bit for each lane, independently on the width of the actual data. So instead of Thus for each value of nV, it is enough to provide nV masks, with the mask being calculated based on the highest nL (or, equivalently, based on the highest nS) required by any rgroup with that nV. We therefore represent the entire collection of masks as a two-level table, with the first level being indexed by nV - 1 (since nV == 0 doesn't exist) and the second being indexed by the mask index 0 <= i < nV. we need a set of nV masks for each value of nS, and we can pick the smallest nV for each nS and generate the corresponding larger nV masks by a series of shifts. In fact we can re-use the first vector (excess bits are OK). So for the example in tree-vectorizer.h float *f; double *d; for (int i = 0; i < n; ++i) { f[i * 2 + 0] += 1.0f; f[i * 2 + 1] += 2.0f; d[i] += 3.0; } we'd need to perform two WHILE_ULT. For float *f; double *d; for (int i = 0; i < n; ++i) { f[i] += 1.0f; d[i] += 3.0; } we'd compute the mask for the f rgroup with a WHILE_ULT and we'll have nV_d = 2 * nV_f and the first mask vector from f can be reused for d (but not the other way around). The second mask vector for d can be obtained by kshiftr. There's no other way to do N bit to two N/2 bit hi/lo (un)packing (there's a 2x N/2 bit -> N bit operation, for whatever reason). There's also no way to transform the d rgroup mask into the f rgroup mask for the first example aka duplicate bits in place, { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN }, nor the reverse. So in reality it seems we need a set of mask vectors for the full set of nS * nV combinations with AVX512. Doing fully masking with AVX2 style vectors would work with the existing rgroup control scheme. Currently the "key" to the AVX512 behavior is the use of scalar modes for the mask vectors but then that's also what GCN uses. Do GCN mask bits really map to bytes to allow the currently used rgroup scheme?
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #2 from rsandifo at gcc dot gnu.org --- See the comment above rgroup_controls in tree-vectorizer.h for the current assumptions around loop predication. If AVX512 wants something different then some extensions will be needed :-)
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 Richard Biener changed: What|Removed |Added Target||x86_64-*-* CC||rsandifo at gcc dot gnu.org --- Comment #1 from Richard Biener --- Take the simplified void foo (int * __restrict dst, short *src, int n) { for (int i = 0; i < n; ++i) dst[i] = src[2*i] + src[2*i+1]; } here we get vect_record_loop_mask twice with nvectors == 2 and V16HImode for the load which populates masks[2]. Then we once get V8SImode but also with nvectors == 2 which leaves the data unadjusted since it looks at masks[2] as well but if it were to come first we'd have recorded a different mask vector type. The masks seem to be constructed in a way to produce two bits per lane (but we still apply it naively?!) and the V_C_E does actually look wrong to me. Huh. With SVE I seem to get (besides a .LOAD_LANE version) permutes of the mask vector: loop_mask_116 = VEC_PERM_EXPR ; loop_mask_117 = VEC_PERM_EXPR ; ... vect__69.20_118 = .MASK_LOAD (vectp_src.18_114, 16B, loop_mask_116); vectp_src.18_119 = vectp_src.18_114 + POLY_INT_CST [8, 8]; vect__69.21_120 = .MASK_LOAD (vectp_src.18_119, 16B, loop_mask_117); ... .MASK_STORE (vectp_dst.25_129, 32B, loop_mask_131, vect__79.24_125); so the original mask provider here is the larger element vector type and the smaller element masks are produced from that. That's something I'd expect for AVX512 as well, not sure where it goes "wrong". See PR107093 for a patch implementing WHILE_ULT for AVX512.