subject:"\[Bug tree\-optimization\/107096\] Fully masking vectorization with AVX512 ICEs gcc.dg\/vect\/vect\-over\-widen\-\*.c"

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2023-06-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Resolution|--- |INVALID

--- Comment #14 from Richard Biener  ---
Not a bug.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2023-02-15 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #13 from Hongtao.liu  ---
(In reply to rguent...@suse.de from comment #12)
> On Wed, 15 Feb 2023, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096
> > 
> > --- Comment #11 from Hongtao.liu  ---
> > 
> > > 
> > > There's no other way to do N bit to two N/2 bit hi/lo (un)packing
> > > (there's a 2x N/2 bit -> N bit operation, for whatever reason).
> > > There's also no way to transform the d rgroup mask into the
> > > f rgroup mask for the first example aka duplicate bits in place,
> > > { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
> > > nor the reverse.
> > > 
> > 
> > Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type.
> > .i.e
> > we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with 
> > same
> > mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be 
> > ok
> > as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0,
> > b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).
> 
> Hmm, yes, if we arrange for the larger mask to be available that would
> work for loads and stores I guess.  It wouldn't work for arithmetic
> cond_* IFNs though.  It's also going to be a bit tricky within the
> masking framework - I'm going to see whether that works though, it might
> be a nice way to avoid an excessive number of masks for integer code
> at least.

There could be some limitation for nV(it should be power of 2 for VCE?)
.i.e.
There's no suitable vectype for VCE of src1 vectype to resure loop mask.
void
foo (int* __restrict dest, int* src1, int* src2)
{
for (int i = 0; i != 1; i++)
  dest[i] = src1[3*i] + src1[3*i + 1] + src1[3*i + 2];
}


Maybe AVX512 could use gather instruction for .MASK_LOAD_LANES to use
LOOP_MASK?

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2023-02-14 Thread rguenther at suse dot de via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #12 from rguenther at suse dot de  ---
On Wed, 15 Feb 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096
> 
> --- Comment #11 from Hongtao.liu  ---
> 
> > 
> > There's no other way to do N bit to two N/2 bit hi/lo (un)packing
> > (there's a 2x N/2 bit -> N bit operation, for whatever reason).
> > There's also no way to transform the d rgroup mask into the
> > f rgroup mask for the first example aka duplicate bits in place,
> > { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
> > nor the reverse.
> > 
> 
> Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type.
> .i.e
> we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with 
> same
> mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok
> as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0,
> b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).

Hmm, yes, if we arrange for the larger mask to be available that would
work for loads and stores I guess.  It wouldn't work for arithmetic
cond_* IFNs though.  It's also going to be a bit tricky within the
masking framework - I'm going to see whether that works though, it might
be a nice way to avoid an excessive number of masks for integer code
at least.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2023-02-14 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #11 from Hongtao.liu  ---

> 
> There's no other way to do N bit to two N/2 bit hi/lo (un)packing
> (there's a 2x N/2 bit -> N bit operation, for whatever reason).
> There's also no way to transform the d rgroup mask into the
> f rgroup mask for the first example aka duplicate bits in place,
> { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
> nor the reverse.
> 

Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type.
.i.e
we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with same
mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok
as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0,
b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2023-02-14 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #10 from Hongtao.liu  ---

> 
> There's no other way to do N bit to two N/2 bit hi/lo (un)packing
> (there's a 2x N/2 bit -> N bit operation, for whatever reason).
> There's also no way to transform the d rgroup mask into the
> f rgroup mask for the first example aka duplicate bits in place,
> { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
> nor the reverse.
> 

Can we just do VIEW_CONVERT_EXPR for vectype instead of mask_type.
.i.e
we can do VCE to tranform V8SI to V16HI, then use mask_load for V16HI with same
mask {b0, b0, b1, b1, b2, b2, .}, then VCE it to back to V8SI, it should be ok
as long as duplicated bits in place.(or VCE V16HI to V8SI then use mask {b0,
b1, b2, ..., bN}, and VCE V8SI back to V16HI after masked load/move).

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #9 from Richard Biener  ---
(In reply to rsand...@gcc.gnu.org from comment #8)
> (In reply to rguent...@suse.de from comment #7)
> > more like precision but x86 uses QImode for two-element, four-element
> > and eight-element masks (rather than two partial integer modes with
> > two and four bits precision).
> Ah, OK.  So yeah, maybe the precision of the vector boolean element *
> the number of elements.

For SVE the following holds:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 1996ecfee7a..9b24b481867 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10097,6 +10097,12 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi,
vec_loop_masks *masks,
  TYPE_VECTOR_SUBPARTS (vectype)));
   gimple_seq seq = NULL;
   mask_type = truth_type_for (vectype);
+  /* Assert that both mask types have the same total number of value
+bits.  */
+  gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask)))
+   * TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)),
+   TYPE_PRECISION (TREE_TYPE (mask_type))
+   * TYPE_VECTOR_SUBPARTS (mask_type)));
   mask = gimple_build (, VIEW_CONVERT_EXPR, mask_type, mask);
   if (seq)
gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);

for AVX the TYPE_PRECISION is always 1 so for unequal subparts we cannot
directly share masks.

I'm going to change LOOP_VINFO_MASKS from an array indexed by nV to
a two-dimensional indexed by nV and bit-precision * subparts.  Well,
probably using a hash_map instead since this will be quite sparse.
Or maybe not, but at least dynamically growing as we do now is difficult
and subparts can be non-constant.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #8 from rsandifo at gcc dot gnu.org  
---
(In reply to rguent...@suse.de from comment #7)
> more like precision but x86 uses QImode for two-element, four-element
> and eight-element masks (rather than two partial integer modes with
> two and four bits precision).
Ah, OK.  So yeah, maybe the precision of the vector boolean element *
the number of elements.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-10 Thread rguenther at suse dot de via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #7 from rguenther at suse dot de  ---
On Mon, 10 Oct 2022, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096
> 
> --- Comment #6 from rsandifo at gcc dot gnu.org  
> ---
> The PR means supporting targets where this assumption doesn't hold.
> But I think we should test for it explicitly somehow.  For now we
> can probably assume that the assumption holds when the two mask
> modes have equal size.

more like precision but x86 uses QImode for two-element, four-element
and eight-element masks (rather than two partial integer modes with
two and four bits precision).

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #6 from rsandifo at gcc dot gnu.org  
---
I don't think we should key this off whether the masks are integers
or vectors.  In principle there's no reason why a vector-based
couldn't work the AVX512 way, or an integer approach the SVE way.
(build_truth_vector_type_for_mode is currently hard-coded to 1 bit
per element for integers, but that could change.)

At the moment:

   We make the simplifying assumption that if a sequence of nV masks is
   suitable for one (nS,nL) pair, we can reuse it for (nS/2,nL/2) by
   VIEW_CONVERTing it.  This holds for all current targets that support
   fully-masked loops.

The PR means supporting targets where this assumption doesn't hold.
But I think we should test for it explicitly somehow.  For now we
can probably assume that the assumption holds when the two mask
modes have equal size.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #5 from Richard Biener  ---
(In reply to Andrew Stubbs from comment #4)
> I don't understand rgroups, but I can say that GCN masks are very simply
> one-bit-one-lane. There are always 64-lanes, regardless of the type, so
> V64QI mode has fewer bytes and bits than V64DImode (when written to memory).
> 
> This is different to most other architectures where the bit-size remains the
> same and number of lanes varies with the inner type, and has caused us some
> issues with invalid assumptions in GCC (e.g. "there's no need for
> sign-extends in vector registers" is not true for GCN).
> 
> However, I think it's the same as you're describing for AVX512, at least in
> this respect.
> 
> Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in
> the GCN backend (in lieu of implementing full mask support everywhere in the
> middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR
> issues are getting worse. I have a bunch of vec_extract patterns that fix up
> some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all
> really just 64-lane vectors with the mask preset, so the mask has to remain
> DImode or register allocation becomes tricky.

For the documentation test case GCN seems to skirt the issue by using
different "sized" vectors so it manages to get two different nV here, one
for the float and one for the double rgroup.  With AVX512 I get the
same nV and wrong code, re-using the mask of the floats:

  _51 = VIEW_CONVERT_EXPR>(loop_mask_40);

(but the verifier not ICEing because it only checks modes).

GCN gets nV == 1 for both for

void foo (float *f, double * __restrict d, int n)
{
  for (int i = 0; i < n; ++i)
{
  f[i] += 1.0f;
  d[i] += 3.0;
}
}

and here sharing the mask is OK.

So it looks like the sharing logic depends on how we get to assign
vector modes - GCN insists on handing out only 64 lane vectors.  If
you'd change that and allow mixing I guess you'll run into similar
issues as AVX512.

Only handing out 8-lane vectors would limit AVX512 quite a bit so
that doesn't sound like a viable option to us.  RVV might be in the
same situation as GCN here.

For GCN with DImode mask vectors at the point where V_C_Es would be
emitted we could assert that the number of lanes are the same (we
probably should, otherwise we'd have wrong-code).

So eventually special-casing integer mode masks might work out.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-10 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #4 from Andrew Stubbs  ---
I don't understand rgroups, but I can say that GCN masks are very simply
one-bit-one-lane. There are always 64-lanes, regardless of the type, so V64QI
mode has fewer bytes and bits than V64DImode (when written to memory).

This is different to most other architectures where the bit-size remains the
same and number of lanes varies with the inner type, and has caused us some
issues with invalid assumptions in GCC (e.g. "there's no need for sign-extends
in vector registers" is not true for GCN).

However, I think it's the same as you're describing for AVX512, at least in
this respect.

Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in the
GCN backend (in lieu of implementing full mask support everywhere in the
middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR
issues are getting worse. I have a bunch of vec_extract patterns that fix up
some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all
really just 64-lane vectors with the mask preset, so the mask has to remain
DImode or register allocation becomes tricky.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

Richard Biener  changed:

   What|Removed |Added

 CC||ams at gcc dot gnu.org

--- Comment #3 from Richard Biener  ---
(In reply to rsand...@gcc.gnu.org from comment #2)
> See the comment above rgroup_controls in tree-vectorizer.h for the
> current assumptions around loop predication.  If AVX512 wants something
> different then some extensions will be needed :-)

Coming back to this now.  Crucially

   If only the first three lanes are active, the masks we need are:

 f rgroup: 1 1 | 1 1 | 1 1 | 0 0
 d rgroup:  1  |  1  |  1  |  0

   Here we can use a mask calculated for f's rgroup for d's, but not
   vice versa.

that seems to assume that the space in the mask vector for the "bools"
in the d rgroup is twice as large as in that for the f rgroup.

For AVX512 there's a single bit for each lane, independently on the
width of the actual data.  So instead of 

   Thus for each value of nV, it is enough to provide nV masks, with the
   mask being calculated based on the highest nL (or, equivalently, based
   on the highest nS) required by any rgroup with that nV.  We therefore
   represent the entire collection of masks as a two-level table, with the
   first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
   the second being indexed by the mask index 0 <= i < nV.

we need a set of nV masks for each value of nS, and we can pick the
smallest nV for each nS and generate the corresponding larger nV
masks by a series of shifts.  In fact we can re-use the first vector
(excess bits are OK).  So for the example in tree-vectorizer.h

 float *f;
 double *d;
 for (int i = 0; i < n; ++i)
   {
 f[i * 2 + 0] += 1.0f;
 f[i * 2 + 1] += 2.0f;
 d[i] += 3.0;
   }

we'd need to perform two WHILE_ULT.  For

 float *f;
 double *d;
 for (int i = 0; i < n; ++i)
   {
 f[i] += 1.0f;
 d[i] += 3.0;
   }

we'd compute the mask for the f rgroup with a WHILE_ULT and we'll
have nV_d = 2 * nV_f and the first mask vector from f can be reused
for d (but not the other way around).  The second mask vector for
d can be obtained by kshiftr.

There's no other way to do N bit to two N/2 bit hi/lo (un)packing
(there's a 2x N/2 bit -> N bit operation, for whatever reason).
There's also no way to transform the d rgroup mask into the
f rgroup mask for the first example aka duplicate bits in place,
{ b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
nor the reverse.

So in reality it seems we need a set of mask vectors for the full
set of nS * nV combinations with AVX512.  Doing fully masking with
AVX2 style vectors would work with the existing rgroup control scheme.

Currently the "key" to the AVX512 behavior is the use of scalar modes
for the mask vectors but then that's also what GCN uses.  Do GCN
mask bits really map to bytes to allow the currently used rgroup scheme?

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-09-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #2 from rsandifo at gcc dot gnu.org  
---
See the comment above rgroup_controls in tree-vectorizer.h for the
current assumptions around loop predication.  If AVX512 wants something
different then some extensions will be needed :-)

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-09-30 Thread rguenth at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

Richard Biener  changed:

   What|Removed |Added

 Target||x86_64-*-*
 CC||rsandifo at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
Take the simplified

void foo (int * __restrict dst, short *src, int n)
{
  for (int i = 0; i < n; ++i)
dst[i] = src[2*i] + src[2*i+1];
}

here we get vect_record_loop_mask twice with nvectors == 2 and V16HImode
for the load which populates masks[2].  Then we once get V8SImode but
also with nvectors == 2 which leaves the data unadjusted since it
looks at masks[2] as well but if it were to come first we'd have recorded
a different mask vector type.

The masks seem to be constructed in a way to produce two bits per lane
(but we still apply it naively?!) and the V_C_E does actually look wrong
to me.  Huh.

With SVE I seem to get (besides a .LOAD_LANE version) permutes of the
mask vector:

  loop_mask_116 = VEC_PERM_EXPR ;
  loop_mask_117 = VEC_PERM_EXPR ;
...
  vect__69.20_118 = .MASK_LOAD (vectp_src.18_114, 16B, loop_mask_116);
  vectp_src.18_119 = vectp_src.18_114 + POLY_INT_CST [8, 8];
  vect__69.21_120 = .MASK_LOAD (vectp_src.18_119, 16B, loop_mask_117);
...
  .MASK_STORE (vectp_dst.25_129, 32B, loop_mask_131, vect__79.24_125);

so the original mask provider here is the larger element vector type
and the smaller element masks are produced from that.  That's something
I'd expect for AVX512 as well, not sure where it goes "wrong".

See PR107093 for a patch implementing WHILE_ULT for AVX512.

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

14 matches

Site Navigation

Mail list logo

Footer information