[Bug tree-optimization/104368] [12 Regression] Failure to vectorise conditional grouped accesses after PR102659

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 03 Feb 2022 23:58:21 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104368


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2022-02-04
             Status|UNCONFIRMED                 |NEW
                 CC|                            |amacleod at redhat dot com

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  On x86 with AVX2 we don't get this vectorized anymore for the same
reason.

t.c:5:15: missed:  failed: evolution of base is not affine.
        base_address:
        offset from base address:
        constant offset from base address:
        step:
        base alignment: 0
        base misalignment: 0
        offset alignment: 0
        step alignment: 0
        base_object: *_8
Creating dr for *_12

if-conversion now produces

...
  _47 = (unsigned long) y_21(D);
..
# i_26 = PHI <i_23(8), 0(15)>
_1 = (long unsigned int) i_26;
_2 = _1 * 4;
_3 = x_20(D) + _2;
_4 = *_3;
_45 = (unsigned int) i_26;
_46 = _45 * 2;
_5 = (int) _46;
_6 = (long unsigned int) _5;
_7 = _6 * 4;
_48 = _47 + _7;
_8 = (int *) _48;
_49 = _4 > 0;
_9 = .MASK_LOAD (_8, 32B, _49);
_10 = _6 + 1;
_11 = _10 * 4;
_51 = _11 + _47;
_12 = (int *) _51;
_13 = .MASK_LOAD (_12, 32B, _49);
_52 = (unsigned int) _9;
_53 = (unsigned int) _13;
_54 = _52 + _53;
_14 = (int) _54;
.MASK_STORE (_3, 32B, _49, _14);
i_23 = i_26 + 1;
if (n_19(D) > i_23)
  goto <bb 8>; [89.00%]
else
  goto <bb 6>; [11.00%]


note that if-conversion is correct in rewriting i*2 and i*2 + 1 to unsigned
arithmetic since that will now execute unconditionally and can overflow.

In the end the issue is that the multiplication by the element size is
done in sizetype and so y[i*2] and y[i*2+1] might not be adjacent.  What
we miss is that iff the stmts were executed then because of undefined overflow
they will always be adjacent.

IMHO the only good way to recover is to scrap the separate if-conversion step
and do vectorization on the original IL.  Or integrate the two passes
as much as to allow dataref analysis on the not if-converted IL.

Another possibility (and long-standing TODO) is to teach SCEV analysis
to derive assumptions we can version the loop on - in this case that
i*2 + 1 does not overflow.

Note in this particular case we probably miss to see that

i is in [0,INT_MAX-1] and thus (unsigned)i * 2 + 1 never wraps

(unless I miss something).  We have

  <bb 3> [local count: 955630226]:
  # RANGE [0, 2147483647] NONZERO 2147483647
  # i_26 = PHI <i_23(8), 0(15)>
  # RANGE [0, 2147483646] NONZERO 2147483647
  _1 = (long unsigned int) i_26;
  # RANGE [0, 8589934584] NONZERO 8589934588
  _2 = _1 * 4;
  # PT = null { D.2435 } (nonlocal, restrict)
  _3 = x_20(D) + _2;
  _4 = MEM[(int *)_3 clique 1 base 1];
  _45 = (unsigned int) i_26;
  _46 = _45 * 2;
  _5 = (int) _46;
  _6 = (long unsigned int) _5;
  _7 = _6 * 4;
  _48 = _47 + _7;

so unfortunately while _1 has that correct range, i_26 does not and the
ifcvt generated stmts don't either.  It might be possible to throw
ranger on the if-converted body.

Andrew - if we'd like to do that, in tree-if-conv.cc in tree_if_conversion ()
after we've produced the final IL (after the call to ifcvt_hoist_invariants),
is there a way to invoke ranger on the stmts of the (single-BB) loop
and have it adjust the global ranges?  In particular - see above, it
would need to somehow improve the global range of the i_26 IV.

The pass creates blocks and destroys edges, so I'm not sure if we can
reasonably use a caching instance over its lifetime so cost per loop would
be a limiting factor.

[Bug tree-optimization/104368] [12 Regression] Failure to vectorise conditional grouped accesses after PR102659

Reply via email to