Extend GCC's loop vectorizer to recognize loops with predicate-guarded stores 
into a buffer with an offset incremented under the same predicate.
Enable the backend to emit AVX-512 VPCOMPRESS for the vectorized 
compressing-store path when that is profitable.

This RFC is in response to PR tree-optimization/91198 (GCC not generating 
AVX-512 compress/expand instructions in relevant cases):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198

We will be sending an initial prototype patch for phase 1 soon to gcc-patches 
list.

------------------------------------------------------------------------------

Running example
===============

The following kernel is used throughout this document to illustrate the 
proposed design. It is reduced from the SPEC CPU 2017 benchmark 527.cam4_r.

int
cam4_filter_pack_kernel_restrict (int N, double thresh,
                                  const double *restrict cwm,
                                  double *restrict coef,
                                  double *restrict fwaut,
                                  int *restrict ind)
{
  int ncols = 0;
  for (int i = 0; i < N; ++i)
    {
      coef[i] = 0.0;
      fwaut[i] = 0.0;

      if (cwm[i] > thresh)
        {
          ind[ncols] = i + 1;
          ncols++;
        }
    }
  return ncols;
}

Assumption: Before the loop vectorizer pass, in GIMPLE IR, the loops of 
interest combine three features: (1) a predicate that controls whether the 
current iteration executes the store or not, (2) a loop-carried PHI that counts 
how many elements have been stored so far,  and (3) a masked store whose 
address depends on that PHI. If-conversion pass has already run on the loop and 
it has created the masked store that uses that predicate and the value to be 
stored.

Note: This document focuses mostly on simple loops like the above running 
example. It also assumes the relevant pointers or arrays do not alias (e.g. 
restrict), which simplifies runtime address and bound checks. More complex 
cases (like multiple number of predicated stores using same or different 
induction variables, epilogue vectorization, etc.) and addition of runtime 
aliasing checks can be covered in an extended design as a follow-up.

------------------------------------------------------------------------------

This RFC outlines a phased approach: Phase 1 detects and classifies the pattern 
early, Phase 2 represents it with explicit internal functions, and Phase 3 
updates the predicated offset of the store and vectorizes the predicated store 
with compressing store using VPCOMPRESS instruction.

------------------------------------------------------------------------------

Phase 1 — Classification of the predicated PHI
==============================================

Phase 1 detects the pattern early by requiring a header PHI for the running 
offset which is updated only when the predicate holds (termed as predicated-PHI 
from now on), and a masked store in the same loop that uses the predicate and 
uses the predicated-PHI as index to get address of the store. It also requires 
that when the predicate is true, the predicated-PHI must advance by exactly one 
element per scalar iteration. Phase 1 records that predicated-PHI along with 
the masked store on which it is paired with.

Input

- GIMPLE after if-conversion: the inner loop containing the PHI for ncols, the 
conditional update, and .MASK_STORE on ind, as in the following GIMPLE IR of 
the running example.

Representative GIMPLE at vectorizer input:

  # ncols_49 = PHI <ncols_17(8), 0(17)>
  # i_51 = PHI <_58(8), 0(17)>
  _1 = (long unsigned int) i_51;
  _2 = _1 * 8;
  _11 = cwm_40(D) + _2;
  _12 = *_11;
  _58 = i_51 + 1;
  _56 = _12 > thresh_41(D);
  _13 = (long unsigned int) ncols_49;
  _14 = _13 * 4;
  _90 = _3 + _14;
  _15 = (int *) _90;
  .MASK_STORE (_15, 32B, _56, _58);
  _91 = (unsigned int) ncols_49;
  _92 = _91 + 1;
  ncols_44 = (int) _92;
  ncols_17 = _56 ? ncols_44 : ncols_49;

Output

Phase 1 performs the +1 legality check and records the predicated-PHI and pair 
it with the masked store: ncols_49 is marked as the predicated-PHI, and the 
loop’s vectorizer state records that PHI together with the original masked 
store .MASK_STORE (_15, 32B, _56, _58). Later phases use this recorded 
association as the anchor for internal-function rewrites and backend pattern 
selection.

------------------------------------------------------------------------------

Phase 2 — Internal functions and pattern recognition
====================================================

Phase 2 introduces two internal functions: one for the predicated-PHI update 
(if the predicate holds, add one, else leave it unchanged), and one for the 
compressing masked store, bundling the store's address, predicate mask, and 
value. They line up with the predicated-PHI and mask-store pair that Phase 1 
already recorded.

Input

- Legality information recorded in Phase 1 (predicated-PHI, paired 
IFN_MASK_STORE). When that classification remains valid, Phase 2 goes ahead 
with introducing and wiring the internal functions.

Output

- Pattern recognition rewrites the classified predicated-PHI and original 
IFN_MASK_STORE onto two internal functions: 
IFN_PREDICATED_INDEX_UPDATE(ncols_49, _56) for the predicated-PHI update, and 
IFN_MASK_COMPRESS_STORE(_15, _56, _58) for the compressing masked store (ptr, 
mask, value).

------------------------------------------------------------------------------

Phase 3 — Vector semantics, aliasing, costing
=============================================

This phase is further split into incremental milestones.

Milestone M1 — Updating the predicated-PHI in vectorized IR
-----------------------------------------------------------

Milestone M1 gives the predicated-PHI the right semantics in the vector loop: 
each vector step must advance the index by the sum of active predicate lanes.

Input

- Pattern IR from Phase 2: IFN_PREDICATED_INDEX_UPDATE(ncols_49, _56)

Output

- The predicated-PHI is updated from the widened predicate using one of two 
routes — (1) cast the comparison mask to a scalar value and apply popcount, or 
(2) use a horizontal lane reduction that yields the active-lane count. The 
exact choice can be decided later.

Milestone M2 — Dependence and alias
-----------------------------------

Milestone M2 integrates compress-store with dependence and alias analysis by 
treating it consistently with masked stores so that vectorization is legalized.

Milestone M3 — Costing
----------------------

Milestone M3 costs the vector compress path against scalar and alternative 
vector schemes.

Milestone M4 — Enable RTL to generate Vector compress store
-----------------------------------------------------------

Milestone M4 makes necessary changes in RTL expansion phase, to detect the 
internal function IFN_MASK_COMPRESS_STORE and use respective RTL expander 
routies to generate VPCOMPRESS.

Reply via email to