store with length in vectorizer

Richard Sandiford Tue, 02 Jun 2020 04:51:16 -0700

"Kewen.Lin" <li...@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/5/29 下午4:32, Richard Sandiford wrote:
>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>>> on 2020/5/27 下午6:02, Richard Sandiford wrote:
>>>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>>>>> Hi Richard,
>>>>>
>
> Snip ...
>
>>>
>>> Thanks a lot for your detailed explanation!  This proposal looks good
>>> based on the current implementation of both masking and length.  I may
>>> think too much, but I had a bit concern as below when some targets have
>>> both masking and length supports in future, such as ppc adds masking
>>> support like SVE.
>>>
>>> I assumed that you meant each vectorizable_* routine should record the
>>> objs for any available partial vectorisation approaches.  If one target
>>> supports both, we would have both recorded but decide not to do partial
>>> vectorisation finally since both have records.  The target can disable
>>> length like through optab to resolve it, but there is one possibility
>>> that the masking support can be imperfect initially since ISA support
>>> could be gradual, it further leads some vectorizable_* check or final
>>> verification to fail for masking, and length approach may work here but
>>> it gets disabled.  We can miss to use partial vectorisation here.
>>>
>>> The other assumption is that each vectorizable_* routine record the 
>>> first available partial vectorisation approach, let's assume masking
>>> takes preference, then it's fine to record just one here even if one
>>> target supports both approaches, but we still have the possiblity to
>>> miss the partial vectorisation chance as some check/verify fail with
>>> masking but fine with length.
>>>
>>> Does this concern make sense?
>> 
>> There's nothing to stop us using masks and lengths in the same loop
>> in future if we need to.  It would “just” be a case of setting up both
>> the masks and the lengths in vect_set_loop_condition.  But the point is
>> that doing that would be extra code, and there's no point writing that
>> extra code until it's needed.
>> 
>> If some future arch does support both mask-based and length-based
>> approaches, I think that's even less reason to make a binary choice
>> between them.  How we prioritise the length and mask approaches when
>> both are available is something that we'll have to decide at the time.
>> 
>> If your concern is that the arch might support masked operations
>> without wanting them to be used for loop control, we could test for
>> that case by checking whether while_ult_optab is implemented.
>> 
>> Thanks,
>> Richard
>> 
>
> Thanks for your further expalanation, as you pointed out, my concern
> is just one case of mixing mask-based and length-based.  I didn't
> realize it and thought we still used one approach for one loop at the
> time, but it's senseless.
>
> The v3 patch attached to use can_partial_vect_p.  In the regression
> testing with explicit vect-with-length-scope setting, I saw several
> reduction failures, updated vectorizable_condition to set
> can_partial_vect_p to false for !EXTRACT_LAST_REDUCTION under your
> guidance to ensure it either records sth. or clearing
> can_partial_vect_p.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu P9 and no remarkable
> failures found even with explicit vect-with-length-scope settings.
>
> But I met one regression failure on aarch64-linux-gnu as below:
>
> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  
> scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,
>
> It's caused by vectorizable_condition's change, without the change,
> it can use fully-masking for the outer loop.  The reduction_type is
> TREE_CODE_REDUCTION here, so can_partial_vect_p gets cleared.
>
> From the optimized dumping, the previous IRs look fine.  It's doing
> reduction for inner loop, but we are checking partial vectorisation
> for the outer loop.  I'm not sure whether to adjust the current
> guard is reasonable for this case.  Could you help to give some
> insights?  Thanks in advance!
>
> BR,
> Kewen
> ------
> gcc/ChangeLog


It would be easier to review, and perhaps easier to bisect,
if some of the mechanical changes were split out.  E.g.:

- Rename can_fully_mask_p to can_use_partial_vectors_p.

- Rename fully_masked_p to using_partial_vectors_p.

- Rename things related to rgroup_masks.  I think “rgroup_controls”
  or “rgroup_guards” might be more descriptive than “rgroup_objs”.

These should be fairly mechanical changes and can happen ahead of
the main series.  It'll then be easier to see what's different
for masks and lengths, separately from the more mechanical stuff.

As far as:

+  union
+  {
+    /* The type of mask to use, based on the highest nS recorded above.  */
+    tree mask_type;
+    /* Any vector type to use these lengths.  */
+    tree vec_type;
+  };

goes, some parts of the code seem to use mask_type for lengths too,
which I'm a bit nervous about.  I think we should either be consistent
about which union field we use (always mask_type for masks, always
vec_type for lengths) or we should just rename mask_type to something
more generic.  Just "type" might be good enough with a suitable comment.

>  {
>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
>    tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);
> -  tree mask_type = rgm->mask_type;
> -  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
> -  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
> +
> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> +  if (!vect_for_masking)
> +    {
> +      /* Obtain target supported length type.  */
> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;
> +      unsigned int len_prec = GET_MODE_PRECISION (len_mode);
> +      compare_type = build_nonstandard_integer_type (len_prec, true);
> +      /* Simply set iv_type as same as compare_type.  */
> +      iv_type = compare_type;

This might not be the best time to bring this up :-) but it seems
odd to be asking the target for the induction variable type here.
I got the impression that the hook was returning DImode, whereas
the PowerPC instructions only looked at the low 8 bits of the length.
If so, forcing a naturally 32-bit IV to DImode would insert extra
sign/zero extensions, even though the input to the length intrinsics
would have been happy with the 32-bit IV.

I think it would make sense to ask the target for its minimum
precision P (which would be 8 bits if the above is correct).
The starting point would then be the maximum of:

- this P
- the IV's natural precision
- the precision needed to hold:
    the maximum number of scalar iterations multiplied by the scale factor
    (to convert scalar counts to bytes)

If the IV might wrap at that precision without producing all-zero lengths,
it would be worth doubling the precision to avoid the wrapping issue,
provided that we don't go beyond BITS_PER_WORD.

> +  tree obj_type = rgo->mask_type;
> +  /* Here, take nscalars_per_iter as nbytes_per_iter for length.  */
> +  unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter;

I think whether we count scalars or count bytes is really a separate
decision that shouldn't be tied directly to using lengths.  Length-based
loads and stores on other arches might want to count scalars too.
I'm not saying you should add support for that (it wouldn't be tested),
but I think we should avoid structuring the code to make it harder to
add in future.

So I think nscalars_per_iter should always count scalars and anything
length-based should be separate.  Would it make sense to store the
length scale factor as a separate field?  I.e. using the terms
above the rgroup_masks comment, the length IV step is:

   factor * nS * VF == factor * nV * nL

That way, applying the factor becomes separate from lengths vs. masks.
The factor would also be useful in calculating the IV precision above.

> [...]
> -/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
> -   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the
> -   number of iterations of the original scalar loop that should be
> -   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are
> -   as for vect_set_loop_condition.
> +/* Make LOOP iterate NITERS times using objects like masks (and
> +   WHILE_ULT calls) or lengths.  LOOP_VINFO describes the vectorization
> +   of LOOP.  NITERS is the number of iterations of the original scalar
> +   loop that should be handled by the vector loop.  NITERS_MAYBE_ZERO
> +   and FINAL_IV are as for vect_set_loop_condition.
>  
>     Insert the branch-back condition before LOOP_COND_GSI and return the
>     final gcond.  */
>  
>  static gcond *
> -vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo,
> -                             tree niters, tree final_iv,
> -                             bool niters_maybe_zero,
> -                             gimple_stmt_iterator loop_cond_gsi)
> +vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo,
> +                              tree niters, tree final_iv,
> +                              bool niters_maybe_zero,
> +                              gimple_stmt_iterator loop_cond_gsi)
>  {
>    gimple_seq preheader_seq = NULL;
>    gimple_seq header_seq = NULL;
>  
> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> +
>    tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
> +  if (!vect_for_masking)
> +    {
> +      /* Obtain target supported length type as compare_type.  */
> +      scalar_int_mode len_mode = targetm.vectorize.length_mode;
> +      unsigned len_prec = GET_MODE_PRECISION (len_mode);
> +      compare_type = build_nonstandard_integer_type (len_prec, true);

Same comment as above about the choice of IV type.  We shouldn't
recalculate this multiple times.  It would be better to calculate
it upfront and store it in the loop_vinfo.

> @@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
> tree nitersm1,
>    if (vect_epilogues
>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>        && prolog_peeling >= 0
> -      && known_eq (vf, lowest_vf))
> +      && known_eq (vf, lowest_vf)
> +      && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo))

Why's this check needed?

>      {
>        unsigned HOST_WIDE_INT eiters
>       = (LOOP_VINFO_INT_NITERS (loop_vinfo)
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 80e33b61be7..99e6cb904ba 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -813,8 +813,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
> vec_info_shared *shared)
>      vec_outside_cost (0),
>      vec_inside_cost (0),
>      vectorizable (false),
> -    can_fully_mask_p (true),
> +    can_partial_vect_p (true),

I think “can_use_partial_vectors_p” reads better

>      fully_masked_p (false),
> +    fully_with_length_p (false),

I think it would be better if these two were a single flag
(using_partial_vectors_p), with masking vs. lengths being derived
information.

> +    epil_partial_vect_p (false),
>      peeling_for_gaps (false),
>      peeling_for_niter (false),
>      no_data_dependencies (false),
> @@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
> vec_info_shared *shared)
>  void
>  release_vec_loop_masks (vec_loop_masks *masks)
>  {
> -  rgroup_masks *rgm;
> +  rgroup_objs *rgm;
>    unsigned int i;
>    FOR_EACH_VEC_ELT (*masks, i, rgm)
> -    rgm->masks.release ();
> +    rgm->objs.release ();
>    masks->release ();
>  }
>  
> +/* Free all levels of LENS.  */
> +
> +void
> +release_vec_loop_lens (vec_loop_lens *lens)
> +{
> +  rgroup_objs *rgl;
> +  unsigned int i;
> +  FOR_EACH_VEC_ELT (*lens, i, rgl)
> +    rgl->objs.release ();
> +  lens->release ();
> +}
> +

There's no need to duplicate this function.

The overall approach looks good though.  I just think we need to work
through the details a bit more.

Thanks,
Richard

Re: [PATCH 5/7 v3] vect: Support vector load/store with length in vectorizer

Reply via email to