store with length in vectorizer

Richard Sandiford Wed, 01 Jul 2020 08:18:01 -0700

"Kewen.Lin" <li...@linux.ibm.com> writes:
> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>> +  tree length_limit = NULL_TREE;
>>> +  /* For length, we need length_limit to check length in range.  */
>>> +  if (!vect_for_masking)
>>> +    {
>>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>>> +      length_limit = build_int_cst (compare_type, len_limit);
>>> +    }
>>>  
>>>    /* Calculate the maximum number of scalar values that the rgroup
>>>       handles in total, the number that it handles for each iteration
>>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, 
>>> loop_vec_info loop_vinfo,
>>>    tree nscalars_total = niters;
>>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>>    tree nscalars_skip = niters_skip;
>>> -  if (nscalars_per_iter != 1)
>>> +  if (nscalars_per_iter_ft != 1)
>>>      {
>>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>>      these multiplications don't overflow.  */
>>> -      tree compare_factor = build_int_cst (compare_type, 
>>> nscalars_per_iter);
>>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>>> +      tree compare_factor = build_int_cst (compare_type, 
>>> nscalars_per_iter_ft);
>>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, 
>>> compare_type,
>>>                                  nscalars_total, compare_factor);
>>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, 
>>> loop_vec_info loop_vinfo,
>>>          NSCALARS_SKIP to that cannot overflow.  */
>>>       tree const_limit = build_int_cst (compare_type,
>>>                                         LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>>> -                                       * nscalars_per_iter);
>>> +                                       * nscalars_per_iter_ft);
>>>       first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>>                                   nscalars_total, const_limit);
>>>       first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
>> 
>> It looks odd that we don't need to adjust the other nscalars_* values too.
>> E.g. the above seems to be comparing an unscaled nscalars_total with
>> a scaled nscalars_per_iter.  I think the units ought to “agree”,
>> both here and in the rest of the function.
>> 
>
> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
> nscalars_step are scaled here.  The remaining related nscalars_*
> seems only nscalars_skip, but length can't support skip.


Hmm, OK.  But in that case can you update the names of the variables
to match?  It's confusing to have some nscalars_* variables actually
count scalars (and thus have “nitems” equivalents) and other nscalars_*
variables count something else (and thus effectively be nitems_* variables
themselves).

>
>>>     }
>>>  
>>> +      /* First iteration is full.  */
>> 
>> This comment belongs inside the “if”.
>> 
>
> Sorry, I might miss something, but isn't this applied for both?

I meant it should be…

>
>>>        if (!init_ctrl)
>>> -   /* First iteration is full.  */
>>> -   init_ctrl = build_minus_one_cst (ctrl_type);
>>> +   {
>>> +     if (vect_for_masking)
>>> +       init_ctrl = build_minus_one_cst (ctrl_type);
>>> +     else
>>> +       init_ctrl = length_limit;
>>> +   }

  if (!init_ctrl)
    {
      /* First iteration is full.  */
      if (vect_for_masking)
        init_ctrl = build_minus_one_cst (ctrl_type);
      else
        init_ctrl = length_limit;
    }

since the comment only applies to the “!init_ctrl” case.  The point
of a nonnull init_ctrl is to create cases in which the first vector
is not a full vector.

>>>  
>>> […]
>>> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
>>> niters, tree nitersm1,
>>>    if (vect_epilogues
>>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>>        && prolog_peeling >= 0
>>> -      && known_eq (vf, lowest_vf))
>>> +      && known_eq (vf, lowest_vf)
>>> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>>>      {
>>>        unsigned HOST_WIDE_INT eiters
>>>     = (LOOP_VINFO_INT_NITERS (loop_vinfo)
>> 
>> I'm still not really convinced that this check is right.  It feels
>> like it's hiding a problem elsewhere.
>> 
>
> The comments above this hunk is that:
>
>   /* If we know the number of scalar iterations for the main loop we should
>      check whether after the main loop there are enough iterations left over
>      for the epilogue.  */
>
> So it's to check the ones in loop_vinfo->epilogue_vinfos whether can be 
> removed.
> And the main work in the loop is to remove epil_info from epilogue_vinfos.

Oops, I think I misread it as checking loop_vinfo rather than
epilogue_vinfo.  It makes more sense now. :-)

>>> +
>>> +  /* Work out how many bits we need to represent the length limit.  */
>>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * 
>>> rgl->factor;
>> 
>> I think this breaks the abstraction.  There's no guarantee that the
>> factor is the same for each rgroup_control, so there's no guarantee
>> that the maximum bytes per iter comes the last entry.  (Also, it'd
>> be better to avoid talking about bytes if we're trying to be general.)
>> I think we should take the maximum of each entry instead.
>> 
>
> Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
> "maximum elements per iter"?  Yes, the code is for length in bytes, checking
> the last entry is only reasonable for it.  Will update it to check all entries
> instead.

I meant bytes, since that's what the code is effectively calculating
(at least for Power).  I.e. I think this breaks the abstraction even
if we assume the Power scheme to measuring length, since in principle
it's possible to fix different vector sizes in the same vector region.

>>> +     we perfer to still use the niters type.  */
>>> +  unsigned int ni_prec
>>> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
>>> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
>>> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
>>> +
>>> +  unsigned int required_prec = ni_prec;
>>> +  if (required_prec < pmode_prec)
>>> +    required_prec = pmode_prec;
>>> +
>>> +  tree iv_type = NULL_TREE;
>>> +  if (min_ni_prec > required_prec)
>>> +    {
>> 
>> Do we need this condition?  Looks like we could just do:
>> 
>>   min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
>>   min_ni_prec = MAX (min_ni_prec, ni_prec);
>> 
>> and then run the loop below.
>> 
>
> I think the assumption holds that Pmode and niters type are standard integral
> type?  If so, both of them don't need the below loop to build the integer
> type, but min_ni_prec needs.  Does it make sense to differentiate them?

IMO we should handle them the same way, i.e. always use the loop.
For example, Pmode can be a partial integer mode on some targets,
so it isn't guaranteed to give a nice power-of-2 integer type.

Maybe having a special case would be worth it if this was performance-
critical code, but since it isn't, having all cases go through the same
path seems better.  It also means that the loop will get more testing
coverage.

>>> +      /* Decide whether to use fully-masked approach.  */
>>> +      if (vect_verify_full_masking (loop_vinfo))
>>> +   LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>> +      /* Decide whether to use length-based approach.  */
>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>> +   {
>>> +     if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>> +         || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>> +       {
>>> +         if (dump_enabled_p ())
>>> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> +                            "can't vectorize this loop with length-based"
>>> +                            " partial vectors approach becuase peeling"
>>> +                            " for alignment or gaps is required.\n");
>>> +         LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +       }
>> 
>> Why are these peeling cases necessary?  Peeling for gaps should
>> just mean subtracting one scalar iteration from the iteration count
>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>> alignment can be handled in the normal way, with a scalar prologue loop.
>> 
>
> I was thinking to relax this later and to avoid to handle too many cases
> in the first enablement patch.  Since Power hw whose level is able to support
> vector with length, it supports unaligned load/store, need to construct
> some cases for them.  May I postpone it a bit?  Or you prefer me to support
> it here?

I've no objection to postponing it if there are specific known
problems that make it difficult, but I think we should at least
say what they are.  On the face of it, I'm not sure why it doesn't
Just Work, since the way that we control the main loop should be
mostly orthogonal to how we handle peeled prologue iterations
and how we handle a single peeled epilogue iteration.

>>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>>       return false;
>>>     }
>>>  
>>> -      if (loop_vinfo
>>> -     && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>> -     && reduction_type == EXTRACT_LAST_REDUCTION)
>>> -   vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>> -                          ncopies * vec_num, vectype, NULL);
>>> +      if (loop_vinfo && for_reduction
>>> +     && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>>> +   {
>>> +     if (reduction_type == EXTRACT_LAST_REDUCTION)
>>> +       vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>> +                              ncopies * vec_num, vectype, NULL);
>>> +     /* Using partial vectors can introduce inactive lanes in the last
>>> +        iteration, since full vector of condition results are operated,
>>> +        it's unsafe here.  But if we can AND the condition mask with
>>> +        loop mask, it would be safe then.  */
>>> +     else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>>> +       {
>>> +         scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>>> +         if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>> +           {
>>> +             bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>>> +             cond.code = invert_tree_comparison (cond.code, honor_nans);
>>> +             if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>> +               LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +           }
>>> +       }
>>> +     else
>>> +       LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +   }
>>>  
>>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, 
>>> slp_node,
>> 
>> I don't understand this part.
>
> This is for the regression case on aarch64:
>
> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  
> scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

OK, if this is an SVE thing, it should really be a separate patch.
(And thanks for testing SVE.)

> As you mentioned before, we would expect to record masks for partial vectors 
> reduction, 
> otherwise the inactive lanes would be possibly unsafe.  For this failed case, 
> the
> reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's 
> still safe
> since the mask is further AND with some loop mask.  The difference looks like:
>
> Without mask AND loop mask optimization:
>
>   loop_mask =...
>   v1 = .MASK_LOAD (a, loop_mask)
>   mask1 = v1 == {cst, ...}                // unsafe since it's generate from 
> full width.
>   mask2 = loop_mask & mask1               // safe, since it's AND with loop 
> mask?
>   v2 = .MASK_LOAD (b, mask2)
>   vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1
>
> With mask AND loop mask optimization:
>
>   loop_mask =...
>   v1 = .MASK_LOAD (a, loop_mask)
>   mask1 = v1 == {cst, ...}
>   mask2 = loop_mask & mask1       
>   v2 = .MASK_LOAD (b, mask2)
>   vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?
>
>
> The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is 
> to further check
> it's possible to be optimized further, if it can, we can know it's safe.  
> Does it make sense?

But in this particular test, we're doing outer loop vectorisation,
and the only elements of vres that matter are the ones selected
by loop_mask (since those are the only ones that get stored out).
So applying the loop mask to the VEC_COND_EXPR is “just” an
(important) optimisation, rather than a correctness issue.

What's causing the test to start failing with the patch?  I realise
you've probably already said, sorry, but it's been a large patch series
so it's hard to keep all the details committed to memory.

>>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, 
>>> stmt_vec_info stmt_info,
>>>    *nunits_vectype_out = nunits_vectype;
>>>    return opt_result::success ();
>>>  }
>>> +
>>> +/* Generate and return statement sequence that sets vector length LEN that 
>>> is:
>>> +
>>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>>> +   left_len = END_INDEX - min_of_start_and_end;
>>> +   rhs = min (left_len, LEN_LIMIT);
>>> +   LEN = rhs;
>>> +
>>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, 
>>> which
>>> +   means if we have left_len in bytes larger than 255, it can't be 
>>> saturated to
>>> +   vector limit (vector size).  One target hook can be provided if other 
>>> ports
>>> +   don't suffer this.
>>> +*/
>> 
>> Should be no line break before the */
>> 
>> Personally I think it'd be better to drop the TODO.  This isn't the only
>> place that would need to change if we allowed out-of-range lengths,
>> whereas the comment might give the impression that it is.
>> 
>
> Sorry I might miss something, but all undetermined lengths are generated here,
> the other places you meant is doc or elsewhere?

For example, we'd need to start querying the length operand of the optabs
to see what length precision the target uses, since it would be invalid
to do this optimisation for IVs that are wider than that precision.
The routine above doesn't seem the right place to do that.

It could also affect the semantics of the IFNs, if we ever added
folding rules for them.  So yeah, it boils down to this not being
a local decision for this routine -- it's tied to the optab and
IFN behaviour too.

Thanks,
Richard

Re: [PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer

Reply via email to