on 2020/5/26 下午5:44, Richard Biener wrote:
> On Tue, 26 May 2020, Kewen.Lin wrote:
> 
>> Hi Richi,
>>
>> on 2020/5/26 下午3:12, Richard Biener wrote:
>>> On Tue, 26 May 2020, Kewen.Lin wrote:
>>>
>>>> Hi all,
>>>>
>>>> This patch set adds support for vector load/store with length, Power 
>>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
>>>> length, it's good to be exploited for those cases we don't have enough
>>>> stuffs to fill in the whole vector like epilogues.
>>>>
>>>> This support mainly refers to the handlings for fully-predicated loop
>>>> but it also covers the epilogue usage.  Now it supports two modes
>>>> controlled by parameter vect-with-length-scope, it can support any
>>>> loops fully with length or just for those cases with small iteration
>>>> counts less than VF like epilogue, for now I don't have ready env to
>>>> benchmark it, but based on the current inefficient length generation,
>>>> I don't think it's a good idea to adopt vector with length for any loops.
>>>> For the main loop which used to be vectorized, it increases register
>>>> pressure and introduces extra computation for length, the pro for icache
>>>> seems not comparable.  But I think it might be a good idea to keep this
>>>> parameter there for functionality testing, further benchmarking and other
>>>> ports' potential future supports.
>>>
>>> Can you explain in more detail what "vector load/store with length" does?
>>> Is that a "simplified" masked load/store which instead of masking 
>>> arbitrary elements (and need a mask computed in the first place), masks
>>> elements > N (the length operand)?  Thus assuming a lane IV decrementing
>>> to zero that IV would be the natural argument for the length operand?
>>> If that's correct, what data are the remaining lanes filled with?
>>>
>>
>> The vector load/store have one GPR holding one length in bytes (called as
>> n here) to control how many bytes we want to load/store.  If n > vector_size
>> (on Power it's 16), n will be taken as 16, if n is zero, the storage access
>> won't happen, the result for load is vector zero, if n > 0 but < vector_size,
>> the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
>> of the length GPR, so the length should be adjusted.
>>
>> Your understanding is correct!  It's a "simplified" masked load/store, need
>> the length in bytes computed, only support continuous access.  For the lane
>> IV, the one should multiply with the lane bytes and be adjusted if needed.
>>
>>> From a look at the series description below you seem to add a new way
>>> of doing loads for this.  Did you review other ISAs (those I'm not
>>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
>>> they have similar support and whether your approach can be supported
>>> there?  ISTR SVE must have some similar support - what's the reason
>>> you do not piggy-back on that?
>>
>> I may miss something, but I didn't find there is any support that meet with
>> this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
>> current implementation, I heavily referred to existing SVE fully predicated
>> loop support, it's stronger than "with length", it can deal with arbitrary
>> elements, only perform for the loop fully etc.
>>
>>>
>>> I think a load like I described above might be represented as
>>>
>>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
>>>
>>> not sure if that actually works out though.  But given it seems it
>>> is a contiguous load we shouldn't need an internal function here?
>>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
>>> on RTL you end up with a paradoxical subreg - or an UNSPEC]
>>
>> IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
>> is it due to the maintenance efforts on various IFNs?  I thought using
>> IFN is gentle.  And agreed, I had the concern that the size mismatch
>> problem here since the length can be large (extremely large probably, it
>> depends on target saturated limit), the converted vector size can be small.
>> Besides, the length can be a variable.
>>
>>>
>>> That said, I'm not very happy seeing yet another way of doing loads
>>> [for fully predicated loops].  I'd rather like to see a single
>>> representation on GIMPLE at least.
>>
>> OK.  Does it mean IFN isn't counted into this scope?  :)
> 
> Sure going with a new IFN is a possibility.  I'd just avoid that if
> possible ;)  How does SVE represent loads/stores for whilelt loops?
> Would it not be possible to share that representation somehow?
> 
> Richard.
> 

Got it.  :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for 
whileult loops:

  vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9);
  ...
  .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18);
  ...
  next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... });
  if (next_mask_26 != { 0, ... })

I thought to share it once, but didn't feel it's a good idea since it's
like we take something as masks that isn't actually masks, easily confuse
people.  OTOH, Power or other potential targets probably supports these
kinds of masking instructions, easily to confuse people further.

But I definitely agree that we can refactor some codes to share with
each other when possible.

Btw, here the code with proposed IFNs looks like:

  vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5);
  ...
  .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18);
  ivtmp_23 = ivtmp_22 + 16;
  _25 = MIN_EXPR <ivtmp_23, 508>;
  _26 = 508 - _25;
  _27 = MIN_EXPR <_26, 16>;
  if (ivtmp_23 <= 507)


BR,
Kewen

>>>
>>> Will dig into the patch once the actual workings of those load/store with
>>> length is confirmed.
>>
>> Thanks a lot for your time!
>>
>>>
>>> I don't spot tree-vect-slp.c being changed - maybe that's not necessary
>>> for SLP operation, but please do not introduce new vectorizer features
>>> without supporting SLP operation at this point.
>>
>> Got it.  SLP is also one part to be supported, I forgot to state in the
>> original email.  I'll let it be for now.  :)
>>
>> BR,
>> Kewen
>>
> 

Reply via email to