Re: [PATCH] Fix PR82255 (vectorizer cost model overcounts some vector load costs)

Richard Biener Thu, 21 Sep 2017 07:39:40 -0700

On Thu, Sep 21, 2017 at 3:15 PM, Bill Schmidt
<wschm...@linux.vnet.ibm.com> wrote:
> On Sep 21, 2017, at 2:37 AM, Richard Biener <richard.guent...@gmail.com> wrote
>>
>> On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt
>> <wschm...@linux.vnet.ibm.com> wrote:
>>> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> 
>>> wrote:
>>>>
>>>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford 
>>>> <richard.sandif...@linaro.org> wrote:
>>>>>
>>>>> But I think this shows up another problem.  In the vectorised loop,
>>>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
>>>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
>>>>> even for the loads.  So I think it's a double whammy: we've quadrupling
>>>>> the real cost via the excessive ncopies_for_cost, and we're using the
>>>>> wrong access type too.
>>>>>
>>>>> The patch below should fix the second problem but isn't much use without
>>>>> the first.  I think we need to adjust ncopies_for_cost in the recursive
>>>>> calls as well.
>>>>
>>>> Unfortunately we have to deal with the fact that ncopies_for_cost is set
>>>> once for the whole SLP instance, which is not ideal since we know the
>>>> number of loads and stores doesn't follow the same rules.
>>>>
>>>> What about the following?  I *think* that group_size / nunits is 
>>>> sufficient for
>>>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are
>>>> not possible (already handled differently), but I could well be missing 
>>>> something.
>>>
>>> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / 
>>> nunits...
>>
>> I don't think any such simple approach will result in appropriate costing.  
>> The
>> current costing is conservative and you definitely don't want to be
>> overly optimistic
>> here ;)
>>
>> You'd have to fully emulate the vectorizable_load code.  Which means we 
>> should
>> probably refactor the code-gen part
>>
>>  if (memory_access_type == VMAT_ELEMENTWISE
>>      || memory_access_type == VMAT_STRIDED_SLP)
>>    {
>> ...
>>
>> where it computes ltype, nloads, etc., do that computation up-front in
>> the analysis
>> phase, storing the result somewhere in stmt_vinfo (err, or slp_node or
>> both -- think
>> of hybrid SLP and/or the stmt used in different SLP contexts) and use the 
>> info
>> during transform.  And hope to have enough info to statically compute the
>> number of loads and their size/alignment at costing time.
>
> I suppose so.  I was thinking that if these are truly strided loads, then the
> cost is predictable -- but only for targets that have hardware strided loads
> available, which are few, so this isn't conservative enough.  The code in
> question didn't deal with VMAT_ELEMENTWISE, so the factoring is even
> more complicated -- the structures of the costing logic and the code gen
> logic don't match well at all.
>
> I'm unlikely to have time for a complex solution right now, so I'll stick a 
> pointer
> to this conversation in the bugzilla and take my name off in case somebody
> gets to it sooner.  But I'll probably keep poking at it in spare moments.


I hope to get to some costmodel cleanups for GCC 8.

Richard.

> Thanks,
> Bill
>>
>> Or go the full way of restructuring costing to have a prologue/body_cost_vec
>> in stmt_info and slp_node so that we can compute and store the cost
>> from vectorizable_* rather than duplicating the hard parts in SLP costing 
>> code.
>>
>> Richard.
>>
>>> Bill
>>>>
>>>> Index: gcc/tree-vect-slp.c
>>>> ===================================================================
>>>> --- gcc/tree-vect-slp.c       (revision 252760)
>>>> +++ gcc/tree-vect-slp.c       (working copy)
>>>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>>>  stmt_info = vinfo_for_stmt (stmt);
>>>>  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>>>>    {
>>>> +      int group_size = GROUP_SIZE (stmt_info);
>>>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>>>      vect_memory_access_type memory_access_type
>>>> -     = (STMT_VINFO_STRIDED_P (stmt_info)
>>>> +     = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits
>>>>         ? VMAT_STRIDED_SLP
>>>>         : VMAT_CONTIGUOUS);
>>>>      if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
>>>> -     vect_model_store_cost (stmt_info, ncopies_for_cost,
>>>> +     vect_model_store_cost (stmt_info, group_size / nunits,
>>>>                             memory_access_type, vect_uninitialized_def,
>>>>                             node, prologue_cost_vec, body_cost_vec);
>>>>      else
>>>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>>>                            + nunits - 1) / nunits);
>>>>            ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);
>>>>          }
>>>> +       else
>>>> +         ncopies_for_cost = group_size / nunits;
>>>>        /* Record the cost for the vector loads.  */
>>>>        vect_model_load_cost (stmt_info, ncopies_for_cost,
>>>>                              memory_access_type, node, prologue_cost_vec,
>>>>
>>>> Bill
>>>>
>>>
>>
>

Re: [PATCH] Fix PR82255 (vectorizer cost model overcounts some vector load costs)

Reply via email to