On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt
<wschm...@linux.vnet.ibm.com> wrote:
> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> wrote:
>>
>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford 
>> <richard.sandif...@linaro.org> wrote:
>>>
>>> But I think this shows up another problem.  In the vectorised loop,
>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
>>> even for the loads.  So I think it's a double whammy: we've quadrupling
>>> the real cost via the excessive ncopies_for_cost, and we're using the
>>> wrong access type too.
>>>
>>> The patch below should fix the second problem but isn't much use without
>>> the first.  I think we need to adjust ncopies_for_cost in the recursive
>>> calls as well.
>>
>> Unfortunately we have to deal with the fact that ncopies_for_cost is set
>> once for the whole SLP instance, which is not ideal since we know the
>> number of loads and stores doesn't follow the same rules.
>>
>> What about the following?  I *think* that group_size / nunits is sufficient 
>> for
>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are
>> not possible (already handled differently), but I could well be missing 
>> something.
>
> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits...

I don't think any such simple approach will result in appropriate costing.  The
current costing is conservative and you definitely don't want to be
overly optimistic
here ;)

You'd have to fully emulate the vectorizable_load code.  Which means we should
probably refactor the code-gen part

  if (memory_access_type == VMAT_ELEMENTWISE
      || memory_access_type == VMAT_STRIDED_SLP)
    {
...

where it computes ltype, nloads, etc., do that computation up-front in
the analysis
phase, storing the result somewhere in stmt_vinfo (err, or slp_node or
both -- think
of hybrid SLP and/or the stmt used in different SLP contexts) and use the info
during transform.  And hope to have enough info to statically compute the
number of loads and their size/alignment at costing time.

Or go the full way of restructuring costing to have a prologue/body_cost_vec
in stmt_info and slp_node so that we can compute and store the cost
from vectorizable_* rather than duplicating the hard parts in SLP costing code.

Richard.

> Bill
>>
>> Index: gcc/tree-vect-slp.c
>> ===================================================================
>> --- gcc/tree-vect-slp.c       (revision 252760)
>> +++ gcc/tree-vect-slp.c       (working copy)
>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>   stmt_info = vinfo_for_stmt (stmt);
>>   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>>     {
>> +      int group_size = GROUP_SIZE (stmt_info);
>> +      int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>       vect_memory_access_type memory_access_type
>> -     = (STMT_VINFO_STRIDED_P (stmt_info)
>> +     = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits
>>          ? VMAT_STRIDED_SLP
>>          : VMAT_CONTIGUOUS);
>>       if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
>> -     vect_model_store_cost (stmt_info, ncopies_for_cost,
>> +     vect_model_store_cost (stmt_info, group_size / nunits,
>>                              memory_access_type, vect_uninitialized_def,
>>                              node, prologue_cost_vec, body_cost_vec);
>>       else
>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>                             + nunits - 1) / nunits);
>>             ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);
>>           }
>> +       else
>> +         ncopies_for_cost = group_size / nunits;
>>         /* Record the cost for the vector loads.  */
>>         vect_model_load_cost (stmt_info, ncopies_for_cost,
>>                               memory_access_type, node, prologue_cost_vec,
>>
>> Bill
>>
>

Reply via email to