On Thu, Sep 21, 2017 at 3:15 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> wrote: > On Sep 21, 2017, at 2:37 AM, Richard Biener <richard.guent...@gmail.com> wrote >> >> On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt >> <wschm...@linux.vnet.ibm.com> wrote: >>> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> >>> wrote: >>>> >>>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford >>>> <richard.sandif...@linaro.org> wrote: >>>>> >>>>> But I think this shows up another problem. In the vectorised loop, >>>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking). >>>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 >>>>> even for the loads. So I think it's a double whammy: we've quadrupling >>>>> the real cost via the excessive ncopies_for_cost, and we're using the >>>>> wrong access type too. >>>>> >>>>> The patch below should fix the second problem but isn't much use without >>>>> the first. I think we need to adjust ncopies_for_cost in the recursive >>>>> calls as well. >>>> >>>> Unfortunately we have to deal with the fact that ncopies_for_cost is set >>>> once for the whole SLP instance, which is not ideal since we know the >>>> number of loads and stores doesn't follow the same rules. >>>> >>>> What about the following? I *think* that group_size / nunits is >>>> sufficient for >>>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are >>>> not possible (already handled differently), but I could well be missing >>>> something. >>> >>> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / >>> nunits... >> >> I don't think any such simple approach will result in appropriate costing. >> The >> current costing is conservative and you definitely don't want to be >> overly optimistic >> here ;) >> >> You'd have to fully emulate the vectorizable_load code. Which means we >> should >> probably refactor the code-gen part >> >> if (memory_access_type == VMAT_ELEMENTWISE >> || memory_access_type == VMAT_STRIDED_SLP) >> { >> ... >> >> where it computes ltype, nloads, etc., do that computation up-front in >> the analysis >> phase, storing the result somewhere in stmt_vinfo (err, or slp_node or >> both -- think >> of hybrid SLP and/or the stmt used in different SLP contexts) and use the >> info >> during transform. And hope to have enough info to statically compute the >> number of loads and their size/alignment at costing time. > > I suppose so. I was thinking that if these are truly strided loads, then the > cost is predictable -- but only for targets that have hardware strided loads > available, which are few, so this isn't conservative enough. The code in > question didn't deal with VMAT_ELEMENTWISE, so the factoring is even > more complicated -- the structures of the costing logic and the code gen > logic don't match well at all. > > I'm unlikely to have time for a complex solution right now, so I'll stick a > pointer > to this conversation in the bugzilla and take my name off in case somebody > gets to it sooner. But I'll probably keep poking at it in spare moments.
I hope to get to some costmodel cleanups for GCC 8. Richard. > Thanks, > Bill >> >> Or go the full way of restructuring costing to have a prologue/body_cost_vec >> in stmt_info and slp_node so that we can compute and store the cost >> from vectorizable_* rather than duplicating the hard parts in SLP costing >> code. >> >> Richard. >> >>> Bill >>>> >>>> Index: gcc/tree-vect-slp.c >>>> =================================================================== >>>> --- gcc/tree-vect-slp.c (revision 252760) >>>> +++ gcc/tree-vect-slp.c (working copy) >>>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >>>> stmt_info = vinfo_for_stmt (stmt); >>>> if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) >>>> { >>>> + int group_size = GROUP_SIZE (stmt_info); >>>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>>> vect_memory_access_type memory_access_type >>>> - = (STMT_VINFO_STRIDED_P (stmt_info) >>>> + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits >>>> ? VMAT_STRIDED_SLP >>>> : VMAT_CONTIGUOUS); >>>> if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info))) >>>> - vect_model_store_cost (stmt_info, ncopies_for_cost, >>>> + vect_model_store_cost (stmt_info, group_size / nunits, >>>> memory_access_type, vect_uninitialized_def, >>>> node, prologue_cost_vec, body_cost_vec); >>>> else >>>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >>>> + nunits - 1) / nunits); >>>> ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); >>>> } >>>> + else >>>> + ncopies_for_cost = group_size / nunits; >>>> /* Record the cost for the vector loads. */ >>>> vect_model_load_cost (stmt_info, ncopies_for_cost, >>>> memory_access_type, node, prologue_cost_vec, >>>> >>>> Bill >>>> >>> >> >