On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> wrote: > On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> wrote: >> >> On Sep 20, 2017, at 3:49 AM, Richard Sandiford >> <richard.sandif...@linaro.org> wrote: >>> >>> But I think this shows up another problem. In the vectorised loop, >>> we have 1 copy of the load and 4 copies of the ABS (after unpacking). >>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 >>> even for the loads. So I think it's a double whammy: we've quadrupling >>> the real cost via the excessive ncopies_for_cost, and we're using the >>> wrong access type too. >>> >>> The patch below should fix the second problem but isn't much use without >>> the first. I think we need to adjust ncopies_for_cost in the recursive >>> calls as well. >> >> Unfortunately we have to deal with the fact that ncopies_for_cost is set >> once for the whole SLP instance, which is not ideal since we know the >> number of loads and stores doesn't follow the same rules. >> >> What about the following? I *think* that group_size / nunits is sufficient >> for >> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are >> not possible (already handled differently), but I could well be missing >> something. > > Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits...
I don't think any such simple approach will result in appropriate costing. The current costing is conservative and you definitely don't want to be overly optimistic here ;) You'd have to fully emulate the vectorizable_load code. Which means we should probably refactor the code-gen part if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_STRIDED_SLP) { ... where it computes ltype, nloads, etc., do that computation up-front in the analysis phase, storing the result somewhere in stmt_vinfo (err, or slp_node or both -- think of hybrid SLP and/or the stmt used in different SLP contexts) and use the info during transform. And hope to have enough info to statically compute the number of loads and their size/alignment at costing time. Or go the full way of restructuring costing to have a prologue/body_cost_vec in stmt_info and slp_node so that we can compute and store the cost from vectorizable_* rather than duplicating the hard parts in SLP costing code. Richard. > Bill >> >> Index: gcc/tree-vect-slp.c >> =================================================================== >> --- gcc/tree-vect-slp.c (revision 252760) >> +++ gcc/tree-vect-slp.c (working copy) >> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >> stmt_info = vinfo_for_stmt (stmt); >> if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) >> { >> + int group_size = GROUP_SIZE (stmt_info); >> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >> vect_memory_access_type memory_access_type >> - = (STMT_VINFO_STRIDED_P (stmt_info) >> + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits >> ? VMAT_STRIDED_SLP >> : VMAT_CONTIGUOUS); >> if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info))) >> - vect_model_store_cost (stmt_info, ncopies_for_cost, >> + vect_model_store_cost (stmt_info, group_size / nunits, >> memory_access_type, vect_uninitialized_def, >> node, prologue_cost_vec, body_cost_vec); >> else >> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >> + nunits - 1) / nunits); >> ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); >> } >> + else >> + ncopies_for_cost = group_size / nunits; >> /* Record the cost for the vector loads. */ >> vect_model_load_cost (stmt_info, ncopies_for_cost, >> memory_access_type, node, prologue_cost_vec, >> >> Bill >> >