on 2023/8/14 22:16, Richard Sandiford wrote: > "Kewen.Lin" <li...@linux.ibm.com> writes: >> Hi Richard, >> >> on 2023/8/14 20:20, Richard Sandiford wrote: >>> Thanks for the clean-ups. But... >>> >>> "Kewen.Lin" <li...@linux.ibm.com> writes: >>>> Hi, >>>> >>>> Following Richi's suggestion [1], this patch is to move the >>>> handlings on VMAT_GATHER_SCATTER in the final loop nest >>>> of function vectorizable_load to its own loop. Basically >>>> it duplicates the final loop nest, clean up some useless >>>> set up code for the case of VMAT_GATHER_SCATTER, remove some >>>> unreachable code. Also remove the corresponding handlings >>>> in the final loop nest. >>>> >>>> Bootstrapped and regtested on x86_64-redhat-linux, >>>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu. >>>> >>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html >>>> >>>> Is it ok for trunk? >>>> >>>> BR, >>>> Kewen >>>> ----- >>>> >>>> gcc/ChangeLog: >>>> >>>> * tree-vect-stmts.cc (vectorizable_load): Move the handlings on >>>> VMAT_GATHER_SCATTER in the final loop nest to its own loop, >>>> and update the final nest accordingly. >>>> --- >>>> gcc/tree-vect-stmts.cc | 361 +++++++++++++++++++++++++---------------- >>>> 1 file changed, 219 insertions(+), 142 deletions(-) >>> >>> ...that seems like quite a lot of +s. Is there nothing we can do to >>> avoid the cut-&-paste? >> >> Thanks for the comments! I'm not sure if I get your question, if we >> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem >> inevitable? Your concern is mainly about git blame history? > > No, it was more that 219-142=77, so it seems like a lot of lines > are being duplicated rather than simply being moved. (Unlike for > VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so > was a clear improvement.) > > So I was just wondering if there was any obvious factoring-out that > could be done to reduce the duplication.
ah, thanks for the clarification! I think the main duplication are on the loop body beginning and end, let's take a look at them in details: + if (memory_access_type == VMAT_GATHER_SCATTER) + { + gcc_assert (alignment_support_scheme == dr_aligned + || alignment_support_scheme == dr_unaligned_supported); + gcc_assert (!grouped_load && !slp_perm); + + unsigned int inside_cost = 0, prologue_cost = 0; // These above are newly added. + for (j = 0; j < ncopies; j++) + { + /* 1. Create the vector or array pointer update chain. */ + if (j == 0 && !costing_p) + { + if (STMT_VINFO_GATHER_SCATTER_P (stmt_info)) + vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info, + slp_node, &gs_info, &dataref_ptr, + &vec_offsets); + else + dataref_ptr + = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type, + at_loop, offset, &dummy, gsi, + &ptr_incr, false, bump); + } + else if (!costing_p) + { + gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)); + if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info)) + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, + gsi, stmt_info, bump); + } // These are for dataref_ptr, in the final looop nest we deal with more cases on simd_lane_access_p and diff_first_stmt_info, but don't handle STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared between, IMHO factoring out it seems like a overkill. + + if (mask && !costing_p) + vec_mask = vec_masks[j]; // It's merged out from j == 0 and j != 0 + + gimple *new_stmt = NULL; + for (i = 0; i < vec_num; i++) + { + tree final_mask = NULL_TREE; + tree final_len = NULL_TREE; + tree bias = NULL_TREE; + if (!costing_p) + { + if (loop_masks) + final_mask + = vect_get_loop_mask (loop_vinfo, gsi, loop_masks, + vec_num * ncopies, vectype, + vec_num * j + i); + if (vec_mask) + final_mask = prepare_vec_mask (loop_vinfo, mask_vectype, + final_mask, vec_mask, gsi); + + if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info)) + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, + gsi, stmt_info, bump); + } // This part is directly copied from the original, the original gets updated by removing && !STMT_VINFO_GATHER_SCATTER_P. Due to its size, I didn't consider this before, do you prefer me to factor this part out? + if (gs_info.ifn != IFN_LAST) + { ... + } + else + { + /* Emulated gather-scatter. */ ... // This part is just moved from the original. + vec_dest = vect_create_destination_var (scalar_dest, vectype); + /* DATA_REF is null if we've already built the statement. */ + if (data_ref) + { + vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr)); + new_stmt = gimple_build_assign (vec_dest, data_ref); + } + new_temp = make_ssa_name (vec_dest, new_stmt); + gimple_set_lhs (new_stmt, new_temp); + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi); + + /* Store vector loads in the corresponding SLP_NODE. */ + if (slp) + slp_node->push_vec_def (new_stmt); + + if (!slp && !costing_p) + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); + } + + if (!slp && !costing_p) + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; // This part is some subsequent handlings, it's duplicated from the original but removing some more useless code. I guess this part is not worthy being factored out? + if (costing_p) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "vect_model_load_cost: inside_cost = %u, " + "prologue_cost = %u .\n", + inside_cost, prologue_cost); + } + return true; + } // Duplicating the dumping, I guess it's unnecessary to be factored out. oh, I just noticed that this should be shorten as "if (costing_p && dump_enabled_p ())" instead, just the same as what's adopted for VMAT_LOAD_STORE_LANES dumping. BR, Kewen