https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190
--- Comment #6 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Richard Biener <[email protected]>: https://gcc.gnu.org/g:948d33f490a6b0051376da6bdcf55223a552b30f commit r16-6767-g948d33f490a6b0051376da6bdcf55223a552b30f Author: Richard Biener <[email protected]> Date: Wed Jan 14 12:45:19 2026 +0100 tree-optimization/123190 - fix costing of permuted contiguous loads The following fixes a regression from the time we split load groups along SLP boundaries. When we face a permuted load from an access that is contiguous across loop iterations we emit code that loads the whole group and then emit required permutations. The permutations might not need all those loads, and if we split the group we would not have emitted them. Fortunately when analyzing a permutation we compute both the number of required permutes and the number of loads that will survive the followin DCE. So make sure to use that when costing. This allows the previously added testcase for PR123190 to undergo epilog vectorization also at -O2 plus when using non-generic tuning, such as tuning for Zen4 which ups the cost for XMM loads. PR tree-optimization/123190 * tree-vectorizer.h (vect_load_store_data): Add n_loads member. * tree-vect-stmts.cc (get_load_store_type): Record the number of required loads for permuted loads. (vectorizable_load): Make use of this when costing loads for VMAT_CONTIGUOUS[_REVERSE]. * gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c: Do not require -mtune=generic. * gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-2.c: Add variant with -O2 instead of -O3, inner loop not unrolled.
