https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
--- Comment #47 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 29 Mar 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908 > > --- Comment #46 from Hongtao.liu <crazylht at gmail dot com> --- > Another issue is splitting vector load to halves or elements, the latter > requires scratch registers which may not be available, the former doesn't > require extra register but may still trigger STLF stalls. For cray case, > splitting to halves is equal to splitting to elements. > > For x86, there're sse/256_unaligned_load_optima would split 128/256-bit vector > load to halves. I suggest to try the easy case first, only split when splitting would split to elements and when that doesn't require scratch registers. For large N (number of elements) the separate loads + inserts will eventually offset the penalty of a failing forwarding anyway, so it is less obviously a win (or less obviously not a loss).