https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844
--- Comment #3 from Peter Cordes <peter at cordes dot ca> --- (In reply to Jakub Jelinek from comment #2) > It doesn't always zero, it can be pretty arbitrary. Is if feasible have it just load the first vector of elements, instead of broadcasting the identity value? i.e. do the vector equivalent of sum = a[0] for (i=1; ...) i.e. peel the first iteration and optimize away the computation, leaving just the load. Another way to handle the actual loop body running zero times for counts between 1 and 2 full vectors is to put the loop entry point after the first load & accumulate. (BTW, for operations like min/max/AND/OR where duplicate values don't affect the result, an unaligned final vector would be much more efficient than a scalar cleanup for the last less-than-full-vector of elements, but you still need a scalar fallback if the total count can be smaller than 1 full vector...)