https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844

--- Comment #3 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Jakub Jelinek from comment #2)
> It doesn't always zero, it can be pretty arbitrary.

Is if feasible have it just load the first vector of elements, instead of
broadcasting the identity value?  i.e. do the vector equivalent of 

 sum = a[0]
 for (i=1; ...)

i.e. peel the first iteration and optimize away the computation, leaving just
the load.  Another way to handle the actual loop body running zero times for
counts between 1 and 2 full vectors is to put the loop entry point after the
first load & accumulate.

(BTW, for operations like min/max/AND/OR where duplicate values don't affect
the result, an unaligned final vector would be much more efficient than a
scalar cleanup for the last less-than-full-vector of elements, but you still
need a scalar fallback if the total count can be smaller than 1 full vector...)

Reply via email to