https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot
gnu.org
Ever confirmed|0 |1
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2023-02-09
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Adding -fopt-info shows
t.c:3:21: optimized: loop vectorized using 8 byte vectors
t.c:1:6: optimized: loop with 7 iterations completely unrolled (header
execution count 63136016)
disabling unrolling instead shows
.L2:
leaq (%rsi,%rax), %r8
leaq (%rdx,%rax), %rdi
movl (%r8), %ecx
addl (%rdi), %ecx
movq %r10, -8(%rsp)
movl %ecx, -8(%rsp)
movq -8(%rsp), %rcx
movl 4(%rdi), %edi
addl 4(%r8), %edi
movq %rcx, -16(%rsp)
movl %edi, -12(%rsp)
movq -16(%rsp), %rcx
movq %rcx, (%r9,%rax)
addq $8, %rax
cmpq $64, %rax
jne .L2
and what happens is that vector lowering fails to perform generic vector
addition (vector lowering is supposed to materialize that), but instead
decomposes the vector, doing scalar adds, which eventually results in
us spilling ...
The reason is that vector lowering does
/* Expand a vector operation to scalars; for integer types we can use
special bit twiddling tricks to do the sums a word at a time, using
function F_PARALLEL instead of F. These tricks are done only if
they can process at least four items, that is, only if the vector
holds at least four items and if a word can hold four items. */
static tree
expand_vector_addition (gimple_stmt_iterator *gsi,
elem_op_func f, elem_op_func f_parallel,
tree type, tree a, tree b, enum tree_code code)
{
int parts_per_word = BITS_PER_WORD / vector_element_bits (type);
if (INTEGRAL_TYPE_P (TREE_TYPE (type))
&& parts_per_word >= 4
&& nunits_for_known_piecewise_op (type) >= 4)
return expand_vector_parallel (gsi, f_parallel,
type, a, b, code);
else
return expand_vector_piecewise (gsi, f,
type, TREE_TYPE (type),
a, b, code, false);
so it only treats >= 4 elements as profitable to vectorize this way but the
vectorizer doesn't seem to know that, it instead applies its own cost model
here while vector lowering doesn't have any.