https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65847
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Target| |x86_64-*-* Status|UNCONFIRMED |NEW Last reconfirmed| |2015-04-22 CC| |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. The issue is that the vectorizer thinks x and y reside in memory and thus it vectorizes the code as <bb 2>: vect__2.5_11 = MEM[(double *)&x]; vect__3.8_13 = MEM[(double *)&y]; vect__4.9_14 = vect__2.5_11 + vect__3.8_13; MEM[(double *)&D.1840] = vect__4.9_14; return D.1840; which looks good. But now comes the ABI and passes x, y and the return value in registers ... But even the best vectorized sequence would have four stmts - two to pack arguments into vector registers, one add and one upack for the return value. Thus it seems the vectorizer should be informed of this ABI detail or simply as heuristic never consider function arguments "memory" it can perform vector loads on (which probably means to disable group analysis on them?). On i?86 with SSE2 we get movupd 8(%esp), %xmm1 movl 4(%esp), %eax movupd 24(%esp), %xmm0 addpd %xmm1, %xmm0 movups %xmm0, (%eax) vs. movsd 16(%esp), %xmm0 movl 4(%esp), %eax movsd 8(%esp), %xmm1 addsd 32(%esp), %xmm0 addsd 24(%esp), %xmm1 movsd %xmm0, 8(%eax) movsd %xmm1, (%eax) which eventually looks even profitable (with -mfpmath=sse). So a simple heuristic might pessimize things too much. Replicating calls.c code to compute how the arguments are passed sounds odd though... Eventually the target can pessimize the loads in the target cost model though (at least it can perform a more reasonable "heuristic").