https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582
Bug ID: 89582
Summary: Suboptimal code generated for floating point struct in
-O3 compare to -O2
Product: gcc
Version: 8.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: yyc1992 at gmail dot com
Target Milestone: ---
When testing the code for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581 on
linux, I noticed that the code seems suboptimum when compiled under -O3 rather
than -O2 on linux x64.
```
typedef struct {
double x1;
double x2;
} vdouble __attribute__((aligned(16)));
vdouble f(vdouble x, vdouble y)
{
return (vdouble){x.x1 + y.x1, x.x2 + y.x2};
}
```
Compiled with `-O2` produces
```
f:
addsd %xmm3, %xmm1
addsd %xmm2, %xmm0
ret
```
With `-O3` or `-Ofast`, however, the code produced is,
```
f:
movq %xmm0, -40(%rsp)
movq %xmm1, -32(%rsp)
movapd -40(%rsp), %xmm4
movq %xmm2, -24(%rsp)
movq %xmm3, -16(%rsp)
addpd -24(%rsp), %xmm4
movaps %xmm4, -40(%rsp)
movsd -32(%rsp), %xmm1
movsd -40(%rsp), %xmm0
ret
```
It seems that gcc tries to use the vector instruction but had to use the stack
for that. I did a quick benchmark which confirms that the -O3 version is much
slower than the -O2 version.
Clang produces
```
f:
addsd %xmm2, %xmm0
addsd %xmm3, %xmm1
retq
```
As long as any optimizations are on, which seems appropriate.