i experience some speed regressions with gcc-4.4, with sse intrinsics on a
core2 (x86_64). the code is:
namespace detail
{
/** compute x1 * (1 + x2 * amount) */
__m128 inline amp_mod4_loop(__m128 x1, __m128 x2, __m128 amount, __m128 one)
{
return _mm_mul_ps(x1,
_mm_add_ps(one,
_mm_mul_ps(x2, amount)));
}
} /* namespace detail */
template <>
inline void amp_mod4(float * out, const float * in1, const float * in2,
const float amount, unsigned int n)
{
n = n >> 2;
const __m128 one = detail::gen_one();
const __m128 amnt = _mm_set_ps1(amount);
do
{
const __m128 x1 = _mm_load_ps(in1);
in1 += 4;
const __m128 x2 = _mm_load_ps(in2);
in2 += 4;
const __m128 result = detail::amp_mod4_loop(x1, x2, amnt, one);
_mm_store_ps(out, result);
out += 4;
}
while (--n);
}
the results for different compilers (using hardware performance counters) are:
gcc-4.4:
cycles: 1416276094
branch misses: 425897
gcc-4.4 -march=core2:
cycles: 1520034636
branch misses: 3263912
gcc-4.3:
cycles: 1548838336
branch misses: 5990424
gcc-4.3 -march=core2:
cycles: 1386605444
branch misses: 5609
gcc-4.2:
cycles: 1321697674
branch misses: 3682
it seems that gcc-4.3 with -march core2 and gcc-4.2 generate code, which is
more friendly to the branch predictor. tuning for core2 on gcc-4.4 actually
seems to generate worse code.
the best code (gcc-4.2) is:
0000000000400de0 <bench_1_simd(unsigned int)>:
400de0: 66 0f ef c0 pxor %xmm0,%xmm0
400de4: c1 ef 02 shr $0x2,%edi
400de7: 0f 28 15 32 0f 00 00 movaps 0xf32(%rip),%xmm2 #
401d20 <_IO_stdin_used+0xb0>
400dee: 31 c0 xor %eax,%eax
400df0: 66 0f 76 c0 pcmpeqd %xmm0,%xmm0
400df4: 66 0f 72 d0 19 psrld $0x19,%xmm0
400df9: 66 0f 72 f0 17 pslld $0x17,%xmm0
400dfe: 0f 28 c8 movaps %xmm0,%xmm1
400e01: 0f 28 80 e0 26 60 00 movaps 0x6026e0(%rax),%xmm0
400e08: 0f 59 c2 mulps %xmm2,%xmm0
400e0b: 0f 58 c1 addps %xmm1,%xmm0
400e0e: 0f 59 80 e0 25 60 00 mulps 0x6025e0(%rax),%xmm0
400e15: 0f 29 80 e0 24 60 00 movaps %xmm0,0x6024e0(%rax)
400e1c: 48 83 c0 10 add $0x10,%rax
400e20: 83 ef 01 sub $0x1,%edi
400e23: 75 dc jne 400e01 <bench_1_simd(unsigned
int)+0x21>
400e25: f3 c3 repz retq
400e27: 90 nop
400e28: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
the worst code (gcc-4.4, -march=core2) is 15% slower:
0000000000400e70 <bench_1_simd(unsigned int)>:
400e70: 66 0f ef d2 pxor %xmm2,%xmm2
400e74: 89 fa mov %edi,%edx
400e76: 66 0f 76 d2 pcmpeqd %xmm2,%xmm2
400e7a: c1 ea 02 shr $0x2,%edx
400e7d: 66 0f 72 d2 19 psrld $0x19,%xmm2
400e82: ff ca dec %edx
400e84: 66 0f 72 f2 17 pslld $0x17,%xmm2
400e89: 48 ff c2 inc %rdx
400e8c: 0f 28 0d 7d 17 00 00 movaps 0x177d(%rip),%xmm1 #
402610 <_IO_stdin_used+0xb0>
400e93: 48 c1 e2 04 shl $0x4,%rdx
400e97: 31 c0 xor %eax,%eax
400e99: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400ea0: 0f 28 c1 movaps %xmm1,%xmm0
400ea3: 0f 59 80 e0 36 60 00 mulps 0x6036e0(%rax),%xmm0
400eaa: 0f 58 c2 addps %xmm2,%xmm0
400ead: 0f 59 80 e0 35 60 00 mulps 0x6035e0(%rax),%xmm0
400eb4: 0f 29 80 e0 34 60 00 movaps %xmm0,0x6034e0(%rax)
400ebb: 48 83 c0 10 add $0x10,%rax
400ebf: 48 39 d0 cmp %rdx,%rax
400ec2: 75 dc jne 400ea0 <bench_1_simd(unsigned
int)+0x30>
400ec4: f3 c3 repz retq
400ec6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
400ecd: 00 00 00
--
Summary: [4.4 Regression] speed regression with sse intrinsics
Product: gcc
Version: 4.4.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: inline-asm
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: tim at klingt dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38671