Toon Moene wrote:
Tim Prince wrote:
> If you want those, you must request them with -mtune=barcelona.
OK, so it is an alignment issue (with -mtune=barcelona):
.L6:
movups 0(%rbp,%rax), %xmm0
movups (%rbx,%rax), %xmm1
incl %ecx
addps %xmm1, %xmm0
movaps %xmm0, (%r8,%rax)
addq $16, %rax
cmpl %r10d, %ecx
jb .L6
This is where IPA could help. I created the following main program:
real a(10), b(10), c(10)
a = 0.
b = 1.
print '(3(1x,z16))', loc(a), loc(b), loc(c)
call sum(a, b, c, 10)
print *, c(5)
end
[ The first print statement assures me a, b and c are 16 byte aligned
(hexadecimal printout of address ends in 0), the second one prevents
optimization from throwing away everything. ]
and then compiling with:
$ /usr/snp/bin/gfortran -O3 -flto -fwhole-program main.f sum.f
So the alignment of a, b and c is known and is correct for vectorization
- still the loop in the subroutine looks like this (objdump -S a.out):
4004d0: 0f 28 c2 movaps %xmm2,%xmm0
4004d3: 0f 28 ca movaps %xmm2,%xmm1
4004d6: 0f 12 04 07 movlps (%rdi,%rax,1),%xmm0
4004da: 0f 12 0c 06 movlps (%rsi,%rax,1),%xmm1
4004de: 0f 16 44 07 08 movhps 0x8(%rdi,%rax,1),%xmm0
4004e3: 0f 16 4c 06 08 movhps 0x8(%rsi,%rax,1),%xmm1
4004e8: ff c1 inc %ecx
4004ea: 0f 58 c1 addps %xmm1,%xmm0
4004ed: 41 0f 29 04 02 movaps %xmm0,(%r10,%rax,1)
4004f2: 48 83 c0 10 add $0x10,%rax
4004f6: 44 39 c9 cmp %r9d,%ecx
4004f9: 72 d5 jb 4004d0 <main+0x1c0>
--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html