Toon Moene wrote:

Tim Prince wrote:

 > If you want those, you must request them with -mtune=barcelona.

OK, so it is an alignment issue (with -mtune=barcelona):

.L6:
        movups  0(%rbp,%rax), %xmm0
        movups  (%rbx,%rax), %xmm1
        incl    %ecx
        addps   %xmm1, %xmm0
        movaps  %xmm0, (%r8,%rax)
        addq    $16, %rax
        cmpl    %r10d, %ecx
        jb      .L6

This is where IPA could help.  I created the following main program:

      real a(10), b(10), c(10)
      a = 0.
      b = 1.
      print '(3(1x,z16))', loc(a), loc(b), loc(c)
      call sum(a, b, c, 10)
      print *, c(5)
      end

[ The first print statement assures me a, b and c are 16 byte aligned
  (hexadecimal printout of address ends in 0), the second one prevents
  optimization from throwing away everything. ]

and then compiling with:

$ /usr/snp/bin/gfortran -O3 -flto -fwhole-program main.f sum.f

So the alignment of a, b and c is known and is correct for vectorization - still the loop in the subroutine looks like this (objdump -S a.out):

  4004d0:       0f 28 c2                movaps %xmm2,%xmm0
  4004d3:       0f 28 ca                movaps %xmm2,%xmm1
  4004d6:       0f 12 04 07             movlps (%rdi,%rax,1),%xmm0
  4004da:       0f 12 0c 06             movlps (%rsi,%rax,1),%xmm1
  4004de:       0f 16 44 07 08          movhps 0x8(%rdi,%rax,1),%xmm0
  4004e3:       0f 16 4c 06 08          movhps 0x8(%rsi,%rax,1),%xmm1
  4004e8:       ff c1                   inc    %ecx
  4004ea:       0f 58 c1                addps  %xmm1,%xmm0
  4004ed:       41 0f 29 04 02          movaps %xmm0,(%r10,%rax,1)
  4004f2:       48 83 c0 10             add    $0x10,%rax
  4004f6:       44 39 c9                cmp    %r9d,%ecx
  4004f9:       72 d5                   jb     4004d0 <main+0x1c0>

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html

Reply via email to