On 01/03/2014 10:11 PM, Jakub Jelinek wrote:

Hi!

On Fri, Jan 03, 2014 at 08:58:30PM +0100, Toon Moene wrote:
I don't doubt that would work, what I'm interested in, is (cat verintlin.f):

Well, you need gather loads for that and there you hit PR target/59617.

I tried your patch, and the effect on the most heavily used loop in the full routine (not the part that I quoted before):

    160       DO JY = KLAT1,KLAT2
    161       DO JX = KLON1,KLON2
    162          IDX  = KP(JX,JY)
    163          IDY  = KQ(JX,JY)
    164          ILEV = KR(JX,JY)
...
    237      + + PBETA(JX,JY,4)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY+1,ILEV+1)
    238      +                  + PALFA(JX,JY,2)*PARG(IDX-1,IDY+1,ILEV+1)
    239      +                  + PALFA(JX,JY,3)*PARG(IDX  ,IDY+1,ILEV+1)
240 + + PALFA(JX,JY,4)*PARG(IDX+1,IDY+1,ILEV+1) ) )
    241       ENDDO
    242       ENDDO

is (just counting assembler lines, i.e., instructions):

-Ofast -mavx2 -mfma:           627 lines in the .s file.

-Ofast -mavx2 -mfma -mavx512f: 588 lines in the .s file.

However, this routine is clearly memory bound (as the vectorization with the gather instruction, needed for the indirect adressing via IDX = KP(JX,JY), etc. didn't bring any speed improvement).

The number of instructions accessing memory:

-Ofast -mavx2 -mfma:           364 lines in the .s file.

-Ofast -mavx2 -mfma -mavx512f: 221 lines in the .s file.

So there might be a clear improvement here ...

Thanks !

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

Reply via email to