[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-25 Thread rguenth at gcc dot gnu dot org


--- Comment #14 from rguenth at gcc dot gnu dot org  2010-08-25 10:03 
---
Subject: Bug 45379

Author: rguenth
Date: Wed Aug 25 10:03:19 2010
New Revision: 163540

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163540
Log:
2010-08-25  Richard Guenther  rguent...@suse.de

PR middle-end/45379
* emit-rtl.c (set_mem_attributes_minus_bitpos): Handle
TARGET_MEM_REF in alignment computation.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/emit-rtl.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-25 Thread rguenth at gcc dot gnu dot org


--- Comment #15 from rguenth at gcc dot gnu dot org  2010-08-25 10:44 
---
Fixed.


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-25 Thread dominiq at lps dot ens dot fr


--- Comment #16 from dominiq at lps dot ens dot fr  2010-08-25 12:01 ---
/* ??? This isn't fully correct, we can't set the alignment from the
   type in all cases.  */

What is the meaning of this comment?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-25 Thread rguenther at suse dot de


--- Comment #17 from rguenther at suse dot de  2010-08-25 12:51 ---
Subject: Re:  [4.6 Regression] ~10% slowdown on test_fpu
 at revision 163278

On Wed, 25 Aug 2010, dominiq at lps dot ens dot fr wrote:

 --- Comment #16 from dominiq at lps dot ens dot fr  2010-08-25 12:01 
 ---
 /* ??? This isn't fully correct, we can't set the alignment from the
type in all cases.  */
 
 What is the meaning of this comment?

The meaning is that the type alignment does not always agree with
the alignment of the memory loaded/stored.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-24 Thread rguenth at gcc dot gnu dot org


--- Comment #6 from rguenth at gcc dot gnu dot org  2010-08-24 10:37 ---
Created an attachment (id=21555)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21555action=view)
patch

With this patch I get a similar looking diff.  I still can't reproduce runtime
differences on my Athlon X2, so can you verify the patch (it makes sense anyway
though)?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-24 Thread dominiq at lps dot ens dot fr


--- Comment #7 from dominiq at lps dot ens dot fr  2010-08-24 10:55 ---
With the patch in comment #6, I get a minor improvement, but do not recover the
timing before r163278:

r163277

Test1 - Gauss 2000 (101x101) inverts  1.9 sec  Err= 0.006
2.157u 0.074s 0:02.23 99.5% 0+0k 0+0io 0pf+0w

r163469 without patch

Test1 - Gauss 2000 (101x101) inverts  2.7 sec  Err= 0.006
2.903u 0.069s 0:02.97 99.6% 0+0k 0+0io 0pf+0w

r163517 with patch

Test1 - Gauss 2000 (101x101) inverts  2.5 sec  Err= 0.006
2.717u 0.073s 0:02.79 99.6% 0+0k 0+0io 0pf+0w


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-24 Thread rguenth at gcc dot gnu dot org


--- Comment #8 from rguenth at gcc dot gnu dot org  2010-08-24 11:37 ---
Do you see the slowdown as well if you drop -funroll-loops?  Do you see
the slowdown with just -O2?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-24 Thread dominiq at lps dot ens dot fr


--- Comment #9 from dominiq at lps dot ens dot fr  2010-08-24 11:47 ---
 Do you see the slowdown as well if you drop -funroll-loops?  

Yes

[macbook] lin/test% gfc -Ofast test_fpu_red.f90
[macbook] lin/test% time a.out
Test1 - Gauss 2000 (101x101) inverts  3.0 sec  Err= 0.006
3.208u 0.072s 0:03.28 99.6% 0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -Ofast test_fpu_red.f90
[macbook] lin/test% time a.out
Test1 - Gauss 2000 (101x101) inverts  2.2 sec  Err= 0.006
2.440u 0.076s 0:02.52 99.6% 0+0k 0+0io 0pf+0w

 Do you see the slowdown with just -O2?

No

[macbook] lin/test% gfc -O2 test_fpu_red.f90
[macbook] lin/test% time a.out
Test1 - Gauss 2000 (101x101) inverts  3.1 sec  Err= 0.006
3.328u 0.071s 0:03.40 99.7% 0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -O2 test_fpu_red.f90
[macbook] lin/test% time a.out
Test1 - Gauss 2000 (101x101) inverts  3.1 sec  Err= 0.006
3.330u 0.073s 0:03.40 100.0%0+0k 0+0io 0pf+0w

but I see it with -O2 -ftree-vectorize

[macbook] lin/test% gfc -O2 -ftree-vectorize test_fpu_red.f90
[macbook] lin/test% time a.out
Test1 - Gauss 2000 (101x101) inverts  3.1 sec  Err= 0.006
3.318u 0.070s 0:03.39 99.7% 0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -O2 -ftree-vectorize test_fpu_red.f90
[macbook] lin/test% time a.out
Test1 - Gauss 2000 (101x101) inverts  2.3 sec  Err= 0.006
2.498u 0.076s 0:02.57 99.6% 0+0k 0+0io 0pf+0w

although I do not see any difference in the outputs with
-ftree-vectorizer-verbose=2.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-24 Thread rguenth at gcc dot gnu dot org


--- Comment #10 from rguenth at gcc dot gnu dot org  2010-08-24 13:25 
---
Subject: Bug 45379

Author: rguenth
Date: Tue Aug 24 13:25:25 2010
New Revision: 163519

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163519
Log:
2010-08-24  Richard Guenther  rguent...@suse.de

PR middle-end/45379
* tree-ssa-address.c (create_mem_ref_raw): Drop to MEM_REF
if addr-index is NULL or zero.
* tree-ssa-alias.c (indirect_refs_may_alias_p): Handle
TARGET_MEM_REF more properly.
(indirect_ref_may_alias_decl_p): Likewise.
* emit-rtl.c (set_mem_attributes_minus_bitpos): Keep TARGET_MEM_REFs.
* alias.c (ao_ref_from_mem): Handle TARGET_MEM_REF more
properly.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/alias.c
trunk/gcc/emit-rtl.c
trunk/gcc/tree-ssa-address.c
trunk/gcc/tree-ssa-alias.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-24 Thread dominiq at lps dot ens dot fr


--- Comment #11 from dominiq at lps dot ens dot fr  2010-08-24 14:33 ---
Assembly for the inner loop

  do i = 1, n
 b(i,j) = b(i,j)-temp(i)*c
  end do

with -Ofast


r163277

L38:
movsd   (%rsi,%rax), %xmm0
addl$1, %ecx
movhpd  8(%rsi,%rax), %xmm0
movapd  %xmm0, %xmm1
movapd  (%rdi,%rax), %xmm0
mulpd   %xmm3, %xmm1
subpd   %xmm1, %xmm0
movapd  %xmm0, (%rdi,%rax)
addq$16, %rax
cmpl$249, %ecx
jbe L38

r163519

L38:
movsd   (%rdi,%rax), %xmm5
addl$1, %esi
movhpd  8(%rdi,%rax), %xmm5
movapd  %xmm5, %xmm1
movsd   (%rcx,%rax), %xmm5
mulpd   %xmm3, %xmm1
movhpd  8(%rcx,%rax), %xmm5
movapd  %xmm5, %xmm0
subpd   %xmm1, %xmm0
movlpd  %xmm0, (%rcx,%rax)
movhpd  %xmm0, 8(%rcx,%rax)
addq$16, %rax
cmpl$249, %esi
jbe L38


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-24 Thread rguenth at gcc dot gnu dot org


--- Comment #12 from rguenth at gcc dot gnu dot org  2010-08-24 15:48 
---
Try

Index: emit-rtl.c
===
--- emit-rtl.c  (revision 163519)
+++ emit-rtl.c  (working copy)
@@ -1615,6 +1615,11 @@ set_mem_attributes_minus_bitpos (rtx ref
align = MAX (align, TYPE_ALIGN (type));
 }

+  else if (TREE_CODE (t) == TARGET_MEM_REF)
+/* ??? This isn't fully correct, we can't set the alignment from the
+   type in all cases.  */
+align = MAX (align, TYPE_ALIGN (type));
+
   else if (TREE_CODE (t) == MISALIGNED_INDIRECT_REF)
 {
   if (integer_zerop (TREE_OPERAND (t, 1)))


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-24 Thread dominiq at lps dot ens dot fr


--- Comment #13 from dominiq at lps dot ens dot fr  2010-08-24 16:19 ---
With the patch in comment #12 I get

[macbook] lin/test% gfc -Ofast -funroll-loops test_fpu.f90
[macbook] lin/test% time a.out
  Benchmark running, hopefully as only ACTIVE task
  0.99755959009261719 
Test1 - Gauss 2000 (101x101) inverts  2.0 sec  Err= 0.006
Test2 - Crout 2000 (101x101) inverts  2.9 sec  Err= 0.014
Test3 - Crout  2 (1001x1001) inverts  3.4 sec  Err= 0.043
Test4 - Lapack 2 (1001x1001) inverts  2.6 sec  Err= 0.250
 total = 10.9 sec

11.103u 0.098s 0:11.21 99.8%0+0k 0+0io 0pf+0w

compared to

[macbook] lin/test% gfcp -Ofast -funroll-loops test_fpu.f90
[macbook] lin/test% time a.out
  Benchmark running, hopefully as only ACTIVE task
  0.99755959009261719 
Test1 - Gauss 2000 (101x101) inverts  2.0 sec  Err= 0.006
Test2 - Crout 2000 (101x101) inverts  2.9 sec  Err= 0.014
Test3 - Crout  2 (1001x1001) inverts  3.4 sec  Err= 0.043
Test4 - Lapack 2 (1001x1001) inverts  2.6 sec  Err= 0.250
 total = 10.9 sec

11.114u 0.101s 0:11.22 99.9%0+0k 0+0io 0pf+0w

So it fixes the slow down. Thanks for the patch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-23 Thread dominiq at lps dot ens dot fr


--- Comment #1 from dominiq at lps dot ens dot fr  2010-08-23 12:20 ---
Reduced test

! 
MODULE kinds
   INTEGER, PARAMETER :: RK8 = SELECTED_REAL_KIND(15, 300)
END MODULE kinds
! 
PROGRAM TEST_FPU  ! A number-crunching benchmark using matrix inversion.
USE kinds ! Implemented by:David Frank  dave_fr...@hotmail.com
IMPLICIT NONE ! Gauss  routine by: Tim Prince   n...@aol.com

REAL(RK8) :: pool(101,101,1000), pool3(1001,1001) ! random numbers to invert
EQUIVALENCE (pool,pool3)   ! use same pool numbers for test 3,4
REAL(RK8) :: a(101,101), a3(1001,1001) ! working matrices
REAL(RK8) :: avg_err, dt
INTEGER :: i, n, t(8), clock1, clock2, rate
CHARACTER (LEN=36) :: invert_id = 
  'Test1 - Gauss 2000 (101x101) inverts'

   CALL DATE_AND_TIME ( values = t )
   CALL RANDOM_NUMBER(pool) ! fill pool with random data ( 0. - 1. )
   CALL SYSTEM_CLOCK (clock1,rate)  ! get benchmark (n) start time

   DO i = 1,1000
  a = pool(:,:,i) ! get next matrix to invert
  CALL Gauss (a,101)  ! invert a
  CALL Gauss (a,101)  ! invert a
   END DO

   avg_err = SUM(ABS(a-pool(:,:,1000)))/(101*101)   ! last matrix error
   CALL SYSTEM_CLOCK (clock2,rate)
   dt = (clock2-clock1)/DBLE(rate)  ! get benchmark (n) elapsed sec.
   WRITE (*,92) invert_id, dt, ' sec  Err=', avg_err
92 FORMAT (A,F5.1,A,F18.15)

END PROGRAM TEST_FPU

! 
SUBROUTINE Gauss (a,n)   ! Invert matrix by Gauss method
! 
USE kinds
IMPLICIT NONE

INTEGER :: n
REAL(RK8) :: a(n,n)

! - - - Local Variables - - -
REAL(RK8) :: b(n,n), c, d, temp(n)
INTEGER :: i, j, k, m, imax(1), ipvt(n)
! - - - - - - - - - - - - - -
b = a
ipvt = (/ (i, i = 1, n) /)

DO k = 1,n
   imax = MAXLOC(ABS(b(k:n,k)))
   m = k-1+imax(1)

   IF (m /= k) THEN
  ipvt( (/m,k/) ) = ipvt( (/k,m/) )
  b((/m,k/),:) = b((/k,m/),:)
   END IF
   d = 1/b(k,k)

   temp = b(:,k)
   DO j = 1, n
  c = b(k,j)*d
  b(:,j) = b(:,j)-temp*c
  b(k,j) = c
   END DO
   b(:,k) = temp*(-d)
   b(k,k) = d
END DO
a(:,ipvt) = b

END SUBROUTINE Gauss

gfcp is r163277, gfc is r163455

[macbook] lin/test% gfcp -Ofast -funroll-loops test_fpu_red.f90
[macbook] lin/test% time a.out
Test1 - Gauss 2000 (101x101) inverts  1.9 sec  Err= 0.006
2.156u 0.064s 0:02.22 99.5% 0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -Ofast -funroll-loops test_fpu_red.f90
[macbook] lin/test% time a.out
Test1 - Gauss 2000 (101x101) inverts  2.7 sec  Err= 0.006
2.906u 0.067s 0:02.99 98.9% 0+0k 0+0io 0pf+0w


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-23 Thread rguenth at gcc dot gnu dot org


--- Comment #2 from rguenth at gcc dot gnu dot org  2010-08-23 17:04 ---
Can't reproduce on x86_64-linux.  Please try to pinpoint the codegen difference
that causes the slowdown.


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 CC|rguenther at suse dot de|rguenth at gcc dot gnu dot
   ||org
   Target Milestone|--- |4.6.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-23 Thread dominiq at lps dot ens dot fr


--- Comment #3 from dominiq at lps dot ens dot fr  2010-08-23 17:24 ---
 Can't reproduce on x86_64-linux.

My timings were on an Intel Core2Duo 2.53Ghz.

 Please try to pinpoint the codegen difference that causes the slowdown.

I don't know if this what you ask for, but comparing assembly (fast -, slow +)
I see in several places the following kind of patterns:

 L36:
-   leaq1(%rsi), %r9
-   movq%rsi, %r10
+   movq%rdi, %r10
+   leaq1(%rdi), %r9
salq$4, %r10
+   movsd   (%rsi,%r10), %xmm14
salq$4, %r9
-   movapd  (%rdi,%r10), %xmm5
-   leaq2(%rsi), %r10
-   movapd  (%rdi,%r9), %xmm4
-   leaq3(%rsi), %r9
+   movhpd  8(%rsi,%r10), %xmm14
+   leaq2(%rdi), %r10
+   movapd  %xmm14, %xmm13
salq$4, %r10
-   andpd   %xmm0, %xmm5
+   andpd   %xmm0, %xmm13
+   movlpd  %xmm13, (%rcx)
+   movhpd  %xmm13, 8(%rcx)
+   movsd   (%rsi,%r9), %xmm12
+   movhpd  8(%rsi,%r9), %xmm12
+   leaq3(%rdi), %r9
+   movapd  %xmm12, %xmm11
salq$4, %r9
-   movapd  (%rdi,%r10), %xmm3
-   leaq4(%rsi), %r10
-   andpd   %xmm0, %xmm4
-   movapd  (%rdi,%r9), %xmm2
-   leaq5(%rsi), %r9
+   andpd   %xmm0, %xmm11
+   movlpd  %xmm11, 16(%rcx)
+   movhpd  %xmm11, 24(%rcx)
+   movsd   (%rsi,%r10), %xmm10
+   movhpd  8(%rsi,%r10), %xmm10
+   leaq4(%rdi), %r10
+   movapd  %xmm10, %xmm9
salq$4, %r10
-   andpd   %xmm0, %xmm3
+   andpd   %xmm0, %xmm9
+   movlpd  %xmm9, 32(%rcx)
+   movhpd  %xmm9, 40(%rcx)
+   movsd   (%rsi,%r9), %xmm8
+   movhpd  8(%rsi,%r9), %xmm8
+   leaq5(%rdi), %r9
+   movapd  %xmm8, %xmm7
salq$4, %r9
-   movapd  (%rdi,%r10), %xmm1
-   leaq6(%rsi), %r10
-   andpd   %xmm0, %xmm2
-   movapd  (%rdi,%r9), %xmm15
-   leaq7(%rsi), %r9
+   andpd   %xmm0, %xmm7
+   movlpd  %xmm7, 48(%rcx)
+   movhpd  %xmm7, 56(%rcx)
+   movsd   (%rsi,%r10), %xmm6
+   movhpd  8(%rsi,%r10), %xmm6
+   leaq6(%rdi), %r10
+   movapd  %xmm6, %xmm5
salq$4, %r10
-   andpd   %xmm0, %xmm1
+   andpd   %xmm0, %xmm5
+   movlpd  %xmm5, 64(%rcx)
+   movhpd  %xmm5, 72(%rcx)
+   movsd   (%rsi,%r9), %xmm4
+   movhpd  8(%rsi,%r9), %xmm4
+   leaq7(%rdi), %r9
+   addq$8, %rdi
+   movapd  %xmm4, %xmm3
salq$4, %r9
-   movapd  (%rdi,%r10), %xmm14
-   andpd   %xmm0, %xmm15
-   addq$8, %rsi
-   movapd  (%rdi,%r9), %xmm13
+   andpd   %xmm0, %xmm3
+   movlpd  %xmm3, 80(%rcx)
+   movhpd  %xmm3, 88(%rcx)
+   movsd   (%rsi,%r10), %xmm2
+   movhpd  8(%rsi,%r10), %xmm2
+   movapd  %xmm2, %xmm1
+   andpd   %xmm0, %xmm1
+   movlpd  %xmm1, 96(%rcx)
+   movhpd  %xmm1, 104(%rcx)
+   movsd   (%rsi,%r9), %xmm15
+   movhpd  8(%rsi,%r9), %xmm15
+   movapd  %xmm15, %xmm14
andpd   %xmm0, %xmm14
-   andpd   %xmm0, %xmm13
-   movlpd  %xmm5, (%rcx)
-   movhpd  %xmm5, 8(%rcx)
-   movlpd  %xmm4, 16(%rcx)
-   movhpd  %xmm4, 24(%rcx)
-   movlpd  %xmm3, 32(%rcx)
-   movhpd  %xmm3, 40(%rcx)
-   movlpd  %xmm2, 48(%rcx)
-   movhpd  %xmm2, 56(%rcx)
-   movlpd  %xmm1, 64(%rcx)
-   movhpd  %xmm1, 72(%rcx)
-   movlpd  %xmm15, 80(%rcx)
-   movhpd  %xmm15, 88(%rcx)
-   movlpd  %xmm14, 96(%rcx)
-   movhpd  %xmm14, 104(%rcx)
-   movlpd  %xmm13, 112(%rcx)
-   movhpd  %xmm13, 120(%rcx)
+   movlpd  %xmm14, 112(%rcx)
+   movhpd  %xmm14, 120(%rcx)
subq$-128, %rcx
-   cmpq%r11, %rsi
+   cmpq%r11, %rdi
jb  L36


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-23 Thread rguenth at gcc dot gnu dot org


--- Comment #4 from rguenth at gcc dot gnu dot org  2010-08-23 17:46 ---
Can you try

Index: gcc/emit-rtl.c
===
--- gcc/emit-rtl.c  (revision 163472)
+++ gcc/emit-rtl.c  (working copy)
@@ -1788,6 +1788,7 @@ set_mem_attributes_minus_bitpos (rtx ref

   /* If this is an indirect reference, record it.  */
   else if (TREE_CODE (t) == MEM_REF 
+  || TREE_CODE (t) == TARGET_MEM_REF
   || TREE_CODE (t) == MISALIGNED_INDIRECT_REF)
{
  expr = t;


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379



[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278

2010-08-23 Thread dominiq at lps dot ens dot fr


--- Comment #5 from dominiq at lps dot ens dot fr  2010-08-23 19:01 ---
 Can you try ...

This does not change the timing for test_fpu.f90 and the reduced test in
comment #1.

AFAICT the problem is within the loop

   DO j = 1, n
  c = b(k,j)*d
  do i = 1, n
 b(i,j) = b(i,j)-temp(i)*c
  end do
  b(k,j) = c
   END DO

(where it does not matter that

   b(:,j) = b(:,j)-temp*c

is scalarized or not).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379