[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #14 from rguenth at gcc dot gnu dot org 2010-08-25 10:03 --- Subject: Bug 45379 Author: rguenth Date: Wed Aug 25 10:03:19 2010 New Revision: 163540 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163540 Log: 2010-08-25 Richard Guenther rguent...@suse.de PR middle-end/45379 * emit-rtl.c (set_mem_attributes_minus_bitpos): Handle TARGET_MEM_REF in alignment computation. Modified: trunk/gcc/ChangeLog trunk/gcc/emit-rtl.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #15 from rguenth at gcc dot gnu dot org 2010-08-25 10:44 --- Fixed. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #16 from dominiq at lps dot ens dot fr 2010-08-25 12:01 --- /* ??? This isn't fully correct, we can't set the alignment from the type in all cases. */ What is the meaning of this comment? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #17 from rguenther at suse dot de 2010-08-25 12:51 --- Subject: Re: [4.6 Regression] ~10% slowdown on test_fpu at revision 163278 On Wed, 25 Aug 2010, dominiq at lps dot ens dot fr wrote: --- Comment #16 from dominiq at lps dot ens dot fr 2010-08-25 12:01 --- /* ??? This isn't fully correct, we can't set the alignment from the type in all cases. */ What is the meaning of this comment? The meaning is that the type alignment does not always agree with the alignment of the memory loaded/stored. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #6 from rguenth at gcc dot gnu dot org 2010-08-24 10:37 --- Created an attachment (id=21555) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21555action=view) patch With this patch I get a similar looking diff. I still can't reproduce runtime differences on my Athlon X2, so can you verify the patch (it makes sense anyway though)? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #7 from dominiq at lps dot ens dot fr 2010-08-24 10:55 --- With the patch in comment #6, I get a minor improvement, but do not recover the timing before r163278: r163277 Test1 - Gauss 2000 (101x101) inverts 1.9 sec Err= 0.006 2.157u 0.074s 0:02.23 99.5% 0+0k 0+0io 0pf+0w r163469 without patch Test1 - Gauss 2000 (101x101) inverts 2.7 sec Err= 0.006 2.903u 0.069s 0:02.97 99.6% 0+0k 0+0io 0pf+0w r163517 with patch Test1 - Gauss 2000 (101x101) inverts 2.5 sec Err= 0.006 2.717u 0.073s 0:02.79 99.6% 0+0k 0+0io 0pf+0w -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #8 from rguenth at gcc dot gnu dot org 2010-08-24 11:37 --- Do you see the slowdown as well if you drop -funroll-loops? Do you see the slowdown with just -O2? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #9 from dominiq at lps dot ens dot fr 2010-08-24 11:47 --- Do you see the slowdown as well if you drop -funroll-loops? Yes [macbook] lin/test% gfc -Ofast test_fpu_red.f90 [macbook] lin/test% time a.out Test1 - Gauss 2000 (101x101) inverts 3.0 sec Err= 0.006 3.208u 0.072s 0:03.28 99.6% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfcp -Ofast test_fpu_red.f90 [macbook] lin/test% time a.out Test1 - Gauss 2000 (101x101) inverts 2.2 sec Err= 0.006 2.440u 0.076s 0:02.52 99.6% 0+0k 0+0io 0pf+0w Do you see the slowdown with just -O2? No [macbook] lin/test% gfc -O2 test_fpu_red.f90 [macbook] lin/test% time a.out Test1 - Gauss 2000 (101x101) inverts 3.1 sec Err= 0.006 3.328u 0.071s 0:03.40 99.7% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfcp -O2 test_fpu_red.f90 [macbook] lin/test% time a.out Test1 - Gauss 2000 (101x101) inverts 3.1 sec Err= 0.006 3.330u 0.073s 0:03.40 100.0%0+0k 0+0io 0pf+0w but I see it with -O2 -ftree-vectorize [macbook] lin/test% gfc -O2 -ftree-vectorize test_fpu_red.f90 [macbook] lin/test% time a.out Test1 - Gauss 2000 (101x101) inverts 3.1 sec Err= 0.006 3.318u 0.070s 0:03.39 99.7% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfcp -O2 -ftree-vectorize test_fpu_red.f90 [macbook] lin/test% time a.out Test1 - Gauss 2000 (101x101) inverts 2.3 sec Err= 0.006 2.498u 0.076s 0:02.57 99.6% 0+0k 0+0io 0pf+0w although I do not see any difference in the outputs with -ftree-vectorizer-verbose=2. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #10 from rguenth at gcc dot gnu dot org 2010-08-24 13:25 --- Subject: Bug 45379 Author: rguenth Date: Tue Aug 24 13:25:25 2010 New Revision: 163519 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163519 Log: 2010-08-24 Richard Guenther rguent...@suse.de PR middle-end/45379 * tree-ssa-address.c (create_mem_ref_raw): Drop to MEM_REF if addr-index is NULL or zero. * tree-ssa-alias.c (indirect_refs_may_alias_p): Handle TARGET_MEM_REF more properly. (indirect_ref_may_alias_decl_p): Likewise. * emit-rtl.c (set_mem_attributes_minus_bitpos): Keep TARGET_MEM_REFs. * alias.c (ao_ref_from_mem): Handle TARGET_MEM_REF more properly. Modified: trunk/gcc/ChangeLog trunk/gcc/alias.c trunk/gcc/emit-rtl.c trunk/gcc/tree-ssa-address.c trunk/gcc/tree-ssa-alias.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #11 from dominiq at lps dot ens dot fr 2010-08-24 14:33 --- Assembly for the inner loop do i = 1, n b(i,j) = b(i,j)-temp(i)*c end do with -Ofast r163277 L38: movsd (%rsi,%rax), %xmm0 addl$1, %ecx movhpd 8(%rsi,%rax), %xmm0 movapd %xmm0, %xmm1 movapd (%rdi,%rax), %xmm0 mulpd %xmm3, %xmm1 subpd %xmm1, %xmm0 movapd %xmm0, (%rdi,%rax) addq$16, %rax cmpl$249, %ecx jbe L38 r163519 L38: movsd (%rdi,%rax), %xmm5 addl$1, %esi movhpd 8(%rdi,%rax), %xmm5 movapd %xmm5, %xmm1 movsd (%rcx,%rax), %xmm5 mulpd %xmm3, %xmm1 movhpd 8(%rcx,%rax), %xmm5 movapd %xmm5, %xmm0 subpd %xmm1, %xmm0 movlpd %xmm0, (%rcx,%rax) movhpd %xmm0, 8(%rcx,%rax) addq$16, %rax cmpl$249, %esi jbe L38 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #12 from rguenth at gcc dot gnu dot org 2010-08-24 15:48 --- Try Index: emit-rtl.c === --- emit-rtl.c (revision 163519) +++ emit-rtl.c (working copy) @@ -1615,6 +1615,11 @@ set_mem_attributes_minus_bitpos (rtx ref align = MAX (align, TYPE_ALIGN (type)); } + else if (TREE_CODE (t) == TARGET_MEM_REF) +/* ??? This isn't fully correct, we can't set the alignment from the + type in all cases. */ +align = MAX (align, TYPE_ALIGN (type)); + else if (TREE_CODE (t) == MISALIGNED_INDIRECT_REF) { if (integer_zerop (TREE_OPERAND (t, 1))) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #13 from dominiq at lps dot ens dot fr 2010-08-24 16:19 --- With the patch in comment #12 I get [macbook] lin/test% gfc -Ofast -funroll-loops test_fpu.f90 [macbook] lin/test% time a.out Benchmark running, hopefully as only ACTIVE task 0.99755959009261719 Test1 - Gauss 2000 (101x101) inverts 2.0 sec Err= 0.006 Test2 - Crout 2000 (101x101) inverts 2.9 sec Err= 0.014 Test3 - Crout 2 (1001x1001) inverts 3.4 sec Err= 0.043 Test4 - Lapack 2 (1001x1001) inverts 2.6 sec Err= 0.250 total = 10.9 sec 11.103u 0.098s 0:11.21 99.8%0+0k 0+0io 0pf+0w compared to [macbook] lin/test% gfcp -Ofast -funroll-loops test_fpu.f90 [macbook] lin/test% time a.out Benchmark running, hopefully as only ACTIVE task 0.99755959009261719 Test1 - Gauss 2000 (101x101) inverts 2.0 sec Err= 0.006 Test2 - Crout 2000 (101x101) inverts 2.9 sec Err= 0.014 Test3 - Crout 2 (1001x1001) inverts 3.4 sec Err= 0.043 Test4 - Lapack 2 (1001x1001) inverts 2.6 sec Err= 0.250 total = 10.9 sec 11.114u 0.101s 0:11.22 99.9%0+0k 0+0io 0pf+0w So it fixes the slow down. Thanks for the patch. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #1 from dominiq at lps dot ens dot fr 2010-08-23 12:20 --- Reduced test ! MODULE kinds INTEGER, PARAMETER :: RK8 = SELECTED_REAL_KIND(15, 300) END MODULE kinds ! PROGRAM TEST_FPU ! A number-crunching benchmark using matrix inversion. USE kinds ! Implemented by:David Frank dave_fr...@hotmail.com IMPLICIT NONE ! Gauss routine by: Tim Prince n...@aol.com REAL(RK8) :: pool(101,101,1000), pool3(1001,1001) ! random numbers to invert EQUIVALENCE (pool,pool3) ! use same pool numbers for test 3,4 REAL(RK8) :: a(101,101), a3(1001,1001) ! working matrices REAL(RK8) :: avg_err, dt INTEGER :: i, n, t(8), clock1, clock2, rate CHARACTER (LEN=36) :: invert_id = 'Test1 - Gauss 2000 (101x101) inverts' CALL DATE_AND_TIME ( values = t ) CALL RANDOM_NUMBER(pool) ! fill pool with random data ( 0. - 1. ) CALL SYSTEM_CLOCK (clock1,rate) ! get benchmark (n) start time DO i = 1,1000 a = pool(:,:,i) ! get next matrix to invert CALL Gauss (a,101) ! invert a CALL Gauss (a,101) ! invert a END DO avg_err = SUM(ABS(a-pool(:,:,1000)))/(101*101) ! last matrix error CALL SYSTEM_CLOCK (clock2,rate) dt = (clock2-clock1)/DBLE(rate) ! get benchmark (n) elapsed sec. WRITE (*,92) invert_id, dt, ' sec Err=', avg_err 92 FORMAT (A,F5.1,A,F18.15) END PROGRAM TEST_FPU ! SUBROUTINE Gauss (a,n) ! Invert matrix by Gauss method ! USE kinds IMPLICIT NONE INTEGER :: n REAL(RK8) :: a(n,n) ! - - - Local Variables - - - REAL(RK8) :: b(n,n), c, d, temp(n) INTEGER :: i, j, k, m, imax(1), ipvt(n) ! - - - - - - - - - - - - - - b = a ipvt = (/ (i, i = 1, n) /) DO k = 1,n imax = MAXLOC(ABS(b(k:n,k))) m = k-1+imax(1) IF (m /= k) THEN ipvt( (/m,k/) ) = ipvt( (/k,m/) ) b((/m,k/),:) = b((/k,m/),:) END IF d = 1/b(k,k) temp = b(:,k) DO j = 1, n c = b(k,j)*d b(:,j) = b(:,j)-temp*c b(k,j) = c END DO b(:,k) = temp*(-d) b(k,k) = d END DO a(:,ipvt) = b END SUBROUTINE Gauss gfcp is r163277, gfc is r163455 [macbook] lin/test% gfcp -Ofast -funroll-loops test_fpu_red.f90 [macbook] lin/test% time a.out Test1 - Gauss 2000 (101x101) inverts 1.9 sec Err= 0.006 2.156u 0.064s 0:02.22 99.5% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfc -Ofast -funroll-loops test_fpu_red.f90 [macbook] lin/test% time a.out Test1 - Gauss 2000 (101x101) inverts 2.7 sec Err= 0.006 2.906u 0.067s 0:02.99 98.9% 0+0k 0+0io 0pf+0w -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #2 from rguenth at gcc dot gnu dot org 2010-08-23 17:04 --- Can't reproduce on x86_64-linux. Please try to pinpoint the codegen difference that causes the slowdown. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added CC|rguenther at suse dot de|rguenth at gcc dot gnu dot ||org Target Milestone|--- |4.6.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #3 from dominiq at lps dot ens dot fr 2010-08-23 17:24 --- Can't reproduce on x86_64-linux. My timings were on an Intel Core2Duo 2.53Ghz. Please try to pinpoint the codegen difference that causes the slowdown. I don't know if this what you ask for, but comparing assembly (fast -, slow +) I see in several places the following kind of patterns: L36: - leaq1(%rsi), %r9 - movq%rsi, %r10 + movq%rdi, %r10 + leaq1(%rdi), %r9 salq$4, %r10 + movsd (%rsi,%r10), %xmm14 salq$4, %r9 - movapd (%rdi,%r10), %xmm5 - leaq2(%rsi), %r10 - movapd (%rdi,%r9), %xmm4 - leaq3(%rsi), %r9 + movhpd 8(%rsi,%r10), %xmm14 + leaq2(%rdi), %r10 + movapd %xmm14, %xmm13 salq$4, %r10 - andpd %xmm0, %xmm5 + andpd %xmm0, %xmm13 + movlpd %xmm13, (%rcx) + movhpd %xmm13, 8(%rcx) + movsd (%rsi,%r9), %xmm12 + movhpd 8(%rsi,%r9), %xmm12 + leaq3(%rdi), %r9 + movapd %xmm12, %xmm11 salq$4, %r9 - movapd (%rdi,%r10), %xmm3 - leaq4(%rsi), %r10 - andpd %xmm0, %xmm4 - movapd (%rdi,%r9), %xmm2 - leaq5(%rsi), %r9 + andpd %xmm0, %xmm11 + movlpd %xmm11, 16(%rcx) + movhpd %xmm11, 24(%rcx) + movsd (%rsi,%r10), %xmm10 + movhpd 8(%rsi,%r10), %xmm10 + leaq4(%rdi), %r10 + movapd %xmm10, %xmm9 salq$4, %r10 - andpd %xmm0, %xmm3 + andpd %xmm0, %xmm9 + movlpd %xmm9, 32(%rcx) + movhpd %xmm9, 40(%rcx) + movsd (%rsi,%r9), %xmm8 + movhpd 8(%rsi,%r9), %xmm8 + leaq5(%rdi), %r9 + movapd %xmm8, %xmm7 salq$4, %r9 - movapd (%rdi,%r10), %xmm1 - leaq6(%rsi), %r10 - andpd %xmm0, %xmm2 - movapd (%rdi,%r9), %xmm15 - leaq7(%rsi), %r9 + andpd %xmm0, %xmm7 + movlpd %xmm7, 48(%rcx) + movhpd %xmm7, 56(%rcx) + movsd (%rsi,%r10), %xmm6 + movhpd 8(%rsi,%r10), %xmm6 + leaq6(%rdi), %r10 + movapd %xmm6, %xmm5 salq$4, %r10 - andpd %xmm0, %xmm1 + andpd %xmm0, %xmm5 + movlpd %xmm5, 64(%rcx) + movhpd %xmm5, 72(%rcx) + movsd (%rsi,%r9), %xmm4 + movhpd 8(%rsi,%r9), %xmm4 + leaq7(%rdi), %r9 + addq$8, %rdi + movapd %xmm4, %xmm3 salq$4, %r9 - movapd (%rdi,%r10), %xmm14 - andpd %xmm0, %xmm15 - addq$8, %rsi - movapd (%rdi,%r9), %xmm13 + andpd %xmm0, %xmm3 + movlpd %xmm3, 80(%rcx) + movhpd %xmm3, 88(%rcx) + movsd (%rsi,%r10), %xmm2 + movhpd 8(%rsi,%r10), %xmm2 + movapd %xmm2, %xmm1 + andpd %xmm0, %xmm1 + movlpd %xmm1, 96(%rcx) + movhpd %xmm1, 104(%rcx) + movsd (%rsi,%r9), %xmm15 + movhpd 8(%rsi,%r9), %xmm15 + movapd %xmm15, %xmm14 andpd %xmm0, %xmm14 - andpd %xmm0, %xmm13 - movlpd %xmm5, (%rcx) - movhpd %xmm5, 8(%rcx) - movlpd %xmm4, 16(%rcx) - movhpd %xmm4, 24(%rcx) - movlpd %xmm3, 32(%rcx) - movhpd %xmm3, 40(%rcx) - movlpd %xmm2, 48(%rcx) - movhpd %xmm2, 56(%rcx) - movlpd %xmm1, 64(%rcx) - movhpd %xmm1, 72(%rcx) - movlpd %xmm15, 80(%rcx) - movhpd %xmm15, 88(%rcx) - movlpd %xmm14, 96(%rcx) - movhpd %xmm14, 104(%rcx) - movlpd %xmm13, 112(%rcx) - movhpd %xmm13, 120(%rcx) + movlpd %xmm14, 112(%rcx) + movhpd %xmm14, 120(%rcx) subq$-128, %rcx - cmpq%r11, %rsi + cmpq%r11, %rdi jb L36 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #4 from rguenth at gcc dot gnu dot org 2010-08-23 17:46 --- Can you try Index: gcc/emit-rtl.c === --- gcc/emit-rtl.c (revision 163472) +++ gcc/emit-rtl.c (working copy) @@ -1788,6 +1788,7 @@ set_mem_attributes_minus_bitpos (rtx ref /* If this is an indirect reference, record it. */ else if (TREE_CODE (t) == MEM_REF + || TREE_CODE (t) == TARGET_MEM_REF || TREE_CODE (t) == MISALIGNED_INDIRECT_REF) { expr = t; -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379
[Bug middle-end/45379] [4.6 Regression] ~10% slowdown on test_fpu at revision 163278
--- Comment #5 from dominiq at lps dot ens dot fr 2010-08-23 19:01 --- Can you try ... This does not change the timing for test_fpu.f90 and the reduced test in comment #1. AFAICT the problem is within the loop DO j = 1, n c = b(k,j)*d do i = 1, n b(i,j) = b(i,j)-temp(i)*c end do b(k,j) = c END DO (where it does not matter that b(:,j) = b(:,j)-temp*c is scalarized or not). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45379