[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #21 from Dominique d'Humieres --- > The original test case is still slow, so I guess we should > keep this open. Which test?
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #20 from Thomas Koenig --- The original test case is still slow, so I guess we should keep this open.
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 Martin Liška changed: What|Removed |Added CC||marxin at gcc dot gnu.org --- Comment #19 from Martin Liška --- Can the bug be marked as resolved?
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #18 from Jürgen Reuter --- The example by posted on May 20, 2017 on c.l.f. improved by Stefano Zaghi below shows a factor of 10-20 improvement now in gfortran 9.0.0 including the work by Thomas Koenig. $ ./a.out Elapsed CPU time = 0.337644997 Elapsed CPU time = 0.292241003 Elapsed CPU time = 0.265654006 Here is the code: program testme use, intrinsic :: iso_fortran_env implicit none integer(int32), parameter :: n = 200 real(real32) :: a(n,n,n), b(n,n,n) integer(int32) :: j, k real(real64) :: t1, t2 call random_number(a) do k = 1, 3 call cpu_time ( t1 ) do j = 1, 100 b = cshift(a, shift=1, DIM=k) end do call cpu_time ( t2 ) write ( *, * ) 'Elapsed CPU time = ', t2-t1 end do end program testme
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #17 from Thomas Koenig --- Large-scale cshift is OK now. The original test case, or inlining something like b = cshift(a,-1) - 2*a + cshift(a,1) to efficiently calculate a second derivative on a circular array would be quite nice.
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #16 from Dominique d'Humieres --- For the original test compiled with -Ofast, I get Testing explicit DO loops Dim =1 Elapsed CPU time = 0.846953988 Dim =2 Elapsed CPU time = 0.724469006 Dim =3 Elapsed CPU time = 0.680019855 Testing built-in cshift Dim =1 Elapsed CPU time = 0.506424904 Dim =2 Elapsed CPU time = 0.391590118 Dim =3 Elapsed CPU time = 0.352367163 An for the IDRIS test: = Call to test_eoshift = Order of matrix:1000 test_eoshift> 1) EOSHIFT Used CPU time ==> 2.976 ms test__eoshift> 2) DO loop Used CPU time ==> 0.832 ms Results OK Call to test_cshift Order of matrix:1000 test_cshift> 1) CSHIFT Used CPU time ==> 1.377 ms test__cshift> 2) DO loop Used CPU time ==> 1.051 ms Results OK = Call to test_reshape = Order of matrix:1000 test__reshape> 1) RESHAPE Used CPU time ==> 4.909 ms test__reshape> 2) DO loop Used CPU time ==> 0.877 ms Results OK May be this PR could be closed.
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 Thomas Koenig changed: What|Removed |Added Status|ASSIGNED|NEW Assignee|tkoenig at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #15 from Thomas Koenig --- What was done so far was an improvement for cshifts of large arrays. The code in the original program has quite small arrays, so inlining would actually make sense there. I don't think I will start this any time soon; unassigning.
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #14 from Thomas Koenig --- Author: tkoenig Date: Sat Jun 24 07:07:56 2017 New Revision: 249620 URL: https://gcc.gnu.org/viewcvs?rev=249620=gcc=rev Log: 2017-06-24 Thomas KoenigPR fortran/52473 * Makefile.am: Add i_cshift1a_c. Add rules to generate files from cshift1a.m4. * Makefile.in: Regenerated. * m4/cshift1a.m4: New file. * m4/cshift.m4 (cshift1): Split up inner loop by removing condition. Use memcpy where possible. Call helper functions based on dtype. * libgfortran.h: Add prototypes for cshift1_16_c10, cshift1_16_c16, cshift1_16_c4, cshift1_16_c8, cshift1_16_i1, cshift1_16_i16, cshift1_16_i2, cshift1_16_i4, cshift1_16_i8, cshift1_16_r10, cshift1_16_r16, cshift1_16_r4, cshift1_16_r8, cshift1_4_c10, cshift1_4_c16, cshift1_4_c4, cshift1_4_c8, cshift1_4_i1, cshift1_4_i16, cshift1_4_i2, cshift1_4_i4, cshift1_4_i8, cshift1_4_r10, cshift1_4_r16, cshift1_4_r4, cshift1_4_r8, cshift1_8_c10, cshift1_8_c16, cshift1_8_c4, cshift1_8_c8, cshift1_8_i1, cshift1_8_i16, cshift1_8_i2, cshift1_8_i4, cshift1_8_i8, cshift1_8_r10, cshift1_8_r16, cshift1_8_r4 and cshift1_8_r8. * generated/cshift1_16_c10.c: New file, generated from cshift1a.m4. * generated/cshift1_16_c16.c: New file, generated from cshift1a.m4. * generated/cshift1_16_c4.c: New file, generated from cshift1a.m4. * generated/cshift1_16_c8.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i1.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i16.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i2.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i4.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i8.c: New file, generated from cshift1a.m4. * generated/cshift1_16_r10.c: New file, generated from cshift1a.m4. * generated/cshift1_16_r16.c: New file, generated from cshift1a.m4. * generated/cshift1_16_r4.c: New file, generated from cshift1a.m4. * generated/cshift1_16_r8.c: New file, generated from cshift1a.m4. * generated/cshift1_4_c10.c: New file, generated from cshift1a.m4. * generated/cshift1_4_c16.c: New file, generated from cshift1a.m4. * generated/cshift1_4_c4.c: New file, generated from cshift1a.m4. * generated/cshift1_4_c8.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i1.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i16.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i2.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i4.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i8.c: New file, generated from cshift1a.m4. * generated/cshift1_4_r10.c: New file, generated from cshift1a.m4. * generated/cshift1_4_r16.c: New file, generated from cshift1a.m4. * generated/cshift1_4_r4.c: New file, generated from cshift1a.m4. * generated/cshift1_4_r8.c: New file, generated from cshift1a.m4. * generated/cshift1_8_c10.c: New file, generated from cshift1a.m4. * generated/cshift1_8_c16.c: New file, generated from cshift1a.m4. * generated/cshift1_8_c4.c: New file, generated from cshift1a.m4. * generated/cshift1_8_c8.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i1.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i16.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i2.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i4.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i8.c: New file, generated from cshift1a.m4. * generated/cshift1_8_r10.c: New file, generated from cshift1a.m4. * generated/cshift1_8_r16.c: New file, generated from cshift1a.m4. * generated/cshift1_8_r4.c: New file, generated from cshift1a.m4. * generated/cshift1_8_r8.c: New file, generated from cshift1a.m4. 2017-06-24 Thomas Koenig PR fortran/52473 * gfortran.dg/cshift_2.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/cshift_2.f90 trunk/libgfortran/generated/cshift1_16_c10.c trunk/libgfortran/generated/cshift1_16_c16.c trunk/libgfortran/generated/cshift1_16_c4.c trunk/libgfortran/generated/cshift1_16_c8.c trunk/libgfortran/generated/cshift1_16_i1.c trunk/libgfortran/generated/cshift1_16_i16.c trunk/libgfortran/generated/cshift1_16_i2.c trunk/libgfortran/generated/cshift1_16_i4.c trunk/libgfortran/generated/cshift1_16_i8.c trunk/libgfortran/generated/cshift1_16_r10.c trunk/libgfortran/generated/cshift1_16_r16.c trunk/libgfortran/generated/cshift1_16_r4.c
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #13 from Thomas Koenig --- Author: tkoenig Date: Sun Jun 18 18:04:19 2017 New Revision: 249350 URL: https://gcc.gnu.org/viewcvs?rev=249350=gcc=rev Log: 2017-06-18 Thomas KoenigPR fortran/52473 * m4/cshift0.m4: For arrays that are contiguous up to shift, implement blocked algorighm for cshift. * generated/cshift0_c10.c: Regenerated. * generated/cshift0_c16.c: Regenerated. * generated/cshift0_c4.c: Regenerated. * generated/cshift0_c8.c: Regenerated. * generated/cshift0_i1.c: Regenerated. * generated/cshift0_i16.c: Regenerated. * generated/cshift0_i2.c: Regenerated. * generated/cshift0_i4.c: Regenerated. * generated/cshift0_i8.c: Regenerated. * generated/cshift0_r10.c: Regenerated. * generated/cshift0_r16.c: Regenerated. * generated/cshift0_r4.c: Regenerated. * generated/cshift0_r8.c: Regenerated. 2017-06-18 Thomas Koenig PR fortran/52473 * gfortran.dg/cshift_1.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/cshift_1.f90 Modified: trunk/gcc/testsuite/ChangeLog trunk/libgfortran/ChangeLog trunk/libgfortran/generated/cshift0_c10.c trunk/libgfortran/generated/cshift0_c16.c trunk/libgfortran/generated/cshift0_c4.c trunk/libgfortran/generated/cshift0_c8.c trunk/libgfortran/generated/cshift0_i1.c trunk/libgfortran/generated/cshift0_i16.c trunk/libgfortran/generated/cshift0_i2.c trunk/libgfortran/generated/cshift0_i4.c trunk/libgfortran/generated/cshift0_i8.c trunk/libgfortran/generated/cshift0_r10.c trunk/libgfortran/generated/cshift0_r16.c trunk/libgfortran/generated/cshift0_r4.c trunk/libgfortran/generated/cshift0_r8.c trunk/libgfortran/m4/cshift0.m4
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #12 from Thomas Koenig --- Created attachment 41542 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41542=edit Test case which still fails
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #11 from Thomas Koenig --- Created attachment 41541 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41541=edit Patch for the library, not yet quite correct This is a patch which brings a dramatic speedup for any cshift with dim>1, but it also causes a regression for a test case which I will attach next.
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #10 from Thomas Koenig --- Also see the discussion at https://groups.google.com/forum/#!topic/comp.lang.fortran/AI0F1Vpkc3I There is one thing that I do not understand. For the following test code, which compares straightforward DO loops for shifting a three-dimensional array by 1, the library version is actually faster than straightforward loops for dim=1. module kinds integer, parameter :: sp = selected_real_kind(6) ! Single precision integer, parameter :: dp = selected_real_kind(15) ! Double precision end module kinds module replacements use kinds contains subroutine cshift_sp_3_v1 (array, shift, dim, res) integer, parameter :: wp = sp real(kind=wp), dimension(:,:,:), intent(in), contiguous :: array integer, intent(in) :: shift, dim real(kind=wp), dimension(:,:,:), intent(out), contiguous :: res integer :: i,j,k integer :: sh, rsh integer :: n res = 0 if (dim == 1) then n = size(array,1) sh = modulo(shift, n) rsh = n - sh do k=1, size(array,3) do j=1, size(array,2) do i=1, rsh res(i,j,k) = array(i+sh,j,k) end do do i=rsh+1,n res(i,j,k) = array(i-rsh,j,k) end do end do end do else if (dim == 2) then n = size(array,2) sh = modulo(shift,n) rsh = n - sh do k=1, size(array,3) do j=1, rsh do i=1, size(array,1) res(i,j,k) = array(i,j+sh, k) end do end do do j=rsh+1, n do i=1, size(array,1) res(i,j,k) = array(i,j-rsh, k) end do end do end do else if (dim == 3) then n = size(array,3) sh = modulo(shift, n) rsh = n - sh do k=1, rsh do j=1, size(array,2) do i=1, size(array,1) res(i,j,k) = array(i, j, k+sh) end do end do end do do k=rsh+1, n do j=1, size(array,2) do i=1, size(array,1) res(i,j, k) = array(i, j, k-rsh) end do end do end do else stop "Wrong argument to dim" end if end subroutine cshift_sp_3_v1 end module replacements program testme use kinds use replacements implicit none integer, parameter :: wp = sp ! Working precision INTEGER, PARAMETER :: n = 200 real(kind=wp) :: a(n,n,n), b(n,n,n) integer i, j, k real t1, t2 print *,"Testing explicit DO loops" call random_number(a) do k = 1,3 call cpu_time ( t1 ) do j = 1, 100 call cshift_sp_3_v1 (a, 1, k, b) end do call cpu_time ( t2 ) write ( *, * ) 'Dim = ', k, ' Elapsed CPU time = ', t2-t1 end do print *,"Testing built-in cshift" do k = 1,3 call cpu_time ( t1 ) do j = 1, 100 b = cshift(a,1,k) end do call cpu_time ( t2 ) write ( *, * ) 'Dim = ', k, ' Elapsed CPU time = ', t2-t1 end do end program testme
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #9 from Dominique d'Humieres --- > How often or how important is this feature? Would it improve any known > benchmark or real world application in a significant manner? Egg and chicken problem! If CSHIFT is slow, nobody will use it. If nobody use it, nobody will be motivated to make it fast.
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #8 from Thomas Koenig --- (In reply to Jerry DeLisle from comment #7) > (In reply to Thomas Koenig from comment #6) > > Created attachment 41508 [details] > > What an unrolled cshift could look like > > > > This is what an unrolled version of cshift could look like, > > for a simple one-dimensional case. > > > > If the shifts are constants, all the if statements are > > optimized away at compile-time, so it is quite efficient. > > How often or how important is this feature? Would it improve any known > benchmark or real world application in a significant manner? There's protein.f90 - see PR 52934. Calculating a numerical derivative for a periodic function as diff = (cshift(a,1) - cshift(a,-1))/(2*h) seems natural, or the second derivative as d2 = (cshift(a,1) - 2*a + cshift(a,-1))/(h**2) Of course, to be really useful, this would also have to be extended to eoshift...
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 Jerry DeLisle changed: What|Removed |Added CC||jvdelisle at gcc dot gnu.org --- Comment #7 from Jerry DeLisle --- (In reply to Thomas Koenig from comment #6) > Created attachment 41508 [details] > What an unrolled cshift could look like > > This is what an unrolled version of cshift could look like, > for a simple one-dimensional case. > > If the shifts are constants, all the if statements are > optimized away at compile-time, so it is quite efficient. How often or how important is this feature? Would it improve any known benchmark or real world application in a significant manner?
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #6 from Thomas Koenig --- Created attachment 41508 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41508=edit What an unrolled cshift could look like This is what an unrolled version of cshift could look like, for a simple one-dimensional case. If the shifts are constants, all the if statements are optimized away at compile-time, so it is quite efficient.
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 Thomas Koenig changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |tkoenig at gcc dot gnu.org --- Comment #5 from Thomas Koenig --- Created attachment 41394 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41394=edit Benchmark of straight DO loops vs. library version Some numbers from https://groups.google.com/forum/#!topic/comp.lang.fortran/AI0F1Vpkc3I show that using straight DO loops could both help and hurt: $ ./a.out Testing explicit DO loops Dim =1 Elapsed CPU time =2.82861114 Dim =2 Elapsed CPU time =2.93245506 Dim =3 Elapsed CPU time =2.94523525 Testing built-in cshift Dim =1 Elapsed CPU time =1.65619278 Dim =2 Elapsed CPU time =2.80988693 Dim =3 Elapsed CPU time =7.13671684 I'll see what could be done using an explicit call to memcpy for the innermost loops.
[Bug fortran/52473] CSHIFT slow - inline it?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 --- Comment #4 from Thomas Koenig --- Looking at the code, inlining cshift for a constant shift could already be a good idea. So, change b = cshift(b,1) + cshift(b,-1) to a = b b(1) = a(2) + a(n) b(2:n-1) = a(1:n-2) + a(3:n) b(n) = a(1) + a(n-1)
[Bug fortran/52473] CSHIFT slow - inline it?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 Dominique d'Humieres dominiq at lps dot ens.fr changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2013-06-16 Ever confirmed|0 |1 --- Comment #3 from Dominique d'Humieres dominiq at lps dot ens.fr --- I don't have access to pathf95, but the following test (modified for some IDRIS benchmarks) subroutine test_cshift() !- ! Test de la fonction intrinsèque cshift !- implicit none integer, parameter :: n=ordre integer:: i,j integer, dimension(n) :: vect=(/ (-i, i=1,n) /) real(kind=prec),dimension(n,n) :: t,td,t2,t1 real(kind=prec):: cste call impression_entete(test_cshift) print *, Order of matrix:, n ! Initialisations -- cste=exp(1._prec) t(:,:) = reshape( source=(/ (i*cste,i=1,n*n) /) , shape = (/ n,n /) ) td(:,:) = 0._prec call cpu_time( val_temps(1) ) td(:,:) = cshift(array=t(:,:), shift=vect(:), dim=1) call cpu_time( val_temps(2) ) t1(:,:) = td(:,:) call temps_consomme(test_cshift 1) CSHIFT) ! Décalage selon les lignes via des boucles td(:,:) = 0._prec call cpu_time( val_temps(1) ) do j=1,n do i=n+vect(j),1,-1 td(i-vect(j),j) = t(i,j) end do do i=1,-vect(j) td(i,j)=t(n+vect(j)+i,j) end do end do call cpu_time( val_temps(2) ) t2(:,:) = td(:,:) call temps_consomme(test__cshift 2) DO loop) if (sum(abs(t2-t1)) /= 0._prec) then print *,'Mauvais résultats ',sum(abs(t2-t1)) else print *,'Results OK' end if end subroutine test_cshift gives Call to test_cshift Order of matrix:1000 test_cshift 1) CSHIFT Used CPU time == 7.624 ms test__cshift 2) DO loop Used CPU time == 4.482 ms Results OK So CSHIFT is 50% slower than the DO loop. Note that I get similar results for EOSHIFT and RESHAPE: = Call to test_eoshift = Order of matrix:1000 test_eoshift 1) EOSHIFT Used CPU time == 7.443 ms test__eoshift 2) DO loop Used CPU time == 3.775 ms Results OK = Call to test_reshape = Order of matrix:1000 test__reshape 1) RESHAPE Used CPU time ==10.624 ms test__reshape 2) DO loop Used CPU time == 4.442 ms Results OK PR45689 should probably fixed at the same time.
[Bug fortran/52473] CSHIFT slow - inline it?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 Tobias Burnus burnus at gcc dot gnu.org changed: What|Removed |Added CC||matz at gcc dot gnu.org --- Comment #2 from Tobias Burnus burnus at gcc dot gnu.org 2012-04-11 12:51:34 UTC --- *** Bug 52934 has been marked as a duplicate of this bug. ***
[Bug fortran/52473] CSHIFT slow - inline it?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52473 Tobias Burnus burnus at gcc dot gnu.org changed: What|Removed |Added CC||burnus at gcc dot gnu.org --- Comment #1 from Tobias Burnus burnus at gcc dot gnu.org 2012-04-10 15:22:30 UTC --- Cf. also http://gcc.gnu.org/ml/fortran/2012-04/msg00041.html