[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067 Dominique d'Humieres changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |FIXED --- Comment #48 from Dominique d'Humieres --- > Is there anything left in this PR or could it be closed as FIXED? No feedback, closing. Please open new PR(s) for remaining issue(s).
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067 Dominique d'Humieres changed: What|Removed |Added Status|NEW |WAITING --- Comment #47 from Dominique d'Humieres --- Is there anything left in this PR or could it be closed as FIXED?
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067 --- Comment #46 from Tobias Burnus 2011-07-29 07:12:47 UTC --- (In reply to comment #45) [Commit to inline MINLOC/MAXLOC for a rank-1 array, which returns a single-element rank-1 array.] On my ~5 year old Athlon64 x2, I get with "-Ofast -march=native" (and with or without -fwhole-program) a performance improvement of 3% (10.72s -> 10.41s). The performance should further improve, if one fuses the loops (cf. comment 42, but also comment 34 ff.) - and if one could move the memory allocation/freeing of the automatic-array DTEMP out of the loop (after inlining). (Recall that with -fstack-arrays/-Ofast, automatic arrays are allocated on the stack.) As already mentioned indirectly in comment 0 (via PR31066): If one uses reciprocal approximation instructions and a Newton-Rhapson step (namely: -mrecip), the performance improves a lot: 6.895s. By comparison, with ifort (11.1) -xHost -O3 the run time is 7.319s.
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067 --- Comment #45 from Jakub Jelinek 2011-07-28 20:56:57 UTC --- Author: jakub Date: Thu Jul 28 20:56:50 2011 New Revision: 176897 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=176897 Log: PR fortran/31067 * frontend-passes.c (optimize_minmaxloc): New function. (optimize_expr): Call it. * gfortran.dg/maxloc_2.f90: New test. * gfortran.dg/maxloc_3.f90: New test. * gfortran.dg/minloc_1.f90: New test. * gfortran.dg/minloc_2.f90: New test. * gfortran.dg/minloc_3.f90: New test. * gfortran.dg/minmaxloc_7.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/maxloc_2.f90 trunk/gcc/testsuite/gfortran.dg/maxloc_3.f90 trunk/gcc/testsuite/gfortran.dg/minloc_1.f90 trunk/gcc/testsuite/gfortran.dg/minloc_2.f90 trunk/gcc/testsuite/gfortran.dg/minloc_3.f90 trunk/gcc/testsuite/gfortran.dg/minmaxloc_7.f90 Modified: trunk/gcc/fortran/ChangeLog trunk/gcc/fortran/frontend-passes.c trunk/gcc/testsuite/ChangeLog
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067 Jakub Jelinek changed: What|Removed |Added Attachment #24856|0 |1 is obsolete|| --- Comment #44 from Jakub Jelinek 2011-07-28 16:00:54 UTC --- Created attachment 24858 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24858 gcc47-pr31067.patch Based on IRC discussions, this patch instead replaces MINLOC (rank1) with /( MINLOC (rank1, DIM=1) )/ so that it should work even with allocatable LHS that should be reallocated on assignment etc. As an bonus, even e.g. if (any (minloc (rank1).ne.6)) etc. can be optimized. So far tested just with dg.exp=m*.f90
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067 --- Comment #43 from Jakub Jelinek 2011-07-28 13:51:32 UTC --- Created attachment 24856 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24856 gcc47-pr31067.patch Patch to optimize a = minloc (b) for rank 1 b into a = minloc (b, dim = 1), according to the standard the latter function is supposed to return the first element of the array returned by former function (which should return a rank 1, 1 element array). So, by instead initializing the (one element) array with the scalar we can get it actually inlined. Doing this during genericization looked much harder to me.
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067 --- Comment #42 from Richard Guenther 2011-07-25 15:39:54 UTC --- With gas_dyn changed to use MINLOC (DTEMP, 1) we now inline the intrinsic (but not with MINLOC (DTEMP), even though we know it'll be a single-element array result ...). We completely lack a way to fuse the loops though. Inlining the intrinsic gives a moderate 5% speedup.
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #41 from irar at il dot ibm dot com 2009-07-28 08:12 --- That requires pattern recognition. MIN/MAX_EXPR are recognized by the first phiopt pass, so MIN/MAXLOC should be either also recognized there or in the vectorizer. (The phiopt pass transforms if clause to MIN/MAX_EXPR. The vectorizer gets COND_EXPR after if-conversion pass). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #40 from jakub at gcc dot gnu dot org 2009-07-27 14:51 --- If the cond_expr compute a minimum or maximum and the other cond_exprs compute something based on the IV at the extremum, then I don't see why it couldn't be vectorized by computing extremes of odd/even and corresponding values based on the IV at those points, then merging them in the final step, and similarly for bigger vectorization steps. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #39 from burnus at gcc dot gnu dot org 2009-07-27 13:15 --- (In reply to comment #38) > However, the loop can be split: [..] > making the first loop vectorizable (inner-most loop vectorization). OK. I tried it with a Fortran program: http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc.f90 http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc2.f90 http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc3.f90 maxloc.f90 is the program from comment 34 (maxloc.s = intel assembler) maxloc2.f90 models what gfortran does for maxloc (maxloc.s = intel assembler) maxloc3.f90 models what has a split loop The splitting plus vectorization makes the calculation 5% faster - 0m2.152s (maxloc3) vs 0m2.260s (maxloc2). Still, that's 35% more than ifort needs. For some reason, maxloc2 with -fno-tree-vectorize takes only 0m1.840s.(Identical to intel's result for maxloc2/maxloc3. While for the original maxloc.f90, there is no performance effect, and for maxloc3 vectorization makes it faster.) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #38 from irar at il dot ibm dot com 2009-07-27 12:44 --- I am not sure that that kind of computation can be generated automatically, since in general the order of caclulation of cond_expr cannot be changed. However, the loop can be split: for (i = 0; i < end; i++) if (arr[i] < limit) limit = arr[i]; for (i = 0; i < end; i++) if (arr[i] == limit) { pos = i + 1; break; } making the first loop vectorizable (inner-most loop vectorization). Ira -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #37 from jakub at gcc dot gnu dot org 2009-07-27 11:10 --- Oh, and on 64-bit arches and float or 32-bit arches and double there is another complication - the comparison has different mode size from the cond_expr for pos. For 32-bit pos and 64-bit double it could perhaps just do the computation in 64-bit integers (vector of 2 (resp. 4 for avx)), for the other case it would need to shuffle the max and compute the pos in 2 vectors, or e.g. the Fortran FE could hand in the common case, emit a likely look for the case where array size is smaller than 4GB, using 32-bit pos zero extended to 64-bits at the end. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #36 from jakub at gcc dot gnu dot org 2009-07-27 11:02 --- Here is the loop in C and vectorized by hand as well: #include float arr[1024]; unsigned int foo (unsigned int end) { unsigned int pos = 1; unsigned int i; float limit = __FLT_MAX__; for (i = 0; i < end; i++) if (arr[i] < limit) { limit = arr[i]; pos = i + 1; } return pos; } unsigned int bar (unsigned int end) { __m128 pos = (__m128) _mm_set1_epi32 (1); __m128 limit = _mm_set1_ps (__FLT_MAX__); __m128i curi = _mm_set_epi32 (4, 3, 2, 1); __m128i inc = _mm_set1_epi32 (4); unsigned int i = 0; if (end > 4) { for (; i < end - 4; i += 4) { __m128 val = _mm_loadu_ps (arr + i); __m128 mask = _mm_cmplt_ps (val, limit); limit = _mm_min_ps (limit, val); pos = _mm_andnot_ps (mask, pos); pos = _mm_or_ps (pos, _mm_and_ps (mask, (__m128) curi)); curi = _mm_add_epi32 (curi, inc); } /* Reduction. */ __m128 tmp1 = _mm_movehl_ps (limit, limit); __m128 tmp2 = _mm_movehl_ps (pos, pos); __m128 mask = _mm_cmplt_ps (tmp1, limit); limit = _mm_min_ps (tmp1, limit); tmp2 = _mm_and_ps (mask, tmp2); pos = _mm_or_ps (tmp2, _mm_andnot_ps (mask, pos)); tmp1 = _mm_shuffle_ps (limit, limit, _MM_SHUFFLE (1, 1, 1, 1)); tmp2 = _mm_shuffle_ps (pos, pos, _MM_SHUFFLE (1, 1, 1, 1)); mask = _mm_cmplt_ps (tmp1, limit); limit = _mm_min_ps (tmp1, limit); tmp2 = _mm_and_ps (mask, tmp2); pos = _mm_or_ps (tmp2, _mm_andnot_ps (mask, pos)); } float limit_ = _mm_cvtss_f32 (limit); unsigned int pos_ = (unsigned int) _mm_cvtsi128_si32 ((__m128i) pos); for (; i < end; i++) if (arr[i] < limit_) { limit_ = arr[i]; pos_ = i + 1; } return pos_; } int main (void) { unsigned int k; arr[0] = -1; arr[2] = -3; arr[8] = -5; arr[9] = -6; if (foo (32) != bar (32)) __builtin_abort (); for (k = 10; k < 32; k++) { arr[k] = -k; if (foo (32) != bar (32)) __builtin_abort (); } return 0; } Don't know how hard would be to vectorize this in the vectorizer, but clearly icc manages to handle that. The loop is: : # pos_22 = PHI # i_23 = PHI # limit_24 = PHI limit_11 = arr[i_23]; D.2700_12 = limit_11 < limit_24; pos_1 = [cond_expr] D.2700_12 ? i_23 : pos_22; limit_4 = [cond_expr] D.2700_12 ? limit_11 : limit_24; i_15 = i_23 + 1; D.2703_9 = (long unsigned int) i_15; if (D.2703_9 < end_10(D)) goto ; else goto ; : goto ; before vectorization. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #35 from burnus at gcc dot gnu dot org 2009-07-27 09:18 --- (In reply to comment #34) > Does ifort vectorize the exact same implemantion of minloc? I tried to convert the minloc implementation into Fortran loops - and the result is at http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc2.f90 $ ifort -O3 -xHost -diag-enable all maxloc2.f90 maxloc2.f90(25): (col. 5) remark: LOOP WAS VECTORIZED. [timing: 0m1.384s] $ gfortran -O3 -ffast-math -march=native -ftree-vectorize -ftree-vectorizer-verbose=5 maxloc2.f90 maxloc2.f90:1: note: vectorized 0 loops in function. [timing: 0m2.212s] In case it helps: I put the ifort assembler output at: http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc2.s -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #34 from irar at il dot ibm dot com 2009-07-27 08:36 --- (In reply to comment #33) > Using the example from comment 23 with ... > gfortran shows: test.f90:12: note: not vectorized: unsupported use in stmt. > and needs 2.272s. (By comparison. 4.4 needs 3.688s.) This is for the inner loop vectorization. For the outer loop we get: tmp.f90:11: note: not vectorized: control flow in loop. because of the if's. Maybe loop unswitching can help us. Vectorizable outer-loops look like this: (pre-header) | header <---+ | | inner-loop | | | tail --+ | (exit-bb) Does ifort vectorize the exact same implemantion of minloc? Ira -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #33 from burnus at gcc dot gnu dot org 2009-07-26 09:50 --- (In reply to comment #32) > > Regarding the just committed inline version: It would be interesting to know > > whether it is vectorizable (with/without -ffinite-math-only [i.e. > > -ffast-math]). > > It depends on where it is inlined. It has to be vectorized in outer loop (see > my previous comment), so it needs another loop around it. Using the example from comment 23 with a) gfortran -O3 -ffast-math -march=native -ftree-vectorize -ftree-vectorizer-verbose=5 b) ifort -O3 -xHost -diag-enable all ifort shows: test.f90(12): (col. 9) remark: LOOP WAS VECTORIZED. and needs 1.476s. gfortran shows: test.f90:12: note: not vectorized: unsupported use in stmt. and needs 2.272s. (By comparison. 4.4 needs 3.688s.) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #32 from irar at il dot ibm dot com 2009-07-26 07:48 --- (In reply to comment #30) > Regarding the just committed inline version: It would be interesting to know > whether it is vectorizable (with/without -ffinite-math-only [i.e. > -ffast-math]). It depends on where it is inlined. It has to be vectorized in outer loop (see my previous comment), so it needs another loop around it. Ira -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #31 from jakub at gcc dot gnu dot org 2009-07-24 08:30 --- Vectorization questions I'll defer to Ira. For !optimize I even had a change to forcibly use the function call instead of inline version. But it didn't really work, as there are only array versions of the library functions. In case of minloc/maxloc we could just set up a fake array descriptor for one element array and call the library functions (the *loc0 ones). But minval/maxval library functions only handle DIM=N cases, not finding a minimum/maximum of the whole array. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #30 from burnus at gcc dot gnu dot org 2009-07-24 08:19 --- Regarding the just committed inline version: It would be interesting to know whether it is vectorizable (with/without -ffinite-math-only [i.e. -ffast-math]). Additionally, for size-1 result arrays, the function should be inlined except for -O0 and -Os. That affects especially {min,max}loc as for {min,max}val this looks like a very special case. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #29 from jakub at gcc dot gnu dot org 2009-07-24 07:57 --- Subject: Bug 31067 Author: jakub Date: Fri Jul 24 07:57:13 2009 New Revision: 150041 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=150041 Log: PR fortran/40643 PR fortran/31067 * trans-intrinsic.c (gfc_conv_intrinsic_minmaxloc, gfc_conv_intrinsic_minmaxval): Handle Infinities and NaNs properly, optimize. * trans-array.c (gfc_trans_scalarized_loop_end): No longer static. * trans-array.h (gfc_trans_scalarized_loop_end): New prototype. * libgfortran.h (GFC_REAL_4_INFINITY, GFC_REAL_8_INFINITY, GFC_REAL_10_INFINITY, GFC_REAL_16_INFINITY, GFC_REAL_4_QUIET_NAN, GFC_REAL_8_QUIET_NAN, GFC_REAL_10_QUIET_NAN, GFC_REAL_16_QUIET_NAN): Define. * m4/iparm.m4 (atype_inf, atype_nan): Define. * m4/ifunction.m4: Formatting. * m4/iforeach.m4: Likewise. (START_FOREACH_FUNCTION): Initialize dest to all 1s, not all 0s. (START_FOREACH_BLOCK, FINISH_FOREACH_FUNCTION, FINISH_MASKED_FOREACH_FUNCTION): Run foreach block inside a loop until count[0] == extent[0]. * m4/minval.m4: Formatting. Handle NaNs and infinities. Optimize. * m4/maxval.m4: Likewise. * m4/minloc0.m4: Likewise. * m4/maxloc0.m4: Likewise. * m4/minloc1.m4: Likewise. * m4/maxloc1.m4: Likewise. * generated/maxloc0_16_i16.c: Regenerated. * generated/maxloc0_16_i1.c: Likewise. * generated/maxloc0_16_i2.c: Likewise. * generated/maxloc0_16_i4.c: Likewise. * generated/maxloc0_16_i8.c: Likewise. * generated/maxloc0_16_r10.c: Likewise. * generated/maxloc0_16_r16.c: Likewise. * generated/maxloc0_16_r4.c: Likewise. * generated/maxloc0_16_r8.c: Likewise. * generated/maxloc0_4_i16.c: Likewise. * generated/maxloc0_4_i1.c: Likewise. * generated/maxloc0_4_i2.c: Likewise. * generated/maxloc0_4_i4.c: Likewise. * generated/maxloc0_4_i8.c: Likewise. * generated/maxloc0_4_r10.c: Likewise. * generated/maxloc0_4_r16.c: Likewise. * generated/maxloc0_4_r4.c: Likewise. * generated/maxloc0_4_r8.c: Likewise. * generated/maxloc0_8_i16.c: Likewise. * generated/maxloc0_8_i1.c: Likewise. * generated/maxloc0_8_i2.c: Likewise. * generated/maxloc0_8_i4.c: Likewise. * generated/maxloc0_8_i8.c: Likewise. * generated/maxloc0_8_r10.c: Likewise. * generated/maxloc0_8_r16.c: Likewise. * generated/maxloc0_8_r4.c: Likewise. * generated/maxloc0_8_r8.c: Likewise. * generated/maxloc1_16_i16.c: Likewise. * generated/maxloc1_16_i1.c: Likewise. * generated/maxloc1_16_i2.c: Likewise. * generated/maxloc1_16_i4.c: Likewise. * generated/maxloc1_16_i8.c: Likewise. * generated/maxloc1_16_r10.c: Likewise. * generated/maxloc1_16_r16.c: Likewise. * generated/maxloc1_16_r4.c: Likewise. * generated/maxloc1_16_r8.c: Likewise. * generated/maxloc1_4_i16.c: Likewise. * generated/maxloc1_4_i1.c: Likewise. * generated/maxloc1_4_i2.c: Likewise. * generated/maxloc1_4_i4.c: Likewise. * generated/maxloc1_4_i8.c: Likewise. * generated/maxloc1_4_r10.c: Likewise. * generated/maxloc1_4_r16.c: Likewise. * generated/maxloc1_4_r4.c: Likewise. * generated/maxloc1_4_r8.c: Likewise. * generated/maxloc1_8_i16.c: Likewise. * generated/maxloc1_8_i1.c: Likewise. * generated/maxloc1_8_i2.c: Likewise. * generated/maxloc1_8_i4.c: Likewise. * generated/maxloc1_8_i8.c: Likewise. * generated/maxloc1_8_r10.c: Likewise. * generated/maxloc1_8_r16.c: Likewise. * generated/maxloc1_8_r4.c: Likewise. * generated/maxloc1_8_r8.c: Likewise. * generated/maxval_i16.c: Likewise. * generated/maxval_i1.c: Likewise. * generated/maxval_i2.c: Likewise. * generated/maxval_i4.c: Likewise. * generated/maxval_i8.c: Likewise. * generated/maxval_r10.c: Likewise. * generated/maxval_r16.c: Likewise. * generated/maxval_r4.c: Likewise. * generated/maxval_r8.c: Likewise. * generated/minloc0_16_i16.c: Likewise. * generated/minloc0_16_i1.c: Likewise. * generated/minloc0_16_i2.c: Likewise. * generated/minloc0_16_i4.c: Likewise. * generated/minloc0_16_i8.c: Likewise. * generated/minloc0_16_r10.c: Likewise. * generated/minloc0_16_r16.c: Likewise. * generated/minloc0_16_r4.c: Likewise. * generated/minloc0_16_r8.c: Likewise. * generated/minloc0_4_i16.c: Likewise. * generated/minloc0_4_i1.c: Likewise. * generated/minloc0_4_i2.c: Likewise. * generated/minloc0_4_i4.c: Likewise. * generated/minloc0_4_i8.c: Li
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #28 from irar at il dot ibm dot com 2009-07-20 12:03 --- I've just committed a patch that adds support of cond_expr in reductions in nested cycles (http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01124.html). cond_expr cannot be vectorized in reduction of inner-most loop, because such reduction changes the order of computation, and that cannot be done for cond_expr. Ira -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #27 from irar at il dot ibm dot com 2009-07-05 06:48 --- (In reply to comment #23) > because there are two reductions in that loop which I think the vectorizer > cannot handle: Actually, the vectorizer can vectorize two reductions. I think, the problem is in cond_expr in reduction: > pos.0_3 = [cond_expr] D.1599_29 ? pos.0_32 : pos.0_31; > limit.2_5 = [cond_expr] D.1599_29 ? limit.2_22 : limit.2_8; I'll look into it. Ira -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #26 from burnus at gcc dot gnu dot org 2009-07-03 13:07 --- > has a superfluous check || (pos.0 == 0 && (*D.1568)[S.3 + D.1569] == limit.2) > at least for flag_finite_math_only. If the array cannot contain Inf or NaN > then it either has all elements == FLT_MAX, so pos will stay zero, or at > least one is less than FLT_MAX in which case pos will be adjusted anyway. I have not checked whether algorithm requires the check; NaN/Inf are possible, but maybe the check is still not needed. And if it is, one could enclose it in a if(!flag_finite_math_only) condition. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #25 from rguenth at gcc dot gnu dot org 2009-07-03 12:57 --- Btw, the inlined minloc D.1570 = a.dim[0].lbound; D.1571 = a.dim[0].ubound; pos.0 = 0; { integer(kind=8) S.3; ({ S.3 = D.1570; while (1) { ({ if (S.3 > D.1571) goto L.3; offset.1 = 1 - D.1570; if ((*D.1568)[S.3 + D.1569] < limit.2 || pos.0 == 0 && (*D.1568)[S.3 + D.1569] == limit.2) { ({ limit.2 = (*D.1568)[S.3 + D.1569]; pos.0 = S.3 + offset.1; }) void } S.3 = S.3 + 1; }) void } L.3:; }) void has a superfluous check || (pos.0 == 0 && (*D.1568)[S.3 + D.1569] == limit.2) at least for flag_finite_math_only. If the array cannot contain Inf or NaN then it either has all elements == FLT_MAX, so pos will stay zero, or at least one is less than FLT_MAX in which case pos will be adjusted anyway. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #24 from burnus at gcc dot gnu dot org 2009-07-03 12:40 --- > One issue is that > ISET = MINLOC (DTEMP) > will cause GCC to assume that DTEMP is clobbered. The problem is that while "MINLOC" is pure, we cannot use DECL_PURE_P as the result is passed by reference: (void) minloc(&isset, DTEMP) ^^--- result -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #23 from rguenth at gcc dot gnu dot org 2009-07-03 12:19 --- We are not able to vectorize the loop in program main implicit none integer, volatile, dimension(1) :: n real, allocatable :: a(:) integer :: i real :: t1, t2 allocate (a(100)) call random_number(a) ! negligible time !call cpu_time(t1) do i=1, 1000 n = minloc(a, dim=1) end do !call cpu_time(t2) print *, n !, t2-t1 end program main because there are two reductions in that loop which I think the vectorizer cannot handle: : # pos.0_31 = PHI # limit.2_8 = PHI # S.3_74 = PHI D.1593_21 = S.3_74 + pretmp.22_77; limit.2_22 = (*D.1568_14)[D.1593_21]; D.1595_23 = limit.2_22 < limit.2_8; D.1596_24 = pos.0_31 == 0; D.1597_27 = limit.2_8 == limit.2_22; D.1598_28 = D.1597_27 & D.1596_24; D.1599_29 = D.1595_23 | D.1598_28; pos.0_32 = S.3_74 + pretmp.28_90; pos.0_3 = [cond_expr] D.1599_29 ? pos.0_32 : pos.0_31; limit.2_5 = [cond_expr] D.1599_29 ? limit.2_22 : limit.2_8; S.3_33 = S.3_74 + 1; if (S.3_33 > pretmp.22_81) goto ; else goto ; : goto ; we reduce over limit.2_5 and pos.0_3. The intel compiler vectorizes the function just fine. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added CC||irar at il dot ibm dot com, ||rguenth at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #22 from rguenth at gcc dot gnu dot org 2009-07-03 10:00 --- One issue is that ISET = MINLOC (DTEMP) will cause GCC to assume that DTEMP is clobbered. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #21 from dominiq at lps dot ens dot fr 2008-01-02 20:27 --- > MATMUL is one distinctly possible one Paul, If you are interested, I have a variant of induct.f90 in which I have replaced three dot-products by the matrix-vector product for a total disaster on all compilers I have tried (for gfortran from 66s to 196s). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #20 from pault at gcc dot gnu dot org 2008-01-02 19:30 --- (In reply to comment #19) > gfortran does inline most array intrinsics, but only if the result is a > scalar. > For most array intrinsics this isn't that much of a problem since usually one > uses the variant that returns a scalar, but MINLOC is different in that > usually > one wants to use the version that returns an array. If one implements this I > guess it would be straightforward to replicate the solution to many other > array > intrinsics as well. > Janne, In contemplating what to do with gfortran in the New Year, I have been mulling over in-lining of array intrinsics; MATMUL is one distinctly possible one, as are MINLOC and MAXLOC. Cheers Paul -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #19 from jb at gcc dot gnu dot org 2007-06-27 14:49 --- gfortran does inline most array intrinsics, but only if the result is a scalar. For most array intrinsics this isn't that much of a problem since usually one uses the variant that returns a scalar, but MINLOC is different in that usually one wants to use the version that returns an array. If one implements this I guess it would be straightforward to replicate the solution to many other array intrinsics as well. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #18 from tkoenig at gcc dot gnu dot org 2007-06-15 20:35 --- Too little time right now. Unassigning myself. -- tkoenig at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|tkoenig at gcc dot gnu dot |unassigned at gcc dot gnu |org |dot org Status|ASSIGNED|NEW http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #17 from jb at gcc dot gnu dot org 2007-05-18 21:20 --- Or even better (duh): REAL :: DTEMP DT = HUGE(1.0d0) DO I = 1, NODES DTEMP = DX(I)/(ABS(VEL(I)+SOUND(I)) IF (DTEMP < DT) THEN DT = DTEMP END IF END DO -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #16 from jb at gcc dot gnu dot org 2007-05-18 21:15 --- The critical thing with inlining array intrinsics, IMHO is to give the optimizer more data to work with allowing it to get rid of temp arrays, perform loop fusion or fission etc. So with a trivial benchmark like #15, you don't see any difference, except with potentially higher optimization for user code than libgfortran. For example in gas_dyn/chozdt: REAL, DIMENSION (NODES) :: DTEMP !--- ! Profile for gfortran 4.3: ! CPU_CLK_UNHALTED L2_CACHE_MISS ! samp %runtim samp %tot ! 59887 22.4783 1484 10.9828 : DTEMP = DX/(ABS(VEL) + SOUND) ! ifort 9.1 profile ! 40104 16.2034 1198 8.8166 : DTEMP = DX/(ABS(VEL) + SOUND) DTEMP = DX/(ABS(VEL) + SOUND) ISET = MINLOC (DTEMP) DT = DTEMP(ISET(1)) If MINLOC were inlined, perhaps a sufficiently optimizer could convert this into the equivalent REAL :: DTEMPMIN INTEGER :: i DTEMPMIN = HUGE(1.0d0) DO I = 1, NODES DT = DX(I)/(ABS(VEL(I)+SOUND(I)) IF (DT < DTEMPMIN) THEN DTEMPMIN = DT END IF END DO DT = DTEMPMIN i.e. avoid the temporary array entirely. Yes, I guess this is quite a lot to ask. -- jb at gcc dot gnu dot org changed: What|Removed |Added CC||jb at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #15 from tkoenig at gcc dot gnu dot org 2007-04-02 21:00 --- The library version doesn't do too badly compared to the inline version: $ cat benchmark-inline.f90 program main implicit none integer, dimension(1) :: n real, allocatable :: a(:) integer :: i allocate (a(100)) call random_number(a) do i=1, 1000 n = minloc(a, dim=1) end do end program main $ cat benchmark-library.f90 program main implicit none integer, dimension(1) :: n real, allocatable :: a(:) integer :: i allocate (a(100)) call random_number(a) do i=1, 1000 n = minloc(a) end do end program main $ gfortran -O3 -static benchmark-inline.f90 && ./a.out && time ./a.out real0m6.941s user0m5.232s sys 0m0.016s $ gfortran -O3 -static benchmark-library.f90 && ./a.out && time ./a.out real0m5.720s user0m5.472s sys 0m0.004s -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #14 from tkoenig at gcc dot gnu dot org 2007-04-02 17:44 --- I'll give this another shot. Maybe inlining isn't even necessary for good performance... -- tkoenig at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|unassigned at gcc dot gnu |tkoenig at gcc dot gnu dot |dot org |org Status|NEW |ASSIGNED Last reconfirmed|2007-03-07 21:09:53 |2007-04-02 17:44:28 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #13 from pault at gcc dot gnu dot org 2007-03-26 12:43 --- (In reply to comment #11) > (In reply to comment #10) Thomas, It does not look too bad: Look at the tail end of array_transfer - gfc_trans_create_temp_array (&se->pre, &se->post, se->loop, info, mold_type, false, true, false); /* Cast the pointer to the result. */ tmp = gfc_conv_descriptor_data_get (info->descriptor); Will produce a temporary array of the right dimension, together with its descriptor. For minmaxloc, mold_type will have to be replaced by the TREE_TYPE of the result (gfc_array_index_type, I suppose? or else, you will have to use gfc_typenode_for_spec (&expr->ts);, I think). tmp will now be a pointer to your result array. This needs to appear fairly early on in minmaxloc, so that the array can be set to zero to initialize it and so that the location can be tranferred to it. The standard checks will have to be made that ss->loop exists etc.. - just check some of the other array valued in-line intrinsics. Having done this, you will need to replace the line (~2049), folowing "remember where we are", with a loop over the n dimensions (note, we do not have to restrict ourselves to one dimension:) ). Something like: /* Remember where we are. */ for (n = 0; n < loop.dimen; n++) { pos = build_fold_indirect_ref (gfc_conv_array_data info->descriptor)); pos = gfc_build_array_ref (pos, build_int_cst (gfc_array_index_type, n)) gfc_add_modify_expr (&ifblock, pos, loop.loopvar[n]); } should bang the position into the result array, which is transferred at the end with se->expr = info->descriptor; Good luck Cheers Paul -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #12 from pault at gcc dot gnu dot org 2007-03-26 11:37 --- (In reply to comment #11) > (In reply to comment #10) > Do you have any idea what I cold do to turn this into an array? > All the "interesting" gfc_conv_intrinsic_* functions have the > "if (se->ss)" statement on top. I'll put my thinking cap on. To first order, you have to follow the same route as array_transfer. I'll come back to you. Paul -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #11 from tkoenig at gcc dot gnu dot org 2007-03-13 20:12 --- (In reply to comment #10) > Thomas, it's a bit kludgy, but why not add a constant expression = 1, if dim > is > not present? Hi Paul, unless I'm mistaken, this would also change the rank of the function to 0, which FX explained is wrong. Do you have any idea what I cold do to turn this into an array? All the "interesting" gfc_conv_intrinsic_* functions have the "if (se->ss)" statement on top. Cheers, Thomas -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #10 from pault at gcc dot gnu dot org 2007-03-12 19:04 --- (In reply to comment #9) > > As a workaround, one could always use "minloc(...,dim=1)", then > we get the inline version. > Thomas, it's a bit kludgy, but why not add a constant expression = 1, if dim is not present? Cheers Paul -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #9 from tkoenig at gcc dot gnu dot org 2007-03-11 19:43 --- I have looked at this some more. Channging gfc_conv_intrinsic_function so that we call gfc_conv_intrinsic_minmaxloc is easy enough: @@ -3481,7 +3481,9 @@ gfc_conv_intrinsic_function (gfc_se * se name = &expr->value.function.name[2]; - if (expr->rank > 0 && !expr->inline_noncopying_intrinsic) + if (expr->rank > 0 && !expr->inline_noncopying_intrinsic + && ! (expr->rank == 1 && (isym->generic_id == GFC_ISYM_MINLOC + || isym->generic_id == GFC_ISYM_MAXLOC))) { lib = gfc_is_intrinsic_libcall (expr); if (lib != 0) If we do that, we hit the "if (se->ss)" contition on top of that function, and we would have to handle scalarization of that one-trip loop. I have currently no idea how to go about that. Simply removing the condition doesn't work :-) As a workaround, one could always use "minloc(...,dim=1)", then we get the inline version. Unassigning myself. -- tkoenig at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|tkoenig at gcc dot gnu dot |unassigned at gcc dot gnu |org |dot org Status|ASSIGNED|NEW http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #8 from tkoenig at gcc dot gnu dot org 2007-03-10 12:34 --- (In reply to comment #7) > (In reply to comment #6) > > This makes minloc have rank 0, and allows for > > inlining. > > No, it's wrong. See F95 13.14.71: "Result Characteristics. The result is of > type default integer. If DIM is absent, the result is an array of rank one > and > of size equal to the rank of ARRAY; otherwise, the result is of rank n-1 and > shape ..." You're right. Back to the drawing board. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #7 from fxcoudert at gcc dot gnu dot org 2007-03-08 05:50 --- (In reply to comment #6) > This makes minloc have rank 0, and allows for > inlining. No, it's wrong. See F95 13.14.71: "Result Characteristics. The result is of type default integer. If DIM is absent, the result is an array of rank one and of size equal to the rank of ARRAY; otherwise, the result is of rank n-1 and shape ..." Note in particular the example for case (i), on the next page: "The value of MINLOC((/4,3,6,3/)) is [2]". If DIM is absent, the result is always an array of rank 1. Only if DIM is present, and the ARRAY is of rank 1, then MINLOC is a scalar: MINLOC((/4,3,6,3/),1) == 2 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #6 from tkoenig at gcc dot gnu dot org 2007-03-07 21:29 --- Created an attachment (id=13165) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13165&action=view) Setting the correct rank in minloc This makes minloc have rank 0, and allows for inlining. I guess we'll find out now wether the inline code works :-) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #5 from tkoenig at gcc dot gnu dot org 2007-03-07 21:09 --- (In reply to comment #3) > In gfc_conv_intrinsic_function, expr->rank is 0 for minval > and 1 for minloc (which is bogus). I wonder where this is > set... To answer my own question: This is set in gfc_resolve_minloc. I'll try to give it a shot. -- tkoenig at gcc dot gnu dot org changed: What|Removed |Added CC||tkoenig at gcc dot gnu dot ||org AssignedTo|unassigned at gcc dot gnu |tkoenig at gcc dot gnu dot |dot org |org Status|NEW |ASSIGNED Last reconfirmed|2007-03-07 12:18:10 |2007-03-07 21:09:53 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #4 from fxcoudert at gmail dot com 2007-03-07 21:09 --- Subject: Re: MINLOC should sometimes be inlined (gas_dyn is so slw) > In gfc_conv_intrinsic_function, expr->rank is 0 for minval > and 1 for minloc (which is bogus). It's not bogus. The MINLOC is an array of rank 1, that's correct. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #3 from tkoenig at gcc dot gnu dot org 2007-03-07 21:00 --- (In reply to comment #2) > No, because we never get into gfc_conv_intrinsic_minmaxloc. We translate the > expression directly into a function call by calling > gfc_conv_intrinsic_funcall() at the head of gfc_conv_intrinsic_function(), > instead of going down the list. I wonder how SUM and PRODUCT are inlined... In gfc_conv_intrinsic_function, expr->rank is 0 for minval and 1 for minloc (which is bogus). I wonder where this is set... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #2 from fxcoudert at gcc dot gnu dot org 2007-03-07 12:18 --- (In reply to comment #1) > We do this for minval, and from glancing at > gfc_conv_intrinsic_minmaxval and gfc_conv_intrinsic_minmaxloc, > it should happen already. No, because we never get into gfc_conv_intrinsic_minmaxloc. We translate the expression directly into a function call by calling gfc_conv_intrinsic_funcall() at the head of gfc_conv_intrinsic_function(), instead of going down the list. I wonder how SUM and PRODUCT are inlined... -- fxcoudert at gcc dot gnu dot org changed: What|Removed |Added Last reconfirmed|2007-03-07 11:27:52 |2007-03-07 12:18:10 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067
[Bug fortran/31067] MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
--- Comment #1 from tkoenig at gcc dot gnu dot org 2007-03-07 11:27 --- (In reply to comment #0) > Maybe we should have MINLOC inlined when there's no mask, stride 1 and > one-dimensional? Definitely. We do this for minval, and from glancing at gfc_conv_intrinsic_minmaxval and gfc_conv_intrinsic_minmaxloc, it should happen already. -- tkoenig at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2007-03-07 11:27:52 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067