[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-26 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

--- Comment #12 from vries at gcc dot gnu.org ---
Created attachment 36063
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36063&action=edit
autopar/outer-7.c

C example to reproduce the same problem


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-26 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

--- Comment #11 from vries at gcc dot gnu.org ---
Created attachment 36057
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36057&action=edit
Updated tentative patch


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-24 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

vries at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords|patch   |

--- Comment #10 from vries at gcc dot gnu.org ---
retracting patch. it doesn't make sense to use graphite by default before
addressing the compile-time-hog problem.


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-24 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

--- Comment #9 from vries at gcc dot gnu.org ---
(In reply to vries from comment #7)
> Created attachment 35986 [details]
> Updated tentative patch
> 
> I found that always doing graphite before parloops resulted in failures to
> parallelize reduction testcases.
> 

That was due to not using -ffast-math.

By using this ( https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01863.html )
approved patch, we no longer have that problem.

However, we run into: PR66997 - outer loop reduction fails to parallelize with
graphite.

And the runtime of graphite is excessive in some cases, with prevents us from
using graphite by default, f.i. Bug 53852 - [4.9/5/6 Regression]
-ftree-loop-linear: large compile time / memory usage.


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-15 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

vries at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||patch

--- Comment #8 from vries at gcc dot gnu.org ---
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01332.html


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-15 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

vries at gcc dot gnu.org changed:

   What|Removed |Added

  Attachment #35983|0   |1
is obsolete||

--- Comment #7 from vries at gcc dot gnu.org ---
Created attachment 35986
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35986&action=edit
Updated tentative patch

I found that always doing graphite before parloops resulted in failures to
parallelize reduction testcases.

I've split things up now:
- first we do parloopsred, a parloops variant in which we only handle
reductions
- then we do graphite
- then we do the normal parloops

This seems to combine the best of graphite and parloops.

The only gotcha is that I had to disable pass_iv_canon when
tree_parallelize_loops > 1. It seems to interfere with graphite. I did not
observe any failures to parallelize due to not running pass_iv_canon.


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-14 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

--- Comment #6 from vries at gcc dot gnu.org ---
(In reply to kargl from comment #5)
> Does the loop ordering matter?  Fortran is a column major language,
> so your nested loops are backwards.  One would normally write.
> 
>   do jj = 0, n - 1
>   do ii = 0, n - 1
>  x(ii, jj) = ii + jj + 3
>   end do
>end do
> 
> where the first loop index varies most rapidly.

Thanks for letting me know. I'm obviously not fluent in Fortran.

Interchanging ii and jj in the array access of the example, and again disabling
pass_iv_canon, gives:
...
(Data Dep:
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 500}_3, +, 1}_4
#)
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 500}_3, +, 1}_4
#)
  access_fn_A: {{0, +, 500}_3, +, 1}_4
  access_fn_B: {{0, +, 500}_3, +, 1}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (3 4 )
  distance_vector:   0   0
  distance_vector:   1 -500
  direction_vector: ==
  direction_vector: +-
)
  FAILED: data dependencies exist across iterations
...

Again, using -floops-parallelize-all allows the outer loop to be paralelized.


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-14 Thread kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

kargl at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kargl at gcc dot gnu.org

--- Comment #5 from kargl at gcc dot gnu.org ---
(In reply to vries from comment #0)
> Consider this example, a fortran version of autopar/outer-1.c:
> ...
> program main
>   implicit none
>   integer, parameter :: n = 500
>   integer, dimension (0:n-1, 0:n-1) :: x
>   integer:: i, j, ii, jj
> 
>   do ii = 0, n - 1
>  do jj = 0, n - 1
> x(ii, jj) = ii + jj + 3
>  end do
>   end do
> 
>   do i = 0, n - 1
>  do j = 0, n - 1
> if (x(i, j) .ne. i + j + 3) call abort
>  end do
>   end do
> 
> end program main
> ...
> 
> When trying to parallelize this using -O2 -ftree-parallelize-loops=2, it
> fails on the dependencies:

Does the loop ordering matter?  Fortran is a column major language,
so your nested loops are backwards.  One would normally write.

  do jj = 0, n - 1
  do ii = 0, n - 1
 x(ii, jj) = ii + jj + 3
  end do
   end do

where the first loop index varies most rapidly.


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-14 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

--- Comment #4 from vries at gcc dot gnu.org ---
Created attachment 35983
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35983&action=edit
Tentative patch

Using this tentative patch, we use graphite analysis (if available) by default
for parloops. That way, we manage to parallelize the fortran example using just
-ftree-parallelize-loops=2.


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-14 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

--- Comment #3 from vries at gcc dot gnu.org ---
The fortran example succeeds when floop-parallelize-all is used.

Even though the access function in graphite seems the same:
...
Access function 0: {{0, +, 1}_3, +, 500}_4
...


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-14 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

--- Comment #2 from vries at gcc dot gnu.org ---
Another obvious difference is that the fortran 2-dimensional array access is
rewritten into a single dimension array access:
...
  :
  # ii_7 = PHI <0(2), ii_16(7)>
  pretmp_52 = (integer(kind=8)) ii_7;
  pretmp_53 = pretmp_52 * 500;

  :
  # jj_10 = PHI <0(3), jj_15(5)>
  _11 = (integer(kind=8)) jj_10;
  _12 = _11 + pretmp_53;
  _13 = ii_7 + jj_10;
  _14 = _13 + 3;
  x[_12] = _14;
  jj_15 = jj_10 + 1;
  if (jj_10 == 499)
goto ;
  else
goto ;
...

While the outer-1.c 2-dimensional array access is still 2-dimensional:
...
  :
  # i_34 = PHI <0(3), i_15(8)>
  goto ;

  :
  # j_36 = PHI <0(9), j_14(4)>
  _11 = i_34 + j_36;
  _12 = _11 + 3;
  x[i_34][j_36] = _12;
  j_14 = j_36 + 1;
  if (N_9(D) > j_14)
goto ;
  else
goto ;
...

Which results in different access functions, and the dependence analysis
succeeds:
...
(Data Dep:
#(Data Ref:
#  bb: 5
#  stmt: x[i_34][j_36] = _12;
#  ref: x[i_34][j_36];
#  base_object: x;
#  Access function 0: {0, +, 1}_4
#  Access function 1: {0, +, 1}_1
#)
#(Data Ref:
#  bb: 5
#  stmt: x[i_34][j_36] = _12;
#  ref: x[i_34][j_36];
#  base_object: x;
#  Access function 0: {0, +, 1}_4
#  Access function 1: {0, +, 1}_1
#)
  access_fn_A: {0, +, 1}_4
  access_fn_B: {0, +, 1}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  access_fn_A: {0, +, 1}_1
  access_fn_B: {0, +, 1}_1

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (1 4 )
  distance_vector:   0   0
  direction_vector: ==
)
  SUCCESS: may be parallelized
...


[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar

2015-07-14 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873

vries at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||missed-optimization

--- Comment #1 from vries at gcc dot gnu.org ---
Once noticeable difference with outer-1.c, is that pass_iv_canon make the inner
and outer loop ivs run downwards (from 500 to 0).

Removing pass_iv_canon from the pass list fixes that, but doesn't change
anything about the dependency analysis in parloops:
...
(Data Dep:
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 1}_3, +, 500}_4
#)
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 1}_3, +, 500}_4
#)
  access_fn_A: {{0, +, 1}_3, +, 500}_4
  access_fn_B: {{0, +, 1}_3, +, 500}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (3 4 )
  distance_vector:   0   0
  distance_vector: 500  -1
  direction_vector: ==
  direction_vector: +-
)
  FAILED: data dependencies exist across iterations
...