Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-22 Thread Bin.Cheng via Gcc-patches
On Tue, Sep 22, 2020 at 10:30 AM HAO CHEN GUI  wrote:
>
> Bin,
>
> I just tested your patch on current trunk.  Here is my summary.
>
> 1. About some iv aren't moved out of inner loop (Lijia mentioned in his
> last email)
>
>[local count: 955630226]:
># l_32 = PHI <1(12), l_54(21)>
># ivtmp_165 = PHI <_446(12), ivtmp_155(21)>
>_26 = (integer(kind=8)) l_32;
>_27 = _25 + _26;
>y__I_lsm.119_136 = (*y_135(D))[_27];
>y__I_lsm.119_90 = m_55 != 1 ? y__I_lsm.119_136 : 0.0;
>_37 = _36 * stride.88_111;
>_38 = _35 + _37;
>_39 = _26 + _38;
>_40 = (*a_137(D))[_39];
>
> The offset _39 is not loop independent as it relies on _26. But _38 and
> _37 should be loop independent. So Lijia thought they should be moved
> out of loop.
>
> I checked the following pass and found that these  statements are
> eliminated after vertorizing and dce.
>
> In vect dump,
>
> simple.F:27:23: note:  -->vectorizing statement: _37 = _36 *
> stride.88_111;
> simple.F:27:23: note:  -->vectorizing statement: _38 = _35 + _37;
> simple.F:27:23: note:  -->vectorizing statement: _39 = _26 + _38;
> simple.F:27:23: note:  -->vectorizing statement: _40 = (*a_137(D))[_39];
> simple.F:27:23: note:  transform statement.
> simple.F:27:23: note:  transform load. ncopies = 1
> simple.F:27:23: note:  create vector_type-pointer variable to type:
> vector(2) real(kind=8)  vectorizing an array ref: (*a_137(D))
> simple.F:27:23: note:  created vectp_a.131_383
> simple.F:27:23: note:  add new stmt: vect__40.132_374 = MEM  real(kind=8)> [(real(kind=8) *)vectp_a.130_376];
>
> In dce dump,
>
> Deleting : _39 = _26 + _38;
>
> Deleting : _38 = _35 + _37;
>
> Deleting : _37 = _36 * stride.88_111;
>
> So it's reasonable to only consider data reference after loop
> interchange. Other statements may be eliminated or be moved out of loop
> in last lim pass if they're real expensive.
>
> 2. I tested the SPEC on powerpc64le-linux-gnu. 503.bwaves_r got 6.77%
> performance improvement with this patch. It has no impact on other
> benchmarks.
>
> 3. The patch passed bootstrapped and regression test on
> powerpc64le-linux-gnu.
>
> I think the patch works fine. Could you please add it into trunk? Thanks
> a lot.
Hmm, IIRC the patch was intended to show what the missing transform
is, and I think it has latent bugs which I haven't got time to refine.
As Richard mentioned, could you please explore this with the existing
LIM facility, rather than introducing new code implementing existing
transforms?

Thanks,
bin
>
>
> On 8/9/2020 下午 6:18, Bin.Cheng wrote:
> > On Mon, Sep 7, 2020 at 5:42 PM HAO CHEN GUI  wrote:
> >> Hi,
> >>
> >> I want to follow Lijia's work as I gained the performance benefit on
> >> some SPEC workloads by adding a im pass after loop interchange.  Could
> >> you send me the latest patches? I could do further testing. Thanks a lot.
> > Hi,
> > Hmm, not sure if this refers to me?  I only provided an example patch
> > (which isn't complete) before Lijia's.  Unfortunately I don't have any
> > latest patch about this either.
> > As Richard suggested, maybe you (if you work on this) can simplify the
> > implementation.  Anyway, we only need to hoist memory references here.
> >
> > Thanks,
> > bin
> >> https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html
> >>


Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-22 Thread Richard Biener via Gcc-patches
On Tue, Sep 22, 2020 at 4:31 AM HAO CHEN GUI via Gcc-patches
 wrote:
>
> Bin,
>
> I just tested your patch on current trunk.  Here is my summary.
>
> 1. About some iv aren't moved out of inner loop (Lijia mentioned in his
> last email)
>
>[local count: 955630226]:
># l_32 = PHI <1(12), l_54(21)>
># ivtmp_165 = PHI <_446(12), ivtmp_155(21)>
>_26 = (integer(kind=8)) l_32;
>_27 = _25 + _26;
>y__I_lsm.119_136 = (*y_135(D))[_27];
>y__I_lsm.119_90 = m_55 != 1 ? y__I_lsm.119_136 : 0.0;
>_37 = _36 * stride.88_111;
>_38 = _35 + _37;
>_39 = _26 + _38;
>_40 = (*a_137(D))[_39];
>
> The offset _39 is not loop independent as it relies on _26. But _38 and
> _37 should be loop independent. So Lijia thought they should be moved
> out of loop.
>
> I checked the following pass and found that these  statements are
> eliminated after vertorizing and dce.
>
> In vect dump,
>
> simple.F:27:23: note:  -->vectorizing statement: _37 = _36 *
> stride.88_111;
> simple.F:27:23: note:  -->vectorizing statement: _38 = _35 + _37;
> simple.F:27:23: note:  -->vectorizing statement: _39 = _26 + _38;
> simple.F:27:23: note:  -->vectorizing statement: _40 = (*a_137(D))[_39];
> simple.F:27:23: note:  transform statement.
> simple.F:27:23: note:  transform load. ncopies = 1
> simple.F:27:23: note:  create vector_type-pointer variable to type:
> vector(2) real(kind=8)  vectorizing an array ref: (*a_137(D))
> simple.F:27:23: note:  created vectp_a.131_383
> simple.F:27:23: note:  add new stmt: vect__40.132_374 = MEM  real(kind=8)> [(real(kind=8) *)vectp_a.130_376];
>
> In dce dump,
>
> Deleting : _39 = _26 + _38;
>
> Deleting : _38 = _35 + _37;
>
> Deleting : _37 = _36 * stride.88_111;
>
> So it's reasonable to only consider data reference after loop
> interchange. Other statements may be eliminated or be moved out of loop
> in last lim pass if they're real expensive.
>
> 2. I tested the SPEC on powerpc64le-linux-gnu. 503.bwaves_r got 6.77%
> performance improvement with this patch. It has no impact on other
> benchmarks.
>
> 3. The patch passed bootstrapped and regression test on
> powerpc64le-linux-gnu.
>
> I think the patch works fine. Could you please add it into trunk? Thanks
> a lot.

I'd like to see us instead apply the existing LIM code (w/o store-motion)
to the loop nests we applied interchange to instead of adding yet another
invariant motion code.  It should be reasonably "easy" to make the code
have the complexity of the region it operates on (again, without store-motion
which would introduce some complexities).

Btw, with an extra LIM pass scheduled after interchange I observed ~20%
improvement on bwaves for x86_64, mainly due to better induction variable
selection IIRC.

Richard.

>
> On 8/9/2020 下午 6:18, Bin.Cheng wrote:
> > On Mon, Sep 7, 2020 at 5:42 PM HAO CHEN GUI  wrote:
> >> Hi,
> >>
> >> I want to follow Lijia's work as I gained the performance benefit on
> >> some SPEC workloads by adding a im pass after loop interchange.  Could
> >> you send me the latest patches? I could do further testing. Thanks a lot.
> > Hi,
> > Hmm, not sure if this refers to me?  I only provided an example patch
> > (which isn't complete) before Lijia's.  Unfortunately I don't have any
> > latest patch about this either.
> > As Richard suggested, maybe you (if you work on this) can simplify the
> > implementation.  Anyway, we only need to hoist memory references here.
> >
> > Thanks,
> > bin
> >> https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html
> >>


Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-21 Thread HAO CHEN GUI via Gcc-patches

Bin,

I just tested your patch on current trunk.  Here is my summary.

1. About some iv aren't moved out of inner loop (Lijia mentioned in his 
last email)


  [local count: 955630226]:
  # l_32 = PHI <1(12), l_54(21)>
  # ivtmp_165 = PHI <_446(12), ivtmp_155(21)>
  _26 = (integer(kind=8)) l_32;
  _27 = _25 + _26;
  y__I_lsm.119_136 = (*y_135(D))[_27];
  y__I_lsm.119_90 = m_55 != 1 ? y__I_lsm.119_136 : 0.0;
  _37 = _36 * stride.88_111;
  _38 = _35 + _37;
  _39 = _26 + _38;
  _40 = (*a_137(D))[_39];

The offset _39 is not loop independent as it relies on _26. But _38 and 
_37 should be loop independent. So Lijia thought they should be moved 
out of loop.


I checked the following pass and found that these  statements are 
eliminated after vertorizing and dce.


In vect dump,

simple.F:27:23: note:  -->vectorizing statement: _37 = _36 * 
stride.88_111;

simple.F:27:23: note:  -->vectorizing statement: _38 = _35 + _37;
simple.F:27:23: note:  -->vectorizing statement: _39 = _26 + _38;
simple.F:27:23: note:  -->vectorizing statement: _40 = (*a_137(D))[_39];
simple.F:27:23: note:  transform statement.
simple.F:27:23: note:  transform load. ncopies = 1
simple.F:27:23: note:  create vector_type-pointer variable to type: 
vector(2) real(kind=8)  vectorizing an array ref: (*a_137(D))

simple.F:27:23: note:  created vectp_a.131_383
simple.F:27:23: note:  add new stmt: vect__40.132_374 = MEM real(kind=8)> [(real(kind=8) *)vectp_a.130_376];


In dce dump,

Deleting : _39 = _26 + _38;

Deleting : _38 = _35 + _37;

Deleting : _37 = _36 * stride.88_111;

So it's reasonable to only consider data reference after loop 
interchange. Other statements may be eliminated or be moved out of loop 
in last lim pass if they're real expensive.


2. I tested the SPEC on powerpc64le-linux-gnu. 503.bwaves_r got 6.77% 
performance improvement with this patch. It has no impact on other 
benchmarks.


3. The patch passed bootstrapped and regression test on 
powerpc64le-linux-gnu.


I think the patch works fine. Could you please add it into trunk? Thanks 
a lot.



On 8/9/2020 下午 6:18, Bin.Cheng wrote:

On Mon, Sep 7, 2020 at 5:42 PM HAO CHEN GUI  wrote:

Hi,

I want to follow Lijia's work as I gained the performance benefit on
some SPEC workloads by adding a im pass after loop interchange.  Could
you send me the latest patches? I could do further testing. Thanks a lot.

Hi,
Hmm, not sure if this refers to me?  I only provided an example patch
(which isn't complete) before Lijia's.  Unfortunately I don't have any
latest patch about this either.
As Richard suggested, maybe you (if you work on this) can simplify the
implementation.  Anyway, we only need to hoist memory references here.

Thanks,
bin

https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html



Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-08 Thread Bin.Cheng via Gcc-patches
On Mon, Sep 7, 2020 at 5:42 PM HAO CHEN GUI  wrote:
>
> Hi,
>
> I want to follow Lijia's work as I gained the performance benefit on
> some SPEC workloads by adding a im pass after loop interchange.  Could
> you send me the latest patches? I could do further testing. Thanks a lot.
Hi,
Hmm, not sure if this refers to me?  I only provided an example patch
(which isn't complete) before Lijia's.  Unfortunately I don't have any
latest patch about this either.
As Richard suggested, maybe you (if you work on this) can simplify the
implementation.  Anyway, we only need to hoist memory references here.

Thanks,
bin
>
> https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html
>


Re: Do we need to do a loop invariant motion after loop interchange ?

2020-09-07 Thread Richard Biener via Gcc-patches
On Mon, Sep 7, 2020 at 11:43 AM HAO CHEN GUI via Gcc-patches
 wrote:
>
> Hi,
>
> I want to follow Lijia's work as I gained the performance benefit on
> some SPEC workloads by adding a im pass after loop interchange.  Could
> you send me the latest patches? I could do further testing. Thanks a lot.
>
> https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html

Given the patches are quite complex I wonder if executing region-based LIM
from interchange makes more sense here.  Martin might want to work on this.

Richard.


Do we need to do a loop invariant motion after loop interchange ?

2020-09-07 Thread HAO CHEN GUI via Gcc-patches

Hi,

I want to follow Lijia's work as I gained the performance benefit on 
some SPEC workloads by adding a im pass after loop interchange.  Could 
you send me the latest patches? I could do further testing. Thanks a lot.


https://gcc.gnu.org/pipermail/gcc/2020-February/232091.html