Re: [PATCH] combine: Allow combining two insns to two insns

2018-08-01 Thread Toon Moene

On 07/24/2018 07:18 PM, Segher Boessenkool wrote:


This patch allows combine to combine two insns into two.  This helps
in many cases, by reducing instruction path length, and also allowing
further combinations to happen.  PR85160 is a typical example of code
that it can improve.


I cannot state with certainty that the improvements to our most 
notorious routine between 8.2 and current trunk are solely due to this 
change, but the differences are telling (see attached Fortran code - the 
analysis is about the third loop).


Number of instructions for this loop (Skylake i9-7900).

gfortran82 -S -Ofast -march=native -mtune=native:

  458 verint.s.82.loop3

gfortran90 -S -Ofast -march=native -mtune=native:

  396 verint.s.90.loop3

But the most stunning difference is the use of the stack [ nn(rsp) ] - 
see the attached files ...


--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
# 1 "/scratch/hirlam/hl_home/MPI/lib/src/grdy/verint.F"
# 1 ""
# 1 ""
# 1 "/scratch/hirlam/hl_home/MPI/lib/src/grdy/verint.F"
c Library:grdy $RCSfile$, $Revision: 7536 $
c checked in by $Author: ovignes $ at $Date: 2009-12-18 14:23:36 +0100 (Fri, 18 Dec 2009) $
c $State$, $Locker$
c $Log$
c Revision 1.3  1999/04/22 09:30:45  DagBjoerge
c MPP code
c
c Revision 1.2  1999/03/09 10:23:13  GerardCats
c Add SGI paralllellisation directives DOACROSS
c
c Revision 1.1  1996/09/06 13:12:18  GCats
c Created from grdy.apl, 1 version 2.6.1, by Gerard Cats
c
  SUBROUTINE VERINT (
 I   KLON   , KLAT   , KLEV   , KINT  , KHALO
 I , KLON1  , KLON2  , KLAT1  , KLAT2
 I , KP , KQ , KR
 R , PARG   , PRES
 R , PALFH  , PBETH
 R , PALFA  , PBETA  , PGAMA   )
C
C***
C
C  VERINT - THREE DIMENSIONAL INTERPOLATION
C
C  PURPOSE:
C
C  THREE DIMENSIONAL INTERPOLATION
C
C  INPUT PARAMETERS:
C
C  KLON  NUMBER OF GRIDPOINTS IN X-DIRECTION
C  KLAT  NUMBER OF GRIDPOINTS IN Y-DIRECTION
C  KLEV  NUMBER OF VERTICAL LEVELS
C  KINT  TYPE OF INTERPOLATION
C= 1 - LINEAR
C= 2 - QUADRATIC
C= 3 - CUBIC
C= 4 - MIXED CUBIC/LINEAR
C  KLON1 FIRST GRIDPOINT IN X-DIRECTION
C  KLON2 LAST  GRIDPOINT IN X-DIRECTION
C  KLAT1 FIRST GRIDPOINT IN Y-DIRECTION
C  KLAT2 LAST  GRIDPOINT IN Y-DIRECTION
C  KPARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS
C  KQARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS
C  KRARRAY OF INDEXES FOR VERTICAL   DISPLACEMENTS
C  PARG  ARRAY OF ARGUMENTS
C  PALFH ALFA HAT
C  PBETH BETA HAT
C  PALFA ARRAY OF WEIGHTS IN X-DIRECTION
C  PBETA ARRAY OF WEIGHTS IN Y-DIRECTION
C  PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION
C
C  OUTPUT PARAMETERS:
C
C  PRES  INTERPOLATED FIELD
C
C  HISTORY:
C
C  J.E. HAUGEN   1  1992
C
C***
C
  IMPLICIT NONE
C
  INTEGER KLON   , KLAT   , KLEV   , KINT   , KHALO,
 IKLON1  , KLON2  , KLAT1  , KLAT2
C
  INTEGER   KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT)
  REALPARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV)  ,   
 RPRES(KLON,KLAT) ,
 R   PALFH(KLON,KLAT) ,  PBETH(KLON,KLAT)  ,
 R   PALFA(KLON,KLAT,4)   ,  PBETA(KLON,KLAT,4),
 R   PGAMA(KLON,KLAT,4)
C
  INTEGER JX, JY, IDX, IDY, ILEV
  REAL Z1MAH, Z1MBH
C
  IF (KINT.EQ.1) THEN
C  LINEAR INTERPOLATION
C
  DO JY = KLAT1,KLAT2
  DO JX = KLON1,KLON2
 IDX  = KP(JX,JY)
 IDY  = KQ(JX,JY)
 ILEV = KR(JX,JY)
C
 PRES(JX,JY) = PGAMA(JX,JY,1)*(
C
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV-1) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY  ,ILEV-1) ) )
C+
 +   + PGAMA(JX,JY,2)*(
C+
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV  )
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV  ) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV  )
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY  ,ILEV  ) ) )
  ENDDO
  ENDDO
C
  ELSE
 +IF (KINT.EQ.2) THEN
C  QUADRATIC INTERPOLATION
C
  DO JY = KLAT1,KLAT2
  DO JX = KLON1,KLON2
 IDX  = KP(JX,JY)
 IDY  = KQ(JX,JY)
 ILEV = KR(JX,JY)
C
 PRES(JX,JY) = PGAMA(JX,JY,1)*(
C
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV-1)
 +  + PALFA(JX,JY,3)*PARG(IDX+1,IDY-1,ILEV-1) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV-1)
 +  + 

Re: [PATCH] combine: Allow combining two insns to two insns

2018-08-01 Thread Christophe Lyon
On Wed, 1 Aug 2018 at 11:40, Segher Boessenkool
 wrote:
>
> On Wed, Aug 01, 2018 at 10:27:31AM +0200, Christophe Lyon wrote:
> > On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool
> >  wrote:
> > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote:
> > > > Since this was committed, I've noticed regressions
> > > > on aarch64:
> > > > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and:
> > >
> > > This went from
> > > and w0, w0, 255
> > > lsl w1, w0, 8
> > > orr w0, w1, w0, lsl 20
> > > ret
> > > to
> > > and w1, w0, 255
> > > ubfiz   w0, w0, 8, 8
> > > orr w0, w0, w1, lsl 20
> > > ret
> > > so it's neither an improvement nor a regression, just different code.
> > > The testcase wants no ANDs in the RTL.
> >
> > I didn't try to manually regenerate the code before and after the patch,
> > but if there was "and w0, w0, 255" before the patch, why did the test 
> > pass?
>
> It wasn't an AND in RTL (it was a ZERO_EXTEND).
>

Indeed, I missed the -dP in the options.

> > > > on arm-none-linux-gnueabi
> > > > FAIL: gfortran.dg/actual_array_constructor_1.f90   -O1  execution test
> > >
> > > That sounds bad.  Open a PR, maybe?
> > >
> > I've just filed PR86771
>
> Thanks.
>
>
> Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-08-01 Thread Segher Boessenkool
On Wed, Aug 01, 2018 at 10:27:31AM +0200, Christophe Lyon wrote:
> On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool
>  wrote:
> > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote:
> > > Since this was committed, I've noticed regressions
> > > on aarch64:
> > > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and:
> >
> > This went from
> > and w0, w0, 255
> > lsl w1, w0, 8
> > orr w0, w1, w0, lsl 20
> > ret
> > to
> > and w1, w0, 255
> > ubfiz   w0, w0, 8, 8
> > orr w0, w0, w1, lsl 20
> > ret
> > so it's neither an improvement nor a regression, just different code.
> > The testcase wants no ANDs in the RTL.
> 
> I didn't try to manually regenerate the code before and after the patch,
> but if there was "and w0, w0, 255" before the patch, why did the test 
> pass?

It wasn't an AND in RTL (it was a ZERO_EXTEND).

> > > on arm-none-linux-gnueabi
> > > FAIL: gfortran.dg/actual_array_constructor_1.f90   -O1  execution test
> >
> > That sounds bad.  Open a PR, maybe?
> >
> I've just filed PR86771

Thanks.


Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-08-01 Thread Christophe Lyon
On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool
 wrote:
>
> Hi Christophe,
>
> On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote:
> > Since this was committed, I've noticed regressions
> > on aarch64:
> > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and:
>
> This went from
> and w0, w0, 255
> lsl w1, w0, 8
> orr w0, w1, w0, lsl 20
> ret
> to
> and w1, w0, 255
> ubfiz   w0, w0, 8, 8
> orr w0, w0, w1, lsl 20
> ret
> so it's neither an improvement nor a regression, just different code.
> The testcase wants no ANDs in the RTL.

I didn't try to manually regenerate the code before and after the patch,
but if there was "and w0, w0, 255" before the patch, why did the test pass?

> > on arm-none-linux-gnueabi
> > FAIL: gfortran.dg/actual_array_constructor_1.f90   -O1  execution test
>
> That sounds bad.  Open a PR, maybe?
>
I've just filed PR86771

> > On aarch64, I've also noticed a few others regressions but I'm not yet
> > 100% sure it's caused by this patch (bisect running):
> > gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4
>
>  ushift_53_i:
> -   uxtwx1, w0
> -   lsl x0, x1, 53
> -   lsr x1, x1, 11
> +   lsr w1, w0, 11
> +   lsl x0, x0, 53
> ret
>
>  shift_53_i:
> -   sxtwx1, w0
> -   lsl x0, x1, 53
> -   asr x1, x1, 11
> +   sbfxx1, x0, 11, 21
> +   lsl x0, x0, 53
> ret
>
> Both are improvements afais.  The number of asr insns changes, sure.
>
Looks like an easy "fix" would be to change to "scan-assembler-times asr 3"
but maybe the aarch64 maintainers want to add more checks here (lsl/lsr counts)

Don't you include arm/aarch64 in the 30 targets you used for testing?

>
> > gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve
> > scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2
>
> Skipping all the SVE tests, sorry.  Richard says they look like
> improvements, and exactly of the expected kind.  :-)
>
>
> Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-31 Thread Richard Earnshaw (lists)
On 31/07/18 14:57, Segher Boessenkool wrote:
> Hi Christophe,
> 
> On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote:
>> Since this was committed, I've noticed regressions
>> on aarch64:
>> FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and:
> 
> This went from
> and w0, w0, 255
> lsl w1, w0, 8

These are sequentially dependent.

> orr w0, w1, w0, lsl 20
> ret
> to
> and w1, w0, 255
> ubfiz   w0, w0, 8, 8

These can run in parallel.

So the change is a good one!  On a super-scalar machine we save a cycle.

R.

> orr w0, w0, w1, lsl 20
> ret
> so it's neither an improvement nor a regression, just different code.
> The testcase wants no ANDs in the RTL.
> 
> 
>> on arm-none-linux-gnueabi
>> FAIL: gfortran.dg/actual_array_constructor_1.f90   -O1  execution test
> 
> That sounds bad.  Open a PR, maybe?
> 
> 
>> On aarch64, I've also noticed a few others regressions but I'm not yet
>> 100% sure it's caused by this patch (bisect running):
>> gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4
> 
>  ushift_53_i:
> -   uxtwx1, w0
> -   lsl x0, x1, 53
> -   lsr x1, x1, 11
> +   lsr w1, w0, 11
> +   lsl x0, x0, 53
> ret
> 
>  shift_53_i:
> -   sxtwx1, w0
> -   lsl x0, x1, 53
> -   asr x1, x1, 11
> +   sbfxx1, x0, 11, 21
> +   lsl x0, x0, 53
> ret
> 
> Both are improvements afais.  The number of asr insns changes, sure.
> 
> 
>> gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve
>> scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2
> 
> Skipping all the SVE tests, sorry.  Richard says they look like
> improvements, and exactly of the expected kind.  :-)
> 
> 
> Segher
> 



Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-31 Thread Segher Boessenkool
On Tue, Jul 31, 2018 at 05:39:37AM -0700, H.J. Lu wrote:
> For
> 
> ---
> #define N 16
> float f[N];
> double d[N];
> int n[N];
> 
> __attribute__((noinline)) void
> f3 (void)
> {
>   int i;
>   for (i = 0; i < N; i++)
> d[i] = f[i];
> }
> ---
> 
> r263067 improved -O3 -mavx2 -mtune=generic -m64 from
> 
> .cfi_startproc
> vmovaps f(%rip), %xmm2
> vmovaps f+32(%rip), %xmm3
> vinsertf128 $0x1, f+16(%rip), %ymm2, %ymm0
> vcvtps2pd %xmm0, %ymm1
> vextractf128 $0x1, %ymm0, %xmm0
> vmovaps %xmm1, d(%rip)
> vextractf128 $0x1, %ymm1, d+16(%rip)
> vcvtps2pd %xmm0, %ymm0
> vmovaps %xmm0, d+32(%rip)
> vextractf128 $0x1, %ymm0, d+48(%rip)
> vinsertf128 $0x1, f+48(%rip), %ymm3, %ymm0
> vcvtps2pd %xmm0, %ymm1
> vextractf128 $0x1, %ymm0, %xmm0
> vmovaps %xmm1, d+64(%rip)
> vextractf128 $0x1, %ymm1, d+80(%rip)
> vcvtps2pd %xmm0, %ymm0
> vmovaps %xmm0, d+96(%rip)
> vextractf128 $0x1, %ymm0, d+112(%rip)
> vzeroupper
> ret
> .cfi_endproc
> 
> to
> 
> .cfi_startproc
> vcvtps2pd f(%rip), %ymm0
> vmovaps %xmm0, d(%rip)
> vextractf128 $0x1, %ymm0, d+16(%rip)
> vcvtps2pd f+16(%rip), %ymm0
> vmovaps %xmm0, d+32(%rip)
> vextractf128 $0x1, %ymm0, d+48(%rip)
> vcvtps2pd f+32(%rip), %ymm0
> vextractf128 $0x1, %ymm0, d+80(%rip)
> vmovaps %xmm0, d+64(%rip)
> vcvtps2pd f+48(%rip), %ymm0
> vextractf128 $0x1, %ymm0, d+112(%rip)
> vmovaps %xmm0, d+96(%rip)
> vzeroupper
> ret
> .cfi_endproc

I cannot really read AVX, but that looks like better code alright :-)


Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-31 Thread Segher Boessenkool
Hi Christophe,

On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote:
> Since this was committed, I've noticed regressions
> on aarch64:
> FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and:

This went from
and w0, w0, 255
lsl w1, w0, 8
orr w0, w1, w0, lsl 20
ret
to
and w1, w0, 255
ubfiz   w0, w0, 8, 8
orr w0, w0, w1, lsl 20
ret
so it's neither an improvement nor a regression, just different code.
The testcase wants no ANDs in the RTL.


> on arm-none-linux-gnueabi
> FAIL: gfortran.dg/actual_array_constructor_1.f90   -O1  execution test

That sounds bad.  Open a PR, maybe?


> On aarch64, I've also noticed a few others regressions but I'm not yet
> 100% sure it's caused by this patch (bisect running):
> gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4

 ushift_53_i:
-   uxtwx1, w0
-   lsl x0, x1, 53
-   lsr x1, x1, 11
+   lsr w1, w0, 11
+   lsl x0, x0, 53
ret

 shift_53_i:
-   sxtwx1, w0
-   lsl x0, x1, 53
-   asr x1, x1, 11
+   sbfxx1, x0, 11, 21
+   lsl x0, x0, 53
ret

Both are improvements afais.  The number of asr insns changes, sure.


> gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve
> scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2

Skipping all the SVE tests, sorry.  Richard says they look like
improvements, and exactly of the expected kind.  :-)


Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-31 Thread Richard Sandiford
Christophe Lyon  writes:
> On Mon, 30 Jul 2018 at 18:09, Segher Boessenkool
>  wrote:
>>
>> On Tue, Jul 24, 2018 at 05:18:41PM +, Segher Boessenkool wrote:
>> > This patch allows combine to combine two insns into two.  This helps
>> > in many cases, by reducing instruction path length, and also allowing
>> > further combinations to happen.  PR85160 is a typical example of code
>> > that it can improve.
>> >
>> > This patch does not allow such combinations if either of the original
>> > instructions was a simple move instruction.  In those cases combining
>> > the two instructions increases register pressure without improving the
>> > code.  With this move test register pressure does no longer increase
>> > noticably as far as I can tell.
>> >
>> > (At first I also didn't allow either of the resulting insns to be a
>> > move instruction.  But that is actually a very good thing to have, as
>> > should have been obvious).
>> >
>> > Tested for many months; tested on about 30 targets.
>> >
>> > I'll commit this later this week if there are no objections.
>>
>> Done now, with the testcase at
>> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html .
>>
>
> Hi,
>
> Since this was committed, I've noticed regressions
> on aarch64:
> FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and:
>
> on arm-none-linux-gnueabi
> FAIL: gfortran.dg/actual_array_constructor_1.f90   -O1  execution test
>
> On aarch64, I've also noticed a few others regressions but I'm not yet
> 100% sure it's caused by this patch (bisect running):
> gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4
> gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve
> scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2
> gcc.target/aarch64/sve/var_stride_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tlsl\\tx[0-9]+, x[0-9]+, 10\\n 2
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> #0\\.0\\n 7
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> z[0-9]+\\.d\\n 14
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> #0\\.0\\n 5
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> z[0-9]+\\.s\\n 10
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> #0\\.0\\n 21
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> z[0-9]+\\.d\\n 42
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> #0\\.0\\n 15
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> z[0-9]+\\.s\\n 30
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> #0\\.0\\n 21
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> z[0-9]+\\.d\\n 42
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> #0\\.0\\n 15
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> z[0-9]+\\.s\\n 30
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> #0\\.0\\n 21
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> z[0-9]+\\.d\\n 42
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> #0\\.0\\n 15
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> z[0-9]+\\.s\\n 30
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> #0\\.0\\n 21
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> z[0-9]+\\.d\\n 42
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> #0\\.0\\n 15
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
> z[0-9]+\\.s\\n 30
> gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
> scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
> #0\\.0\\n 21
> 

Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-31 Thread H.J. Lu
On Wed, Jul 25, 2018 at 1:28 AM, Richard Biener
 wrote:
> On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool
>  wrote:
>>
>> This patch allows combine to combine two insns into two.  This helps
>> in many cases, by reducing instruction path length, and also allowing
>> further combinations to happen.  PR85160 is a typical example of code
>> that it can improve.
>>
>> This patch does not allow such combinations if either of the original
>> instructions was a simple move instruction.  In those cases combining
>> the two instructions increases register pressure without improving the
>> code.  With this move test register pressure does no longer increase
>> noticably as far as I can tell.
>>
>> (At first I also didn't allow either of the resulting insns to be a
>> move instruction.  But that is actually a very good thing to have, as
>> should have been obvious).
>>
>> Tested for many months; tested on about 30 targets.
>>
>> I'll commit this later this week if there are no objections.
>
> Sounds good - but, _any_ testcase?  Please! ;)
>

Here is a testcase:

For

---
#define N 16
float f[N];
double d[N];
int n[N];

__attribute__((noinline)) void
f3 (void)
{
  int i;
  for (i = 0; i < N; i++)
d[i] = f[i];
}
---

r263067 improved -O3 -mavx2 -mtune=generic -m64 from

.cfi_startproc
vmovaps f(%rip), %xmm2
vmovaps f+32(%rip), %xmm3
vinsertf128 $0x1, f+16(%rip), %ymm2, %ymm0
vcvtps2pd %xmm0, %ymm1
vextractf128 $0x1, %ymm0, %xmm0
vmovaps %xmm1, d(%rip)
vextractf128 $0x1, %ymm1, d+16(%rip)
vcvtps2pd %xmm0, %ymm0
vmovaps %xmm0, d+32(%rip)
vextractf128 $0x1, %ymm0, d+48(%rip)
vinsertf128 $0x1, f+48(%rip), %ymm3, %ymm0
vcvtps2pd %xmm0, %ymm1
vextractf128 $0x1, %ymm0, %xmm0
vmovaps %xmm1, d+64(%rip)
vextractf128 $0x1, %ymm1, d+80(%rip)
vcvtps2pd %xmm0, %ymm0
vmovaps %xmm0, d+96(%rip)
vextractf128 $0x1, %ymm0, d+112(%rip)
vzeroupper
ret
.cfi_endproc

to

.cfi_startproc
vcvtps2pd f(%rip), %ymm0
vmovaps %xmm0, d(%rip)
vextractf128 $0x1, %ymm0, d+16(%rip)
vcvtps2pd f+16(%rip), %ymm0
vmovaps %xmm0, d+32(%rip)
vextractf128 $0x1, %ymm0, d+48(%rip)
vcvtps2pd f+32(%rip), %ymm0
vextractf128 $0x1, %ymm0, d+80(%rip)
vmovaps %xmm0, d+64(%rip)
vcvtps2pd f+48(%rip), %ymm0
vextractf128 $0x1, %ymm0, d+112(%rip)
vmovaps %xmm0, d+96(%rip)
vzeroupper
ret
.cfi_endproc

This is:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86752

H.J.


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-31 Thread Christophe Lyon
On Mon, 30 Jul 2018 at 18:09, Segher Boessenkool
 wrote:
>
> On Tue, Jul 24, 2018 at 05:18:41PM +, Segher Boessenkool wrote:
> > This patch allows combine to combine two insns into two.  This helps
> > in many cases, by reducing instruction path length, and also allowing
> > further combinations to happen.  PR85160 is a typical example of code
> > that it can improve.
> >
> > This patch does not allow such combinations if either of the original
> > instructions was a simple move instruction.  In those cases combining
> > the two instructions increases register pressure without improving the
> > code.  With this move test register pressure does no longer increase
> > noticably as far as I can tell.
> >
> > (At first I also didn't allow either of the resulting insns to be a
> > move instruction.  But that is actually a very good thing to have, as
> > should have been obvious).
> >
> > Tested for many months; tested on about 30 targets.
> >
> > I'll commit this later this week if there are no objections.
>
> Done now, with the testcase at 
> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html .
>

Hi,

Since this was committed, I've noticed regressions
on aarch64:
FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and:

on arm-none-linux-gnueabi
FAIL: gfortran.dg/actual_array_constructor_1.f90   -O1  execution test

On aarch64, I've also noticed a few others regressions but I'm not yet
100% sure it's caused by this patch (bisect running):
gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4
gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve
scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2
gcc.target/aarch64/sve/var_stride_4.c -march=armv8.2-a+sve
scan-assembler-times \\tlsl\\tx[0-9]+, x[0-9]+, 10\\n 2
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
#0\\.0\\n 7
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
z[0-9]+\\.d\\n 14
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
#0\\.0\\n 5
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
z[0-9]+\\.s\\n 10
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
#0\\.0\\n 21
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
z[0-9]+\\.d\\n 42
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
#0\\.0\\n 15
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
z[0-9]+\\.s\\n 30
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
#0\\.0\\n 21
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
z[0-9]+\\.d\\n 42
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
#0\\.0\\n 15
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
z[0-9]+\\.s\\n 30
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
#0\\.0\\n 21
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
z[0-9]+\\.d\\n 42
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
#0\\.0\\n 15
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
z[0-9]+\\.s\\n 30
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
#0\\.0\\n 21
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
z[0-9]+\\.d\\n 42
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
#0\\.0\\n 15
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s,
z[0-9]+\\.s\\n 30
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
#0\\.0\\n 21
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d,
z[0-9]+\\.d\\n 42
gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve
scan-assembler-times 

Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-30 Thread Segher Boessenkool
On Tue, Jul 24, 2018 at 05:18:41PM +, Segher Boessenkool wrote:
> This patch allows combine to combine two insns into two.  This helps
> in many cases, by reducing instruction path length, and also allowing
> further combinations to happen.  PR85160 is a typical example of code
> that it can improve.
> 
> This patch does not allow such combinations if either of the original
> instructions was a simple move instruction.  In those cases combining
> the two instructions increases register pressure without improving the
> code.  With this move test register pressure does no longer increase
> noticably as far as I can tell.
> 
> (At first I also didn't allow either of the resulting insns to be a
> move instruction.  But that is actually a very good thing to have, as
> should have been obvious).
> 
> Tested for many months; tested on about 30 targets.
> 
> I'll commit this later this week if there are no objections.

Done now, with the testcase at 
https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html .


Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-25 Thread Segher Boessenkool
On Wed, Jul 25, 2018 at 09:47:31AM -0400, David Malcolm wrote:
> > +/* Return whether X is just a single set, with the source
> > +   a general_operand.  */
> > +static bool
> > +is_just_move (rtx x)
> > +{
> > +  if (INSN_P (x))
> > +x = PATTERN (x);
> > +
> > +  return (GET_CODE (x) == SET && general_operand (SET_SRC (x),
> > VOIDmode));
> > +}
> 
> If I'm reading it right, the patch only calls this function on i2 and
> i3, which are known to be rtx_insn *, rather than just rtx.

I used to also have

  is_just_move (XVECEXP (newpat, 0, 0))

etc.; during most of combine you do not have instructions, just patterns.


Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-25 Thread David Malcolm
On Tue, 2018-07-24 at 17:18 +, Segher Boessenkool wrote:
> This patch allows combine to combine two insns into two.  This helps
> in many cases, by reducing instruction path length, and also allowing
> further combinations to happen.  PR85160 is a typical example of code
> that it can improve.
> 
> This patch does not allow such combinations if either of the original
> instructions was a simple move instruction.  In those cases combining
> the two instructions increases register pressure without improving
> the
> code.  With this move test register pressure does no longer increase
> noticably as far as I can tell.
> 
> (At first I also didn't allow either of the resulting insns to be a
> move instruction.  But that is actually a very good thing to have, as
> should have been obvious).
> 
> Tested for many months; tested on about 30 targets.
> 
> I'll commit this later this week if there are no objections.
> 
> 
> Segher
> 
> 
> 2018-07-24  Segher Boessenkool  
> 
>   PR rtl-optimization/85160
>   * combine.c (is_just_move): New function.
>   (try_combine): Allow combining two instructions into two if
> neither of
>   the original instructions was a move.
> 
> ---
>  gcc/combine.c | 22 --
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/combine.c b/gcc/combine.c
> index cfe0f19..d64e84d 100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn
> *insn, int n)
>return true;
>  }
>  
> +/* Return whether X is just a single set, with the source
> +   a general_operand.  */
> +static bool
> +is_just_move (rtx x)
> +{
> +  if (INSN_P (x))
> +x = PATTERN (x);
> +
> +  return (GET_CODE (x) == SET && general_operand (SET_SRC (x),
> VOIDmode));
> +}

If I'm reading it right, the patch only calls this function on i2 and
i3, which are known to be rtx_insn *, rather than just rtx.

Hence the only way in which GET_CODE (x) can be SET is if the INSN_P
pattern test sets x to PATTERN (x) immediately above: it can't be a SET
otherwise - but this isn't obvious from the code.

Can this function take an rtx_insn * instead?  Maybe something like:

/* Return whether INSN's pattern is just a single set, with the source
   a general_operand.  */
static bool
is_just_move_p (rtx_insn *insn)
{
  if (!INSN_P (insn))
return false;

  rtx x = PATTERN (insn);
  return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode));
}

or similar?

[...snip...]

Thanks; I hope this is constructive.
Dave


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-25 Thread Richard Biener
On Wed, Jul 25, 2018 at 11:50 AM Segher Boessenkool
 wrote:
>
> On Wed, Jul 25, 2018 at 10:28:30AM +0200, Richard Biener wrote:
> > On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool
> >  wrote:
> > >
> > > This patch allows combine to combine two insns into two.  This helps
> > > in many cases, by reducing instruction path length, and also allowing
> > > further combinations to happen.  PR85160 is a typical example of code
> > > that it can improve.
> > >
> > > This patch does not allow such combinations if either of the original
> > > instructions was a simple move instruction.  In those cases combining
> > > the two instructions increases register pressure without improving the
> > > code.  With this move test register pressure does no longer increase
> > > noticably as far as I can tell.
> > >
> > > (At first I also didn't allow either of the resulting insns to be a
> > > move instruction.  But that is actually a very good thing to have, as
> > > should have been obvious).
> > >
> > > Tested for many months; tested on about 30 targets.
> > >
> > > I'll commit this later this week if there are no objections.
> >
> > Sounds good - but, _any_ testcase?  Please! ;)
>
> I only have target-specific ones.

Works for me.

>  Most *simple* ones will already be
> optimised by current code (via 3->2 combination).  But I've now got one
> that trunk does not optimise, and it can be confirmed with looking at
> the resulting machine code even (no need to look at the combine dump,
> which is a very good thing).  And it is a proper thing to test even: it
> tests that some source is compiled to properly optimised machine code.
>
> Any other kind of testcase is worse than useless, of course.
>
> Testing it results in working code isn't very feasible or useful either.
>
>
> Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-25 Thread Segher Boessenkool
On Wed, Jul 25, 2018 at 10:28:30AM +0200, Richard Biener wrote:
> On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool
>  wrote:
> >
> > This patch allows combine to combine two insns into two.  This helps
> > in many cases, by reducing instruction path length, and also allowing
> > further combinations to happen.  PR85160 is a typical example of code
> > that it can improve.
> >
> > This patch does not allow such combinations if either of the original
> > instructions was a simple move instruction.  In those cases combining
> > the two instructions increases register pressure without improving the
> > code.  With this move test register pressure does no longer increase
> > noticably as far as I can tell.
> >
> > (At first I also didn't allow either of the resulting insns to be a
> > move instruction.  But that is actually a very good thing to have, as
> > should have been obvious).
> >
> > Tested for many months; tested on about 30 targets.
> >
> > I'll commit this later this week if there are no objections.
> 
> Sounds good - but, _any_ testcase?  Please! ;)

I only have target-specific ones.  Most *simple* ones will already be
optimised by current code (via 3->2 combination).  But I've now got one
that trunk does not optimise, and it can be confirmed with looking at
the resulting machine code even (no need to look at the combine dump,
which is a very good thing).  And it is a proper thing to test even: it
tests that some source is compiled to properly optimised machine code.

Any other kind of testcase is worse than useless, of course.

Testing it results in working code isn't very feasible or useful either.


Segher


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-25 Thread Richard Biener
On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool
 wrote:
>
> This patch allows combine to combine two insns into two.  This helps
> in many cases, by reducing instruction path length, and also allowing
> further combinations to happen.  PR85160 is a typical example of code
> that it can improve.
>
> This patch does not allow such combinations if either of the original
> instructions was a simple move instruction.  In those cases combining
> the two instructions increases register pressure without improving the
> code.  With this move test register pressure does no longer increase
> noticably as far as I can tell.
>
> (At first I also didn't allow either of the resulting insns to be a
> move instruction.  But that is actually a very good thing to have, as
> should have been obvious).
>
> Tested for many months; tested on about 30 targets.
>
> I'll commit this later this week if there are no objections.

Sounds good - but, _any_ testcase?  Please! ;)

Richard.

>
> Segher
>
>
> 2018-07-24  Segher Boessenkool  
>
> PR rtl-optimization/85160
> * combine.c (is_just_move): New function.
> (try_combine): Allow combining two instructions into two if neither of
> the original instructions was a move.
>
> ---
>  gcc/combine.c | 22 --
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/combine.c b/gcc/combine.c
> index cfe0f19..d64e84d 100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn *insn, int 
> n)
>return true;
>  }
>
> +/* Return whether X is just a single set, with the source
> +   a general_operand.  */
> +static bool
> +is_just_move (rtx x)
> +{
> +  if (INSN_P (x))
> +x = PATTERN (x);
> +
> +  return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode));
> +}
> +
>  /* Try to combine the insns I0, I1 and I2 into I3.
> Here I0, I1 and I2 appear earlier than I3.
> I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into
> @@ -2668,6 +2679,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
> rtx_insn *i0,
>int swap_i2i3 = 0;
>int split_i2i3 = 0;
>int changed_i3_dest = 0;
> +  bool i2_was_move = false, i3_was_move = false;
>
>int maxreg;
>rtx_insn *temp_insn;
> @@ -3059,6 +3071,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
> rtx_insn *i0,
>return 0;
>  }
>
> +  /* Record whether i2 and i3 are trivial moves.  */
> +  i2_was_move = is_just_move (i2);
> +  i3_was_move = is_just_move (i3);
> +
>/* Record whether I2DEST is used in I2SRC and similarly for the other
>   cases.  Knowing this will help in register status updating below.  */
>i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src);
> @@ -4014,8 +4030,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
> rtx_insn *i0,
>&& XVECLEN (newpat, 0) == 2
>&& GET_CODE (XVECEXP (newpat, 0, 0)) == SET
>&& GET_CODE (XVECEXP (newpat, 0, 1)) == SET
> -  && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
> - || set_noop_p (XVECEXP (newpat, 0, 1)))
> +  && (i1
> +  || set_noop_p (XVECEXP (newpat, 0, 0))
> +  || set_noop_p (XVECEXP (newpat, 0, 1))
> +  || (!i2_was_move && !i3_was_move))
>&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
>&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
>&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT
> --
> 1.8.3.1
>


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-24 Thread Jeff Law
On 07/24/2018 11:18 AM, Segher Boessenkool wrote:
> This patch allows combine to combine two insns into two.  This helps
> in many cases, by reducing instruction path length, and also allowing
> further combinations to happen.  PR85160 is a typical example of code
> that it can improve.
> 
> This patch does not allow such combinations if either of the original
> instructions was a simple move instruction.  In those cases combining
> the two instructions increases register pressure without improving the
> code.  With this move test register pressure does no longer increase
> noticably as far as I can tell.
> 
> (At first I also didn't allow either of the resulting insns to be a
> move instruction.  But that is actually a very good thing to have, as
> should have been obvious).
> 
> Tested for many months; tested on about 30 targets.
> 
> I'll commit this later this week if there are no objections.
> 
> 
> Segher
> 
> 
> 2018-07-24  Segher Boessenkool  
> 
>   PR rtl-optimization/85160
>   * combine.c (is_just_move): New function.
>   (try_combine): Allow combining two instructions into two if neither of
>   the original instructions was a move.
I've had several instances where a 2->2 combination would be useful
through the years.  I didn't save any of those examples though...  Good
to see the limitation being addressed.

jeff


[PATCH] combine: Allow combining two insns to two insns

2018-07-24 Thread Segher Boessenkool
This patch allows combine to combine two insns into two.  This helps
in many cases, by reducing instruction path length, and also allowing
further combinations to happen.  PR85160 is a typical example of code
that it can improve.

This patch does not allow such combinations if either of the original
instructions was a simple move instruction.  In those cases combining
the two instructions increases register pressure without improving the
code.  With this move test register pressure does no longer increase
noticably as far as I can tell.

(At first I also didn't allow either of the resulting insns to be a
move instruction.  But that is actually a very good thing to have, as
should have been obvious).

Tested for many months; tested on about 30 targets.

I'll commit this later this week if there are no objections.


Segher


2018-07-24  Segher Boessenkool  

PR rtl-optimization/85160
* combine.c (is_just_move): New function.
(try_combine): Allow combining two instructions into two if neither of
the original instructions was a move.

---
 gcc/combine.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/gcc/combine.c b/gcc/combine.c
index cfe0f19..d64e84d 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn *insn, int n)
   return true;
 }
 
+/* Return whether X is just a single set, with the source
+   a general_operand.  */
+static bool
+is_just_move (rtx x)
+{
+  if (INSN_P (x))
+x = PATTERN (x);
+
+  return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode));
+}
+
 /* Try to combine the insns I0, I1 and I2 into I3.
Here I0, I1 and I2 appear earlier than I3.
I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into
@@ -2668,6 +2679,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
   int swap_i2i3 = 0;
   int split_i2i3 = 0;
   int changed_i3_dest = 0;
+  bool i2_was_move = false, i3_was_move = false;
 
   int maxreg;
   rtx_insn *temp_insn;
@@ -3059,6 +3071,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
   return 0;
 }
 
+  /* Record whether i2 and i3 are trivial moves.  */
+  i2_was_move = is_just_move (i2);
+  i3_was_move = is_just_move (i3);
+
   /* Record whether I2DEST is used in I2SRC and similarly for the other
  cases.  Knowing this will help in register status updating below.  */
   i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src);
@@ -4014,8 +4030,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
   && XVECLEN (newpat, 0) == 2
   && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
   && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
-  && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
- || set_noop_p (XVECEXP (newpat, 0, 1)))
+  && (i1
+  || set_noop_p (XVECEXP (newpat, 0, 0))
+  || set_noop_p (XVECEXP (newpat, 0, 1))
+  || (!i2_was_move && !i3_was_move))
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT
-- 
1.8.3.1