Re: [PATCH] combine: Allow combining two insns to two insns
On 07/24/2018 07:18 PM, Segher Boessenkool wrote: This patch allows combine to combine two insns into two. This helps in many cases, by reducing instruction path length, and also allowing further combinations to happen. PR85160 is a typical example of code that it can improve. I cannot state with certainty that the improvements to our most notorious routine between 8.2 and current trunk are solely due to this change, but the differences are telling (see attached Fortran code - the analysis is about the third loop). Number of instructions for this loop (Skylake i9-7900). gfortran82 -S -Ofast -march=native -mtune=native: 458 verint.s.82.loop3 gfortran90 -S -Ofast -march=native -mtune=native: 396 verint.s.90.loop3 But the most stunning difference is the use of the stack [ nn(rsp) ] - see the attached files ... -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news # 1 "/scratch/hirlam/hl_home/MPI/lib/src/grdy/verint.F" # 1 "" # 1 "" # 1 "/scratch/hirlam/hl_home/MPI/lib/src/grdy/verint.F" c Library:grdy $RCSfile$, $Revision: 7536 $ c checked in by $Author: ovignes $ at $Date: 2009-12-18 14:23:36 +0100 (Fri, 18 Dec 2009) $ c $State$, $Locker$ c $Log$ c Revision 1.3 1999/04/22 09:30:45 DagBjoerge c MPP code c c Revision 1.2 1999/03/09 10:23:13 GerardCats c Add SGI paralllellisation directives DOACROSS c c Revision 1.1 1996/09/06 13:12:18 GCats c Created from grdy.apl, 1 version 2.6.1, by Gerard Cats c SUBROUTINE VERINT ( I KLON , KLAT , KLEV , KINT , KHALO I , KLON1 , KLON2 , KLAT1 , KLAT2 I , KP , KQ , KR R , PARG , PRES R , PALFH , PBETH R , PALFA , PBETA , PGAMA ) C C*** C C VERINT - THREE DIMENSIONAL INTERPOLATION C C PURPOSE: C C THREE DIMENSIONAL INTERPOLATION C C INPUT PARAMETERS: C C KLON NUMBER OF GRIDPOINTS IN X-DIRECTION C KLAT NUMBER OF GRIDPOINTS IN Y-DIRECTION C KLEV NUMBER OF VERTICAL LEVELS C KINT TYPE OF INTERPOLATION C= 1 - LINEAR C= 2 - QUADRATIC C= 3 - CUBIC C= 4 - MIXED CUBIC/LINEAR C KLON1 FIRST GRIDPOINT IN X-DIRECTION C KLON2 LAST GRIDPOINT IN X-DIRECTION C KLAT1 FIRST GRIDPOINT IN Y-DIRECTION C KLAT2 LAST GRIDPOINT IN Y-DIRECTION C KPARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KQARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KRARRAY OF INDEXES FOR VERTICAL DISPLACEMENTS C PARG ARRAY OF ARGUMENTS C PALFH ALFA HAT C PBETH BETA HAT C PALFA ARRAY OF WEIGHTS IN X-DIRECTION C PBETA ARRAY OF WEIGHTS IN Y-DIRECTION C PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION C C OUTPUT PARAMETERS: C C PRES INTERPOLATED FIELD C C HISTORY: C C J.E. HAUGEN 1 1992 C C*** C IMPLICIT NONE C INTEGER KLON , KLAT , KLEV , KINT , KHALO, IKLON1 , KLON2 , KLAT1 , KLAT2 C INTEGER KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT) REALPARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV) , RPRES(KLON,KLAT) , R PALFH(KLON,KLAT) , PBETH(KLON,KLAT) , R PALFA(KLON,KLAT,4) , PBETA(KLON,KLAT,4), R PGAMA(KLON,KLAT,4) C INTEGER JX, JY, IDX, IDY, ILEV REAL Z1MAH, Z1MBH C IF (KINT.EQ.1) THEN C LINEAR INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV-1) ) ) C+ + + PGAMA(JX,JY,2)*( C+ + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV ) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV ) ) ) ENDDO ENDDO C ELSE +IF (KINT.EQ.2) THEN C QUADRATIC INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY-1,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV-1) + +
Re: [PATCH] combine: Allow combining two insns to two insns
On Wed, 1 Aug 2018 at 11:40, Segher Boessenkool wrote: > > On Wed, Aug 01, 2018 at 10:27:31AM +0200, Christophe Lyon wrote: > > On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool > > wrote: > > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: > > > > Since this was committed, I've noticed regressions > > > > on aarch64: > > > > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > > > > > This went from > > > and w0, w0, 255 > > > lsl w1, w0, 8 > > > orr w0, w1, w0, lsl 20 > > > ret > > > to > > > and w1, w0, 255 > > > ubfiz w0, w0, 8, 8 > > > orr w0, w0, w1, lsl 20 > > > ret > > > so it's neither an improvement nor a regression, just different code. > > > The testcase wants no ANDs in the RTL. > > > > I didn't try to manually regenerate the code before and after the patch, > > but if there was "and w0, w0, 255" before the patch, why did the test > > pass? > > It wasn't an AND in RTL (it was a ZERO_EXTEND). > Indeed, I missed the -dP in the options. > > > > on arm-none-linux-gnueabi > > > > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > > > > > That sounds bad. Open a PR, maybe? > > > > > I've just filed PR86771 > > Thanks. > > > Segher
Re: [PATCH] combine: Allow combining two insns to two insns
On Wed, Aug 01, 2018 at 10:27:31AM +0200, Christophe Lyon wrote: > On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool > wrote: > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: > > > Since this was committed, I've noticed regressions > > > on aarch64: > > > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > > > This went from > > and w0, w0, 255 > > lsl w1, w0, 8 > > orr w0, w1, w0, lsl 20 > > ret > > to > > and w1, w0, 255 > > ubfiz w0, w0, 8, 8 > > orr w0, w0, w1, lsl 20 > > ret > > so it's neither an improvement nor a regression, just different code. > > The testcase wants no ANDs in the RTL. > > I didn't try to manually regenerate the code before and after the patch, > but if there was "and w0, w0, 255" before the patch, why did the test > pass? It wasn't an AND in RTL (it was a ZERO_EXTEND). > > > on arm-none-linux-gnueabi > > > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > > > That sounds bad. Open a PR, maybe? > > > I've just filed PR86771 Thanks. Segher
Re: [PATCH] combine: Allow combining two insns to two insns
On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool wrote: > > Hi Christophe, > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: > > Since this was committed, I've noticed regressions > > on aarch64: > > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > This went from > and w0, w0, 255 > lsl w1, w0, 8 > orr w0, w1, w0, lsl 20 > ret > to > and w1, w0, 255 > ubfiz w0, w0, 8, 8 > orr w0, w0, w1, lsl 20 > ret > so it's neither an improvement nor a regression, just different code. > The testcase wants no ANDs in the RTL. I didn't try to manually regenerate the code before and after the patch, but if there was "and w0, w0, 255" before the patch, why did the test pass? > > on arm-none-linux-gnueabi > > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > That sounds bad. Open a PR, maybe? > I've just filed PR86771 > > On aarch64, I've also noticed a few others regressions but I'm not yet > > 100% sure it's caused by this patch (bisect running): > > gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 > > ushift_53_i: > - uxtwx1, w0 > - lsl x0, x1, 53 > - lsr x1, x1, 11 > + lsr w1, w0, 11 > + lsl x0, x0, 53 > ret > > shift_53_i: > - sxtwx1, w0 > - lsl x0, x1, 53 > - asr x1, x1, 11 > + sbfxx1, x0, 11, 21 > + lsl x0, x0, 53 > ret > > Both are improvements afais. The number of asr insns changes, sure. > Looks like an easy "fix" would be to change to "scan-assembler-times asr 3" but maybe the aarch64 maintainers want to add more checks here (lsl/lsr counts) Don't you include arm/aarch64 in the 30 targets you used for testing? > > > gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve > > scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 > > Skipping all the SVE tests, sorry. Richard says they look like > improvements, and exactly of the expected kind. :-) > > > Segher
Re: [PATCH] combine: Allow combining two insns to two insns
On 31/07/18 14:57, Segher Boessenkool wrote: > Hi Christophe, > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: >> Since this was committed, I've noticed regressions >> on aarch64: >> FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > This went from > and w0, w0, 255 > lsl w1, w0, 8 These are sequentially dependent. > orr w0, w1, w0, lsl 20 > ret > to > and w1, w0, 255 > ubfiz w0, w0, 8, 8 These can run in parallel. So the change is a good one! On a super-scalar machine we save a cycle. R. > orr w0, w0, w1, lsl 20 > ret > so it's neither an improvement nor a regression, just different code. > The testcase wants no ANDs in the RTL. > > >> on arm-none-linux-gnueabi >> FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > That sounds bad. Open a PR, maybe? > > >> On aarch64, I've also noticed a few others regressions but I'm not yet >> 100% sure it's caused by this patch (bisect running): >> gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 > > ushift_53_i: > - uxtwx1, w0 > - lsl x0, x1, 53 > - lsr x1, x1, 11 > + lsr w1, w0, 11 > + lsl x0, x0, 53 > ret > > shift_53_i: > - sxtwx1, w0 > - lsl x0, x1, 53 > - asr x1, x1, 11 > + sbfxx1, x0, 11, 21 > + lsl x0, x0, 53 > ret > > Both are improvements afais. The number of asr insns changes, sure. > > >> gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve >> scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 > > Skipping all the SVE tests, sorry. Richard says they look like > improvements, and exactly of the expected kind. :-) > > > Segher >
Re: [PATCH] combine: Allow combining two insns to two insns
On Tue, Jul 31, 2018 at 05:39:37AM -0700, H.J. Lu wrote: > For > > --- > #define N 16 > float f[N]; > double d[N]; > int n[N]; > > __attribute__((noinline)) void > f3 (void) > { > int i; > for (i = 0; i < N; i++) > d[i] = f[i]; > } > --- > > r263067 improved -O3 -mavx2 -mtune=generic -m64 from > > .cfi_startproc > vmovaps f(%rip), %xmm2 > vmovaps f+32(%rip), %xmm3 > vinsertf128 $0x1, f+16(%rip), %ymm2, %ymm0 > vcvtps2pd %xmm0, %ymm1 > vextractf128 $0x1, %ymm0, %xmm0 > vmovaps %xmm1, d(%rip) > vextractf128 $0x1, %ymm1, d+16(%rip) > vcvtps2pd %xmm0, %ymm0 > vmovaps %xmm0, d+32(%rip) > vextractf128 $0x1, %ymm0, d+48(%rip) > vinsertf128 $0x1, f+48(%rip), %ymm3, %ymm0 > vcvtps2pd %xmm0, %ymm1 > vextractf128 $0x1, %ymm0, %xmm0 > vmovaps %xmm1, d+64(%rip) > vextractf128 $0x1, %ymm1, d+80(%rip) > vcvtps2pd %xmm0, %ymm0 > vmovaps %xmm0, d+96(%rip) > vextractf128 $0x1, %ymm0, d+112(%rip) > vzeroupper > ret > .cfi_endproc > > to > > .cfi_startproc > vcvtps2pd f(%rip), %ymm0 > vmovaps %xmm0, d(%rip) > vextractf128 $0x1, %ymm0, d+16(%rip) > vcvtps2pd f+16(%rip), %ymm0 > vmovaps %xmm0, d+32(%rip) > vextractf128 $0x1, %ymm0, d+48(%rip) > vcvtps2pd f+32(%rip), %ymm0 > vextractf128 $0x1, %ymm0, d+80(%rip) > vmovaps %xmm0, d+64(%rip) > vcvtps2pd f+48(%rip), %ymm0 > vextractf128 $0x1, %ymm0, d+112(%rip) > vmovaps %xmm0, d+96(%rip) > vzeroupper > ret > .cfi_endproc I cannot really read AVX, but that looks like better code alright :-) Segher
Re: [PATCH] combine: Allow combining two insns to two insns
Hi Christophe, On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: > Since this was committed, I've noticed regressions > on aarch64: > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: This went from and w0, w0, 255 lsl w1, w0, 8 orr w0, w1, w0, lsl 20 ret to and w1, w0, 255 ubfiz w0, w0, 8, 8 orr w0, w0, w1, lsl 20 ret so it's neither an improvement nor a regression, just different code. The testcase wants no ANDs in the RTL. > on arm-none-linux-gnueabi > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test That sounds bad. Open a PR, maybe? > On aarch64, I've also noticed a few others regressions but I'm not yet > 100% sure it's caused by this patch (bisect running): > gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 ushift_53_i: - uxtwx1, w0 - lsl x0, x1, 53 - lsr x1, x1, 11 + lsr w1, w0, 11 + lsl x0, x0, 53 ret shift_53_i: - sxtwx1, w0 - lsl x0, x1, 53 - asr x1, x1, 11 + sbfxx1, x0, 11, 21 + lsl x0, x0, 53 ret Both are improvements afais. The number of asr insns changes, sure. > gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve > scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 Skipping all the SVE tests, sorry. Richard says they look like improvements, and exactly of the expected kind. :-) Segher
Re: [PATCH] combine: Allow combining two insns to two insns
Christophe Lyon writes: > On Mon, 30 Jul 2018 at 18:09, Segher Boessenkool > wrote: >> >> On Tue, Jul 24, 2018 at 05:18:41PM +, Segher Boessenkool wrote: >> > This patch allows combine to combine two insns into two. This helps >> > in many cases, by reducing instruction path length, and also allowing >> > further combinations to happen. PR85160 is a typical example of code >> > that it can improve. >> > >> > This patch does not allow such combinations if either of the original >> > instructions was a simple move instruction. In those cases combining >> > the two instructions increases register pressure without improving the >> > code. With this move test register pressure does no longer increase >> > noticably as far as I can tell. >> > >> > (At first I also didn't allow either of the resulting insns to be a >> > move instruction. But that is actually a very good thing to have, as >> > should have been obvious). >> > >> > Tested for many months; tested on about 30 targets. >> > >> > I'll commit this later this week if there are no objections. >> >> Done now, with the testcase at >> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html . >> > > Hi, > > Since this was committed, I've noticed regressions > on aarch64: > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > on arm-none-linux-gnueabi > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > On aarch64, I've also noticed a few others regressions but I'm not yet > 100% sure it's caused by this patch (bisect running): > gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 > gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve > scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 > gcc.target/aarch64/sve/var_stride_4.c -march=armv8.2-a+sve > scan-assembler-times \\tlsl\\tx[0-9]+, x[0-9]+, 10\\n 2 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 7 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 14 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 5 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 10 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 >
Re: [PATCH] combine: Allow combining two insns to two insns
On Wed, Jul 25, 2018 at 1:28 AM, Richard Biener wrote: > On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool > wrote: >> >> This patch allows combine to combine two insns into two. This helps >> in many cases, by reducing instruction path length, and also allowing >> further combinations to happen. PR85160 is a typical example of code >> that it can improve. >> >> This patch does not allow such combinations if either of the original >> instructions was a simple move instruction. In those cases combining >> the two instructions increases register pressure without improving the >> code. With this move test register pressure does no longer increase >> noticably as far as I can tell. >> >> (At first I also didn't allow either of the resulting insns to be a >> move instruction. But that is actually a very good thing to have, as >> should have been obvious). >> >> Tested for many months; tested on about 30 targets. >> >> I'll commit this later this week if there are no objections. > > Sounds good - but, _any_ testcase? Please! ;) > Here is a testcase: For --- #define N 16 float f[N]; double d[N]; int n[N]; __attribute__((noinline)) void f3 (void) { int i; for (i = 0; i < N; i++) d[i] = f[i]; } --- r263067 improved -O3 -mavx2 -mtune=generic -m64 from .cfi_startproc vmovaps f(%rip), %xmm2 vmovaps f+32(%rip), %xmm3 vinsertf128 $0x1, f+16(%rip), %ymm2, %ymm0 vcvtps2pd %xmm0, %ymm1 vextractf128 $0x1, %ymm0, %xmm0 vmovaps %xmm1, d(%rip) vextractf128 $0x1, %ymm1, d+16(%rip) vcvtps2pd %xmm0, %ymm0 vmovaps %xmm0, d+32(%rip) vextractf128 $0x1, %ymm0, d+48(%rip) vinsertf128 $0x1, f+48(%rip), %ymm3, %ymm0 vcvtps2pd %xmm0, %ymm1 vextractf128 $0x1, %ymm0, %xmm0 vmovaps %xmm1, d+64(%rip) vextractf128 $0x1, %ymm1, d+80(%rip) vcvtps2pd %xmm0, %ymm0 vmovaps %xmm0, d+96(%rip) vextractf128 $0x1, %ymm0, d+112(%rip) vzeroupper ret .cfi_endproc to .cfi_startproc vcvtps2pd f(%rip), %ymm0 vmovaps %xmm0, d(%rip) vextractf128 $0x1, %ymm0, d+16(%rip) vcvtps2pd f+16(%rip), %ymm0 vmovaps %xmm0, d+32(%rip) vextractf128 $0x1, %ymm0, d+48(%rip) vcvtps2pd f+32(%rip), %ymm0 vextractf128 $0x1, %ymm0, d+80(%rip) vmovaps %xmm0, d+64(%rip) vcvtps2pd f+48(%rip), %ymm0 vextractf128 $0x1, %ymm0, d+112(%rip) vmovaps %xmm0, d+96(%rip) vzeroupper ret .cfi_endproc This is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86752 H.J.
Re: [PATCH] combine: Allow combining two insns to two insns
On Mon, 30 Jul 2018 at 18:09, Segher Boessenkool wrote: > > On Tue, Jul 24, 2018 at 05:18:41PM +, Segher Boessenkool wrote: > > This patch allows combine to combine two insns into two. This helps > > in many cases, by reducing instruction path length, and also allowing > > further combinations to happen. PR85160 is a typical example of code > > that it can improve. > > > > This patch does not allow such combinations if either of the original > > instructions was a simple move instruction. In those cases combining > > the two instructions increases register pressure without improving the > > code. With this move test register pressure does no longer increase > > noticably as far as I can tell. > > > > (At first I also didn't allow either of the resulting insns to be a > > move instruction. But that is actually a very good thing to have, as > > should have been obvious). > > > > Tested for many months; tested on about 30 targets. > > > > I'll commit this later this week if there are no objections. > > Done now, with the testcase at > https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html . > Hi, Since this was committed, I've noticed regressions on aarch64: FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: on arm-none-linux-gnueabi FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test On aarch64, I've also noticed a few others regressions but I'm not yet 100% sure it's caused by this patch (bisect running): gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 gcc.target/aarch64/sve/var_stride_4.c -march=armv8.2-a+sve scan-assembler-times \\tlsl\\tx[0-9]+, x[0-9]+, 10\\n 2 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 7 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 14 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 5 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 10 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times
Re: [PATCH] combine: Allow combining two insns to two insns
On Tue, Jul 24, 2018 at 05:18:41PM +, Segher Boessenkool wrote: > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. > > This patch does not allow such combinations if either of the original > instructions was a simple move instruction. In those cases combining > the two instructions increases register pressure without improving the > code. With this move test register pressure does no longer increase > noticably as far as I can tell. > > (At first I also didn't allow either of the resulting insns to be a > move instruction. But that is actually a very good thing to have, as > should have been obvious). > > Tested for many months; tested on about 30 targets. > > I'll commit this later this week if there are no objections. Done now, with the testcase at https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html . Segher
Re: [PATCH] combine: Allow combining two insns to two insns
On Wed, Jul 25, 2018 at 09:47:31AM -0400, David Malcolm wrote: > > +/* Return whether X is just a single set, with the source > > + a general_operand. */ > > +static bool > > +is_just_move (rtx x) > > +{ > > + if (INSN_P (x)) > > +x = PATTERN (x); > > + > > + return (GET_CODE (x) == SET && general_operand (SET_SRC (x), > > VOIDmode)); > > +} > > If I'm reading it right, the patch only calls this function on i2 and > i3, which are known to be rtx_insn *, rather than just rtx. I used to also have is_just_move (XVECEXP (newpat, 0, 0)) etc.; during most of combine you do not have instructions, just patterns. Segher
Re: [PATCH] combine: Allow combining two insns to two insns
On Tue, 2018-07-24 at 17:18 +, Segher Boessenkool wrote: > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. > > This patch does not allow such combinations if either of the original > instructions was a simple move instruction. In those cases combining > the two instructions increases register pressure without improving > the > code. With this move test register pressure does no longer increase > noticably as far as I can tell. > > (At first I also didn't allow either of the resulting insns to be a > move instruction. But that is actually a very good thing to have, as > should have been obvious). > > Tested for many months; tested on about 30 targets. > > I'll commit this later this week if there are no objections. > > > Segher > > > 2018-07-24 Segher Boessenkool > > PR rtl-optimization/85160 > * combine.c (is_just_move): New function. > (try_combine): Allow combining two instructions into two if > neither of > the original instructions was a move. > > --- > gcc/combine.c | 22 -- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/gcc/combine.c b/gcc/combine.c > index cfe0f19..d64e84d 100644 > --- a/gcc/combine.c > +++ b/gcc/combine.c > @@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn > *insn, int n) >return true; > } > > +/* Return whether X is just a single set, with the source > + a general_operand. */ > +static bool > +is_just_move (rtx x) > +{ > + if (INSN_P (x)) > +x = PATTERN (x); > + > + return (GET_CODE (x) == SET && general_operand (SET_SRC (x), > VOIDmode)); > +} If I'm reading it right, the patch only calls this function on i2 and i3, which are known to be rtx_insn *, rather than just rtx. Hence the only way in which GET_CODE (x) can be SET is if the INSN_P pattern test sets x to PATTERN (x) immediately above: it can't be a SET otherwise - but this isn't obvious from the code. Can this function take an rtx_insn * instead? Maybe something like: /* Return whether INSN's pattern is just a single set, with the source a general_operand. */ static bool is_just_move_p (rtx_insn *insn) { if (!INSN_P (insn)) return false; rtx x = PATTERN (insn); return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode)); } or similar? [...snip...] Thanks; I hope this is constructive. Dave
Re: [PATCH] combine: Allow combining two insns to two insns
On Wed, Jul 25, 2018 at 11:50 AM Segher Boessenkool wrote: > > On Wed, Jul 25, 2018 at 10:28:30AM +0200, Richard Biener wrote: > > On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool > > wrote: > > > > > > This patch allows combine to combine two insns into two. This helps > > > in many cases, by reducing instruction path length, and also allowing > > > further combinations to happen. PR85160 is a typical example of code > > > that it can improve. > > > > > > This patch does not allow such combinations if either of the original > > > instructions was a simple move instruction. In those cases combining > > > the two instructions increases register pressure without improving the > > > code. With this move test register pressure does no longer increase > > > noticably as far as I can tell. > > > > > > (At first I also didn't allow either of the resulting insns to be a > > > move instruction. But that is actually a very good thing to have, as > > > should have been obvious). > > > > > > Tested for many months; tested on about 30 targets. > > > > > > I'll commit this later this week if there are no objections. > > > > Sounds good - but, _any_ testcase? Please! ;) > > I only have target-specific ones. Works for me. > Most *simple* ones will already be > optimised by current code (via 3->2 combination). But I've now got one > that trunk does not optimise, and it can be confirmed with looking at > the resulting machine code even (no need to look at the combine dump, > which is a very good thing). And it is a proper thing to test even: it > tests that some source is compiled to properly optimised machine code. > > Any other kind of testcase is worse than useless, of course. > > Testing it results in working code isn't very feasible or useful either. > > > Segher
Re: [PATCH] combine: Allow combining two insns to two insns
On Wed, Jul 25, 2018 at 10:28:30AM +0200, Richard Biener wrote: > On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool > wrote: > > > > This patch allows combine to combine two insns into two. This helps > > in many cases, by reducing instruction path length, and also allowing > > further combinations to happen. PR85160 is a typical example of code > > that it can improve. > > > > This patch does not allow such combinations if either of the original > > instructions was a simple move instruction. In those cases combining > > the two instructions increases register pressure without improving the > > code. With this move test register pressure does no longer increase > > noticably as far as I can tell. > > > > (At first I also didn't allow either of the resulting insns to be a > > move instruction. But that is actually a very good thing to have, as > > should have been obvious). > > > > Tested for many months; tested on about 30 targets. > > > > I'll commit this later this week if there are no objections. > > Sounds good - but, _any_ testcase? Please! ;) I only have target-specific ones. Most *simple* ones will already be optimised by current code (via 3->2 combination). But I've now got one that trunk does not optimise, and it can be confirmed with looking at the resulting machine code even (no need to look at the combine dump, which is a very good thing). And it is a proper thing to test even: it tests that some source is compiled to properly optimised machine code. Any other kind of testcase is worse than useless, of course. Testing it results in working code isn't very feasible or useful either. Segher
Re: [PATCH] combine: Allow combining two insns to two insns
On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool wrote: > > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. > > This patch does not allow such combinations if either of the original > instructions was a simple move instruction. In those cases combining > the two instructions increases register pressure without improving the > code. With this move test register pressure does no longer increase > noticably as far as I can tell. > > (At first I also didn't allow either of the resulting insns to be a > move instruction. But that is actually a very good thing to have, as > should have been obvious). > > Tested for many months; tested on about 30 targets. > > I'll commit this later this week if there are no objections. Sounds good - but, _any_ testcase? Please! ;) Richard. > > Segher > > > 2018-07-24 Segher Boessenkool > > PR rtl-optimization/85160 > * combine.c (is_just_move): New function. > (try_combine): Allow combining two instructions into two if neither of > the original instructions was a move. > > --- > gcc/combine.c | 22 -- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/gcc/combine.c b/gcc/combine.c > index cfe0f19..d64e84d 100644 > --- a/gcc/combine.c > +++ b/gcc/combine.c > @@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn *insn, int > n) >return true; > } > > +/* Return whether X is just a single set, with the source > + a general_operand. */ > +static bool > +is_just_move (rtx x) > +{ > + if (INSN_P (x)) > +x = PATTERN (x); > + > + return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode)); > +} > + > /* Try to combine the insns I0, I1 and I2 into I3. > Here I0, I1 and I2 appear earlier than I3. > I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into > @@ -2668,6 +2679,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, > rtx_insn *i0, >int swap_i2i3 = 0; >int split_i2i3 = 0; >int changed_i3_dest = 0; > + bool i2_was_move = false, i3_was_move = false; > >int maxreg; >rtx_insn *temp_insn; > @@ -3059,6 +3071,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, > rtx_insn *i0, >return 0; > } > > + /* Record whether i2 and i3 are trivial moves. */ > + i2_was_move = is_just_move (i2); > + i3_was_move = is_just_move (i3); > + >/* Record whether I2DEST is used in I2SRC and similarly for the other > cases. Knowing this will help in register status updating below. */ >i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src); > @@ -4014,8 +4030,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, > rtx_insn *i0, >&& XVECLEN (newpat, 0) == 2 >&& GET_CODE (XVECEXP (newpat, 0, 0)) == SET >&& GET_CODE (XVECEXP (newpat, 0, 1)) == SET > - && (i1 || set_noop_p (XVECEXP (newpat, 0, 0)) > - || set_noop_p (XVECEXP (newpat, 0, 1))) > + && (i1 > + || set_noop_p (XVECEXP (newpat, 0, 0)) > + || set_noop_p (XVECEXP (newpat, 0, 1)) > + || (!i2_was_move && !i3_was_move)) >&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT >&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART >&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT > -- > 1.8.3.1 >
Re: [PATCH] combine: Allow combining two insns to two insns
On 07/24/2018 11:18 AM, Segher Boessenkool wrote: > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. > > This patch does not allow such combinations if either of the original > instructions was a simple move instruction. In those cases combining > the two instructions increases register pressure without improving the > code. With this move test register pressure does no longer increase > noticably as far as I can tell. > > (At first I also didn't allow either of the resulting insns to be a > move instruction. But that is actually a very good thing to have, as > should have been obvious). > > Tested for many months; tested on about 30 targets. > > I'll commit this later this week if there are no objections. > > > Segher > > > 2018-07-24 Segher Boessenkool > > PR rtl-optimization/85160 > * combine.c (is_just_move): New function. > (try_combine): Allow combining two instructions into two if neither of > the original instructions was a move. I've had several instances where a 2->2 combination would be useful through the years. I didn't save any of those examples though... Good to see the limitation being addressed. jeff
[PATCH] combine: Allow combining two insns to two insns
This patch allows combine to combine two insns into two. This helps in many cases, by reducing instruction path length, and also allowing further combinations to happen. PR85160 is a typical example of code that it can improve. This patch does not allow such combinations if either of the original instructions was a simple move instruction. In those cases combining the two instructions increases register pressure without improving the code. With this move test register pressure does no longer increase noticably as far as I can tell. (At first I also didn't allow either of the resulting insns to be a move instruction. But that is actually a very good thing to have, as should have been obvious). Tested for many months; tested on about 30 targets. I'll commit this later this week if there are no objections. Segher 2018-07-24 Segher Boessenkool PR rtl-optimization/85160 * combine.c (is_just_move): New function. (try_combine): Allow combining two instructions into two if neither of the original instructions was a move. --- gcc/combine.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/gcc/combine.c b/gcc/combine.c index cfe0f19..d64e84d 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn *insn, int n) return true; } +/* Return whether X is just a single set, with the source + a general_operand. */ +static bool +is_just_move (rtx x) +{ + if (INSN_P (x)) +x = PATTERN (x); + + return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode)); +} + /* Try to combine the insns I0, I1 and I2 into I3. Here I0, I1 and I2 appear earlier than I3. I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into @@ -2668,6 +2679,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, int swap_i2i3 = 0; int split_i2i3 = 0; int changed_i3_dest = 0; + bool i2_was_move = false, i3_was_move = false; int maxreg; rtx_insn *temp_insn; @@ -3059,6 +3071,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, return 0; } + /* Record whether i2 and i3 are trivial moves. */ + i2_was_move = is_just_move (i2); + i3_was_move = is_just_move (i3); + /* Record whether I2DEST is used in I2SRC and similarly for the other cases. Knowing this will help in register status updating below. */ i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src); @@ -4014,8 +4030,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, && XVECLEN (newpat, 0) == 2 && GET_CODE (XVECEXP (newpat, 0, 0)) == SET && GET_CODE (XVECEXP (newpat, 0, 1)) == SET - && (i1 || set_noop_p (XVECEXP (newpat, 0, 0)) - || set_noop_p (XVECEXP (newpat, 0, 1))) + && (i1 + || set_noop_p (XVECEXP (newpat, 0, 0)) + || set_noop_p (XVECEXP (newpat, 0, 1)) + || (!i2_was_move && !i3_was_move)) && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT -- 1.8.3.1