Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Thu, Oct 6, 2011 at 11:33 PM, H.J. Lu hjl.to...@gmail.com wrote: OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? It doesn't work with 64bit libstdc++: Yeah, yeah. ix86_output_mi_thunk has some ... issues. Please try attached patch that introduces ix86_emit_binop and uses it in a bunch of places. I tried it on GCC. There are no regressions. The bugs are fixed for x32. Here are size comparison with GCC runtime libraries on ia32, x32 and x86-64: 884093 18600 27064 929757 e2fdd old libstdc++.so 884189 18600 27064 929853 e303d new libs/libstdc++.so The new code is mov 0xc(%edi),%eax mov %eax,0x8(%esi) mov -0xc(%eax),%eax mov 0x10(%edi),%edx lea 0x8(%esi,%eax,1),%eax The old one is mov 0xc(%edi),%edx lea 0x8(%esi),%eax mov %edx,0x8(%esi) add -0xc(%edx),%eax mov 0x10(%edi),%edx The new code merged lea+add into one lea, so it looks quite OK to me. Do you have some performance numbers? I will report performance numbers in a few days. The differences in SPEC CPU 2006 on ia32, x86-64 and x32 are within noise range. Great. Attached is a slightly updated patch, where we consider only integer-mode PLUS RTXes. 2011-10-07 Uros Bizjak ubiz...@gmail.com H.J. Lu hongjiu...@intel.com PR target/50603 * config/i386/i386.c (ix86_fixup_binary_operands): Force src2 of integer PLUS RTX to a register to improve address combine. testsuite/ChangeLog: 2011-10-07 Uros Bizjak ubiz...@gmail.com H.J. Lu hongjiu...@intel.com PR target/50603 * gcc.target/i386/pr50603.c: New test. Tested on x86_64-pc-linux-gnu {,-m32}, committed to mainline SVN. Uros. Index: config/i386/i386.c === --- config/i386/i386.c (revision 179645) +++ config/i386/i386.c (working copy) @@ -15798,6 +15798,12 @@ ix86_fixup_binary_operands (enum rtx_code code, en if (MEM_P (src1) !rtx_equal_p (dst, src1)) src1 = force_reg (mode, src1); + /* Improve address combine. */ + if (code == PLUS + GET_MODE_CLASS (mode) == MODE_INT + MEM_P (src2)) +src2 = force_reg (mode, src2); + operands[1] = src1; operands[2] = src2; return dst; Index: testsuite/gcc.target/i386/pr50603.c === --- testsuite/gcc.target/i386/pr50603.c (revision 0) +++ testsuite/gcc.target/i386/pr50603.c (revision 0) @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options -O2 } */ + +extern int *foo; + +int +bar (int x) +{ + return foo[x]; +} +/* { dg-final { scan-assembler-not lea\[lq\] } } */
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 11:58 AM, H.J. Lu hjl.to...@gmail.com wrote: On Tue, Oct 4, 2011 at 11:51 AM, Uros Bizjak ubiz...@gmail.com wrote: On Tue, Oct 4, 2011 at 8:37 PM, H.J. Lu hjl.to...@gmail.com wrote: OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? It doesn't work with 64bit libstdc++: Yeah, yeah. ix86_output_mi_thunk has some ... issues. Please try attached patch that introduces ix86_emit_binop and uses it in a bunch of places. I tried it on GCC. There are no regressions. The bugs are fixed for x32. Here are size comparison with GCC runtime libraries on ia32, x32 and x86-64: 884093 18600 27064 929757 e2fdd old libstdc++.so 884189 18600 27064 929853 e303d new libs/libstdc++.so The new code is mov 0xc(%edi),%eax mov %eax,0x8(%esi) mov -0xc(%eax),%eax mov 0x10(%edi),%edx lea 0x8(%esi,%eax,1),%eax The old one is mov 0xc(%edi),%edx lea 0x8(%esi),%eax mov %edx,0x8(%esi) add -0xc(%edx),%eax mov 0x10(%edi),%edx The new code merged lea+add into one lea, so it looks quite OK to me. Do you have some performance numbers? I will report performance numbers in a few days. The differences in SPEC CPU 2006 on ia32, x86-64 and x32 are within noise range. -- H.J.
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 1:00 AM, H.J. Lu hongjiu...@intel.com wrote: This patch improves address combine for x32 by forcing the memory memory operand of PLUS operation into register. Tested on Linux/x86-64 with -mx32. OK for trunk? Does the patch fix FAIL: gcc.target/i386/pr45670.c scan-assembler-not lea[lq] on x32 ? Uros.
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On 10/04/2011 01:00 AM, H.J. Lu wrote: + else +{ + /* Improve address combine in x32 mode. */ + if (TARGET_X32 + code == PLUS + !MEM_P (dst) + !MEM_P (src1) + MEM_P (src2) ) + src2 = force_reg (mode, src2); +} Perhaps this is worthwhile also on non-x32? Paolo
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 10:06 AM, Uros Bizjak ubiz...@gmail.com wrote: This patch improves address combine for x32 by forcing the memory memory operand of PLUS operation into register. Tested on Linux/x86-64 with -mx32. OK for trunk? Does the patch fix FAIL: gcc.target/i386/pr45670.c scan-assembler-not lea[lq] on x32 ? It does. Following patch is the same, but takes into account that non-matching memory can only be in src2, so it avoids a bunch of unnecessary checks. Can you please check the effects of the patch with some codesize benchmark? OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? Uros.
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 1:19 AM, Uros Bizjak ubiz...@gmail.com wrote: On Tue, Oct 4, 2011 at 10:06 AM, Uros Bizjak ubiz...@gmail.com wrote: This patch improves address combine for x32 by forcing the memory memory operand of PLUS operation into register. Tested on Linux/x86-64 with -mx32. OK for trunk? Does the patch fix FAIL: gcc.target/i386/pr45670.c scan-assembler-not lea[lq] on x32 ? It does. Following patch is the same, but takes into account that non-matching memory can only be in src2, so it avoids a bunch of unnecessary checks. Can you please check the effects of the patch with some codesize benchmark? OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? This patch should fix PR 50603 as well as gcc.target/i386/pr45670.c. I will check its code size impact on SPEC CPU 2K/2006. -- H.J.
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 1:19 AM, Uros Bizjak ubiz...@gmail.com wrote: On Tue, Oct 4, 2011 at 10:06 AM, Uros Bizjak ubiz...@gmail.com wrote: This patch improves address combine for x32 by forcing the memory memory operand of PLUS operation into register. Tested on Linux/x86-64 with -mx32. OK for trunk? Does the patch fix FAIL: gcc.target/i386/pr45670.c scan-assembler-not lea[lq] on x32 ? It does. Following patch is the same, but takes into account that non-matching memory can only be in src2, so it avoids a bunch of unnecessary checks. Can you please check the effects of the patch with some codesize benchmark? OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? It doesn't work with 64bit libstdc++: /export/gnu/import/git/gcc/libstdc++-v3/src/strstream.cc: In member function ‘void std::strstream::_ZTv0_n24_NSt9strstreamD1Ev()’: /export/gnu/import/git/gcc/libstdc++-v3/src/strstream.cc:418:1: error: insn does not satisfy its constraints: (insn 3 2 4 (set (reg:DI 59) (mem:DI (plus:DI (reg:DI 39 r10) (const_int -24 [0xffe8])) [0 S8 A8])) 62 {*movdi_internal_rex64} (nil)) /export/gnu/import/git/gcc/libstdc++-v3/src/strstream.cc:418:1: internal compiler error: in final_scan_insn, at final.c:2648 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions. make[7]: *** [strstream.lo] Error 1 make[7]: *** Waiting for unfinished jobs -- H.J.
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 6:06 PM, H.J. Lu hjl.to...@gmail.com wrote: OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? It doesn't work with 64bit libstdc++: Yeah, yeah. ix86_output_mi_thunk has some ... issues. Please try attached patch that introduces ix86_emit_binop and uses it in a bunch of places. Uros. Index: i386-protos.h === --- i386-protos.h (revision 179506) +++ i386-protos.h (working copy) @@ -94,6 +94,7 @@ extern bool ix86_lea_outperforms (rtx, unsigned in unsigned int, unsigned int); extern bool ix86_avoid_lea_for_add (rtx, rtx[]); extern bool ix86_avoid_lea_for_addr (rtx, rtx[]); +extern void ix86_emit_binop (enum rtx_code, enum machine_mode, rtx, rtx); extern void ix86_split_lea_for_addr (rtx[], enum machine_mode); extern bool ix86_lea_for_add_ok (rtx, rtx[]); extern bool ix86_vec_interleave_v2df_operator_ok (rtx operands[3], bool high); Index: i386.c === --- i386.c (revision 179506) +++ i386.c (working copy) @@ -15727,6 +15727,10 @@ ix86_fixup_binary_operands (enum rtx_code code, en if (MEM_P (src1) !rtx_equal_p (dst, src1)) src1 = force_reg (mode, src1); + /* Improve address combine. */ + if (code == PLUS MEM_P (src2)) +src2 = force_reg (mode, src2); + operands[1] = src1; operands[2] = src2; return dst; @@ -16470,6 +16474,20 @@ ix86_avoid_lea_for_addr (rtx insn, rtx operands[]) return !ix86_lea_outperforms (insn, regno0, regno1, regno2, split_cost); } +/* Emit x86 binary operand CODE in mode MODE, where the first operand + matches destination. RTX includes clobber of FLAGS_REG. */ + +extern void ix86_emit_binop (enum rtx_code code, enum machine_mode mode, +rtx dst, rtx src) +{ + rtx op, clob; + + op = gen_rtx_SET (VOIDmode, dst, gen_rtx_fmt_ee (code, mode, dst, src)); + clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, FLAGS_REG)); + + emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, op, clob))); +} + /* Split lea instructions into a sequence of instructions which are executed on ALU to avoid AGU stalls. It is assumed that it is allowed to clobber flags register @@ -16482,8 +16500,7 @@ ix86_split_lea_for_addr (rtx operands[], enum mach unsigned int regno1 = INVALID_REGNUM; unsigned int regno2 = INVALID_REGNUM; struct ix86_address parts; - rtx tmp, clob; - rtvec par; + rtx tmp; int ok, adds; ok = ix86_decompose_address (operands[1], parts); @@ -16515,14 +16532,7 @@ ix86_split_lea_for_addr (rtx operands[], enum mach gcc_assert (regno2 != regno0); for (adds = parts.scale; adds 0; adds--) - { - tmp = gen_rtx_PLUS (mode, operands[0], parts.index); - tmp = gen_rtx_SET (VOIDmode, operands[0], tmp); - clob = gen_rtx_CLOBBER (VOIDmode, - gen_rtx_REG (CCmode, FLAGS_REG)); - par = gen_rtvec (2, tmp, clob); - emit_insn (gen_rtx_PARALLEL (VOIDmode, par)); - } + ix86_emit_binop (PLUS, mode, operands[0], parts.index); } else { @@ -16531,30 +16541,14 @@ ix86_split_lea_for_addr (rtx operands[], enum mach emit_insn (gen_rtx_SET (VOIDmode, operands[0], parts.index)); /* Use shift for scaling. */ - tmp = gen_rtx_ASHIFT (mode, operands[0], - GEN_INT (exact_log2 (parts.scale))); - tmp = gen_rtx_SET (VOIDmode, operands[0], tmp); - clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, FLAGS_REG)); - par = gen_rtvec (2, tmp, clob); - emit_insn (gen_rtx_PARALLEL (VOIDmode, par)); + ix86_emit_binop (ASHIFT, mode, operands[0], + GEN_INT (exact_log2 (parts.scale))); if (parts.base) - { - tmp = gen_rtx_PLUS (mode, operands[0], parts.base); - tmp = gen_rtx_SET (VOIDmode, operands[0], tmp); - clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, FLAGS_REG)); - par = gen_rtvec (2, tmp, clob); - emit_insn (gen_rtx_PARALLEL (VOIDmode, par)); - } + ix86_emit_binop (PLUS, mode, operands[0], parts.base); if (parts.disp parts.disp != const0_rtx) - { - tmp = gen_rtx_PLUS (mode, operands[0], parts.disp); - tmp = gen_rtx_SET (VOIDmode, operands[0], tmp); - clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, FLAGS_REG)); - par = gen_rtvec (2, tmp,
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 10:32 AM, Uros Bizjak ubiz...@gmail.com wrote: On Tue, Oct 4, 2011 at 6:06 PM, H.J. Lu hjl.to...@gmail.com wrote: OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? It doesn't work with 64bit libstdc++: Yeah, yeah. ix86_output_mi_thunk has some ... issues. Please try attached patch that introduces ix86_emit_binop and uses it in a bunch of places. Uros. I tried it on GCC. There are no regressions. The bugs are fixed for x32. Here are size comparison with GCC runtime libraries on ia32, x32 and x86-64: textdata bss dec hex filename 103304 480 368 104152 196d8 32/libgcc/32/libgcc_s.so 103048 480 368 103896 195d8 ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libgcc/32/libgcc_s.so textdata bss dec hex filename 10538463100 360 1057306 10221a 32/libgfortran/.libs/libgfortran.so 10535743100 360 1057034 10210a ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libgfortran/.libs/libgfortran.so textdata bss dec hex filename 56817 536 156 57509e0a5 32/libgomp/.libs/libgomp.so 56817 536 156 57509 e0a5 ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libgomp/.libs/libgomp.so textdata bss dec hex filename 1078151832 839016 948663 e79b7 32/libmudflap/.libs/libmudflap.so 1078151832 839016 948663 e79b7 ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libmudflap/.libs/libmudflap.so textdata bss dec hex filename 1108901976 839052 951918 e866e 32/libmudflap/.libs/libmudflapth.so 1108901976 839052 951918 e866e ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libmudflap/.libs/libmudflapth.so textdata bss dec hex filename 9911329124748 106773 1a115 32/libobjc/.libs/libobjc.so 9911329124748 106773 1a115 ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libobjc/.libs/libobjc.so textdata bss dec hex filename 3574661148 12 358626 578e2 32/libquadmath/.libs/libquadmath.so 3572181148 12 358378 577ea ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libquadmath/.libs/libquadmath.so textdata bss dec hex filename 6018 388 86414190e 32/libssp/.libs/libssp.so 6018 388 86414 190e ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libssp/.libs/libssp.so textdata bss dec hex filename 884093 18600 27064 929757 e2fdd 32/libstdc++-v3/src/.libs/libstdc++.so 884189 18600 27064 929853 e303d ../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs/libstdc++.so textdata bss dec hex filename 83012 944 696 84652 14aac libgcc/libgcc_s.so 83012 944 696 84652 14aac ../../build-x86_64-linux/x86_64-unknown-linux-gnu/libgcc/libgcc_s.so textdata bss dec hex filename 11445936520 568 1151681 1192c1 libgfortran/.libs/libgfortran.so 11445616520 568 1151649 1192a1 ../../build-x86_64-linux/x86_64-unknown-linux-gnu/libgfortran/.libs/libgfortran.so textdata bss dec hex filename 519581080 260 53298d032 libgomp/.libs/libgomp.so 519581080 260 53298 d032 ../../build-x86_64-linux/x86_64-unknown-linux-gnu/libgomp/.libs/libgomp.so textdata bss dec hex filename 1118313332 1634728 1749891 1ab383 libmudflap/.libs/libmudflap.so 1118313332 1634728 1749891 1ab383 ../../build-x86_64-linux/x86_64-unknown-linux-gnu/libmudflap/.libs/libmudflap.so textdata bss dec hex filename 1164303612 1634768 1754810 1ac6ba libmudflap/.libs/libmudflapth.so 1164303612 1634768 1754810 1ac6ba ../../build-x86_64-linux/x86_64-unknown-linux-gnu/libmudflap/.libs/libmudflapth.so textdata bss dec hex filename 10224048729320 116432 1c6d0 libobjc/.libs/libobjc.so 10224048729320 116432 1c6d0 ../../build-x86_64-linux/x86_64-unknown-linux-gnu/libobjc/.libs/libobjc.so textdata bss dec hex filename 2132151744 16 214975 347bf libquadmath/.libs/libquadmath.so 2132311744 16 214991 347cf ../../build-x86_64-linux/x86_64-unknown-linux-gnu/libquadmath/.libs/libquadmath.so textdata bss dec hex filename 6395 768 1671791c0b libssp/.libs/libssp.so 6395 768 167179 1c0b../../build-x86_64-linux/x86_64-unknown-linux-gnu/libssp/.libs/libssp.so textdata bss dec hex
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 8:37 PM, H.J. Lu hjl.to...@gmail.com wrote: OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? It doesn't work with 64bit libstdc++: Yeah, yeah. ix86_output_mi_thunk has some ... issues. Please try attached patch that introduces ix86_emit_binop and uses it in a bunch of places. I tried it on GCC. There are no regressions. The bugs are fixed for x32. Here are size comparison with GCC runtime libraries on ia32, x32 and x86-64: 884093 18600 27064 929757 e2fdd old libstdc++.so 884189 18600 27064 929853 e303d new libs/libstdc++.so The new code is mov 0xc(%edi),%eax mov %eax,0x8(%esi) mov -0xc(%eax),%eax mov 0x10(%edi),%edx lea 0x8(%esi,%eax,1),%eax The old one is mov 0xc(%edi),%edx lea 0x8(%esi),%eax mov %edx,0x8(%esi) add -0xc(%edx),%eax mov 0x10(%edi),%edx The new code merged lea+add into one lea, so it looks quite OK to me. Do you have some performance numbers? Thanks, Uros.
Re: PATCH: PR target/50603: [x32] Unnecessary lea
On Tue, Oct 4, 2011 at 11:51 AM, Uros Bizjak ubiz...@gmail.com wrote: On Tue, Oct 4, 2011 at 8:37 PM, H.J. Lu hjl.to...@gmail.com wrote: OTOH, x86_64 and i686 targets can also benefit from this change. If combine can't create more complex address (covered by lea), then it will simply propagate memory operand back into the add insn. It looks to me that we can't loose here, so: /* Improve address combine. */ if (code == PLUS MEM_P (src2)) src2 = force_reg (mode, src2); Any opinions? It doesn't work with 64bit libstdc++: Yeah, yeah. ix86_output_mi_thunk has some ... issues. Please try attached patch that introduces ix86_emit_binop and uses it in a bunch of places. I tried it on GCC. There are no regressions. The bugs are fixed for x32. Here are size comparison with GCC runtime libraries on ia32, x32 and x86-64: 884093 18600 27064 929757 e2fdd old libstdc++.so 884189 18600 27064 929853 e303d new libs/libstdc++.so The new code is mov 0xc(%edi),%eax mov %eax,0x8(%esi) mov -0xc(%eax),%eax mov 0x10(%edi),%edx lea 0x8(%esi,%eax,1),%eax The old one is mov 0xc(%edi),%edx lea 0x8(%esi),%eax mov %edx,0x8(%esi) add -0xc(%edx),%eax mov 0x10(%edi),%edx The new code merged lea+add into one lea, so it looks quite OK to me. Do you have some performance numbers? I will report performance numbers in a few days. -- H.J.