Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-07 Thread Uros Bizjak
On Thu, Oct 6, 2011 at 11:33 PM, H.J. Lu hjl.to...@gmail.com wrote:

 OTOH, x86_64 and i686 targets can also benefit from this change. If
 combine can't create more complex address (covered by lea), then it
 will simply propagate memory operand back into the add insn. It looks
 to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
    src2 = force_reg (mode, src2);

 Any opinions?


 It doesn't work with 64bit libstdc++:

 Yeah, yeah. ix86_output_mi_thunk has some ...  issues.

 Please try attached patch that introduces ix86_emit_binop and uses it
 in a bunch of places.

 I tried it on GCC.  There are no regressions.  The bugs are fixed for x32.
 Here are size comparison with GCC runtime libraries on ia32, x32 and
 x86-64:

  884093   18600   27064  929757   e2fdd old libstdc++.so
  884189   18600   27064  929853   e303d new libs/libstdc++.so

 The new code is

 mov    0xc(%edi),%eax
 mov    %eax,0x8(%esi)
 mov    -0xc(%eax),%eax
 mov    0x10(%edi),%edx
 lea    0x8(%esi,%eax,1),%eax

 The old one is

 mov    0xc(%edi),%edx
 lea    0x8(%esi),%eax
 mov    %edx,0x8(%esi)
 add    -0xc(%edx),%eax
 mov    0x10(%edi),%edx

 The new code merged lea+add into one lea, so it looks quite OK to me.

 Do you have some performance numbers?


 I will report performance numbers in a few days.

 The differences in SPEC CPU 2006 on ia32, x86-64 and
 x32 are within noise range.

Great.

Attached is a slightly updated patch, where we consider only
integer-mode PLUS RTXes.

2011-10-07  Uros Bizjak  ubiz...@gmail.com
H.J. Lu  hongjiu...@intel.com

PR target/50603
* config/i386/i386.c (ix86_fixup_binary_operands): Force src2 of
integer PLUS RTX to a register to improve address combine.

testsuite/ChangeLog:

2011-10-07  Uros Bizjak  ubiz...@gmail.com
H.J. Lu  hongjiu...@intel.com

PR target/50603
* gcc.target/i386/pr50603.c: New test.

Tested on x86_64-pc-linux-gnu {,-m32}, committed to mainline SVN.

Uros.
Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 179645)
+++ config/i386/i386.c  (working copy)
@@ -15798,6 +15798,12 @@ ix86_fixup_binary_operands (enum rtx_code code, en
   if (MEM_P (src1)  !rtx_equal_p (dst, src1))
 src1 = force_reg (mode, src1);
 
+  /* Improve address combine.  */
+  if (code == PLUS
+   GET_MODE_CLASS (mode) == MODE_INT
+   MEM_P (src2))
+src2 = force_reg (mode, src2);
+
   operands[1] = src1;
   operands[2] = src2;
   return dst;
Index: testsuite/gcc.target/i386/pr50603.c
===
--- testsuite/gcc.target/i386/pr50603.c (revision 0)
+++ testsuite/gcc.target/i386/pr50603.c (revision 0)
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options -O2 } */
+
+extern int *foo;
+
+int
+bar (int x)
+{
+  return foo[x];
+}
+/* { dg-final { scan-assembler-not lea\[lq\] } } */


Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-06 Thread H.J. Lu
On Tue, Oct 4, 2011 at 11:58 AM, H.J. Lu hjl.to...@gmail.com wrote:
 On Tue, Oct 4, 2011 at 11:51 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Oct 4, 2011 at 8:37 PM, H.J. Lu hjl.to...@gmail.com wrote:

 OTOH, x86_64 and i686 targets can also benefit from this change. If
 combine can't create more complex address (covered by lea), then it
 will simply propagate memory operand back into the add insn. It looks
 to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
    src2 = force_reg (mode, src2);

 Any opinions?


 It doesn't work with 64bit libstdc++:

 Yeah, yeah. ix86_output_mi_thunk has some ...  issues.

 Please try attached patch that introduces ix86_emit_binop and uses it
 in a bunch of places.

 I tried it on GCC.  There are no regressions.  The bugs are fixed for x32.
 Here are size comparison with GCC runtime libraries on ia32, x32 and
 x86-64:

  884093   18600   27064  929757   e2fdd old libstdc++.so
  884189   18600   27064  929853   e303d new libs/libstdc++.so

 The new code is

 mov    0xc(%edi),%eax
 mov    %eax,0x8(%esi)
 mov    -0xc(%eax),%eax
 mov    0x10(%edi),%edx
 lea    0x8(%esi,%eax,1),%eax

 The old one is

 mov    0xc(%edi),%edx
 lea    0x8(%esi),%eax
 mov    %edx,0x8(%esi)
 add    -0xc(%edx),%eax
 mov    0x10(%edi),%edx

 The new code merged lea+add into one lea, so it looks quite OK to me.

 Do you have some performance numbers?


 I will report performance numbers in a few days.

The differences in SPEC CPU 2006 on ia32, x86-64 and
x32 are within noise range.



-- 
H.J.


Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread Uros Bizjak
On Tue, Oct 4, 2011 at 1:00 AM, H.J. Lu hongjiu...@intel.com wrote:
 This patch improves address combine for x32 by forcing the memory memory
 operand of PLUS operation into register.  Tested on Linux/x86-64 with
 -mx32.  OK for trunk?

Does the patch fix

FAIL: gcc.target/i386/pr45670.c scan-assembler-not lea[lq]

on x32 ?

Uros.


Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread Paolo Bonzini

On 10/04/2011 01:00 AM, H.J. Lu wrote:

+  else
+{
+  /* Improve address combine in x32 mode.  */
+  if (TARGET_X32
+ code == PLUS
+ !MEM_P (dst)
+ !MEM_P (src1)
+ MEM_P (src2) )
+   src2 = force_reg (mode, src2);
+}


Perhaps this is worthwhile also on non-x32?

Paolo



Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread Uros Bizjak
On Tue, Oct 4, 2011 at 10:06 AM, Uros Bizjak ubiz...@gmail.com wrote:
 This patch improves address combine for x32 by forcing the memory memory
 operand of PLUS operation into register.  Tested on Linux/x86-64 with
 -mx32.  OK for trunk?

 Does the patch fix

 FAIL: gcc.target/i386/pr45670.c scan-assembler-not lea[lq]

 on x32 ?

 It does.

 Following patch is the same, but takes into account that non-matching
 memory can only be in src2, so it avoids a bunch of unnecessary
 checks. Can you please check the effects of the patch with some
 codesize benchmark?

OTOH, x86_64 and i686 targets can also benefit from this change. If
combine can't create more complex address (covered by lea), then it
will simply propagate memory operand back into the add insn. It looks
to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
src2 = force_reg (mode, src2);

Any opinions?

Uros.


Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread H.J. Lu
On Tue, Oct 4, 2011 at 1:19 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Oct 4, 2011 at 10:06 AM, Uros Bizjak ubiz...@gmail.com wrote:
 This patch improves address combine for x32 by forcing the memory memory
 operand of PLUS operation into register.  Tested on Linux/x86-64 with
 -mx32.  OK for trunk?

 Does the patch fix

 FAIL: gcc.target/i386/pr45670.c scan-assembler-not lea[lq]

 on x32 ?

 It does.

 Following patch is the same, but takes into account that non-matching
 memory can only be in src2, so it avoids a bunch of unnecessary
 checks. Can you please check the effects of the patch with some
 codesize benchmark?

 OTOH, x86_64 and i686 targets can also benefit from this change. If
 combine can't create more complex address (covered by lea), then it
 will simply propagate memory operand back into the add insn. It looks
 to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
    src2 = force_reg (mode, src2);

 Any opinions?

This patch should fix PR 50603 as well as gcc.target/i386/pr45670.c.
I will check its code size impact on SPEC CPU 2K/2006.

-- 
H.J.


Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread H.J. Lu
On Tue, Oct 4, 2011 at 1:19 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Oct 4, 2011 at 10:06 AM, Uros Bizjak ubiz...@gmail.com wrote:
 This patch improves address combine for x32 by forcing the memory memory
 operand of PLUS operation into register.  Tested on Linux/x86-64 with
 -mx32.  OK for trunk?

 Does the patch fix

 FAIL: gcc.target/i386/pr45670.c scan-assembler-not lea[lq]

 on x32 ?

 It does.

 Following patch is the same, but takes into account that non-matching
 memory can only be in src2, so it avoids a bunch of unnecessary
 checks. Can you please check the effects of the patch with some
 codesize benchmark?

 OTOH, x86_64 and i686 targets can also benefit from this change. If
 combine can't create more complex address (covered by lea), then it
 will simply propagate memory operand back into the add insn. It looks
 to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
    src2 = force_reg (mode, src2);

 Any opinions?


It doesn't work with 64bit libstdc++:

/export/gnu/import/git/gcc/libstdc++-v3/src/strstream.cc: In member
function ‘void std::strstream::_ZTv0_n24_NSt9strstreamD1Ev()’:
/export/gnu/import/git/gcc/libstdc++-v3/src/strstream.cc:418:1: error:
insn does not satisfy its constraints:
(insn 3 2 4 (set (reg:DI 59)
(mem:DI (plus:DI (reg:DI 39 r10)
(const_int -24 [0xffe8])) [0 S8 A8])) 62
{*movdi_internal_rex64}
 (nil))
/export/gnu/import/git/gcc/libstdc++-v3/src/strstream.cc:418:1:
internal compiler error: in final_scan_insn, at final.c:2648
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html for instructions.
make[7]: *** [strstream.lo] Error 1
make[7]: *** Waiting for unfinished jobs


-- 
H.J.


Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread Uros Bizjak
On Tue, Oct 4, 2011 at 6:06 PM, H.J. Lu hjl.to...@gmail.com wrote:

 OTOH, x86_64 and i686 targets can also benefit from this change. If
 combine can't create more complex address (covered by lea), then it
 will simply propagate memory operand back into the add insn. It looks
 to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
    src2 = force_reg (mode, src2);

 Any opinions?


 It doesn't work with 64bit libstdc++:

Yeah, yeah. ix86_output_mi_thunk has some ...  issues.

Please try attached patch that introduces ix86_emit_binop and uses it
in a bunch of places.

Uros.
Index: i386-protos.h
===
--- i386-protos.h   (revision 179506)
+++ i386-protos.h   (working copy)
@@ -94,6 +94,7 @@ extern bool ix86_lea_outperforms (rtx, unsigned in
  unsigned int, unsigned int);
 extern bool ix86_avoid_lea_for_add (rtx, rtx[]);
 extern bool ix86_avoid_lea_for_addr (rtx, rtx[]);
+extern void ix86_emit_binop (enum rtx_code, enum machine_mode, rtx, rtx);
 extern void ix86_split_lea_for_addr (rtx[], enum machine_mode);
 extern bool ix86_lea_for_add_ok (rtx, rtx[]);
 extern bool ix86_vec_interleave_v2df_operator_ok (rtx operands[3], bool high);
Index: i386.c
===
--- i386.c  (revision 179506)
+++ i386.c  (working copy)
@@ -15727,6 +15727,10 @@ ix86_fixup_binary_operands (enum rtx_code code, en
   if (MEM_P (src1)  !rtx_equal_p (dst, src1))
 src1 = force_reg (mode, src1);
 
+  /* Improve address combine.  */
+  if (code == PLUS  MEM_P (src2))
+src2 = force_reg (mode, src2);
+
   operands[1] = src1;
   operands[2] = src2;
   return dst;
@@ -16470,6 +16474,20 @@ ix86_avoid_lea_for_addr (rtx insn, rtx operands[])
   return !ix86_lea_outperforms (insn, regno0, regno1, regno2, split_cost);
 }
 
+/* Emit x86 binary operand CODE in mode MODE, where the first operand
+   matches destination.  RTX includes clobber of FLAGS_REG.  */
+
+extern void ix86_emit_binop (enum rtx_code code, enum machine_mode mode,
+rtx dst, rtx src)
+{
+  rtx op, clob;
+
+  op = gen_rtx_SET (VOIDmode, dst, gen_rtx_fmt_ee (code, mode, dst, src));
+  clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, FLAGS_REG));
+  
+  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, op, clob)));
+}
+
 /* Split lea instructions into a sequence of instructions
which are executed on ALU to avoid AGU stalls.
It is assumed that it is allowed to clobber flags register
@@ -16482,8 +16500,7 @@ ix86_split_lea_for_addr (rtx operands[], enum mach
   unsigned int regno1 = INVALID_REGNUM;
   unsigned int regno2 = INVALID_REGNUM;
   struct ix86_address parts;
-  rtx tmp, clob;
-  rtvec par;
+  rtx tmp;
   int ok, adds;
 
   ok = ix86_decompose_address (operands[1], parts);
@@ -16515,14 +16532,7 @@ ix86_split_lea_for_addr (rtx operands[], enum mach
  gcc_assert (regno2 != regno0);
 
  for (adds = parts.scale; adds  0; adds--)
-   {
- tmp = gen_rtx_PLUS (mode, operands[0], parts.index);
- tmp = gen_rtx_SET (VOIDmode, operands[0], tmp);
- clob = gen_rtx_CLOBBER (VOIDmode,
- gen_rtx_REG (CCmode, FLAGS_REG));
- par = gen_rtvec (2, tmp, clob);
- emit_insn (gen_rtx_PARALLEL (VOIDmode, par));
-   }
+   ix86_emit_binop (PLUS, mode, operands[0], parts.index);
}
   else
{
@@ -16531,30 +16541,14 @@ ix86_split_lea_for_addr (rtx operands[], enum mach
emit_insn (gen_rtx_SET (VOIDmode, operands[0], parts.index));
 
  /* Use shift for scaling.  */
- tmp = gen_rtx_ASHIFT (mode, operands[0],
-   GEN_INT (exact_log2 (parts.scale)));
- tmp = gen_rtx_SET (VOIDmode, operands[0], tmp);
- clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, FLAGS_REG));
- par = gen_rtvec (2, tmp, clob);
- emit_insn (gen_rtx_PARALLEL (VOIDmode, par));
+ ix86_emit_binop (ASHIFT, mode, operands[0],
+  GEN_INT (exact_log2 (parts.scale)));
 
  if (parts.base)
-   {
- tmp = gen_rtx_PLUS (mode, operands[0], parts.base);
- tmp = gen_rtx_SET (VOIDmode, operands[0], tmp);
- clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, 
FLAGS_REG));
- par = gen_rtvec (2, tmp, clob);
- emit_insn (gen_rtx_PARALLEL (VOIDmode, par));
-   }
+   ix86_emit_binop (PLUS, mode, operands[0], parts.base);
 
  if (parts.disp  parts.disp != const0_rtx)
-   {
- tmp = gen_rtx_PLUS (mode, operands[0], parts.disp);
- tmp = gen_rtx_SET (VOIDmode, operands[0], tmp);
- clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, 
FLAGS_REG));
- par = gen_rtvec (2, tmp, 

Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread H.J. Lu
On Tue, Oct 4, 2011 at 10:32 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Oct 4, 2011 at 6:06 PM, H.J. Lu hjl.to...@gmail.com wrote:

 OTOH, x86_64 and i686 targets can also benefit from this change. If
 combine can't create more complex address (covered by lea), then it
 will simply propagate memory operand back into the add insn. It looks
 to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
    src2 = force_reg (mode, src2);

 Any opinions?


 It doesn't work with 64bit libstdc++:

 Yeah, yeah. ix86_output_mi_thunk has some ...  issues.

 Please try attached patch that introduces ix86_emit_binop and uses it
 in a bunch of places.

 Uros.


I tried it on GCC.  There are no regressions.  The bugs are fixed for x32.
Here are size comparison with GCC runtime libraries on ia32, x32 and
x86-64:

   textdata bss dec hex filename
 103304 480 368  104152   196d8 32/libgcc/32/libgcc_s.so
 103048 480 368  103896 
195d8   
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libgcc/32/libgcc_s.so
   textdata bss dec hex filename
10538463100 360 1057306  10221a 32/libgfortran/.libs/libgfortran.so
10535743100 360 1057034 
10210a  
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libgfortran/.libs/libgfortran.so
   textdata bss dec hex filename
  56817 536 156   57509e0a5 32/libgomp/.libs/libgomp.so
  56817 536 156   57509 
e0a5
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libgomp/.libs/libgomp.so
   textdata bss dec hex filename
 1078151832  839016  948663   e79b7 32/libmudflap/.libs/libmudflap.so
 1078151832  839016  948663 
e79b7   
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libmudflap/.libs/libmudflap.so
   textdata bss dec hex filename
 1108901976  839052  951918   e866e 32/libmudflap/.libs/libmudflapth.so
 1108901976  839052  951918 
e866e   
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libmudflap/.libs/libmudflapth.so
   textdata bss dec hex filename
  9911329124748  106773   1a115 32/libobjc/.libs/libobjc.so
  9911329124748  106773 
1a115   
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libobjc/.libs/libobjc.so
   textdata bss dec hex filename
 3574661148  12  358626   578e2 32/libquadmath/.libs/libquadmath.so
 3572181148  12  358378 
577ea   
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libquadmath/.libs/libquadmath.so
   textdata bss dec hex filename
   6018 388   86414190e 32/libssp/.libs/libssp.so
   6018 388   86414 
190e
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libssp/.libs/libssp.so
   textdata bss dec hex filename
 884093   18600   27064  929757   e2fdd 32/libstdc++-v3/src/.libs/libstdc++.so
 884189   18600   27064  929853 
e303d   
../../build-x86_64-linux/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs/libstdc++.so
   textdata bss dec hex filename
  83012 944 696   84652   14aac libgcc/libgcc_s.so
  83012 944 696   84652 
14aac   ../../build-x86_64-linux/x86_64-unknown-linux-gnu/libgcc/libgcc_s.so
   textdata bss dec hex filename
11445936520 568 1151681  1192c1 libgfortran/.libs/libgfortran.so
11445616520 568 1151649 
1192a1  
../../build-x86_64-linux/x86_64-unknown-linux-gnu/libgfortran/.libs/libgfortran.so
   textdata bss dec hex filename
  519581080 260   53298d032 libgomp/.libs/libgomp.so
  519581080 260   53298 
d032
../../build-x86_64-linux/x86_64-unknown-linux-gnu/libgomp/.libs/libgomp.so
   textdata bss dec hex filename
 1118313332 1634728 1749891  1ab383 libmudflap/.libs/libmudflap.so
 1118313332 1634728 1749891 
1ab383  
../../build-x86_64-linux/x86_64-unknown-linux-gnu/libmudflap/.libs/libmudflap.so
   textdata bss dec hex filename
 1164303612 1634768 1754810  1ac6ba libmudflap/.libs/libmudflapth.so
 1164303612 1634768 1754810 
1ac6ba  
../../build-x86_64-linux/x86_64-unknown-linux-gnu/libmudflap/.libs/libmudflapth.so
   textdata bss dec hex filename
 10224048729320  116432   1c6d0 libobjc/.libs/libobjc.so
 10224048729320  116432 
1c6d0   
../../build-x86_64-linux/x86_64-unknown-linux-gnu/libobjc/.libs/libobjc.so
   textdata bss dec hex filename
 2132151744  16  214975   347bf libquadmath/.libs/libquadmath.so
 2132311744  16  214991 
347cf   
../../build-x86_64-linux/x86_64-unknown-linux-gnu/libquadmath/.libs/libquadmath.so
   textdata bss dec hex filename
   6395 768  1671791c0b libssp/.libs/libssp.so
   6395 768  167179 
1c0b../../build-x86_64-linux/x86_64-unknown-linux-gnu/libssp/.libs/libssp.so
   textdata bss dec hex 

Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread Uros Bizjak
On Tue, Oct 4, 2011 at 8:37 PM, H.J. Lu hjl.to...@gmail.com wrote:

 OTOH, x86_64 and i686 targets can also benefit from this change. If
 combine can't create more complex address (covered by lea), then it
 will simply propagate memory operand back into the add insn. It looks
 to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
    src2 = force_reg (mode, src2);

 Any opinions?


 It doesn't work with 64bit libstdc++:

 Yeah, yeah. ix86_output_mi_thunk has some ...  issues.

 Please try attached patch that introduces ix86_emit_binop and uses it
 in a bunch of places.

 I tried it on GCC.  There are no regressions.  The bugs are fixed for x32.
 Here are size comparison with GCC runtime libraries on ia32, x32 and
 x86-64:

  884093   18600   27064  929757   e2fdd old libstdc++.so
  884189   18600   27064  929853   e303d new libs/libstdc++.so

 The new code is

 mov    0xc(%edi),%eax
 mov    %eax,0x8(%esi)
 mov    -0xc(%eax),%eax
 mov    0x10(%edi),%edx
 lea    0x8(%esi,%eax,1),%eax

 The old one is

 mov    0xc(%edi),%edx
 lea    0x8(%esi),%eax
 mov    %edx,0x8(%esi)
 add    -0xc(%edx),%eax
 mov    0x10(%edi),%edx

The new code merged lea+add into one lea, so it looks quite OK to me.

Do you have some performance numbers?

Thanks,
Uros.


Re: PATCH: PR target/50603: [x32] Unnecessary lea

2011-10-04 Thread H.J. Lu
On Tue, Oct 4, 2011 at 11:51 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Oct 4, 2011 at 8:37 PM, H.J. Lu hjl.to...@gmail.com wrote:

 OTOH, x86_64 and i686 targets can also benefit from this change. If
 combine can't create more complex address (covered by lea), then it
 will simply propagate memory operand back into the add insn. It looks
 to me that we can't loose here, so:

  /* Improve address combine.  */
  if (code == PLUS  MEM_P (src2))
    src2 = force_reg (mode, src2);

 Any opinions?


 It doesn't work with 64bit libstdc++:

 Yeah, yeah. ix86_output_mi_thunk has some ...  issues.

 Please try attached patch that introduces ix86_emit_binop and uses it
 in a bunch of places.

 I tried it on GCC.  There are no regressions.  The bugs are fixed for x32.
 Here are size comparison with GCC runtime libraries on ia32, x32 and
 x86-64:

  884093   18600   27064  929757   e2fdd old libstdc++.so
  884189   18600   27064  929853   e303d new libs/libstdc++.so

 The new code is

 mov    0xc(%edi),%eax
 mov    %eax,0x8(%esi)
 mov    -0xc(%eax),%eax
 mov    0x10(%edi),%edx
 lea    0x8(%esi,%eax,1),%eax

 The old one is

 mov    0xc(%edi),%edx
 lea    0x8(%esi),%eax
 mov    %edx,0x8(%esi)
 add    -0xc(%edx),%eax
 mov    0x10(%edi),%edx

 The new code merged lea+add into one lea, so it looks quite OK to me.

 Do you have some performance numbers?


I will report performance numbers in a few days.

-- 
H.J.