Re: s390: Avoid CAS boolean output inefficiency

Ulrich Weigand Tue, 07 Aug 2012 10:02:47 -0700

Richard Henderson wrote:
> On 08/06/2012 11:34 AM, Ulrich Weigand wrote:
> > There is one particular inefficiency I have noticed.  This code:
> > 
> >   if (!__atomic_compare_exchange_n (&v, &expected, max, 0 , 0, 0))
> >     abort ();
> > 
> > from atomic-compare-exchange-3.c gets compiled into:
> > 
> >         l       %r3,0(%r2)
> >         larl    %r1,v
> >         cs      %r3,%r4,0(%r1)
> >         ipm     %r1
> >         sra     %r1,28
> >         st      %r3,0(%r2)
> >         ltr     %r1,%r1
> >         jne     .L3
> > 
> > which is extremely inefficient; it converts the condition code into
> > an integer using the slow ipm, sra sequence, just so that it can
> > convert the integer back into a condition code via ltr and branch
> > on it ...
> 
> This was caused (or perhaps abetted by) the representation of EQ
> as NE ^ 1.  With the subsequent truncation and zero-extend, I
> think combine reached its insn limit of 3 before seeing everything
> it needed to see.


Yes, combine isn't able to handle everything in one go, but it finds
an intermediate insn.  With the old __sync_compare_and_swap, it is
then able to optimize everything away in a second step; the only
reason this doesn't work any more is the intervening store.

The following patch changes the builtin expander to pass a MEM oldval
as-is to the back-end expander, so that the back-end can move the
store to before the CC operation.  With that patch I'm also seeing
all the IPMs disappear.

[ Well, all except for this one:

> This happens because CSE notices the cbranch vs 0, and sets r116
> to zero along the path to
> 
>      32   if (!__atomic_compare_exchange_n (&v, &expected, 0, STRONG,
>                                             __ATOMIC_RELEASE, 
> __ATOMIC_ACQUIRE))
> 
> at which point CSE decides that it would be cheaper to "re-use"
> the zero already in r116 instead of load another constant 0 here.
> After that, combine is ham-strung because r116 is not dead.

As you point out, this does indeed fix this problem as well:

> A short-term possibility is to have the CAS insns accept general_operand,
> so that the 0 gets merged.  
]


What do you think about this solution?  It has the advantage that
we still get the same xor code if we actually do need the ipm ...


Bye,
Ulrich


diff -ur gcc/builtins.c gcc.new/builtins.c
--- gcc/builtins.c      2012-08-07 16:04:45.054348099 +0200
+++ gcc.new/builtins.c  2012-08-07 15:44:01.304349225 +0200
@@ -5376,6 +5376,7 @@
 
   expect = expand_normal (CALL_EXPR_ARG (exp, 1));
   expect = convert_memory_address (Pmode, expect);
+  expect = gen_rtx_MEM (mode, expect);
   desired = expand_expr_force_mode (CALL_EXPR_ARG (exp, 2), mode);
 
   weak = CALL_EXPR_ARG (exp, 3);
@@ -5383,14 +5384,15 @@
   if (host_integerp (weak, 0) && tree_low_cst (weak, 0) != 0)
     is_weak = true;
 
-  oldval = copy_to_reg (gen_rtx_MEM (mode, expect));
-
+  oldval = expect;
   if (!expand_atomic_compare_and_swap ((target == const0_rtx ? NULL : &target),
                                       &oldval, mem, oldval, desired,
                                       is_weak, success, failure))
     return NULL_RTX;
 
-  emit_move_insn (gen_rtx_MEM (mode, expect), oldval);
+  if (oldval != expect)
+    emit_move_insn (expect, oldval);
+
   return target;
 }
 
diff -ur gcc/config/s390/s390.md gcc.new/config/s390/s390.md
--- gcc/config/s390/s390.md     2012-08-07 16:04:54.204348621 +0200
+++ gcc.new/config/s390/s390.md 2012-08-07 16:00:21.934348628 +0200
@@ -8870,7 +8870,7 @@
 
 (define_expand "atomic_compare_and_swap<mode>"
   [(match_operand:SI 0 "register_operand")     ;; bool success output
-   (match_operand:DGPR 1 "register_operand")   ;; oldval output
+   (match_operand:DGPR 1 "nonimmediate_operand");; oldval output
    (match_operand:DGPR 2 "memory_operand")     ;; memory
    (match_operand:DGPR 3 "register_operand")   ;; expected intput
    (match_operand:DGPR 4 "register_operand")   ;; newval intput
@@ -8879,9 +8879,17 @@
    (match_operand:SI 7 "const_int_operand")]   ;; failure model
   ""
 {
-  rtx cc, cmp;
+  rtx cc, cmp, output = operands[1];
+
+  if (!register_operand (output, <MODE>mode))
+    output = gen_reg_rtx (<MODE>mode);
+
   emit_insn (gen_atomic_compare_and_swap<mode>_internal
-            (operands[1], operands[2], operands[3], operands[4]));
+            (output, operands[2], operands[3], operands[4]));
+
+  if (output != operands[1])
+    emit_move_insn (operands[1], output);
+
   cc = gen_rtx_REG (CCZ1mode, CC_REGNUM);
   cmp = gen_rtx_EQ (SImode, cc, const0_rtx);
   emit_insn (gen_cstorecc4 (operands[0], cmp, cc, const0_rtx));


-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com

Re: s390: Avoid CAS boolean output inefficiency

Reply via email to