[x86 PATCH] PR target/105930: Split *xordi3_doubleword after reload.

Roger Sayle Wed, 22 Jun 2022 04:39:40 -0700

This patch addresses PR target/105930 which is an ia32 stack frame size
regression in high-register pressure XOR-rich cryptography functions
reported by Linus Torvalds.  The underlying problem is once the limited
number of registers on the x86 are exhausted, the register allocator
has to decide which to spill, where some eviction choices lead to much
poorer code, but these consequences are difficult to predict in advance.


The patch below, which splits xordi3_doubleword and iordi3_doubleword
after reload (instead of before), significantly reduces the amount of
spill code and stack frame size, in what might appear to be an arbitrary
choice.

My explanation of this behaviour is that the mixing of pre-reload split
SImode instructions and post-reload split DImode instructions is
confusing some of the heuristics used by reload.  One might think
that splitting early gives the register allocator more freedom to
use available registers, but in practice the constraint that double
word values occupy consecutive registers (when ultimately used as a
DImode value) is the greater constraint.  Instead, I believe in this
case, the pseudo registers used in memory addressing, appear to be
double counted for split SImode instructions compared to unsplit
DImode instructions.  For the reduced test case in comment #13, this
leads to %eax being used to hold the long-lived argument pointer "v",
blocking the use of the ax:dx pair for processing double word values.
The important lines are at the very top of the assembly output:

GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
        sub     esp, 24
        mov     ecx, DWORD PTR [esp+40]

GCC 12 [use %eax to address memory, require a 44-byte stack frame]
        sub     esp, 44
        mov     eax, DWORD PTR [esp+64]

Jakub's alternative proposed patch in comment #17 is to improve
consistency by splitting more instructions (rotates) before reload,
which shows a small improvement, but not to GCC v11 levels.

I have some follow-up patches (based on other approaches I've tried),
including splitting rotations by 32 after reload, and supporting
TImode operations via <dwi>, notice this same pattern is mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html
but this patch below is the core minimal fix that's hopefully
suitable for benchmarking and possibly backporting to the 12 branch.
I believe that changes to the register allocator itself, to tweak how
stack slots are assigned and which values can be cheaply materialized
are out-of-scope for the release branches.


I'm curious what folks (especially Uros and Jakub) think?
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  CSiBE also shows a (minor) code size reduction.
It would be great if someone could benchmark this patch, or alternatively
it can be baked on mainline to let the automatic benchmarking evaluate it,
then revert the patch if there are any observed performance issues.

Thoughts?


2022-06-22  Roger Sayle  <[email protected]>

gcc/ChangeLog
        PR target/105930
        * config/i386/i386.md (*<any_or>di3_doubleword): Split after
        reload.  Use rtx_equal_p to avoid creating memory-to-memory moves,
        and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0).
        (define_insn <any_or><mode>_1): Renamed from *<any_or>mode_1
        to provide gen_<any_or>si_1 functions.

Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 3093cb5..537fba21 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -10539,48 +10539,61 @@
   "ix86_expand_binary_operator (<CODE>, <MODE>mode, operands); DONE;")
 
 (define_insn_and_split "*<code>di3_doubleword"
-  [(set (match_operand:DI 0 "nonimmediate_operand")
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r")
        (any_or:DI
-        (match_operand:DI 1 "nonimmediate_operand")
-        (match_operand:DI 2 "x86_64_szext_general_operand")))
+        (match_operand:DI 1 "nonimmediate_operand" "0,0")
+        (match_operand:DI 2 "x86_64_szext_general_operand" "re,o")))
    (clobber (reg:CC FLAGS_REG))]
   "!TARGET_64BIT
-   && ix86_binary_operator_ok (<CODE>, DImode, operands)
-   && ix86_pre_reload_split ()"
+   && ix86_binary_operator_ok (<CODE>, DImode, operands)"
   "#"
-  "&& 1"
+  "&& reload_completed"
   [(const_int 0)]
 {
+  /* This insn may disappear completely when operands[2] == const0_rtx
+     and operands[0] == operands[1], which requires a NOTE_INSN_DELETED.  */
+  bool emit_insn_deleted_note_p = false;
+
   split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
 
   if (operands[2] == const0_rtx)
-    emit_move_insn (operands[0], operands[1]);
+    {
+      if (!rtx_equal_p (operands[0], operands[1]))
+       emit_move_insn (operands[0], operands[1]);
+      else
+       emit_insn_deleted_note_p = true;
+    }
   else if (operands[2] == constm1_rtx)
     {
       if (<CODE> == IOR)
        emit_move_insn (operands[0], constm1_rtx);
       else
-       ix86_expand_unary_operator (NOT, SImode, &operands[0]);
+       emit_insn (gen_one_cmplsi2 (operands[0], operands[1]));
     }
   else
-    ix86_expand_binary_operator (<CODE>, SImode, &operands[0]);
+    emit_insn (gen_<code>si_1 (operands[0], operands[1], operands[2]));
 
   if (operands[5] == const0_rtx)
-    emit_move_insn (operands[3], operands[4]);
+    {
+      if (!rtx_equal_p (operands[3], operands[4]))
+       emit_move_insn (operands[3], operands[4]);
+      else if (emit_insn_deleted_note_p)
+       emit_note (NOTE_INSN_DELETED);
+    }
   else if (operands[5] == constm1_rtx)
     {
       if (<CODE> == IOR)
        emit_move_insn (operands[3], constm1_rtx);
       else
-       ix86_expand_unary_operator (NOT, SImode, &operands[3]);
+       emit_insn (gen_one_cmplsi2 (operands[3], operands[4]));
     }
   else
-    ix86_expand_binary_operator (<CODE>, SImode, &operands[3]);
+    emit_insn (gen_<code>si_1 (operands[3], operands[4], operands[5]));
 
   DONE;
 })
 
-(define_insn "*<code><mode>_1"
+(define_insn "<code><mode>_1"
   [(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm,r,?k")
        (any_or:SWI248
         (match_operand:SWI248 1 "nonimmediate_operand" "%0,0,k")

[x86 PATCH] PR target/105930: Split *xordi3_doubleword after reload.

Reply via email to