This patch addresses PR target/105930 which is an ia32 stack frame size regression in high-register pressure XOR-rich cryptography functions reported by Linus Torvalds. The underlying problem is once the limited number of registers on the x86 are exhausted, the register allocator has to decide which to spill, where some eviction choices lead to much poorer code, but these consequences are difficult to predict in advance.
The patch below, which splits xordi3_doubleword and iordi3_doubleword after reload (instead of before), significantly reduces the amount of spill code and stack frame size, in what might appear to be an arbitrary choice. My explanation of this behaviour is that the mixing of pre-reload split SImode instructions and post-reload split DImode instructions is confusing some of the heuristics used by reload. One might think that splitting early gives the register allocator more freedom to use available registers, but in practice the constraint that double word values occupy consecutive registers (when ultimately used as a DImode value) is the greater constraint. Instead, I believe in this case, the pseudo registers used in memory addressing, appear to be double counted for split SImode instructions compared to unsplit DImode instructions. For the reduced test case in comment #13, this leads to %eax being used to hold the long-lived argument pointer "v", blocking the use of the ax:dx pair for processing double word values. The important lines are at the very top of the assembly output: GCC 11 [use %ecx to address memory, require a 24-byte stack frame] sub esp, 24 mov ecx, DWORD PTR [esp+40] GCC 12 [use %eax to address memory, require a 44-byte stack frame] sub esp, 44 mov eax, DWORD PTR [esp+64] Jakub's alternative proposed patch in comment #17 is to improve consistency by splitting more instructions (rotates) before reload, which shows a small improvement, but not to GCC v11 levels. I have some follow-up patches (based on other approaches I've tried), including splitting rotations by 32 after reload, and supporting TImode operations via <dwi>, notice this same pattern is mentioned in https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html but this patch below is the core minimal fix that's hopefully suitable for benchmarking and possibly backporting to the 12 branch. I believe that changes to the register allocator itself, to tweak how stack slots are assigned and which values can be cheaply materialized are out-of-scope for the release branches. I'm curious what folks (especially Uros and Jakub) think? This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. CSiBE also shows a (minor) code size reduction. It would be great if someone could benchmark this patch, or alternatively it can be baked on mainline to let the automatic benchmarking evaluate it, then revert the patch if there are any observed performance issues. Thoughts? 2022-06-22 Roger Sayle <ro...@nextmovesoftware.com> gcc/ChangeLog PR target/105930 * config/i386/i386.md (*<any_or>di3_doubleword): Split after reload. Use rtx_equal_p to avoid creating memory-to-memory moves, and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0). (define_insn <any_or><mode>_1): Renamed from *<any_or>mode_1 to provide gen_<any_or>si_1 functions. Roger --
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 3093cb5..537fba21 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -10539,48 +10539,61 @@ "ix86_expand_binary_operator (<CODE>, <MODE>mode, operands); DONE;") (define_insn_and_split "*<code>di3_doubleword" - [(set (match_operand:DI 0 "nonimmediate_operand") + [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r") (any_or:DI - (match_operand:DI 1 "nonimmediate_operand") - (match_operand:DI 2 "x86_64_szext_general_operand"))) + (match_operand:DI 1 "nonimmediate_operand" "0,0") + (match_operand:DI 2 "x86_64_szext_general_operand" "re,o"))) (clobber (reg:CC FLAGS_REG))] "!TARGET_64BIT - && ix86_binary_operator_ok (<CODE>, DImode, operands) - && ix86_pre_reload_split ()" + && ix86_binary_operator_ok (<CODE>, DImode, operands)" "#" - "&& 1" + "&& reload_completed" [(const_int 0)] { + /* This insn may disappear completely when operands[2] == const0_rtx + and operands[0] == operands[1], which requires a NOTE_INSN_DELETED. */ + bool emit_insn_deleted_note_p = false; + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) - emit_move_insn (operands[0], operands[1]); + { + if (!rtx_equal_p (operands[0], operands[1])) + emit_move_insn (operands[0], operands[1]); + else + emit_insn_deleted_note_p = true; + } else if (operands[2] == constm1_rtx) { if (<CODE> == IOR) emit_move_insn (operands[0], constm1_rtx); else - ix86_expand_unary_operator (NOT, SImode, &operands[0]); + emit_insn (gen_one_cmplsi2 (operands[0], operands[1])); } else - ix86_expand_binary_operator (<CODE>, SImode, &operands[0]); + emit_insn (gen_<code>si_1 (operands[0], operands[1], operands[2])); if (operands[5] == const0_rtx) - emit_move_insn (operands[3], operands[4]); + { + if (!rtx_equal_p (operands[3], operands[4])) + emit_move_insn (operands[3], operands[4]); + else if (emit_insn_deleted_note_p) + emit_note (NOTE_INSN_DELETED); + } else if (operands[5] == constm1_rtx) { if (<CODE> == IOR) emit_move_insn (operands[3], constm1_rtx); else - ix86_expand_unary_operator (NOT, SImode, &operands[3]); + emit_insn (gen_one_cmplsi2 (operands[3], operands[4])); } else - ix86_expand_binary_operator (<CODE>, SImode, &operands[3]); + emit_insn (gen_<code>si_1 (operands[3], operands[4], operands[5])); DONE; }) -(define_insn "*<code><mode>_1" +(define_insn "<code><mode>_1" [(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm,r,?k") (any_or:SWI248 (match_operand:SWI248 1 "nonimmediate_operand" "%0,0,k")