[gcc r15-775] i386: Correct insn_cost of movabsq.
https://gcc.gnu.org/g:a3b16e73a2d5b2d4d20ef6f2fd164cea633bbec8 commit r15-775-ga3b16e73a2d5b2d4d20ef6f2fd164cea633bbec8 Author: Roger Sayle Date: Wed May 22 16:45:48 2024 +0100 i386: Correct insn_cost of movabsq. This single line patch fixes a strange quirk/glitch in i386's rtx_costs, which considers an instruction loading a 64-bit constant to be significantly cheaper than loading a 32-bit (or smaller) constant. Consider the two functions: unsigned long long foo() { return 0x0123456789abcdefULL; } unsigned int bar() { return 10; } and the corresponding lines from combine's dump file: insn_cost 1 for #: r98:DI=0x123456789abcdef insn_cost 4 for #: ax:SI=0xa The same issue can be seen in -dP assembler output. movabsq $81985529216486895, %rax# 5 [c=1 l=10] *movdi_internal/4 The problem is that pattern_costs interpretation of rtx_costs contains "return cost > 0 ? cost : COSTS_N_INSNS (1)" where a zero value (for example a register or small immediate constant) is considered special, and equivalent to a single instruction, but all other values are treated as verbatim. Hence to x86_64's 10-byte long movabsq instruction slightly more expensive than a simple constant, rtx_costs needs to return COSTS_N_INSNS(1)+1 and not 1. With this change, the insn_cost of movabsq is the intended value 5: insn_cost 5 for #: r98:DI=0x123456789abcdef and movabsq $81985529216486895, %rax# 5 [c=5 l=10] *movdi_internal/4 2024-05-22 Roger Sayle gcc/ChangeLog * config/i386/i386.cc (ix86_rtx_costs) : A CONST_INT that isn't x86_64_immediate_operand requires an extra (expensive) movabsq insn to load, so return COSTS_N_INSNS (1) + 1. Diff: --- gcc/config/i386/i386.cc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 69cd4ae05a7..3e2a3a194f1 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -21562,7 +21562,8 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno, if (x86_64_immediate_operand (x, VOIDmode)) *total = 0; else - *total = 1; + /* movabsq is slightly more expensive than a simple instruction. */ + *total = COSTS_N_INSNS (1) + 1; return true; case CONST_DOUBLE:
[x86_64 PATCH] Correct insn_cost of movabsq.
This single line patch fixes a strange quirk/glitch in i386's rtx_costs, which considers an instruction loading a 64-bit constant to be significantly cheaper than loading a 32-bit (or smaller) constant. Consider the two functions: unsigned long long foo() { return 0x0123456789abcdefULL; } unsigned int bar() { return 10; } and the corresponding lines from combine's dump file: insn_cost 1 for #: r98:DI=0x123456789abcdef insn_cost 4 for #: ax:SI=0xa The same issue can be seen in -dP assembler output. movabsq $81985529216486895, %rax# 5 [c=1 l=10] *movdi_internal/4 The problem is that pattern_costs interpretation of rtx_costs contains "return cost > 0 ? cost : COSTS_N_INSNS (1)" where a zero value (for example a register or small immediate constant) is considered special, and equivalent to a single instruction, but all other values are treated as verbatim. Hence to make x86_64's 10-byte long movabsq instruction slightly more expensive than a simple constant, rtx_costs needs to return COSTS_N_INSNS(1)+1 and not 1. With this change, the insn_cost of movabsq is the intended value 5: insn_cost 5 for #: r98:DI=0x123456789abcdef and movabsq $81985529216486895, %rax# 5 [c=5 l=10] *movdi_internal/4 [I'd originally tried fixing this by adding a ix86_insn_cost target hook, but the testsuite is very sensitive to the costing of insns]. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-05-22 Roger Sayle gcc/ChangeLog * config/i386/i386.cc (ix86_rtx_costs) : A CONST_INT that isn't x86_64_immediate_operand requires an extra (expensive) movabsq insn to load, so return COSTS_N_INSNS (1) + 1. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index b4838b7..b4a9519 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -21569,7 +21569,7 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno, if (x86_64_immediate_operand (x, VOIDmode)) *total = 0; else - *total = 1; + *total = COSTS_N_INSNS (1) + 1; return true; case CONST_DOUBLE:
[gcc r15-774] Avoid ICE in except.cc on targets that don't support exceptions.
https://gcc.gnu.org/g:26df7b4684e201e66c09dd018603a248ddc5f437 commit r15-774-g26df7b4684e201e66c09dd018603a248ddc5f437 Author: Roger Sayle Date: Wed May 22 13:48:52 2024 +0100 Avoid ICE in except.cc on targets that don't support exceptions. A number of testcases currently fail on nvptx with the ICE: during RTL pass: final openmp-simd-2.c: In function 'foo': openmp-simd-2.c:28:1: internal compiler error: in get_personality_function, at expr.cc:14037 28 | } | ^ 0x98a38f get_personality_function(tree_node*) /home/roger/GCC/nvptx-none/gcc/gcc/expr.cc:14037 0x969d3b output_function_exception_table(int) /home/roger/GCC/nvptx-none/gcc/gcc/except.cc:3226 0x9b760d rest_of_handle_final /home/roger/GCC/nvptx-none/gcc/gcc/final.cc:4252 The simple oversight in output_function_exception_table is that it calls get_personality_function (immediately) before checking the target's except_unwind_info hook (which on nvptx always returns UI_NONE). The (perhaps obvious) fix is to move the assignments of fname and personality after the tests that they are needed, and before their first use. 2024-05-22 Roger Sayle gcc/ChangeLog * except.cc (output_function_exception_table): Move call to get_personality_function after targetm_common.except_unwind_info check, to avoid ICE on targets that don't support exceptions. Diff: --- gcc/except.cc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/except.cc b/gcc/except.cc index 2080fcc22e6..b5886e97be9 100644 --- a/gcc/except.cc +++ b/gcc/except.cc @@ -3222,9 +3222,6 @@ output_one_function_exception_table (int section) void output_function_exception_table (int section) { - const char *fnname = get_fnname_from_decl (current_function_decl); - rtx personality = get_personality_function (current_function_decl); - /* Not all functions need anything. */ if (!crtl->uses_eh_lsda || targetm_common.except_unwind_info (_options) == UI_NONE) @@ -3234,6 +3231,9 @@ output_function_exception_table (int section) if (section == 1 && !crtl->eh.call_site_record_v[1]) return; + const char *fnname = get_fnname_from_decl (current_function_decl); + rtx personality = get_personality_function (current_function_decl); + if (personality) { assemble_external_libcall (personality);
[PATCH] Avoid ICE in except.cc on targets that don't support exceptions.
A number of testcases currently fail on nvptx with the ICE: during RTL pass: final openmp-simd-2.c: In function 'foo': openmp-simd-2.c:28:1: internal compiler error: in get_personality_function, at expr.cc:14037 28 | } | ^ 0x98a38f get_personality_function(tree_node*) /home/roger/GCC/nvptx-none/gcc/gcc/expr.cc:14037 0x969d3b output_function_exception_table(int) /home/roger/GCC/nvptx-none/gcc/gcc/except.cc:3226 0x9b760d rest_of_handle_final /home/roger/GCC/nvptx-none/gcc/gcc/final.cc:4252 The simple oversight in output_function_exception_table is that it calls get_personality_function (immediately) before checking the target's except_unwind_info hook (which on nvptx always returns UI_NONE). The (perhaps obvious) fix is to move the assignments of fname and personality after the tests that they are needed, and before their first use. This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu with no new failures in the testsuite, and ~220 fewer FAILs. Ok for mainline? 2024-05-22 Roger Sayle gcc/ChangeLog * except.cc (output_function_exception_table): Move call to get_personality_function after targetm_common.except_unwind_info check, to avoid ICE on targets that don't support exceptions. Thanks in advance, Roger -- diff --git a/gcc/except.cc b/gcc/except.cc index 2080fcc..b5886e9 100644 --- a/gcc/except.cc +++ b/gcc/except.cc @@ -3222,9 +3222,6 @@ output_one_function_exception_table (int section) void output_function_exception_table (int section) { - const char *fnname = get_fnname_from_decl (current_function_decl); - rtx personality = get_personality_function (current_function_decl); - /* Not all functions need anything. */ if (!crtl->uses_eh_lsda || targetm_common.except_unwind_info (_options) == UI_NONE) @@ -3234,6 +3231,9 @@ output_function_exception_table (int section) if (section == 1 && !crtl->eh.call_site_record_v[1]) return; + const char *fnname = get_fnname_from_decl (current_function_decl); + rtx personality = get_personality_function (current_function_decl); + if (personality) { assemble_external_libcall (personality);
[gcc r15-648] nvptx: Correct pattern for popcountdi2 insn in nvptx.md.
https://gcc.gnu.org/g:1676ef6e91b902f592270e4bcf10b4fc342e200d commit r15-648-g1676ef6e91b902f592270e4bcf10b4fc342e200d Author: Roger Sayle Date: Sun May 19 09:49:45 2024 +0100 nvptx: Correct pattern for popcountdi2 insn in nvptx.md. The result of a POPCOUNT operation in RTL should have the same mode as its operand. This corrects the specification of popcount in the nvptx backend, splitting the current generic define_insn into two, one for popcountsi2 and the other for popcountdi2 (the latter with an explicit truncate). 2024-05-19 Roger Sayle gcc/ChangeLog * config/nvptx/nvptx.md (popcount2): Split into... (popcountsi2): define_insn handling SImode popcount. (popcountdi2): define_insn handling DImode popcount, with an explicit truncate:SI to produce an SImode result. Diff: --- gcc/config/nvptx/nvptx.md | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 96e6c9116080..ef7e3fb00fac 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -655,11 +655,18 @@ DONE; }) -(define_insn "popcount2" +(define_insn "popcountsi2" [(set (match_operand:SI 0 "nvptx_register_operand" "=R") - (popcount:SI (match_operand:SDIM 1 "nvptx_register_operand" "R")))] + (popcount:SI (match_operand:SI 1 "nvptx_register_operand" "R")))] "" - "%.\\tpopc.b%T1\\t%0, %1;") + "%.\\tpopc.b32\\t%0, %1;") + +(define_insn "popcountdi2" + [(set (match_operand:SI 0 "nvptx_register_operand" "=R") + (truncate:SI + (popcount:DI (match_operand:DI 1 "nvptx_register_operand" "R"] + "" + "%.\\tpopc.b64\\t%0, %1;") ;; Multiplication variants
[x86 SSE] Improve handling of ternlog instructions in i386/sse.md (v2)
Hi Hongtao, Many thanks for the review, bug fixes and suggestions for improvements. This revised version of the patch, implements all of your corrections. In theory the "ternlog idx" should guarantee that some operands are non-null, but I agree that it's better defensive programming to check invariants not easily proved. Instead of calling ix86_expand_vector_move, I use ix86_broadcast_from_constant to achieve the same effect of using a broadcast when possible, but has the benefit of still using a memory operand (instead of a vector load) when broadcasting isn't possible. There are other places that could benefit from the same trick, but I can address these in a follow-up patch (it may even be preferrable to keep these as CONST_VECTOR during early RTL passes and lower to broadcast or constant pool using splitters). This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-05-17 Roger Sayle Hongtao Liu gcc/ChangeLog PR target/115021 * config/i386/i386-expand.cc (ix86_expand_args_builtin): Call fixup_modeless_constant before testing predicates. Only call copy_to_mode_reg on memory operands (after the first one). (ix86_gen_bcst_mem): Helper function to convert a CONST_VECTOR into a VEC_DUPLICATE if possible. (ix86_ternlog_idx): Convert an RTX expression into a ternlog index between 0 and 255, recording the operands in ARGS, if possible or return -1 if this is not possible/valid. (ix86_ternlog_leaf_p): Helper function to identify "leaves" of a ternlog expression, e.g. REG_P, MEM_P, CONST_VECTOR, etc. (ix86_ternlog_operand_p): Test whether a expression is suitable for and prefered as an UNSPEC_TERNLOG. (ix86_expand_ternlog_binop): Helper function to construct the binary operation corresponding to a sufficiently simple ternlog. (ix86_expand_ternlog_andnot): Helper function to construct a ANDN operation corresponding to a sufficiently simple ternlog. (ix86_expand_ternlog): Expand a 3-operand ternary logic expression, constructing either an UNSPEC_TERNLOG or simpler rtx expression. Called from builtin expanders and pre-reload splitters. * config/i386/i386-protos.h (ix86_ternlog_idx): Prototype here. (ix86_ternlog_operand_p): Likewise. (ix86_expand_ternlog): Likewise. * config/i386/predicates.md (ternlog_operand): New predicate that calls xi86_ternlog_operand_p. * config/i386/sse.md (_vpternlog_0): New define_insn_and_split that recognizes a SET_SRC of ternlog_operand and expands it via ix86_expand_ternlog pre-reload. (_vternlog_mask): Convert from define_insn to define_expand. Use ix86_expand_ternlog if the mask operand is ~0 (or 255 or -1). (*_vternlog_mask): define_insn renamed from above. gcc/testsuite/ChangeLog * gcc.target/i386/avx512f-andn-di-zmm-2.c: Update test case. * gcc.target/i386/avx512f-andn-si-zmm-2.c: Likewise. * gcc.target/i386/avx512f-orn-si-zmm-1.c: Likewise. * gcc.target/i386/avx512f-orn-si-zmm-2.c: Likewise. * gcc.target/i386/avx512f-vpternlogd-1.c: Likewise. * gcc.target/i386/avx512f-vpternlogq-1.c: Likewise. * gcc.target/i386/avx512vl-vpternlogd-1.c: Likewise. * gcc.target/i386/avx512vl-vpternlogq-1.c: Likewise. * gcc.target/i386/pr100711-3.c: Likewise. * gcc.target/i386/pr100711-4.c: Likewise. * gcc.target/i386/pr100711-5.c: Likewise. Thanks again, Roger -- > From: Hongtao Liu > Sent: 14 May 2024 09:46 > On Mon, May 13, 2024 at 5:57 AM Roger Sayle > wrote: > > > > This patch improves the way that the x86 backend recognizes and > > expands AVX512's bitwise ternary logic (vpternlog) instructions. > I like the patch. > > 1 file changed, 25 insertions(+), 1 deletion(-) > gcc/config/i386/i386-expand.cc | 26 > +- > > modified gcc/config/i386/i386-expand.cc > @@ -25601,6 +25601,7 @@ ix86_gen_bcst_mem (machine_mode mode, rtx x) > int ix86_ternlog_idx (rtx op, rtx *args) { > + /* Nice dynamic programming:) */ >int idx0, idx1; > >if (!op) > @@ -25651,6 +25652,7 @@ ix86_ternlog_idx (rtx op, rtx *args) > return 0xaa; > } >/* Maximum of one volatile memory reference per expression. */ > + /* According to comments, it should be && ? */ >if (side_effects_p (op) || side_effects_p (args[2])) > return -1; >if (rtx_equal_p (op, args[2])) > @@ -25666,6 +25668,8 @@ ix86_ternlog_idx (rtx op, rtx *args) > > case SUBREG: >if (!VECTOR_MODE_P (GET_MODE (SUBREG_REG (op))) > +
[x86 SSE] Improve handling of ternlog instructions in i386/sse.md
This patch improves the way that the x86 backend recognizes and expands AVX512's bitwise ternary logic (vpternlog) instructions. As a motivating example consider the following code which calculates the carry out from a (binary) full adder: typedef unsigned long long v4di __attribute((vector_size(32))); v4di foo(v4di a, v4di b, v4di c) { return (a & b) | ((a ^ b) & c); } with -O2 -march=cascadelake current mainline produces: foo:vpternlogq $96, %ymm0, %ymm1, %ymm2 vmovdqa %ymm0, %ymm3 vmovdqa %ymm2, %ymm0 vpternlogq $248, %ymm3, %ymm1, %ymm0 ret with the patch below, we now generate a single instruction: foo:vpternlogq $232, %ymm2, %ymm1, %ymm0 ret The AVX512 vpternlog[qd] instructions are a very cool addition to the x86 instruction set, that can calculate any Boolean function of three inputs in a single fast instruction. As the truth table for any three-input function has 8 rows, any specific function can be represented by specifying those bits, i.e. by an 8-bit byte, an immediate integer between 0 and 256. Examples of ternary functions and their indices are given below: 0x01 1: ~((b|a)|c) 0x02 2: (~(b|a)) 0x03 3: ~(b|a) 0x04 4: (~(c|a)) 0x05 5: ~(c|a) 0x06 6: (c^b)&~a 0x07 7: ~((c)|a) 0x08 8: (~a) (~a) (c)&~a 0x09 9: ~((c^b)|a) 0x0a 10: ~a 0x0b 11: ~((~c)|a) (~b|c)&~a 0x0c 12: ~a 0x0d 13: ~((~b)|a) (~c|b)&~a 0x0e 14: (c|b)&~a 0x0f 15: ~a 0x10 16: (~(c|b)) 0x11 17: ~(c|b) ... 0xf4 244: (~c)|a 0xf5 245: ~c|a 0xf6 246: (c^b)|a 0xf7 247: (~(c))|a 0xf8 248: (c)|a 0xf9 249: (~(c^b))|a 0xfa 250: c|a 0xfb 251: (c|a)|~b (~b|a)|c (~b|c)|a 0xfc 252: b|a 0xfd 253: (b|a)|~c (~c|a)|b (~c|b)|a 0xfe 254: (b|a)|c (c|a)|b (c|b)|a A naive implementation (in many compilers) might be add define_insn patterns for all 256 different functions. The situation is even worse as many of these Boolean functions don't have a "canonical form" (as produced by simplify_rtx) and would each need multiple patterns. See the space-separated equivalent expressions in the table above. This need to provide instruction "templates" might explain why GCC, LLVM and ICC all exhibit similar coverage problems in their ability to recognize x86 ternlog ternary functions. Perhaps a unique feature of GCC's design is that in addition to regular define_insn templates, machine descriptions can also perform pattern matching via a match_operator (and its corresponding predicate). This patch introduces a ternlog_operand predicate that matches a (possibly infinite) set of expression trees, identifying those that have at most three unique operands. This then allows a define_insn_and_split to recognize suitable expressions and then transform them into the appropriate UNSPEC_VTERNLOG as a pre-reload splitter. This design allows combine to smash together arbitrarily complex Boolean expressions, then transform them into an UNSPEC before register allocation. As an "optimization", where possible ix86_expand_ternlog generates a simpler binary operation, using AND, XOR, IOR or ANDN where possible, and in a few cases attempts to "canonicalize" the ternlog, by reordering or duplicating operands, so that later CSE passes have a hope of spotting equivalent values. Another benefit of this patch is that it improves the code generated for PR target/115021 [see comment #1]. This patch leaves the existing ternlog patterns in sse.md (for now), many of which are made obsolete by these changes. In theory we now only need one define_insn for UNSPEC_VTERNLOG. One complication from these previous variants was that they inconsistently used decimal vs. hexadecimal to specify the immediate constant operand in assembly language, making the list of tweaks to the testsuite with this patch larger than it might have been. I propose to remove the vestigial patterns in a follow-up patch, once this approach has baked (proven to be stable) on mainline. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-05-12 Roger Sayle gcc/ChangeLog PR target/115021 * config/i386/i386-expand.cc (ix86_expand_args_builtin): Call fixup_modeless_constant before testing predicates. Only call copy_to_mode_reg on memory operands (after the first one). (ix86_gen_bcst_mem): Helper function to convert a CONST_VECTOR into a VEC_DUPLICATE if possible. (ix86_ternlog_idx): Convert an RTX expression into a ternlog index between 0 and 255, recording the operands in ARGS, if possible or return -1 if this is not possible/valid. (ix86_ternlog_leaf_p): Helper function to identify "leaves" of a ternlog expression, e.g. REG_P, MEM_P, CONST_VECTOR, etc. (ix86_ternlog_operand_p): Test
[gcc r15-390] arm: Use utxb rN, rM, ror #8 to implement zero_extract on armv6.
https://gcc.gnu.org/g:46077992180d6d86c86544df5e8cb943492d3b01 commit r15-390-g46077992180d6d86c86544df5e8cb943492d3b01 Author: Roger Sayle Date: Sun May 12 16:27:22 2024 +0100 arm: Use utxb rN, rM, ror #8 to implement zero_extract on armv6. Examining the code generated for the following C snippet on a raspberry pi: int popcount_lut8(unsigned *buf, int n) { int cnt=0; unsigned int i; do { i = *buf; cnt += lut[i&255]; cnt += lut[i>>8&255]; cnt += lut[i>>16&255]; cnt += lut[i>>24]; buf++; } while(--n); return cnt; } I was surprised to see following instruction sequence generated by the compiler: movr5, r2, lsr #8 uxtb r5, r5 This sequence can be performed by a single ARM instruction: uxtb r5, r2, ror #8 The attached patch allows GCC's combine pass to take advantage of ARM's uxtb with rotate functionality to implement the above zero_extract, and likewise to use the sxtb with rotate to implement sign_extract. ARM's uxtb and sxtb can only be used with rotates of 0, 8, 16 and 24, and of these only the 8 and 16 are useful [ror #0 is a nop, and extends with ror #24 can be implemented using regular shifts], so the approach here is to add the six missing but useful instructions as 6 different define_insn in arm.md, rather than try to be clever with new predicates. Later ARM hardware has advanced bit field instructions, and earlier ARM cores didn't support extend-with-rotate, so this appears to only benefit armv6 era CPUs (e.g. the raspberry pi). Patch posted: https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01339.html Approved by Kyrill Tkachov: https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01881.html 2024-05-12 Roger Sayle Kyrill Tkachov * config/arm/arm.md (*arm_zeroextractsi2_8_8, *arm_signextractsi2_8_8, *arm_zeroextractsi2_8_16, *arm_signextractsi2_8_16, *arm_zeroextractsi2_16_8, *arm_signextractsi2_16_8): New. 2024-05-12 Roger Sayle Kyrill Tkachov * gcc.target/arm/extend-ror.c: New test. Diff: --- gcc/config/arm/arm.md | 66 +++ gcc/testsuite/gcc.target/arm/extend-ror.c | 38 ++ 2 files changed, 104 insertions(+) diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index 1fd00146ca9e..f47e036a8034 100644 --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -12647,6 +12647,72 @@ "" ) +;; Implement zero_extract using uxtb/uxth instruction with +;; the ror #N qualifier when applicable. + +(define_insn "*arm_zeroextractsi2_8_8" + [(set (match_operand:SI 0 "s_register_operand" "=r") + (zero_extract:SI (match_operand:SI 1 "s_register_operand" "r") +(const_int 8) (const_int 8)))] + "TARGET_ARM && arm_arch6" + "uxtb%?\\t%0, %1, ror #8" + [(set_attr "predicable" "yes") + (set_attr "type" "extend")] +) + +(define_insn "*arm_zeroextractsi2_8_16" + [(set (match_operand:SI 0 "s_register_operand" "=r") + (zero_extract:SI (match_operand:SI 1 "s_register_operand" "r") +(const_int 8) (const_int 16)))] + "TARGET_ARM && arm_arch6" + "uxtb%?\\t%0, %1, ror #16" + [(set_attr "predicable" "yes") + (set_attr "type" "extend")] +) + +(define_insn "*arm_zeroextractsi2_16_8" + [(set (match_operand:SI 0 "s_register_operand" "=r") + (zero_extract:SI (match_operand:SI 1 "s_register_operand" "r") +(const_int 16) (const_int 8)))] + "TARGET_ARM && arm_arch6" + "uxth%?\\t%0, %1, ror #8" + [(set_attr "predicable" "yes") + (set_attr "type" "extend")] +) + +;; Implement sign_extract using sxtb/sxth instruction with +;; the ror #N qualifier when applicable. + +(define_insn "*arm_signextractsi2_8_8" + [(set (match_operand:SI 0 "s_register_operand" "=r") + (sign_extract:SI (match_operand:SI 1 "s_register_operand" "r") +(const_int 8) (const_int 8)))] + "TARGET_ARM && arm_arch6" + "sxtb%?\\t%0, %1, ror #8" + [(set_attr "predicable" "yes") + (set_attr "type" "extend")] +) + +(define_insn "*arm_signextractsi2_8_16" + [(set (match_operand:SI 0 "s_register_operand" "=r") + (sign_extract:SI (match_operand:SI 1 "s
[gcc r15-366] i386: Improve V[48]QI shifts on AVX512/SSE4.1
https://gcc.gnu.org/g:f5a8cdc1ef5d6aa2de60849c23658ac5298df7bb commit r15-366-gf5a8cdc1ef5d6aa2de60849c23658ac5298df7bb Author: Roger Sayle Date: Fri May 10 20:26:40 2024 +0100 i386: Improve V[48]QI shifts on AVX512/SSE4.1 The following one line patch improves the code generated for V8QI and V4QI shifts when AV512BW and AVX512VL functionality is available. For the testcase (from gcc.target/i386/vect-shiftv8qi.c): typedef signed char v8qi __attribute__ ((__vector_size__ (8))); v8qi foo (v8qi x) { return x >> 5; } GCC with -O2 -march=cascadelake currently generates: foo:movl$67372036, %eax vpsraw $5, %xmm0, %xmm2 vpbroadcastd%eax, %xmm1 movl$117901063, %eax vpbroadcastd%eax, %xmm3 vmovdqa %xmm1, %xmm0 vmovdqa %xmm3, -24(%rsp) vpternlogd $120, -24(%rsp), %xmm2, %xmm0 vpsubb %xmm1, %xmm0, %xmm0 ret with this patch we now generate the much improved: foo:vpmovsxbw %xmm0, %xmm0 vpsraw $5, %xmm0, %xmm0 vpmovwb %xmm0, %xmm0 ret This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c when run with the additional -march=cascadelake flag, by splitting these tests into two; one form testing code generation with -msse2 (and -mno-avx512vl) as originally intended, and the other testing AVX512 code generation with an explicit -march=cascadelake. 2024-05-10 Roger Sayle Hongtao Liu gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial): Don't attempt ix86_expand_vec_shift_qihi_constant on SSE4.1. gcc/testsuite/ChangeLog * gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl. * gcc.target/i386/vect-shiftv8qi.c: Likewise. * gcc.target/i386/vect-shiftv4qi-2.c: New test case. * gcc.target/i386/vect-shiftv8qi-2.c: Likewise. Diff: --- gcc/config/i386/i386-expand.cc | 3 ++ gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c | 43 gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c | 2 +- gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c | 43 gcc/testsuite/gcc.target/i386/vect-shiftv8qi.c | 2 +- 5 files changed, 91 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 2f27bfb484c2..1ab22fe79736 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -24283,6 +24283,9 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx dest, rtx op1, rtx op2) if (CONST_INT_P (op2) && (code == ASHIFT || code == LSHIFTRT || code == ASHIFTRT) + /* With AVX512 it's cheaper to do vpmovsxbw/op/vpmovwb. + Even with SSE4.1 the alternative is better. */ + && !TARGET_SSE4_1 && ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2)) { emit_move_insn (dest, gen_lowpart (qimode, qdest)); diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c new file mode 100644 index ..abc1a276b043 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c @@ -0,0 +1,43 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=cascadelake" } */ + +#define N 4 + +typedef unsigned char __vu __attribute__ ((__vector_size__ (N))); +typedef signed char __vi __attribute__ ((__vector_size__ (N))); + +__vu sll (__vu a, int n) +{ + return a << n; +} + +__vu sll_c (__vu a) +{ + return a << 5; +} + +/* { dg-final { scan-assembler-times "vpsllw" 2 } } */ + +__vu srl (__vu a, int n) +{ + return a >> n; +} + +__vu srl_c (__vu a) +{ + return a >> 5; +} + +/* { dg-final { scan-assembler-times "vpsrlw" 2 } } */ + +__vi sra (__vi a, int n) +{ + return a >> n; +} + +__vi sra_c (__vi a) +{ + return a >> 5; +} + +/* { dg-final { scan-assembler-times "vpsraw" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c index b7e45c2e8799..9b52582d01f8 100644 --- a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c +++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -msse2" } */ +/* { dg-options "-O2 -msse2 -mno-avx2 -mno-avx512vl" } */ #define N 4 diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c new file mode 100644 index ..52760f5a0607 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c @@ -0,0 +1,43 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=cascadelake" } */ + +#defi
Re: [x86 PATCH] Improve V[48]QI shifts on AVX512
Many thanks for the speedy review and correction/improvement. It's interesting that you spotted the ternlog "spill"... I have a patch that rewrites ternlog handling that's been waiting for stage 1, that would also fix this mem operand issue. I hope to submit it for review this weekend. Thanks again, Roger > From: Hongtao Liu > On Fri, May 10, 2024 at 6:26 AM Roger Sayle > wrote: > > > > > > The following one line patch improves the code generated for V8QI and > > V4QI shifts when AV512BW and AVX512VL functionality is available. > + /* With AVX512 its cheaper to do vpmovsxbw/op/vpmovwb. */ > + && !(TARGET_AVX512BW && TARGET_AVX512VL && TARGET_SSE4_1) >&& ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2)) I > think > TARGET_SSE4_1 is enough, it's always better w/ sse4.1 and above when not going > into ix86_expand_vec_shift_qihi_constant. > Others LGTM. > > > > For the testcase (from gcc.target/i386/vect-shiftv8qi.c): > > > > typedef signed char v8qi __attribute__ ((__vector_size__ (8))); v8qi > > foo (v8qi x) { return x >> 5; } > > > > GCC with -O2 -march=cascadelake currently generates: > > > > foo:movl$67372036, %eax > > vpsraw $5, %xmm0, %xmm2 > > vpbroadcastd%eax, %xmm1 > > movl$117901063, %eax > > vpbroadcastd%eax, %xmm3 > > vmovdqa %xmm1, %xmm0 > > vmovdqa %xmm3, -24(%rsp) > > vpternlogd $120, -24(%rsp), %xmm2, %xmm0 > It looks like a miss-optimization under AVX512, but it's a separate issue. > > vpsubb %xmm1, %xmm0, %xmm0 > > ret > > > > with this patch we now generate the much improved: > > > > foo:vpmovsxbw %xmm0, %xmm0 > > vpsraw $5, %xmm0, %xmm0 > > vpmovwb %xmm0, %xmm0 > > ret > > > > This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c > > when run with the additional -march=cascadelake flag, by splitting > > these tests into two; one form testing code generation with -msse2 > > (and > > -mno-avx512vl) as originally intended, and the other testing AVX512 > > code generation with an explicit -march=cascadelake. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > > > 2024-05-09 Roger Sayle > > > > gcc/ChangeLog > > * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial): > > Don't attempt ix86_expand_vec_shift_qihi_constant on AVX512. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl. > > * gcc.target/i386/vect-shiftv8qi.c: Likewise. > > * gcc.target/i386/vect-shiftv4qi-2.c: New test case. > > * gcc.target/i386/vect-shiftv8qi-2.c: Likewise. > > > > > > Thanks in advance, > > Roger > > -- > > > -- > BR, > Hongtao
[x86 PATCH] Improve V[48]QI shifts on AVX512
The following one line patch improves the code generated for V8QI and V4QI shifts when AV512BW and AVX512VL functionality is available. For the testcase (from gcc.target/i386/vect-shiftv8qi.c): typedef signed char v8qi __attribute__ ((__vector_size__ (8))); v8qi foo (v8qi x) { return x >> 5; } GCC with -O2 -march=cascadelake currently generates: foo:movl$67372036, %eax vpsraw $5, %xmm0, %xmm2 vpbroadcastd%eax, %xmm1 movl$117901063, %eax vpbroadcastd%eax, %xmm3 vmovdqa %xmm1, %xmm0 vmovdqa %xmm3, -24(%rsp) vpternlogd $120, -24(%rsp), %xmm2, %xmm0 vpsubb %xmm1, %xmm0, %xmm0 ret with this patch we now generate the much improved: foo:vpmovsxbw %xmm0, %xmm0 vpsraw $5, %xmm0, %xmm0 vpmovwb %xmm0, %xmm0 ret This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c when run with the additional -march=cascadelake flag, by splitting these tests into two; one form testing code generation with -msse2 (and -mno-avx512vl) as originally intended, and the other testing AVX512 code generation with an explicit -march=cascadelake. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-05-09 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial): Don't attempt ix86_expand_vec_shift_qihi_constant on AVX512. gcc/testsuite/ChangeLog * gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl. * gcc.target/i386/vect-shiftv8qi.c: Likewise. * gcc.target/i386/vect-shiftv4qi-2.c: New test case. * gcc.target/i386/vect-shiftv8qi-2.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index a613291..8eb31b2 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -24212,6 +24212,8 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx dest, rtx op1, rtx op2) if (CONST_INT_P (op2) && (code == ASHIFT || code == LSHIFTRT || code == ASHIFTRT) + /* With AVX512 its cheaper to do vpmovsxbw/op/vpmovwb. */ + && !(TARGET_AVX512BW && TARGET_AVX512VL && TARGET_SSE4_1) && ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2)) { emit_move_insn (dest, gen_lowpart (qimode, qdest)); diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c new file mode 100644 index 000..abc1a27 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c @@ -0,0 +1,43 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=cascadelake" } */ + +#define N 4 + +typedef unsigned char __vu __attribute__ ((__vector_size__ (N))); +typedef signed char __vi __attribute__ ((__vector_size__ (N))); + +__vu sll (__vu a, int n) +{ + return a << n; +} + +__vu sll_c (__vu a) +{ + return a << 5; +} + +/* { dg-final { scan-assembler-times "vpsllw" 2 } } */ + +__vu srl (__vu a, int n) +{ + return a >> n; +} + +__vu srl_c (__vu a) +{ + return a >> 5; +} + +/* { dg-final { scan-assembler-times "vpsrlw" 2 } } */ + +__vi sra (__vi a, int n) +{ + return a >> n; +} + +__vi sra_c (__vi a) +{ + return a >> 5; +} + +/* { dg-final { scan-assembler-times "vpsraw" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c index b7e45c2..9b52582 100644 --- a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c +++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -msse2" } */ +/* { dg-options "-O2 -msse2 -mno-avx2 -mno-avx512vl" } */ #define N 4 diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c new file mode 100644 index 000..52760f5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c @@ -0,0 +1,43 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=cascadelake" } */ + +#define N 8 + +typedef unsigned char __vu __attribute__ ((__vector_size__ (N))); +typedef signed char __vi __attribute__ ((__vector_size__ (N))); + +__vu sll (__vu a, int n) +{ + return a << n; +} + +__vu sll_c (__vu a) +{ + return a << 5; +} + +/* { dg-final { scan-assembler-times "vpsllw" 2 } } */ + +__vu srl (__vu a, int n) +{ + return a >> n; +} + +__vu srl_c (__vu a) +{ + return a >> 5; +} + +/* { dg-final { scan-assembler-times "vpsrlw" 2 } } */ + +__vi sra (__vi a, int n) +{ + return a >> n; +} + +__vi sra_c (__vi a) +{ + return a >> 5; +} + +/* { dg-final { scan-assembler-times "vpsraw" 2 }
[gcc r15-352] Constant fold {-1,-1} << 1 in simplify-rtx.cc
https://gcc.gnu.org/g:f2449b55fb2d32fc4200667ba79847db31f6530d commit r15-352-gf2449b55fb2d32fc4200667ba79847db31f6530d Author: Roger Sayle Date: Thu May 9 22:45:54 2024 +0100 Constant fold {-1,-1} << 1 in simplify-rtx.cc This patch addresses a missed optimization opportunity in the RTL optimization passes. The function simplify_const_binary_operation will constant fold binary operators with two CONST_INT operands, and those with two CONST_VECTOR operands, but is missing compile-time evaluation of binary operators with a CONST_VECTOR and a CONST_INT, such as vector shifts and rotates. The first version of this patch didn't contain a switch statement to explicitly check for valid binary opcodes, which bootstrapped and regression tested fine, but my paranoia has got the better of me, so this version now checks that VEC_SELECT or some funky (future) rtx_code doesn't cause problems. 2024-05-09 Roger Sayle gcc/ChangeLog * simplify-rtx.cc (simplify_const_binary_operation): Constant fold binary operations where the LHS is CONST_VECTOR and the RHS is CONST_INT (or CONST_DOUBLE) such as vector shifts. Diff: --- gcc/simplify-rtx.cc | 54 + 1 file changed, 54 insertions(+) diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index dceaa1ca..53f54d1d3928 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -5021,6 +5021,60 @@ simplify_const_binary_operation (enum rtx_code code, machine_mode mode, return gen_rtx_CONST_VECTOR (mode, v); } + if (VECTOR_MODE_P (mode) + && GET_CODE (op0) == CONST_VECTOR + && (CONST_SCALAR_INT_P (op1) || CONST_DOUBLE_AS_FLOAT_P (op1)) + && (CONST_VECTOR_DUPLICATE_P (op0) + || CONST_VECTOR_NUNITS (op0).is_constant ())) +{ + switch (code) + { + case PLUS: + case MINUS: + case MULT: + case DIV: + case MOD: + case UDIV: + case UMOD: + case AND: + case IOR: + case XOR: + case SMIN: + case SMAX: + case UMIN: + case UMAX: + case LSHIFTRT: + case ASHIFTRT: + case ASHIFT: + case ROTATE: + case ROTATERT: + case SS_PLUS: + case US_PLUS: + case SS_MINUS: + case US_MINUS: + case SS_ASHIFT: + case US_ASHIFT: + case COPYSIGN: + break; + default: + return NULL_RTX; + } + + unsigned int npatterns = (CONST_VECTOR_DUPLICATE_P (op0) + ? CONST_VECTOR_NPATTERNS (op0) + : CONST_VECTOR_NUNITS (op0).to_constant ()); + rtx_vector_builder builder (mode, npatterns, 1); + for (unsigned i = 0; i < npatterns; i++) + { + rtx x = simplify_binary_operation (code, GET_MODE_INNER (mode), +CONST_VECTOR_ELT (op0, i), op1); + if (!x || !valid_for_const_vector_p (mode, x)) + return 0; + builder.quick_push (x); + } + return builder.build (); +} + if (SCALAR_FLOAT_MODE_P (mode) && CONST_DOUBLE_AS_FLOAT_P (op0) && CONST_DOUBLE_AS_FLOAT_P (op1)
[gcc r15-222] PR target/106060: Improved SSE vector constant materialization on x86.
https://gcc.gnu.org/g:79649a5dcd81bc05c0ba591068c9075de43bd417 commit r15-222-g79649a5dcd81bc05c0ba591068c9075de43bd417 Author: Roger Sayle Date: Tue May 7 07:14:40 2024 +0100 PR target/106060: Improved SSE vector constant materialization on x86. This patch resolves PR target/106060 by providing efficient methods for materializing/synthesizing special "vector" constants on x86. Currently there are three methods of materializing a vector constant; the most general is to load a vector from the constant pool, secondly "duplicated" constants can be synthesized by moving an integer between units and broadcasting (of shuffling it), and finally the special cases of the all-zeros vector and all-ones vectors can be loaded via a single SSE instruction. This patch handle additional cases that can be synthesized in two instructions, loading an all-ones vector followed by another SSE instruction. Following my recent patch for PR target/112992, there's conveniently a single place in i386-expand.cc where these special cases can be handled. Two examples are given in the original bugzilla PR for 106060. __m256i should_be_cmpeq_abs () { return _mm256_set1_epi8 (1); } is now generated (with -O3 -march=x86-64-v3) as: vpcmpeqd%ymm0, %ymm0, %ymm0 vpabsb %ymm0, %ymm0 ret and __m256i should_be_cmpeq_add () { return _mm256_set1_epi8 (-2); } is now generated as: vpcmpeqd%ymm0, %ymm0, %ymm0 vpaddb %ymm0, %ymm0, %ymm0 ret 2024-05-07 Roger Sayle Hongtao Liu gcc/ChangeLog PR target/106060 * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New. (struct ix86_vec_bcast_map_simode_t): New type for table below. (ix86_vec_bcast_map_simode): Table of SImode constants that may be efficiently synthesized by a ix86_vec_bcast_alg method. (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch. (ix86_vector_duplicate_simode_const): Efficiently synthesize V4SImode and V8SImode constants that duplicate special constants. (ix86_vector_duplicate_value): Attempt to synthesize "special" vector constants using ix86_vector_duplicate_simode_const. * config/i386/i386.cc (ix86_rtx_costs) : ABS of a vector integer mode costs with a single SSE instruction. gcc/testsuite/ChangeLog PR target/106060 * gcc.target/i386/auto-init-8.c: Update test case. * gcc.target/i386/avx512fp16-13.c: Likewise. * gcc.target/i386/pr100865-9a.c: Likewise. * gcc.target/i386/pr101796-1.c: Likewise. * gcc.target/i386/pr106060-1.c: New test case. * gcc.target/i386/pr106060-2.c: Likewise. * gcc.target/i386/pr106060-3.c: Likewise. * gcc.target/i386/pr70314.c: Update test case. * gcc.target/i386/vect-shiftv4qi.c: Likewise. * gcc.target/i386/vect-shiftv8qi.c: Likewise. Diff: --- gcc/config/i386/i386-expand.cc | 364 - gcc/config/i386/i386.cc| 2 + gcc/testsuite/gcc.target/i386/auto-init-8.c| 2 +- gcc/testsuite/gcc.target/i386/avx512fp16-13.c | 3 - gcc/testsuite/gcc.target/i386/pr100865-9a.c| 2 +- gcc/testsuite/gcc.target/i386/pr101796-1.c | 6 +- gcc/testsuite/gcc.target/i386/pr106060-1.c | 12 + gcc/testsuite/gcc.target/i386/pr106060-2.c | 13 + gcc/testsuite/gcc.target/i386/pr106060-3.c | 14 + gcc/testsuite/gcc.target/i386/pr70314.c| 2 +- gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c | 2 +- gcc/testsuite/gcc.target/i386/vect-shiftv8qi.c | 2 +- 12 files changed, 411 insertions(+), 13 deletions(-) diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 8bb8f21e686..a6132911e6a 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -15696,6 +15696,332 @@ s4fma_expand: gcc_unreachable (); } +/* See below where shifts are handled for explanation of this enum. */ +enum ix86_vec_bcast_alg +{ + VEC_BCAST_PXOR, + VEC_BCAST_PCMPEQ, + VEC_BCAST_PABSB, + VEC_BCAST_PADDB, + VEC_BCAST_PSRLW, + VEC_BCAST_PSRLD, + VEC_BCAST_PSLLW, + VEC_BCAST_PSLLD +}; + +struct ix86_vec_bcast_map_simode_t +{ + unsigned int key; + enum ix86_vec_bcast_alg alg; + unsigned int arg; +}; + +/* This table must be kept sorted as values are looked-up using bsearch. */ +static const ix86_vec_bcast_map_simode_t ix86_vec_bcast_map_simode[] = { + { 0x, VEC_BCAST_PXOR,0 }, + { 0x0001, VEC_BCAST_PSRLD, 31 }, + { 0x0003, VEC_BCAST_PSRLD, 30 }, + { 0x0007, VEC_BCAST_PSRLD, 29 }, + { 0x000f, VEC_BCAST
RE: [PATCH] PR middle-end/111701: signbit(x*x) vs -fsignaling-nans
> From: Richard Biener > On Thu, May 2, 2024 at 11:34 AM Roger Sayle > wrote: > > > > > > > From: Richard Biener On Fri, Apr 26, > > > 2024 at 10:19 AM Roger Sayle > > > wrote: > > > > > > > > This patch addresses PR middle-end/111701 where optimization of > > > > signbit(x*x) using tree_nonnegative_p incorrectly eliminates a > > > > floating point multiplication when the operands may potentially be > > > > signaling > > > NaNs. > > > > > > > > The above bug fix also provides a solution or work-around to the > > > > tricky issue in PR middle-end/111701, that the results of IEEE > > > > operations on NaNs are specified to return a NaN result, but fail > > > > to > > > > (precisely) specify the exact NaN representation of this result. > > > > Hence for the operation "-NaN*-NaN" different hardware > > > > implementations > > > > (targets) return different results. Ultimately knowing what the > > > > resulting NaN "payload" of an operation is can only be known by > > > > executing that operation at run-time, and I'd suggest that GCC's > > > > -fsignaling-nans provides a mechanism for handling code that uses > > > > NaN representations for communication/signaling (which is a > > > > different but related > > > concept to IEEE's sNaN). > > > > > > > > One nice thing about this patch, which may or may not be a P2 > > > > regression fix, is that it only affects (improves) code compiled > > > > with -fsignaling-nans so should be extremely safe even for this point > > > > in stage > 3. > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > > bootstrap and make -k check, both with and without > > > > --target_board=unix{-m32} with no new failures. Ok for mainline? > > > > > > Hmm, but the bugreports are not about sNaN but about the fact that > > > the sign of the NaN produced by 0/0 or by -NaN*-NaN is not well-defined. > > > So I don't think this is the correct approach to fix this. We'd > > > instead have to use tree_expr_maybe_nan_p () - and if NaN*NaN cannot > > > be -NaN (is that at least > > > specified?) then the RECURSE path should still work as well. > > > > If we ignore the bugzilla PR for now, can we agree that if x is a > > signaling NaN, that we shouldn't be eliminating x*x? i.e. that this > > patch fixes a real bug, but perhaps not (precisely) the one described in PR > middle-end/111701. > > This might or might not be covered by -fdelete-dead-exceptions - at least in > the > past we were OK with removing traps like for -ftrapv (-ftrapv makes signed > overflow no longer invoke undefined behavior) or when deleting loads that > might > trap (but those would invoke undefined behavior). > > I bet the C standard doesn't say anything about sNaNs or how an operation with > it has to behave in the abstract machine. We do document though that it > "disables optimizations that may change the number of exceptions visible with > signaling NaNs" which suggests that with -fsignaling-nans we have to preserve > all > such traps but I am very sure DCE will simply elide unused ops here (also for > other > FP operations with -ftrapping-math - but there we do not document that we > preserve all traps). > > With the patch the multiplication is only preserved because __builtin_signbit > still > uses it. A plain > > void foo(double x) > { >x*x; > } > > has the multiplication elided during gimplification already (even at -O0). void foo(double x) { double t = x*x; } when compiled with -fsignaling-nans -fexceptions -fnon-call-exceptions doesn't exhibit the above bug. Perhaps this short-coming of gimplification deserves its own Bugzilla PR? > So I don't think the patch is a meaningful improvement as to preserve > multiplications of sNaNs. > > Richard. > > > Once the signaling NaN case is correctly handled, the use of > > -fsignaling-nans can be used as a workaround for PR 111701, allowing > > it to perhaps be reduced from a P2 to a P3 regression (or even not a bug if > > the > qNaN case is undefined behavior). > > When I wrote this patch I was trying to help with GCC 14's stage 3. > > > > > > 2024-04-26 Roger Sayle > > > > > > > > gcc/ChangeLog > > > > PR middle-end/111701 > > > > * fold-const.cc (tree_binary_nonnegative_warnv_p) MULT_EXPR>: > > > > Split handling of floating point and integer types. For equal > > > > floating point operands, avoid optimization if the operand may > > > > be > > > > a signaling NaN. > > > > > > > > gcc/testsuite/ChangeLog > > > > PR middle-end/111701 > > > > * gcc.dg/pr111701-1.c: New test case. > > > > * gcc.dg/pr111701-2.c: Likewise. > > > > > > > >
RE: [PATCH] PR middle-end/111701: signbit(x*x) vs -fsignaling-nans
> From: Richard Biener > On Fri, Apr 26, 2024 at 10:19 AM Roger Sayle > wrote: > > > > This patch addresses PR middle-end/111701 where optimization of > > signbit(x*x) using tree_nonnegative_p incorrectly eliminates a > > floating point multiplication when the operands may potentially be signaling > NaNs. > > > > The above bug fix also provides a solution or work-around to the > > tricky issue in PR middle-end/111701, that the results of IEEE > > operations on NaNs are specified to return a NaN result, but fail to > > (precisely) specify the exact NaN representation of this result. > > Hence for the operation "-NaN*-NaN" different hardware implementations > > (targets) return different results. Ultimately knowing what the > > resulting NaN "payload" of an operation is can only be known by > > executing that operation at run-time, and I'd suggest that GCC's > > -fsignaling-nans provides a mechanism for handling code that uses NaN > > representations for communication/signaling (which is a different but > > related > concept to IEEE's sNaN). > > > > One nice thing about this patch, which may or may not be a P2 > > regression fix, is that it only affects (improves) code compiled with > > -fsignaling-nans so should be extremely safe even for this point in stage 3. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > Hmm, but the bugreports are not about sNaN but about the fact that the sign of > the NaN produced by 0/0 or by -NaN*-NaN is not well-defined. > So I don't think this is the correct approach to fix this. We'd instead have > to use > tree_expr_maybe_nan_p () - and if NaN*NaN cannot be -NaN (is that at least > specified?) then the RECURSE path should still work as well. If we ignore the bugzilla PR for now, can we agree that if x is a signaling NaN, that we shouldn't be eliminating x*x? i.e. that this patch fixes a real bug, but perhaps not (precisely) the one described in PR middle-end/111701. Once the signaling NaN case is correctly handled, the use of -fsignaling-nans can be used as a workaround for PR 111701, allowing it to perhaps be reduced from a P2 to a P3 regression (or even not a bug if the qNaN case is undefined behavior). When I wrote this patch I was trying to help with GCC 14's stage 3. > > 2024-04-26 Roger Sayle > > > > gcc/ChangeLog > > PR middle-end/111701 > > * fold-const.cc (tree_binary_nonnegative_warnv_p) : > > Split handling of floating point and integer types. For equal > > floating point operands, avoid optimization if the operand may be > > a signaling NaN. > > > > gcc/testsuite/ChangeLog > > PR middle-end/111701 > > * gcc.dg/pr111701-1.c: New test case. > > * gcc.dg/pr111701-2.c: Likewise. > >
RE: [C PATCH] PR c/109618: ICE-after-error from error_mark_node.
> On Tue, Apr 30, 2024 at 10:23 AM Roger Sayle > wrote: > > Hi Richard, > > Thanks for looking into this. > > > > It’s not the call to size_binop_loc (for CEIL_DIV_EXPR) that's > > problematic, but the call to fold_convert_loc (loc, size_type_node, value) > > on line > 4009 of c-common.cc. > > At this point, value is (NOP_EXPR:sizetype (VAR_DECL:error_mark_node)). > > I see. Can we catch this when we build (NOP_EXPR:sizetype > (VAR_DECL:error_mark_node)) > and instead have it "build" error_mark_node? That's the tricky part. At the point the NOP_EXPR is built the VAR_DECL's type is valid. It's later when this variable gets redefined with a conflicting type that the shared VAR_DECL gets modified, setting its type to error_mark_node. Mutating this shared node, then potentially introduces error_operand_p at arbitrary places deep within an expression. Fortunately, we only have to worry about this in the unusual/exceptional case that seen_error() is true. > > Ultimately, it's the code in match.pd /* Handle cases of two > > conversions in a row. */ with the problematic line being (match.pd:4748): > > unsigned int inside_prec = element_precision (inside_type); > > > > Here inside_type is error_mark_node, and so tree type checking in > > element_precision throws an internal_error. > > > > There doesn’t seem to be a good way to fix this in element_precision, > > and it's complicated to reorganize the logic in match.pd's "with > > clause" inside the (ocvt (icvt@1 @0)), but perhaps a (ocvt > (icvt:non_error_type@1 @0))? > > > > The last place/opportunity the front-end could sanitize this operand > > before passing the dubious tree to the middle-end is > > c_sizeof_or_alignof_type (which alas doesn't appear in the backtrace due to > inlining). > > > > #5 0x0227b0e9 in internal_error ( > > gmsgid=gmsgid@entry=0x249c7b8 "tree check: expected class %qs, > > have %qs (%s) in %s, at %s:%d") at ../../gcc/gcc/diagnostic.cc:2232 > > #6 0x0081e32a in tree_class_check_failed (node=0x76c1ef30, > > cl=cl@entry=tcc_type, file=file@entry=0x2495f3f "../../gcc/gcc/tree.cc", > > line=line@entry=6795, function=function@entry=0x24961fe > "element_precision") > > at ../../gcc/gcc/tree.cc:9005 > > #7 0x0081ef4c in tree_class_check (__t=, > __class=tcc_type, > > __f=0x2495f3f "../../gcc/gcc/tree.cc", __l=6795, > > __g=0x24961fe "element_precision") at ../../gcc/gcc/tree.h:4067 > > #8 element_precision (type=, type@entry=0x76c1ef30) > > at ../../gcc/gcc/tree.cc:6795 > > #9 0x017f66a4 in generic_simplify_CONVERT_EXPR (loc=201632, > > code=, type=0x76c3e7e0, _p0=0x76dc95c0) > > at generic-match-6.cc:3386 > > #10 0x00c1b18c in fold_unary_loc (loc=201632, code=NOP_EXPR, > > type=0x76c3e7e0, op0=0x76dc95c0) at > > ../../gcc/gcc/fold-const.cc:9523 > > #11 0x00c1d94a in fold_build1_loc (loc=201632, code=NOP_EXPR, > > type=0x76c3e7e0, op0=0x76dc95c0) at > > ../../gcc/gcc/fold-const.cc:14165 > > #12 0x0094068c in c_expr_sizeof_expr (loc=loc@entry=201632, > expr=...) > > at ../../gcc/gcc/tree.h:3771 > > #13 0x0097f06c in c_parser_sizeof_expression (parser= out>) > > at ../../gcc/gcc/c/c-parser.cc:9932 > > > > > > I hope this explains what's happening. The size_binop_loc call is a > > bit of a red herring that returns the same tree it is given (as > > TYPE_PRECISION (char_type_node) == BITS_PER_UNIT), so it's the > > "TYPE_SIZE_UNIT (type)" which needs to be checked for the embedded > VAR_DECL with a TREE_TYPE of error_mark_node. > > > > As Andrew Pinski writes in comment #3, this one is trickier than average. > > > > A more comprehensive fix might be to write deep_error_operand_p which > > does more of a tree traversal checking error_operand_p within the > > unary and binary operators of an expression tree. > > > > Please let me know what you think/recommend. > > Best regards, > > Roger > > -- > > > > > -Original Message- > > > From: Richard Biener > > > Sent: 30 April 2024 08:38 > > > To: Roger Sayle > > > Cc: gcc-patches@gcc.gnu.org > > > Subject: Re: [C PATCH] PR c/109618: ICE-after-error from error_mark_node. > > > > > > On Tue, Apr 30, 2024 at 1:06 AM Roger Sayle > > > > > > wrote: > > > > > > > > > > > > This patch solves another ICE-a
RE: [C PATCH] PR c/109618: ICE-after-error from error_mark_node.
Hi Richard, Thanks for looking into this. It’s not the call to size_binop_loc (for CEIL_DIV_EXPR) that's problematic, but the call to fold_convert_loc (loc, size_type_node, value) on line 4009 of c-common.cc. At this point, value is (NOP_EXPR:sizetype (VAR_DECL:error_mark_node)). Ultimately, it's the code in match.pd /* Handle cases of two conversions in a row. */ with the problematic line being (match.pd:4748): unsigned int inside_prec = element_precision (inside_type); Here inside_type is error_mark_node, and so tree type checking in element_precision throws an internal_error. There doesn’t seem to be a good way to fix this in element_precision, and it's complicated to reorganize the logic in match.pd's "with clause" inside the (ocvt (icvt@1 @0)), but perhaps a (ocvt (icvt:non_error_type@1 @0))? The last place/opportunity the front-end could sanitize this operand before passing the dubious tree to the middle-end is c_sizeof_or_alignof_type (which alas doesn't appear in the backtrace due to inlining). #5 0x0227b0e9 in internal_error ( gmsgid=gmsgid@entry=0x249c7b8 "tree check: expected class %qs, have %qs (%s) in %s, at %s:%d") at ../../gcc/gcc/diagnostic.cc:2232 #6 0x0081e32a in tree_class_check_failed (node=0x76c1ef30, cl=cl@entry=tcc_type, file=file@entry=0x2495f3f "../../gcc/gcc/tree.cc", line=line@entry=6795, function=function@entry=0x24961fe "element_precision") at ../../gcc/gcc/tree.cc:9005 #7 0x0081ef4c in tree_class_check (__t=, __class=tcc_type, __f=0x2495f3f "../../gcc/gcc/tree.cc", __l=6795, __g=0x24961fe "element_precision") at ../../gcc/gcc/tree.h:4067 #8 element_precision (type=, type@entry=0x76c1ef30) at ../../gcc/gcc/tree.cc:6795 #9 0x017f66a4 in generic_simplify_CONVERT_EXPR (loc=201632, code=, type=0x76c3e7e0, _p0=0x76dc95c0) at generic-match-6.cc:3386 #10 0x00c1b18c in fold_unary_loc (loc=201632, code=NOP_EXPR, type=0x76c3e7e0, op0=0x76dc95c0) at ../../gcc/gcc/fold-const.cc:9523 #11 0x00c1d94a in fold_build1_loc (loc=201632, code=NOP_EXPR, type=0x76c3e7e0, op0=0x76dc95c0) at ../../gcc/gcc/fold-const.cc:14165 #12 0x0094068c in c_expr_sizeof_expr (loc=loc@entry=201632, expr=...) at ../../gcc/gcc/tree.h:3771 #13 0x0097f06c in c_parser_sizeof_expression (parser=) at ../../gcc/gcc/c/c-parser.cc:9932 I hope this explains what's happening. The size_binop_loc call is a bit of a red herring that returns the same tree it is given (as TYPE_PRECISION (char_type_node) == BITS_PER_UNIT), so it's the "TYPE_SIZE_UNIT (type)" which needs to be checked for the embedded VAR_DECL with a TREE_TYPE of error_mark_node. As Andrew Pinski writes in comment #3, this one is trickier than average. A more comprehensive fix might be to write deep_error_operand_p which does more of a tree traversal checking error_operand_p within the unary and binary operators of an expression tree. Please let me know what you think/recommend. Best regards, Roger -- > -Original Message----- > From: Richard Biener > Sent: 30 April 2024 08:38 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [C PATCH] PR c/109618: ICE-after-error from error_mark_node. > > On Tue, Apr 30, 2024 at 1:06 AM Roger Sayle > wrote: > > > > > > This patch solves another ICE-after-error problem in the C family > > front-ends. Upon a conflicting type redeclaration, the ambiguous type > > is poisoned with an error_mark_node to indicate to the middle-end that > > the type is suspect, but care has to be taken by the front-end to > > avoid passing these malformed trees into the middle-end during error > > recovery. In this case, a var_decl with a poisoned type appears within > > a sizeof() expression (wrapped in NOP_EXPR) which causes problems. > > > > This revision of the patch tests seen_error() to avoid tree traversal > > (STRIP_NOPs) in the most common case that an error hasn't occurred. > > Both this version (and an earlier revision that didn't test > > seen_error) have survived bootstrap and regression testing on > > x86_64-pc-linux- > gnu. > > As a consolation, this code also contains a minor performance > > improvement, by avoiding trying to create (and folding away) a > > CEIL_DIV_EXPR in the common case that "char" is a single-byte. The > > current code relies on the middle-end's tree folding to recognize that > > CEIL_DIV_EXPR of integer_one_node is a no-op, that can be optimized away. > > > > Ok for mainline? > > Where does it end up ICEing? I see size_binop_loc guards against > error_mark_node operands already, maybe it should use error_operand_p > instead? > > > > > 2024
[C PATCH] PR c/109618: ICE-after-error from error_mark_node.
This patch solves another ICE-after-error problem in the C family front-ends. Upon a conflicting type redeclaration, the ambiguous type is poisoned with an error_mark_node to indicate to the middle-end that the type is suspect, but care has to be taken by the front-end to avoid passing these malformed trees into the middle-end during error recovery. In this case, a var_decl with a poisoned type appears within a sizeof() expression (wrapped in NOP_EXPR) which causes problems. This revision of the patch tests seen_error() to avoid tree traversal (STRIP_NOPs) in the most common case that an error hasn't occurred. Both this version (and an earlier revision that didn't test seen_error) have survived bootstrap and regression testing on x86_64-pc-linux-gnu. As a consolation, this code also contains a minor performance improvement, by avoiding trying to create (and folding away) a CEIL_DIV_EXPR in the common case that "char" is a single-byte. The current code relies on the middle-end's tree folding to recognize that CEIL_DIV_EXPR of integer_one_node is a no-op, that can be optimized away. Ok for mainline? 2024-04-30 Roger Sayle gcc/c-family/ChangeLog PR c/109618 * c-common.cc (c_sizeof_or_alignof_type): If seen_error() check whether value is (a VAR_DECL) of type error_mark_node, or a NOP_EXPR thereof. Avoid folding CEIL_DIV_EXPR for the common case where char_type is a single byte. gcc/testsuite/ChangeLog PR c/109618 * gcc.dg/pr109618.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/c-family/c-common.cc b/gcc/c-family/c-common.cc index 6fa8243..be8ff09 100644 --- a/gcc/c-family/c-common.cc +++ b/gcc/c-family/c-common.cc @@ -3993,10 +3993,31 @@ c_sizeof_or_alignof_type (location_t loc, else { if (is_sizeof) - /* Convert in case a char is more than one unit. */ - value = size_binop_loc (loc, CEIL_DIV_EXPR, TYPE_SIZE_UNIT (type), - size_int (TYPE_PRECISION (char_type_node) - / BITS_PER_UNIT)); + { + value = TYPE_SIZE_UNIT (type); + + /* PR 109618: Check for erroneous types, stripping NOPs. */ + if (seen_error ()) + { + tree tmp = value; + while (CONVERT_EXPR_P (tmp) +|| TREE_CODE (tmp) == NON_LVALUE_EXPR) + { + if (TREE_TYPE (tmp) == error_mark_node) + return error_mark_node; + tmp = TREE_OPERAND (tmp, 0); + } + if (tmp == error_mark_node + || TREE_TYPE (tmp) == error_mark_node) + return error_mark_node; + } + + /* Convert in case a char is more than one unit. */ + if (TYPE_PRECISION (char_type_node) != BITS_PER_UNIT) + value = size_binop_loc (loc, CEIL_DIV_EXPR, value, + size_int (TYPE_PRECISION (char_type_node) + / BITS_PER_UNIT)); + } else if (min_alignof) value = size_int (min_align_of_type (type)); else diff --git a/gcc/testsuite/gcc.dg/pr109618.c b/gcc/testsuite/gcc.dg/pr109618.c new file mode 100644 index 000..f240907 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr109618.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O0" } */ +int foo() +{ + const unsigned int var_1 = 2; + + char var_5[var_1]; + + int var_1[10]; /* { dg-error "conflicting type" } */ + + return sizeof(var_5); +} +
[PATCH] PR tree-opt/113673: Avoid load merging from potentially trapping additions.
This patch fixes PR tree-optimization/113673, a P2 ice-on-valid regression caused by load merging of (ptr[0]<<8)+ptr[1] when -ftrapv has been specified. When the operator is | or ^ this is safe, but for addition of signed integer types, a trap may be generated/required, so merging this idiom into a single non-trapping instruction is inappropriate, confusing the compiler by transforming a basic block with an exception edge into one without. One fix is to be more selective for PLUS_EXPR than for BIT_IOR_EXPR or BIT_XOR_EXPR in gimple-ssa-store-merging.cc's find_bswap_or_nop_1 function. An alternate solution might be to notice that in this idiom the addition can't overflow, but that this detail wasn't apparent when exception edges were added to the CFG. In which case, it's safe to remove (or mark for removal) the problematic exceptional edge. Unfortunately updating the CFG is a part of the compiler that I'm less familiar with. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-04-28 Roger Sayle gcc/ChangeLog PR tree-optimization/113673 * gimple-ssa-store-merging.cc (find_bswap_or_nop_1) : Don't perform load merging if a signed addition may trap. gcc/testsuite/ChangeLog PR tree-optimization/113673 * g++.dg/pr113673.C: New test case. Thanks in advance, Roger -- diff --git a/gcc/gimple-ssa-store-merging.cc b/gcc/gimple-ssa-store-merging.cc index cb0cb5f..41a1066 100644 --- a/gcc/gimple-ssa-store-merging.cc +++ b/gcc/gimple-ssa-store-merging.cc @@ -776,9 +776,16 @@ find_bswap_or_nop_1 (gimple *stmt, struct symbolic_number *n, int limit) switch (code) { + case PLUS_EXPR: + /* Don't perform load merging if this addition can trap. */ + if (cfun->can_throw_non_call_exceptions + && INTEGRAL_TYPE_P (TREE_TYPE (rhs1)) + && TYPE_OVERFLOW_TRAPS (TREE_TYPE (rhs1))) + return NULL; + /* Fallthru. */ + case BIT_IOR_EXPR: case BIT_XOR_EXPR: - case PLUS_EXPR: source_stmt1 = find_bswap_or_nop_1 (rhs1_stmt, , limit - 1); if (!source_stmt1) diff --git a/gcc/testsuite/g++.dg/pr113673.C b/gcc/testsuite/g++.dg/pr113673.C new file mode 100644 index 000..1148977 --- /dev/null +++ b/gcc/testsuite/g++.dg/pr113673.C @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -fnon-call-exceptions -ftrapv" } */ + +struct s { ~s(); }; +void +h (unsigned char *data, int c) +{ + s a1; + while (c) +{ + int m = *data++ << 8; + m += *data++; +} +}
[PATCH] PR middle-end/111701: signbit(x*x) vs -fsignaling-nans
This patch addresses PR middle-end/111701 where optimization of signbit(x*x) using tree_nonnegative_p incorrectly eliminates a floating point multiplication when the operands may potentially be signaling NaNs. The above bug fix also provides a solution or work-around to the tricky issue in PR middle-end/111701, that the results of IEEE operations on NaNs are specified to return a NaN result, but fail to (precisely) specify the exact NaN representation of this result. Hence for the operation "-NaN*-NaN" different hardware implementations (targets) return different results. Ultimately knowing what the resulting NaN "payload" of an operation is can only be known by executing that operation at run-time, and I'd suggest that GCC's -fsignaling-nans provides a mechanism for handling code that uses NaN representations for communication/signaling (which is a different but related concept to IEEE's sNaN). One nice thing about this patch, which may or may not be a P2 regression fix, is that it only affects (improves) code compiled with -fsignaling-nans so should be extremely safe even for this point in stage 3. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-04-26 Roger Sayle gcc/ChangeLog PR middle-end/111701 * fold-const.cc (tree_binary_nonnegative_warnv_p) : Split handling of floating point and integer types. For equal floating point operands, avoid optimization if the operand may be a signaling NaN. gcc/testsuite/ChangeLog PR middle-end/111701 * gcc.dg/pr111701-1.c: New test case. * gcc.dg/pr111701-2.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index 7b26896..f7f174d 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -15076,16 +15076,27 @@ tree_binary_nonnegative_warnv_p (enum tree_code code, tree type, tree op0, break; case MULT_EXPR: - if (FLOAT_TYPE_P (type) || TYPE_OVERFLOW_UNDEFINED (type)) + if (FLOAT_TYPE_P (type)) { - /* x * x is always non-negative for floating point x -or without overflow. */ + /* x * x is non-negative for floating point x except +that -NaN*-NaN may return -NaN. PR middle-end/111701. */ + if (operand_equal_p (op0, op1, 0)) + { + if (!tree_expr_maybe_signaling_nan_p (op0) || RECURSE (op0)) + return true; + } + else if (RECURSE (op0) && RECURSE (op1)) + return true; + } + + if (ANY_INTEGRAL_TYPE_P (type) + && TYPE_OVERFLOW_UNDEFINED (type)) + { + /* x * x is always non-negative without overflow. */ if (operand_equal_p (op0, op1, 0) || (RECURSE (op0) && RECURSE (op1))) { - if (ANY_INTEGRAL_TYPE_P (type) - && TYPE_OVERFLOW_UNDEFINED (type)) - *strict_overflow_p = true; + *strict_overflow_p = true; return true; } } diff --git a/gcc/testsuite/gcc.dg/pr111701-1.c b/gcc/testsuite/gcc.dg/pr111701-1.c new file mode 100644 index 000..5cbfac2 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr111701-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fsignaling-nans -fdump-tree-optimized" } */ + +int foo(double x) +{ +return __builtin_signbit(x*x); +} + +int bar(float x) +{ +return __builtin_signbit(x*x); +} + +/* { dg-final { scan-tree-dump-times " \\* " 2 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/pr111701-2.c b/gcc/testsuite/gcc.dg/pr111701-2.c new file mode 100644 index 000..f79c7ba --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr111701-2.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +int foo(double x) +{ +return __builtin_signbit(x*x); +} + +int bar(float x) +{ +return __builtin_signbit(x*x); +} + +/* { dg-final { scan-tree-dump-not " \\* " "optimized" } } */
[PATCH] PR target/114187: Fix ?Fmode SUBREG simplification in simplify_subreg.
This patch fixes PR target/114187 a typo/missed-optimization in simplify-rtx that's exposed by (my) changes to x86_64's parameter passing. The context is that construction of double word (TImode) values now uses the idiom: (ior:TI (ashift:TI (zero_extend:TI (reg:DI x)) (const_int 64 [0x40])) (zero_extend:TI (reg:DI y))) Extracting the DImode highpart and lowpart halves of this complex expression is supported by simplications in simplify_subreg. The problem is when the doubleword TImode value represents two DFmode fields, there isn't a direct simplification to extract the highpart or lowpart SUBREGs, but instead GCC uses two steps, extract the DImode {high,low} part and then cast the result back to a floating point mode, DFmode. The (buggy) code to do this is: /* If the outer mode is not integral, try taking a subreg with the equivalent integer outer mode and then bitcasting the result. Other simplifications rely on integer to integer subregs and we'd potentially miss out on optimizations otherwise. */ if (known_gt (GET_MODE_SIZE (innermode), GET_MODE_SIZE (outermode)) && SCALAR_INT_MODE_P (innermode) && !SCALAR_INT_MODE_P (outermode) && int_mode_for_size (GET_MODE_BITSIZE (outermode), 0).exists (_outermode)) { rtx tem = simplify_subreg (int_outermode, op, innermode, byte); if (tem) return simplify_gen_subreg (outermode, tem, int_outermode, byte); } The issue/mistake is that the second call, to simplify_subreg, shouldn't use "byte" as the final argument; the offset has already been handled by the first call, to simplify_subreg, and this second call is just a type conversion from an integer mode to floating point (from DImode to DFmode). Interestingly, this mistake was already spotted by Richard Sandiford when the optimization was originally contributed in January 2023. https://gcc.gnu.org/pipermail/gcc-patches/2023-January/610920.html >> Richard Sandiford writes: >> Also, the final line should pass 0 rather than byte. Unfortunately a miscommunication/misunderstanding in a later thread https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612898.html resulted in this correction being undone. Alas the lack of any test cases when the optimization was added/modified potentially contributed to this lapse. Using lowpart_subreg should avoid/reduce confusion in future. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-03-03 Roger Sayle gcc/ChangeLog PR target/114187 * simplify-rtx.cc (simplify_context::simplify_subreg): Call lowpart_subreg to perform type conversion, to avoid confusion over the offset to use in the call to simplify_reg_subreg. gcc/testsuite/ChangeLog PR target/114187 * g++.target/i386/pr114187.C: New test case. Thanks in advance, Roger -- diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index 36dd522..dceaa13 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -7846,7 +7846,7 @@ simplify_context::simplify_subreg (machine_mode outermode, rtx op, { rtx tem = simplify_subreg (int_outermode, op, innermode, byte); if (tem) - return simplify_gen_subreg (outermode, tem, int_outermode, byte); + return lowpart_subreg (outermode, tem, int_outermode); } /* If OP is a vector comparison and the subreg is not changing the diff --git a/gcc/testsuite/g++.target/i386/pr114187.C b/gcc/testsuite/g++.target/i386/pr114187.C new file mode 100644 index 000..69912a9 --- /dev/null +++ b/gcc/testsuite/g++.target/i386/pr114187.C @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +struct P2d { +double x, y; +}; + +double sumxy_p(P2d p) { +return p.x + p.y; +} + +/* { dg-final { scan-assembler-not "movq" } } */ +/* { dg-final { scan-assembler-not "xchg" } } */
[x86_64 PATCH] PR target/113690: Fix-up MULT REG_EQUAL notes in STV.
This patch fixes PR target/113690, an ICE-on-valid regression on x86_64 that exhibits with a specific combination of command line options. The cause is that x86's scalar-to-vector pass converts a chain of instructions from TImode to V1TImode, but fails to appropriately update the attached REG_EQUAL note. Given that multiplication isn't supported in V1TImode, the REG_NOTE handling code wasn't expecting to see a MULT. Easily solved with additional handling for other binary operators that may potentially (in future) have an immediate constant as the second operand that needs handling. For convenience, this code (re)factors the logic to convert a TImode constant into a V1TImode constant vector into a subroutine and reuses it. For the record, STV is actually doing something useful in this strange testcase, GCC with -O2 -fno-dce -fno-forward-propagate -fno-split-wide-types -funroll-loops generates: foo:movl$v, %eax pxor%xmm0, %xmm0 movaps %xmm0, 48(%rax) movaps %xmm0, (%rax) movaps %xmm0, 16(%rax) movaps %xmm0, 32(%rax) ret With the addition of -mno-stv (to disable the patched code) it gives: foo:movl$v, %eax movq$0, 48(%rax) movq$0, 56(%rax) movq$0, (%rax) movq$0, 8(%rax) movq$0, 16(%rax) movq$0, 24(%rax) movq$0, 32(%rax) movq$0, 40(%rax) ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-02-05 Roger Sayle gcc/ChangeLog PR target/113690 * config/i386/i386-features.cc (timode_convert_cst): New helper function to convert a TImode CONST_SCALAR_INT_P to a V1TImode CONST_VECTOR. (timode_scalar_chain::convert_op): Use timode_convert_cst. (timode_scalar_chain::convert_insn): If a REG_EQUAL note contains a binary operator where the second operand is an immediate integer constant, convert it to V1TImode using timode_convert_cst. Use timode_convert_cst. gcc/testsuite/ChangeLog PR target/113690 * gcc.target/i386/pr113690.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index 4020b27..90ada7d 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -1749,6 +1749,19 @@ timode_scalar_chain::fix_debug_reg_uses (rtx reg) } } +/* Helper function to convert immediate constant X to V1TImode. */ +static rtx +timode_convert_cst (rtx x) +{ + /* Prefer all ones vector in case of -1. */ + if (constm1_operand (x, TImode)) +return CONSTM1_RTX (V1TImode); + + rtx *v = XALLOCAVEC (rtx, 1); + v[0] = x; + return gen_rtx_CONST_VECTOR (V1TImode, gen_rtvec_v (1, v)); +} + /* Convert operand OP in INSN from TImode to V1TImode. */ void @@ -1775,18 +1788,8 @@ timode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) } else if (CONST_SCALAR_INT_P (*op)) { - rtx vec_cst; rtx tmp = gen_reg_rtx (V1TImode); - - /* Prefer all ones vector in case of -1. */ - if (constm1_operand (*op, TImode)) - vec_cst = CONSTM1_RTX (V1TImode); - else - { - rtx *v = XALLOCAVEC (rtx, 1); - v[0] = *op; - vec_cst = gen_rtx_CONST_VECTOR (V1TImode, gen_rtvec_v (1, v)); - } + rtx vec_cst = timode_convert_cst (*op); if (!standard_sse_constant_p (vec_cst, V1TImode)) { @@ -1830,12 +1833,28 @@ timode_scalar_chain::convert_insn (rtx_insn *insn) tmp = find_reg_equal_equiv_note (insn); if (tmp) { - if (GET_MODE (XEXP (tmp, 0)) == TImode) - PUT_MODE (XEXP (tmp, 0), V1TImode); - else if (CONST_SCALAR_INT_P (XEXP (tmp, 0))) - XEXP (tmp, 0) - = gen_rtx_CONST_VECTOR (V1TImode, - gen_rtvec (1, XEXP (tmp, 0))); + rtx expr = XEXP (tmp, 0); + if (GET_MODE (expr) == TImode) + { + PUT_MODE (expr, V1TImode); + switch (GET_CODE (expr)) + { + case PLUS: + case MINUS: + case MULT: + case AND: + case IOR: + case XOR: + if (CONST_SCALAR_INT_P (XEXP (expr, 1))) + XEXP (expr, 1) = timode_convert_cst (XEXP (expr, 1)); + break; + + default: + break; + } + } + else if (CONST_SCALAR_INT_P (expr)) + XEXP (tmp, 0) = timode_convert_cst (expr); } } break; @@ -1876,7 +1895,7 @@ timode_scalar_chain::convert_insn (rtx_insn *insn
[tree-ssa PATCH] PR target/113560: Enhance is_widening_mult_rhs_p.
This patch resolves PR113560, a code quality regression from GCC12 affecting x86_64, by enhancing the middle-end's tree-ssa-math-opts.cc to recognize more instances of widening multiplications. The widening multiplication perception code identifies cases like: _1 = (unsigned __int128) x; __res = _1 * 100; but in the reported test case, the original input looks like: _1 = (unsigned long long) x; _2 = (unsigned __int128) _1; __res = _2 * 100; which gets optimized by constant folding during tree-ssa to: _2 = x & 18446744073709551615; // x & 0x __res = _2 * 100; where the BIT_AND_EXPR hides (has consumed) the extension operation. This reveals the more general deficiency (missed optimization opportunity) in widening multiplication perception that additionally both __int128 foo(__int128 x, __int128 y) { return (x & 1000) * (y & 1000) } and unsigned __int128 bar(unsigned __int128 x, unsigned __int128) { return (x >> 80) * (y >> 80); } should be recognized as widening multiplications. Hence rather than test explicitly for BIT_AND_EXPR (as in the first version of this patch) the more general solution is to make use of range information, as provided by tree_non_zero_bits. As a demonstration of the observed improvements, function foo above currently with -O2 compiles on x86_64 to: foo:movq%rdi, %rsi movq%rdx, %r8 xorl%edi, %edi xorl%r9d, %r9d andl$1000, %esi andl$1000, %r8d movq%rdi, %rcx movq%r9, %rdx imulq %rsi, %rdx movq%rsi, %rax imulq %r8, %rcx addq%rdx, %rcx mulq%r8 addq%rdx, %rcx movq%rcx, %rdx ret with this patch, GCC recognizes the *w and instead generates: foo:movq%rdi, %rsi movq%rdx, %r8 andl$1000, %esi andl$1000, %r8d movq%rsi, %rax imulq %r8 ret which is perhaps easier to understand at the tree-level where __int128 foo (__int128 x, __int128 y) { __int128 _1; __int128 _2; __int128 _5; [local count: 1073741824]: _1 = x_3(D) & 1000; _2 = y_4(D) & 1000; _5 = _1 * _2; return _5; } gets transformed to: __int128 foo (__int128 x, __int128 y) { __int128 _1; __int128 _2; __int128 _5; signed long _7; signed long _8; [local count: 1073741824]: _1 = x_3(D) & 1000; _2 = y_4(D) & 1000; _7 = (signed long) _1; _8 = (signed long) _2; _5 = _7 w* _8; return _5; } This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-01-30 Roger Sayle gcc/ChangeLog PR target/113560 * tree-ssa-math-opts.cc (is_widening_mult_rhs_p): Use range information via tree_non_zero_bits to check if this operand is suitably extended for a widening (or highpart) multiplication. (convert_mult_to_widen): Insert explicit casts if the RHS or LHS isn't already of the claimed type. gcc/testsuite/ChangeLog PR target/113560 * g++.target/i386/pr113560.C: New test case. * gcc.target/i386/pr113560.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/testsuite/g++.target/i386/pr113560.C b/gcc/testsuite/g++.target/i386/pr113560.C new file mode 100644 index 000..179b68f --- /dev/null +++ b/gcc/testsuite/g++.target/i386/pr113560.C @@ -0,0 +1,19 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-Ofast -std=c++23 -march=znver4" } */ + +#include +auto f(char *buf, unsigned long long in) noexcept +{ +unsigned long long hi{}; +auto lo{_mulx_u64(in, 0x2af31dc462ull, )}; +lo = _mulx_u64(lo, 100, ); +__builtin_memcpy(buf + 2, , 2); +return buf + 10; +} + +/* { dg-final { scan-assembler-times "mulx" 1 } } */ +/* { dg-final { scan-assembler-times "mulq" 1 } } */ +/* { dg-final { scan-assembler-not "addq" } } */ +/* { dg-final { scan-assembler-not "adcq" } } */ +/* { dg-final { scan-assembler-not "salq" } } */ +/* { dg-final { scan-assembler-not "shldq" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr113560.c b/gcc/testsuite/gcc.target/i386/pr113560.c new file mode 100644 index 000..ac2e01a --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr113560.c @@ -0,0 +1,17 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ + +unsigned __int128 foo(unsigned __int128 x, unsigned __int128 y) +{ + return (x & 1000) * (y & 1000); +} + +__int128 bar(__int128 x, __int128 y) +{ + return (x & 1000) * (y & 1000); +} + +/* { dg-final { scan-assembler-times "\tmulq" 1 } } */ +/* { dg-final { scan-assembler-times "\timulq" 1 } } */ +/* { dg-final { scan-assembler-not
[libatomic PATCH] PR other/113336: Fix libatomic testsuite regressions on ARM.
This patch is a revised version of the fix for PR other/113336. This patch has been tested on arm-linux-gnueabihf with --with-arch=armv6 with make bootstrap and make -k check where it fixes all of the FAILs in libatomic. Ok for mainline? 2024-01-28 Roger Sayle Victor Do Nascimento libatomic/ChangeLog PR other/113336 * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX * Makefile.in: Regenerate. Thanks in advance. Roger -- diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am index cfad90124f9..eb04fa2fc60 100644 --- a/libatomic/Makefile.am +++ b/libatomic/Makefile.am @@ -139,6 +139,7 @@ if ARCH_ARM_LINUX IFUNC_OPTIONS = -march=armv7-a+fp -DHAVE_KERNEL64 libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix _$(s)_1_.lo,$(SIZEOBJS))) libatomic_la_LIBADD += $(addsuffix _8_2_.lo,$(SIZEOBJS)) +libatomic_la_LIBADD += tas_1_2_.lo endif if ARCH_I386 IFUNC_OPTIONS = -march=i586
[middle-end PATCH] Constant fold {-1,-1} << 1 in simplify-rtx.cc
This patch addresses a missed optimization opportunity in the RTL optimization passes. The function simplify_const_binary_operation will constant fold binary operators with two CONST_INT operands, and those with two CONST_VECTOR operands, but is missing compile-time evaluation of binary operators with a CONST_VECTOR and a CONST_INT, such as vector shifts and rotates. My first version of this patch didn't contain a switch statement to explicitly check for valid binary opcodes, which bootstrapped and regression tested fine, but by paranoia has got the better of me, so this version now checks that VEC_SELECT or some funky (future) rtx_code doesn't cause problems. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline (in stage 1)? 2024-01-26 Roger Sayle gcc/ChangeLog * simplify-rtx.cc (simplify_const_binary_operation): Constant fold binary operations where the LHS is CONST_VECTOR and the RHS is CONST_INT (or CONST_DOUBLE) such as vector shifts. Thanks in advance, Roger -- diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index c7215cf..2e2809a 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -5021,6 +5021,60 @@ simplify_const_binary_operation (enum rtx_code code, machine_mode mode, return gen_rtx_CONST_VECTOR (mode, v); } + if (VECTOR_MODE_P (mode) + && GET_CODE (op0) == CONST_VECTOR + && (CONST_SCALAR_INT_P (op1) || CONST_DOUBLE_AS_FLOAT_P (op1)) + && (CONST_VECTOR_DUPLICATE_P (op0) + || CONST_VECTOR_NUNITS (op0).is_constant ())) +{ + switch (code) + { + case PLUS: + case MINUS: + case MULT: + case DIV: + case MOD: + case UDIV: + case UMOD: + case AND: + case IOR: + case XOR: + case SMIN: + case SMAX: + case UMIN: + case UMAX: + case LSHIFTRT: + case ASHIFTRT: + case ASHIFT: + case ROTATE: + case ROTATERT: + case SS_PLUS: + case US_PLUS: + case SS_MINUS: + case US_MINUS: + case SS_ASHIFT: + case US_ASHIFT: + case COPYSIGN: + break; + default: + return NULL_RTX; + } + + unsigned int npatterns = (CONST_VECTOR_DUPLICATE_P (op0) + ? CONST_VECTOR_NPATTERNS (op0) + : CONST_VECTOR_NUNITS (op0).to_constant ()); + rtx_vector_builder builder (mode, npatterns, 1); + for (unsigned i = 0; i < npatterns; i++) + { + rtx x = simplify_binary_operation (code, GET_MODE_INNER (mode), +CONST_VECTOR_ELT (op0, i), op1); + if (!x || !valid_for_const_vector_p (mode, x)) + return 0; + builder.quick_push (x); + } + return builder.build (); +} + if (SCALAR_FLOAT_MODE_P (mode) && CONST_DOUBLE_AS_FLOAT_P (op0) && CONST_DOUBLE_AS_FLOAT_P (op1)
RE: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.
Hi Hongtao, Many thanks for the review. Here's a revised version of my patch that addresses (most of) the issues you've raised. Firstly the handling of zero and all_ones in this function is mostly for completeness/documentation, these standard_sse_constant_p values are (currently/normally) handled elsewhere. But I have added an "n_var == 0" optimization to ix86_expand_vector_init. As you've suggested I've added explicit TARGET_SSE2 tests where required, and for consistency I've also added support for AVX512's V16SImode. As you've predicted, the eventual goal is to move this after combine (or reload) using define_insn_and_split, but that requires a significant restructuring that should be done in steps. This also interacts with a similar planned reorganization of TImode constant handling. If all 128-bit (vector) constants are acceptable before combine, then STV has the freedom to chose V1TImode (and this broadcast functionality) to implement TImode operations on immediate constants. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline (in stage 1)? 2024-01-25 Roger Sayle Hongtao Liu gcc/ChangeLog PR target/106060 * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New. (struct ix86_vec_bcast_map_simode_t): New type for table below. (ix86_vec_bcast_map_simode): Table of SImode constants that may be efficiently synthesized by a ix86_vec_bcast_alg method. (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch. (ix86_vector_duplicate_simode_const): Efficiently synthesize V4SImode and V8SImode constants that duplicate special constants. (ix86_vector_duplicate_value): Attempt to synthesize "special" vector constants using ix86_vector_duplicate_simode_const. * config/i386/i386.cc (ix86_rtx_costs) : ABS of a vector integer mode costs with a single SSE instruction. gcc/testsuite/ChangeLog PR target/106060 * gcc.target/i386/auto-init-8.c: Update test case. * gcc.target/i386/avx512fp16-3.c: Likewise. * gcc.target/i386/pr100865-9a.c: Likewise. * gcc.target/i386/pr101796-1.c: Likewise. * gcc.target/i386/pr106060-1.c: New test case. * gcc.target/i386/pr106060-2.c: Likewise. * gcc.target/i386/pr106060-3.c: Likewise. * gcc.target/i386/pr70314.c: Update test case. * gcc.target/i386/vect-shiftv4qi.c: Likewise. * gcc.target/i386/vect-shiftv8qi.c: Likewise. Roger -- > -Original Message- > From: Hongtao Liu > Sent: 17 January 2024 03:13 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak > Subject: Re: [x86 PATCH] PR target/106060: Improved SSE vector constant > materialization. > > On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle > wrote: > > > > > > I thought I'd just missed the bug fixing season of stage3, but there > > appears to a little latitude in early stage4 (for vector patches), so > > I'll post this now. > > > > This patch resolves PR target/106060 by providing efficient methods > > for materializing/synthesizing special "vector" constants on x86. > > Currently there are three methods of materializing a vector constant; > > the most general is to load a vector from the constant pool, secondly > "duplicated" > > constants can be synthesized by moving an integer between units and > > broadcasting (or shuffling it), and finally the special cases of the > > all-zeros vector and all-ones vectors can be loaded via a single SSE > > instruction. This patch handles additional cases that can be synthesized > > in two instructions, loading an all-ones vector followed by another > > SSE instruction. Following my recent patch for PR target/112992, > > there's conveniently a single place in i386-expand.cc where these > > special cases can be handled. > > > > Two examples are given in the original bugzilla PR for 106060. > > > > __m256i > > should_be_cmpeq_abs () > > { > > return _mm256_set1_epi8 (1); > > } > > > > is now generated (with -O3 -march=x86-64-v3) as: > > > > vpcmpeqd%ymm0, %ymm0, %ymm0 > > vpabsb %ymm0, %ymm0 > > ret > > > > and > > > > __m256i > > should_be_cmpeq_add () > > { > > return _mm256_set1_epi8 (-2); > > } > > > > is now generated as: > > > > vpcmpeqd%ymm0, %ymm0, %ymm0 > > vpaddb %ymm0, %ymm0, %ymm0 > > ret > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with
RE: [middle-end PATCH] Prefer PLUS over IOR in RTL expansion of multi-word shifts/rotates.
Hi Richard, Thanks for the speedy review. I completely agree this patch can wait for stage1, but it's related to some recent work Andrew Pinski has been doing in match.pd, so I thought I'd share it. Hypothetically, recognizing (x<<4)+(x>>60) as a rotation at the tree-level might lead to a code quality regression, if RTL expansion doesn't know to lower it back to use PLUS on those targets with lea but without rotate. > From: Richard Biener > Sent: 19 January 2024 11:04 > On Thu, Jan 18, 2024 at 8:55 PM Roger Sayle > wrote: > > > > This patch tweaks RTL expansion of multi-word shifts and rotates to > > use PLUS rather than IOR for disjunctive operations. During expansion > > of these operations, the middle-end creates RTL like (X<>C2) > > where the constants C1 and C2 guarantee that bits don't overlap. > > Hence the IOR can be performed by any any_or_plus operation, such as > > IOR, XOR or PLUS; for word-size operations where carry chains aren't > > an issue these should all be equally fast (single-cycle) instructions. > > The benefit of this change is that targets with shift-and-add insns, > > like x86's lea, can benefit from the LSHIFT-ADD form. > > > > An example of a backend that benefits is ARC, which is demonstrated by > > these two simple functions: > > > > unsigned long long foo(unsigned long long x) { return x<<2; } > > > > which with -O2 is currently compiled to: > > > > foo:lsr r2,r0,30 > > asl_s r1,r1,2 > > asl_s r0,r0,2 > > j_s.d [blink] > > or_sr1,r1,r2 > > > > with this patch becomes: > > > > foo:lsr r2,r0,30 > > add2r1,r2,r1 > > j_s.d [blink] > > asl_s r0,r0,2 > > > > unsigned long long bar(unsigned long long x) { return (x<<2)|(x>>62); > > } > > > > which with -O2 is currently compiled to 6 insns + return: > > > > bar:lsr r12,r0,30 > > asl_s r3,r1,2 > > asl_s r0,r0,2 > > lsr_s r1,r1,30 > > or_sr0,r0,r1 > > j_s.d [blink] > > or r1,r12,r3 > > > > with this patch becomes 4 insns + return: > > > > bar:lsr r3,r1,30 > > lsr r2,r0,30 > > add2r1,r2,r1 > > j_s.d [blink] > > add2r0,r3,r0 > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > For expand_shift_1 you add > > +where C is the bitsize of A. If N cannot be zero, > +use PLUS instead of IOR. > > but I don't see a check ensuring this other than mabe CONST_INT_P (op1) > suggesting that we enver end up with const0_rtx here. OTOH why is N zero a > problem and why is it not in the optabs.cc case where I don't see any such > check > (at least not obvious)? Excellent question. A common mistake in writing a rotate function in C or C++ is to write something like (x>>n)|(x<<(64-n)) or (x<>(64-n)) which invokes undefined behavior when n == 0. It's OK to recognize these as rotates (relying on the undefined behavior), but correct/portable code (and RTL) needs the correct idiom(x>>n)|(x<<((-n)&63), which never invokes undefined behaviour. One interesting property of this idiom, is that shift by zero is then calculated as (x>>0)|(x<<0) which is x|x. This should then reveal the problem, for all non-zero values the IOR can be replaced by PLUS, but for zero shifts, X|X isn't the same as X+X or X^X. This only applies for single word rotations, and not multi-word shifts nor multi-word rotates, which explains why this test is only in one place. In theory, we could use ranger to check whether a rotate by a variable amount can ever be by zero bits, but the simplification used here is to continue using IOR for variable shifts, and PLUS for fixed/known shift values. The last remaining insight is that we only need to check for CONST_INT_P, as rotations/shifts by const0_rtx are handled earlier in this function (and eliminated by the tree-optimizers), i.e. rotation by a known constant is implicitly a rotation by a known non-zero constant. This is a little clearer if you read/cite more of the comment that was changed. Fortunately, this case is also well covered by the testsuite. I'd be happy to change the code to read: (CONST_INT_P (op1) && op1 != const0_rtx) ? add_optab : ior_optab But the test "if (op1 == const0_rtx)" already appears on line 2570 of expmed.cc. > Since this doesn't seem to fix a regression it probably has to wait for > stage1 to re-open. > > Thanks, > Richard. > > > 2024-01-18 Roger Sayle > > > > gcc/ChangeLog > > * expmed.cc (expand_shift_1): Use add_optab instead of ior_optab > > to generate PLUS instead or IOR when unioning disjoint bitfields. > > * optabs.cc (expand_subword_shift): Likewise. > > (expand_binop): Likewise for double-word rotate. > > Thanks again.
[middle-end PATCH] Prefer PLUS over IOR in RTL expansion of multi-word shifts/rotates.
This patch tweaks RTL expansion of multi-word shifts and rotates to use PLUS rather than IOR for disjunctive operations. During expansion of these operations, the middle-end creates RTL like (X<>C2) where the constants C1 and C2 guarantee that bits don't overlap. Hence the IOR can be performed by any any_or_plus operation, such as IOR, XOR or PLUS; for word-size operations where carry chains aren't an issue these should all be equally fast (single-cycle) instructions. The benefit of this change is that targets with shift-and-add insns, like x86's lea, can benefit from the LSHIFT-ADD form. An example of a backend that benefits is ARC, which is demonstrated by these two simple functions: unsigned long long foo(unsigned long long x) { return x<<2; } which with -O2 is currently compiled to: foo:lsr r2,r0,30 asl_s r1,r1,2 asl_s r0,r0,2 j_s.d [blink] or_sr1,r1,r2 with this patch becomes: foo:lsr r2,r0,30 add2r1,r2,r1 j_s.d [blink] asl_s r0,r0,2 unsigned long long bar(unsigned long long x) { return (x<<2)|(x>>62); } which with -O2 is currently compiled to 6 insns + return: bar:lsr r12,r0,30 asl_s r3,r1,2 asl_s r0,r0,2 lsr_s r1,r1,30 or_sr0,r0,r1 j_s.d [blink] or r1,r12,r3 with this patch becomes 4 insns + return: bar:lsr r3,r1,30 lsr r2,r0,30 add2r1,r2,r1 j_s.d [blink] add2r0,r3,r0 This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-01-18 Roger Sayle gcc/ChangeLog * expmed.cc (expand_shift_1): Use add_optab instead of ior_optab to generate PLUS instead or IOR when unioning disjoint bitfields. * optabs.cc (expand_subword_shift): Likewise. (expand_binop): Likewise for double-word rotate. Thanks in advance, Roger -- diff --git a/gcc/expmed.cc b/gcc/expmed.cc index 5916d6ed1bc..d1900f97f0c 100644 --- a/gcc/expmed.cc +++ b/gcc/expmed.cc @@ -2610,10 +2610,11 @@ expand_shift_1 (enum tree_code code, machine_mode mode, rtx shifted, else if (methods == OPTAB_LIB_WIDEN) { /* If we have been unable to open-code this by a rotation, -do it as the IOR of two shifts. I.e., to rotate A -by N bits, compute +do it as the IOR or PLUS of two shifts. I.e., to rotate +A by N bits, compute (A << N) | ((unsigned) A >> ((-N) & (C - 1))) -where C is the bitsize of A. +where C is the bitsize of A. If N cannot be zero, +use PLUS instead of IOR. It is theoretically possible that the target machine might not be able to perform either shift and hence we would @@ -2650,8 +2651,9 @@ expand_shift_1 (enum tree_code code, machine_mode mode, rtx shifted, temp1 = expand_shift_1 (left ? RSHIFT_EXPR : LSHIFT_EXPR, mode, shifted, other_amount, subtarget, 1); - return expand_binop (mode, ior_optab, temp, temp1, target, - unsignedp, methods); + return expand_binop (mode, + CONST_INT_P (op1) ? add_optab : ior_optab, + temp, temp1, target, unsignedp, methods); } temp = expand_binop (mode, diff --git a/gcc/optabs.cc b/gcc/optabs.cc index ce91f94ed43..dcd3e406719 100644 --- a/gcc/optabs.cc +++ b/gcc/optabs.cc @@ -566,8 +566,8 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab, if (tmp == 0) return false; - /* Now OR in the bits carried over from OUTOF_INPUT. */ - if (!force_expand_binop (word_mode, ior_optab, tmp, carries, + /* Now OR/PLUS in the bits carried over from OUTOF_INPUT. */ + if (!force_expand_binop (word_mode, add_optab, tmp, carries, into_target, unsignedp, methods)) return false; } @@ -1937,7 +1937,7 @@ expand_binop (machine_mode mode, optab binoptab, rtx op0, rtx op1, NULL_RTX, unsignedp, next_methods); if (into_temp1 != 0 && into_temp2 != 0) - inter = expand_binop (word_mode, ior_optab, into_temp1, into_temp2, + inter = expand_binop (word_mode, add_optab, into_temp1, into_temp2, into_target, unsignedp, next_methods); else inter = 0; @@ -1953,7 +1953,7 @@ expand_binop (machine_mode mode, optab binoptab, rtx op0, rtx op1, NULL_RTX, unsignedp, next_methods); if (inter != 0 && outof_temp1 !=
[x86 PATCH] PR target/106060: Improved SSE vector constant materialization.
I thought I'd just missed the bug fixing season of stage3, but there appears to a little latitude in early stage4 (for vector patches), so I'll post this now. This patch resolves PR target/106060 by providing efficient methods for materializing/synthesizing special "vector" constants on x86. Currently there are three methods of materializing a vector constant; the most general is to load a vector from the constant pool, secondly "duplicated" constants can be synthesized by moving an integer between units and broadcasting (or shuffling it), and finally the special cases of the all-zeros vector and all-ones vectors can be loaded via a single SSE instruction. This patch handles additional cases that can be synthesized in two instructions, loading an all-ones vector followed by another SSE instruction. Following my recent patch for PR target/112992, there's conveniently a single place in i386-expand.cc where these special cases can be handled. Two examples are given in the original bugzilla PR for 106060. __m256i should_be_cmpeq_abs () { return _mm256_set1_epi8 (1); } is now generated (with -O3 -march=x86-64-v3) as: vpcmpeqd%ymm0, %ymm0, %ymm0 vpabsb %ymm0, %ymm0 ret and __m256i should_be_cmpeq_add () { return _mm256_set1_epi8 (-2); } is now generated as: vpcmpeqd%ymm0, %ymm0, %ymm0 vpaddb %ymm0, %ymm0, %ymm0 ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-01-16 Roger Sayle gcc/ChangeLog PR target/106060 * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New. (struct ix86_vec_bcast_map_simode_t): New type for table below. (ix86_vec_bcast_map_simode): Table of SImode constants that may be efficiently synthesized by a ix86_vec_bcast_alg method. (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch. (ix86_vector_duplicate_simode_const): Efficiently synthesize V4SImode and V8SImode constants that duplicate special constants. (ix86_vector_duplicate_value): Attempt to synthesize "special" vector constants using ix86_vector_duplicate_simode_const. * config/i386/i386.cc (ix86_rtx_costs) : ABS of a vector integer mode costs with a single SSE instruction. gcc/testsuite/ChangeLog PR target/106060 * gcc.target/i386/auto-init-8.c: Update test case. * gcc.target/i386/avx512fp16-3.c: Likewise. * gcc.target/i386/pr100865-9a.c: Likewise. * gcc.target/i386/pr106060-1.c: New test case. * gcc.target/i386/pr106060-2.c: Likewise. * gcc.target/i386/pr106060-3.c: Likewise. * gcc.target/i386/pr70314-3.c: Update test case. * gcc.target/i386/vect-shiftv4qi.c: Likewise. * gcc.target/i386/vect-shiftv8qi.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 52754e1..f8f8af6 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -15638,6 +15638,288 @@ s4fma_expand: gcc_unreachable (); } +/* See below where shifts are handled for explanation of this enum. */ +enum ix86_vec_bcast_alg +{ + VEC_BCAST_PXOR, + VEC_BCAST_PCMPEQ, + VEC_BCAST_PABSB, + VEC_BCAST_PADDB, + VEC_BCAST_PSRLW, + VEC_BCAST_PSRLD, + VEC_BCAST_PSLLW, + VEC_BCAST_PSLLD +}; + +struct ix86_vec_bcast_map_simode_t +{ + unsigned int key; + enum ix86_vec_bcast_alg alg; + unsigned int arg; +}; + +/* This table must be kept sorted as values are looked-up using bsearch. */ +static const ix86_vec_bcast_map_simode_t ix86_vec_bcast_map_simode[] = { + { 0x, VEC_BCAST_PXOR,0 }, + { 0x0001, VEC_BCAST_PSRLD, 31 }, + { 0x0003, VEC_BCAST_PSRLD, 30 }, + { 0x0007, VEC_BCAST_PSRLD, 29 }, + { 0x000f, VEC_BCAST_PSRLD, 28 }, + { 0x001f, VEC_BCAST_PSRLD, 27 }, + { 0x003f, VEC_BCAST_PSRLD, 26 }, + { 0x007f, VEC_BCAST_PSRLD, 25 }, + { 0x00ff, VEC_BCAST_PSRLD, 24 }, + { 0x01ff, VEC_BCAST_PSRLD, 23 }, + { 0x03ff, VEC_BCAST_PSRLD, 22 }, + { 0x07ff, VEC_BCAST_PSRLD, 21 }, + { 0x0fff, VEC_BCAST_PSRLD, 20 }, + { 0x1fff, VEC_BCAST_PSRLD, 19 }, + { 0x3fff, VEC_BCAST_PSRLD, 18 }, + { 0x7fff, VEC_BCAST_PSRLD, 17 }, + { 0x, VEC_BCAST_PSRLD, 16 }, + { 0x00010001, VEC_BCAST_PSRLW, 15 }, + { 0x0001, VEC_BCAST_PSRLD, 15 }, + { 0x00030003, VEC_BCAST_PSRLW, 14 }, + { 0x0003, VEC_BCAST_PSRLD, 14 }, + { 0x00070007, VEC_BCAST_PSRLW, 13 }, + { 0x0007, VEC_BCAST_PSRLD, 13 }, + { 0x000f000f, VEC_BCAST_PSRLW, 12 }, + { 0x000f, VEC_BCAST_PSRLD, 12 }, + { 0x001f001f, VEC_BCAST_PSRLW, 11 }, + { 0x001f, VEC_BCAST_PSRLD, 11 }, + { 0x003f003f, VEC_BCAST_PSRLW, 10 }, + { 0x003f, VEC_BCAST_PSRLD, 10 }, + { 0x
[PATCH] PR rtl-optimization/111267: Improved forward propagation.
This patch resolves PR rtl-optimization/111267 by improving RTL-level forward propagation. This x86_64 code quality regression was caused (exposed) by my changes to improve how x86's (TImode) argument passing is represented at the RTL-level (reducing the use of SUBREGs to catch more optimization opportunities in combine). The pitfall is that the more complex RTL representations expose a limitation in RTL's fwprop pass. At the heart of fwprop, in try_fwprop_subst_pattern, the logic can be summarized as three steps. Step 1 is a heuristic that rejects the propagation attempt if the expression is too complex, step 2 calls the backend's recog to see if the propagated/simplified instruction is recognizable/valid, and step 3 then calls src_cost to compare the rtx costs of the replacement vs. the original, and accepts the transformation if the final cost is the same of better. The logic error (or missed optimization opportunity) is that the step 1 heuristic that attempts to predict (second guess) the process is flawed. Ultimately the decision on whether to fwprop or not should depend solely on actual improvement, as measured by RTX costs. Hence the prototype fix in the bugzilla PR removes the heuristic of calling prop.profitable_p entirely, relying entirely on the cost comparison in step 3. Unfortunately, things are a tiny bit more complicated. The cost comparison in fwprop uses the older set_src_cost API and not the newer (preffered) insn_cost API as currently used in combine. This means that the cost improvement comparisons are only done for single_set instructions (more complex PARALLELs etc. aren't supported). Hence we can only rely on skipping step 1 for that subset of instructions actually evaluated by step 3. The other subtlety is that to avoid potential infinite loops in fwprop we should only reply purely on rtx costs when the transformation is obviously an improvement. If the replacement has the same cost as the original, we can use the prop.profitable_p test to preserve the current behavior. Finally, to answer Richard Biener's remaining question about this approach: yes, there is an asymmetry between how patterns are handled and how REG_EQUAL notes are handled. For example, at the moment propagation into notes doesn't use rtx costs at all, and ultimately when fwprop is updated to use insn_cost, this (and recog) obviously isn't applicable to notes. There's no reason the logic need be identical between patterns and notes, and during stage4 we only need update propagation into patterns to fix this P1 regression (notes and use of cost_insn can be done for GCC 15). For Jakub's reduced testcase: struct S { float a, b, c, d; }; int bar (struct S x, struct S y) { return x.b <= y.d && x.c >= y.a; } On x86_64-pc-linux-gnu with -O2 gcc currently generates: bar:movq%xmm2, %rdx movq%xmm3, %rax movq%xmm0, %rsi xchgq %rdx, %rax movq%rsi, %rcx movq%rax, %rsi movq%rdx, %rax shrq$32, %rcx shrq$32, %rax movd%ecx, %xmm4 movd%eax, %xmm0 comiss %xmm4, %xmm0 jb .L6 movd%esi, %xmm0 xorl%eax, %eax comiss %xmm0, %xmm1 setnb %al ret .L6:xorl%eax, %eax ret with this simple patch to fwprop, we now generate: bar:shufps $85, %xmm0, %xmm0 shufps $85, %xmm3, %xmm3 comiss %xmm0, %xmm3 jb .L6 xorl%eax, %eax comiss %xmm2, %xmm1 setnb %al ret .L6:xorl%eax, %eax ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Additionally, it also resolves the FAIL for gcc.target/i386/pr82580.c. Ok for mainline? 2024-01-16 Roger Sayle gcc/ChangeLog PR rtl-optimization/111267 * fwprop.cc (try_fwprop_subst_pattern): Only bail-out early when !prop.profitable_p for instructions that are not single sets. When comparing costs, bail-out if the cost is unchanged and !prop.profitable_p. gcc/testsuite/ChangeLog PR rtl-optimization/111267 * gcc.target/i386/pr111267.c: New test case. Thanks in advance (and to Jeff Law for his guidance/help), Roger -- diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc index 0c588f8..f06225a 100644 --- a/gcc/fwprop.cc +++ b/gcc/fwprop.cc @@ -449,7 +449,10 @@ try_fwprop_subst_pattern (obstack_watermark , insn_change _change, if (prop.num_replacements == 0) return false; - if (!prop.profitable_p ()) + if (!prop.profitable_p () + && (prop.changed_mem_p () + || use_insn->is_asm () + || !single_set (use_rtl))) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "cannot propagate from insn %d into" @@ -481,7 +484,8 @@ try_fwprop_
[PATCH/RFC] Add --with-dwarf4 configure option.
This patch fixes three of the four unexpected failures that I'm seeing in the gcc testsuite on x86_64-pc-linux-gnu. The three FAILs are: FAIL: gcc.c-torture/execute/fprintf-2.c -O3 -g (test for excess errors) FAIL: gcc.c-torture/execute/printf-2.c -O3 -g (test for excess errors) FAIL: gcc.c-torture/execute/user-printf.c -O3 -g (test for excess errors) and are caused by the linker/toolchain (GNU ld 2.27 on RedHat 7) issuing a link-time warning: /usr/bin/ld: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information. This also explains why these c-torture tests only fail with -g. One solution might be to tweak/improve GCC's testsuite to ignore these warnings. However, ideally it should also be possible to configure gcc not to generate dwarf5 debugging information on systems that don't/can't support it. This patch supplements the current --with-dwarf2 configure option with the addition of a --with-dwarf4 option that adds a tm-dwarf4.h to $tm_file (using the same mechanism as --with-dwarf2) that changes/redefines DWARF_VERSION_DEFAULT to 4 (overriding the current default of 5), This patch has been tested on x86_64-pc-linux-gnu, with a full make bootstrap, both with and without --with-dwarf4. This is fixes the three failures above, and causes no new failures outside of the gcc.dg/guality directory. Unfortunately, the guality testsuite contains a large number of tests that assume support for dwarf5 and don't (yet) check check_effective_target_dwarf5. Hopefully, adding --with-dwarf4 will help improve/test the portability of the guality testsuite. Ok for mainline? An alternative implementation might be to allow integer values for $with_dwarf such that --with-dwarf5, --with-dwarf3 etc. do the right thing. In fact, I'd originally misread the documentation and assumed --with-dwarf4 was already supported. 2024-01-14 Roger Sayle gcc/ChangeLog * configure.ac: Add a with --with dwarf4 option. * configure: Regenerate. * config/tm-dwarf4.h: New target file to define DWARF_VERSION_DEFAULT to 4. Thanks in advance, Roger -- diff --git a/gcc/configure.ac b/gcc/configure.ac index 596e5f2..2ce9093 100644 --- a/gcc/configure.ac +++ b/gcc/configure.ac @@ -1036,6 +1036,11 @@ AC_ARG_WITH(dwarf2, dwarf2="$with_dwarf2", dwarf2=no) +AC_ARG_WITH(dwarf4, +[AS_HELP_STRING([--with-dwarf4], [force the default debug format to be DWARF 4])], +dwarf4="$with_dwarf4", +dwarf4=no) + AC_ARG_ENABLE(shared, [AS_HELP_STRING([--disable-shared], [don't provide a shared libgcc])], [ @@ -1916,6 +1921,10 @@ if test x"$dwarf2" = xyes then tm_file="$tm_file tm-dwarf2.h" fi +if test x"$dwarf4" = xyes +then tm_file="$tm_file tm-dwarf4.h" +fi + # Say what files are being used for the output code and MD file. echo "Using \`$srcdir/config/$out_file' for machine-specific logic." echo "Using \`$srcdir/config/$md_file' as machine description file." diff --git a/gcc/config/tm-dwarf4.h b/gcc/config/tm-dwarf4.h new file mode 100644 index 000..9557b40 --- /dev/null +++ b/gcc/config/tm-dwarf4.h @@ -0,0 +1,3 @@ +/* Make Dwarf4 debugging info the default */ +#undef DWARF_VERSION_DEFAULT +#define DWARF_VERSION_DEFAULT 4
RE: [libatomic PATCH] Fix testsuite regressions on ARM [raspberry pi].
Hi Richard, As you've recommended, this issue has now been filed in bugzilla as PR other/113336. As explained in the new PR, libatomic's testsuite used to pass on armv6 (raspberry pi) in previous GCC releases, but the code was incorrect/non-synchronous; this was reported as PR target/107567 and PR target/109166. Now that those issues have been fixed, we now see that there's a missing dependency in libatomic that's required to implement this functionality correctly. I'm more convinced that my fix is correct, but it's perhaps a little disappointing that libatomic doesn't have a (multi-threaded) run-time test to search for race conditions, and confirm its implementations are correctly serializing. Please let me know what you think. Best regards, Roger -- > -Original Message- > From: Richard Earnshaw > Sent: 10 January 2024 15:34 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Subject: Re: [libatomic PATCH] Fix testsuite regressions on ARM [raspberry > pi]. > > > > On 08/01/2024 16:07, Roger Sayle wrote: > > > > Bootstrapping GCC on arm-linux-gnueabihf with --with-arch=armv6 > > currently has a large number of FAILs in libatomic (regressions since > > last time I attempted this). The failure mode is related to IFUNC > > handling with the file tas_8_2_.o containing an unresolved reference > > to the function libat_test_and_set_1_i2. > > > > Bearing in mind I've no idea what's going on, the following one line > > change, to build tas_1_2_.o when building tas_8_2_.o, resolves the > > problem for me and restores the libatomic testsuite to 44 expected > > passes and 5 unsupported tests [from 22 unexpected failures and 22 > > unresolved > testcases]. > > > > If this looks like the correct fix, I'm not confident with rebuilding > > Makefile.in with correct version of automake, so I'd very much > > appreciate it if someone/the reviewer/mainainer could please check this in > > for > me. > > Thanks in advance. > > > > > > 2024-01-08 Roger Sayle > > > > libatomic/ChangeLog > > * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX > > * Makefile.in: Regenerate. > > > > > > Roger > > -- > > > > Hi Roger, > > I don't really understand all this make foo :( so I'm not sure if this is the > right fix > either. If this is, as you say, a regression, have you been able to track > down when > it first started to occur? That might also help me to understand what > changed to > cause this. > > Perhaps we should have a PR for this, to make tracking the fixes easier. > > R.
[libatomic PATCH] Fix testsuite regressions on ARM [raspberry pi].
Bootstrapping GCC on arm-linux-gnueabihf with --with-arch=armv6 currently has a large number of FAILs in libatomic (regressions since last time I attempted this). The failure mode is related to IFUNC handling with the file tas_8_2_.o containing an unresolved reference to the function libat_test_and_set_1_i2. Bearing in mind I've no idea what's going on, the following one line change, to build tas_1_2_.o when building tas_8_2_.o, resolves the problem for me and restores the libatomic testsuite to 44 expected passes and 5 unsupported tests [from 22 unexpected failures and 22 unresolved testcases]. If this looks like the correct fix, I'm not confident with rebuilding Makefile.in with correct version of automake, so I'd very much appreciate it if someone/the reviewer/mainainer could please check this in for me. Thanks in advance. 2024-01-08 Roger Sayle libatomic/ChangeLog * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX * Makefile.in: Regenerate. Roger -- diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am index cfad90124f9..e0988a18c9a 100644 --- a/libatomic/Makefile.am +++ b/libatomic/Makefile.am @@ -139,6 +139,7 @@ if ARCH_ARM_LINUX IFUNC_OPTIONS = -march=armv7-a+fp -DHAVE_KERNEL64 libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix _$(s)_1_.lo,$(SIZEOBJS))) libatomic_la_LIBADD += $(addsuffix _8_2_.lo,$(SIZEOBJS)) +libatomic_la_LIBADD += $(addsuffix _1_2_.lo,$(SIZEOBJS)) endif if ARCH_I386 IFUNC_OPTIONS = -march=i586
RE: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.
Hi Hongtao, Many thanks for the review. This revised patch implements several of your suggestions, specifically to use pshufd for V4SImode and punpcklqdq for V2DImode. These changes are demonstrated by the examples below: typedef unsigned int v4si __attribute((vector_size(16))); typedef unsigned long long v2di __attribute((vector_size(16))); v4si foo() { return (v4si){1,1,1,1}; } v2di bar() { return (v2di){1,1}; } The previous version of my patch generated: foo:movdqa .LC0(%rip), %xmm0 ret bar:movdqa .LC1(%rip), %xmm0 ret with this revised version, -O2 generates: foo:movl$1, %eax movd%eax, %xmm0 pshufd $0, %xmm0, %xmm0 ret bar:movl$1, %eax movq%rax, %xmm0 punpcklqdq %xmm0, %xmm0 ret However, if it's OK with you, I'd prefer to allow this function to return false, safely falling back to emitting a vector load from the constant bool rather than ICEing from a gcc_assert. For one thing this isn't a unrecoverable correctness issue, but at worst a missed optimization. The deeper reason is that this usefully provides a handle for tuning on different microarchitectures. On some (AMD?) machines, where !TARGET_INTER_UNIT_MOVES_TO_VEC, the first form above may be preferable to the second. Currently the start of ix86_convert_const_wide_int_to_broadcast disables broadcasts for !TARGET_INTER_UNIT_MOVES_TO_VEC even when an implementation doesn't reuire an inter unit move, such as a broadcast from memory. I plan follow-up patches that benefit from this flexibility. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? gcc/ChangeLog PR target/112992 * config/i386/i386-expand.cc (ix86_convert_const_wide_int_to_broadcast): Allow call to ix86_expand_vector_init_duplicate to fail, and return NULL_RTX. (ix86_broadcast_from_constant): Revert recent change; Return a suitable MEMREF independently of mode/target combinations. (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate to decide whether expansion is possible/preferrable. Only try forcing DImode constants to memory (and trying again) if calling ix86_expand_vector_init_duplicate fails with an DImode immediate constant. (ix86_expand_vector_init_duplicate) : Try using V4SImode for suitable immediate constants. : Try using V8SImode for suitable constants. : Fail for CONST_INT_P, i.e. use constant pool. : Likewise. : For CONST_INT_P try using V4SImode via widen. : For CONT_INT_P try using V8HImode via widen. : Handle CONT_INTs via simplify_binary_operation. Allow recursive calls to ix86_expand_vector_init_duplicate to fail. : For CONST_INT_P try V8SImode via widen. : For CONST_INT_P try V16HImode via widen. (ix86_expand_vector_init): Move try using a broadcast for all_same with ix86_expand_vector_init_duplicate before using constant pool. gcc/testsuite/ChangeLog * gcc.target/i386/auto-init-8.c: Update test case. * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Likewise. * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise. * gcc.target/i386/avx512fp16-13.c: Likewise. * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise. * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise. * gcc.target/i386/pr100865-1.c: Likewise. * gcc.target/i386/pr100865-10a.c: Likewise. * gcc.target/i386/pr100865-10b.c: Likewise. * gcc.target/i386/pr100865-2.c: Likewise. * gcc.target/i386/pr100865-3.c: Likewise. * gcc.target/i386/pr100865-4a.c: Likewise. * gcc.target/i386/pr100865-4b.c: Likewise. * gcc.target/i386/pr100865-5a.c: Likewise. * gcc.target/i386/pr100865-5b.c: Likewise. * gcc.target/i386/pr100865-9a.c: Likewise. * gcc.target/i386/pr100865-9b.c: Likewise. * gcc.target/i386/pr102021.c: Likewise. * gcc.target/i386/pr90773-17.c: Likewise. Thanks in advance. Roger -- > -Original Message- > From: Hongtao Liu > Sent: 02 January 2024 05:40 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak > Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of > constants. > > On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle > wrote: > > > > > > This patch resolves the second part of PR target/112992, building upon > > Hongtao Liu's solution to the first part. > > > > The issue addressed by this patch is that when initializing vectors by > > broadcasting integer constants, the compiler has the flexibility to > > select the most appropriate vector mode to perform the broadcast, as &
[x86 PATCH] PR target/113231: Improved costs in Scalar-To-Vector (STV) pass.
This patch improves the cost/gain calculation used during the i386 backend's SImode/DImode scalar-to-vector (STV) conversion pass. The current code handles loads and stores, but doesn't consider that converting other scalar operations with a memory destination, requires an explicit load before and an explicit store after the vector equivalent. To ease the review, the significant change looks like: /* For operations on memory operands, include the overhead of explicit load and store instructions. */ if (MEM_P (dst)) igain += !optimize_insn_for_size_p () ? (m * (ix86_cost->int_load[2] + ix86_cost->int_store[2]) - (ix86_cost->sse_load[sse_cost_idx] + ix86_cost->sse_store[sse_cost_idx])) : -COSTS_N_BYTES (8); however the patch itself is complicated by a change in indentation which leads to a number of lines with only whitespace changes. For architectures where integer load/store costs are the same as vector load/store costs, there should be no change without -Os/-Oz. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-01-06 Roger Sayle gcc/ChangeLog PR target/113231 * config/i386/i386-features.cc (compute_convert_gain): Include the overhead of explicit load and store (movd) instructions when converting non-store scalar operations with memory destinations. gcc/testsuite/ChangeLog PR target/113231 * gcc.target/i386/pr113231.c: New test case. Thanks again, Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index 4ae3e75..3677aef 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -563,183 +563,195 @@ general_scalar_chain::compute_convert_gain () else if (MEM_P (src) && REG_P (dst)) igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else - switch (GET_CODE (src)) - { - case ASHIFT: - case ASHIFTRT: - case LSHIFTRT: - if (m == 2) - { - if (INTVAL (XEXP (src, 1)) >= 32) - igain += ix86_cost->add; - /* Gain for extend highpart case. */ - else if (GET_CODE (XEXP (src, 0)) == ASHIFT) - igain += ix86_cost->shift_const - ix86_cost->sse_op; - else - igain += ix86_cost->shift_const; - } - - igain += ix86_cost->shift_const - ix86_cost->sse_op; + { + /* For operations on memory operands, include the overhead +of explicit load and store instructions. */ + if (MEM_P (dst)) + igain += !optimize_insn_for_size_p () +? (m * (ix86_cost->int_load[2] ++ ix86_cost->int_store[2]) + - (ix86_cost->sse_load[sse_cost_idx] + + ix86_cost->sse_store[sse_cost_idx])) +: -COSTS_N_BYTES (8); - if (CONST_INT_P (XEXP (src, 0))) - igain -= vector_const_cost (XEXP (src, 0)); - break; + switch (GET_CODE (src)) + { + case ASHIFT: + case ASHIFTRT: + case LSHIFTRT: + if (m == 2) + { + if (INTVAL (XEXP (src, 1)) >= 32) + igain += ix86_cost->add; + /* Gain for extend highpart case. */ + else if (GET_CODE (XEXP (src, 0)) == ASHIFT) + igain += ix86_cost->shift_const - ix86_cost->sse_op; + else + igain += ix86_cost->shift_const; + } - case ROTATE: - case ROTATERT: - igain += m * ix86_cost->shift_const; - if (TARGET_AVX512VL) - igain -= ix86_cost->sse_op; - else if (smode == DImode) - { - int bits = INTVAL (XEXP (src, 1)); - if ((bits & 0x0f) == 0) - igain -= ix86_cost->sse_op; - else if ((bits & 0x07) == 0) - igain -= 2 * ix86_cost->sse_op; - else - igain -= 3 * ix86_cost->sse_op; - } - else if (INTVAL (XEXP (src, 1)) == 16) - igain -= ix86_cost->sse_op; - else - igain -= 2 * ix86_cost->sse_op; - break; + igain += ix86_cost->shift_const - ix86_cost->sse_op; - case AND: - case IOR: - case XOR: - case PLUS: - case MINUS: - igain += m * ix86_cost->add - ix86_cost->sse_op; - /* Additional gain for
[middle-end PATCH take #2] Only call targetm.truly_noop_truncation for truncations.
Very many thanks (and a Happy New Year) to the pre-commit patch testing folks at linaro.org. Their testing has revealed that although my patch is clean on x86_64, it triggers some problems on aarch64 and arm. The issue (with the previous version of my patch) is that these platforms require a paradoxical subreg to be generated by the middle-end, where we were previously checking for truly_noop_truncation. This has been fixed (in revision 2) below. Where previously I had: @@ -66,7 +66,9 @@ gen_lowpart_general (machine_mode mode, rtx x) scalar_int_mode xmode; if (is_a (GET_MODE (x), ) && GET_MODE_SIZE (xmode) <= UNITS_PER_WORD - && TRULY_NOOP_TRUNCATION_MODES_P (mode, xmode) + && (known_lt (GET_MODE_SIZE (mode), GET_MODE_SIZE (xmode)) + ? TRULY_NOOP_TRUNCATION_MODES_P (mode, xmode) + : known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE (xmode))) && !reload_completed) return gen_lowpart_general (mode, force_reg (xmode, x)); the correct change is: scalar_int_mode xmode; if (is_a (GET_MODE (x), ) && GET_MODE_SIZE (xmode) <= UNITS_PER_WORD - && TRULY_NOOP_TRUNCATION_MODES_P (mode, xmode) + && (known_ge (GET_MODE_SIZE (mode), GET_MODE_SIZE (xmode)) + || TRULY_NOOP_TRUNCATION_MODES_P (mode, xmode)) && !reload_completed) return gen_lowpart_general (mode, force_reg (xmode, x)); i.e. we only call TRULY_NOOP_TRUNCATION_MODES_P when we know we have a truncation, but the behaviour of non-truncations is preserved (no longer depends upon unspecified behaviour) and gen_lowpart_general is called to create the paradoxical SUBREG. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? Hopefully this revision tests cleanly on the linaro.org CI pipeline. 2023-12-31 Roger Sayle gcc/ChangeLog * combine.cc (make_extraction): Confirm that OUTPREC is less than INPREC before calling TRULY_NOOP_TRUNCATION_MODES_P. * expmed.cc (store_bit_field_using_insv): Likewise. (extract_bit_field_using_extv): Likewise. (extract_bit_field_as_subreg): Likewise. * optabs-query.cc (get_best_extraction_insn): Likewise. * optabs.cc (expand_parity): Likewise. * rtlhooks.cc (gen_lowpart_general): Likewise. * simplify-rtx.cc (simplify_truncation): Disallow truncations to the same precision. (simplify_unary_operation_1) : Move optimization of truncations to the same mode earlier. > -Original Message- > From: Roger Sayle > Sent: 28 December 2023 15:35 > To: 'gcc-patches@gcc.gnu.org' > Cc: 'Jeff Law' > Subject: [middle-end PATCH] Only call targetm.truly_noop_truncation for > truncations. > > > The truly_noop_truncation target hook is documented, in target.def, as "true if it > is safe to convert a value of inprec bits to one of outprec bits (where outprec is > smaller than inprec) by merely operating on it as if it had only outprec bits", i.e. > the middle-end can use a SUBREG instead of a TRUNCATE. > > What's perhaps potentially a little ambiguous in the above description is whether > it is the caller or the callee that's responsible for ensuring or checking whether > "outprec < inprec". The name TRULY_NOOP_TRUNCATION_P, like > SUBREG_PROMOTED_P, may be prone to being understood as a predicate that > confirms that something is a no-op truncation or a promoted subreg, when in fact > the caller must first confirm this is a truncation/subreg and only then call the > "classification" macro. > > Alas making the following minor tweak (for testing) to the i386 backend: > > static bool > ix86_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec) { > gcc_assert (outprec < inprec); > return true; > } > > #undef TARGET_TRULY_NOOP_TRUNCATION > #define TARGET_TRULY_NOOP_TRUNCATION ix86_truly_noop_truncation > > reveals that there are numerous callers in middle-end that rely on the default > behaviour of silently returning true for any (invalid) input. > These are fixed below. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and > make -k check, both with and without --target_board=unix{-m32} with no new > failures. Ok for mainline? > > > 2023-12-28 Roger Sayle > > gcc/ChangeLog > * combine.cc (make_extraction): Confirm that OUTPREC is less than > INPREC before calling TRULY_NOOP_TRUNCATION_MODES_P. > * expmed.cc (store_bit_field_using_insv): Likewise. > (extract_bit_field_using_extv): Likewise. > (extract_bit_field_as_subreg
RE: [x86_PATCH] peephole2 to resolve failure of gcc.target/i386/pr43644-2.c
Hi Uros, > From: Uros Bizjak > Sent: 28 December 2023 10:33 > On Fri, Dec 22, 2023 at 11:14 AM Roger Sayle > wrote: > > > > This patch resolves the failure of pr43644-2.c in the testsuite, a > > code quality test I added back in July, that started failing as the > > code GCC generates for 128-bit values (and their parameter passing) > > has been in flux. After a few attempts at tweaking pattern > > constraints in the hope of convincing reload to produce a more > > aggressive (but potentially > > unsafe) register allocation, I think the best solution is to use a > > peephole2 to catch/clean-up this specific case. > > > > Specifically, the function: > > > > unsigned __int128 foo(unsigned __int128 x, unsigned long long y) { > > return x+y; > > } > > > > currently generates: > > > > foo:movq%rdx, %rcx > > movq%rdi, %rax > > movq%rsi, %rdx > > addq%rcx, %rax > > adcq$0, %rdx > > ret > > > > and with this patch/peephole2 now generates: > > > > foo:movq%rdx, %rax > > movq%rsi, %rdx > > addq%rdi, %rax > > adcq$0, %rdx > > ret > > > > which I believe is optimal. > > How about simply moving the assignment to the MSB in the split pattern after > the > LSB calculation: > > [(set (match_dup 0) (match_dup 4)) > - (set (match_dup 5) (match_dup 2)) >(parallel [(set (reg:CCC FLAGS_REG) > (compare:CCC > (plus:DWIH (match_dup 0) (match_dup 1)) > (match_dup 0))) > (set (match_dup 0) > (plus:DWIH (match_dup 0) (match_dup 1)))]) > + (set (match_dup 5) (match_dup 2)) >(parallel [(set (match_dup 5) > (plus:DWIH > (plus:DWIH > > There is an earlyclobber on the output operand, so we are sure that > assignments > to (op 0) and (op 5) won't clobber anything. > cprop_hardreg pass will then do the cleanup for us, resulting in: > > foo: movq%rdi, %rax >addq%rdx, %rax >movq%rsi, %rdx > adcq$0, %rdx > > Uros. I agree. This is a much better fix. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-12-31 Uros Bizjak Roger Sayle gcc/ChangeLog PR target/43644 * config/i386/i386.md (*add3_doubleword_concat_zext): Tweak order of instructions after split, to minimize number of moves. gcc/testsuite/ChangeLog PR target/43644 * gcc.target/i386/pr43644-2.c: Expect 2 movq instructions. Thanks again (and Happy New Year). Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index e862368..6671274 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -6412,13 +6412,13 @@ "#" "&& reload_completed" [(set (match_dup 0) (match_dup 4)) - (set (match_dup 5) (match_dup 2)) (parallel [(set (reg:CCC FLAGS_REG) (compare:CCC (plus:DWIH (match_dup 0) (match_dup 1)) (match_dup 0))) (set (match_dup 0) (plus:DWIH (match_dup 0) (match_dup 1)))]) + (set (match_dup 5) (match_dup 2)) (parallel [(set (match_dup 5) (plus:DWIH (plus:DWIH diff --git a/gcc/testsuite/gcc.target/i386/pr43644-2.c b/gcc/testsuite/gcc.target/i386/pr43644-2.c index d470b0a..3316ac6 100644 --- a/gcc/testsuite/gcc.target/i386/pr43644-2.c +++ b/gcc/testsuite/gcc.target/i386/pr43644-2.c @@ -6,4 +6,4 @@ unsigned __int128 foo(unsigned __int128 x, unsigned long long y) return x+y; } -/* { dg-final { scan-assembler-times "movq" 1 } } */ +/* { dg-final { scan-assembler-times "movq" 2 } } */
RE: [PATCH] Improved RTL expansion of field assignments into promoted registers.
Hi Jeff, Thanks for the speedy review. > On 12/28/23 07:59, Roger Sayle wrote: > > This patch fixes PR rtl-optmization/104914 by tweaking/improving the > > way that fields are written into a pseudo register that needs to be > > kept sign extended. > Well, I think "fixes" is a bit of a stretch. We're avoiding the issue by > changing the > early RTL generation, but if I understand what's going on in the RTL > optimizers > and MIPS backend correctly, the core bug still remains. Admittedly I haven't > put it > under a debugger, but that MIPS definition of NOOP_TRUNCATION just seems > badly wrong and is just waiting to pop it's ugly head up again. I think this really is the/a correct fix. The MIPS backend defines NOOP_TRUNCATION to false, so it's not correct to use a SUBREG to convert from DImode to SImode. The problem then is where in the compiler (middle-end or backend) is this invalid SUBREG being created and how can it be fixed. In this particular case, the fault is in RTL expansion. There may be other places where a SUBREG is inappropriately used instead of a TRUNCATE, but this is the place where things go wrong for PR rtl-optimization/104914. Once an inappropriate SImode SUBREG is in the RTL stream, it can remain harmlessly latent (most of the time), unless it gets split, simplified or spilled. Copying this SImode expression into it's own pseudo, results in incorrect code. One approach might be to use an UNSPEC for places where backend invariants are temporarily invalid, but in this case it's machine independent middle-end code that's using SUBREGs as though the target was an x86/pdp11. So I agree that on the surface, both of these appear to be identical: > (set (reg:DI) (sign_extend:DI (truncate:SI (reg:DI > (set (reg:DI) (sign_extend:DI (subreg:SI (reg:DI But should they get split or spilled by reload: (set (reg_tmp:SI) (subreg:SI (reg:DI)) (set (reg:DI) (sign_extend:DI (reg_tmp:SI)) is invalid as the reg_tmp isn't correctly sign-extended for SImode. But, (set (reg_tmp:SI) (truncate:SI (reg:DI)) (set (reg:DI) (sign_extend:DI (reg_tmp:SI)) is fine. The difference is the instant in time, when the SUBREG's invariants aren't yet valid (and its contents shouldn't be thought of as SImode). On nvptx, where truly_noop_truncation is always "false", it'd show the same bug/failure, if it were not for that fact that nvptx doesn't attempt to store values in "mode extended" (SUBREG_PROMOTED_VAR_P) registers. The bug is really in MODE_REP_EXTENDED support. > > The motivating example from the bugzilla PR is: > > > > extern void ext(int); > > void foo(const unsigned char *buf) { > >int val; > >((unsigned char*))[0] = *buf++; > >((unsigned char*))[1] = *buf++; > >((unsigned char*))[2] = *buf++; > >((unsigned char*))[3] = *buf++; > >if(val > 0) > > ext(1); > >else > > ext(0); > > } > > > > which at the end of the tree optimization passes looks like: > > > > void foo (const unsigned char * buf) > > { > >int val; > >unsigned char _1; > >unsigned char _2; > >unsigned char _3; > >unsigned char _4; > >int val.5_5; > > > > [local count: 1073741824]: > >_1 = *buf_7(D); > >MEM[(unsigned char *)] = _1; > >_2 = MEM[(const unsigned char *)buf_7(D) + 1B]; > >MEM[(unsigned char *) + 1B] = _2; > >_3 = MEM[(const unsigned char *)buf_7(D) + 2B]; > >MEM[(unsigned char *) + 2B] = _3; > >_4 = MEM[(const unsigned char *)buf_7(D) + 3B]; > >MEM[(unsigned char *) + 3B] = _4; > >val.5_5 = val; > >if (val.5_5 > 0) > > goto ; [59.00%] > >else > > goto ; [41.00%] > > > > [local count: 633507681]: > >ext (1); > >goto ; [100.00%] > > > > [local count: 440234144]: > >ext (0); > > > > [local count: 1073741824]: > >val ={v} {CLOBBER(eol)}; > >return; > > > > } > > > > Here four bytes are being sequentially written into the SImode value > > val. On some platforms, such as MIPS64, this SImode value is kept in > > a 64-bit register, suitably sign-extended. The function > > expand_assignment contains logic to handle this via > > SUBREG_PROMOTED_VAR_P (around line 6264 in expr.cc) which outputs an > > explicit extension operation after each store_field (typically insv) to such > promoted/extended pseudos. > > > > The first observation is that there's no need to perform sign > > extension after each byte in the example above; the extension is only > > required after changes to the most significant byte (i.e.
[PATCH] MIPS: Implement TARGET_INSN_COSTS
The current (default) behavior is that when the target doesn't define TARGET_INSN_COST the middle-end uses the backend's TARGET_RTX_COSTS, so multiplications are slower than additions, but about the same size when optimizing for size (with -Os or -Oz). All of this gets disabled with your proposed patch. [If you don't check speed, you probably shouldn't touch insn_cost]. I agree that a backend can fine tune the (speed and size) costs of instructions (especially complex !single_set instructions) via attributes in the machine description, but these should be used to override/fine-tune rtx_costs, not override/replace/duplicate them. Having accurate rtx_costs also helps RTL expansion and the earlier optimizers, but insn_cost is used by combine and the later RTL optimization passes, once instructions have been recognized. Might I also recommend that instead of insn_count*perf_ratio*4, or even the slightly better COSTS_N_INSNS (insn_count*perf_ratio), that encode the relative cost in the attribute, avoiding the multiplication (at runtime), and allowing fine tuning like "COSTS_N_INSNS(2) - 1". Likewise, COSTS_N_BYTES is a very useful macro for a backend to define/use in rtx_costs. Conveniently for many RISC machines, 1 instruction takes about 4 bytes, for COSTS_N_INSNS (1) is (approximately) comparable to COSTS_N_BYTES (4). I hope this helps. Perhaps something like: static int mips_insn_cost (rtx_insn *insn, bool speed) { int cost; if (recog_memoized (insn) >= 0) { if (speed) { /* Use cost if provided. */ cost = get_attr_cost (insn); if (cost > 0) return cost; } else { /* If optimizing for size, we want the insn size. */ return get_attr_length (insn); } } if (rtx set = single_set (insn)) cost = set_rtx_cost (set, speed); else cost = pattern_cost (PATTERN (insn), speed); /* If the cost is zero, then it's likely a complex insn. We don't want the cost of these to be less than something we know about. */ return cost ? cost : COSTS_N_INSNS (2); }
[middle-end PATCH] Only call targetm.truly_noop_truncation for truncations.
The truly_noop_truncation target hook is documented, in target.def, as "true if it is safe to convert a value of inprec bits to one of outprec bits (where outprec is smaller than inprec) by merely operating on it as if it had only outprec bits", i.e. the middle-end can use a SUBREG instead of a TRUNCATE. What's perhaps potentially a little ambiguous in the above description is whether it is the caller or the callee that's responsible for ensuring or checking whether "outprec < inprec". The name TRULY_NOOP_TRUNCATION_P, like SUBREG_PROMOTED_P, may be prone to being understood as a predicate that confirms that something is a no-op truncation or a promoted subreg, when in fact the caller must first confirm this is a truncation/subreg and only then call the "classification" macro. Alas making the following minor tweak (for testing) to the i386 backend: static bool ix86_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec) { gcc_assert (outprec < inprec); return true; } #undef TARGET_TRULY_NOOP_TRUNCATION #define TARGET_TRULY_NOOP_TRUNCATION ix86_truly_noop_truncation reveals that there are numerous callers in middle-end that rely on the default behaviour of silently returning true for any (invalid) input. These are fixed below. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-12-28 Roger Sayle gcc/ChangeLog * combine.cc (make_extraction): Confirm that OUTPREC is less than INPREC before calling TRULY_NOOP_TRUNCATION_MODES_P. * expmed.cc (store_bit_field_using_insv): Likewise. (extract_bit_field_using_extv): Likewise. (extract_bit_field_as_subreg): Likewise. * optabs-query.cc (get_best_extraction_insn): Likewise. * optabs.cc (expand_parity): Likewise. * rtlhooks.cc (gen_lowpart_general): Likewise. * simplify-rtx.cc (simplify_truncation): Disallow truncations to the same precision. (simplify_unary_operation_1) : Move optimization of truncations to the same mode earlier. Thanks in advance, Roger -- diff --git a/gcc/combine.cc b/gcc/combine.cc index f2c64a9..5aa2f57 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -7613,7 +7613,8 @@ make_extraction (machine_mode mode, rtx inner, HOST_WIDE_INT pos, && (pos == 0 || REG_P (inner)) && (inner_mode == tmode || !REG_P (inner) - || TRULY_NOOP_TRUNCATION_MODES_P (tmode, inner_mode) + || (known_lt (GET_MODE_SIZE (tmode), GET_MODE_SIZE (inner_mode)) + && TRULY_NOOP_TRUNCATION_MODES_P (tmode, inner_mode)) || reg_truncated_to_mode (tmode, inner)) && (! in_dest || (REG_P (inner) @@ -7856,6 +7857,8 @@ make_extraction (machine_mode mode, rtx inner, HOST_WIDE_INT pos, /* On the LHS, don't create paradoxical subregs implicitely truncating the register unless TARGET_TRULY_NOOP_TRUNCATION. */ if (in_dest + && known_lt (GET_MODE_SIZE (GET_MODE (inner)), + GET_MODE_SIZE (wanted_inner_mode)) && !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (inner), wanted_inner_mode)) return NULL_RTX; diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 0bba93f..8940d47 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -26707,6 +26707,16 @@ ix86_libm_function_max_error (unsigned cfn, machine_mode mode, #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests #endif /* #if CHECKING_P */ +static bool +ix86_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec) +{ + gcc_assert (outprec < inprec); + return true; +} + +#undef TARGET_TRULY_NOOP_TRUNCATION +#define TARGET_TRULY_NOOP_TRUNCATION ix86_truly_noop_truncation + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-i386.h" diff --git a/gcc/expmed.cc b/gcc/expmed.cc index 05331dd..6398bf9 100644 --- a/gcc/expmed.cc +++ b/gcc/expmed.cc @@ -651,6 +651,7 @@ store_bit_field_using_insv (const extraction_insn *insv, rtx op0, X) 0)) is (reg:N X). */ if (GET_CODE (xop0) == SUBREG && REG_P (SUBREG_REG (xop0)) + && paradoxical_subreg_p (xop0) && !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (SUBREG_REG (xop0)), op_mode)) { @@ -1585,7 +1586,11 @@ extract_bit_field_using_extv (const extraction_insn *extv, rtx op0, mode. Instead, create a temporary and use convert_move to set the target. */ if (REG_P (target) - && TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (target), ext_mode) + && (known_lt (GET_MODE_SIZE (GET_MODE (target)), + GET_
[PATCH] Improved RTL expansion of field assignments into promoted registers.
This patch fixes PR rtl-optmization/104914 by tweaking/improving the way that fields are written into a pseudo register that needs to be kept sign extended. The motivating example from the bugzilla PR is: extern void ext(int); void foo(const unsigned char *buf) { int val; ((unsigned char*))[0] = *buf++; ((unsigned char*))[1] = *buf++; ((unsigned char*))[2] = *buf++; ((unsigned char*))[3] = *buf++; if(val > 0) ext(1); else ext(0); } which at the end of the tree optimization passes looks like: void foo (const unsigned char * buf) { int val; unsigned char _1; unsigned char _2; unsigned char _3; unsigned char _4; int val.5_5; [local count: 1073741824]: _1 = *buf_7(D); MEM[(unsigned char *)] = _1; _2 = MEM[(const unsigned char *)buf_7(D) + 1B]; MEM[(unsigned char *) + 1B] = _2; _3 = MEM[(const unsigned char *)buf_7(D) + 2B]; MEM[(unsigned char *) + 2B] = _3; _4 = MEM[(const unsigned char *)buf_7(D) + 3B]; MEM[(unsigned char *) + 3B] = _4; val.5_5 = val; if (val.5_5 > 0) goto ; [59.00%] else goto ; [41.00%] [local count: 633507681]: ext (1); goto ; [100.00%] [local count: 440234144]: ext (0); [local count: 1073741824]: val ={v} {CLOBBER(eol)}; return; } Here four bytes are being sequentially written into the SImode value val. On some platforms, such as MIPS64, this SImode value is kept in a 64-bit register, suitably sign-extended. The function expand_assignment contains logic to handle this via SUBREG_PROMOTED_VAR_P (around line 6264 in expr.cc) which outputs an explicit extension operation after each store_field (typically insv) to such promoted/extended pseudos. The first observation is that there's no need to perform sign extension after each byte in the example above; the extension is only required after changes to the most significant byte (i.e. to a field that overlaps the most significant bit). The bug fix is actually a bit more subtle, but at this point during code expansion it's not safe to use a SUBREG when sign-extending this field. Currently, GCC generates (sign_extend:DI (subreg:SI (reg:DI) 0)) but combine (and other RTL optimizers) later realize that because SImode values are always sign-extended in their 64-bit hard registers that this is a no-op and eliminates it. The trouble is that it's unsafe to refer to the SImode lowpart of a 64-bit register using SUBREG at those critical points when temporarily the value isn't correctly sign-extended, and the usual backend invariants don't hold. At these critical points, the middle-end needs to use an explicit TRUNCATE rtx (as this isn't a TRULY_NOOP_TRUNCATION), so that the explicit sign-extension looks like (sign_extend:DI (truncate:SI (reg:DI)), which avoids the problem. Note that MODE_REP_EXTENDED (NARROW, WIDE) != UNKOWN implies (or should imply) !TRULY_NOOP_TRUNCATION (NARROW, WIDE). I've another (independent) patch that I'll post in a few minutes. This middle-end patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. The cc1 from a cross-compiler to mips64 appears to generate much better code for the above test case. Ok for mainline? 2023-12-28 Roger Sayle gcc/ChangeLog PR rtl-optimization/104914 * expr.cc (expand_assignment): When target is SUBREG_PROMOTED_VAR_P a sign or zero extension is only required if the modified field overlaps the SUBREG's most significant bit. On MODE_REP_EXTENDED targets, don't refer to the temporarily incorrectly extended value using a SUBREG, but instead generate an explicit TRUNCATE rtx. Thanks in advance, Roger -- diff --git a/gcc/expr.cc b/gcc/expr.cc index 9fef2bf6585..1a34b48e38f 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -6272,19 +6272,32 @@ expand_assignment (tree to, tree from, bool nontemporal) && known_eq (bitpos, 0) && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (to_rtx result = store_expr (from, to_rtx, 0, nontemporal, false); - else + /* Check if the field overlaps the MSB, requiring extension. */ + else if (known_eq (bitpos + bitsize, +GET_MODE_BITSIZE (GET_MODE (to_rtx { - rtx to_rtx1 - = lowpart_subreg (subreg_unpromoted_mode (to_rtx), - SUBREG_REG (to_rtx), - subreg_promoted_mode (to_rtx)); + scalar_int_mode imode = subreg_unpromoted_mode (to_rtx); + scalar_int_mode omode = subreg_promoted_mode (to_rtx); + rtx to_rtx1 = lowpart_subreg (imode, SUBREG_REG (to_rtx), + omode); result = store_field (to_rtx1, bitsize, bitpos,
RE: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode
> > > What's exceedingly weird is T_N_T_M_P (DImode, SImode) isn't > > > actually a truncation! The output precision is first, the input > > > precision is second. The docs explicitly state the output precision > > > should be smaller than the input precision (which makes sense for > > > truncation). > > > > > > That's where I'd start with trying to untangle this mess. > > > > Thanks (both) for correcting my misunderstanding. > > At the very least might I suggest that we introduce a new > > TRULY_NOOP_EXTENSION_MODES_P target hook that MIPS can use for this > > purpose? It'd help reduce confusion, and keep the > > documentation/function naming correct. > > > > Yes. It is good for me. > T_N_T_M_P is a really confusion naming. Ignore my suggestion for a new target hook. GCC already has one. You shouldn't be using TRULY_NOOP_TRUNCATION_MODES_P with incorrectly ordered arguments. The correct target hook is TARGET_MODE_REP_EXTENDED, which the MIPS backend correctly defines via mips_mode_rep_extended. It's MIPS definition of (and interpretation of) mips_truly_noop_truncation that's suspect. My latest theory is that these sign extensions should be: (set (reg:DI) (sign_extend:DI (truncate:SI (reg:DI and not (set (reg:DI) (sign_extend:DI (subreg:SI (reg:DI If the RTL optimizer's ever split this instruction the semantics of the SUBREG intermediate are incorrect. Another (less desirable) approach might be to use an UNSPEC.
RE: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode
> What's exceedingly weird is T_N_T_M_P (DImode, SImode) isn't actually a > truncation! The output precision is first, the input precision is second. > The docs > explicitly state the output precision should be smaller than the input > precision > (which makes sense for truncation). > > That's where I'd start with trying to untangle this mess. Thanks (both) for correcting my misunderstanding. At the very least might I suggest that we introduce a new TRULY_NOOP_EXTENSION_MODES_P target hook that MIPS can use for this purpose? It'd help reduce confusion, and keep the documentation/function naming correct. When Richard Sandiford "hookized" truly_noop_truncation in 2017 https://gcc.gnu.org/legacy-ml/gcc-patches/2017-09/msg00836.html he mentions the inprec/outprec confusion [deciding not to add a gcc_assert outprec < inprec here, which might be a good idea]. The next question is whether this is just TRULY_NOOP_SIGN_EXTENSION_MODES_P or whether there are any targets that usefully ensure some modes are zero-extended forms of others. TRULY_NOOP_ZERO_EXTENSION... My vote is that a DINS instruction that updates the most significant bit of an SImode value should then expand or define_insn_and_split with an explicit following sign-extension operation. To avoid this being eliminated by the RTL optimizers/combine the DINS should return a DImode result, with the following extension truncating it to canonical SImode form. This preserves the required backend invariant (and doesn't require tweaking machine-independent code in combine). SImode DINS instructions that don't/can't affect the MSB, can be a single SImode instruction. Cheers, Roger --
RE: Re: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode
> There's a PR in Bugzilla around this representational issue on MIPS, but I can't find > it straight away. Found it. It's PR rtl-optimization/104914, where we've already discussed this in comments #15 and #16. > -Original Message- > From: Roger Sayle > Sent: 24 December 2023 00:50 > To: 'gcc-patches@gcc.gnu.org' ; 'YunQiang Su' > > Cc: 'Jeff Law' > Subject: Re: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode > > > Hi YunQiang (and Jeff), > > > MIPS claims TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode)) == > true > > based on that the hard register is always sign-extended, but here the > > hard register is polluted by zero_extract. > > I suspect that the bug here is that the MIPS backend shouldn't be returning > true for TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode). It's true > that the backend stores SImode values in DImode registers by sign extending > them, but this doesn't mean that any DImode pseudo register can be truncated to > an SImode pseudo just by SUBREG/register naming. As you point out, if the high > bits of a DImode value are random, truncation isn't a no-op, and requires an > explicit sign-extension instruction. > > There's a PR in Bugzilla around this representational issue on MIPS, but I can't find > it straight away. > > Out of curiosity, how badly affected is the testsuite if mips.cc's > mips_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec) is changed > to just return !TARGET_64BIT ? > > I agree with Jeff there's an invariant that isn't correctly being modelled by the > MIPS machine description. A machine description probably shouldn't define an > addsi3 pattern if what it actually supports is (sign_extend:DI (truncate:SI (plus:DI > (reg:DI x) (reg:DI y Trying to model this as SImode addition plus a > SUBREG_PROMOTED flag is less than ideal. > > Just my thoughts. I'm curious what other folks think. > > Cheers, > Roger > --
Re: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode
Hi YunQiang (and Jeff), > MIPS claims TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode)) == true > based on that the hard register is always sign-extended, but here > the hard register is polluted by zero_extract. I suspect that the bug here is that the MIPS backend shouldn't be returning true for TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode). It's true that the backend stores SImode values in DImode registers by sign extending them, but this doesn't mean that any DImode pseudo register can be truncated to an SImode pseudo just by SUBREG/register naming. As you point out, if the high bits of a DImode value are random, truncation isn't a no-op, and requires an explicit sign-extension instruction. There's a PR in Bugzilla around this representational issue on MIPS, but I can't find it straight away. Out of curiosity, how badly affected is the testsuite if mips.cc's mips_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec) is changed to just return !TARGET_64BIT ? I agree with Jeff there's an invariant that isn't correctly being modelled by the MIPS machine description. A machine description probably shouldn't define an addsi3 pattern if what it actually supports is (sign_extend:DI (truncate:SI (plus:DI (reg:DI x) (reg:DI y Trying to model this as SImode addition plus a SUBREG_PROMOTED flag is less than ideal. Just my thoughts. I'm curious what other folks think. Cheers, Roger --
[ARC PATCH] Table-driven ashlsi implementation for better code/rtx_costs.
One of the cool features of the H8 backend is its use of tables to select optimal shift implementations for different CPU variants. This patch borrows (plagiarizes) that idiom for SImode left shifts in the ARC backend (for CPUs without a barrel-shifter). This provides a convenient mechanism for both selecting the best implementation strategy (for speed vs. size), and providing accurate rtx_costs [without duplicating a lot of logic]. Left shift RTX costs are especially important for use in synth_mult. An example improvement is: int foo(int x) { return 32768*x; } which is now generated with -O2 -mcpu=em -mswap as: foo:bmsk_s r0,r0,16 swapr0,r0 j_s.d [blink] ror r0,r0 where previously the ARC backend would generate a loop: foo:mov lp_count,15 lp 2f add r0,r0,r0 nop 2: # end single insn loop j_s [blink] Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's and/or Jeff's testing? [Thanks again to Jeff for finding the typo in my last ARC patch] 2023-12-23 Roger Sayle gcc/ChangeLog * config/arc/arc.cc (arc_shift_alg): New enumerated type for left shift implementation strategies. (arc_shift_info): Type for each entry of the shift strategy table. (arc_shift_context_idx): Return a integer value for each code generation context, used as an index (arc_ashl_alg): Table indexed by context and shifted bit count. (arc_split_ashl): Use the arc_ashl_alg table to select SImode left shift implementation. (arc_rtx_costs) : Use the arc_ashl_alg table to provide accurate costs, when optimizing for speed or size. Thanks in advance, Roger -- diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc index 3f4eb5a5736..925bffaa7d6 100644 --- a/gcc/config/arc/arc.cc +++ b/gcc/config/arc/arc.cc @@ -4222,6 +4222,253 @@ output_shift_loop (enum rtx_code code, rtx *operands) return ""; } +/* See below where shifts are handled for explanation of this enum. */ +enum arc_shift_alg +{ + SHIFT_MOVE, /* Register-to-register move. */ + SHIFT_LOOP, /* Zero-overhead loop implementation. */ + SHIFT_INLINE,/* Mmultiple LSHIFTs and LSHIFT-PLUSs. */ + SHIFT_AND_ROT,/* Bitwise AND, then ROTATERTs. */ + SHIFT_SWAP, /* SWAP then multiple LSHIFTs/LSHIFT-PLUSs. */ + SHIFT_AND_SWAP_ROT /* Bitwise AND, then SWAP, then ROTATERTs. */ +}; + +struct arc_shift_info { + enum arc_shift_alg alg; + unsigned int cost; +}; + +/* Return shift algorithm context, an index into the following tables. + * 0 for -Os (optimize for size) 3 for -O2 (optimized for speed) + * 1 for -Os -mswap TARGET_V2 4 for -O2 -mswap TARGET_V2 + * 2 for -Os -mswap !TARGET_V2 5 for -O2 -mswap !TARGET_V2 */ +static unsigned int +arc_shift_context_idx () +{ + if (optimize_function_for_size_p (cfun)) +{ + if (!TARGET_SWAP) + return 0; + if (TARGET_V2) + return 1; + return 2; +} + else +{ + if (!TARGET_SWAP) + return 3; + if (TARGET_V2) + return 4; + return 5; +} +} + +static const arc_shift_info arc_ashl_alg[6][32] = { + { /* 0: -Os. */ +{ SHIFT_MOVE, COSTS_N_INSNS (1) }, /* 0 */ +{ SHIFT_INLINE, COSTS_N_INSNS (1) }, /* 1 */ +{ SHIFT_INLINE, COSTS_N_INSNS (2) }, /* 2 */ +{ SHIFT_INLINE, COSTS_N_INSNS (2) }, /* 3 */ +{ SHIFT_INLINE, COSTS_N_INSNS (3) }, /* 4 */ +{ SHIFT_INLINE, COSTS_N_INSNS (3) }, /* 5 */ +{ SHIFT_INLINE, COSTS_N_INSNS (3) }, /* 6 */ +{ SHIFT_INLINE, COSTS_N_INSNS (4) }, /* 7 */ +{ SHIFT_INLINE, COSTS_N_INSNS (4) }, /* 8 */ +{ SHIFT_INLINE, COSTS_N_INSNS (4) }, /* 9 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 10 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 11 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 12 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 13 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 14 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 15 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 16 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 17 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 18 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 19 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 20 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 21 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 22 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 23 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 24 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 25 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4) }, /* 26 */ +{ SHIFT_LOOP, COSTS_N_INSNS (4)
[x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.
This patch resolves the second part of PR target/112992, building upon Hongtao Liu's solution to the first part. The issue addressed by this patch is that when initializing vectors by broadcasting integer constants, the compiler has the flexibility to select the most appropriate vector mode to perform the broadcast, as long as the resulting vector has an identical bit pattern. For example, the following constants are all equivalent: V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 } V8HImode {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 } V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ... 0x01 } So instruction sequences that construct any of these can be used to construct the others (with a suitable cast/SUBREG). On x86_64, it turns out that broadcasts of SImode constants are preferred, as DImode constants often require a longer movabs instruction, and HImode and QImode broadcasts require multiple uops on some architectures. Hence, SImode is always the equal shortest/fastest implementation. Examples of this improvement, can be seen in the testsuite. gcc.target/i386/pr102021.c Before: 0: 48 b8 0c 00 0c 00 0cmovabs $0xc000c000c000c,%rax 7: 00 0c 00 a: 62 f2 fd 28 7c c0 vpbroadcastq %rax,%ymm0 10: c3 retq After: 0: b8 0c 00 0c 00 mov$0xc000c,%eax 5: 62 f2 7d 28 7c c0 vpbroadcastd %eax,%ymm0 b: c3 retq and gcc.target/i386/pr90773-17.c: Before: 0: 48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 7 7: b8 0c 00 00 00 mov$0xc,%eax c: 62 f2 7d 08 7a c0 vpbroadcastb %eax,%xmm0 12: 62 f1 7f 08 7f 02 vmovdqu8 %xmm0,(%rdx) 18: c7 42 0f 0c 0c 0c 0cmovl $0xc0c0c0c,0xf(%rdx) 1f: c3 retq After: 0: 48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 7 7: b8 0c 0c 0c 0c mov$0xc0c0c0c,%eax c: 62 f2 7d 08 7c c0 vpbroadcastd %eax,%xmm0 12: 62 f1 7f 08 7f 02 vmovdqu8 %xmm0,(%rdx) 18: c7 42 0f 0c 0c 0c 0cmovl $0xc0c0c0c,0xf(%rdx) 1f: c3 retq where according to Agner Fog's instruction tables broadcastd is slightly faster on some microarchitectures, for example Knight's Landing. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-12-21 Roger Sayle gcc/ChangeLog PR target/112992 * config/i386/i386-expand.cc (ix86_convert_const_wide_int_to_broadcast): Allow call to ix86_expand_vector_init_duplicate to fail, and return NULL_RTX. (ix86_broadcast_from_constant): Revert recent change; Return a suitable MEMREF independently of mode/target combinations. (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate to decide whether expansion is possible/preferrable. Only try forcing DImode constants to memory (and trying again) if calling ix86_expand_vector_init_duplicate fails with an DImode immediate constant. (ix86_expand_vector_init_duplicate) : Try using V4SImode for suitable immediate constants. : Try using V8SImode for suitable constants. : Use constant pool for AVX without AVX2. : Fail for CONST_INT_P, i.e. use constant pool. : Likewise. : For CONST_INT_P try using V4SImode via widen. : For CONT_INT_P try using V8HImode via widen. : Handle CONT_INTs via simplify_binary_operation. Allow recursive calls to ix86_expand_vector_init_duplicate to fail. : For CONST_INT_P try V8SImode via widen. : For CONST_INT_P try V16HImode via widen. (ix86_expand_vector_init): Move try using a broadcast for all_same with ix86_expand_vector_init_duplicate before using constant pool. gcc/testsuite/ChangeLog * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case. * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise. * gcc.target/i386/avx512fp16-13.c: Likewise. * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise. * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise. * gcc.target/i386/pr100865-10a.c: Likewise. * gcc.target/i386/pr100865-10b.c: Likewise. * gcc.target/i386/pr100865-11c.c: Likewise. * gcc.target/i386/pr100865-12c.c: Likewise. * gcc.target/i386/pr100865-2.c: Likewise. * gcc.target/i386/pr100865-3.c: Likewise. * gcc.target/i386/pr100865-4a.c: Likewise. * gcc.target/i386/pr100865-4b.c: Likewise. * gcc.target/i386/pr100865-5a.c: Likewise. * gcc.target/i386/pr100865-5b.c: Likewise. * gcc.target/i386/pr100865-9a.c: Likewise. * gcc.target/i386/pr100865-9b.c: Likewise. * gcc.target/i386/pr102021.c: Likewise
[x86_PATCH] peephole2 to resolve failure of gcc.target/i386/pr43644-2.c
This patch resolves the failure of pr43644-2.c in the testsuite, a code quality test I added back in July, that started failing as the code GCC generates for 128-bit values (and their parameter passing) has been in flux. After a few attempts at tweaking pattern constraints in the hope of convincing reload to produce a more aggressive (but potentially unsafe) register allocation, I think the best solution is to use a peephole2 to catch/clean-up this specific case. Specifically, the function: unsigned __int128 foo(unsigned __int128 x, unsigned long long y) { return x+y; } currently generates: foo:movq%rdx, %rcx movq%rdi, %rax movq%rsi, %rdx addq%rcx, %rax adcq$0, %rdx ret and with this patch/peephole2 now generates: foo:movq%rdx, %rax movq%rsi, %rdx addq%rdi, %rax adcq$0, %rdx ret which I believe is optimal. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-12-21 Roger Sayle gcc/ChangeLog PR target/43644 * config/i386/i386.md (define_peephole2): Tweak register allocation of *add3_doubleword_concat_zext. gcc/testsuite/ChangeLog PR target/43644 * gcc.target/i386/pr43644-2.c: Expect 2 movq instructions. Thanks in advance, and for your patience with this testsuite noise. Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index e862368..5967208 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -6428,6 +6428,38 @@ (clobber (reg:CC FLAGS_REG))])] "split_double_mode (mode, [0], 1, [0], [5]);") +(define_peephole2 + [(set (match_operand:SWI48 0 "general_reg_operand") + (match_operand:SWI48 1 "general_reg_operand")) + (set (match_operand:SWI48 2 "general_reg_operand") + (match_operand:SWI48 3 "general_reg_operand")) + (set (match_dup 1) (match_operand:SWI48 4 "general_reg_operand")) + (parallel [(set (reg:CCC FLAGS_REG) + (compare:CCC +(plus:SWI48 (match_dup 2) (match_dup 0)) +(match_dup 2))) + (set (match_dup 2) + (plus:SWI48 (match_dup 2) (match_dup 0)))])] + "REGNO (operands[0]) != REGNO (operands[1]) + && REGNO (operands[0]) != REGNO (operands[2]) + && REGNO (operands[0]) != REGNO (operands[3]) + && REGNO (operands[0]) != REGNO (operands[4]) + && REGNO (operands[1]) != REGNO (operands[2]) + && REGNO (operands[1]) != REGNO (operands[3]) + && REGNO (operands[1]) != REGNO (operands[4]) + && REGNO (operands[2]) != REGNO (operands[3]) + && REGNO (operands[2]) != REGNO (operands[4]) + && REGNO (operands[3]) != REGNO (operands[4]) + && peep2_reg_dead_p (4, operands[0])" + [(set (match_dup 2) (match_dup 1)) + (set (match_dup 1) (match_dup 4)) + (parallel [(set (reg:CCC FLAGS_REG) + (compare:CCC + (plus:SWI48 (match_dup 2) (match_dup 3)) + (match_dup 2))) + (set (match_dup 2) + (plus:SWI48 (match_dup 2) (match_dup 3)))])]) + (define_insn "*add_1" [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r,r,r,r,r") (plus:SWI48 diff --git a/gcc/testsuite/gcc.target/i386/pr43644-2.c b/gcc/testsuite/gcc.target/i386/pr43644-2.c index d470b0a..3316ac6 100644 --- a/gcc/testsuite/gcc.target/i386/pr43644-2.c +++ b/gcc/testsuite/gcc.target/i386/pr43644-2.c @@ -6,4 +6,4 @@ unsigned __int128 foo(unsigned __int128 x, unsigned long long y) return x+y; } -/* { dg-final { scan-assembler-times "movq" 1 } } */ +/* { dg-final { scan-assembler-times "movq" 2 } } */
[x86 PATCH] Improved TImode (128-bit) integer constants on x86_64.
This patch fixes two issues with the handling of 128-bit TImode integer constants in the x86_64 backend. The main issue is that GCC always tries to load 128-bit integer constants via broadcasts to vector SSE registers, even if the result is required in general registers. This is seen in the two closely related functions below: __int128 m; #define CONST (((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL) void foo() { m &= CONST; } void bar() { m = CONST; } When compiled with -O2 -mavx, we currently generate: foo:movabsq $81985529216486895, %rax vmovq %rax, %xmm0 vpunpcklqdq %xmm0, %xmm0, %xmm0 vmovq %xmm0, %rax vpextrq $1, %xmm0, %rdx andq%rax, m(%rip) andq%rdx, m+8(%rip) ret bar:movabsq $81985529216486895, %rax vmovq %rax, %xmm1 vpunpcklqdq %xmm1, %xmm1, %xmm0 vpextrq $1, %xmm0, %rdx vmovq %xmm0, m(%rip) movq%rdx, m+8(%rip) ret With this patch we defer the decision to use vector broadcast for TImode until we know we need actually want a SSE register result, by moving the call to ix86_convert_const_wide_int_to_broadcast from the RTL expansion pass, to the scalar-to-vector (STV) pass. With this change (and a minor tweak described below) we now generate: foo:movabsq $81985529216486895, %rax andq%rax, m(%rip) andq%rax, m+8(%rip) ret bar:movabsq $81985529216486895, %rax vmovq %rax, %xmm0 vpunpcklqdq %xmm0, %xmm0, %xmm0 vmovdqa %xmm0, m(%rip) ret showing that we now correctly use vector mode broadcasts (only) where appropriate. The one minor tweak mentioned above is to enable the un-cprop hi/lo optimization, that I originally contributed back in September 2004 https://gcc.gnu.org/pipermail/gcc-patches/2004-September/148756.html even when not optimizing for size. Without this (and currently with just -O2) the function foo above generates: foo:movabsq $81985529216486895, %rax movabsq $81985529216486895, %rdx andq%rax, m(%rip) andq%rdx, m+8(%rip) ret I'm not sure why (back in 2004) I thought that avoiding the implicit "movq %rax, %rdx" instead of a second load was faster, perhaps avoiding a dependency to allow better scheduling, but nowadays "movq %rax, %rdx" is either eliminated by GCC's hardreg cprop pass, or special cased by modern hardware, making the first foo preferrable, not only shorter but also faster. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, and with/without -march=cascadelake with no new failures. Ok for mainline? 2023-12-18 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_convert_const_wide_int_to_broadcast): Remove static. (ix86_expand_move): Don't attempt to convert wide constants to SSE using ix86_convert_const_wide_int_to_broadcast here. (ix86_split_long_move): Always un-cprop multi-word constants. * config/i386/i386-expand.h (ix86_convert_const_wide_int_to_broadcast): Prototype here. * config/i386/i386-features.cc: Include i386-expand.h. (timode_scalar_chain::convert_insn): When converting TImode to v1TImode, try ix86_convert_const_wide_int_to_broadcast. gcc/testsuite/ChangeLog * gcc.target/i386/movti-2.c: New test case. * gcc.target/i386/movti-3.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index fad4f34..57a108a 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -289,7 +289,7 @@ ix86_broadcast (HOST_WIDE_INT v, unsigned int width, /* Convert the CONST_WIDE_INT operand OP to broadcast in MODE. */ -static rtx +rtx ix86_convert_const_wide_int_to_broadcast (machine_mode mode, rtx op) { /* Don't use integer vector broadcast if we can't move from GPR to SSE @@ -541,14 +541,6 @@ ix86_expand_move (machine_mode mode, rtx operands[]) return; } } - else if (CONST_WIDE_INT_P (op1) - && GET_MODE_SIZE (mode) >= 16) - { - rtx tmp = ix86_convert_const_wide_int_to_broadcast - (GET_MODE (op0), op1); - if (tmp != nullptr) - op1 = tmp; - } } } @@ -6323,18 +6315,15 @@ ix86_split_long_move (rtx operands[]) } } - /* If optimizing for size, attempt to locally unCSE nonzero constants. */ - if (optimize_insn_for_size_p ()) -{ - for (j = 0; j < nparts - 1; j++) - if (CONST_INT_P (operands[6 + j]) - && operands[6 + j] != const0_rtx - && REG_P (operands[2 + j])) - for (i = j; i < nparts - 1; i++) - if (CONST_INT_P (operand
[PING] PR112380: Defend against CLOBBERs in RTX expressions in combine.cc
I'd like to ping my patch for PR rtl-optimization/112380. https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636203.html For those unfamiliar with the (clobber (const_int 0)) idiom used by combine, I'll explain a little of the ancient history... Back before time, in the prehistory of git/subversion/cvs or even ChangeLogs, in March 1987 to be precise, Richard Stallman's GCC version 0.9, had RTL optimization passes similar to those in use today. This far back, combine.c contained the function gen_lowpart_for_combine, which was documented as "Like gen_lowpart but for use in combine" where "it is not possible to create any new pseudoregs." and "return zero if we don't see a way to make a lowpart.". And indeed, this function returned (rtx)0, and the single caller of gen_lowpart_for_combine checked whether the return value was non-zero. Unfortunately, gcc 0.9's combine also contained bugs; At three places in combine.c, it called gen_lowpart, the first of these looked like: return gen_rtx (AND, GET_MODE (x), gen_lowpart (GET_MODE (x), XEXP (to, 0)), XEXP (to, 1)); Time passes, and by version 1.21 in May 1988 (in fact before the earliest ChangeLogs were introduced for version 1.17 in January 1988), this issue had been identified, and a helpful reminder placed at the top of the code: /* It is not safe to use ordinary gen_lowpart in combine. Use gen_lowpart_for_combine instead. See comments there. */ #define gen_lowpart dont_use_gen_lowpart_you_dummy However, to save a little effort, and avoid checking the return value for validity at all of the callers of gen_lowpart_for_combine, RMS invented the "(clobber (const_int 0))" idiom, which was returned instead of zero. The comment above gen_lowpart_for_combine was modified to state: /* If for some reason this cannot do its job, an rtx (clobber (const_int 0)) is returned. An insn containing that will not be recognized. */ Aside: Around this time Bjarne Stroustrup was also trying to avoid testing function return values for validitity, so introduced exceptions into C++. Thirty five years later this decision (short-cut) still haunts combine. Using "(clobber (const_int 0))", like error_mark_node, that can appear anywhere in a RTX expression makes it hard to impose strict typing (to catch things like a CLOBBER of a CLOBBER) and as shown by bugzilla's PR rtl-optimization/112380 these RTX occasionally escape from combine to cause problems in generic RTL handling functions. This patch doesn't eliminate combine.cc's such of (clobber (const_int 0)), we still allocate memory to indicate exceptional conditions, and require the garbage collector to clean things up, but testing the values returned from functions for errors/exceptions is good software engineering, and hopefully a step in the right direction. I'd hoped allowing combine to continue exploring alternate simplifications would also lead to better code generation, but I've not been able to find any examples on x86_64. This patch has been retested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-11-12 Roger Sayle gcc/ChangeLog PR rtl-optimization/112380 * combine.cc (find_split_point): Check if gen_lowpart returned a CLOBBER. (subst): Check if combine_simplify_rtx returned a CLOBBER. (simplify_set): Check if force_to_mode returned a CLOBBER. Check if gen_lowpart returned a CLOBBER. (expand_field_assignment): Likewise. (make_extraction): Check if force_to_mode returned a CLOBBER. (force_int_to_mode): Likewise. (simplify_and_const_int_1): Check if VAROP is a CLOBBER, after call to force_to_mode (and before). (simplify_comparison): Check if force_to_mode returned a CLOBBER. Check if gen_lowpart returned a CLOBBER. gcc/testsuite/ChangeLog PR rtl-optimization/112380 * gcc.dg/pr112380.c: New test case. Thanks in advance, Roger --
RE: [ARC PATCH] Add *extvsi_n_0 define_insn_and_split for PR 110717.
Hi Jeff, Doh! Great catch. The perils of not (yet) being able to actually run any ARC execution tests myself. > Shouldn't operands[4] be GEN_INT ((HOST_WIDE_INT_1U << tmp) - 1)? Yes(-ish), operands[4] should be GEN_INT(HOST_WIDE_INT_1U << (tmp - 1)). And the 32s in the test cases need to be 16s (the MSB of a five bit field is 16). You're probably also thinking the same thing that I am... that it might be possible to implement this in the middle-end, but things are complicated by combine's make_compound_operation/expand_compound_operation, and that combine doesn't (normally) like turning two instructions into three. Fingers-crossed the attached patch works better on the nightly testers. Thanks in advance, Roger -- > -Original Message- > From: Jeff Law > Sent: 07 December 2023 14:47 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Cc: 'Claudiu Zissulescu' > Subject: Re: [ARC PATCH] Add *extvsi_n_0 define_insn_and_split for PR 110717. > > On 12/5/23 06:59, Roger Sayle wrote: > > This patch improves the code generated for bitfield sign extensions on > > ARC cpus without a barrel shifter. > > > > > > Compiling the following test case: > > > > int foo(int x) { return (x<<27)>>27; } > > > > with -O2 -mcpu=em, generates two loops: > > > > foo:mov lp_count,27 > > lp 2f > > add r0,r0,r0 > > nop > > 2: # end single insn loop > > mov lp_count,27 > > lp 2f > > asr r0,r0 > > nop > > 2: # end single insn loop > > j_s [blink] > > > > > > and the closely related test case: > > > > struct S { int a : 5; }; > > int bar (struct S *p) { return p->a; } > > > > generates the slightly better: > > > > bar:ldb_s r0,[r0] > > mov_s r2,0;3 > > add3r0,r2,r0 > > sexb_s r0,r0 > > asr_s r0,r0 > > asr_s r0,r0 > > j_s.d [blink] > > asr_s r0,r0 > > > > which uses 6 instructions to perform this particular sign extension. > > It turns out that sign extensions can always be implemented using at > > most three instructions on ARC (without a barrel shifter) using the > > idiom ((x)^msb)-msb [as described in section "2-5 Sign Extension" > > of Henry Warren's book "Hacker's Delight"]. Using this, the sign > > extensions above on ARC's EM both become: > > > > bmsk_s r0,r0,4 > > xor r0,r0,32 > > sub r0,r0,32 > > > > which takes about 3 cycles, compared to the ~112 cycles for the loops > > in foo. > > > > > > Tested with a cross-compiler to arc-linux hosted on x86_64, with no > > new (compile-only) regressions from make -k check. > > Ok for mainline if this passes Claudiu's nightly testing? > > > > > > 2023-12-05 Roger Sayle > > > > gcc/ChangeLog > > * config/arc/arc.md (*extvsi_n_0): New define_insn_and_split to > > implement SImode sign extract using a AND, XOR and MINUS sequence. > > > > gcc/testsuite/ChangeLog > > * gcc.target/arc/extvsi-1.c: New test case. > > * gcc.target/arc/extvsi-2.c: Likewise. > > > > > > Thanks in advance, > > Roger > > -- > > > > > > patchar.txt > > > > diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index > > bf9f88eff047..5ebaf2e20ab0 100644 > > --- a/gcc/config/arc/arc.md > > +++ b/gcc/config/arc/arc.md > > @@ -6127,6 +6127,26 @@ archs4x, archs4xd" > > "" > > [(set_attr "length" "8")]) > > > > +(define_insn_and_split "*extvsi_n_0" > > + [(set (match_operand:SI 0 "register_operand" "=r") > > + (sign_extract:SI (match_operand:SI 1 "register_operand" "0") > > +(match_operand:QI 2 "const_int_operand") > > +(const_int 0)))] > > + "!TARGET_BARREL_SHIFTER > > + && IN_RANGE (INTVAL (operands[2]), 2, > > + (optimize_insn_for_size_p () ? 28 : 30))" > > + "#" > > + "&& 1" > > +[(set (match_dup 0) (and:SI (match_dup 0) (match_dup 3))) (set > > +(match_dup 0) (xor:SI (match_dup 0) (match_dup 4))) (set (match_dup > > +0) (minus:SI (match_dup 0) (match_dup 4)))] { > > + int tmp = INTVAL (operands[2]); > > + operands[3] = GEN_INT (~(HOST_WIDE_INT_M1U &
[ARC PATCH] Add *extvsi_n_0 define_insn_and_split for PR 110717.
This patch improves the code generated for bitfield sign extensions on ARC cpus without a barrel shifter. Compiling the following test case: int foo(int x) { return (x<<27)>>27; } with -O2 -mcpu=em, generates two loops: foo:mov lp_count,27 lp 2f add r0,r0,r0 nop 2: # end single insn loop mov lp_count,27 lp 2f asr r0,r0 nop 2: # end single insn loop j_s [blink] and the closely related test case: struct S { int a : 5; }; int bar (struct S *p) { return p->a; } generates the slightly better: bar:ldb_s r0,[r0] mov_s r2,0;3 add3r0,r2,r0 sexb_s r0,r0 asr_s r0,r0 asr_s r0,r0 j_s.d [blink] asr_s r0,r0 which uses 6 instructions to perform this particular sign extension. It turns out that sign extensions can always be implemented using at most three instructions on ARC (without a barrel shifter) using the idiom ((x)^msb)-msb [as described in section "2-5 Sign Extension" of Henry Warren's book "Hacker's Delight"]. Using this, the sign extensions above on ARC's EM both become: bmsk_s r0,r0,4 xor r0,r0,32 sub r0,r0,32 which takes about 3 cycles, compared to the ~112 cycles for the loops in foo. Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's nightly testing? 2023-12-05 Roger Sayle gcc/ChangeLog * config/arc/arc.md (*extvsi_n_0): New define_insn_and_split to implement SImode sign extract using a AND, XOR and MINUS sequence. gcc/testsuite/ChangeLog * gcc.target/arc/extvsi-1.c: New test case. * gcc.target/arc/extvsi-2.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index bf9f88eff047..5ebaf2e20ab0 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -6127,6 +6127,26 @@ archs4x, archs4xd" "" [(set_attr "length" "8")]) +(define_insn_and_split "*extvsi_n_0" + [(set (match_operand:SI 0 "register_operand" "=r") + (sign_extract:SI (match_operand:SI 1 "register_operand" "0") +(match_operand:QI 2 "const_int_operand") +(const_int 0)))] + "!TARGET_BARREL_SHIFTER + && IN_RANGE (INTVAL (operands[2]), 2, + (optimize_insn_for_size_p () ? 28 : 30))" + "#" + "&& 1" +[(set (match_dup 0) (and:SI (match_dup 0) (match_dup 3))) + (set (match_dup 0) (xor:SI (match_dup 0) (match_dup 4))) + (set (match_dup 0) (minus:SI (match_dup 0) (match_dup 4)))] +{ + int tmp = INTVAL (operands[2]); + operands[3] = GEN_INT (~(HOST_WIDE_INT_M1U << tmp)); + operands[4] = GEN_INT (HOST_WIDE_INT_1U << tmp); +} + [(set_attr "length" "14")]) + (define_insn_and_split "rotlsi3_cnt1" [(set (match_operand:SI 0 "dest_reg_operand""=r") (rotate:SI (match_operand:SI 1 "register_operand" "r") diff --git a/gcc/testsuite/gcc.target/arc/extvsi-1.c b/gcc/testsuite/gcc.target/arc/extvsi-1.c new file mode 100644 index ..eb53c78b4e6d --- /dev/null +++ b/gcc/testsuite/gcc.target/arc/extvsi-1.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=em" } */ +struct S { int a : 5; }; + +int foo (struct S *p) +{ + return p->a; +} + +/* { dg-final { scan-assembler "msk_s\\s+r0,r0,4" } } */ +/* { dg-final { scan-assembler "xor\\s+r0,r0,32" } } */ +/* { dg-final { scan-assembler "sub\\s+r0,r0,32" } } */ +/* { dg-final { scan-assembler-not "add3\\s+r0,r2,r0" } } */ +/* { dg-final { scan-assembler-not "sext_s\\s+r0,r0" } } */ +/* { dg-final { scan-assembler-not "asr_s\\s+r0,r0" } } */ diff --git a/gcc/testsuite/gcc.target/arc/extvsi-2.c b/gcc/testsuite/gcc.target/arc/extvsi-2.c new file mode 100644 index ..a0c6894259d4 --- /dev/null +++ b/gcc/testsuite/gcc.target/arc/extvsi-2.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=em" } */ + +int foo(int x) +{ + return (x<<27)>>27; +} + +/* { dg-final { scan-assembler "msk_s\\s+r0,r0,4" } } */ +/* { dg-final { scan-assembler "xor\\s+r0,r0,32" } } */ +/* { dg-final { scan-assembler "sub\\s+r0,r0,32" } } */ +/* { dg-final { scan-assembler-not "lp\\s+2f" } } */
[PATCH] Workaround array_slice constructor portability issues (with older g++).
The recent change to represent language and target attribute tables using vec.h's array_slice template class triggers an issue/bug in older g++ compilers, specifically the g++ 4.8.5 system compiler of older RedHat distributions. This exhibits as the following compilation errors during bootstrap: ../../gcc/gcc/c/c-lang.cc:55:2661: error: could not convert '(const scoped_attribute_specs* const*)(& c_objc_attribute_table)' from 'const scoped_attribute_specs* const*' to 'array_slice' struct lang_hooks lang_hooks = LANG_HOOKS_INITIALIZER; ../../gcc/gcc/c/c-decl.cc:4657:1: error: could not convert '(const attribute_spec*)(& std_attributes)' from 'const attribute_spec*' to 'array_slice' Here the issue is with constructors of the from: static const int table[] = { 1, 2, 3 }; array_slice t = table; Perhaps there's a fix possible in vec.h (an additional constructor?), but the patch below fixes this issue by using one of array_slice's constructors (that takes a size) explicitly, rather than rely on template resolution. In the example above this looks like: array_slice t (table, 3); or equivalently array_slice t = array_slice(table, 3); or equivalently array_slice t = array_slice(table, ARRAY_SIZE (table)); This patch has been tested on x86_64-pc-linux-gnu with make bootstrap, where these changes allow the bootstrap to complete. Ok for mainline? This fix might not by ideal, but it both draws attention to the problem and restores bootstrap whilst better approaches are investigated. For example, an ARRAY_SLICE(table) macro might be appropriate if there isn't an easy/portable template resolution solution. Thoughts? 2023-12-03 Roger Sayle gcc/c-family/ChangeLog * c-attribs.cc (c_common_gnu_attribute_table): Use an explicit array_slice constructor with an explicit size argument. (c_common_format_attribute_table): Likewise. gcc/c/ChangeLog * c-decl.cc (std_attribute_table): Use an explicit array_slice constructor with an explicit size argument. * c-objc-common.h (LANG_HOOKS_ATTRIBUTE_TABLE): Likewise. gcc/ChangeLog * config/i386/i386-options.cc (ix86_gnu_attribute_table): Use an explicit array_slice constructor with an explicit size argument. * config/i386/i386.cc (TARGET_ATTRIBUTE_TABLE): Likewise. gcc/cp/ChangeLog * cp-objcp-common.h (LANG_HOOKS_ATTRIBUTE_TABLE): Use an explicit array_slice constructor with an explicit size argument. * tree.cc (cxx_gnu_attribute_table): Likewise. (std_attribute_table): Likewise. gcc/lto/ChangeLog * lto-lang.cc (lto_gnu_attribute_table): Use an explicit array_slice constructor with an explicit size argument. (lto_format_attribute_table): Likewise. (LANG_HOOKS_ATTRIBUTE_TABLE): Likewise. Thanks in advance, Roger -- diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc index 45af074..af83588 100644 --- a/gcc/c-family/c-attribs.cc +++ b/gcc/c-family/c-attribs.cc @@ -584,7 +584,9 @@ const struct attribute_spec c_common_gnu_attributes[] = const struct scoped_attribute_specs c_common_gnu_attribute_table = { - "gnu", c_common_gnu_attributes + "gnu", + array_slice(c_common_gnu_attributes, + ARRAY_SIZE (c_common_gnu_attributes)) }; /* Give the specifications for the format attributes, used by C and all @@ -603,7 +605,9 @@ const struct attribute_spec c_common_format_attributes[] = const struct scoped_attribute_specs c_common_format_attribute_table = { - "gnu", c_common_format_attributes + "gnu", + array_slice(c_common_format_attributes, + ARRAY_SIZE (c_common_format_attributes)) }; /* Returns TRUE iff the attribute indicated by ATTR_ID takes a plain diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc index 248d1bb..a6984b0 100644 --- a/gcc/c/c-decl.cc +++ b/gcc/c/c-decl.cc @@ -4653,7 +4653,8 @@ static const attribute_spec std_attributes[] = const scoped_attribute_specs std_attribute_table = { - nullptr, std_attributes + nullptr, array_slice(std_attributes, +ARRAY_SIZE (std_attributes)) }; /* Create the predefined scalar types of C, diff --git a/gcc/c/c-objc-common.h b/gcc/c/c-objc-common.h index 426d938..021c651 100644 --- a/gcc/c/c-objc-common.h +++ b/gcc/c/c-objc-common.h @@ -83,7 +83,8 @@ static const scoped_attribute_specs *const c_objc_attribute_table[] = }; #undef LANG_HOOKS_ATTRIBUTE_TABLE -#define LANG_HOOKS_ATTRIBUTE_TABLE c_objc_attribute_table +#define LANG_HOOKS_ATTRIBUTE_TABLE \ +array_slice (c_objc_attribute_table, ARRAY_SIZE (c_objc_attribute_table)) #undef LANG_HOOKS_TREE_DUMP_DUMP_TREE_FN #define LANG_HOOKS_TREE_DUMP_DUMP_TREE_FN c_dump_tree diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc index 8776592..50b3425 100644 --- a/gcc/config/i386/i386-options.cc +++ b/gcc/config/i386/i3
[RISC-V PATCH] Improve style to work around PR 60994 in host compiler.
This simple patch allows me to build a cross-compiler to riscv using older versions of RedHat's system compiler. The issue is PR c++/60994 where g++ doesn't like the same name (demand_flags) to be used by both a variable and a (enumeration) type, which is also undesirable from a (GNU) coding style perspective. One solution is to rename the type to demand_flags_t, but a less invasive change is to simply use another identifier for the problematic local variable, renaming demand_flags to dflags. This patch has been tested by building cc1 of a cross-compiler to riscv64-unknown-linux-gnu using g++ 4.8.5 as the host compiler. Ok for mainline? 2023-12-01 Roger Sayle gcc/ChangeLog * config/riscv/riscv-vsetvl.cc (csetvl_info::parse_insn): Rename local variable from demand_flags to dflags, to avoid conflicting with (enumeration) type of the same name. Thanks in advance, Roger -- diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc index b3e07d4..9d11416 100644 --- a/gcc/config/riscv/riscv-vsetvl.cc +++ b/gcc/config/riscv/riscv-vsetvl.cc @@ -987,11 +987,11 @@ public: /* Determine the demand info of the RVV insn. */ m_max_sew = get_max_int_sew (); -unsigned demand_flags = 0; +unsigned dflags = 0; if (vector_config_insn_p (insn->rtl ())) { - demand_flags |= demand_flags::DEMAND_AVL_P; - demand_flags |= demand_flags::DEMAND_RATIO_P; + dflags |= demand_flags::DEMAND_AVL_P; + dflags |= demand_flags::DEMAND_RATIO_P; } else { @@ -1006,39 +1006,39 @@ public: available. */ if (has_non_zero_avl ()) - demand_flags |= demand_flags::DEMAND_NON_ZERO_AVL_P; + dflags |= demand_flags::DEMAND_NON_ZERO_AVL_P; else - demand_flags |= demand_flags::DEMAND_AVL_P; + dflags |= demand_flags::DEMAND_AVL_P; } else - demand_flags |= demand_flags::DEMAND_AVL_P; + dflags |= demand_flags::DEMAND_AVL_P; } if (get_attr_ratio (insn->rtl ()) != INVALID_ATTRIBUTE) - demand_flags |= demand_flags::DEMAND_RATIO_P; + dflags |= demand_flags::DEMAND_RATIO_P; else { if (scalar_move_insn_p (insn->rtl ()) && m_ta) { - demand_flags |= demand_flags::DEMAND_GE_SEW_P; + dflags |= demand_flags::DEMAND_GE_SEW_P; m_max_sew = get_attr_type (insn->rtl ()) == TYPE_VFMOVFV ? get_max_float_sew () : get_max_int_sew (); } else - demand_flags |= demand_flags::DEMAND_SEW_P; + dflags |= demand_flags::DEMAND_SEW_P; if (!ignore_vlmul_insn_p (insn->rtl ())) - demand_flags |= demand_flags::DEMAND_LMUL_P; + dflags |= demand_flags::DEMAND_LMUL_P; } if (!m_ta) - demand_flags |= demand_flags::DEMAND_TAIL_POLICY_P; + dflags |= demand_flags::DEMAND_TAIL_POLICY_P; if (!m_ma) - demand_flags |= demand_flags::DEMAND_MASK_POLICY_P; + dflags |= demand_flags::DEMAND_MASK_POLICY_P; } -normalize_demand (demand_flags); +normalize_demand (dflags); /* Optimize AVL from the vsetvl instruction. */ insn_info *def_insn = extract_single_source (get_avl_def ());
[PATCH] PR112380: Defend against CLOBBERs in RTX expressions in combine.cc
This patch addresses PR rtl-optimization/112380, an ICE-on-valid regression where a (clobber (const_int 0)) encounters a sanity checking gcc_assert (at line 7554) in simplify-rtx.cc. These CLOBBERs are used internally by GCC's combine pass much like error_mark_node is used by various language front-ends. The solutions are either to handle/accept these CLOBBERs through-out (or in more places in) the middle-end's RTL optimizers, including functions in simplify-rtx.cc that are used by passes other than combine, and/or attempt to prevent these CLOBBERs escaping from try_combine into the RTX/RTL stream. The benefit of the second approach is that it actually allows for better optimization: when try_combine fails to simplify an expression instead of substituting a CLOBBER to avoid the instruction pattern being recognized, noticing the CLOBBER often allows combine to attempt alternate simplifications/transformations looking for those that can be recognized. This patch is provided as two alternatives. The first is the minimal fix to address the CLOBBER encountered in the bugzilla PR. Assuming this approach is the correct fix to a latent bug/liability through-out combine.cc, the second alternative fixes many of the places that may potentially trigger problems in future, and allows combine to attempt more valid combinations/transformations. These were identified proactively by changing the "fail:" case in gen_lowpart_for_combine to return NULL_RTX, and working through the fall-out sufficient for x86_64 to bootstrap and regression test without new failures. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-11-12 Roger Sayle gcc/ChangeLog PR rtl-optimization/112380 * combine.cc (expand_field_assignment): Check if gen_lowpart returned a CLOBBER, and avoid calling gen_simplify_binary with it if so. gcc/testsuite/ChangeLog PR rtl-optimization/112380 * gcc.dg/pr112380.c: New test case. gcc/ChangeLog PR rtl-optimization/112380 * combine.cc (find_split_point): Check if gen_lowpart returned a CLOBBER. (subst): Check if combine_simplify_rtx returned a CLOBBER. (simplify_set): Check if force_to_mode returned a CLOBBER. Check if gen_lowpart returned a CLOBBER. (expand_field_assignment): Likewise. (make_extraction): Check if force_to_mode returned a CLOBBER. (force_int_to_mode): Likewise. (simplify_and_const_int_1): Check if VAROP is a CLOBBER, after call to force_to_mode (and before). (simplify_comparison): Check if force_to_mode returned a CLOBBER. Check if gen_lowpart returned a CLOBBER. diff --git a/gcc/combine.cc b/gcc/combine.cc index 6344cd3..f2c64a9 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -7466,6 +7466,11 @@ expand_field_assignment (const_rtx x) if (!targetm.scalar_mode_supported_p (compute_mode)) break; + /* gen_lowpart_for_combine returns CLOBBER on failure. */ + rtx lowpart = gen_lowpart (compute_mode, SET_SRC (x)); + if (GET_CODE (lowpart) == CLOBBER) + break; + /* Now compute the equivalent expression. Make a copy of INNER for the SET_DEST in case it is a MEM into which we will substitute; we don't want shared RTL in that case. */ @@ -7480,9 +7485,7 @@ expand_field_assignment (const_rtx x) inner); masked = simplify_gen_binary (ASHIFT, compute_mode, simplify_gen_binary ( - AND, compute_mode, - gen_lowpart (compute_mode, SET_SRC (x)), - mask), + AND, compute_mode, lowpart, mask), pos); x = gen_rtx_SET (copy_rtx (inner), diff --git a/gcc/combine.cc b/gcc/combine.cc index 6344cd3..969eb9d 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -5157,36 +5157,37 @@ find_split_point (rtx *loc, rtx_insn *insn, bool set_src) always at least get 8-bit constants in an AND insn, which is true for every current RISC. */ - if (unsignedp && len <= 8) + rtx lowpart = gen_lowpart (mode, inner); + if (lowpart && GET_CODE (lowpart) != CLOBBER) { - unsigned HOST_WIDE_INT mask - = (HOST_WIDE_INT_1U << len) - 1; - rtx pos_rtx = gen_int_shift_amount (mode, pos); - SUBST (SET_SRC (x), -gen_rtx_AND (mode, - gen_rtx_LSHIFTRT - (mode, gen_lowpart (mode, inner), pos_rtx), - gen_int_mode (mask, mode))); - - split = fin
[x86 PATCH] Improve reg pressure of double-word right-shift then truncate.
This patch improves register pressure during reload, inspired by PR 97756. Normally, a double-word right-shift by a constant produces a double-word result, the highpart of which is dead when followed by a truncation. The dead code calculating the high part gets cleaned up post-reload, so the issue isn't normally visible, except for the increased register pressure during reload, sometimes leading to odd register assignments. Providing a post-reload splitter, which clobbers a single wordmode result register instead of a doubleword result register, helps (a bit). An example demonstrating this effect is: #define MASK60 ((1ul << 60) - 1) unsigned long foo (__uint128_t n) { unsigned long a = n & MASK60; unsigned long b = (n >> 60); b = b & MASK60; unsigned long c = (n >> 120); return a+b+c; } which currently with -O2 generates (13 instructions): foo:movabsq $1152921504606846975, %rcx xchgq %rdi, %rsi movq%rsi, %rax shrdq $60, %rdi, %rax movq%rax, %rdx movq%rsi, %rax movq%rdi, %rsi andq%rcx, %rax shrq$56, %rsi andq%rcx, %rdx addq%rsi, %rax addq%rdx, %rax ret with this patch, we generate one less mov (12 instructions): foo:movabsq $1152921504606846975, %rcx xchgq %rdi, %rsi movq%rdi, %rdx movq%rsi, %rax movq%rdi, %rsi shrdq $60, %rdi, %rdx andq%rcx, %rax shrq$56, %rsi addq%rsi, %rax andq%rcx, %rdx addq%rdx, %rax ret The significant difference is easier to see via diff: < shrdq $60, %rdi, %rax < movq%rax, %rdx --- > shrdq $60, %rdi, %rdx Admittedly a single "mov" isn't much of a saving on modern architectures, but as demonstrated by the PR, people still track the number of them. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-11-12 Roger Sayle gcc/ChangeLog * config/i386/i386.md (3_doubleword_lowpart): New define_insn_and_split to optimize register usage of doubleword right shifts followed by truncation. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 663db73..8a6928f 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -14833,6 +14833,31 @@ [(const_int 0)] "ix86_split_ (operands, operands[3], mode); DONE;") +;; Split truncations of TImode right shifts into x86_64_shrd_1. +;; Split truncations of DImode right shifts into x86_shrd_1. +(define_insn_and_split "3_doubleword_lowpart" + [(set (match_operand:DWIH 0 "register_operand" "=") + (subreg:DWIH + (any_shiftrt: (match_operand: 1 "register_operand" "r") +(match_operand:QI 2 "const_int_operand")) 0)) + (clobber (reg:CC FLAGS_REG))] + "UINTVAL (operands[2]) < * BITS_PER_UNIT" + "#" + "&& reload_completed" + [(parallel + [(set (match_dup 0) + (ior:DWIH (lshiftrt:DWIH (match_dup 0) (match_dup 2)) + (subreg:DWIH + (ashift: (zero_extend: (match_dup 3)) + (match_dup 4)) 0))) + (clobber (reg:CC FLAGS_REG))])] +{ + split_double_mode (mode, [1], 1, [1], [3]); + operands[4] = GEN_INT (( * BITS_PER_UNIT) - INTVAL (operands[2])); + if (!rtx_equal_p (operands[0], operands[3])) +emit_move_insn (operands[0], operands[3]); +}) + (define_insn "x86_64_shrd" [(set (match_operand:DI 0 "nonimmediate_operand" "+r*m") (ior:DI (lshiftrt:DI (match_dup 0)
[ARC PATCH] Consistent use of whitespace in assembler templates.
This minor clean-up patch tweaks arc.md to use whitespace consistently in output templates, always using a TAB between the mnemonic and its operands, and avoiding spaces after commas between operands. There should be no functional changes with this patch, though several test cases' scan-assembler needed to be updated to use \s+ instead of testing for a TAB or a space explicitly. Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's nightly testing? 2023-11-06 Roger Sayle gcc/ChangeLog * config/arc/arc.md: Make output template whitespace consistent. gcc/testsuite/ChangeLog * gcc.target/arc/jli-1.c: Update dg-final whitespace. * gcc.target/arc/jli-2.c: Likewise. * gcc.target/arc/naked-1.c: Likewise. * gcc.target/arc/naked-2.c: Likewise. * gcc.target/arc/tmac-1.c: Likewise. * gcc.target/arc/tmac-2.c: Likewise. Thanks again, Roger -- diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index 7702978..846aa32 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -669,26 +669,26 @@ archs4x, archs4xd" || (satisfies_constraint_Cm3 (operands[1]) && memory_operand (operands[0], QImode))" "@ - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - ldb%? %0,%1 - stb%? %1,%0 - ldb%? %0,%1 - xldb%U1 %0,%1 - ldb%U1%V1 %0,%1 - xstb%U0 %1,%0 - stb%U0%V0 %1,%0 - stb%U0%V0 %1,%0 - stb%U0%V0 %1,%0 - stb%U0%V0 %1,%0" + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + ldb%?\\t%0,%1 + stb%?\\t%1,%0 + ldb%?\\t%0,%1 + xldb%U1\\t%0,%1 + ldb%U1%V1\\t%0,%1 + xstb%U0\\t%1,%0 + stb%U0%V0\\t%1,%0 + stb%U0%V0\\t%1,%0 + stb%U0%V0\\t%1,%0 + stb%U0%V0\\t%1,%0" [(set_attr "type" "move,move,move,move,move,move,move,move,move,move,load,store,load,load,load,store,store,store,store,store") (set_attr "iscompact" "maybe,maybe,maybe,true,true,false,false,false,maybe_limm,false,true,true,true,false,false,false,false,false,false,false") (set_attr "predicable" "yes,no,yes,no,no,yes,no,yes,yes,yes,no,no,no,no,no,no,no,no,no,no") @@ -713,26 +713,26 @@ archs4x, archs4xd" || (satisfies_constraint_Cm3 (operands[1]) && memory_operand (operands[0], HImode))" "@ - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - mov%? %0,%1 - ld%_%? %0,%1 - st%_%? %1,%0 - xld%_%U1 %0,%1 - ld%_%U1%V1 %0,%1 - xst%_%U0 %1,%0 - st%_%U0%V0 %1,%0 - st%_%U0%V0 %1,%0 - st%_%U0%V0 %1,%0 - st%_%U0%V0 %1,%0" + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + mov%?\\t%0,%1 + ld%_%?\\t%0,%1 + st%_%?\\t%1,%0 + xld%_%U1\\t%0,%1 + ld%_%U1%V1\\t%0,%1 + xst%_%U0\\t%1,%0 + st%_%U0%V0\\t%1,%0 + st%_%U0%V0\\t%1,%0 + st%_%U0%V0\\t%1,%0 + st%_%U0%V0\\t%1,%0" [(set_attr "type" "move,move,move,move,move,move,move,move,move,move,move,load,store,load,load,store,store,store,store,store") (set_attr "iscompact" "maybe,maybe,maybe,true,true,false,false,false,maybe_limm,maybe_limm,false,true,true,false,false,false,false,false,false,false") (set_attr "predicable" "yes,no,yes,no,no,yes,no,yes,yes,yes,yes,no,no,no,no,no,no,no,no,no") @@ -818,7 +818,7 @@ archs4x, archs4xd" (plus:SI (reg:SI SP_REG) (match_operand 1 "immediate_operand" "Cal")] "reload_completed" - "ld.a %0,[sp,%1]" + "ld.a\\t%0,[sp,%1]" [(set_attr "type" "load") (set_attr "length" "8")]) @@ -830,7 +830,7 @@ archs4x, archs4xd" (unspec:SI [(match_operand:SI 1 "register_operand" "c")] UNSPEC_ARC_DIRECT))] "" - "st%U0 %1,%0\;st%U0.di %1,%0" + "st%U0\\t%1,%0\;st%U0.di\\t%1,%0" [(set_attr "type" "store")]) ;; Combiner patterns for compare with zero @@ -944,7 +944,7 @@ archs4x, archs4xd" (set (match_operand:SI 0 "register_operand" "=w") (match_dup 3))] "" - "%O3.f %0,%1" + "%O3.f\\t%0,%1" [(set_attr "type" "compare") (set_attr "cond" "set_zn") (set_attr "length" "4")]) @@ -987,15 +
[ARC PATCH] Improved DImode rotates and right shifts by one bit.
This patch improves the code generated for DImode right shifts (both arithmetic and logical) by a single bit, and also for DImode rotates (both left and right) by a single bit. In approach, this is similar to the recently added DImode left shift by a single bit patch, but also builds upon i386.md's UNSPEC carry flag representation: https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632169.html The benefits can be seen from the four new test cases: long long ashr(long long x) { return x >> 1; } Before: ashr: asl r2,r1,31 lsr_s r0,r0 or_sr0,r0,r2 j_s.d [blink] asr_s r1,r1,1 After: ashr: asr.f r1,r1 j_s.d [blink] rrc r0,r0 unsigned long long lshr(unsigned long long x) { return x >> 1; } Before: lshr: asl r2,r1,31 lsr_s r0,r0 or_sr0,r0,r2 j_s.d [blink] lsr_s r1,r1 After: lshr: lsr.f r1,r1 j_s.d [blink] rrc r0,r0 unsigned long long rotl(unsigned long long x) { return (x<<1) | (x>>63); } Before: rotl: lsr r12,r1,31 lsr r2,r0,31 asl_s r3,r0,1 asl_s r1,r1,1 or r0,r12,r3 j_s.d [blink] or_sr1,r1,r2 After: rotl: add.f r0,r0,r0 adc.f r1,r1,r1 j_s.d [blink] add.cs r0,r0,1 unsigned long long rotr(unsigned long long x) { return (x>>1) | (x<<63); } Before: rotr: asl r12,r1,31 asl r2,r0,31 lsr_s r3,r0 lsr_s r1,r1 or r0,r12,r3 j_s.d [blink] or_sr1,r1,r2 After: rotr: asr.f 0,r0 rrc.f r1,r1 j_s.d [blink] rrc r0,r0 On CPUs without a barrel shifter the improvements are even better. Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's nightly testing? 2023-11-06 Roger Sayle gcc/ChangeLog * config/arc/arc.md (UNSPEC_ARC_CC_NEZ): New UNSPEC that represents the carry flag being set if the operand is non-zero. (adc_f): New define_insn representing adc with updated flags. (ashrdi3): New define_expand that only handles shifts by 1. (ashrdi3_cnt1): New pre-reload define_insn_and_split. (lshrdi3): New define_expand that only handles shifts by 1. (lshrdi3_cnt1): New pre-reload define_insn_and_split. (rrcsi2): New define_insn for rrc (SImode rotate right through carry). (rrcsi2_carry): Likewise for rrc.f, as above but updating flags. (rotldi3): New define_expand that only handles rotates by 1. (rotldi3_cnt1): New pre-reload define_insn_and_split. (rotrdi3): New define_expand that only handles rotates by 1. (rotrdi3_cnt1): New pre-reload define_insn_and_split. (lshrsi3_cnt1_carry): New define_insn for lsr.f. (ashrsi3_cnt1_carry): New define_insn for asr.f. (btst_0_carry): New define_insn for asr.f without result. gcc/testsuite/ChangeLog * gcc.target/arc/ashrdi3-1.c: New test case. * gcc.target/arc/lshrdi3-1.c: Likewise. * gcc.target/arc/rotldi3-1.c: Likewise. * gcc.target/arc/rotrdi3-1.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index 7702978..97231b9 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -137,6 +137,7 @@ UNSPEC_ARC_VMAC2HU UNSPEC_ARC_VMPY2H UNSPEC_ARC_VMPY2HU + UNSPEC_ARC_CC_NEZ VUNSPEC_ARC_RTIE VUNSPEC_ARC_SYNC @@ -2790,6 +2791,31 @@ archs4x, archs4xd" (set_attr "type" "cc_arith") (set_attr "length" "4,4,4,4,8,8")]) +(define_insn "adc_f" + [(set (reg:CC_C CC_REG) + (compare:CC_C + (zero_extend:DI + (plus:SI + (plus:SI + (ltu:SI (reg:CC_C CC_REG) (const_int 0)) + (match_operand:SI 1 "register_operand" "%r")) + (match_operand:SI 2 "register_operand" "r"))) + (plus:DI + (ltu:DI (reg:CC_C CC_REG) (const_int 0)) + (zero_extend:DI (match_dup 1) + (set (match_operand:SI 0 "register_operand" "=r") + (plus:SI + (plus:SI + (ltu:SI (reg:CC_C CC_REG) (const_int 0)) + (match_dup 1)) + (match_dup 2)))] + "" + "adc.f\\t%0,%1,%2" + [(set_attr "cond" "set") + (set_attr "predicable" "no") + (set_attr "type" "cc_arith") + (set_attr "length" "4")]) + ; combiner-splitter cmp / scc -> cmp / adc (define_split [(set (match_operand:SI 0 "dest_reg_operand" "") @@ -3530,6 +3556,68 @@ archs4x, archs4xd" "" [(set_attr "length" "8")
[ARC PATCH] Provide a TARGET_FOLD_BUILTIN target hook.
This patch implements a arc_fold_builtin target hook to allow ARC builtins to be folded at the tree-level. Currently this function converts __builtin_arc_swap into a LROTATE_EXPR at the tree-level, and evaluates __builtin_arc_norm and __builtin_arc_normw of integer constant arguments at compile-time. Because ARC_BUILTIIN_SWAP is now handled at the tree-level, UNSPEC_ARC_SWAP no longer used, allowing it and the "swap" define_insn to be removed. An example benefit of folding things at compile-time is that calling __builtin_arc_swap on the result of __builtin_arc_swap now eliminates both and generates no code, and likewise calling __builtin_arc_swap of a constant integer argument is evaluated at compile-time. Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's nightly testing? 2023-11-03 Roger Sayle gcc/ChangeLog * config/arc/arc.cc (TARGET_FOLD_BUILTIN): Define to arc_fold_builtin. (arc_fold_builtin): New function. Convert ARC_BUILTIN_SWAP into a rotate. Evaluate ARC_BUILTIN_NORM and ARC_BUILTIN_NORMW of constant arguments. * config/arc/arc.md (UNSPEC_ARC_SWAP): Delete. (normw): Make output template/assembler whitespace consistent. (swap): Remove define_insn, only use of SWAP UNSPEC. * config/arc/builtins.def: Tweak indentation. (SWAP): Expand using rotlsi2_cnt16 instead of using swap. gcc/testsuite/ChangeLog * gcc.target/arc/builtin_norm-1.c: New test case. * gcc.target/arc/builtin_norm-2.c: Likewise. * gcc.target/arc/builtin_normw-1.c: Likewise. * gcc.target/arc/builtin_normw-2.c: Likewise. * gcc.target/arc/builtin_swap-1.c: Likewise. * gcc.target/arc/builtin_swap-2.c: Likewise. * gcc.target/arc/builtin_swap-3.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc index e209ad2..70ee410 100644 --- a/gcc/config/arc/arc.cc +++ b/gcc/config/arc/arc.cc @@ -643,6 +643,9 @@ static rtx arc_legitimize_address_0 (rtx, rtx, machine_mode mode); #undef TARGET_EXPAND_BUILTIN #define TARGET_EXPAND_BUILTIN arc_expand_builtin +#undef TARGET_FOLD_BUILTIN +#define TARGET_FOLD_BUILTIN arc_fold_builtin + #undef TARGET_BUILTIN_DECL #define TARGET_BUILTIN_DECL arc_builtin_decl @@ -7048,6 +7051,46 @@ arc_expand_builtin (tree exp, return const0_rtx; } +/* Implement TARGET_FOLD_BUILTIN. */ + +static tree +arc_fold_builtin (tree fndecl, int n_args ATTRIBUTE_UNUSED, tree *arg, + bool ignore ATTRIBUTE_UNUSED) +{ + unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl); + + switch (fcode) +{ +default: + break; + +case ARC_BUILTIN_SWAP: + return fold_build2 (LROTATE_EXPR, integer_type_node, arg[0], + build_int_cst (integer_type_node, 16)); + +case ARC_BUILTIN_NORM: + if (TREE_CODE (arg[0]) == INTEGER_CST + && !TREE_OVERFLOW (arg[0])) + { + wide_int arg0 = wi::to_wide (arg[0], 32); + wide_int result = wi::shwi (wi::clrsb (arg0), 32); + return wide_int_to_tree (integer_type_node, result); + } + break; + +case ARC_BUILTIN_NORMW: + if (TREE_CODE (arg[0]) == INTEGER_CST + && !TREE_OVERFLOW (arg[0])) + { + wide_int arg0 = wi::to_wide (arg[0], 16); + wide_int result = wi::shwi (wi::clrsb (arg0), 32); + return wide_int_to_tree (integer_type_node, result); + } + break; +} + return NULL_TREE; +} + /* Returns true if the operands[opno] is a valid compile-time constant to be used as register number in the code for builtins. Else it flags an error and returns false. */ diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index 96ff62d..9e81d13 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -116,7 +116,6 @@ UNSPEC_TLS_OFF UNSPEC_ARC_NORM UNSPEC_ARC_NORMW - UNSPEC_ARC_SWAP UNSPEC_ARC_DIVAW UNSPEC_ARC_DIRECT UNSPEC_ARC_LP @@ -4355,8 +4354,8 @@ archs4x, archs4xd" (clrsb:HI (match_operand:HI 1 "general_operand" "cL,Cal"] "TARGET_NORM" "@ - norm%_ \t%0, %1 - norm%_ \t%0, %1" + norm%_\\t%0,%1 + norm%_\\t%0,%1" [(set_attr "length" "4,8") (set_attr "type" "two_cycle_core,two_cycle_core")]) @@ -4453,18 +4452,6 @@ archs4x, archs4xd" [(set_attr "type" "unary") (set_attr "length" "20")]) -(define_insn "swap" - [(set (match_operand:SI 0 "dest_reg_operand" "=w,w,w") - (unspec:SI [(match_operand:SI 1 "general_operand" "L,Cal,c")] - UNSPEC_ARC_SWAP))] - "TARGET_SWAP" - "@ -
[AVR PATCH] Improvements to SImode and PSImode shifts by constants.
This patch provides non-looping implementations for more SImode (32-bit) and PSImode (24-bit) shifts on AVR. For most cases, these are shorter and faster than using a loop, but for a few (controlled by optimize_size) they are a little larger but significantly faster, The approach is to perform byte-based shifts by 1, 2 or 3 bytes, followed by bit-based shifts (effectively in a narrower type) for the remaining bits, beyond 8, 16 or 24. For example, the simple test case below (inspired by PR 112268): unsigned long foo(unsigned long x) { return x >> 26; } gcc -O2 currently generates: foo:ldi r18,26 1: lsr r25 ror r24 ror r23 ror r22 dec r18 brne 1b ret which is 8 instructions, and takes ~158 cycles. With this patch, we now generate: foo:mov r22,r25 clr r23 clr r24 clr r25 lsr r22 lsr r22 ret which is 7 instructions, and takes ~7 cycles. One complication is that the modified functions sometimes use spaces instead of TABs, with occasional mistakes in GNU-style formatting, so I've fixed these indentation/whitespace issues. There's no change in the code for the cases previously handled/special-cased, with the exception of ashrqi3 reg,5 where with -Os a (4-instruction) loop is shorter than the five single-bit shifts of a fully unrolled implementation. This patch has been (partially) tested with a cross-compiler to avr-elf hosted on x86_64, without a simulator, where the compile-only tests in the gcc testsuite show no regressions. If someone could test this more thoroughly that would be great. 2023-11-02 Roger Sayle gcc/ChangeLog * config/avr/avr.cc (ashlqi3_out): Fix indentation whitespace. (ashlhi3_out): Likewise. (avr_out_ashlpsi3): Likewise. Handle shifts by 9 and 17-22. (ashlsi3_out): Fix formatting. Handle shifts by 9 and 25-30. (ashrqi3_our): Use loop for shifts by 5 when optimizing for size. Fix indentation whitespace. (ashrhi3_out): Likewise. (avr_out_ashrpsi3): Likewise. Handle shifts by 17. (ashrsi3_out): Fix indentation. Handle shifts by 17 and 25. (lshrqi3_out): Fix whitespace. (lshrhi3_out): Likewise. (avr_out_lshrpsi3): Likewise. Handle shifts by 9 and 17-22. (lshrsi3_out): Fix indentation. Handle shifts by 9,17,18 and 25-30. gcc/testsuite/ChangeLog * gcc.target/avr/ashlsi-1.c: New test case. * gcc.target/avr/ashlsi-2.c: Likewise. * gcc.target/avr/ashrsi-1.c: Likewise. * gcc.target/avr/ashrsi-2.c: Likewise. * gcc.target/avr/lshrsi-1.c: Likewise. * gcc.target/avr/lshrsi-2.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc index 5e0217de36fc..706599b4aa6a 100644 --- a/gcc/config/avr/avr.cc +++ b/gcc/config/avr/avr.cc @@ -6715,7 +6715,7 @@ ashlqi3_out (rtx_insn *insn, rtx operands[], int *len) fatal_insn ("internal compiler error. Incorrect shift:", insn); out_shift_with_cnt ("lsl %0", - insn, operands, len, 1); + insn, operands, len, 1); return ""; } @@ -6728,8 +6728,8 @@ ashlhi3_out (rtx_insn *insn, rtx operands[], int *len) if (CONST_INT_P (operands[2])) { int scratch = (GET_CODE (PATTERN (insn)) == PARALLEL - && XVECLEN (PATTERN (insn), 0) == 3 - && REG_P (operands[3])); +&& XVECLEN (PATTERN (insn), 0) == 3 +&& REG_P (operands[3])); int ldi_ok = test_hard_reg_class (LD_REGS, operands[0]); int k; int *t = len; @@ -6826,8 +6826,9 @@ ashlhi3_out (rtx_insn *insn, rtx operands[], int *len) "ror %A0"); case 8: - return *len = 2, ("mov %B0,%A1" CR_TAB - "clr %A0"); + *len = 2; + return ("mov %B0,%A1" CR_TAB + "clr %A0"); case 9: *len = 3; @@ -6974,7 +6975,7 @@ ashlhi3_out (rtx_insn *insn, rtx operands[], int *len) len = t; } out_shift_with_cnt ("lsl %A0" CR_TAB - "rol %B0", insn, operands, len, 2); + "rol %B0", insn, operands, len, 2); return ""; } @@ -6990,54 +6991,126 @@ avr_out_ashlpsi3 (rtx_insn *insn, rtx *op, int *plen) if (CONST_INT_P (op[2])) { switch (INTVAL (op[2])) -{ -default: - if (INTVAL (op[2]) < 24) -break; + { + default: + if (INTVAL (op[2]) < 24) + break; - return avr_asm_len ("clr %A0" CR_TAB - "clr %B0" CR_TAB - "clr %C0", op, plen, 3); + return avr_a
[AVR PATCH] Optimize (X>>C)&1 for C in [1, 4, 8, 16, 24] in *insv.any_shift..
This patch optimizes a few special cases in avr.md's *insv.any_shift. instruction. This template handles tests for a single bit, where the result has only a (different) single bit set in the result. Usually (currently) this always requires a three-instruction sequence of a BST, a CLR and a BLD (plus any additional CLR instructions to clear the rest of the result bytes). The special cases considered here are those that can be done with only two instructions (plus CLRs); an ANDI preceded by either a MOV, a SHIFT or a SWAP. Hence for C=1 in HImode, GCC with -O2 currently generates: bst r24,1 clr r24 clr r25 bld r24,0 with this patch, we now generate: lsr r24 andi r24,1 clr r25 Likewise, HImode C=4 now becomes: swap r24 andi r24,1 clr r25 and SImode C=8 now becomes: mov r22,r23 andi r22,1 clr 23 clr 24 clr 25 I've not attempted to model the instruction length accurately for these special cases; the logic would be ugly, but it's safe to use the current (1 insn longer) length. This patch has been (partially) tested with a cross-compiler to avr-elf hosted on x86_64, without a simulator, where the compile-only tests in the gcc testsuite show no regressions. If someone could test this more thoroughly that would be great. 2023-11-02 Roger Sayle gcc/ChangeLog * config/avr/avr.md (*insv.any_shift.): Optimize special cases of *insv.any_shift that save one instruction by using ANDI with either a MOV, a SHIFT or a SWAP. gcc/testsuite/ChangeLog * gcc.target/avr/insvhi-1.c: New HImode test case. * gcc.target/avr/insvhi-2.c: Likewise. * gcc.target/avr/insvhi-3.c: Likewise. * gcc.target/avr/insvhi-4.c: Likewise. * gcc.target/avr/insvhi-5.c: Likewise. * gcc.target/avr/insvqi-1.c: New QImode test case. * gcc.target/avr/insvqi-2.c: Likewise. * gcc.target/avr/insvqi-3.c: Likewise. * gcc.target/avr/insvqi-4.c: Likewise. * gcc.target/avr/insvsi-1.c: New SImode test case. * gcc.target/avr/insvsi-2.c: Likewise. * gcc.target/avr/insvsi-3.c: Likewise. * gcc.target/avr/insvsi-4.c: Likewise. * gcc.target/avr/insvsi-5.c: Likewise. * gcc.target/avr/insvsi-6.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/avr/avr.md b/gcc/config/avr/avr.md index 83dd15040b07..c2a1931733f8 100644 --- a/gcc/config/avr/avr.md +++ b/gcc/config/avr/avr.md @@ -9840,6 +9840,7 @@ (clobber (reg:CC REG_CC))] "reload_completed" { +int ldi_ok = test_hard_reg_class (LD_REGS, operands[0]); int shift = == ASHIFT ? INTVAL (operands[2]) : -INTVAL (operands[2]); int mask = GET_MODE_MASK (mode) & INTVAL (operands[3]); // Position of the output / input bit, respectively. @@ -9850,6 +9851,217 @@ operands[3] = GEN_INT (obit); operands[2] = GEN_INT (ibit); +/* Special cases requiring MOV to low byte and ANDI. */ +if ((shift & 7) == 0 && ldi_ok) + { + if (IN_RANGE (obit, 0, 7)) + { + if (shift == -8) + { + if ( == 2) + return "mov %A0,%B1\;andi %A0,lo8(1<<%3)\;clr %B0"; + if ( == 3) + return "mov %A0,%B1\;andi %A0,lo8(1<<%3)\;clr %B0\;clr %C0"; + if ( == 4 && !AVR_HAVE_MOVW) + return "mov %A0,%B1\;andi %A0,lo8(1<<%3)\;" +"clr %B0\;clr %C0\;clr %D0"; + } + else if (shift == -16) + { + if ( == 3) + return "mov %A0,%C1\;andi %A0,lo8(1<<%3)\;clr %B0\;clr %C0"; + if ( == 4 && !AVR_HAVE_MOVW) + return "mov %A0,%C1\;andi %A0,lo8(1<<%3)\;" +"clr %B0\;clr %C0\;clr %D0"; + } + else if (shift == -24 && !AVR_HAVE_MOVW) + return "mov %A0,%D1\;andi %A0,lo8(1<<%3)\;" +"clr %B0\;clr %C0\;clr %D0"; + } + + /* Special cases requiring MOV and ANDI. */ + else if (IN_RANGE (obit, 8, 15)) + { + if (shift == 8) + { + if ( == 2) + return "mov %B0,%A1\;andi %B0,lo8(1<<(%3-8))\;clr %A0"; + if ( == 3) + return "mov %B0,%A1\;andi %B0,lo8(1<<(%3-8))\;" +"clr %A0\;clr %C0"; + if ( == 4 && !AVR_HAVE_MOVW) + return "mov %B0,%A1\;andi %B0,lo8(1<<(%3-8))\;" +"clr %A0\;clr %C0\;clr %D0"; + } + else if (shift == -8) + { + if ( == 3) + ret
RE: [x86_64 PATCH] PR target/110551: Tweak mulx register allocation using peephole2.
Hi Uros, > From: Uros Bizjak > Sent: 01 November 2023 10:05 > Subject: Re: [x86_64 PATCH] PR target/110551: Tweak mulx register allocation > using peephole2. > > On Mon, Oct 30, 2023 at 6:27 PM Roger Sayle > wrote: > > > > > > This patch is a follow-up to my previous PR target/110551 patch, this > > time to address the additional move after mulx, seen on TARGET_BMI2 > > architectures (such as -march=haswell). The complication here is that > > the flexible multiple-set mulx instruction is introduced into RTL > > after reload, by split2, and therefore can't benefit from register > > preferencing. This results in RTL like the following: > > > > (insn 32 31 17 2 (parallel [ > > (set (reg:DI 4 si [orig:101 r ] [101]) > > (mult:DI (reg:DI 1 dx [109]) > > (reg:DI 5 di [109]))) > > (set (reg:DI 5 di [ r+8 ]) > > (umul_highpart:DI (reg:DI 1 dx [109]) > > (reg:DI 5 di [109]))) > > ]) "pr110551-2.c":8:17 -1 > > (nil)) > > > > (insn 17 32 9 2 (set (reg:DI 0 ax [107]) > > (reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal} > > (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ]) > > (nil))) > > > > Here insn 32, the mulx instruction, places its results in si and di, > > and then immediately after decides to move di to ax, with di now dead. > > This can be trivially cleaned up by a peephole2. I've added an > > additional constraint that the two SET_DESTs can't be the same > > register to avoid confusing the middle-end, but this has well-defined > > behaviour on x86_64/BMI2, encoding a umul_highpart. > > > > For the new test case, compiled on x86_64 with -O2 -march=haswell: > > > > Before: > > mulx64: movabsq $-7046029254386353131, %rdx > > mulx%rdi, %rsi, %rdi > > movq%rdi, %rax > > xorq%rsi, %rax > > ret > > > > After: > > mulx64: movabsq $-7046029254386353131, %rdx > > mulx%rdi, %rsi, %rax > > xorq%rsi, %rax > > ret > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > It looks that your previous PR110551 patch regressed -march=cascadelake [1]. > Let's fix these regressions first. > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634660.html > > Uros. This patch fixes that "regression". Originally, the test case in PR110551 contained one unnecessary mov on "default" x86_targets, but two extra movs on BMI2 targets, including -march=haswell and -march=cascadelake. The first patch eliminated one of these MOVs, this patch eliminates the second. I'm not sure that you can call it a regression, the added test failed when run with a non-standard -march setting. The good news is that test case doesn't have to be changed with this patch applied, i.e. the correct intended behaviour is no MOVs on all architectures. I'll admit the timing is unusual; I had already written and was regression testing a patch for the BMI2 issue, when the -march=cascadelake regression tester let me know it was required for folks that helpfully run the regression suite with non standard settings. i.e. a long standing bug that wasn't previously tested for by the testsuite. > > 2023-10-30 Roger Sayle > > > > gcc/ChangeLog > > PR target/110551 > > * config/i386/i386.md (*bmi2_umul3_1): Tidy condition > > as operands[2] with predicate register_operand must be !MEM_P. > > (peephole2): Optimize a mulx followed by a register-to-register > > move, to place result in the correct destination if possible. > > > > gcc/testsuite/ChangeLog > > PR target/110551 > > * gcc.target/i386/pr110551-2.c: New test case. > > Thanks again, Roger --
[x86_64 PATCH] PR target/110551: Tweak mulx register allocation using peephole2.
This patch is a follow-up to my previous PR target/110551 patch, this time to address the additional move after mulx, seen on TARGET_BMI2 architectures (such as -march=haswell). The complication here is that the flexible multiple-set mulx instruction is introduced into RTL after reload, by split2, and therefore can't benefit from register preferencing. This results in RTL like the following: (insn 32 31 17 2 (parallel [ (set (reg:DI 4 si [orig:101 r ] [101]) (mult:DI (reg:DI 1 dx [109]) (reg:DI 5 di [109]))) (set (reg:DI 5 di [ r+8 ]) (umul_highpart:DI (reg:DI 1 dx [109]) (reg:DI 5 di [109]))) ]) "pr110551-2.c":8:17 -1 (nil)) (insn 17 32 9 2 (set (reg:DI 0 ax [107]) (reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ]) (nil))) Here insn 32, the mulx instruction, places its results in si and di, and then immediately after decides to move di to ax, with di now dead. This can be trivially cleaned up by a peephole2. I've added an additional constraint that the two SET_DESTs can't be the same register to avoid confusing the middle-end, but this has well-defined behaviour on x86_64/BMI2, encoding a umul_highpart. For the new test case, compiled on x86_64 with -O2 -march=haswell: Before: mulx64: movabsq $-7046029254386353131, %rdx mulx%rdi, %rsi, %rdi movq%rdi, %rax xorq%rsi, %rax ret After: mulx64: movabsq $-7046029254386353131, %rdx mulx%rdi, %rsi, %rax xorq%rsi, %rax ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-30 Roger Sayle gcc/ChangeLog PR target/110551 * config/i386/i386.md (*bmi2_umul3_1): Tidy condition as operands[2] with predicate register_operand must be !MEM_P. (peephole2): Optimize a mulx followed by a register-to-register move, to place result in the correct destination if possible. gcc/testsuite/ChangeLog PR target/110551 * gcc.target/i386/pr110551-2.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index eb4121b..a314f1a 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -9747,13 +9747,37 @@ (match_operand:DWIH 3 "nonimmediate_operand" "rm"))) (set (match_operand:DWIH 1 "register_operand" "=r") (umul_highpart:DWIH (match_dup 2) (match_dup 3)))] - "TARGET_BMI2 - && !(MEM_P (operands[2]) && MEM_P (operands[3]))" + "TARGET_BMI2" "mulx\t{%3, %0, %1|%1, %0, %3}" [(set_attr "type" "imulx") (set_attr "prefix" "vex") (set_attr "mode" "")]) +;; Tweak *bmi2_umul3_1 to eliminate following mov. +(define_peephole2 + [(parallel [(set (match_operand:DWIH 0 "general_reg_operand") + (mult:DWIH (match_operand:DWIH 2 "register_operand") + (match_operand:DWIH 3 "nonimmediate_operand"))) + (set (match_operand:DWIH 1 "general_reg_operand") + (umul_highpart:DWIH (match_dup 2) (match_dup 3)))]) + (set (match_operand:DWIH 4 "general_reg_operand") + (match_operand:DWIH 5 "general_reg_operand"))] + "TARGET_BMI2 + && ((REGNO (operands[5]) == REGNO (operands[0]) +&& REGNO (operands[1]) != REGNO (operands[4])) + || (REGNO (operands[5]) == REGNO (operands[1]) + && REGNO (operands[0]) != REGNO (operands[4]))) + && peep2_reg_dead_p (2, operands[5])" + [(parallel [(set (match_dup 0) (mult:DWIH (match_dup 2) (match_dup 3))) + (set (match_dup 1) + (umul_highpart:DWIH (match_dup 2) (match_dup 3)))])] +{ + if (REGNO (operands[5]) == REGNO (operands[0])) +operands[0] = operands[4]; + else +operands[1] = operands[4]; +}) + (define_insn "*umul3_1" [(set (match_operand: 0 "register_operand" "=r,A") (mult: diff --git a/gcc/testsuite/gcc.target/i386/pr110551-2.c b/gcc/testsuite/gcc.target/i386/pr110551-2.c new file mode 100644 index 000..4936adf --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110551-2.c @@ -0,0 +1,12 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2 -march=haswell" } */ + +typedef unsigned long long uint64_t; + +uint64_t mulx64(uint64_t x) +{ +__uint128_t r = (__uint128_t)x * 0x9E3779B97F4A7C15ull; +return (uint64_t)r ^ (uint64_t)( r >> 64 ); +} + +/* { dg-final { scan-assembler-not "movq" } } */
RE: [ARC PATCH] Improve DImode left shift by a single bit.
Hi Jeff, > From: Jeff Law > Sent: 30 October 2023 15:09 > Subject: Re: [ARC PATCH] Improve DImode left shift by a single bit. > > On 10/28/23 07:05, Roger Sayle wrote: > > > > This patch improves the code generated for X << 1 (and for X + X) when > > X is 64-bit DImode, using the same two instruction code sequence used > > for DImode addition. > > > > For the test case: > > > > long long foo(long long x) { return x << 1; } > > > > GCC -O2 currently generates the following code: > > > > foo:lsr r2,r0,31 > > asl_s r1,r1,1 > > asl_s r0,r0,1 > > j_s.d [blink] > > or_sr1,r1,r2 > > > > and on CPU without a barrel shifter, i.e. -mcpu=em > > > > foo:add.f 0,r0,r0 > > asl_s r1,r1 > > rlc r2,0 > > asl_s r0,r0 > > j_s.d [blink] > > or_sr1,r1,r2 > > > > with this patch (both with and without a barrel shifter): > > > > foo:add.f r0,r0,r0 > > j_s.d [blink] > > adc r1,r1,r1 > > > > [For Jeff Law's benefit a similar optimization is also applicable to > > H8300H, that could also use a two instruction sequence (plus rts) but > > currently GCC generates 16 instructions (plus an rts) for foo above.] > > > > Tested with a cross-compiler to arc-linux hosted on x86_64, with no > > new (compile-only) regressions from make -k check. > > Ok for mainline if this passes Claudiu's nightly testing? > WRT H8. Bug filed so we don't lose track of it. We don't have DImode > operations > defined on the H8. First step would be DImode loads/stores and basic > arithmetic. The H8's machine description is impressively well organized. Would it make sense to add a doubleword.md, or should DImode support be added to each of the individual addsub.md, logical.md, shiftrotate.md etc..? The fact that register-to-register moves clobber some of the flags bits must also make reload's task very difficult (impossible?). Cheers, Roger --
[ARC PATCH] Improved ARC rtx_costs/insn_cost for SHIFTs and ROTATEs.
This patch overhauls the ARC backend's insn_cost target hook, and makes some related improvements to rtx_costs, BRANCH_COST, etc. The primary goal is to allow the backend to indicate that shifts and rotates are slow (discouraged) when the CPU doesn't have a barrel shifter. I should also acknowledge Richard Sandiford for inspiring the use of set_cost in this rewrite of arc_insn_cost; this implementation borrows heavily for the target hooks for AArch64 and ARM. The motivating example is derived from PR rtl-optimization/110717. struct S { int a : 5; }; unsigned int foo (struct S *p) { return p->a; } With a barrel shifter, GCC -O2 generates the reasonable: foo:ldb_s r0,[r0] asl_s r0,r0,27 j_s.d [blink] asr_s r0,r0,27 What's interesting is that during combine, the middle-end actually has two shifts by three bits, and a sign-extension from QI to SI. Trying 8, 9 -> 11: 8: r158:SI=r157:QI#0<<0x3 REG_DEAD r157:QI 9: r159:SI=sign_extend(r158:SI#0) REG_DEAD r158:SI 11: r155:SI=r159:SI>>0x3 REG_DEAD r159:SI Whilst it's reasonable to simplify this to two shifts by 27 bits when the CPU has a barrel shifter, it's actually a significant pessimization when these shifts are implemented by loops. This combination can be prevented if the backend provides accurate-ish estimates for insn_cost. Previously, without a barrel shifter, GCC -O2 -mcpu=em generates: foo:ldb_s r0,[r0] mov lp_count,27 lp 2f add r0,r0,r0 nop 2: # end single insn loop mov lp_count,27 lp 2f asr r0,r0 nop 2: # end single insn loop j_s [blink] which contains two loops and requires about ~113 cycles to execute. With this patch to rtx_cost/insn_cost, GCC -O2 -mcpu=em generates: foo:ldb_s r0,[r0] mov_s r2,0;3 add3r0,r2,r0 sexb_s r0,r0 asr_s r0,r0 asr_s r0,r0 j_s.d [blink] asr_s r0,r0 which requires only ~6 cycles, for the shorter shifts by 3 and sign extension. Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's nightly testing? 2023-10-29 Roger Sayle gcc/ChangeLog * config/arc/arc.cc (arc_rtx_costs): Improve cost estimates. Provide reasonable values for SHIFTS and ROTATES by constant bit counts depending upon TARGET_BARREL_SHIFTER. (arc_insn_cost): Use insn attributes if the instruction is recognized. Avoid calling get_attr_length for type "multi", i.e. define_insn_and_split patterns without explicit type. Fall-back to set_rtx_cost for single_set and pattern_cost otherwise. * config/arc/arc.h (COSTS_N_BYTES): Define helper macro. (BRANCH_COST): Improve/correct definition. (LOGICAL_OP_NON_SHORT_CIRCUIT): Preserve previous behavior. Thanks again, Roger -- diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc index 353ac69..ae83e5e 100644 --- a/gcc/config/arc/arc.cc +++ b/gcc/config/arc/arc.cc @@ -5492,7 +5492,7 @@ arc_rtx_costs (rtx x, machine_mode mode, int outer_code, case CONST: case LABEL_REF: case SYMBOL_REF: - *total = speed ? COSTS_N_INSNS (1) : COSTS_N_INSNS (4); + *total = speed ? COSTS_N_INSNS (1) : COSTS_N_BYTES (4); return true; case CONST_DOUBLE: @@ -5516,26 +5516,32 @@ arc_rtx_costs (rtx x, machine_mode mode, int outer_code, case ASHIFT: case ASHIFTRT: case LSHIFTRT: +case ROTATE: +case ROTATERT: + if (mode == DImode) + return false; if (TARGET_BARREL_SHIFTER) { - if (CONSTANT_P (XEXP (x, 0))) + *total = COSTS_N_INSNS (1); + if (CONSTANT_P (XEXP (x, 1))) { - *total += rtx_cost (XEXP (x, 1), mode, (enum rtx_code) code, + *total += rtx_cost (XEXP (x, 0), mode, (enum rtx_code) code, 0, speed); return true; } - *total = COSTS_N_INSNS (1); } else if (GET_CODE (XEXP (x, 1)) != CONST_INT) - *total = COSTS_N_INSNS (16); + *total = speed ? COSTS_N_INSNS (16) : COSTS_N_INSNS (4); else { - *total = COSTS_N_INSNS (INTVAL (XEXP ((x), 1))); - /* ??? want_to_gcse_p can throw negative shift counts at us, -and then panics when it gets a negative cost as result. -Seen for gcc.c-torture/compile/20020710-1.c -Os . */ - if (*total < 0) - *total = 0; + int n = INTVAL (XEXP (x, 1)) & 31; + if (n < 4) + *total = COSTS_N_INSNS (n); + else + *total = speed ? COSTS_N_INSNS (n + 2) : COSTS_N_INSNS (4); + *total += rtx_cost (XEXP (x, 0), mode, (enum rtx_code) code, +
[ARC PATCH] Convert (signed<<31)>>31 to -(signed&1) without barrel shifter.
This patch optimizes PR middle-end/101955 for the ARC backend. On ARC CPUs with a barrel shifter, using two shifts is (probably) optimal as: asl_s r0,r0,31 asr_s r0,r0,31 but without a barrel shifter, GCC -O2 -mcpu=em currently generates: and r2,r0,1 ror r2,r2 add.f 0,r2,r2 sbc r0,r0,r0 with this patch, we now generate the smaller, faster and non-flags clobbering: bmsk_s r0,r0,0 neg_s r0,r0 Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's nightly testing? 2023-10-28 Roger Sayle gcc/ChangeLog PR middle-end/101955 * config/arc/arc.md (*extvsi_1_0): New define_insn_and_split to convert sign extract of the least significant bit into an AND $1 then a NEG when !TARGET_BARREL_SHIFTER. gcc/testsuite/ChangeLog PR middle-end/101955 * gcc.target/arc/pr101955.c: New test case. Thanks again, Roger -- diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index ee43887..6471344 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -5873,6 +5873,20 @@ archs4x, archs4xd" (zero_extract:SI (match_dup 1) (match_dup 5) (match_dup 7)))]) (match_dup 1)]) +;; Split sign-extension of single least significant bit as and x,$1;neg x +(define_insn_and_split "*extvsi_1_0" + [(set (match_operand:SI 0 "register_operand" "=r") + (sign_extract:SI (match_operand:SI 1 "register_operand" "0") +(const_int 1) +(const_int 0)))] + "!TARGET_BARREL_SHIFTER" + "#" + "&& 1" + [(set (match_dup 0) (and:SI (match_dup 1) (const_int 1))) + (set (match_dup 0) (neg:SI (match_dup 0)))] + "" + [(set_attr "length" "8")]) + (define_insn_and_split "rotlsi3_cnt1" [(set (match_operand:SI 0 "dest_reg_operand""=r") (rotate:SI (match_operand:SI 1 "register_operand" "r") diff --git a/gcc/testsuite/gcc.target/arc/pr101955.c b/gcc/testsuite/gcc.target/arc/pr101955.c new file mode 100644 index 000..74bca3c --- /dev/null +++ b/gcc/testsuite/gcc.target/arc/pr101955.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=em" } */ + +int f(int a) +{ +return (a << 31) >> 31; +} + +/* { dg-final { scan-assembler "msk_s\\s+r0,r0,0" } } */ +/* { dg-final { scan-assembler "neg_s\\s+r0,r0" } } */
[ARC PATCH] Improve DImode left shift by a single bit.
This patch improves the code generated for X << 1 (and for X + X) when X is 64-bit DImode, using the same two instruction code sequence used for DImode addition. For the test case: long long foo(long long x) { return x << 1; } GCC -O2 currently generates the following code: foo:lsr r2,r0,31 asl_s r1,r1,1 asl_s r0,r0,1 j_s.d [blink] or_sr1,r1,r2 and on CPU without a barrel shifter, i.e. -mcpu=em foo:add.f 0,r0,r0 asl_s r1,r1 rlc r2,0 asl_s r0,r0 j_s.d [blink] or_sr1,r1,r2 with this patch (both with and without a barrel shifter): foo:add.f r0,r0,r0 j_s.d [blink] adc r1,r1,r1 [For Jeff Law's benefit a similar optimization is also applicable to H8300H, that could also use a two instruction sequence (plus rts) but currently GCC generates 16 instructions (plus an rts) for foo above.] Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's nightly testing? 2023-10-28 Roger Sayle gcc/ChangeLog * config/arc/arc.md (addsi3): Fix GNU-style code formatting. (adddi3): Change define_expand to generate an *adddi3. (*adddi3): New define_insn_and_split to lower DImode additions during the split1 pass (after combine and before reload). (ashldi3): New define_expand to (only) generate *ashldi3_cnt1 for DImode left shifts by a single bit. (*ashldi3_cnt1): New define_insn_and_split to lower DImode left shifts by one bit to an *adddi3. gcc/testsuite/ChangeLog * gcc.target/arc/adddi3-1.c: New test case. * gcc.target/arc/ashldi3-1.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index ee43887..fe5f48c 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -2675,19 +2675,28 @@ archs4x, archs4xd" (plus:SI (match_operand:SI 1 "register_operand" "") (match_operand:SI 2 "nonmemory_operand" "")))] "" - "if (flag_pic && arc_raw_symbolic_reference_mentioned_p (operands[2], false)) - { - operands[2]=force_reg(SImode, operands[2]); - } - ") +{ + if (flag_pic && arc_raw_symbolic_reference_mentioned_p (operands[2], false)) +operands[2] = force_reg (SImode, operands[2]); +}) (define_expand "adddi3" + [(parallel + [(set (match_operand:DI 0 "register_operand" "") + (plus:DI (match_operand:DI 1 "register_operand" "") +(match_operand:DI 2 "nonmemory_operand" ""))) + (clobber (reg:CC CC_REG))])]) + +(define_insn_and_split "*adddi3" [(set (match_operand:DI 0 "register_operand" "") (plus:DI (match_operand:DI 1 "register_operand" "") (match_operand:DI 2 "nonmemory_operand" ""))) (clobber (reg:CC CC_REG))] - "" - " + "arc_pre_reload_split ()" + "#" + "&& 1" + [(const_int 0)] +{ rtx l0 = gen_lowpart (SImode, operands[0]); rtx h0 = gen_highpart (SImode, operands[0]); rtx l1 = gen_lowpart (SImode, operands[1]); @@ -2719,11 +2728,12 @@ archs4x, archs4xd" gen_rtx_LTU (VOIDmode, gen_rtx_REG (CC_Cmode, CC_REG), GEN_INT (0)), gen_rtx_SET (h0, plus_constant (SImode, h0, 1; DONE; - } +} emit_insn (gen_add_f (l0, l1, l2)); emit_insn (gen_adc (h0, h1, h2)); DONE; -") +} + [(set_attr "length" "8")]) (define_insn "add_f" [(set (reg:CC_C CC_REG) @@ -3461,6 +3471,33 @@ archs4x, archs4xd" [(set_attr "type" "shift") (set_attr "length" "16,20")]) +;; DImode shifts + +(define_expand "ashldi3" + [(parallel + [(set (match_operand:DI 0 "register_operand") + (ashift:DI (match_operand:DI 1 "register_operand") + (match_operand:QI 2 "const_int_operand"))) + (clobber (reg:CC CC_REG))])] + "" +{ + if (operands[2] != const1_rtx) +FAIL; +}) + +(define_insn_and_split "*ashldi3_cnt1" + [(set (match_operand:DI 0 "register_operand") + (ashift:DI (match_operand:DI 1 "register_operand") + (const_int 1))) + (clobber (reg:CC CC_REG))] + "arc_pre_reload_split ()" + "#" + "&& 1" + [(parallel [(set (match_dup 0) (plus:DI (match_dup 1) (match_dup 1))) + (clobber (reg:CC CC_REG))])] + "" + [(set_attr "length" "8")]) + ;; Rotate instructions. (define_insn "rotrsi3_insn" diff --git
[wwwdocs] Get newlib via git in simtest-howto.html
A minor tweak to the documentation, to use git rather than cvs to obtain the latest version of newlib. Ok for mainline? 2023-10-27 Roger Sayle * htdocs/simtest-howto.html: Use git to obtain newlib. Cheers, Roger -- diff --git a/htdocs/simtest-howto.html b/htdocs/simtest-howto.html index 2e54476b..d9c027fd 100644 --- a/htdocs/simtest-howto.html +++ b/htdocs/simtest-howto.html @@ -59,9 +59,7 @@ contrib/gcc_update --touch cd ${TOP} -cvs -d :pserver:anon...@sourceware.org:/cvs/src login -# You will be prompted for a password; reply with "anoncvs". -cvs -d :pserver:anon...@sourceware.org:/cvs/src co newlib +git clone https://sourceware.org/git/newlib-cygwin.git newlib Check out the sim and binutils tree:
[ARC PATCH] Improved SImode shifts and rotates with -mswap.
This patch improves the code generated by the ARC back-end for CPUs without a barrel shifter but with -mswap. The -mswap option provides a SWAP instruction that implements SImode rotations by 16, but also logical shift instructions (left and right) by 16 bits. Clearly these are also useful building blocks for implementing shifts by 17, 18, etc. which would otherwise require a loop. As a representative example: int shl20 (int x) { return x << 20; } GCC with -O2 -mcpu=em -mswap would previously generate: shl20: mov lp_count,10 lp 2f add r0,r0,r0 add r0,r0,r0 2: # end single insn loop j_s [blink] with this patch we now generate: shl20: mov_s r2,0;3 lsl16 r0,r0 add3r0,r2,r0 j_s.d [blink] asl_s r0,r0 Although both are four instructions (excluding the j_s), the original takes ~22 cycles, and replacement ~4 cycles. Tested with a cross-compiler to arc-linux hosted on x86_64, with no new (compile-only) regressions from make -k check. Ok for mainline if this passes Claudiu's nightly testing? 2023-10-27 Roger Sayle gcc/ChangeLog * config/arc/arc.cc (arc_split_ashl): Use lsl16 on TARGET_SWAP. (arc_split_ashr): Use swap and sign-extend on TARGET_SWAP. (arc_split_lshr): Use lsr16 on TARGET_SWAP. (arc_split_rotl): Use swap on TARGET_SWAP. (arc_split_rotr): Likewise. * config/arc/arc.md (ANY_ROTATE): New code iterator. (si2_cnt16): New define_insn for alternate form of swap instruction on TARGET_SWAP. (ashlsi2_cnt16): Rename from *ashlsi16_cnt16 and move earlier. (lshrsi2_cnt16): New define_insn for LSR16 instruction. (*ashlsi2_cnt16): See above. gcc/testsuite/ChangeLog * gcc.target/arc/lsl16-1.c: New test case. * gcc.target/arc/lsr16-1.c: Likewise. * gcc.target/arc/swap-1.c: Likewise. * gcc.target/arc/swap-2.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc index 353ac69..e98692a 100644 --- a/gcc/config/arc/arc.cc +++ b/gcc/config/arc/arc.cc @@ -4256,6 +4256,17 @@ arc_split_ashl (rtx *operands) } return; } + else if (n >= 16 && n <= 22 && TARGET_SWAP && TARGET_V2) + { + emit_insn (gen_ashlsi2_cnt16 (operands[0], operands[1])); + if (n > 16) + { + operands[1] = operands[0]; + operands[2] = GEN_INT (n - 16); + arc_split_ashl (operands); + } + return; + } else if (n >= 29) { if (n < 31) @@ -4300,6 +4311,15 @@ arc_split_ashr (rtx *operands) emit_move_insn (operands[0], operands[1]); return; } + else if (n >= 16 && n <= 18 && TARGET_SWAP) + { + emit_insn (gen_rotrsi2_cnt16 (operands[0], operands[1])); + emit_insn (gen_extendhisi2 (operands[0], + gen_lowpart (HImode, operands[0]))); + while (--n >= 16) + emit_insn (gen_ashrsi3_cnt1 (operands[0], operands[0])); + return; + } else if (n == 30) { rtx tmp = gen_reg_rtx (SImode); @@ -4339,6 +4359,13 @@ arc_split_lshr (rtx *operands) emit_move_insn (operands[0], operands[1]); return; } + else if (n >= 16 && n <= 19 && TARGET_SWAP && TARGET_V2) + { + emit_insn (gen_lshrsi2_cnt16 (operands[0], operands[1])); + while (--n >= 16) + emit_insn (gen_lshrsi3_cnt1 (operands[0], operands[0])); + return; + } else if (n == 30) { rtx tmp = gen_reg_rtx (SImode); @@ -4385,6 +4412,19 @@ arc_split_rotl (rtx *operands) emit_insn (gen_rotrsi3_cnt1 (operands[0], operands[0])); return; } + else if (n >= 13 && n <= 16 && TARGET_SWAP) + { + emit_insn (gen_rotlsi2_cnt16 (operands[0], operands[1])); + while (++n <= 16) + emit_insn (gen_rotrsi3_cnt1 (operands[0], operands[0])); + return; + } + else if (n == 17 && TARGET_SWAP) + { + emit_insn (gen_rotlsi2_cnt16 (operands[0], operands[1])); + emit_insn (gen_rotlsi3_cnt1 (operands[0], operands[0])); + return; + } else if (n >= 16 || n == 12 || n == 14) { emit_insn (gen_rotrsi3_loop (operands[0], operands[1], @@ -4415,6 +4455,19 @@ arc_split_rotr (rtx *operands) emit_move_insn (operands[0], operands[1]); return; } + else if (n == 15 && TARGET_SWAP) + { + emit_insn (gen_rotrsi2_cnt16 (operands[0], operands[1])); + emit_insn (gen_rotlsi3_cnt1 (operands[0], operands[0])); + return; + } + e
RE: [x86 PATCH] PR target/110511: Fix reg allocation for widening multiplications.
Hi Uros, I've tried your suggestions to see what would happen. Alas, allowing both operands to (i386's) widening multiplications to be nonimmediate_operand results in 90 additional testsuite unexpected failures", and 41 unresolved testcase, around things like: gcc.c-torture/compile/di.c:6:1: error: unrecognizable insn: (insn 14 13 15 2 (parallel [ (set (reg:DI 98 [ _3 ]) (mult:DI (zero_extend:DI (mem/c:SI (plus:SI (reg/f:SI 93 virtual-stack-vars) (const_int -8 [0xfff8])) [1 a+0 S4 A64])) (zero_extend:DI (mem/c:SI (plus:SI (reg/f:SI 93 virtual-stack-vars) (const_int -16 [0xfff0])) [1 b+0 S4 A64] (clobber (reg:CC 17 flags)) ]) "gcc.c-torture/compile/di.c":5:12 -1 (nil)) during RTL pass: vregs gcc.c-torture/compile/di.c:6:1: internal compiler error: in extract_insn, at recog.cc:2791 In my experiments, I've used nonimmediate_operand instead of general_operand, as a zero_extend of an immediate_operand, like const_int, would be non-canonical. In short, it's ok (common) for '%' to apply to operands with different predicates; reload will only swap things if the operand's predicates/constraints remain consistent. For example, see i386.c's *add_1 pattern. And as shown above it can't be left to (until) reload to decide which "mem" gets loaded into a register (which would be nice), as some passes before reload check both predicates and constraints. My original patch fixes PR 110511, using the same peephole2 idiom as already used elsewhere in i386.md. Ok for mainline? > -Original Message- > From: Uros Bizjak > Sent: 19 October 2023 18:02 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [x86 PATCH] PR target/110511: Fix reg allocation for widening > multiplications. > > On Tue, Oct 17, 2023 at 9:05 PM Roger Sayle > wrote: > > > > > > This patch contains clean-ups of the widening multiplication patterns > > in i386.md, and provides variants of the existing highpart > > multiplication > > peephole2 transformations (that tidy up register allocation after > > reload), and thereby fixes PR target/110511, which is a superfluous > > move instruction. > > > > For the new test case, compiled on x86_64 with -O2. > > > > Before: > > mulx64: movabsq $-7046029254386353131, %rcx > > movq%rcx, %rax > > mulq%rdi > > xorq%rdx, %rax > > ret > > > > After: > > mulx64: movabsq $-7046029254386353131, %rax > > mulq%rdi > > xorq%rdx, %rax > > ret > > > > The clean-ups are (i) that operand 1 is consistently made > > register_operand and operand 2 becomes nonimmediate_operand, so that > > predicates match the constraints, (ii) the representation of the BMI2 > > mulx instruction is updated to use the new umul_highpart RTX, and > > (iii) because operands > > 0 and 1 have different modes in widening multiplications, "a" is a > > more appropriate constraint than "0" (which avoids spills/reloads > > containing SUBREGs). The new peephole2 transformations are based upon > > those at around line 9951 of i386.md, that begins with the comment ;; > > Highpart multiplication peephole2s to tweak register allocation. > > ;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx -> mov imm,%rax; imulq > > %rdi > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > > > 2023-10-17 Roger Sayle > > > > gcc/ChangeLog > > PR target/110511 > > * config/i386/i386.md (mul3): Make operands 1 and > > 2 take "regiser_operand" and "nonimmediate_operand" respectively. > > (mulqihi3): Likewise. > > (*bmi2_umul3_1): Operand 2 needs to be register_operand > > matching the %d constraint. Use umul_highpart RTX to represent > > the highpart multiplication. > > (*umul3_1): Operand 2 should use regiser_operand > > predicate, and "a" rather than "0" as operands 0 and 2 have > > different modes. > > (define_split): For mul to mulx conversion, use the new > > umul_highpart RTX representation. > > (*mul3_1): Operand 1 should be register_operand > > and the constraint %a as operands 0 and 1 have different modes. > > (*mulqihi3_1): Operand 1 should be register_
[NVPTX] Patch pings...
Random fact: there have been no changes to nvptx.md in 2023 apart from Jakub's tree-wide update to the copyright years in early January. Please can I ping two of my of pending Nvidia nvptx patches: "Correct pattern for popcountdi2 insn in nvptx.md" from January https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609571.html and "Update nvptx's bitrev2 pattern to use BITREVERSE rtx" from June https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620994.html Both of these still apply cleanly (because nvptx.md hasn't changed). Thanks in advance, Roger --
[PATCH v2] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.
Hi Jeff, Many thanks for the review/approval of my fix for PR rtl-optimization/91865. Based on your and Richard Biener's feedback, I’d like to propose a revision calling simplify_unary_operation instead of simplify_const_unary_operation (i.e. Richi's recommendation). I was originally concerned that this might potentially result in unbounded recursion, and testing for ZERO_EXTEND was safer but "uglier", but testing hasn't shown any issues. If we do see issues in the future, it's easy to fall back to the previous version of this patch. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-25 Roger Sayle Richard Biener gcc/ChangeLog PR rtl-optimization/91865 * combine.cc (make_compound_operation): Avoid creating a ZERO_EXTEND of a ZERO_EXTEND. gcc/testsuite/ChangeLog PR rtl-optimization/91865 * gcc.target/msp430/pr91865.c: New test case. Thanks again, Roger -- > -Original Message- > From: Jeff Law > Sent: 19 October 2023 16:20 > > On 10/14/23 16:14, Roger Sayle wrote: > > > > This patch is my proposed solution to PR rtl-optimization/91865. > > Normally RTX simplification canonicalizes a ZERO_EXTEND of a > > ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is > > possible for combine's make_compound_operation to unintentionally > > generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is > > unlikely to be matched by the backend. > > > > For the new test case: > > > > const int table[2] = {1, 2}; > > int foo (char i) { return table[i]; } > > > > compiling with -O2 -mlarge on msp430 we currently see: > > > > Trying 2 -> 7: > > 2: r25:HI=zero_extend(R12:QI) > >REG_DEAD R12:QI > > 7: r28:PSI=sign_extend(r25:HI)#0 > >REG_DEAD r25:HI > > Failed to match this instruction: > > (set (reg:PSI 28 [ iD.1772 ]) > > (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ] > > > > which results in the following code: > > > > foo:AND #0xff, R12 > > RLAM.A #4, R12 { RRAM.A #4, R12 > > RLAM.A #1, R12 > > MOVX.W table(R12), R12 > > RETA > > > > With this patch, we now see: > > > > Trying 2 -> 7: > > 2: r25:HI=zero_extend(R12:QI) > >REG_DEAD R12:QI > > 7: r28:PSI=sign_extend(r25:HI)#0 > >REG_DEAD r25:HI > > Successfully matched this instruction: > > (set (reg:PSI 28 [ iD.1772 ]) > > (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing > > combination of insns 2 and 7 original costs 4 + 8 = 12 replacement > > cost 8 > > > > foo:MOV.B R12, R12 > > RLAM.A #1, R12 > > MOVX.W table(R12), R12 > > RETA > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > 2023-10-14 Roger Sayle > > > > gcc/ChangeLog > > PR rtl-optimization/91865 > > * combine.cc (make_compound_operation): Avoid creating a > > ZERO_EXTEND of a ZERO_EXTEND. > Final question. Is there a reasonable expectation that we could get a > similar situation with sign extensions? If so we probably ought to try > and handle both. > > OK with the obvious change to handle nested sign extensions if you think it's > useful to do so. And OK as-is if you don't think handling nested sign > extensions is > useful. > > jeff diff --git a/gcc/combine.cc b/gcc/combine.cc index 360aa2f25e6..b1b16ac7bb2 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -8449,8 +8449,8 @@ make_compound_operation (rtx x, enum rtx_code in_code) if (code == ZERO_EXTEND) { new_rtx = make_compound_operation (XEXP (x, 0), next_code); - tem = simplify_const_unary_operation (ZERO_EXTEND, GET_MODE (x), - new_rtx, GET_MODE (XEXP (x, 0))); + tem = simplify_unary_operation (ZERO_EXTEND, GET_MODE (x), + new_rtx, GET_MODE (XEXP (x, 0))); if (tem) return tem; SUBST (XEXP (x, 0), new_rtx); diff --git a/gcc/testsuite/gcc.target/msp430/pr91865.c b/gcc/testsuite/gcc.target/msp430/pr91865.c new file mode 100644 index 000..8cc21c8b9e8 --- /dev/null +++ b/gcc/testsuite/gcc.target/msp430/pr91865.c @@ -0,0 +1,8 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mlarge" } */ + +const int table[2] = {1, 2}; +int foo (char i) { return table[i]; } + +/* { dg-final { scan-assembler-not "AND" } } */ +/* { dg-final { scan-assembler-not "RRAM" } } */
[x86 PATCH] Fine tune STV register conversion costs for -Os.
The eagle-eyed may have spotted that my recent testcases for DImode shifts on x86_64 included -mno-stv in the dg-options. This is because the Scalar-To-Vector (STV) pass currently transforms these shifts to use SSE vector operations, producing larger code even with -Os. The issue is that the compute_convert_gain currently underestimates the size of instructions required for interunit moves, which is corrected with the patch below. For the simple test case: unsigned long long shl1(unsigned long long x) { return x << 1; } without this patch, GCC -m32 -Os -mavx2 currently generates: shl1: push %ebp // 1 byte mov%esp,%ebp // 2 bytes vmovq 0x8(%ebp),%xmm0 // 5 bytes pop%ebp // 1 byte vpaddq %xmm0,%xmm0,%xmm0 // 4 bytes vmovd %xmm0,%eax// 4 bytes vpextrd $0x1,%xmm0,%edx // 6 bytes ret // 1 byte = 24 bytes total with this patch, we now generate the shorter shl1: push %ebp // 1 byte mov%esp,%ebp// 2 bytes mov0x8(%ebp),%eax // 3 bytes mov0xc(%ebp),%edx // 3 bytes pop%ebp // 1 byte add%eax,%eax// 2 bytes adc%edx,%edx// 2 bytes ret // 1 byte = 15 bytes total Benchmarking using CSiBE, shows that this patch saves 1361 bytes when compiling with -m32 -Os, and saves 172 bytes when compiling with -Os. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-23 Roger Sayle gcc/ChangeLog * config/i386/i386-features.cc (compute_convert_gain): Provide more accurate values (sizes) for inter-unit moves with -Os. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index cead397..6fac67e 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -752,11 +752,33 @@ general_scalar_chain::compute_convert_gain () fprintf (dump_file, " Instruction conversion gain: %d\n", gain); /* Cost the integer to sse and sse to integer moves. */ - cost += n_sse_to_integer * ix86_cost->sse_to_integer; - /* ??? integer_to_sse but we only have that in the RA cost table. - Assume sse_to_integer/integer_to_sse are the same which they - are at the moment. */ - cost += n_integer_to_sse * ix86_cost->sse_to_integer; + if (!optimize_function_for_size_p (cfun)) +{ + cost += n_sse_to_integer * ix86_cost->sse_to_integer; + /* ??? integer_to_sse but we only have that in the RA cost table. + Assume sse_to_integer/integer_to_sse are the same which they + are at the moment. */ + cost += n_integer_to_sse * ix86_cost->sse_to_integer; +} + else if (TARGET_64BIT || smode == SImode) +{ + cost += n_sse_to_integer * COSTS_N_BYTES (4); + cost += n_integer_to_sse * COSTS_N_BYTES (4); +} + else if (TARGET_SSE4_1) +{ + /* vmovd (4 bytes) + vpextrd (6 bytes). */ + cost += n_sse_to_integer * COSTS_N_BYTES (10); + /* vmovd (4 bytes) + vpinsrd (6 bytes). */ + cost += n_integer_to_sse * COSTS_N_BYTES (10); +} + else +{ + /* movd (4 bytes) + psrlq (5 bytes) + movd (4 bytes). */ + cost += n_sse_to_integer * COSTS_N_BYTES (13); + /* movd (4 bytes) + movd (4 bytes) + unpckldq (4 bytes). */ + cost += n_integer_to_sse * COSTS_N_BYTES (12); +} if (dump_file) fprintf (dump_file, " Registers conversion cost: %d\n", cost);
RE: [Patch] nvptx: Use fatal_error when -march= is missing not an assert [PR111093]
Hi Tomas, Tobias and Tom, Thanks for asking. Interestingly, I've a patch (attached) from last year that tackled some of the issues here. The surface problem is that nvptx's march and misa are related in complicated ways. Specifying an arch defines the range of valid isa's, and specifying an isa restricts the set of valid arches. The current approach, which I agree is problematic, is to force these to be specified (compatibly) on the cc1 command line. Certainly, an error is better than an abort. My proposed solution was to allow either to imply a default for the other, and only issue an error if they are explicitly specified incompatibly. One reason for supporting this approach was to ultimately support an -march=native in the driver (calling libcuda.so to determine the hardware available on the current machine). The other use case is bumping the "default" nvptx architecture to something more recent, say sm_53, by providing/honoring a default arch at configure time. Alas, it turns out that specifying a recent arch during GCC bootstrap, allows the build to notice that the backend (now) supports 16-bit floats, which then prompts libgcc to contain the floathf and fixhf support that would be required. Then this in turn shows up as a limitation in the middle-end's handling of libcalls, which I submitted as a patch to back in July 2022: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598848.html That patch hasn't yet been approved, so the whole nvptx -march= patch series became backlogged/forgotten. Hopefully, the attached "proof-of-concept" patch looks interesting (food for thought). If this approach seems reasonable, I'm happy to brush the dust off, and resubmit it (or a series of pieces) for review. Best regards, Roger -- > -Original Message- > From: Thomas Schwinge > Sent: 18 October 2023 11:16 > To: Tobias Burnus > Cc: gcc-patches@gcc.gnu.org; Tom de Vries ; Roger Sayle > > Subject: Re: [Patch] nvptx: Use fatal_error when -march= is missing not an > assert > [PR111093] > > Hi Tobias! > > On 2023-10-16T11:18:45+0200, Tobias Burnus > wrote: > > While mkoffload ensures that there is always a -march=, nvptx's > > cc1 can also be run directly. > > > > In my case, I wanted to know which target-specific #define are > > available; hence, I did run: > >accel/nvptx-none/cc1 -E -dM < /dev/null which gave an ICE. After > > some debugging, the reasons was clear (missing -march=) but somehow a > > (fatal) error would have been nicer than an ICE + debugging. > > > > OK for mainline? > > Yes, thanks. I think I prefer this over hard-coding some default > 'ptx_isa_option' -- > but may be convinced otherwise (incremental change), if that's maybe more > convenient for others? (Roger?) > > > Grüße > Thomas > > > > nvptx: Use fatal_error when -march= is missing not an assert > > [PR111093] > > > > gcc/ChangeLog: > > > > PR target/111093 > > * config/nvptx/nvptx.cc (nvptx_option_override): Issue fatal error > > instead of an assert ICE when no -march= has been specified. > > > > diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc > > index edef39fb5e1..634c31673be 100644 > > --- a/gcc/config/nvptx/nvptx.cc > > +++ b/gcc/config/nvptx/nvptx.cc > > @@ -335,8 +335,9 @@ nvptx_option_override (void) > >init_machine_status = nvptx_init_machine_status; > > > >/* Via nvptx 'OPTION_DEFAULT_SPECS', '-misa' always appears on the > command > > - line. */ > > - gcc_checking_assert (OPTION_SET_P (ptx_isa_option)); > > + line; but handle the case that the compiler is not run via the > > + driver. */ if (!OPTION_SET_P (ptx_isa_option)) > > +fatal_error (UNKNOWN_LOCATION, "%<-march=%> must be specified"); > > > >handle_ptx_version_option (); > > > - > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 > München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas > Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht > München, HRB 106955 diff --git a/gcc/calls.cc b/gcc/calls.cc index 6dd6f73..8a18eae 100644 --- a/gcc/calls.cc +++ b/gcc/calls.cc @@ -4795,14 +4795,20 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value, else { /* Convert to the proper mode if a promotion has been active. */ - if (GET_MODE (valreg) != outmode) + enum machine_mode valmode = GET_MODE (valreg); + if (valmode != outmode) { int unsignedp = TYPE_UNSIGNED (tfom); gcc_assert (promote_function_mode (tfom, outmode, ,
RE: [x86 PATCH] PR target/110551: Fix reg allocation for widening multiplications.
Many thanks to Tobias Burnus for pointing out the mistake/typo in the PR number. This fix is for PR 110551, not PR 110511. I'll update the ChangeLog and filename of the new testcase, if approved. Sorry for any inconvenience/confusion. Cheers, Roger -- > -Original Message- > From: Roger Sayle > Sent: 17 October 2023 20:06 > To: 'gcc-patches@gcc.gnu.org' > Cc: 'Uros Bizjak' > Subject: [x86 PATCH] PR target/110511: Fix reg allocation for widening > multiplications. > > > This patch contains clean-ups of the widening multiplication patterns in i386.md, > and provides variants of the existing highpart multiplication > peephole2 transformations (that tidy up register allocation after reload), and > thereby fixes PR target/110511, which is a superfluous move instruction. > > For the new test case, compiled on x86_64 with -O2. > > Before: > mulx64: movabsq $-7046029254386353131, %rcx > movq%rcx, %rax > mulq%rdi > xorq%rdx, %rax > ret > > After: > mulx64: movabsq $-7046029254386353131, %rax > mulq%rdi > xorq%rdx, %rax > ret > > The clean-ups are (i) that operand 1 is consistently made register_operand and > operand 2 becomes nonimmediate_operand, so that predicates match the > constraints, (ii) the representation of the BMI2 mulx instruction is updated to use > the new umul_highpart RTX, and (iii) because operands > 0 and 1 have different modes in widening multiplications, "a" is a more > appropriate constraint than "0" (which avoids spills/reloads containing SUBREGs). > The new peephole2 transformations are based upon those at around line 9951 of > i386.md, that begins with the comment ;; Highpart multiplication peephole2s to > tweak register allocation. > ;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx -> mov imm,%rax; imulq %rdi > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and > make -k check, both with and without --target_board=unix{-m32} with no new > failures. Ok for mainline? > > > 2023-10-17 Roger Sayle > > gcc/ChangeLog > PR target/110511 > * config/i386/i386.md (mul3): Make operands 1 and > 2 take "regiser_operand" and "nonimmediate_operand" respectively. > (mulqihi3): Likewise. > (*bmi2_umul3_1): Operand 2 needs to be register_operand > matching the %d constraint. Use umul_highpart RTX to represent > the highpart multiplication. > (*umul3_1): Operand 2 should use regiser_operand > predicate, and "a" rather than "0" as operands 0 and 2 have > different modes. > (define_split): For mul to mulx conversion, use the new > umul_highpart RTX representation. > (*mul3_1): Operand 1 should be register_operand > and the constraint %a as operands 0 and 1 have different modes. > (*mulqihi3_1): Operand 1 should be register_operand matching > the constraint %0. > (define_peephole2): Providing widening multiplication variants > of the peephole2s that tweak highpart multiplication register > allocation. > > gcc/testsuite/ChangeLog > PR target/110511 > * gcc.target/i386/pr110511.c: New test case. > > > Thanks in advance, > Roger
[x86 PATCH] PR target/110511: Fix reg allocation for widening multiplications.
This patch contains clean-ups of the widening multiplication patterns in i386.md, and provides variants of the existing highpart multiplication peephole2 transformations (that tidy up register allocation after reload), and thereby fixes PR target/110511, which is a superfluous move instruction. For the new test case, compiled on x86_64 with -O2. Before: mulx64: movabsq $-7046029254386353131, %rcx movq%rcx, %rax mulq%rdi xorq%rdx, %rax ret After: mulx64: movabsq $-7046029254386353131, %rax mulq%rdi xorq%rdx, %rax ret The clean-ups are (i) that operand 1 is consistently made register_operand and operand 2 becomes nonimmediate_operand, so that predicates match the constraints, (ii) the representation of the BMI2 mulx instruction is updated to use the new umul_highpart RTX, and (iii) because operands 0 and 1 have different modes in widening multiplications, "a" is a more appropriate constraint than "0" (which avoids spills/reloads containing SUBREGs). The new peephole2 transformations are based upon those at around line 9951 of i386.md, that begins with the comment ;; Highpart multiplication peephole2s to tweak register allocation. ;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx -> mov imm,%rax; imulq %rdi This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-17 Roger Sayle gcc/ChangeLog PR target/110511 * config/i386/i386.md (mul3): Make operands 1 and 2 take "regiser_operand" and "nonimmediate_operand" respectively. (mulqihi3): Likewise. (*bmi2_umul3_1): Operand 2 needs to be register_operand matching the %d constraint. Use umul_highpart RTX to represent the highpart multiplication. (*umul3_1): Operand 2 should use regiser_operand predicate, and "a" rather than "0" as operands 0 and 2 have different modes. (define_split): For mul to mulx conversion, use the new umul_highpart RTX representation. (*mul3_1): Operand 1 should be register_operand and the constraint %a as operands 0 and 1 have different modes. (*mulqihi3_1): Operand 1 should be register_operand matching the constraint %0. (define_peephole2): Providing widening multiplication variants of the peephole2s that tweak highpart multiplication register allocation. gcc/testsuite/ChangeLog PR target/110511 * gcc.target/i386/pr110511.c: New test case. Thanks in advance, Roger diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 2a60df5..22f18c2 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -9710,33 +9710,29 @@ [(parallel [(set (match_operand: 0 "register_operand") (mult: (any_extend: - (match_operand:DWIH 1 "nonimmediate_operand")) + (match_operand:DWIH 1 "register_operand")) (any_extend: - (match_operand:DWIH 2 "register_operand" + (match_operand:DWIH 2 "nonimmediate_operand" (clobber (reg:CC FLAGS_REG))])]) (define_expand "mulqihi3" [(parallel [(set (match_operand:HI 0 "register_operand") (mult:HI (any_extend:HI - (match_operand:QI 1 "nonimmediate_operand")) + (match_operand:QI 1 "register_operand")) (any_extend:HI - (match_operand:QI 2 "register_operand" + (match_operand:QI 2 "nonimmediate_operand" (clobber (reg:CC FLAGS_REG))])] "TARGET_QIMODE_MATH") (define_insn "*bmi2_umul3_1" [(set (match_operand:DWIH 0 "register_operand" "=r") (mult:DWIH - (match_operand:DWIH 2 "nonimmediate_operand" "%d") + (match_operand:DWIH 2 "register_operand" "%d") (match_operand:DWIH 3 "nonimmediate_operand" "rm"))) (set (match_operand:DWIH 1 "register_operand" "=r") - (truncate:DWIH - (lshiftrt: - (mult: (zero_extend: (match_dup 2)) - (zero_extend: (match_dup 3))) - (match_operand:QI 4 "const_int_operand"] - "TARGET_BMI2 && INTVAL (operands[4]) == * BITS_PER_UNIT + (umul_highpart:DWIH (match_dup 2) (match_dup 3)))] + "TARGET_BMI2 && !(MEM_P (operands[2]) && MEM_P (operands[3]))" "mulx\t{%3, %0, %1|%1, %0, %3}" [(set_attr "type" &qu
RE: [x86 PATCH] PR 106245: Split (x<<31)>>31 as -(x&1) in i386.md
Hi Uros, Thanks for the speedy review. > From: Uros Bizjak > Sent: 17 October 2023 17:38 > > On Tue, Oct 17, 2023 at 3:08 PM Roger Sayle > wrote: > > > > > > This patch is the backend piece of a solution to PRs 101955 and > > 106245, that adds a define_insn_and_split to the i386 backend, to > > perform sign extension of a single (least significant) bit using AND $1 > > then NEG. > > > > Previously, (x<<31)>>31 would be generated as > > > > sall$31, %eax // 3 bytes > > sarl$31, %eax // 3 bytes > > > > with this patch the backend now generates: > > > > andl$1, %eax// 3 bytes > > negl%eax// 2 bytes > > > > Not only is this smaller in size, but microbenchmarking confirms that > > it's a performance win on both Intel and AMD; Intel sees only a 2% > > improvement (perhaps just a size effect), but AMD sees a 7% win. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > > > 2023-10-17 Roger Sayle > > > > gcc/ChangeLog > > PR middle-end/101955 > > PR tree-optimization/106245 > > * config/i386/i386.md (*extv_1_0): New define_insn_and_split. > > > > gcc/testsuite/ChangeLog > > PR middle-end/101955 > > PR tree-optimization/106245 > > * gcc.target/i386/pr106245-2.c: New test case. > > * gcc.target/i386/pr106245-3.c: New 32-bit test case. > > * gcc.target/i386/pr106245-4.c: New 64-bit test case. > > * gcc.target/i386/pr106245-5.c: Likewise. > > +;; Split sign-extension of single least significant bit as and x,$1;neg > +x (define_insn_and_split "*extv_1_0" > + [(set (match_operand:SWI48 0 "register_operand" "=r") > + (sign_extract:SWI48 (match_operand:SWI48 1 "register_operand" "0") > +(const_int 1) > +(const_int 0))) > + (clobber (reg:CC FLAGS_REG))] > + "" > + "#" > + "&& 1" > > No need to use "&&" for an empty insn constraint. Just use "reload_completed" > in > this case. > > + [(parallel [(set (match_dup 0) (and:SWI48 (match_dup 1) (const_int 1))) > + (clobber (reg:CC FLAGS_REG))]) > + (parallel [(set (match_dup 0) (neg:SWI48 (match_dup 0))) > + (clobber (reg:CC FLAGS_REG))])]) > > Did you intend to split this after reload? If this is the case, then > reload_completed > is missing. Because this splitter neither required the allocation of a new pseudo, nor a hard register assignment, i.e. it's a splitter that can be run before or after reload, it's written to split "whenever". If you'd prefer it to only split after reload, I agree a "reload_completed" can be added (alternatively, adding "ix86_pre_reload_split ()" would also work). I now see from "*load_tp_" that "" is perhaps preferred over "&& 1" In these cases. Please let me know which you prefer. Cheers, Roger
[x86 PATCH] PR 106245: Split (x<<31)>>31 as -(x&1) in i386.md
This patch is the backend piece of a solution to PRs 101955 and 106245, that adds a define_insn_and_split to the i386 backend, to perform sign extension of a single (least significant) bit using AND $1 then NEG. Previously, (x<<31)>>31 would be generated as sall$31, %eax // 3 bytes sarl$31, %eax // 3 bytes with this patch the backend now generates: andl$1, %eax// 3 bytes negl%eax// 2 bytes Not only is this smaller in size, but microbenchmarking confirms that it's a performance win on both Intel and AMD; Intel sees only a 2% improvement (perhaps just a size effect), but AMD sees a 7% win. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-17 Roger Sayle gcc/ChangeLog PR middle-end/101955 PR tree-optimization/106245 * config/i386/i386.md (*extv_1_0): New define_insn_and_split. gcc/testsuite/ChangeLog PR middle-end/101955 PR tree-optimization/106245 * gcc.target/i386/pr106245-2.c: New test case. * gcc.target/i386/pr106245-3.c: New 32-bit test case. * gcc.target/i386/pr106245-4.c: New 64-bit test case. * gcc.target/i386/pr106245-5.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 2a60df5..b7309be0 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3414,6 +3414,21 @@ [(set_attr "type" "imovx") (set_attr "mode" "SI")]) +;; Split sign-extension of single least significant bit as and x,$1;neg x +(define_insn_and_split "*extv_1_0" + [(set (match_operand:SWI48 0 "register_operand" "=r") + (sign_extract:SWI48 (match_operand:SWI48 1 "register_operand" "0") + (const_int 1) + (const_int 0))) + (clobber (reg:CC FLAGS_REG))] + "" + "#" + "&& 1" + [(parallel [(set (match_dup 0) (and:SWI48 (match_dup 1) (const_int 1))) + (clobber (reg:CC FLAGS_REG))]) + (parallel [(set (match_dup 0) (neg:SWI48 (match_dup 0))) + (clobber (reg:CC FLAGS_REG))])]) + (define_expand "extzv" [(set (match_operand:SWI248 0 "register_operand") (zero_extract:SWI248 (match_operand:SWI248 1 "register_operand") diff --git a/gcc/testsuite/gcc.target/i386/pr106245-2.c b/gcc/testsuite/gcc.target/i386/pr106245-2.c new file mode 100644 index 000..47b0d27 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int f(int a) +{ +return (a << 31) >> 31; +} + +/* { dg-final { scan-assembler "andl" } } */ +/* { dg-final { scan-assembler "negl" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr106245-3.c b/gcc/testsuite/gcc.target/i386/pr106245-3.c new file mode 100644 index 000..4ec6342 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-3.c @@ -0,0 +1,11 @@ +/* { dg-do compile { target ia32 } } */ +/* { dg-options "-O2" } */ + +long long f(long long a) +{ +return (a << 63) >> 63; +} + +/* { dg-final { scan-assembler "andl" } } */ +/* { dg-final { scan-assembler "negl" } } */ +/* { dg-final { scan-assembler "cltd" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr106245-4.c b/gcc/testsuite/gcc.target/i386/pr106245-4.c new file mode 100644 index 000..ef77ee5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-4.c @@ -0,0 +1,10 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2" } */ + +long long f(long long a) +{ +return (a << 63) >> 63; +} + +/* { dg-final { scan-assembler "andl" } } */ +/* { dg-final { scan-assembler "negq" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr106245-5.c b/gcc/testsuite/gcc.target/i386/pr106245-5.c new file mode 100644 index 000..0351866 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-5.c @@ -0,0 +1,11 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ + +__int128 f(__int128 a) +{ + return (a << 127) >> 127; +} + +/* { dg-final { scan-assembler "andl" } } */ +/* { dg-final { scan-assembler "negq" } } */ +/* { dg-final { scan-assembler "cqto" } } */
RE: [PATCH] Support g++ 4.8 as a host compiler.
I'd like to ping my patch for restoring bootstrap using g++ 4.8.5 (the system compiler on RHEL 7 and later systems). https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632008.html Note the preprocessor #ifs can be removed; they are only there to document why the union u must have an explicit, empty (but not default) constructor. I completely agree with the various opinions that we might consider upgrading the minimum host compiler for many good reasons (Ada, D, newer C++ features etc.). It's inevitable that older compilers and systems can't be supported indefinitely. Having said that I don't think that this unintentional trivial breakage, that has a safe one-line work around is sufficient cause (or non-neglible risk or support burden), to inconvenice a large number of GCC users (the impact/disruption to cfarm has already been mentioned). Interestingly, "scl enable devtoolset-XX" to use a newer host compiler, v10 or v11, results in a significant increase (100+) in unexpected failures I see during mainline regression testing using "make -k check" (on RedHat 7.9). (Older) system compilers, despite their flaws, are selected for their (overall) stability and maturity. If another patch/change hits the compiler next week that reasonably means that 4.8.5 can no longer be supported, so be it, but its an annoying (and unnecessary?) inconvenience in the meantime. Perhaps we should file a Bugzilla PR indicating that the documentation and release notes need updating, if my fix isn't considered acceptable? Why this patch is an trigger issue (that requires significant discussion and deliberation) is somewhat of a mystery. Thanks in advance. Roger > -Original Message- > From: Jeff Law > Sent: 07 October 2023 17:20 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Cc: 'Richard Sandiford' > Subject: Re: [PATCH] Support g++ 4.8 as a host compiler. > > > > On 10/4/23 16:19, Roger Sayle wrote: > > > > The recent patch to remove poly_int_pod triggers a bug in g++ 4.8.5's > > C++ 11 support which mistakenly believes poly_uint16 has a non-trivial > > constructor. This in turn prohibits it from being used as a member in > > a union (rtxunion) that constructed statically, resulting in a (fatal) > > error during stage 1. A workaround is to add an explicit constructor > > to the problematic union, which allows mainline to be bootstrapped > > with the system compiler on older RedHat 7 systems. > > > > This patch has been tested on x86_64-pc-linux-gnu where it allows a > > bootstrap to complete when using g++ 4.8.5 as the host compiler. > > Ok for mainline? > > > > > > 2023-10-04 Roger Sayle > > > > gcc/ChangeLog > > * rtl.h (rtx_def::u): Add explicit constructor to workaround > > issue using g++ 4.8 as a host compiler. > I think the bigger question is whether or not we're going to step forward on > the > minimum build requirements. > > My recollection was we settled on gcc-4.8 for the benefit of RHEL 7 and > Centos 7 > which are rapidly approaching EOL (June 2024). > > I would certainly support stepping forward to a more modern compiler for the > build requirements, which might make this patch obsolete. > > Jeff
RE: [PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.
Hi Jeff, Thanks for the speedy review(s). > From: Jeff Law > Sent: 15 October 2023 00:03 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in > make_compound_operation. > > On 10/14/23 16:14, Roger Sayle wrote: > > > > This patch is my proposed solution to PR rtl-optimization/91865. > > Normally RTX simplification canonicalizes a ZERO_EXTEND of a > > ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is > > possible for combine's make_compound_operation to unintentionally > > generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is > > unlikely to be matched by the backend. > > > > For the new test case: > > > > const int table[2] = {1, 2}; > > int foo (char i) { return table[i]; } > > > > compiling with -O2 -mlarge on msp430 we currently see: > > > > Trying 2 -> 7: > > 2: r25:HI=zero_extend(R12:QI) > >REG_DEAD R12:QI > > 7: r28:PSI=sign_extend(r25:HI)#0 > >REG_DEAD r25:HI > > Failed to match this instruction: > > (set (reg:PSI 28 [ iD.1772 ]) > > (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ] > > > > which results in the following code: > > > > foo:AND #0xff, R12 > > RLAM.A #4, R12 { RRAM.A #4, R12 > > RLAM.A #1, R12 > > MOVX.W table(R12), R12 > > RETA > > > > With this patch, we now see: > > > > Trying 2 -> 7: > > 2: r25:HI=zero_extend(R12:QI) > >REG_DEAD R12:QI > > 7: r28:PSI=sign_extend(r25:HI)#0 > >REG_DEAD r25:HI > > Successfully matched this instruction: > > (set (reg:PSI 28 [ iD.1772 ]) > > (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing > > combination of insns 2 and 7 original costs 4 + 8 = 12 replacement > > cost 8 > > > > foo:MOV.B R12, R12 > > RLAM.A #1, R12 > > MOVX.W table(R12), R12 > > RETA > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > 2023-10-14 Roger Sayle > > > > gcc/ChangeLog > > PR rtl-optimization/91865 > > * combine.cc (make_compound_operation): Avoid creating a > > ZERO_EXTEND of a ZERO_EXTEND. > > > > gcc/testsuite/ChangeLog > > PR rtl-optimization/91865 > > * gcc.target/msp430/pr91865.c: New test case. > Neither an ACK or NAK at this point. > > The bug report includes a patch from Segher which purports to fix this in > simplify- > rtx. Any thoughts on Segher's approach and whether or not it should be > considered? > > The BZ also indicates that removal of 2 patterns from msp430.md would solve > this > too (though it may cause regressions elsewhere?). Any thoughts on that > approach > as well? > Great questions. I believe Segher's proposed patch (in comment #4) was an msp430-specific proof-of-concept workaround rather than intended to be fix. Eliminating a ZERO_EXTEND simply by changing the mode of a hard register is not a solution that'll work on many platforms (and therefore not really suitable for target-independent middle-end code in the RTL optimizers). For example, zero_extend:TI (and:QI (reg:QI hard_r1) (const_int 0x0f)) can't universally be reduced to (and:TI (reg:TI hard_r1) (const_int 0x0f)). Notice that Segher's code doesn't check TARGET_HARD_REGNO_MODE_OK or TARGET_MODES_TIEABLE_P or any of the other backend hooks necessary to confirm such a transformation is safe/possible. Secondly, the hard register aspect is a bit of a red herring. This work-around fixes the issue in the original BZ description, but not the slightly modified test case in comment #2 (with a global variable). This doesn't have a hard register, but does have the dubious ZERO_EXTEND/SIGN_EXTEND of a ZERO_EXTEND. The underlying issue, which is applicable to all targets, is that combine.cc's make_compound_operation is expected to reverse the local transformations made by expand_compound_operation. Hence, if an RTL expression is canonical going into expand_compound_operation, it is expected (hoped) to be canonical (and equivalent) coming out of make_compound_operation. Hence, rather than be a MSP430 specific issue, no target should expect (or be expected to see) a ZERO_EXTEND of a ZERO_EXTEND, or a SIGN_EXTEND of a ZERO_EXTEND in the RTL stream. Much like a binary operator with two CONST_INT operands, or a shift by zero, it's something the middle-end might reasonably be expected to
RE: [ARC PATCH] Split asl dst, 1, src into bset dst, 0, src to implement 1<
I've done it again. ENOPATCH. From: Roger Sayle Sent: 15 October 2023 09:13 To: 'gcc-patches@gcc.gnu.org' Cc: 'Claudiu Zissulescu' Subject: [ARC PATCH] Split asl dst,1,src into bset dst,0,src to implement 1<mailto:ro...@nextmovesoftware.com> > gcc/ChangeLog * config/arc/arc.md (*ashlsi3_1): New pre-reload splitter to use bset dst,0,src to implement 1<diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index a936a8b..22af0bf 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -3421,6 +3421,22 @@ archs4x, archs4xd" (set_attr "predicable" "no,no,yes,no,no") (set_attr "cond" "nocond,canuse,canuse,nocond,nocond")]) +;; Split asl dst,1,src into bset dst,0,src. +(define_insn_and_split "*ashlsi3_1" + [(set (match_operand:SI 0 "dest_reg_operand") + (ashift:SI (const_int 1) + (match_operand:SI 1 "nonmemory_operand")))] + "!TARGET_BARREL_SHIFTER + && arc_pre_reload_split ()" + "#" + "&& 1" + [(set (match_dup 0) + (ior:SI (ashift:SI (const_int 1) (match_dup 1)) + (const_int 0)))] + "" + [(set_attr "type" "shift") + (set_attr "length" "8")]) + (define_insn_and_split "*ashlsi3_nobs" [(set (match_operand:SI 0 "dest_reg_operand") (ashift:SI (match_operand:SI 1 "register_operand")
[ARC PATCH] Split asl dst, 1, src into bset dst, 0, src to implement 1<
This patch adds a pre-reload splitter to arc.md, to use the bset (set specific bit instruction) to implement 1< gcc/ChangeLog * config/arc/arc.md (*ashlsi3_1): New pre-reload splitter to use bset dst,0,src to implement 1<
[PATCH] Improved RTL expansion of 1LL << x.
This patch improves the initial RTL expanded for double word shifts on architectures with conditional moves, so that later passes don't need to clean-up unnecessary and/or unused instructions. Consider the general case, x << y, which is expanded well as: t1 = y & 32; t2 = 0; t3 = x_lo >> 1; t4 = y ^ ~0; t5 = t3 >> t4; tmp_hi = x_hi << y; tmp_hi |= t5; tmp_lo = x_lo << y; out_hi = t1 ? tmp_lo : tmp_hi; out_lo = t1 ? t2 : tmp_lo; which is nearly optimal, the only thing that can be improved is that using a unary NOT operation "t4 = ~y" is better than XOR with -1, on targets that support it. [Note the one_cmpl_optab expander didn't fall back to XOR when this code was originally written, but has been improved since]. Now consider the relatively common idiom of 1LL << y, which currently produces the RTL equivalent of: t1 = y & 32; t2 = 0; t3 = 1 >> 1; t4 = y ^ ~0; t5 = t3 >> t4; tmp_hi = 0 << y; tmp_hi |= t5; tmp_lo = 1 << y; out_hi = t1 ? tmp_lo : tmp_hi; out_lo = t1 ? t2 : tmp_lo; Notice here that t3 is always zero, so the assignment of t5 is a variable shift of zero, which expands to a loop on many smaller targets, a similar shift by zero in the first tmp_hi assignment (another loop), that the value of t4 is no longer required (as t3 is zero), and that the ultimate value of tmp_hi is always zero. Fortunately, for many (but perhaps not all) targets this mess gets cleaned up by later optimization passes. However, this patch avoids generating unnecessary RTL at expand time, by calling simplify_expand_binop instead of expand_binop, and avoiding generating dead or unnecessary code when intermediate values are known to be zero. For the 1LL << y test case above, we now generate: t1 = y & 32; t2 = 0; tmp_hi = 0; tmp_lo = 1 << y; out_hi = t1 ? tmp_lo : tmp_hi; out_lo = t1 ? t2 : tmp_lo; On arc-elf, for example, there are 18 RTL INSN_P instructions generated by expand before this patch, but only 12 with this patch (improving both compile-time and memory usage). This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-15 Roger Sayle gcc/ChangeLog * optabs.cc (expand_subword_shift): Call simplify_expand_binop instead of expand_binop. Optimize cases (i.e. avoid generating RTL) when CARRIES or INTO_INPUT is zero. Use one_cmpl_optab (i.e. NOT) instead of xor_optab with ~0 to calculate ~OP1. Thanks in advance, Roger -- diff --git a/gcc/optabs.cc b/gcc/optabs.cc index e1898da..f0a048a 100644 --- a/gcc/optabs.cc +++ b/gcc/optabs.cc @@ -533,15 +533,13 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab, has unknown behavior. Do a single shift first, then shift by the remainder. It's OK to use ~OP1 as the remainder if shift counts are truncated to the mode size. */ - carries = expand_binop (word_mode, reverse_unsigned_shift, - outof_input, const1_rtx, 0, unsignedp, methods); - if (shift_mask == BITS_PER_WORD - 1) - { - tmp = immed_wide_int_const - (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode); - tmp = simplify_expand_binop (op1_mode, xor_optab, op1, tmp, - 0, true, methods); - } + carries = simplify_expand_binop (word_mode, reverse_unsigned_shift, + outof_input, const1_rtx, 0, + unsignedp, methods); + if (carries == const0_rtx) + tmp = const0_rtx; + else if (shift_mask == BITS_PER_WORD - 1) + tmp = expand_unop (op1_mode, one_cmpl_optab, op1, 0, true); else { tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, @@ -552,22 +550,29 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab, } if (tmp == 0 || carries == 0) return false; - carries = expand_binop (word_mode, reverse_unsigned_shift, - carries, tmp, 0, unsignedp, methods); + if (carries != const0_rtx && tmp != const0_rtx) +carries = simplify_expand_binop (word_mode, reverse_unsigned_shift, +carries, tmp, 0, unsignedp, methods); if (carries == 0) return false; - /* Shift INTO_INPUT logically by OP1. This is the last use of INTO_INPUT - so the result can go directly into INTO_TARGET if convenient. */ - tmp = expand_binop (word_mode, unsigned_shift, into_input, op1, - into_target, unsignedp, methods); - if (tmp == 0) -return false; + if (into_input != const0_rt
[PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.
This patch is my proposed solution to PR rtl-optimization/91865. Normally RTX simplification canonicalizes a ZERO_EXTEND of a ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is possible for combine's make_compound_operation to unintentionally generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is unlikely to be matched by the backend. For the new test case: const int table[2] = {1, 2}; int foo (char i) { return table[i]; } compiling with -O2 -mlarge on msp430 we currently see: Trying 2 -> 7: 2: r25:HI=zero_extend(R12:QI) REG_DEAD R12:QI 7: r28:PSI=sign_extend(r25:HI)#0 REG_DEAD r25:HI Failed to match this instruction: (set (reg:PSI 28 [ iD.1772 ]) (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ] which results in the following code: foo:AND #0xff, R12 RLAM.A #4, R12 { RRAM.A #4, R12 RLAM.A #1, R12 MOVX.W table(R12), R12 RETA With this patch, we now see: Trying 2 -> 7: 2: r25:HI=zero_extend(R12:QI) REG_DEAD R12:QI 7: r28:PSI=sign_extend(r25:HI)#0 REG_DEAD r25:HI Successfully matched this instruction: (set (reg:PSI 28 [ iD.1772 ]) (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing combination of insns 2 and 7 original costs 4 + 8 = 12 replacement cost 8 foo:MOV.B R12, R12 RLAM.A #1, R12 MOVX.W table(R12), R12 RETA This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-14 Roger Sayle gcc/ChangeLog PR rtl-optimization/91865 * combine.cc (make_compound_operation): Avoid creating a ZERO_EXTEND of a ZERO_EXTEND. gcc/testsuite/ChangeLog PR rtl-optimization/91865 * gcc.target/msp430/pr91865.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/combine.cc b/gcc/combine.cc index 360aa2f25e6..f47ff596782 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -8453,6 +8453,9 @@ make_compound_operation (rtx x, enum rtx_code in_code) new_rtx, GET_MODE (XEXP (x, 0))); if (tem) return tem; + /* Avoid creating a ZERO_EXTEND of a ZERO_EXTEND. */ + if (GET_CODE (new_rtx) == ZERO_EXTEND) + new_rtx = XEXP (new_rtx, 0); SUBST (XEXP (x, 0), new_rtx); return x; } diff --git a/gcc/testsuite/gcc.target/msp430/pr91865.c b/gcc/testsuite/gcc.target/msp430/pr91865.c new file mode 100644 index 000..8cc21c8b9e8 --- /dev/null +++ b/gcc/testsuite/gcc.target/msp430/pr91865.c @@ -0,0 +1,8 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mlarge" } */ + +const int table[2] = {1, 2}; +int foo (char i) { return table[i]; } + +/* { dg-final { scan-assembler-not "AND" } } */ +/* { dg-final { scan-assembler-not "RRAM" } } */
[PATCH] Optimize (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) as (and:SI x 1).
This patch is the middle-end piece of an improvement to PRs 101955 and 106245, that adds a missing simplification to the RTL optimizers. This transformation is to simplify (char)(x << 7) != 0 as x & 1. Technically, the cast can be any truncation, where shift is by one less than the narrower type's precision, setting the most significant (only) bit from the least significant bit. This transformation applies to any target, but it's easy to see (and add a new test case) on x86, where the following function: int f(int a) { return (a << 31) >> 31; } currently gets compiled with -O2 to: foo:movl%edi, %eax sall$7, %eax sarb$7, %al movsbl %al, %eax ret but with this patch, we now generate the slightly simpler. foo:movl%edi, %eax sall$31, %eax sarl$31, %eax ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check with no new failures. Ok for mainline? 2023-10-10 Roger Sayle gcc/ChangeLog PR middle-end/101955 PR tree-optimization/106245 * simplify-rtx.c (simplify_relational_operation_1): Simplify the RTL (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) to (and:SI x 1). gcc/testsuite/ChangeLog * gcc.target/i386/pr106245-1.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index bd9443d..69d8757 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -6109,6 +6109,23 @@ simplify_context::simplify_relational_operation_1 (rtx_code code, break; } + /* (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) -> (and:SI x 1). */ + if (code == NE + && op1 == const0_rtx + && (op0code == TRUNCATE + || (partial_subreg_p (op0) + && subreg_lowpart_p (op0))) + && SCALAR_INT_MODE_P (mode) + && STORE_FLAG_VALUE == 1) +{ + rtx tmp = XEXP (op0, 0); + if (GET_CODE (tmp) == ASHIFT + && GET_MODE (tmp) == mode + && CONST_INT_P (XEXP (tmp, 1)) + && is_int_mode (GET_MODE (op0), _mode) + && INTVAL (XEXP (tmp, 1)) == GET_MODE_PRECISION (int_mode) - 1) + return simplify_gen_binary (AND, mode, XEXP (tmp, 0), const1_rtx); +} return NULL_RTX; } diff --git a/gcc/testsuite/gcc.target/i386/pr106245-1.c b/gcc/testsuite/gcc.target/i386/pr106245-1.c new file mode 100644 index 000..a0403e9 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-1.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int f(int a) +{ +return (a << 31) >> 31; +} + +/* { dg-final { scan-assembler-not "sarb" } } */ +/* { dg-final { scan-assembler-not "movsbl" } } */
[ARC PATCH] Improved SImode shifts and rotates on !TARGET_BARREL_SHIFTER.
This patch completes the ARC back-end's transition to using pre-reload splitters for SImode shifts and rotates on targets without a barrel shifter. The core part is that the shift_si3 define_insn is no longer needed, as shifts and rotates that don't require a loop are split before reload, and then because shift_si3_loop is the only caller of output_shift, both can be significantly cleaned up and simplified. The output_shift function (Claudiu's "the elephant in the room") is renamed output_shift_loop, which handles just the four instruction zero-overhead loop implementations. Aside from the clean-ups, the user visible changes are much improved implementations of SImode shifts and rotates on affected targets. For the function: unsigned int rotr_1 (unsigned int x) { return (x >> 1) | (x << 31); } GCC with -O2 -mcpu=em would previously generate: rotr_1: lsr_s r2,r0 bmsk_s r0,r0,0 ror r0,r0 j_s.d [blink] or_sr0,r0,r2 with this patch, we now generate: j_s.d [blink] ror r0,r0 For the function: unsigned int rotr_31 (unsigned int x) { return (x >> 31) | (x << 1); } GCC with -O2 -mcpu=em would previously generate: rotr_31: mov_s r2,r0 ;4 asl_s r0,r0 add.f 0,r2,r2 rlc r2,0 j_s.d [blink] or_sr0,r0,r2 with this patch we now generate an add.f followed by an adc: rotr_31: add.f r0,r0,r0 j_s.d [blink] add.cs r0,r0,1 Shifts by constants requiring a loop have been improved for even counts by performing two operations in each iteration: int shl10(int x) { return x >> 10; } Previously looked like: shl10: mov.f lp_count, 10 lpnz2f asr r0,r0 nop 2: # end single insn loop j_s [blink] And now becomes: shl10: mov lp_count,5 lp 2f asr r0,r0 asr r0,r0 2: # end single insn loop j_s [blink] So emulating ARC's SWAP on architectures that don't have it: unsigned int rotr_16 (unsigned int x) { return (x >> 16) | (x << 16); } previously required 10 instructions and ~70 cycles: rotr_16: mov_s r2,r0 ;4 mov.f lp_count, 16 lpnz2f add r0,r0,r0 nop 2: # end single insn loop mov.f lp_count, 16 lpnz2f lsr r2,r2 nop 2: # end single insn loop j_s.d [blink] or_sr0,r0,r2 now becomes just 4 instructions and ~18 cycles: rotr_16: mov lp_count,8 lp 2f ror r0,r0 ror r0,r0 2: # end single insn loop j_s [blink] This patch has been tested with a cross-compiler to arc-linux hosted on x86_64-pc-linux-gnu and (partially) tested with the compile-only portions of the testsuite with no regressions. Ok for mainline, if your own testing shows no issues? 2023-10-07 Roger Sayle gcc/ChangeLog * config/arc/arc-protos.h (output_shift): Rename to... (output_shift_loop): Tweak API to take an explicit rtx_code. (arc_split_ashl): Prototype new function here. (arc_split_ashr): Likewise. (arc_split_lshr): Likewise. (arc_split_rotl): Likewise. (arc_split_rotr): Likewise. * config/arc/arc.cc (output_shift): Delete local prototype. Rename. (output_shift_loop): New function replacing output_shift to output a zero overheap loop for SImode shifts and rotates on ARC targets without barrel shifter (i.e. no hardware support for these insns). (arc_split_ashl): New helper function to split *ashlsi3_nobs. (arc_split_ashr): New helper function to split *ashrsi3_nobs. (arc_split_lshr): New helper function to split *lshrsi3_nobs. (arc_split_rotl): New helper function to split *rotlsi3_nobs. (arc_split_rotr): New helper function to split *rotrsi3_nobs. * config/arc/arc.md (any_shift_rotate): New define_code_iterator. (define_code_attr insn): New code attribute to map to pattern name. (si3): New expander unifying previous ashlsi3, ashrsi3 and lshrsi3 define_expands. Adds rotlsi3 and rotrsi3. (*si3_nobs): New define_insn_and_split that unifies the previous *ashlsi3_nobs, *ashrsi3_nobs and *lshrsi3_nobs. We now call arc_split_ in arc.cc to implement each split. (shift_si3): Delete define_insn, all shifts/rotates are now split. (shift_si3_loop): Rename to... (si3_loop): define_insn to handle loop implementations of SImode shifts and rotates, calling ouput_shift_loop for template. (rotrsi3): Rename to... (*rotrsi3_insn): define_insn for TARGET_BARREL_SHIFTER's ror. (*rotlsi3): New define_insn_and_split to transform left rotates into right rotates before reload. (rotlsi3_cnt1): New define_insn_and_split to implement a le
RE: [X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.
Grr! I've done it again. ENOPATCH. > -Original Message- > From: Roger Sayle > Sent: 06 October 2023 14:58 > To: 'gcc-patches@gcc.gnu.org' > Cc: 'Uros Bizjak' > Subject: [X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr. > > > This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr > functions to implement doubleword right shifts by 1 bit, using a shift of the > highpart that sets the carry flag followed by a rotate-carry-right > (RCR) instruction on the lowpart. > > Conceptually this is similar to the recent left shift patch, but with two > complicating factors. The first is that although the RCR sequence is shorter, and is > a ~3x performance improvement on AMD, my micro-benchmarking shows it > ~10% slower on Intel. Hence this patch also introduces a new > X86_TUNE_USE_RCR tuning parameter. The second is that I believe this is the > first time a "rotate-right-through-carry" and a right shift that sets the carry flag > from the least significant bit has been modelled in GCC RTL (on a MODE_CC > target). For this I've used the i386 back-end's UNSPEC_CC_NE which seems > appropriate. Finally rcrsi2 and rcrdi2 are separate define_insns so that we can > use their generator functions. > > For the pair of functions: > unsigned __int128 foo(unsigned __int128 x) { return x >> 1; } > __int128 bar(__int128 x) { return x >> 1; } > > with -O2 -march=znver4 we previously generated: > > foo:movq%rdi, %rax > movq%rsi, %rdx > shrdq $1, %rsi, %rax > shrq%rdx > ret > bar:movq%rdi, %rax > movq%rsi, %rdx > shrdq $1, %rsi, %rax > sarq%rdx > ret > > with this patch we now generate: > > foo:movq%rsi, %rdx > movq%rdi, %rax > shrq%rdx > rcrq%rax > ret > bar:movq%rsi, %rdx > movq%rdi, %rax > sarq%rdx > rcrq%rax > ret > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and > make -k check, both with and without --target_board=unix{-m32} with no new > failures. And to provide additional testing, I've also bootstrapped and regression > tested a version of this patch where the RCR is always generated (independent of > the -march target) again with no regressions. Ok for mainline? > > > 2023-10-06 Roger Sayle > > gcc/ChangeLog > * config/i386/i386-expand.c (ix86_split_ashr): Split shifts by > one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR > or -Oz. > (ix86_split_lshr): Likewise, split shifts by one bit into > lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz. > * config/i386/i386.h (TARGET_USE_RCR): New backend macro. > * config/i386/i386.md (rcrsi2): New define_insn for rcrl. > (rcrdi2): New define_insn for rcrq. > (3_carry): New define_insn for right shifts that > set the carry flag from the least significant bit, modelled using > UNSPEC_CC_NE. > * config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning parameter > controlling use of rcr 1 vs. shrd, which is significantly faster on > AMD processors. > > gcc/testsuite/ChangeLog > * gcc.target/i386/rcr-1.c: New 64-bit test case. > * gcc.target/i386/rcr-2.c: New 32-bit test case. > > > Thanks in advance, > Roger > -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index e42ff27..399eb8e 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -6496,6 +6496,22 @@ ix86_split_ashr (rtx *operands, rtx scratch, machine_mode mode) emit_insn (gen_ashr3 (low[0], low[0], GEN_INT (count - half_width))); } + else if (count == 1 + && (TARGET_USE_RCR || optimize_size > 1)) + { + if (!rtx_equal_p (operands[0], operands[1])) + emit_move_insn (operands[0], operands[1]); + if (mode == DImode) + { + emit_insn (gen_ashrsi3_carry (high[0], high[0])); + emit_insn (gen_rcrsi2 (low[0], low[0])); + } + else + { + emit_insn (gen_ashrdi3_carry (high[0], high[0])); + emit_insn (gen_rcrdi2 (low[0], low[0])); + } + } else { gen_shrd = mode == DImode ? gen_x86_shrd : gen_x86_64_shrd; @@ -6561,6 +6577,22 @@ ix86_split_lshr (rtx *operands, rtx scratch, machine_mode mode) emit_insn (gen_lshr3 (low[0], low[0], GEN_INT (count - ha
[X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.
This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr functions to implement doubleword right shifts by 1 bit, using a shift of the highpart that sets the carry flag followed by a rotate-carry-right (RCR) instruction on the lowpart. Conceptually this is similar to the recent left shift patch, but with two complicating factors. The first is that although the RCR sequence is shorter, and is a ~3x performance improvement on AMD, my micro-benchmarking shows it ~10% slower on Intel. Hence this patch also introduces a new X86_TUNE_USE_RCR tuning parameter. The second is that I believe this is the first time a "rotate-right-through-carry" and a right shift that sets the carry flag from the least significant bit has been modelled in GCC RTL (on a MODE_CC target). For this I've used the i386 back-end's UNSPEC_CC_NE which seems appropriate. Finally rcrsi2 and rcrdi2 are separate define_insns so that we can use their generator functions. For the pair of functions: unsigned __int128 foo(unsigned __int128 x) { return x >> 1; } __int128 bar(__int128 x) { return x >> 1; } with -O2 -march=znver4 we previously generated: foo:movq%rdi, %rax movq%rsi, %rdx shrdq $1, %rsi, %rax shrq%rdx ret bar:movq%rdi, %rax movq%rsi, %rdx shrdq $1, %rsi, %rax sarq%rdx ret with this patch we now generate: foo:movq%rsi, %rdx movq%rdi, %rax shrq%rdx rcrq%rax ret bar:movq%rsi, %rdx movq%rdi, %rax sarq%rdx rcrq%rax ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. And to provide additional testing, I've also bootstrapped and regression tested a version of this patch where the RCR is always generated (independent of the -march target) again with no regressions. Ok for mainline? 2023-10-06 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.c (ix86_split_ashr): Split shifts by one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz. (ix86_split_lshr): Likewise, split shifts by one bit into lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz. * config/i386/i386.h (TARGET_USE_RCR): New backend macro. * config/i386/i386.md (rcrsi2): New define_insn for rcrl. (rcrdi2): New define_insn for rcrq. (3_carry): New define_insn for right shifts that set the carry flag from the least significant bit, modelled using UNSPEC_CC_NE. * config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning parameter controlling use of rcr 1 vs. shrd, which is significantly faster on AMD processors. gcc/testsuite/ChangeLog * gcc.target/i386/rcr-1.c: New 64-bit test case. * gcc.target/i386/rcr-2.c: New 32-bit test case. Thanks in advance, Roger --
RE: [X86 PATCH] Split lea into shorter left shift by 2 or 3 bits with -Oz.
Hi Uros, Very many thanks for the speedy reviews. Uros Bizjak wrote: > On Thu, Oct 5, 2023 at 11:06 AM Roger Sayle > wrote: > > > > > > This patch avoids long lea instructions for performing x<<2 and x<<3 > > by splitting them into shorter sal and move (or xchg instructions). > > Because this increases the number of instructions, but reduces the > > total size, its suitable for -Oz (but not -Os). > > > > The impact can be seen in the new test case: > > > > int foo(int x) { return x<<2; } > > int bar(int x) { return x<<3; } > > long long fool(long long x) { return x<<2; } long long barl(long long > > x) { return x<<3; } > > > > where with -O2 we generate: > > > > foo:lea0x0(,%rdi,4),%eax// 7 bytes > > retq > > bar:lea0x0(,%rdi,8),%eax// 7 bytes > > retq > > fool: lea0x0(,%rdi,4),%rax// 8 bytes > > retq > > barl: lea0x0(,%rdi,8),%rax// 8 bytes > > retq > > > > and with -Oz we now generate: > > > > foo:xchg %eax,%edi// 1 byte > > shl$0x2,%eax// 3 bytes > > retq > > bar:xchg %eax,%edi// 1 byte > > shl$0x3,%eax// 3 bytes > > retq > > fool: xchg %rax,%rdi// 2 bytes > > shl$0x2,%rax// 4 bytes > > retq > > barl: xchg %rax,%rdi// 2 bytes > > shl$0x3,%rax// 4 bytes > > retq > > > > Over the entirety of the CSiBE code size benchmark this saves 1347 > > bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32. > > Conveniently, there's already a backend function in i386.cc for > > deciding whether to split an lea into its component instructions, > > ix86_avoid_lea_for_addr, all that's required is an additional clause > > checking for -Oz (i.e. optimize_size > 1). > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board='unix{-m32}' > > with no new failures. Additional testing was performed by repeating > > these steps after removing the "optimize_size > 1" condition, so that > > suitable lea instructions were always split [-Oz is not heavily > > tested, so this invoked the new code during the bootstrap and > > regression testing], again with no regressions. Ok for mainline? > > > > > > 2023-10-05 Roger Sayle > > > > gcc/ChangeLog > > * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used > > to perform left shifts into shorter instructions with -Oz. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/lea-2.c: New test case. > > > > OK, but ... > > @@ -0,0 +1,7 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > > Is there a reason to avoid 32-bit targets? I'd expect that the optimization > also > triggers on x86_32 for 32bit integers. Good catch. You're 100% correct; because the test case just checks that an LEA is not used, and not for the specific sequence of shift instructions used instead, this test also passes with --target_board='unix{-m32}'. I'll remove the target clause from the dg-do compile directive. > +/* { dg-options "-Oz" } */ > +int foo(int x) { return x<<2; } > +int bar(int x) { return x<<3; } > +long long fool(long long x) { return x<<2; } long long barl(long long > +x) { return x<<3; } > +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */ Thanks again. Roger --
RE: [X86 PATCH] Implement doubleword shift left by 1 bit using add+adc.
Doh! ENOPATCH. > -Original Message- > From: Roger Sayle > Sent: 05 October 2023 12:44 > To: 'gcc-patches@gcc.gnu.org' > Cc: 'Uros Bizjak' > Subject: [X86 PATCH] Implement doubleword shift left by 1 bit using add+adc. > > > This patch tweaks the i386 back-end's ix86_split_ashl to implement doubleword > left shifts by 1 bit, using an add followed by an add-with-carry (i.e. a doubleword > x+x) instead of using the x86's shld instruction. > The replacement sequence both requires fewer bytes and is faster on both Intel > and AMD architectures (from Agner Fog's latency tables and confirmed by my > own microbenchmarking). > > For the test case: > __int128 foo(__int128 x) { return x << 1; } > > with -O2 we previously generated: > > foo:movq%rdi, %rax > movq%rsi, %rdx > shldq $1, %rdi, %rdx > addq%rdi, %rax > ret > > with this patch we now generate: > > foo:movq%rdi, %rax > movq%rsi, %rdx > addq%rdi, %rax > adcq%rsi, %rdx > ret > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and > make -k check, both with and without --target_board=unix{-m32} with no new > failures. Ok for mainline? > > > 2023-10-05 Roger Sayle > > gcc/ChangeLog > * config/i386/i386-expand.cc (ix86_split_ashl): Split shifts by > one into add3_cc_overflow_1 followed by add3_carry. > * config/i386/i386.md (@add3_cc_overflow_1): Renamed from > "*add3_cc_overflow_1" to provide generator function. > > gcc/testsuite/ChangeLog > * gcc.target/i386/ashldi3-2.c: New 32-bit test case. > * gcc.target/i386/ashlti3-3.c: New 64-bit test case. > > > Thanks in advance, > Roger > -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index e42ff27..09e41c8 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -6342,6 +6342,18 @@ ix86_split_ashl (rtx *operands, rtx scratch, machine_mode mode) if (count > half_width) ix86_expand_ashl_const (high[0], count - half_width, mode); } + else if (count == 1) + { + if (!rtx_equal_p (operands[0], operands[1])) + emit_move_insn (operands[0], operands[1]); + rtx x3 = gen_rtx_REG (CCCmode, FLAGS_REG); + rtx x4 = gen_rtx_LTU (mode, x3, const0_rtx); + half_mode = mode == DImode ? SImode : DImode; + emit_insn (gen_add3_cc_overflow_1 (half_mode, low[0], +low[0], low[0])); + emit_insn (gen_add3_carry (half_mode, high[0], high[0], high[0], +x3, x4)); + } else { gen_shld = mode == DImode ? gen_x86_shld : gen_x86_64_shld; diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index eef8a0e..6a5bc16 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -8864,7 +8864,7 @@ [(set_attr "type" "alu") (set_attr "mode" "")]) -(define_insn "*add3_cc_overflow_1" +(define_insn "@add3_cc_overflow_1" [(set (reg:CCC FLAGS_REG) (compare:CCC (plus:SWI diff --git a/gcc/testsuite/gcc.target/i386/ashldi3-2.c b/gcc/testsuite/gcc.target/i386/ashldi3-2.c new file mode 100644 index 000..053389d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/ashldi3-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile { target ia32 } } */ +/* { dg-options "-O2 -mno-stv" } */ + +long long foo(long long x) +{ + return x << 1; +} + +/* { dg-final { scan-assembler "adcl" } } */ +/* { dg-final { scan-assembler-not "shldl" } } */ diff --git a/gcc/testsuite/gcc.target/i386/ashlti3-3.c b/gcc/testsuite/gcc.target/i386/ashlti3-3.c new file mode 100644 index 000..4f14ca0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/ashlti3-3.c @@ -0,0 +1,10 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ + +__int128 foo(__int128 x) +{ + return x << 1; +} + +/* { dg-final { scan-assembler "adcq" } } */ +/* { dg-final { scan-assembler-not "shldq" } } */
[X86 PATCH] Implement doubleword shift left by 1 bit using add+adc.
This patch tweaks the i386 back-end's ix86_split_ashl to implement doubleword left shifts by 1 bit, using an add followed by an add-with-carry (i.e. a doubleword x+x) instead of using the x86's shld instruction. The replacement sequence both requires fewer bytes and is faster on both Intel and AMD architectures (from Agner Fog's latency tables and confirmed by my own microbenchmarking). For the test case: __int128 foo(__int128 x) { return x << 1; } with -O2 we previously generated: foo:movq%rdi, %rax movq%rsi, %rdx shldq $1, %rdi, %rdx addq%rdi, %rax ret with this patch we now generate: foo:movq%rdi, %rax movq%rsi, %rdx addq%rdi, %rax adcq%rsi, %rdx ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-05 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_split_ashl): Split shifts by one into add3_cc_overflow_1 followed by add3_carry. * config/i386/i386.md (@add3_cc_overflow_1): Renamed from "*add3_cc_overflow_1" to provide generator function. gcc/testsuite/ChangeLog * gcc.target/i386/ashldi3-2.c: New 32-bit test case. * gcc.target/i386/ashlti3-3.c: New 64-bit test case. Thanks in advance, Roger --
[X86 PATCH] Split lea into shorter left shift by 2 or 3 bits with -Oz.
This patch avoids long lea instructions for performing x<<2 and x<<3 by splitting them into shorter sal and move (or xchg instructions). Because this increases the number of instructions, but reduces the total size, its suitable for -Oz (but not -Os). The impact can be seen in the new test case: int foo(int x) { return x<<2; } int bar(int x) { return x<<3; } long long fool(long long x) { return x<<2; } long long barl(long long x) { return x<<3; } where with -O2 we generate: foo:lea0x0(,%rdi,4),%eax// 7 bytes retq bar:lea0x0(,%rdi,8),%eax// 7 bytes retq fool: lea0x0(,%rdi,4),%rax// 8 bytes retq barl: lea0x0(,%rdi,8),%rax// 8 bytes retq and with -Oz we now generate: foo:xchg %eax,%edi// 1 byte shl$0x2,%eax// 3 bytes retq bar:xchg %eax,%edi// 1 byte shl$0x3,%eax// 3 bytes retq fool: xchg %rax,%rdi// 2 bytes shl$0x2,%rax// 4 bytes retq barl: xchg %rax,%rdi// 2 bytes shl$0x3,%rax// 4 bytes retq Over the entirety of the CSiBE code size benchmark this saves 1347 bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32. Conveniently, there's already a backend function in i386.cc for deciding whether to split an lea into its component instructions, ix86_avoid_lea_for_addr, all that's required is an additional clause checking for -Oz (i.e. optimize_size > 1). This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board='unix{-m32}' with no new failures. Additional testing was performed by repeating these steps after removing the "optimize_size > 1" condition, so that suitable lea instructions were always split [-Oz is not heavily tested, so this invoked the new code during the bootstrap and regression testing], again with no regressions. Ok for mainline? 2023-10-05 Roger Sayle gcc/ChangeLog * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used to perform left shifts into shorter instructions with -Oz. gcc/testsuite/ChangeLog * gcc.target/i386/lea-2.c: New test case. diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 477e6ce..9557bff 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -15543,6 +15543,13 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx operands[]) && (regno0 == regno1 || regno0 == regno2)) return true; + /* Split with -Oz if the encoding requires fewer bytes. */ + if (optimize_size > 1 + && parts.scale > 1 + && !parts.base + && (!parts.disp || parts.disp == const0_rtx)) +return true; + /* Check we need to optimize. */ if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun)) return false; diff --git a/gcc/testsuite/gcc.target/i386/lea-2.c b/gcc/testsuite/gcc.target/i386/lea-2.c new file mode 100644 index 000..20aded8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/lea-2.c @@ -0,0 +1,7 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-Oz" } */ +int foo(int x) { return x<<2; } +int bar(int x) { return x<<3; } +long long fool(long long x) { return x<<2; } +long long barl(long long x) { return x<<3; } +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */
[PATCH] Support g++ 4.8 as a host compiler.
The recent patch to remove poly_int_pod triggers a bug in g++ 4.8.5's C++ 11 support which mistakenly believes poly_uint16 has a non-trivial constructor. This in turn prohibits it from being used as a member in a union (rtxunion) that constructed statically, resulting in a (fatal) error during stage 1. A workaround is to add an explicit constructor to the problematic union, which allows mainline to be bootstrapped with the system compiler on older RedHat 7 systems. This patch has been tested on x86_64-pc-linux-gnu where it allows a bootstrap to complete when using g++ 4.8.5 as the host compiler. Ok for mainline? 2023-10-04 Roger Sayle gcc/ChangeLog * rtl.h (rtx_def::u): Add explicit constructor to workaround issue using g++ 4.8 as a host compiler. diff --git a/gcc/rtl.h b/gcc/rtl.h index 6850281..a7667f5 100644 --- a/gcc/rtl.h +++ b/gcc/rtl.h @@ -451,6 +451,9 @@ struct GTY((desc("0"), tag("0"), struct fixed_value fv; struct hwivec_def hwiv; struct const_poly_int_def cpi; +#if defined(__GNUC__) && GCC_VERSION < 5000 +u () {} +#endif } GTY ((special ("rtx_def"), desc ("GET_CODE (&%0)"))) u; };
PING: PR rtl-optimization/110701
There are a small handful of middle-end maintainers/reviewers that understand and appreciate the difference between the RTL statements: (set (subreg:HI (reg:SI x)) (reg:HI y)) and (set (strict_lowpart:HI (reg:SI x)) (reg:HI y)) If one (or more) of them could please take a look at https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625532.html I'd very much appreciate it (one less wrong-code regression). Many thanks in advance, Roger --
RE: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
Hi Claudiu, Thanks for the answers to my technical questions. If you'd prefer to update arc.md's add3 pattern first, I'm happy to update/revise my patch based on this and your feedback, for example preferring add over asl_s (or controlling this choice with -Os). Thanks again. Roger -- > -Original Message- > From: Claudiu Zissulescu > Sent: 03 October 2023 15:26 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on > !TARGET_BARREL_SHIFTER. > > Hi Roger, > > It was nice to meet you too. > > Thank you in looking into the ARC's non-Barrel Shifter configurations. I will dive > into your patch asap, but before starting here are a few of my comments: > > -Original Message- > From: Roger Sayle > Sent: Thursday, September 28, 2023 2:27 PM > To: gcc-patches@gcc.gnu.org > Cc: Claudiu Zissulescu > Subject: [ARC PATCH] Split SImode shifts pre-reload on > !TARGET_BARREL_SHIFTER. > > > Hi Claudiu, > It was great meeting up with you and the Synopsys ARC team at the GNU tools > Cauldron in Cambridge. > > This patch is the first in a series to improve SImode and DImode shifts and rotates > in the ARC backend. This first piece splits SImode shifts, for > !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 > pass, as suggested by the FIXME comment above output_shift in arc.cc. To do > this I've copied the implementation of the x86_pre_reload_split function from > i386 backend, and renamed it arc_pre_reload_split. > > Although the actual implementations of shifts remain the same (as in > output_shift), having them as explicit instructions in the RTL stream allows better > scheduling and use of compact forms when available. The benefits can be seen in > two short examples below. > > For the function: > unsigned int foo(unsigned int x, unsigned int y) { > return y << 2; > } > > GCC with -O2 -mcpu=em would previously generate: > foo:add r1,r1,r1 > add r1,r1,r1 > j_s.d [blink] > mov_s r0,r1 ;4 > > [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly > beneficial for older ARCv1 arches. > > and with this patch now generates: > foo:asl_s r0,r1 > j_s.d [blink] > asl_s r0,r0 > > [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus. > > Notice the original (from shift_si3's output_shift) requires the shift sequence to be > monolithic with the same destination register as the source (requiring an extra > mov_s). The new version can eliminate this move, and schedule the second asl in > the branch delay slot of the return. > > For the function: > int x,y,z; > > void bar() > { > x <<= 3; > y <<= 3; > z <<= 3; > } > > GCC -O2 -mcpu=em currently generates: > bar:push_s r13 > ld.as r12,[gp,@x@sda] ;23 > ld.as r3,[gp,@y@sda] ;23 > mov r2,0 > add3 r12,r2,r12 > mov r2,0 > add3 r3,r2,r3 > ld.as r2,[gp,@z@sda] ;23 > st.as r12,[gp,@x@sda] ;26 > mov r13,0 > add3 r2,r13,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s.d [blink] > pop_s r13 > > where each shift by 3, uses ARC's add3 instruction, which is similar to x86's lea > implementing x = (y<<3) + z, but requires the value zero to be placed in a > temporary register "z". Splitting this before reload allows these pseudos to be > shared/reused. With this patch, we get > > bar:ld.as r2,[gp,@x@sda] ;23 > mov_s r3,0;3 > add3r2,r3,r2 > ld.as r3,[gp,@y@sda] ;23 > st.as r2,[gp,@x@sda] ;26 > ld.as r2,[gp,@z@sda] ;23 > mov_s r12,0 ;3 > add3r3,r12,r3 > add3r2,r12,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s [blink] > > [CZI] Looks great, but it also shows that I've forgot to add to ADD3 instruction the > Ra,LIMM,RC variant, which will lead to have instead of > mov_s r3,0;3 > add3r2,r3,r2 > Only this add3,0,r2, Indeed it is longer instruction but faster. > > Unfortunately, register allocation means that we only share two of the three > "mov_s z,0", but this is sufficient to reduce register pressure enough to avoid > spilling r13 in the prologue/epilogue. > > This patch also contains a (latent?) bug fix. The implementation of the default > insn "length" attribute, assumes instructions of type "shift" have two inpu
RE: [ARC PATCH] Use rlc r0, 0 to implement scc_ltu (i.e. carry_flag ? 1 : 0)
Hi Claudiu, > The patch looks sane. Have you run dejagnu test suite? I've not yet managed to set up an emulator or compile the entire toolchain, so my dejagnu results are only useful for catching (serious) problems in the compile only tests: === gcc Summary === # of expected passes91875 # of unexpected failures23768 # of unexpected successes 23 # of expected failures 1038 # of unresolved testcases 19490 # of unsupported tests 3819 /home/roger/GCC/arc-linux/gcc/xgcc version 14.0.0 20230828 (experimental) (GCC) If someone could double check there are no issues on real hardware that would be great. I'm not sure if ARC is one of the targets covered by Jeff Law's compile farm? > -Original Message- > From: Roger Sayle > Sent: Friday, September 29, 2023 6:54 PM > To: gcc-patches@gcc.gnu.org > Cc: Claudiu Zissulescu > Subject: [ARC PATCH] Use rlc r0,0 to implement scc_ltu (i.e. carry_flag ? 1 : 0) > > > This patch teaches the ARC backend that the contents of the carry flag can be > placed in an integer register conveniently using the "rlc rX,0" > instruction, which is a rotate-left-through-carry using zero as a source. > This is a convenient special case for the LTU form of the scc pattern. > > unsigned int foo(unsigned int x, unsigned int y) { > return (x+y) < x; > } > > With -O2 -mcpu=em this is currently compiled to: > > foo:add.f 0,r0,r1 > mov_s r0,1;3 > j_s.d [blink] > mov.hs r0,0 > > [which after an addition to set the carry flag, sets r0 to 1, followed by a > conditional assignment of r0 to zero if the carry flag is clear]. With the new > define_insn/optimization in this patch, this becomes: > > foo:add.f 0,r0,r1 > j_s.d [blink] > rlc r0,0 > > This define_insn is also a useful building block for implementing shifts and rotates. > > Tested on a cross-compiler to arc-linux (hosted on x86_64-pc-linux-gnu), and a > partial tool chain, where the new case passes and there are no new regressions. > Ok for mainline? > > > 2023-09-29 Roger Sayle > > gcc/ChangeLog > * config/arc/arc.md (CC_ltu): New mode iterator for CC and CC_C. > (scc_ltu_): New define_insn to handle LTU form of scc_insn. > (*scc_insn): Don't split to a conditional move sequence for LTU. > > gcc/testsuite/ChangeLog > * gcc.target/arc/scc-ltu.c: New test case. > > > Thanks in advance, > Roger > --
[ARC PATCH] Use rlc r0, 0 to implement scc_ltu (i.e. carry_flag ? 1 : 0)
This patch teaches the ARC backend that the contents of the carry flag can be placed in an integer register conveniently using the "rlc rX,0" instruction, which is a rotate-left-through-carry using zero as a source. This is a convenient special case for the LTU form of the scc pattern. unsigned int foo(unsigned int x, unsigned int y) { return (x+y) < x; } With -O2 -mcpu=em this is currently compiled to: foo:add.f 0,r0,r1 mov_s r0,1;3 j_s.d [blink] mov.hs r0,0 [which after an addition to set the carry flag, sets r0 to 1, followed by a conditional assignment of r0 to zero if the carry flag is clear]. With the new define_insn/optimization in this patch, this becomes: foo:add.f 0,r0,r1 j_s.d [blink] rlc r0,0 This define_insn is also a useful building block for implementing shifts and rotates. Tested on a cross-compiler to arc-linux (hosted on x86_64-pc-linux-gnu), and a partial tool chain, where the new case passes and there are no new regressions. Ok for mainline? 2023-09-29 Roger Sayle gcc/ChangeLog * config/arc/arc.md (CC_ltu): New mode iterator for CC and CC_C. (scc_ltu_): New define_insn to handle LTU form of scc_insn. (*scc_insn): Don't split to a conditional move sequence for LTU. gcc/testsuite/ChangeLog * gcc.target/arc/scc-ltu.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index d37ecbf..fe2e7fb 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -3658,12 +3658,24 @@ archs4x, archs4xd" (define_expand "scc_insn" [(set (match_operand:SI 0 "dest_reg_operand" "=w") (match_operand:SI 1 ""))]) +(define_mode_iterator CC_ltu [CC_C CC]) + +(define_insn "scc_ltu_" + [(set (match_operand:SI 0 "dest_reg_operand" "=w") +(ltu:SI (reg:CC_ltu CC_REG) (const_int 0)))] + "" + "rlc\\t%0,0" + [(set_attr "type" "shift") + (set_attr "predicable" "no") + (set_attr "length" "4")]) + (define_insn_and_split "*scc_insn" [(set (match_operand:SI 0 "dest_reg_operand" "=w") (match_operator:SI 1 "proper_comparison_operator" [(reg CC_REG) (const_int 0)]))] "" "#" - "reload_completed" + "reload_completed + && GET_CODE (operands[1]) != LTU" [(set (match_dup 0) (const_int 1)) (cond_exec (match_dup 1) diff --git a/gcc/testsuite/gcc.target/arc/scc-ltu.c b/gcc/testsuite/gcc.target/arc/scc-ltu.c new file mode 100644 index 000..653c55d --- /dev/null +++ b/gcc/testsuite/gcc.target/arc/scc-ltu.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=em" } */ + +unsigned int foo(unsigned int x, unsigned int y) +{ + return (x+y) < x; +} + +/* { dg-final { scan-assembler "rlc\\s+r0,0" } } */ +/* { dg-final { scan-assembler "add.f\\s+0,r0,r1" } } */ +/* { dg-final { scan-assembler-not "mov_s\\s+r0,1" } } */ +/* { dg-final { scan-assembler-not "mov\.hs\\s+r0,0" } } */
RE: [RFC] expr: don't clear SUBREG_PROMOTED_VAR_P flag for a promoted subreg [target/111466]
I agree that this looks dubious. Normally, if the middle-end/optimizers wish to reuse a SUBREG in a context where the flags are not valid, it should create a new one with the desired flags, rather than "mutate" an existing (and possibly shared) RTX. I wonder if creating a new SUBREG here also fixes your problem? I'm not sure that clearing SUBREG_PROMOTED_VAR_P is needed at all, but given its motivation has been lost to history, it would good to have a plan B, if Jeff's alpha testing uncovers a subtle issue. Roger -- > -Original Message- > From: Vineet Gupta > Sent: 28 September 2023 22:44 > To: gcc-patches@gcc.gnu.org; Robin Dapp > Cc: kito.ch...@gmail.com; Jeff Law ; Palmer Dabbelt > ; gnu-toolch...@rivosinc.com; Roger Sayle > ; Jakub Jelinek ; Jivan > Hakobyan ; Vineet Gupta > Subject: [RFC] expr: don't clear SUBREG_PROMOTED_VAR_P flag for a promoted > subreg [target/111466] > > RISC-V suffers from extraneous sign extensions, despite/given the ABI guarantee > that 32-bit quantities are sign-extended into 64-bit registers, meaning incoming SI > function args need not be explicitly sign extended (so do SI return values as most > ALU insns implicitly sign-extend too.) > > Existing REE doesn't seem to handle this well and there are various ideas floating > around to smarten REE about it. > > RISC-V also seems to correctly implement middle-end hook PROMOTE_MODE > etc. > > Another approach would be to prevent EXPAND from generating the sign_extend > in the first place which this patch tries to do. > > The hunk being removed was introduced way back in 1994 as >5069803972 ("expand_expr, case CONVERT_EXPR .. clear the promotion flag") > > This survived full testsuite run for RISC-V rv64gc with surprisingly no > fallouts: test results before/after are exactly same. > > | | # of unexpected case / # of unique unexpected case > | | gcc | g++ | gfortran | > | rv64imafdc_zba_zbb_zbs_zicond/| 264 /87 |5 / 2 | 72 / 12 | > |lp64d/medlow > > Granted for something so old to have survived, there must be a valid reason. > Unfortunately the original change didn't have additional commentary or a test > case. That is not to say it can't/won't possibly break things on other arches/ABIs, > hence the RFC for someone to scream that this is just bonkers, don't do this :-) > > I've explicitly CC'ed Jakub and Roger who have last touched subreg promoted > notes in expr.cc for insight and/or screaming ;-) > > Thanks to Robin for narrowing this down in an amazing debugging session @ GNU > Cauldron. > > ``` > foo2: > sext.w a6,a1 <-- this goes away > beq a1,zero,.L4 > li a5,0 > li a0,0 > .L3: > addwa4,a2,a5 > addwa5,a3,a5 > addwa0,a4,a0 > bltua5,a6,.L3 > ret > .L4: > li a0,0 > ret > ``` > > Signed-off-by: Vineet Gupta > Co-developed-by: Robin Dapp > --- > gcc/expr.cc | 7 --- > gcc/testsuite/gcc.target/riscv/pr111466.c | 15 +++ > 2 files changed, 15 insertions(+), 7 deletions(-) create mode 100644 > gcc/testsuite/gcc.target/riscv/pr111466.c > > diff --git a/gcc/expr.cc b/gcc/expr.cc > index 308ddc09e631..d259c6e53385 100644 > --- a/gcc/expr.cc > +++ b/gcc/expr.cc > @@ -9332,13 +9332,6 @@ expand_expr_real_2 (sepops ops, rtx target, > machine_mode tmode, > op0 = expand_expr (treeop0, target, VOIDmode, >modifier); > > - /* If the signedness of the conversion differs and OP0 is > - a promoted SUBREG, clear that indication since we now > - have to do the proper extension. */ > - if (TYPE_UNSIGNED (TREE_TYPE (treeop0)) != unsignedp > - && GET_CODE (op0) == SUBREG) > - SUBREG_PROMOTED_VAR_P (op0) = 0; > - > return REDUCE_BIT_FIELD (op0); > } > > diff --git a/gcc/testsuite/gcc.target/riscv/pr111466.c > b/gcc/testsuite/gcc.target/riscv/pr111466.c > new file mode 100644 > index ..007792466a51 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/pr111466.c > @@ -0,0 +1,15 @@ > +/* Simplified varaint of gcc.target/riscv/zba-adduw.c. */ > + > +/* { dg-do compile } */ > +/* { dg-options "-march=rv64gc_zba_zbs -mabi=lp64" } */ > +/* { dg-skip-if "" { *-*-* } { "-O0" } } */ > + > +int foo2(int unused, int n, unsigned y, unsigned delta){ > + int s = 0; > + unsigned int x = 0; > + for (;x +s += x+y; > + return s; > +} > + > +/* { dg-final { scan-assembler "\msext\M" } } */ > -- > 2.34.1
[ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
Hi Claudiu, It was great meeting up with you and the Synopsys ARC team at the GNU tools Cauldron in Cambridge. This patch is the first in a series to improve SImode and DImode shifts and rotates in the ARC backend. This first piece splits SImode shifts, for !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 pass, as suggested by the FIXME comment above output_shift in arc.cc. To do this I've copied the implementation of the x86_pre_reload_split function from i386 backend, and renamed it arc_pre_reload_split. Although the actual implementations of shifts remain the same (as in output_shift), having them as explicit instructions in the RTL stream allows better scheduling and use of compact forms when available. The benefits can be seen in two short examples below. For the function: unsigned int foo(unsigned int x, unsigned int y) { return y << 2; } GCC with -O2 -mcpu=em would previously generate: foo:add r1,r1,r1 add r1,r1,r1 j_s.d [blink] mov_s r0,r1 ;4 and with this patch now generates: foo:asl_s r0,r1 j_s.d [blink] asl_s r0,r0 Notice the original (from shift_si3's output_shift) requires the shift sequence to be monolithic with the same destination register as the source (requiring an extra mov_s). The new version can eliminate this move, and schedule the second asl in the branch delay slot of the return. For the function: int x,y,z; void bar() { x <<= 3; y <<= 3; z <<= 3; } GCC -O2 -mcpu=em currently generates: bar:push_s r13 ld.as r12,[gp,@x@sda] ;23 ld.as r3,[gp,@y@sda] ;23 mov r2,0 add3 r12,r2,r12 mov r2,0 add3 r3,r2,r3 ld.as r2,[gp,@z@sda] ;23 st.as r12,[gp,@x@sda] ;26 mov r13,0 add3 r2,r13,r2 st.as r3,[gp,@y@sda] ;26 st.as r2,[gp,@z@sda] ;26 j_s.d [blink] pop_s r13 where each shift by 3, uses ARC's add3 instruction, which is similar to x86's lea implementing x = (y<<3) + z, but requires the value zero to be placed in a temporary register "z". Splitting this before reload allows these pseudos to be shared/reused. With this patch, we get bar:ld.as r2,[gp,@x@sda] ;23 mov_s r3,0;3 add3r2,r3,r2 ld.as r3,[gp,@y@sda] ;23 st.as r2,[gp,@x@sda] ;26 ld.as r2,[gp,@z@sda] ;23 mov_s r12,0 ;3 add3r3,r12,r3 add3r2,r12,r2 st.as r3,[gp,@y@sda] ;26 st.as r2,[gp,@z@sda] ;26 j_s [blink] Unfortunately, register allocation means that we only share two of the three "mov_s z,0", but this is sufficient to reduce register pressure enough to avoid spilling r13 in the prologue/epilogue. This patch also contains a (latent?) bug fix. The implementation of the default insn "length" attribute, assumes instructions of type "shift" have two input operands and accesses operands[2], hence specializations of shifts that don't have a operands[2], need to be categorized as type "unary" (which results in the correct length). This patch has been tested on a cross-compiler to arc-elf (hosted on x86_64-pc-linux-gnu), but because I've an incomplete tool chain many of the regression test fail, but there are no new failures with new test cases added below. If you can confirm that there are no issues from additional testing, is this OK for mainline? Finally a quick technical question. ARC's zero overhead loops require at least two instructions in the loop, so currently the backend's implementation of shr20 pads the loop body with a "nop". lshr20: mov.f lp_count, 20 lpnz2f lsr r0,r0 nop 2: # end single insn loop j_s [blink] could this be more efficiently implemented as: lshr20: mov lp_count, 10 lp 2f lsr_s r0,r0 lsr_s r0,r0 2: # end single insn loop j_s [blink] i.e. half the number of iterations, but doing twice as much useful work in each iteration? Or might the nop be free on advanced microarchitectures, and/or the consecutive dependent shifts cause a pipeline stall? It would be nice to fuse loops to implement rotations, such that rotr16 (aka swap) would look like: rot16: mov_s r1,r0 mov lp_count, 16 lp 2f asl_s r0,r0 lsr_s r1,r1 2: # end single insn loop j_s.d[blink] or_s r0,r0,r1 Thanks in advance, Roger 2023-09-28 Roger Sayle gcc/ChangeLog * config/arc/arc-protos.h (emit_shift): Delete prototype. (arc_pre_reload_split): New function prototype. * config/arc/arc.cc (emit_shift): Delete function. (arc_pre_reload_split): New predicate function, copied from i386, to schedule define_insn_and_split splitters to the split1 pass. * config/arc/arc.md (ashlsi3): Exp
RE: [x86_64 PATCH] Improve __int128 argument passing (in ix86_expand_move).
Hi Manolis, Many thanks. If you haven't already, could you create/file a bug report at https://gcc.gnu.org/bugzilla/ which ensures this doesn't get lost/forgotten. It provides a PR number for tracking discussions, and patches/fixes with PR numbers are (often) prioritized during the review and approval process. I'll investigate what's going on. Either my "improvements" need to be disabled for V2SF arguments, or the middle/back end needs to figure out how to efficiently shuffle these values, without reload moving them via integer registers, at least as efficiently as we were before. As you/clang show, we could do better. Thanks again, and sorry for any inconvenience. Best regards, Roger -- > -Original Message- > From: Manolis Tsamis > Sent: 01 September 2023 11:45 > To: Uros Bizjak > Cc: Roger Sayle ; gcc-patches@gcc.gnu.org > Subject: Re: [x86_64 PATCH] Improve __int128 argument passing (in > ix86_expand_move). > > Hi Roger, > > I've (accidentally) found a codegen regression that I bisected down to this > patch. > For these two functions: > > typedef struct { > float minx, miny; > float maxx, maxy; > } AABB; > > int TestOverlap(AABB a, AABB b) { > return a.minx <= b.maxx > && a.miny <= b.maxy > && a.maxx >= b.minx > && a.maxx >= b.minx; > } > > int TestOverlap2(AABB a, AABB b) { > return a.miny <= b.maxy > && a.maxx >= b.minx; > } > > GCC used to produce this code: > > TestOverlap: > comiss xmm3, xmm0 > movqrdx, xmm0 > movqrsi, xmm1 > movqrax, xmm3 > jb .L10 > shr rdx, 32 > shr rax, 32 > movdxmm0, eax > movdxmm4, edx > comiss xmm0, xmm4 > jb .L10 > movdxmm1, esi > xor eax, eax > comiss xmm1, xmm2 > setnb al > ret > .L10: > xor eax, eax > ret > TestOverlap2: > shufps xmm0, xmm0, 85 > shufps xmm3, xmm3, 85 > comiss xmm3, xmm0 > jb .L17 > xor eax, eax > comiss xmm1, xmm2 > setnb al > ret > .L17: > xor eax, eax > ret > > After this patch codegen gets much worse: > > TestOverlap: > movqrax, xmm1 > movqrdx, xmm2 > movqrsi, xmm0 > mov rdi, rax > movqrax, xmm3 > mov rcx, rsi > xchgrdx, rax > movdxmm1, edx > mov rsi, rax > mov rax, rdx > comiss xmm1, xmm0 > jb .L10 > shr rcx, 32 > shr rax, 32 > movdxmm0, eax > movdxmm4, ecx > comiss xmm0, xmm4 > jb .L10 > movdxmm0, esi > movdxmm1, edi > xor eax, eax > comiss xmm1, xmm0 > setnb al > ret > .L10: > xor eax, eax > ret > TestOverlap2: > movqrdx, xmm2 > movqrax, xmm3 > movqrsi, xmm0 > xchgrdx, rax > mov rcx, rsi > mov rsi, rax > mov rax, rdx > shr rcx, 32 > shr rax, 32 > movdxmm4, ecx > movdxmm0, eax > comiss xmm0, xmm4 > jb .L17 > movdxmm0, esi > xor eax, eax > comiss xmm1, xmm0 > setnb al > ret > .L17: > xor eax, eax > ret > > I saw that you've been improving i386 argument passing, so maybe this is just > a > missed case of these additions? > > (Can also be seen here https://godbolt.org/z/E4xrEn6KW) > > PS: I found the code that clang generates, with cmpleps + pextrw to avoid the > fp- > >int->fp + shr interesting. I wonder if something like this could be added to > >GCC as > well. > > Thanks! > Manolis > > On Thu, Jul 6, 2023 at 5:21 PM Uros Bizjak via Gcc-patches patc...@gcc.gnu.org> wrote: > > > > On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle > wrote: > > > > > > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle > > > > > > > > wrote: > > > > > > > > > > > > > > > Passing 128-bit integer (TImode) parameters on x86_64 can > > > > > sometimes result in surprising code. Consider the example below > > > > > (from PR > 43644): > > > > > > > > > > __uint128 foo(__uint128 x