from:"Roger Sayle"

[gcc r15-775] i386: Correct insn_cost of movabsq.

2024-05-22 Thread Roger Sayle via Gcc-cvs

https://gcc.gnu.org/g:a3b16e73a2d5b2d4d20ef6f2fd164cea633bbec8

commit r15-775-ga3b16e73a2d5b2d4d20ef6f2fd164cea633bbec8
Author: Roger Sayle 
Date:   Wed May 22 16:45:48 2024 +0100

i386: Correct insn_cost of movabsq.

This single line patch fixes a strange quirk/glitch in i386's rtx_costs,
which considers an instruction loading a 64-bit constant to be significantly
cheaper than loading a 32-bit (or smaller) constant.

Consider the two functions:
unsigned long long foo() { return 0x0123456789abcdefULL; }
unsigned int bar() { return 10; }

and the corresponding lines from combine's dump file:
  insn_cost 1 for #: r98:DI=0x123456789abcdef
  insn_cost 4 for #: ax:SI=0xa

The same issue can be seen in -dP assembler output.
  movabsq $81985529216486895, %rax# 5  [c=1 l=10]  *movdi_internal/4

The problem is that pattern_costs interpretation of rtx_costs contains
"return cost > 0 ? cost : COSTS_N_INSNS (1)" where a zero value (for
example a register or small immediate constant) is considered special,
and equivalent to a single instruction, but all other values are treated
as verbatim.  Hence to x86_64's 10-byte long movabsq instruction slightly
more expensive than a simple constant, rtx_costs needs to return
COSTS_N_INSNS(1)+1 and not 1.  With this change, the insn_cost of
movabsq is the intended value 5:
  insn_cost 5 for #: r98:DI=0x123456789abcdef
and
  movabsq $81985529216486895, %rax# 5  [c=5 l=10]  *movdi_internal/4

    2024-05-22  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.cc (ix86_rtx_costs) :
A CONST_INT that isn't x86_64_immediate_operand requires an extra
(expensive) movabsq insn to load, so return COSTS_N_INSNS (1) + 1.

Diff:
---
 gcc/config/i386/i386.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 69cd4ae05a7..3e2a3a194f1 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -21562,7 +21562,8 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
outer_code_i, int opno,
   if (x86_64_immediate_operand (x, VOIDmode))
*total = 0;
  else
-   *total = 1;
+   /* movabsq is slightly more expensive than a simple instruction. */
+   *total = COSTS_N_INSNS (1) + 1;
   return true;
 
 case CONST_DOUBLE:

[x86_64 PATCH] Correct insn_cost of movabsq.

2024-05-22 Thread Roger Sayle

This single line patch fixes a strange quirk/glitch in i386's rtx_costs,
which considers an instruction loading a 64-bit constant to be significantly
cheaper than loading a 32-bit (or smaller) constant.

Consider the two functions:
unsigned long long foo() { return 0x0123456789abcdefULL; }
unsigned int bar() { return 10; }

and the corresponding lines from combine's dump file:
  insn_cost 1 for #: r98:DI=0x123456789abcdef
  insn_cost 4 for #: ax:SI=0xa

The same issue can be seen in -dP assembler output.
  movabsq $81985529216486895, %rax# 5  [c=1 l=10]  *movdi_internal/4

The problem is that pattern_costs interpretation of rtx_costs contains
"return cost > 0 ? cost : COSTS_N_INSNS (1)" where a zero value (for
example a register or small immediate constant) is considered special,
and equivalent to a single instruction, but all other values are treated
as verbatim.  Hence to make x86_64's 10-byte long movabsq instruction
slightly more expensive than a simple constant, rtx_costs needs to
return COSTS_N_INSNS(1)+1 and not 1.  With this change, the insn_cost
of movabsq is the intended value 5:
  insn_cost 5 for #: r98:DI=0x123456789abcdef
and
  movabsq $81985529216486895, %rax# 5  [c=5 l=10]  *movdi_internal/4


[I'd originally tried fixing this by adding a ix86_insn_cost target
hook, but the testsuite is very sensitive to the costing of insns].


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-05-22  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.cc (ix86_rtx_costs) :
A CONST_INT that isn't x86_64_immediate_operand requires an extra
(expensive) movabsq insn to load, so return COSTS_N_INSNS (1) + 1.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b4838b7..b4a9519 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -21569,7 +21569,7 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
outer_code_i, int opno,
   if (x86_64_immediate_operand (x, VOIDmode))
*total = 0;
  else
-   *total = 1;
+   *total = COSTS_N_INSNS (1) + 1;
   return true;
 
 case CONST_DOUBLE:

[gcc r15-774] Avoid ICE in except.cc on targets that don't support exceptions.

2024-05-22 Thread Roger Sayle via Gcc-cvs

https://gcc.gnu.org/g:26df7b4684e201e66c09dd018603a248ddc5f437

commit r15-774-g26df7b4684e201e66c09dd018603a248ddc5f437
Author: Roger Sayle 
Date:   Wed May 22 13:48:52 2024 +0100

Avoid ICE in except.cc on targets that don't support exceptions.

A number of testcases currently fail on nvptx with the ICE:

during RTL pass: final
openmp-simd-2.c: In function 'foo':
openmp-simd-2.c:28:1: internal compiler error: in get_personality_function, 
at expr.cc:14037
   28 | }
  | ^
0x98a38f get_personality_function(tree_node*)
/home/roger/GCC/nvptx-none/gcc/gcc/expr.cc:14037
0x969d3b output_function_exception_table(int)
/home/roger/GCC/nvptx-none/gcc/gcc/except.cc:3226
0x9b760d rest_of_handle_final
/home/roger/GCC/nvptx-none/gcc/gcc/final.cc:4252

The simple oversight in output_function_exception_table is that it calls
get_personality_function (immediately) before checking the target's
except_unwind_info hook (which on nvptx always returns UI_NONE).
The (perhaps obvious) fix is to move the assignments of fname and
personality after the tests that they are needed, and before their
first use.

2024-05-22  Roger Sayle  

gcc/ChangeLog
* except.cc (output_function_exception_table): Move call to
get_personality_function after targetm_common.except_unwind_info
check, to avoid ICE on targets that don't support exceptions.

Diff:
---
 gcc/except.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/except.cc b/gcc/except.cc
index 2080fcc22e6..b5886e97be9 100644
--- a/gcc/except.cc
+++ b/gcc/except.cc
@@ -3222,9 +3222,6 @@ output_one_function_exception_table (int section)
 void
 output_function_exception_table (int section)
 {
-  const char *fnname = get_fnname_from_decl (current_function_decl);
-  rtx personality = get_personality_function (current_function_decl);
-
   /* Not all functions need anything.  */
   if (!crtl->uses_eh_lsda
   || targetm_common.except_unwind_info (_options) == UI_NONE)
@@ -3234,6 +3231,9 @@ output_function_exception_table (int section)
   if (section == 1 && !crtl->eh.call_site_record_v[1])
 return;
 
+  const char *fnname = get_fnname_from_decl (current_function_decl);
+  rtx personality = get_personality_function (current_function_decl);
+
   if (personality)
 {
   assemble_external_libcall (personality);

[PATCH] Avoid ICE in except.cc on targets that don't support exceptions.

2024-05-22 Thread Roger Sayle


A number of testcases currently fail on nvptx with the ICE:

during RTL pass: final
openmp-simd-2.c: In function 'foo':
openmp-simd-2.c:28:1: internal compiler error: in get_personality_function,
at expr.cc:14037
   28 | }
  | ^
0x98a38f get_personality_function(tree_node*)
/home/roger/GCC/nvptx-none/gcc/gcc/expr.cc:14037
0x969d3b output_function_exception_table(int)
/home/roger/GCC/nvptx-none/gcc/gcc/except.cc:3226
0x9b760d rest_of_handle_final
/home/roger/GCC/nvptx-none/gcc/gcc/final.cc:4252

The simple oversight in output_function_exception_table is that it calls
get_personality_function (immediately) before checking the target's
except_unwind_info hook (which on nvptx always returns UI_NONE).
The (perhaps obvious) fix is to move the assignments of fname and
personality after the tests that they are needed, and before their
first use.

This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
with no new failures in the testsuite, and ~220 fewer FAILs.
Ok for mainline?

2024-05-22  Roger Sayle  

gcc/ChangeLog
* except.cc (output_function_exception_table): Move call to
get_personality_function after targetm_common.except_unwind_info
check, to avoid ICE on targets that don't support exceptions.


Thanks in advance,
Roger
--

diff --git a/gcc/except.cc b/gcc/except.cc
index 2080fcc..b5886e9 100644
--- a/gcc/except.cc
+++ b/gcc/except.cc
@@ -3222,9 +3222,6 @@ output_one_function_exception_table (int section)
 void
 output_function_exception_table (int section)
 {
-  const char *fnname = get_fnname_from_decl (current_function_decl);
-  rtx personality = get_personality_function (current_function_decl);
-
   /* Not all functions need anything.  */
   if (!crtl->uses_eh_lsda
   || targetm_common.except_unwind_info (_options) == UI_NONE)
@@ -3234,6 +3231,9 @@ output_function_exception_table (int section)
   if (section == 1 && !crtl->eh.call_site_record_v[1])
 return;
 
+  const char *fnname = get_fnname_from_decl (current_function_decl);
+  rtx personality = get_personality_function (current_function_decl);
+
   if (personality)
 {
   assemble_external_libcall (personality);

[gcc r15-648] nvptx: Correct pattern for popcountdi2 insn in nvptx.md.

2024-05-19 Thread Roger Sayle via Gcc-cvs

https://gcc.gnu.org/g:1676ef6e91b902f592270e4bcf10b4fc342e200d

commit r15-648-g1676ef6e91b902f592270e4bcf10b4fc342e200d
Author: Roger Sayle 
Date:   Sun May 19 09:49:45 2024 +0100

nvptx: Correct pattern for popcountdi2 insn in nvptx.md.

The result of a POPCOUNT operation in RTL should have the same mode
as its operand.  This corrects the specification of popcount in
the nvptx backend, splitting the current generic define_insn into
two, one for popcountsi2 and the other for popcountdi2 (the latter
with an explicit truncate).

2024-05-19  Roger Sayle  

gcc/ChangeLog
* config/nvptx/nvptx.md (popcount2): Split into...
(popcountsi2): define_insn handling SImode popcount.
(popcountdi2): define_insn handling DImode popcount, with an
explicit truncate:SI to produce an SImode result.

Diff:
---
 gcc/config/nvptx/nvptx.md | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 96e6c9116080..ef7e3fb00fac 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -655,11 +655,18 @@
   DONE;
 })
 
-(define_insn "popcount2"
+(define_insn "popcountsi2"
   [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
-   (popcount:SI (match_operand:SDIM 1 "nvptx_register_operand" "R")))]
+   (popcount:SI (match_operand:SI 1 "nvptx_register_operand" "R")))]
   ""
-  "%.\\tpopc.b%T1\\t%0, %1;")
+  "%.\\tpopc.b32\\t%0, %1;")
+
+(define_insn "popcountdi2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
+   (truncate:SI
+ (popcount:DI (match_operand:DI 1 "nvptx_register_operand" "R"]
+  ""
+  "%.\\tpopc.b64\\t%0, %1;")
 
 ;; Multiplication variants

[x86 SSE] Improve handling of ternlog instructions in i386/sse.md (v2)

2024-05-17 Thread Roger Sayle

Hi Hongtao,
Many thanks for the review, bug fixes and suggestions for improvements.
This revised version of the patch, implements all of your corrections.  In 
theory
the "ternlog idx" should guarantee that some operands are non-null, but I agree
that it's better defensive programming to check invariants not easily proved.
Instead of calling ix86_expand_vector_move, I use ix86_broadcast_from_constant
to achieve the same effect of using a broadcast when possible, but has the 
benefit
of still using a memory operand (instead of a vector load) when broadcasting 
isn't
possible.  There are other places that could benefit from the same trick, but I 
can
address these in a follow-up patch (it may even be preferrable to keep these as
CONST_VECTOR during early RTL passes and lower to broadcast or constant pool
using splitters).

This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2024-05-17  Roger Sayle  
Hongtao Liu  

gcc/ChangeLog
PR target/115021
* config/i386/i386-expand.cc (ix86_expand_args_builtin): Call
fixup_modeless_constant before testing predicates.  Only call
copy_to_mode_reg on memory operands (after the first one).
(ix86_gen_bcst_mem): Helper function to convert a CONST_VECTOR
into a VEC_DUPLICATE if possible.
(ix86_ternlog_idx):  Convert an RTX expression into a ternlog
index between 0 and 255, recording the operands in ARGS, if
possible or return -1 if this is not possible/valid.
(ix86_ternlog_leaf_p): Helper function to identify "leaves"
of a ternlog expression, e.g. REG_P, MEM_P, CONST_VECTOR, etc.
(ix86_ternlog_operand_p): Test whether a expression is suitable
for and prefered as an UNSPEC_TERNLOG.
(ix86_expand_ternlog_binop): Helper function to construct the
binary operation corresponding to a sufficiently simple ternlog.
(ix86_expand_ternlog_andnot): Helper function to construct a
ANDN operation corresponding to a sufficiently simple ternlog.
(ix86_expand_ternlog): Expand a 3-operand ternary logic
expression, constructing either an UNSPEC_TERNLOG or simpler
rtx expression.  Called from builtin expanders and pre-reload
splitters.
* config/i386/i386-protos.h (ix86_ternlog_idx): Prototype here.
(ix86_ternlog_operand_p): Likewise.
(ix86_expand_ternlog): Likewise.
* config/i386/predicates.md (ternlog_operand): New predicate
that calls xi86_ternlog_operand_p.
* config/i386/sse.md (_vpternlog_0): New
define_insn_and_split that recognizes a SET_SRC of ternlog_operand
and expands it via ix86_expand_ternlog pre-reload.
(_vternlog_mask): Convert from define_insn to
define_expand.  Use ix86_expand_ternlog if the mask operand is
~0 (or 255 or -1).
(*_vternlog_mask): define_insn renamed from above.

gcc/testsuite/ChangeLog
* gcc.target/i386/avx512f-andn-di-zmm-2.c: Update test case.
* gcc.target/i386/avx512f-andn-si-zmm-2.c: Likewise.
* gcc.target/i386/avx512f-orn-si-zmm-1.c: Likewise.
* gcc.target/i386/avx512f-orn-si-zmm-2.c: Likewise.
* gcc.target/i386/avx512f-vpternlogd-1.c: Likewise.
* gcc.target/i386/avx512f-vpternlogq-1.c: Likewise.
* gcc.target/i386/avx512vl-vpternlogd-1.c: Likewise.
* gcc.target/i386/avx512vl-vpternlogq-1.c: Likewise.
* gcc.target/i386/pr100711-3.c: Likewise.
* gcc.target/i386/pr100711-4.c: Likewise.
* gcc.target/i386/pr100711-5.c: Likewise.

Thanks again,
Roger
--

> From: Hongtao Liu 
> Sent: 14 May 2024 09:46
> On Mon, May 13, 2024 at 5:57 AM Roger Sayle 
> wrote:
> >
> > This patch improves the way that the x86 backend recognizes and
> > expands AVX512's bitwise ternary logic (vpternlog) instructions.
> I like the patch.
> 
> 1 file changed, 25 insertions(+), 1 deletion(-) 
> gcc/config/i386/i386-expand.cc | 26
> +-
> 
> modified   gcc/config/i386/i386-expand.cc
> @@ -25601,6 +25601,7 @@ ix86_gen_bcst_mem (machine_mode mode, rtx x)
> int  ix86_ternlog_idx (rtx op, rtx *args)  {
> +  /* Nice dynamic programming:)  */
>int idx0, idx1;
> 
>if (!op)
> @@ -25651,6 +25652,7 @@ ix86_ternlog_idx (rtx op, rtx *args)
> return 0xaa;
>   }
>/* Maximum of one volatile memory reference per expression.  */
> +  /* According to comments, it should be && ?  */
>if (side_effects_p (op) || side_effects_p (args[2]))
>   return -1;
>if (rtx_equal_p (op, args[2]))
> @@ -25666,6 +25668,8 @@ ix86_ternlog_idx (rtx op, rtx *args)
> 
>  case SUBREG:
>if (!VECTOR_MODE_P (GET_MODE (SUBREG_REG (op)))
> +

[x86 SSE] Improve handling of ternlog instructions in i386/sse.md

2024-05-12 Thread Roger Sayle


This patch improves the way that the x86 backend recognizes and
expands AVX512's bitwise ternary logic (vpternlog) instructions.

As a motivating example consider the following code which calculates
the carry out from a (binary) full adder:

typedef unsigned long long v4di __attribute((vector_size(32)));

v4di foo(v4di a, v4di b, v4di c)
{
return (a & b) | ((a ^ b) & c);
}

with -O2 -march=cascadelake current mainline produces:

foo:vpternlogq  $96, %ymm0, %ymm1, %ymm2
vmovdqa %ymm0, %ymm3
vmovdqa %ymm2, %ymm0
vpternlogq  $248, %ymm3, %ymm1, %ymm0
ret

with the patch below, we now generate a single instruction:

foo:vpternlogq  $232, %ymm2, %ymm1, %ymm0
ret


The AVX512 vpternlog[qd] instructions are a very cool addition to the
x86 instruction set, that can calculate any Boolean function of three
inputs in a single fast instruction.  As the truth table for any
three-input function has 8 rows, any specific function can be represented
by specifying those bits, i.e. by an 8-bit byte, an immediate integer
between 0 and 256.

Examples of ternary functions and their indices are given below:

0x01   1:  ~((b|a)|c)
0x02   2:  (~(b|a))
0x03   3:  ~(b|a)
0x04   4:  (~(c|a))
0x05   5:  ~(c|a)
0x06   6:  (c^b)&~a
0x07   7:  ~((c)|a)
0x08   8:  (~a) (~a) (c)&~a
0x09   9:  ~((c^b)|a)
0x0a  10:  ~a
0x0b  11:  ~((~c)|a) (~b|c)&~a
0x0c  12:  ~a
0x0d  13:  ~((~b)|a) (~c|b)&~a
0x0e  14:  (c|b)&~a
0x0f  15:  ~a
0x10  16:  (~(c|b))
0x11  17:  ~(c|b)
...
0xf4 244:  (~c)|a
0xf5 245:  ~c|a
0xf6 246:  (c^b)|a
0xf7 247:  (~(c))|a
0xf8 248:  (c)|a
0xf9 249:  (~(c^b))|a
0xfa 250:  c|a
0xfb 251:  (c|a)|~b (~b|a)|c (~b|c)|a
0xfc 252:  b|a
0xfd 253:  (b|a)|~c (~c|a)|b (~c|b)|a
0xfe 254:  (b|a)|c (c|a)|b (c|b)|a

A naive implementation (in many compilers) might be add define_insn
patterns for all 256 different functions.  The situation is even
worse as many of these Boolean functions don't have a "canonical form"
(as produced by simplify_rtx) and would each need multiple patterns.
See the space-separated equivalent expressions in the table above.

This need to provide instruction "templates" might explain why GCC,
LLVM and ICC all exhibit similar coverage problems in their ability
to recognize x86 ternlog ternary functions.

Perhaps a unique feature of GCC's design is that in addition to regular
define_insn templates, machine descriptions can also perform pattern
matching via a match_operator (and its corresponding predicate).
This patch introduces a ternlog_operand predicate that matches a
(possibly infinite) set of expression trees, identifying those that
have at most three unique operands.  This then allows a
define_insn_and_split to recognize suitable expressions and then
transform them into the appropriate UNSPEC_VTERNLOG as a pre-reload
splitter.  This design allows combine to smash together arbitrarily
complex Boolean expressions, then transform them into an UNSPEC
before register allocation.  As an "optimization", where possible
ix86_expand_ternlog generates a simpler binary operation, using
AND, XOR, IOR or ANDN where possible, and in a few cases attempts
to "canonicalize" the ternlog, by reordering or duplicating operands,
so that later CSE passes have a hope of spotting equivalent values.

Another benefit of this patch is that it improves the code
generated for PR target/115021 [see comment #1].

This patch leaves the existing ternlog patterns in sse.md (for now),
many of which are made obsolete by these changes.  In theory we now
only need one define_insn for UNSPEC_VTERNLOG.  One complication from
these previous variants was that they inconsistently used decimal vs.
hexadecimal to specify the immediate constant operand in assembly
language, making the list of tweaks to the testsuite with this patch
larger than it might have been.  I propose to remove the vestigial
patterns in a follow-up patch, once this approach has baked (proven
to be stable) on mainline.


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2024-05-12  Roger Sayle  

gcc/ChangeLog
PR target/115021
* config/i386/i386-expand.cc (ix86_expand_args_builtin): Call
fixup_modeless_constant before testing predicates.  Only call
copy_to_mode_reg on memory operands (after the first one).
(ix86_gen_bcst_mem): Helper function to convert a CONST_VECTOR
into a VEC_DUPLICATE if possible.
(ix86_ternlog_idx):  Convert an RTX expression into a ternlog
index between 0 and 255, recording the operands in ARGS, if
possible or return -1 if this is not possible/valid.
(ix86_ternlog_leaf_p): Helper function to identify "leaves"
of a ternlog expression, e.g. REG_P, MEM_P, CONST_VECTOR, etc.
(ix86_ternlog_operand_p): Test

[gcc r15-390] arm: Use utxb rN, rM, ror #8 to implement zero_extract on armv6.

2024-05-12 Thread Roger Sayle via Gcc-cvs

https://gcc.gnu.org/g:46077992180d6d86c86544df5e8cb943492d3b01

commit r15-390-g46077992180d6d86c86544df5e8cb943492d3b01
Author: Roger Sayle 
Date:   Sun May 12 16:27:22 2024 +0100

arm: Use utxb rN, rM, ror #8 to implement zero_extract on armv6.

Examining the code generated for the following C snippet on a
raspberry pi:

int popcount_lut8(unsigned *buf, int n)
{
  int cnt=0;
  unsigned int i;
  do {
i = *buf;
cnt += lut[i&255];
cnt += lut[i>>8&255];
cnt += lut[i>>16&255];
cnt += lut[i>>24];
buf++;
  } while(--n);
  return cnt;
}

I was surprised to see following instruction sequence generated by the
compiler:

  movr5, r2, lsr #8
  uxtb   r5, r5

This sequence can be performed by a single ARM instruction:

  uxtb   r5, r2, ror #8

The attached patch allows GCC's combine pass to take advantage of ARM's
uxtb with rotate functionality to implement the above zero_extract, and
likewise to use the sxtb with rotate to implement sign_extract.  ARM's
uxtb and sxtb can only be used with rotates of 0, 8, 16 and 24, and of
these only the 8 and 16 are useful [ror #0 is a nop, and extends with
ror #24 can be implemented using regular shifts],  so the approach here
is to add the six missing but useful instructions as 6 different
define_insn in arm.md, rather than try to be clever with new predicates.

Later ARM hardware has advanced bit field instructions, and earlier
ARM cores didn't support extend-with-rotate, so this appears to only
benefit armv6 era CPUs (e.g. the raspberry pi).

Patch posted:
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01339.html
Approved by Kyrill Tkachov:
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01881.html
    
2024-05-12  Roger Sayle  
Kyrill Tkachov  

* config/arm/arm.md (*arm_zeroextractsi2_8_8, 
*arm_signextractsi2_8_8,
*arm_zeroextractsi2_8_16, *arm_signextractsi2_8_16,
*arm_zeroextractsi2_16_8, *arm_signextractsi2_16_8): New.
    
2024-05-12  Roger Sayle  
Kyrill Tkachov  

* gcc.target/arm/extend-ror.c: New test.

Diff:
---
 gcc/config/arm/arm.md | 66 +++
 gcc/testsuite/gcc.target/arm/extend-ror.c | 38 ++
 2 files changed, 104 insertions(+)

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 1fd00146ca9e..f47e036a8034 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -12647,6 +12647,72 @@
 ""
 )
 
+;; Implement zero_extract using uxtb/uxth instruction with 
+;; the ror #N qualifier when applicable.
+
+(define_insn "*arm_zeroextractsi2_8_8"
+  [(set (match_operand:SI 0 "s_register_operand" "=r")
+   (zero_extract:SI (match_operand:SI 1 "s_register_operand" "r")
+(const_int 8) (const_int 8)))]
+  "TARGET_ARM && arm_arch6"
+  "uxtb%?\\t%0, %1, ror #8"
+  [(set_attr "predicable" "yes")
+   (set_attr "type" "extend")]
+)
+
+(define_insn "*arm_zeroextractsi2_8_16"
+  [(set (match_operand:SI 0 "s_register_operand" "=r")
+   (zero_extract:SI (match_operand:SI 1 "s_register_operand" "r")
+(const_int 8) (const_int 16)))]
+  "TARGET_ARM && arm_arch6"
+  "uxtb%?\\t%0, %1, ror #16"
+  [(set_attr "predicable" "yes")
+   (set_attr "type" "extend")]
+)
+
+(define_insn "*arm_zeroextractsi2_16_8"
+  [(set (match_operand:SI 0 "s_register_operand" "=r")
+   (zero_extract:SI (match_operand:SI 1 "s_register_operand" "r")
+(const_int 16) (const_int 8)))]
+  "TARGET_ARM && arm_arch6"
+  "uxth%?\\t%0, %1, ror #8"
+  [(set_attr "predicable" "yes")
+   (set_attr "type" "extend")]
+)
+
+;; Implement sign_extract using sxtb/sxth instruction with 
+;; the ror #N qualifier when applicable.
+
+(define_insn "*arm_signextractsi2_8_8"
+  [(set (match_operand:SI 0 "s_register_operand" "=r")
+   (sign_extract:SI (match_operand:SI 1 "s_register_operand" "r")
+(const_int 8) (const_int 8)))]
+  "TARGET_ARM && arm_arch6"
+  "sxtb%?\\t%0, %1, ror #8"
+  [(set_attr "predicable" "yes")
+   (set_attr "type" "extend")]
+)
+
+(define_insn "*arm_signextractsi2_8_16"
+  [(set (match_operand:SI 0 "s_register_operand" "=r")
+   (sign_extract:SI (match_operand:SI 1 "s

[gcc r15-366] i386: Improve V[48]QI shifts on AVX512/SSE4.1

2024-05-10 Thread Roger Sayle via Gcc-cvs

https://gcc.gnu.org/g:f5a8cdc1ef5d6aa2de60849c23658ac5298df7bb

commit r15-366-gf5a8cdc1ef5d6aa2de60849c23658ac5298df7bb
Author: Roger Sayle 
Date:   Fri May 10 20:26:40 2024 +0100

i386: Improve V[48]QI shifts on AVX512/SSE4.1

The following one line patch improves the code generated for V8QI and V4QI
shifts when AV512BW and AVX512VL functionality is available.

For the testcase (from gcc.target/i386/vect-shiftv8qi.c):

typedef signed char v8qi __attribute__ ((__vector_size__ (8)));
v8qi foo (v8qi x) { return x >> 5; }

GCC with -O2 -march=cascadelake currently generates:

foo:movl$67372036, %eax
vpsraw  $5, %xmm0, %xmm2
vpbroadcastd%eax, %xmm1
movl$117901063, %eax
vpbroadcastd%eax, %xmm3
vmovdqa %xmm1, %xmm0
vmovdqa %xmm3, -24(%rsp)
vpternlogd  $120, -24(%rsp), %xmm2, %xmm0
vpsubb  %xmm1, %xmm0, %xmm0
ret

with this patch we now generate the much improved:

foo:vpmovsxbw   %xmm0, %xmm0
vpsraw  $5, %xmm0, %xmm0
vpmovwb %xmm0, %xmm0
ret

This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c
when run with the additional -march=cascadelake flag, by splitting these
tests into two; one form testing code generation with -msse2 (and
-mno-avx512vl) as originally intended, and the other testing AVX512
code generation with an explicit -march=cascadelake.

2024-05-10  Roger Sayle  
Hongtao Liu  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
Don't attempt ix86_expand_vec_shift_qihi_constant on SSE4.1.

gcc/testsuite/ChangeLog
* gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl.
* gcc.target/i386/vect-shiftv8qi.c: Likewise.
* gcc.target/i386/vect-shiftv4qi-2.c: New test case.
* gcc.target/i386/vect-shiftv8qi-2.c: Likewise.

Diff:
---
 gcc/config/i386/i386-expand.cc   |  3 ++
 gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c | 43 
 gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c   |  2 +-
 gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c | 43 
 gcc/testsuite/gcc.target/i386/vect-shiftv8qi.c   |  2 +-
 5 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 2f27bfb484c2..1ab22fe79736 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -24283,6 +24283,9 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx 
dest, rtx op1, rtx op2)
 
   if (CONST_INT_P (op2)
   && (code == ASHIFT || code == LSHIFTRT || code == ASHIFTRT)
+  /* With AVX512 it's cheaper to do vpmovsxbw/op/vpmovwb.
+ Even with SSE4.1 the alternative is better.  */
+  && !TARGET_SSE4_1
   && ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2))
 {
   emit_move_insn (dest, gen_lowpart (qimode, qdest));
diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c 
b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c
new file mode 100644
index ..abc1a276b043
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+#define N 4
+
+typedef unsigned char __vu __attribute__ ((__vector_size__ (N)));
+typedef signed char __vi __attribute__ ((__vector_size__ (N)));
+
+__vu sll (__vu a, int n)
+{
+  return a << n;
+}
+
+__vu sll_c (__vu a)
+{
+  return a << 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsllw" 2 } } */
+
+__vu srl (__vu a, int n)
+{
+  return a >> n;
+}
+
+__vu srl_c (__vu a)
+{
+  return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsrlw" 2 } } */
+
+__vi sra (__vi a, int n)
+{
+  return a >> n;
+}
+
+__vi sra_c (__vi a)
+{
+  return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsraw" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c 
b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
index b7e45c2e8799..9b52582d01f8 100644
--- a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
+++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -msse2" } */
+/* { dg-options "-O2 -msse2 -mno-avx2 -mno-avx512vl" } */
 
 #define N 4
 
diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c 
b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c
new file mode 100644
index ..52760f5a0607
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c
@@ -0,0 +1,43 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+#defi

Re: [x86 PATCH] Improve V[48]QI shifts on AVX512

2024-05-10 Thread Roger Sayle



Many thanks for the speedy review and correction/improvement.
It's interesting that you spotted the ternlog "spill"...
I have a patch that rewrites ternlog handling that's been
waiting for stage 1, that would also fix this mem operand
issue.  I hope to submit it for review this weekend.

Thanks again,
Roger

> From: Hongtao Liu 
> On Fri, May 10, 2024 at 6:26 AM Roger Sayle 
> wrote:
> >
> >
> > The following one line patch improves the code generated for V8QI and
> > V4QI shifts when AV512BW and AVX512VL functionality is available.
> +  /* With AVX512 its cheaper to do vpmovsxbw/op/vpmovwb.  */
> +  && !(TARGET_AVX512BW && TARGET_AVX512VL && TARGET_SSE4_1)
>&& ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2)) I 
> think
> TARGET_SSE4_1 is enough, it's always better w/ sse4.1 and above when not going
> into ix86_expand_vec_shift_qihi_constant.
> Others LGTM.
> >
> > For the testcase (from gcc.target/i386/vect-shiftv8qi.c):
> >
> > typedef signed char v8qi __attribute__ ((__vector_size__ (8))); v8qi
> > foo (v8qi x) { return x >> 5; }
> >
> > GCC with -O2 -march=cascadelake currently generates:
> >
> > foo:movl$67372036, %eax
> > vpsraw  $5, %xmm0, %xmm2
> > vpbroadcastd%eax, %xmm1
> > movl$117901063, %eax
> > vpbroadcastd%eax, %xmm3
> > vmovdqa %xmm1, %xmm0
> > vmovdqa %xmm3, -24(%rsp)
> > vpternlogd  $120, -24(%rsp), %xmm2, %xmm0
> It looks like a miss-optimization under AVX512, but it's a separate issue.
> > vpsubb  %xmm1, %xmm0, %xmm0
> > ret
> >
> > with this patch we now generate the much improved:
> >
> > foo:vpmovsxbw   %xmm0, %xmm0
> > vpsraw  $5, %xmm0, %xmm0
> > vpmovwb %xmm0, %xmm0
> > ret
> >
> > This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c
> > when run with the additional -march=cascadelake flag, by splitting
> > these tests into two; one form testing code generation with -msse2
> > (and
> > -mno-avx512vl) as originally intended, and the other testing AVX512
> > code generation with an explicit -march=cascadelake.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2024-05-09  Roger Sayle  
> >
> > gcc/ChangeLog
> > * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
> > Don't attempt ix86_expand_vec_shift_qihi_constant on AVX512.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl.
> > * gcc.target/i386/vect-shiftv8qi.c: Likewise.
> > * gcc.target/i386/vect-shiftv4qi-2.c: New test case.
> > * gcc.target/i386/vect-shiftv8qi-2.c: Likewise.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >
> --
> BR,
> Hongtao

[x86 PATCH] Improve V[48]QI shifts on AVX512

2024-05-09 Thread Roger Sayle


The following one line patch improves the code generated for V8QI and V4QI
shifts when AV512BW and AVX512VL functionality is available.

For the testcase (from gcc.target/i386/vect-shiftv8qi.c):

typedef signed char v8qi __attribute__ ((__vector_size__ (8)));
v8qi foo (v8qi x) { return x >> 5; }

GCC with -O2 -march=cascadelake currently generates:

foo:movl$67372036, %eax
vpsraw  $5, %xmm0, %xmm2
vpbroadcastd%eax, %xmm1
movl$117901063, %eax
vpbroadcastd%eax, %xmm3
vmovdqa %xmm1, %xmm0
vmovdqa %xmm3, -24(%rsp)
vpternlogd  $120, -24(%rsp), %xmm2, %xmm0
vpsubb  %xmm1, %xmm0, %xmm0
ret

with this patch we now generate the much improved:

foo:vpmovsxbw   %xmm0, %xmm0
vpsraw  $5, %xmm0, %xmm0
vpmovwb %xmm0, %xmm0
ret

This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c
when run with the additional -march=cascadelake flag, by splitting these
tests into two; one form testing code generation with -msse2 (and
-mno-avx512vl) as originally intended, and the other testing AVX512
code generation with an explicit -march=cascadelake.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-05-09  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
Don't attempt ix86_expand_vec_shift_qihi_constant on AVX512.

gcc/testsuite/ChangeLog
* gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl.
* gcc.target/i386/vect-shiftv8qi.c: Likewise.
* gcc.target/i386/vect-shiftv4qi-2.c: New test case.
* gcc.target/i386/vect-shiftv8qi-2.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index a613291..8eb31b2 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -24212,6 +24212,8 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx 
dest, rtx op1, rtx op2)
 
   if (CONST_INT_P (op2)
   && (code == ASHIFT || code == LSHIFTRT || code == ASHIFTRT)
+  /* With AVX512 its cheaper to do vpmovsxbw/op/vpmovwb.  */
+  && !(TARGET_AVX512BW && TARGET_AVX512VL && TARGET_SSE4_1)
   && ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2))
 {
   emit_move_insn (dest, gen_lowpart (qimode, qdest));
diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c 
b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c
new file mode 100644
index 000..abc1a27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi-2.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+#define N 4
+
+typedef unsigned char __vu __attribute__ ((__vector_size__ (N)));
+typedef signed char __vi __attribute__ ((__vector_size__ (N)));
+
+__vu sll (__vu a, int n)
+{
+  return a << n;
+}
+
+__vu sll_c (__vu a)
+{
+  return a << 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsllw" 2 } } */
+
+__vu srl (__vu a, int n)
+{
+  return a >> n;
+}
+
+__vu srl_c (__vu a)
+{
+  return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsrlw" 2 } } */
+
+__vi sra (__vi a, int n)
+{
+  return a >> n;
+}
+
+__vi sra_c (__vi a)
+{
+  return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsraw" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c 
b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
index b7e45c2..9b52582 100644
--- a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
+++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -msse2" } */
+/* { dg-options "-O2 -msse2 -mno-avx2 -mno-avx512vl" } */
 
 #define N 4
 
diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c 
b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c
new file mode 100644
index 000..52760f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi-2.c
@@ -0,0 +1,43 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+#define N 8
+
+typedef unsigned char __vu __attribute__ ((__vector_size__ (N)));
+typedef signed char __vi __attribute__ ((__vector_size__ (N)));
+
+__vu sll (__vu a, int n)
+{
+  return a << n;
+}
+
+__vu sll_c (__vu a)
+{
+  return a << 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsllw" 2 } } */
+
+__vu srl (__vu a, int n)
+{
+  return a >> n;
+}
+
+__vu srl_c (__vu a)
+{
+  return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsrlw" 2 } } */
+
+__vi sra (__vi a, int n)
+{
+  return a >> n;
+}
+
+__vi sra_c (__vi a)
+{
+  return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsraw" 2 }

[gcc r15-352] Constant fold {-1,-1} << 1 in simplify-rtx.cc

2024-05-09 Thread Roger Sayle via Gcc-cvs

https://gcc.gnu.org/g:f2449b55fb2d32fc4200667ba79847db31f6530d

commit r15-352-gf2449b55fb2d32fc4200667ba79847db31f6530d
Author: Roger Sayle 
Date:   Thu May 9 22:45:54 2024 +0100

Constant fold {-1,-1} << 1 in simplify-rtx.cc

This patch addresses a missed optimization opportunity in the RTL
optimization passes.  The function simplify_const_binary_operation
will constant fold binary operators with two CONST_INT operands,
and those with two CONST_VECTOR operands, but is missing compile-time
evaluation of binary operators with a CONST_VECTOR and a CONST_INT,
such as vector shifts and rotates.

The first version of this patch didn't contain a switch statement to
explicitly check for valid binary opcodes, which bootstrapped and
regression tested fine, but my paranoia has got the better of me,
so this version now checks that VEC_SELECT or some funky (future)
rtx_code doesn't cause problems.

2024-05-09  Roger Sayle  

gcc/ChangeLog
* simplify-rtx.cc (simplify_const_binary_operation): Constant
fold binary operations where the LHS is CONST_VECTOR and the
RHS is CONST_INT (or CONST_DOUBLE) such as vector shifts.

Diff:
---
 gcc/simplify-rtx.cc | 54 +
 1 file changed, 54 insertions(+)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index dceaa1ca..53f54d1d3928 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -5021,6 +5021,60 @@ simplify_const_binary_operation (enum rtx_code code, 
machine_mode mode,
   return gen_rtx_CONST_VECTOR (mode, v);
 }
 
+  if (VECTOR_MODE_P (mode)
+  && GET_CODE (op0) == CONST_VECTOR
+  && (CONST_SCALAR_INT_P (op1) || CONST_DOUBLE_AS_FLOAT_P (op1))
+  && (CONST_VECTOR_DUPLICATE_P (op0)
+ || CONST_VECTOR_NUNITS (op0).is_constant ()))
+{
+  switch (code)
+   {
+   case PLUS:
+   case MINUS:
+   case MULT:
+   case DIV:
+   case MOD:
+   case UDIV:
+   case UMOD:
+   case AND:
+   case IOR:
+   case XOR:
+   case SMIN:
+   case SMAX:
+   case UMIN:
+   case UMAX:
+   case LSHIFTRT:
+   case ASHIFTRT:
+   case ASHIFT:
+   case ROTATE:
+   case ROTATERT:
+   case SS_PLUS:
+   case US_PLUS:
+   case SS_MINUS:
+   case US_MINUS:
+   case SS_ASHIFT:
+   case US_ASHIFT:
+   case COPYSIGN:
+ break;
+   default:
+ return NULL_RTX;
+   }
+
+  unsigned int npatterns = (CONST_VECTOR_DUPLICATE_P (op0)
+   ? CONST_VECTOR_NPATTERNS (op0)
+   : CONST_VECTOR_NUNITS (op0).to_constant ());
+  rtx_vector_builder builder (mode, npatterns, 1);
+  for (unsigned i = 0; i < npatterns; i++)
+   {
+ rtx x = simplify_binary_operation (code, GET_MODE_INNER (mode),
+CONST_VECTOR_ELT (op0, i), op1);
+ if (!x || !valid_for_const_vector_p (mode, x))
+   return 0;
+ builder.quick_push (x);
+   }
+  return builder.build ();
+}
+
   if (SCALAR_FLOAT_MODE_P (mode)
   && CONST_DOUBLE_AS_FLOAT_P (op0) 
   && CONST_DOUBLE_AS_FLOAT_P (op1)

[gcc r15-222] PR target/106060: Improved SSE vector constant materialization on x86.

2024-05-07 Thread Roger Sayle via Gcc-cvs

https://gcc.gnu.org/g:79649a5dcd81bc05c0ba591068c9075de43bd417

commit r15-222-g79649a5dcd81bc05c0ba591068c9075de43bd417
Author: Roger Sayle 
Date:   Tue May 7 07:14:40 2024 +0100

PR target/106060: Improved SSE vector constant materialization on x86.

This patch resolves PR target/106060 by providing efficient methods for
materializing/synthesizing special "vector" constants on x86.  Currently
there are three methods of materializing a vector constant; the most
general is to load a vector from the constant pool, secondly "duplicated"
constants can be synthesized by moving an integer between units and
broadcasting (of shuffling it), and finally the special cases of the
all-zeros vector and all-ones vectors can be loaded via a single SSE
instruction.   This patch handle additional cases that can be synthesized
in two instructions, loading an all-ones vector followed by another SSE
instruction.  Following my recent patch for PR target/112992, there's
conveniently a single place in i386-expand.cc where these special cases
can be handled.

Two examples are given in the original bugzilla PR for 106060.

__m256i should_be_cmpeq_abs ()
{
  return _mm256_set1_epi8 (1);
}

is now generated (with -O3 -march=x86-64-v3) as:

vpcmpeqd%ymm0, %ymm0, %ymm0
vpabsb  %ymm0, %ymm0
ret

and

__m256i should_be_cmpeq_add ()
{
  return _mm256_set1_epi8 (-2);
}

is now generated as:

vpcmpeqd%ymm0, %ymm0, %ymm0
vpaddb  %ymm0, %ymm0, %ymm0
ret
    
    2024-05-07  Roger Sayle  
Hongtao Liu  

gcc/ChangeLog
PR target/106060
* config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
(struct ix86_vec_bcast_map_simode_t): New type for table below.
(ix86_vec_bcast_map_simode): Table of SImode constants that may
be efficiently synthesized by a ix86_vec_bcast_alg method.
(ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
(ix86_vector_duplicate_simode_const): Efficiently synthesize
V4SImode and V8SImode constants that duplicate special constants.
(ix86_vector_duplicate_value): Attempt to synthesize "special"
vector constants using ix86_vector_duplicate_simode_const.
* config/i386/i386.cc (ix86_rtx_costs) : ABS of a
vector integer mode costs with a single SSE instruction.

gcc/testsuite/ChangeLog
PR target/106060
* gcc.target/i386/auto-init-8.c: Update test case.
* gcc.target/i386/avx512fp16-13.c: Likewise.
* gcc.target/i386/pr100865-9a.c: Likewise.
* gcc.target/i386/pr101796-1.c: Likewise.
* gcc.target/i386/pr106060-1.c: New test case.
* gcc.target/i386/pr106060-2.c: Likewise.
* gcc.target/i386/pr106060-3.c: Likewise.
* gcc.target/i386/pr70314.c: Update test case.
* gcc.target/i386/vect-shiftv4qi.c: Likewise.
* gcc.target/i386/vect-shiftv8qi.c: Likewise.

Diff:
---
 gcc/config/i386/i386-expand.cc | 364 -
 gcc/config/i386/i386.cc|   2 +
 gcc/testsuite/gcc.target/i386/auto-init-8.c|   2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16-13.c  |   3 -
 gcc/testsuite/gcc.target/i386/pr100865-9a.c|   2 +-
 gcc/testsuite/gcc.target/i386/pr101796-1.c |   6 +-
 gcc/testsuite/gcc.target/i386/pr106060-1.c |  12 +
 gcc/testsuite/gcc.target/i386/pr106060-2.c |  13 +
 gcc/testsuite/gcc.target/i386/pr106060-3.c |  14 +
 gcc/testsuite/gcc.target/i386/pr70314.c|   2 +-
 gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c |   2 +-
 gcc/testsuite/gcc.target/i386/vect-shiftv8qi.c |   2 +-
 12 files changed, 411 insertions(+), 13 deletions(-)

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 8bb8f21e686..a6132911e6a 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -15696,6 +15696,332 @@ s4fma_expand:
   gcc_unreachable ();
 }
 
+/* See below where shifts are handled for explanation of this enum.  */
+enum ix86_vec_bcast_alg
+{
+  VEC_BCAST_PXOR,
+  VEC_BCAST_PCMPEQ,
+  VEC_BCAST_PABSB,
+  VEC_BCAST_PADDB,
+  VEC_BCAST_PSRLW,
+  VEC_BCAST_PSRLD,
+  VEC_BCAST_PSLLW,
+  VEC_BCAST_PSLLD
+};
+
+struct ix86_vec_bcast_map_simode_t
+{
+  unsigned int key;
+  enum ix86_vec_bcast_alg alg;
+  unsigned int arg;
+};
+
+/* This table must be kept sorted as values are looked-up using bsearch.  */
+static const ix86_vec_bcast_map_simode_t ix86_vec_bcast_map_simode[] = {
+  { 0x, VEC_BCAST_PXOR,0 },
+  { 0x0001, VEC_BCAST_PSRLD,  31 },
+  { 0x0003, VEC_BCAST_PSRLD,  30 },
+  { 0x0007, VEC_BCAST_PSRLD,  29 },
+  { 0x000f, VEC_BCAST

RE: [PATCH] PR middle-end/111701: signbit(x*x) vs -fsignaling-nans

2024-05-02 Thread Roger Sayle



> From: Richard Biener 
> On Thu, May 2, 2024 at 11:34 AM Roger Sayle 
> wrote:
> >
> >
> > > From: Richard Biener  On Fri, Apr 26,
> > > 2024 at 10:19 AM Roger Sayle 
> > > wrote:
> > > >
> > > > This patch addresses PR middle-end/111701 where optimization of
> > > > signbit(x*x) using tree_nonnegative_p incorrectly eliminates a
> > > > floating point multiplication when the operands may potentially be
> > > > signaling
> > > NaNs.
> > > >
> > > > The above bug fix also provides a solution or work-around to the
> > > > tricky issue in PR middle-end/111701, that the results of IEEE
> > > > operations on NaNs are specified to return a NaN result, but fail
> > > > to
> > > > (precisely) specify the exact NaN representation of this result.
> > > > Hence for the operation "-NaN*-NaN" different hardware
> > > > implementations
> > > > (targets) return different results.  Ultimately knowing what the
> > > > resulting NaN "payload" of an operation is can only be known by
> > > > executing that operation at run-time, and I'd suggest that GCC's
> > > > -fsignaling-nans provides a mechanism for handling code that uses
> > > > NaN representations for communication/signaling (which is a
> > > > different but related
> > > concept to IEEE's sNaN).
> > > >
> > > > One nice thing about this patch, which may or may not be a P2
> > > > regression fix, is that it only affects (improves) code compiled
> > > > with -fsignaling-nans so should be extremely safe even for this point 
> > > > in stage
> 3.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check, both with and without
> > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > >
> > > Hmm, but the bugreports are not about sNaN but about the fact that
> > > the sign of the NaN produced by 0/0 or by -NaN*-NaN is not well-defined.
> > > So I don't think this is the correct approach to fix this.  We'd
> > > instead have to use tree_expr_maybe_nan_p () - and if NaN*NaN cannot
> > > be -NaN (is that at least
> > > specified?) then the RECURSE path should still work as well.
> >
> > If we ignore the bugzilla PR for now, can we agree that if x is a
> > signaling NaN, that we shouldn't be eliminating x*x?  i.e. that this
> > patch fixes a real bug, but perhaps not (precisely) the one described in PR
> middle-end/111701.
> 
> This might or might not be covered by -fdelete-dead-exceptions - at least in 
> the
> past we were OK with removing traps like for -ftrapv (-ftrapv makes signed
> overflow no longer invoke undefined behavior) or when deleting loads that 
> might
> trap (but those would invoke undefined behavior).
> 
> I bet the C standard doesn't say anything about sNaNs or how an operation with
> it has to behave in the abstract machine.  We do document though that it
> "disables optimizations that may change the number of exceptions visible with
> signaling NaNs" which suggests that with -fsignaling-nans we have to preserve 
> all
> such traps but I am very sure DCE will simply elide unused ops here (also for 
> other
> FP operations with -ftrapping-math - but there we do not document that we
> preserve all traps).
> 
> With the patch the multiplication is only preserved because __builtin_signbit 
> still
> uses it.  A plain
> 
> void foo(double x)
> {
>x*x;
> }
> 
> has the multiplication elided during gimplification already (even at -O0).

void foo(double x)
{
  double t = x*x;
}

when compiled with -fsignaling-nans -fexceptions -fnon-call-exceptions
doesn't exhibit the above bug.  Perhaps this short-coming of gimplification
deserves its own Bugzilla PR?
 
> So I don't think the patch is a meaningful improvement as to preserve
> multiplications of sNaNs.
> 
> Richard.
> 
> > Once the signaling NaN case is correctly handled, the use of
> > -fsignaling-nans can be used as a workaround for PR 111701, allowing
> > it to perhaps be reduced from a P2 to a P3 regression (or even not a bug if 
> > the
> qNaN case is undefined behavior).
> > When I wrote this patch I was trying to help with GCC 14's stage 3.
> >
> > > > 2024-04-26  Roger Sayle  
> > > >
> > > > gcc/ChangeLog
> > > > PR middle-end/111701
> > > > * fold-const.cc (tree_binary_nonnegative_warnv_p)  MULT_EXPR>:
> > > > Split handling of floating point and integer types.  For equal
> > > > floating point operands, avoid optimization if the operand may 
> > > > be
> > > > a signaling NaN.
> > > >
> > > > gcc/testsuite/ChangeLog
> > > > PR middle-end/111701
> > > > * gcc.dg/pr111701-1.c: New test case.
> > > > * gcc.dg/pr111701-2.c: Likewise.
> > > >
> >
> >

RE: [PATCH] PR middle-end/111701: signbit(x*x) vs -fsignaling-nans

2024-05-02 Thread Roger Sayle



> From: Richard Biener 
> On Fri, Apr 26, 2024 at 10:19 AM Roger Sayle 
> wrote:
> >
> > This patch addresses PR middle-end/111701 where optimization of
> > signbit(x*x) using tree_nonnegative_p incorrectly eliminates a
> > floating point multiplication when the operands may potentially be signaling
> NaNs.
> >
> > The above bug fix also provides a solution or work-around to the
> > tricky issue in PR middle-end/111701, that the results of IEEE
> > operations on NaNs are specified to return a NaN result, but fail to
> > (precisely) specify the exact NaN representation of this result.
> > Hence for the operation "-NaN*-NaN" different hardware implementations
> > (targets) return different results.  Ultimately knowing what the
> > resulting NaN "payload" of an operation is can only be known by
> > executing that operation at run-time, and I'd suggest that GCC's
> > -fsignaling-nans provides a mechanism for handling code that uses NaN
> > representations for communication/signaling (which is a different but 
> > related
> concept to IEEE's sNaN).
> >
> > One nice thing about this patch, which may or may not be a P2
> > regression fix, is that it only affects (improves) code compiled with
> > -fsignaling-nans so should be extremely safe even for this point in stage 3.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> 
> Hmm, but the bugreports are not about sNaN but about the fact that the sign of
> the NaN produced by 0/0 or by -NaN*-NaN is not well-defined.
> So I don't think this is the correct approach to fix this.  We'd instead have 
> to use
> tree_expr_maybe_nan_p () - and if NaN*NaN cannot be -NaN (is that at least
> specified?) then the RECURSE path should still work as well.

If we ignore the bugzilla PR for now, can we agree that if x is a signaling NaN,
that we shouldn't be eliminating x*x?  i.e. that this patch fixes a real bug, 
but
perhaps not (precisely) the one described in PR middle-end/111701.

Once the signaling NaN case is correctly handled, the use of -fsignaling-nans
can be used as a workaround for PR 111701, allowing it to perhaps be reduced
from a P2 to a P3 regression (or even not a bug if the qNaN case is undefined 
behavior).
When I wrote this patch I was trying to help with GCC 14's stage 3.
 
> > 2024-04-26  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR middle-end/111701
> > * fold-const.cc (tree_binary_nonnegative_warnv_p) :
> > Split handling of floating point and integer types.  For equal
> > floating point operands, avoid optimization if the operand may be
> > a signaling NaN.
> >
> > gcc/testsuite/ChangeLog
> > PR middle-end/111701
> > * gcc.dg/pr111701-1.c: New test case.
> > * gcc.dg/pr111701-2.c: Likewise.
> >

RE: [C PATCH] PR c/109618: ICE-after-error from error_mark_node.

2024-04-30 Thread Roger Sayle

> On Tue, Apr 30, 2024 at 10:23 AM Roger Sayle 
> wrote:
> > Hi Richard,
> > Thanks for looking into this.
> >
> > It’s not the call to size_binop_loc (for CEIL_DIV_EXPR) that's
> > problematic, but the call to fold_convert_loc (loc, size_type_node, value) 
> > on line
> 4009 of c-common.cc.
> > At this point, value is (NOP_EXPR:sizetype (VAR_DECL:error_mark_node)).
> 
> I see.  Can we catch this when we build (NOP_EXPR:sizetype
> (VAR_DECL:error_mark_node))
> and instead have it "build" error_mark_node?

That's the tricky part.  At the point the NOP_EXPR is built the VAR_DECL's type
is valid.  It's later when this variable gets redefined with a conflicting type 
that
the shared VAR_DECL gets modified, setting its type to error_mark_node.
Mutating this shared node, then potentially introduces error_operand_p at
arbitrary places deep within an expression.  Fortunately, we only have to 
worry about this in the unusual/exceptional case that seen_error() is true.


> > Ultimately, it's the code in match.pd /* Handle cases of two
> > conversions in a row.  */ with the problematic line being (match.pd:4748):
> >   unsigned int inside_prec = element_precision (inside_type);
> >
> > Here inside_type is error_mark_node, and so tree type checking in
> > element_precision throws an internal_error.
> >
> > There doesn’t seem to be a good way to fix this in element_precision,
> > and it's complicated to reorganize the logic in match.pd's "with
> > clause" inside the (ocvt (icvt@1 @0)), but perhaps a (ocvt
> (icvt:non_error_type@1 @0))?
> >
> > The last place/opportunity the front-end could sanitize this operand
> > before passing the dubious tree to the middle-end is
> > c_sizeof_or_alignof_type (which alas doesn't appear in the backtrace due to
> inlining).
> >
> > #5  0x0227b0e9 in internal_error (
> > gmsgid=gmsgid@entry=0x249c7b8 "tree check: expected class %qs,
> > have %qs (%s) in %s, at %s:%d") at ../../gcc/gcc/diagnostic.cc:2232
> > #6  0x0081e32a in tree_class_check_failed (node=0x76c1ef30,
> > cl=cl@entry=tcc_type, file=file@entry=0x2495f3f "../../gcc/gcc/tree.cc",
> > line=line@entry=6795, function=function@entry=0x24961fe
> "element_precision")
> > at ../../gcc/gcc/tree.cc:9005
> > #7  0x0081ef4c in tree_class_check (__t=,
> __class=tcc_type,
> > __f=0x2495f3f "../../gcc/gcc/tree.cc", __l=6795,
> > __g=0x24961fe "element_precision") at ../../gcc/gcc/tree.h:4067
> > #8  element_precision (type=, type@entry=0x76c1ef30)
> > at ../../gcc/gcc/tree.cc:6795
> > #9  0x017f66a4 in generic_simplify_CONVERT_EXPR (loc=201632,
> > code=, type=0x76c3e7e0, _p0=0x76dc95c0)
> > at generic-match-6.cc:3386
> > #10 0x00c1b18c in fold_unary_loc (loc=201632, code=NOP_EXPR,
> > type=0x76c3e7e0, op0=0x76dc95c0) at
> > ../../gcc/gcc/fold-const.cc:9523
> > #11 0x00c1d94a in fold_build1_loc (loc=201632, code=NOP_EXPR,
> > type=0x76c3e7e0, op0=0x76dc95c0) at
> > ../../gcc/gcc/fold-const.cc:14165
> > #12 0x0094068c in c_expr_sizeof_expr (loc=loc@entry=201632,
> expr=...)
> > at ../../gcc/gcc/tree.h:3771
> > #13 0x0097f06c in c_parser_sizeof_expression (parser= out>)
> > at ../../gcc/gcc/c/c-parser.cc:9932
> >
> >
> > I hope this explains what's happening.  The size_binop_loc call is a
> > bit of a red herring that returns the same tree it is given (as
> > TYPE_PRECISION (char_type_node) == BITS_PER_UNIT), so it's the
> > "TYPE_SIZE_UNIT (type)" which needs to be checked for the embedded
> VAR_DECL with a TREE_TYPE of error_mark_node.
> >
> > As Andrew Pinski writes in comment #3, this one is trickier than average.
> >
> > A more comprehensive fix might be to write deep_error_operand_p which
> > does more of a tree traversal checking error_operand_p within the
> > unary and binary operators of an expression tree.
> >
> > Please let me know what you think/recommend.
> > Best regards,
> > Roger
> > --
> >
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: 30 April 2024 08:38
> > > To: Roger Sayle 
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [C PATCH] PR c/109618: ICE-after-error from error_mark_node.
> > >
> > > On Tue, Apr 30, 2024 at 1:06 AM Roger Sayle
> > > 
> > > wrote:
> > > >
> > > >
> > > > This patch solves another ICE-a

RE: [C PATCH] PR c/109618: ICE-after-error from error_mark_node.

2024-04-30 Thread Roger Sayle

Hi Richard,
Thanks for looking into this.

It’s not the call to size_binop_loc (for CEIL_DIV_EXPR) that's problematic, but 
the
call to fold_convert_loc (loc, size_type_node, value) on line 4009 of 
c-common.cc.
At this point, value is (NOP_EXPR:sizetype (VAR_DECL:error_mark_node)).

Ultimately, it's the code in match.pd /* Handle cases of two conversions in a 
row.  */
with the problematic line being (match.pd:4748):
  unsigned int inside_prec = element_precision (inside_type); 

Here inside_type is error_mark_node, and so tree type checking in 
element_precision
throws an internal_error.

There doesn’t seem to be a good way to fix this in element_precision, and it's
complicated to reorganize the logic in match.pd's "with clause" inside the
(ocvt (icvt@1 @0)), but perhaps a (ocvt (icvt:non_error_type@1 @0))?

The last place/opportunity the front-end could sanitize this operand before
passing the dubious tree to the middle-end is c_sizeof_or_alignof_type (which
alas doesn't appear in the backtrace due to inlining).

#5  0x0227b0e9 in internal_error (
gmsgid=gmsgid@entry=0x249c7b8 "tree check: expected class %qs, have %qs 
(%s) in %s, at %s:%d") at ../../gcc/gcc/diagnostic.cc:2232
#6  0x0081e32a in tree_class_check_failed (node=0x76c1ef30,
cl=cl@entry=tcc_type, file=file@entry=0x2495f3f "../../gcc/gcc/tree.cc",
line=line@entry=6795, function=function@entry=0x24961fe "element_precision")
at ../../gcc/gcc/tree.cc:9005
#7  0x0081ef4c in tree_class_check (__t=, 
__class=tcc_type,
__f=0x2495f3f "../../gcc/gcc/tree.cc", __l=6795,
__g=0x24961fe "element_precision") at ../../gcc/gcc/tree.h:4067
#8  element_precision (type=, type@entry=0x76c1ef30)
at ../../gcc/gcc/tree.cc:6795
#9  0x017f66a4 in generic_simplify_CONVERT_EXPR (loc=201632,
code=, type=0x76c3e7e0, _p0=0x76dc95c0)
at generic-match-6.cc:3386
#10 0x00c1b18c in fold_unary_loc (loc=201632, code=NOP_EXPR,
type=0x76c3e7e0, op0=0x76dc95c0) at ../../gcc/gcc/fold-const.cc:9523
#11 0x00c1d94a in fold_build1_loc (loc=201632, code=NOP_EXPR,
type=0x76c3e7e0, op0=0x76dc95c0) at 
../../gcc/gcc/fold-const.cc:14165
#12 0x0094068c in c_expr_sizeof_expr (loc=loc@entry=201632, expr=...)
at ../../gcc/gcc/tree.h:3771
#13 0x0097f06c in c_parser_sizeof_expression (parser=)
at ../../gcc/gcc/c/c-parser.cc:9932

I hope this explains what's happening.  The size_binop_loc call is a bit of a 
red
herring that returns the same tree it is given (as TYPE_PRECISION 
(char_type_node)
== BITS_PER_UNIT), so it's the "TYPE_SIZE_UNIT (type)" which needs to be checked
for the embedded VAR_DECL with a TREE_TYPE of error_mark_node.

As Andrew Pinski writes in comment #3, this one is trickier than average.

A more comprehensive fix might be to write deep_error_operand_p which does
more of a tree traversal checking error_operand_p within the unary and binary
operators of an expression tree.

Please let me know what you think/recommend.
Best regards,
Roger
--

> -Original Message-----
> From: Richard Biener 
> Sent: 30 April 2024 08:38
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [C PATCH] PR c/109618: ICE-after-error from error_mark_node.
> 
> On Tue, Apr 30, 2024 at 1:06 AM Roger Sayle 
> wrote:
> >
> >
> > This patch solves another ICE-after-error problem in the C family
> > front-ends.  Upon a conflicting type redeclaration, the ambiguous type
> > is poisoned with an error_mark_node to indicate to the middle-end that
> > the type is suspect, but care has to be taken by the front-end to
> > avoid passing these malformed trees into the middle-end during error
> > recovery. In this case, a var_decl with a poisoned type appears within
> > a sizeof() expression (wrapped in NOP_EXPR) which causes problems.
> >
> > This revision of the patch tests seen_error() to avoid tree traversal
> > (STRIP_NOPs) in the most common case that an error hasn't occurred.
> > Both this version (and an earlier revision that didn't test
> > seen_error) have survived bootstrap and regression testing on 
> > x86_64-pc-linux-
> gnu.
> > As a consolation, this code also contains a minor performance
> > improvement, by avoiding trying to create (and folding away) a
> > CEIL_DIV_EXPR in the common case that "char" is a single-byte.  The
> > current code relies on the middle-end's tree folding to recognize that
> > CEIL_DIV_EXPR of integer_one_node is a no-op, that can be optimized away.
> >
> > Ok for mainline?
> 
> Where does it end up ICEing?  I see size_binop_loc guards against
> error_mark_node operands already, maybe it should use error_operand_p
> instead?
> 
> >
> > 2024

[C PATCH] PR c/109618: ICE-after-error from error_mark_node.

2024-04-29 Thread Roger Sayle


This patch solves another ICE-after-error problem in the C family
front-ends.  Upon a conflicting type redeclaration, the ambiguous
type is poisoned with an error_mark_node to indicate to the middle-end
that the type is suspect, but care has to be taken by the front-end to
avoid passing these malformed trees into the middle-end during error
recovery. In this case, a var_decl with a poisoned type appears within
a sizeof() expression (wrapped in NOP_EXPR) which causes problems.

This revision of the patch tests seen_error() to avoid tree traversal
(STRIP_NOPs) in the most common case that an error hasn't occurred.
Both this version (and an earlier revision that didn't test seen_error)
have survived bootstrap and regression testing on x86_64-pc-linux-gnu.
As a consolation, this code also contains a minor performance improvement,
by avoiding trying to create (and folding away) a CEIL_DIV_EXPR in the
common case that "char" is a single-byte.  The current code relies on
the middle-end's tree folding to recognize that CEIL_DIV_EXPR of
integer_one_node is a no-op, that can be optimized away.

Ok for mainline?


2024-04-30  Roger Sayle  

gcc/c-family/ChangeLog
PR c/109618
* c-common.cc (c_sizeof_or_alignof_type): If seen_error() check
whether value is (a VAR_DECL) of type error_mark_node, or a
NOP_EXPR thereof.  Avoid folding CEIL_DIV_EXPR for the common
case where char_type is a single byte.

gcc/testsuite/ChangeLog
PR c/109618
* gcc.dg/pr109618.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/c-family/c-common.cc b/gcc/c-family/c-common.cc
index 6fa8243..be8ff09 100644
--- a/gcc/c-family/c-common.cc
+++ b/gcc/c-family/c-common.cc
@@ -3993,10 +3993,31 @@ c_sizeof_or_alignof_type (location_t loc,
   else
 {
   if (is_sizeof)
-   /* Convert in case a char is more than one unit.  */
-   value = size_binop_loc (loc, CEIL_DIV_EXPR, TYPE_SIZE_UNIT (type),
-   size_int (TYPE_PRECISION (char_type_node)
- / BITS_PER_UNIT));
+   {
+ value = TYPE_SIZE_UNIT (type);
+
+ /* PR 109618: Check for erroneous types, stripping NOPs.  */
+ if (seen_error ())
+   {
+ tree tmp = value;
+ while (CONVERT_EXPR_P (tmp)
+|| TREE_CODE (tmp) == NON_LVALUE_EXPR)
+   {
+ if (TREE_TYPE (tmp) == error_mark_node)
+   return error_mark_node;
+ tmp = TREE_OPERAND (tmp, 0);
+   }
+ if (tmp == error_mark_node
+ || TREE_TYPE (tmp) == error_mark_node)
+   return error_mark_node;
+   }
+
+ /* Convert in case a char is more than one unit.  */
+ if (TYPE_PRECISION (char_type_node) != BITS_PER_UNIT)
+   value = size_binop_loc (loc, CEIL_DIV_EXPR, value,
+   size_int (TYPE_PRECISION (char_type_node)
+ / BITS_PER_UNIT));
+   }
   else if (min_alignof)
value = size_int (min_align_of_type (type));
   else
diff --git a/gcc/testsuite/gcc.dg/pr109618.c b/gcc/testsuite/gcc.dg/pr109618.c
new file mode 100644
index 000..f240907
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr109618.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O0" } */
+int foo()
+{
+  const unsigned int var_1 = 2;
+  
+  char var_5[var_1];
+  
+  int var_1[10];  /* { dg-error "conflicting type" } */
+  
+  return sizeof(var_5);
+}
+

[PATCH] PR tree-opt/113673: Avoid load merging from potentially trapping additions.

2024-04-28 Thread Roger Sayle


This patch fixes PR tree-optimization/113673, a P2 ice-on-valid regression
caused by load merging of (ptr[0]<<8)+ptr[1] when -ftrapv has been
specified.  When the operator is | or ^ this is safe, but for addition
of signed integer types, a trap may be generated/required, so merging this
idiom into a single non-trapping instruction is inappropriate, confusing
the compiler by transforming a basic block with an exception edge into one
without.  One fix is to be more selective for PLUS_EXPR than for
BIT_IOR_EXPR or BIT_XOR_EXPR in gimple-ssa-store-merging.cc's
find_bswap_or_nop_1 function.

An alternate solution might be to notice that in this idiom the addition
can't overflow, but that this detail wasn't apparent when exception edges
were added to the CFG.  In which case, it's safe to remove (or mark for
removal) the problematic exceptional edge.  Unfortunately updating the
CFG is a part of the compiler that I'm less familiar with.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-04-28  Roger Sayle  

gcc/ChangeLog
PR tree-optimization/113673
* gimple-ssa-store-merging.cc (find_bswap_or_nop_1) :
Don't perform load merging if a signed addition may trap.

gcc/testsuite/ChangeLog
PR tree-optimization/113673
* g++.dg/pr113673.C: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/gimple-ssa-store-merging.cc b/gcc/gimple-ssa-store-merging.cc
index cb0cb5f..41a1066 100644
--- a/gcc/gimple-ssa-store-merging.cc
+++ b/gcc/gimple-ssa-store-merging.cc
@@ -776,9 +776,16 @@ find_bswap_or_nop_1 (gimple *stmt, struct symbolic_number 
*n, int limit)
 
   switch (code)
{
+   case PLUS_EXPR:
+ /* Don't perform load merging if this addition can trap.  */
+ if (cfun->can_throw_non_call_exceptions
+ && INTEGRAL_TYPE_P (TREE_TYPE (rhs1))
+ && TYPE_OVERFLOW_TRAPS (TREE_TYPE (rhs1)))
+   return NULL;
+ /* Fallthru.  */
+
case BIT_IOR_EXPR:
case BIT_XOR_EXPR:
-   case PLUS_EXPR:
  source_stmt1 = find_bswap_or_nop_1 (rhs1_stmt, , limit - 1);
 
  if (!source_stmt1)
diff --git a/gcc/testsuite/g++.dg/pr113673.C b/gcc/testsuite/g++.dg/pr113673.C
new file mode 100644
index 000..1148977
--- /dev/null
+++ b/gcc/testsuite/g++.dg/pr113673.C
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -fnon-call-exceptions -ftrapv" } */
+
+struct s { ~s(); };
+void
+h (unsigned char *data, int c)
+{
+  s a1;
+  while (c)
+{
+  int m = *data++ << 8;
+  m += *data++;
+}
+}

[PATCH] PR middle-end/111701: signbit(x*x) vs -fsignaling-nans

2024-04-26 Thread Roger Sayle


This patch addresses PR middle-end/111701 where optimization of signbit(x*x)
using tree_nonnegative_p incorrectly eliminates a floating point
multiplication when the operands may potentially be signaling NaNs.

The above bug fix also provides a solution or work-around to the tricky
issue in PR middle-end/111701, that the results of IEEE operations on NaNs
are specified to return a NaN result, but fail to (precisely) specify
the exact NaN representation of this result.  Hence for the operation
"-NaN*-NaN" different hardware implementations (targets) return different
results.  Ultimately knowing what the resulting NaN "payload" of an
operation is can only be known by executing that operation at run-time,
and I'd suggest that GCC's -fsignaling-nans provides a mechanism for
handling code that uses NaN representations for communication/signaling
(which is a different but related concept to IEEE's sNaN).

One nice thing about this patch, which may or may not be a P2 regression
fix, is that it only affects (improves) code compiled with -fsignaling-nans
so should be extremely safe even for this point in stage 3.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-04-26  Roger Sayle  

gcc/ChangeLog
PR middle-end/111701
* fold-const.cc (tree_binary_nonnegative_warnv_p) :
Split handling of floating point and integer types.  For equal
floating point operands, avoid optimization if the operand may be
a signaling NaN.

gcc/testsuite/ChangeLog
PR middle-end/111701
* gcc.dg/pr111701-1.c: New test case.
* gcc.dg/pr111701-2.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 7b26896..f7f174d 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -15076,16 +15076,27 @@ tree_binary_nonnegative_warnv_p (enum tree_code code, 
tree type, tree op0,
   break;
 
 case MULT_EXPR:
-  if (FLOAT_TYPE_P (type) || TYPE_OVERFLOW_UNDEFINED (type))
+  if (FLOAT_TYPE_P (type))
{
- /* x * x is always non-negative for floating point x
-or without overflow.  */
+ /* x * x is non-negative for floating point x except
+that -NaN*-NaN may return -NaN.  PR middle-end/111701.  */
+ if (operand_equal_p (op0, op1, 0))
+   {
+ if (!tree_expr_maybe_signaling_nan_p (op0) || RECURSE (op0))
+   return true;
+   }
+ else if (RECURSE (op0) && RECURSE (op1))
+   return true;
+   }
+
+  if (ANY_INTEGRAL_TYPE_P (type)
+ && TYPE_OVERFLOW_UNDEFINED (type))
+   {
+ /* x * x is always non-negative without overflow.  */
  if (operand_equal_p (op0, op1, 0)
  || (RECURSE (op0) && RECURSE (op1)))
{
- if (ANY_INTEGRAL_TYPE_P (type)
- && TYPE_OVERFLOW_UNDEFINED (type))
-   *strict_overflow_p = true;
+ *strict_overflow_p = true;
  return true;
}
}
diff --git a/gcc/testsuite/gcc.dg/pr111701-1.c 
b/gcc/testsuite/gcc.dg/pr111701-1.c
new file mode 100644
index 000..5cbfac2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr111701-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fsignaling-nans -fdump-tree-optimized" } */
+
+int foo(double x)
+{
+return __builtin_signbit(x*x);
+}
+
+int bar(float x)
+{
+return __builtin_signbit(x*x);
+}
+
+/* { dg-final { scan-tree-dump-times " \\* " 2 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/pr111701-2.c 
b/gcc/testsuite/gcc.dg/pr111701-2.c
new file mode 100644
index 000..f79c7ba
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr111701-2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int foo(double x)
+{
+return __builtin_signbit(x*x);
+}
+
+int bar(float x)
+{
+return __builtin_signbit(x*x);
+}
+
+/* { dg-final { scan-tree-dump-not " \\* " "optimized" } } */

[PATCH] PR target/114187: Fix ?Fmode SUBREG simplification in simplify_subreg.

2024-03-03 Thread Roger Sayle

This patch fixes PR target/114187 a typo/missed-optimization in simplify-rtx
that's exposed by (my) changes to x86_64's parameter passing.  The context
is that construction of double word (TImode) values now uses the idiom:

(ior:TI (ashift:TI (zero_extend:TI (reg:DI x)) (const_int 64 [0x40]))
(zero_extend:TI (reg:DI y)))

Extracting the DImode highpart and lowpart halves of this complex expression
is supported by simplications in simplify_subreg.  The problem is when the
doubleword TImode value represents two DFmode fields, there isn't a direct
simplification to extract the highpart or lowpart SUBREGs, but instead GCC
uses two steps, extract the DImode {high,low} part and then cast the result
back to a floating point mode, DFmode.

The (buggy) code to do this is:

  /* If the outer mode is not integral, try taking a subreg with the
equivalent
 integer outer mode and then bitcasting the result.
 Other simplifications rely on integer to integer subregs and we'd
 potentially miss out on optimizations otherwise.  */
  if (known_gt (GET_MODE_SIZE (innermode),
GET_MODE_SIZE (outermode))
  && SCALAR_INT_MODE_P (innermode)
  && !SCALAR_INT_MODE_P (outermode)
  && int_mode_for_size (GET_MODE_BITSIZE (outermode),
0).exists (_outermode))
{
  rtx tem = simplify_subreg (int_outermode, op, innermode, byte);
  if (tem)
return simplify_gen_subreg (outermode, tem, int_outermode, byte);
}

The issue/mistake is that the second call, to simplify_subreg, shouldn't
use "byte" as the final argument; the offset has already been handled by
the first call, to simplify_subreg, and this second call is just a type
conversion from an integer mode to floating point (from DImode to DFmode).

Interestingly, this mistake was already spotted by Richard Sandiford when
the optimization was originally contributed in January 2023.
https://gcc.gnu.org/pipermail/gcc-patches/2023-January/610920.html
>> Richard Sandiford writes:
>> Also, the final line should pass 0 rather than byte.

Unfortunately a miscommunication/misunderstanding in a later thread
https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612898.html
resulted in this correction being undone.  Alas the lack of any test
cases when the optimization was added/modified potentially contributed
to this lapse.  Using lowpart_subreg should avoid/reduce confusion in
future.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2024-03-03  Roger Sayle  

gcc/ChangeLog
PR target/114187
* simplify-rtx.cc (simplify_context::simplify_subreg): Call
lowpart_subreg to perform type conversion, to avoid confusion
over the offset to use in the call to simplify_reg_subreg.

gcc/testsuite/ChangeLog
PR target/114187
* g++.target/i386/pr114187.C: New test case.

Thanks in advance,
Roger
--

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 36dd522..dceaa13 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -7846,7 +7846,7 @@ simplify_context::simplify_subreg (machine_mode 
outermode, rtx op,
 {
   rtx tem = simplify_subreg (int_outermode, op, innermode, byte);
   if (tem)
-   return simplify_gen_subreg (outermode, tem, int_outermode, byte);
+   return lowpart_subreg (outermode, tem, int_outermode);
 }

   /* If OP is a vector comparison and the subreg is not changing the
diff --git a/gcc/testsuite/g++.target/i386/pr114187.C 
b/gcc/testsuite/g++.target/i386/pr114187.C
new file mode 100644
index 000..69912a9
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr114187.C
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+struct P2d {
+double x, y;
+};
+
+double sumxy_p(P2d p) {
+return p.x + p.y;
+}
+
+/* { dg-final { scan-assembler-not "movq" } } */
+/* { dg-final { scan-assembler-not "xchg" } } */

[x86_64 PATCH] PR target/113690: Fix-up MULT REG_EQUAL notes in STV.

2024-02-04 Thread Roger Sayle


This patch fixes PR target/113690, an ICE-on-valid regression on x86_64
that exhibits with a specific combination of command line options.  The
cause is that x86's scalar-to-vector pass converts a chain of instructions
from TImode to V1TImode, but fails to appropriately update the attached
REG_EQUAL note.  Given that multiplication isn't supported in V1TImode,
the REG_NOTE handling code wasn't expecting to see a MULT.  Easily solved
with additional handling for other binary operators that may potentially
(in future) have an immediate constant as the second operand that needs
handling.  For convenience, this code (re)factors the logic to convert
a TImode constant into a V1TImode constant vector into a subroutine and
reuses it.

For the record, STV is actually doing something useful in this strange
testcase,  GCC with -O2 -fno-dce -fno-forward-propagate
-fno-split-wide-types
-funroll-loops generates:

foo:movl$v, %eax
pxor%xmm0, %xmm0
movaps  %xmm0, 48(%rax)
movaps  %xmm0, (%rax)
movaps  %xmm0, 16(%rax)
movaps  %xmm0, 32(%rax)
ret

With the addition of -mno-stv (to disable the patched code) it gives:

foo:movl$v, %eax
movq$0, 48(%rax)
movq$0, 56(%rax)
movq$0, (%rax)
movq$0, 8(%rax)
movq$0, 16(%rax)
movq$0, 24(%rax)
movq$0, 32(%rax)
movq$0, 40(%rax)
ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-02-05  Roger Sayle  

gcc/ChangeLog
PR target/113690
* config/i386/i386-features.cc (timode_convert_cst): New helper
function to convert a TImode CONST_SCALAR_INT_P to a V1TImode
CONST_VECTOR.
(timode_scalar_chain::convert_op): Use timode_convert_cst.
(timode_scalar_chain::convert_insn): If a REG_EQUAL note contains
a binary operator where the second operand is an immediate integer
constant, convert it to V1TImode using timode_convert_cst.
Use timode_convert_cst.

gcc/testsuite/ChangeLog
PR target/113690
* gcc.target/i386/pr113690.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 4020b27..90ada7d 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -1749,6 +1749,19 @@ timode_scalar_chain::fix_debug_reg_uses (rtx reg)
 }
 }
 
+/* Helper function to convert immediate constant X to V1TImode.  */
+static rtx
+timode_convert_cst (rtx x)
+{
+  /* Prefer all ones vector in case of -1.  */
+  if (constm1_operand (x, TImode))
+return CONSTM1_RTX (V1TImode);
+
+  rtx *v = XALLOCAVEC (rtx, 1);
+  v[0] = x;
+  return gen_rtx_CONST_VECTOR (V1TImode, gen_rtvec_v (1, v));
+}
+
 /* Convert operand OP in INSN from TImode to V1TImode.  */
 
 void
@@ -1775,18 +1788,8 @@ timode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 }
   else if (CONST_SCALAR_INT_P (*op))
 {
-  rtx vec_cst;
   rtx tmp = gen_reg_rtx (V1TImode);
-
-  /* Prefer all ones vector in case of -1.  */
-  if (constm1_operand (*op, TImode))
-   vec_cst = CONSTM1_RTX (V1TImode);
-  else
-   {
- rtx *v = XALLOCAVEC (rtx, 1);
- v[0] = *op;
- vec_cst = gen_rtx_CONST_VECTOR (V1TImode, gen_rtvec_v (1, v));
-   }
+  rtx vec_cst = timode_convert_cst (*op);
 
   if (!standard_sse_constant_p (vec_cst, V1TImode))
{
@@ -1830,12 +1833,28 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)
  tmp = find_reg_equal_equiv_note (insn);
  if (tmp)
{
- if (GET_MODE (XEXP (tmp, 0)) == TImode)
-   PUT_MODE (XEXP (tmp, 0), V1TImode);
- else if (CONST_SCALAR_INT_P (XEXP (tmp, 0)))
-   XEXP (tmp, 0)
- = gen_rtx_CONST_VECTOR (V1TImode,
- gen_rtvec (1, XEXP (tmp, 0)));
+ rtx expr = XEXP (tmp, 0);
+ if (GET_MODE (expr) == TImode)
+   {
+ PUT_MODE (expr, V1TImode);
+ switch (GET_CODE (expr))
+   {
+   case PLUS:
+   case MINUS:
+   case MULT:
+   case AND:
+   case IOR:
+   case XOR:
+ if (CONST_SCALAR_INT_P (XEXP (expr, 1)))
+   XEXP (expr, 1) = timode_convert_cst (XEXP (expr, 1));
+ break;
+
+   default:
+ break;
+   }
+   }
+ else if (CONST_SCALAR_INT_P (expr))
+   XEXP (tmp, 0) = timode_convert_cst (expr);
}
}
   break;
@@ -1876,7 +1895,7 @@ timode_scalar_chain::convert_insn (rtx_insn *insn

[tree-ssa PATCH] PR target/113560: Enhance is_widening_mult_rhs_p.

2024-01-29 Thread Roger Sayle


This patch resolves PR113560, a code quality regression from GCC12
affecting x86_64, by enhancing the middle-end's tree-ssa-math-opts.cc
to recognize more instances of widening multiplications.

The widening multiplication perception code identifies cases like:

_1 = (unsigned __int128) x;
__res = _1 * 100;

but in the reported test case, the original input looks like:

_1 = (unsigned long long) x;
_2 = (unsigned __int128) _1;
__res = _2 * 100;

which gets optimized by constant folding during tree-ssa to:

_2 = x & 18446744073709551615;  // x & 0x
__res = _2 * 100;

where the BIT_AND_EXPR hides (has consumed) the extension operation.
This reveals the more general deficiency (missed optimization
opportunity) in widening multiplication perception that additionally
both

__int128 foo(__int128 x, __int128 y) {
  return (x & 1000) * (y & 1000)
}

and

unsigned __int128 bar(unsigned __int128 x, unsigned __int128) {
  return (x >> 80) * (y >> 80);
}

should be recognized as widening multiplications.  Hence rather than
test explicitly for BIT_AND_EXPR (as in the first version of this patch)
the more general solution is to make use of range information, as
provided by tree_non_zero_bits.

As a demonstration of the observed improvements, function foo above
currently with -O2 compiles on x86_64 to:

foo:movq%rdi, %rsi
movq%rdx, %r8
xorl%edi, %edi
xorl%r9d, %r9d
andl$1000, %esi
andl$1000, %r8d
movq%rdi, %rcx
movq%r9, %rdx
imulq   %rsi, %rdx
movq%rsi, %rax
imulq   %r8, %rcx
addq%rdx, %rcx
mulq%r8
addq%rdx, %rcx
movq%rcx, %rdx
ret

with this patch, GCC recognizes the *w and instead generates:

foo:movq%rdi, %rsi
movq%rdx, %r8
andl$1000, %esi
andl$1000, %r8d
movq%rsi, %rax
imulq   %r8
ret

which is perhaps easier to understand at the tree-level where

__int128 foo (__int128 x, __int128 y)
{
  __int128 _1;
  __int128 _2;
  __int128 _5;

   [local count: 1073741824]:
  _1 = x_3(D) & 1000;
  _2 = y_4(D) & 1000;
  _5 = _1 * _2;
  return _5;
}

gets transformed to:

__int128 foo (__int128 x, __int128 y)
{
  __int128 _1;
  __int128 _2;
  __int128 _5;
  signed long _7;
  signed long _8;

   [local count: 1073741824]:
  _1 = x_3(D) & 1000;
  _2 = y_4(D) & 1000;
  _7 = (signed long) _1;
  _8 = (signed long) _2;
  _5 = _7 w* _8;
  return _5;
}

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-01-30  Roger Sayle  

gcc/ChangeLog
PR target/113560
* tree-ssa-math-opts.cc (is_widening_mult_rhs_p): Use range
information via tree_non_zero_bits to check if this operand
is suitably extended for a widening (or highpart) multiplication.
(convert_mult_to_widen): Insert explicit casts if the RHS or LHS
isn't already of the claimed type.

gcc/testsuite/ChangeLog
PR target/113560
* g++.target/i386/pr113560.C: New test case.
* gcc.target/i386/pr113560.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/testsuite/g++.target/i386/pr113560.C 
b/gcc/testsuite/g++.target/i386/pr113560.C
new file mode 100644
index 000..179b68f
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr113560.C
@@ -0,0 +1,19 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-Ofast -std=c++23 -march=znver4" } */
+
+#include 
+auto f(char *buf, unsigned long long in) noexcept
+{
+unsigned long long hi{};
+auto lo{_mulx_u64(in, 0x2af31dc462ull, )};
+lo = _mulx_u64(lo, 100, );
+__builtin_memcpy(buf + 2, , 2);
+return buf + 10;
+}
+
+/* { dg-final { scan-assembler-times "mulx" 1 } } */
+/* { dg-final { scan-assembler-times "mulq" 1 } } */
+/* { dg-final { scan-assembler-not "addq" } } */
+/* { dg-final { scan-assembler-not "adcq" } } */
+/* { dg-final { scan-assembler-not "salq" } } */
+/* { dg-final { scan-assembler-not "shldq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr113560.c 
b/gcc/testsuite/gcc.target/i386/pr113560.c
new file mode 100644
index 000..ac2e01a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr113560.c
@@ -0,0 +1,17 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+unsigned __int128 foo(unsigned __int128 x, unsigned __int128 y)
+{
+  return (x & 1000) * (y & 1000);
+}
+
+__int128 bar(__int128 x, __int128 y)
+{
+  return (x & 1000) * (y & 1000);
+}
+
+/* { dg-final { scan-assembler-times "\tmulq" 1 } } */
+/* { dg-final { scan-assembler-times "\timulq" 1 } } */
+/* { dg-final { scan-assembler-not

[libatomic PATCH] PR other/113336: Fix libatomic testsuite regressions on ARM.

2024-01-28 Thread Roger Sayle


This patch is a revised version of the fix for PR other/113336.

This patch has been tested on arm-linux-gnueabihf with --with-arch=armv6
with make bootstrap and make -k check where it fixes all of the FAILs in
libatomic.  Ok for mainline?


2024-01-28  Roger Sayle  
Victor Do Nascimento  

libatomic/ChangeLog
PR other/113336
* Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX
* Makefile.in: Regenerate.

Thanks in advance.
Roger
--

diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am
index cfad90124f9..eb04fa2fc60 100644
--- a/libatomic/Makefile.am
+++ b/libatomic/Makefile.am
@@ -139,6 +139,7 @@ if ARCH_ARM_LINUX
 IFUNC_OPTIONS   = -march=armv7-a+fp -DHAVE_KERNEL64
 libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix 
_$(s)_1_.lo,$(SIZEOBJS)))
 libatomic_la_LIBADD += $(addsuffix _8_2_.lo,$(SIZEOBJS))
+libatomic_la_LIBADD += tas_1_2_.lo
 endif
 if ARCH_I386
 IFUNC_OPTIONS   = -march=i586

[middle-end PATCH] Constant fold {-1,-1} << 1 in simplify-rtx.cc

2024-01-26 Thread Roger Sayle


This patch addresses a missed optimization opportunity in the RTL
optimization passes.  The function simplify_const_binary_operation
will constant fold binary operators with two CONST_INT operands,
and those with two CONST_VECTOR operands, but is missing compile-time
evaluation of binary operators with a CONST_VECTOR and a CONST_INT,
such as vector shifts and rotates.

My first version of this patch didn't contain a switch statement to
explicitly check for valid binary opcodes, which bootstrapped and
regression tested fine, but by paranoia has got the better of me,
so this version now checks that VEC_SELECT or some funky (future)
rtx_code doesn't cause problems.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline (in stage 1)?


2024-01-26  Roger Sayle  

gcc/ChangeLog
* simplify-rtx.cc (simplify_const_binary_operation): Constant
fold binary operations where the LHS is CONST_VECTOR and the
RHS is CONST_INT (or CONST_DOUBLE) such as vector shifts.


Thanks in advance,
Roger
--

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index c7215cf..2e2809a 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -5021,6 +5021,60 @@ simplify_const_binary_operation (enum rtx_code code, 
machine_mode mode,
   return gen_rtx_CONST_VECTOR (mode, v);
 }
 
+  if (VECTOR_MODE_P (mode)
+  && GET_CODE (op0) == CONST_VECTOR
+  && (CONST_SCALAR_INT_P (op1) || CONST_DOUBLE_AS_FLOAT_P (op1))
+  && (CONST_VECTOR_DUPLICATE_P (op0)
+ || CONST_VECTOR_NUNITS (op0).is_constant ()))
+{
+  switch (code)
+   {
+   case PLUS:
+   case MINUS:
+   case MULT:
+   case DIV:
+   case MOD:
+   case UDIV:
+   case UMOD:
+   case AND:
+   case IOR:
+   case XOR:
+   case SMIN:
+   case SMAX:
+   case UMIN:
+   case UMAX:
+   case LSHIFTRT:
+   case ASHIFTRT:
+   case ASHIFT:
+   case ROTATE:
+   case ROTATERT:
+   case SS_PLUS:
+   case US_PLUS:
+   case SS_MINUS:
+   case US_MINUS:
+   case SS_ASHIFT:
+   case US_ASHIFT:
+   case COPYSIGN:
+ break;
+   default:
+ return NULL_RTX;
+   }
+
+  unsigned int npatterns = (CONST_VECTOR_DUPLICATE_P (op0)
+   ? CONST_VECTOR_NPATTERNS (op0)
+   : CONST_VECTOR_NUNITS (op0).to_constant ());
+  rtx_vector_builder builder (mode, npatterns, 1);
+  for (unsigned i = 0; i < npatterns; i++)
+   {
+ rtx x = simplify_binary_operation (code, GET_MODE_INNER (mode),
+CONST_VECTOR_ELT (op0, i), op1);
+ if (!x || !valid_for_const_vector_p (mode, x))
+   return 0;
+ builder.quick_push (x);
+   }
+  return builder.build ();
+}
+
   if (SCALAR_FLOAT_MODE_P (mode)
   && CONST_DOUBLE_AS_FLOAT_P (op0) 
   && CONST_DOUBLE_AS_FLOAT_P (op1)

RE: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.

2024-01-25 Thread Roger Sayle

Hi Hongtao,
Many thanks for the review.  Here's a revised version of my patch
that addresses (most of) the issues you've raised.  Firstly the
handling of zero and all_ones in this function is mostly for 
completeness/documentation, these standard_sse_constant_p
values are (currently/normally) handled elsewhere.  But I have
added an "n_var == 0" optimization to ix86_expand_vector_init.

As you've suggested I've added explicit TARGET_SSE2 tests where
required, and for consistency I've also added support for AVX512's
V16SImode.

As you've predicted, the eventual goal is to move this after combine
(or reload) using define_insn_and_split, but that requires a significant
restructuring that should be done in steps.  This also interacts with
a similar planned reorganization of TImode constant handling.  If
all 128-bit (vector) constants are acceptable before combine, then
STV has the freedom to chose V1TImode (and this broadcast
functionality) to implement TImode operations on immediate
constants.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline (in stage 1)?

2024-01-25  Roger Sayle  
Hongtao Liu  

gcc/ChangeLog
PR target/106060
* config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
(struct ix86_vec_bcast_map_simode_t): New type for table below.
(ix86_vec_bcast_map_simode): Table of SImode constants that may
be efficiently synthesized by a ix86_vec_bcast_alg method.
(ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
(ix86_vector_duplicate_simode_const): Efficiently synthesize
V4SImode and V8SImode constants that duplicate special constants.
(ix86_vector_duplicate_value): Attempt to synthesize "special"
vector constants using ix86_vector_duplicate_simode_const.
* config/i386/i386.cc (ix86_rtx_costs) : ABS of a
vector integer mode costs with a single SSE instruction.

gcc/testsuite/ChangeLog
PR target/106060
* gcc.target/i386/auto-init-8.c: Update test case.
* gcc.target/i386/avx512fp16-3.c: Likewise.
* gcc.target/i386/pr100865-9a.c: Likewise.
* gcc.target/i386/pr101796-1.c: Likewise.
* gcc.target/i386/pr106060-1.c: New test case.
* gcc.target/i386/pr106060-2.c: Likewise.
* gcc.target/i386/pr106060-3.c: Likewise.
* gcc.target/i386/pr70314.c: Update test case.
* gcc.target/i386/vect-shiftv4qi.c: Likewise.
* gcc.target/i386/vect-shiftv8qi.c: Likewise.

Roger
--

> -Original Message-
> From: Hongtao Liu 
> Sent: 17 January 2024 03:13
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak 
> Subject: Re: [x86 PATCH] PR target/106060: Improved SSE vector constant
> materialization.
> 
> On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle 
> wrote:
> >
> >
> > I thought I'd just missed the bug fixing season of stage3, but there
> > appears to a little latitude in early stage4 (for vector patches), so
> > I'll post this now.
> >
> > This patch resolves PR target/106060 by providing efficient methods
> > for materializing/synthesizing special "vector" constants on x86.
> > Currently there are three methods of materializing a vector constant;
> > the most general is to load a vector from the constant pool, secondly
> "duplicated"
> > constants can be synthesized by moving an integer between units and
> > broadcasting (or shuffling it), and finally the special cases of the
> > all-zeros vector and all-ones vectors can be loaded via a single SSE
> > instruction.   This patch handles additional cases that can be synthesized
> > in two instructions, loading an all-ones vector followed by another
> > SSE instruction.  Following my recent patch for PR target/112992,
> > there's conveniently a single place in i386-expand.cc where these
> > special cases can be handled.
> >
> > Two examples are given in the original bugzilla PR for 106060.
> >
> > __m256i
> > should_be_cmpeq_abs ()
> > {
> >   return _mm256_set1_epi8 (1);
> > }
> >
> > is now generated (with -O3 -march=x86-64-v3) as:
> >
> > vpcmpeqd%ymm0, %ymm0, %ymm0
> > vpabsb  %ymm0, %ymm0
> > ret
> >
> > and
> >
> > __m256i
> > should_be_cmpeq_add ()
> > {
> >   return _mm256_set1_epi8 (-2);
> > }
> >
> > is now generated as:
> >
> > vpcmpeqd%ymm0, %ymm0, %ymm0
> > vpaddb  %ymm0, %ymm0, %ymm0
> > ret
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with

RE: [middle-end PATCH] Prefer PLUS over IOR in RTL expansion of multi-word shifts/rotates.

2024-01-19 Thread Roger Sayle

Hi Richard,

Thanks for the speedy review.  I completely agree this patch
can wait for stage1, but it's related to some recent work Andrew
Pinski has been doing in match.pd, so I thought I'd share it.

Hypothetically, recognizing (x<<4)+(x>>60) as a rotation at the
tree-level might lead to a code quality regression, if RTL
expansion doesn't know to lower it back to use PLUS on
those targets with lea but without rotate.

> From: Richard Biener 
> Sent: 19 January 2024 11:04
> On Thu, Jan 18, 2024 at 8:55 PM Roger Sayle 
> wrote:
> >
> > This patch tweaks RTL expansion of multi-word shifts and rotates to
> > use PLUS rather than IOR for disjunctive operations.  During expansion
> > of these operations, the middle-end creates RTL like (X<>C2)
> > where the constants C1 and C2 guarantee that bits don't overlap.
> > Hence the IOR can be performed by any any_or_plus operation, such as
> > IOR, XOR or PLUS; for word-size operations where carry chains aren't
> > an issue these should all be equally fast (single-cycle) instructions.
> > The benefit of this change is that targets with shift-and-add insns,
> > like x86's lea, can benefit from the LSHIFT-ADD form.
> >
> > An example of a backend that benefits is ARC, which is demonstrated by
> > these two simple functions:
> >
> > unsigned long long foo(unsigned long long x) { return x<<2; }
> >
> > which with -O2 is currently compiled to:
> >
> > foo:lsr r2,r0,30
> > asl_s   r1,r1,2
> > asl_s   r0,r0,2
> > j_s.d   [blink]
> > or_sr1,r1,r2
> >
> > with this patch becomes:
> >
> > foo:lsr r2,r0,30
> > add2r1,r2,r1
> > j_s.d   [blink]
> > asl_s   r0,r0,2
> >
> > unsigned long long bar(unsigned long long x) { return (x<<2)|(x>>62);
> > }
> >
> > which with -O2 is currently compiled to 6 insns + return:
> >
> > bar:lsr r12,r0,30
> > asl_s   r3,r1,2
> > asl_s   r0,r0,2
> > lsr_s   r1,r1,30
> > or_sr0,r0,r1
> > j_s.d   [blink]
> > or  r1,r12,r3
> >
> > with this patch becomes 4 insns + return:
> >
> > bar:lsr r3,r1,30
> > lsr r2,r0,30
> > add2r1,r2,r1
> > j_s.d   [blink]
> > add2r0,r3,r0
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> 
> For expand_shift_1 you add
> 
> +where C is the bitsize of A.  If N cannot be zero,
> +use PLUS instead of IOR.
> 
> but I don't see a check ensuring this other than mabe CONST_INT_P (op1)
> suggesting that we enver end up with const0_rtx here.  OTOH why is N zero a
> problem and why is it not in the optabs.cc case where I don't see any such 
> check
> (at least not obvious)?

Excellent question.   A common mistake in writing a rotate function in C
or C++ is to write something like (x>>n)|(x<<(64-n)) or (x<>(64-n))
which invokes undefined behavior when n == 0.  It's OK to recognize these
as rotates (relying on the undefined behavior), but correct/portable code
(and RTL) needs the correct idiom(x>>n)|(x<<((-n)&63), which never invokes
undefined behaviour.  One interesting property of this idiom, is that shift
by zero is then calculated as (x>>0)|(x<<0) which is x|x.  This should then
reveal the problem, for all non-zero values the IOR can be replaced by PLUS,
but for zero shifts, X|X isn't the same as X+X or X^X.

This only applies for single word rotations, and not multi-word shifts
nor multi-word rotates, which explains why this test is only in one place.

In theory, we could use ranger to check whether a rotate by a variable
amount can ever be by zero bits, but the simplification used here is to
continue using IOR for variable shifts, and PLUS for fixed/known shift
values.  The last remaining insight is that we only need to check for
CONST_INT_P, as rotations/shifts by const0_rtx are handled earlier in
this function (and eliminated by the tree-optimizers), i.e. rotation by
a known constant is implicitly a rotation by a known non-zero constant.

This is a little clearer if you read/cite more of the comment that was
changed.  Fortunately, this case is also well covered by the testsuite.
I'd be happy to change the code to read:

(CONST_INT_P (op1) && op1 != const0_rtx)
? add_optab
: ior_optab

But the test "if (op1 == const0_rtx)" already appears on line 2570
of expmed.cc.

> Since this doesn't seem to fix a regression it probably has to wait for
> stage1 to re-open.
> 
> Thanks,
> Richard.
> 
> > 2024-01-18  Roger Sayle  
> >
> > gcc/ChangeLog
> > * expmed.cc (expand_shift_1): Use add_optab instead of ior_optab
> > to generate PLUS instead or IOR when unioning disjoint bitfields.
> > * optabs.cc (expand_subword_shift): Likewise.
> > (expand_binop): Likewise for double-word rotate.
> >

Thanks again.

[middle-end PATCH] Prefer PLUS over IOR in RTL expansion of multi-word shifts/rotates.

2024-01-18 Thread Roger Sayle


This patch tweaks RTL expansion of multi-word shifts and rotates to use
PLUS rather than IOR for disjunctive operations.  During expansion of
these operations, the middle-end creates RTL like (X<>C2)
where the constants C1 and C2 guarantee that bits don't overlap.
Hence the IOR can be performed by any any_or_plus operation, such as
IOR, XOR or PLUS; for word-size operations where carry chains aren't
an issue these should all be equally fast (single-cycle) instructions.
The benefit of this change is that targets with shift-and-add insns,
like x86's lea, can benefit from the LSHIFT-ADD form.

An example of a backend that benefits is ARC, which is demonstrated
by these two simple functions:

unsigned long long foo(unsigned long long x) { return x<<2; }

which with -O2 is currently compiled to:

foo:lsr r2,r0,30
asl_s   r1,r1,2
asl_s   r0,r0,2
j_s.d   [blink]
or_sr1,r1,r2

with this patch becomes:

foo:lsr r2,r0,30
add2r1,r2,r1
j_s.d   [blink]
asl_s   r0,r0,2

unsigned long long bar(unsigned long long x) { return (x<<2)|(x>>62); }

which with -O2 is currently compiled to 6 insns + return:

bar:lsr r12,r0,30
asl_s   r3,r1,2
asl_s   r0,r0,2
lsr_s   r1,r1,30
or_sr0,r0,r1
j_s.d   [blink]
or  r1,r12,r3

with this patch becomes 4 insns + return:

bar:lsr r3,r1,30
lsr r2,r0,30
add2r1,r2,r1
j_s.d   [blink]
add2r0,r3,r0


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-01-18  Roger Sayle  

gcc/ChangeLog
* expmed.cc (expand_shift_1): Use add_optab instead of ior_optab
to generate PLUS instead or IOR when unioning disjoint bitfields.
* optabs.cc (expand_subword_shift): Likewise.
(expand_binop): Likewise for double-word rotate.


Thanks in advance,
Roger
--

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 5916d6ed1bc..d1900f97f0c 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -2610,10 +2610,11 @@ expand_shift_1 (enum tree_code code, machine_mode mode, 
rtx shifted,
  else if (methods == OPTAB_LIB_WIDEN)
{
  /* If we have been unable to open-code this by a rotation,
-do it as the IOR of two shifts.  I.e., to rotate A
-by N bits, compute
+do it as the IOR or PLUS of two shifts.  I.e., to rotate
+A by N bits, compute
 (A << N) | ((unsigned) A >> ((-N) & (C - 1)))
-where C is the bitsize of A.
+where C is the bitsize of A.  If N cannot be zero,
+use PLUS instead of IOR.
 
 It is theoretically possible that the target machine might
 not be able to perform either shift and hence we would
@@ -2650,8 +2651,9 @@ expand_shift_1 (enum tree_code code, machine_mode mode, 
rtx shifted,
  temp1 = expand_shift_1 (left ? RSHIFT_EXPR : LSHIFT_EXPR,
  mode, shifted, other_amount,
  subtarget, 1);
- return expand_binop (mode, ior_optab, temp, temp1, target,
-  unsignedp, methods);
+ return expand_binop (mode,
+  CONST_INT_P (op1) ? add_optab : ior_optab,
+  temp, temp1, target, unsignedp, methods);
}
 
  temp = expand_binop (mode,
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index ce91f94ed43..dcd3e406719 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -566,8 +566,8 @@ expand_subword_shift (scalar_int_mode op1_mode, optab 
binoptab,
   if (tmp == 0)
return false;
 
-  /* Now OR in the bits carried over from OUTOF_INPUT.  */
-  if (!force_expand_binop (word_mode, ior_optab, tmp, carries,
+  /* Now OR/PLUS in the bits carried over from OUTOF_INPUT.  */
+  if (!force_expand_binop (word_mode, add_optab, tmp, carries,
   into_target, unsignedp, methods))
return false;
 }
@@ -1937,7 +1937,7 @@ expand_binop (machine_mode mode, optab binoptab, rtx op0, 
rtx op1,
 NULL_RTX, unsignedp, next_methods);
 
  if (into_temp1 != 0 && into_temp2 != 0)
-   inter = expand_binop (word_mode, ior_optab, into_temp1, into_temp2,
+   inter = expand_binop (word_mode, add_optab, into_temp1, into_temp2,
  into_target, unsignedp, next_methods);
  else
inter = 0;
@@ -1953,7 +1953,7 @@ expand_binop (machine_mode mode, optab binoptab, rtx op0, 
rtx op1,
  NULL_RTX, unsignedp, next_methods);
 
  if (inter != 0 && outof_temp1 !=

[x86 PATCH] PR target/106060: Improved SSE vector constant materialization.

2024-01-16 Thread Roger Sayle


I thought I'd just missed the bug fixing season of stage3, but there
appears to a little latitude in early stage4 (for vector patches), so
I'll post this now.

This patch resolves PR target/106060 by providing efficient methods for
materializing/synthesizing special "vector" constants on x86.  Currently
there are three methods of materializing a vector constant; the most
general is to load a vector from the constant pool, secondly "duplicated"
constants can be synthesized by moving an integer between units and
broadcasting (or shuffling it), and finally the special cases of the
all-zeros vector and all-ones vectors can be loaded via a single SSE
instruction.   This patch handles additional cases that can be synthesized
in two instructions, loading an all-ones vector followed by another SSE
instruction.  Following my recent patch for PR target/112992, there's
conveniently a single place in i386-expand.cc where these special cases
can be handled.

Two examples are given in the original bugzilla PR for 106060.

__m256i
should_be_cmpeq_abs ()
{
  return _mm256_set1_epi8 (1);
}

is now generated (with -O3 -march=x86-64-v3) as:

vpcmpeqd%ymm0, %ymm0, %ymm0
vpabsb  %ymm0, %ymm0
ret

and

__m256i
should_be_cmpeq_add ()
{
  return _mm256_set1_epi8 (-2);
}

is now generated as:

vpcmpeqd%ymm0, %ymm0, %ymm0
vpaddb  %ymm0, %ymm0, %ymm0
ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-01-16  Roger Sayle  

gcc/ChangeLog
PR target/106060
* config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
(struct ix86_vec_bcast_map_simode_t): New type for table below.
(ix86_vec_bcast_map_simode): Table of SImode constants that may
be efficiently synthesized by a ix86_vec_bcast_alg method.
(ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
(ix86_vector_duplicate_simode_const): Efficiently synthesize
V4SImode and V8SImode constants that duplicate special constants.
(ix86_vector_duplicate_value): Attempt to synthesize "special"
vector constants using ix86_vector_duplicate_simode_const.
* config/i386/i386.cc (ix86_rtx_costs) : ABS of a
vector integer mode costs with a single SSE instruction.

gcc/testsuite/ChangeLog
PR target/106060
* gcc.target/i386/auto-init-8.c: Update test case.
* gcc.target/i386/avx512fp16-3.c: Likewise.
* gcc.target/i386/pr100865-9a.c: Likewise.
* gcc.target/i386/pr106060-1.c: New test case.
* gcc.target/i386/pr106060-2.c: Likewise.
* gcc.target/i386/pr106060-3.c: Likewise.
* gcc.target/i386/pr70314-3.c: Update test case.
* gcc.target/i386/vect-shiftv4qi.c: Likewise.
* gcc.target/i386/vect-shiftv8qi.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 52754e1..f8f8af6 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -15638,6 +15638,288 @@ s4fma_expand:
   gcc_unreachable ();
 }
 
+/* See below where shifts are handled for explanation of this enum.  */
+enum ix86_vec_bcast_alg
+{
+  VEC_BCAST_PXOR,
+  VEC_BCAST_PCMPEQ,
+  VEC_BCAST_PABSB,
+  VEC_BCAST_PADDB,
+  VEC_BCAST_PSRLW,
+  VEC_BCAST_PSRLD,
+  VEC_BCAST_PSLLW,
+  VEC_BCAST_PSLLD
+};
+
+struct ix86_vec_bcast_map_simode_t
+{
+  unsigned int key;
+  enum ix86_vec_bcast_alg alg;
+  unsigned int arg;
+};
+
+/* This table must be kept sorted as values are looked-up using bsearch.  */
+static const ix86_vec_bcast_map_simode_t ix86_vec_bcast_map_simode[] = {
+  { 0x, VEC_BCAST_PXOR,0 },
+  { 0x0001, VEC_BCAST_PSRLD,  31 },
+  { 0x0003, VEC_BCAST_PSRLD,  30 },
+  { 0x0007, VEC_BCAST_PSRLD,  29 },
+  { 0x000f, VEC_BCAST_PSRLD,  28 },
+  { 0x001f, VEC_BCAST_PSRLD,  27 },
+  { 0x003f, VEC_BCAST_PSRLD,  26 },
+  { 0x007f, VEC_BCAST_PSRLD,  25 },
+  { 0x00ff, VEC_BCAST_PSRLD,  24 },
+  { 0x01ff, VEC_BCAST_PSRLD,  23 },
+  { 0x03ff, VEC_BCAST_PSRLD,  22 },
+  { 0x07ff, VEC_BCAST_PSRLD,  21 },
+  { 0x0fff, VEC_BCAST_PSRLD,  20 },
+  { 0x1fff, VEC_BCAST_PSRLD,  19 },
+  { 0x3fff, VEC_BCAST_PSRLD,  18 },
+  { 0x7fff, VEC_BCAST_PSRLD,  17 },
+  { 0x, VEC_BCAST_PSRLD,  16 },
+  { 0x00010001, VEC_BCAST_PSRLW,  15 },
+  { 0x0001, VEC_BCAST_PSRLD,  15 },
+  { 0x00030003, VEC_BCAST_PSRLW,  14 },
+  { 0x0003, VEC_BCAST_PSRLD,  14 },
+  { 0x00070007, VEC_BCAST_PSRLW,  13 },
+  { 0x0007, VEC_BCAST_PSRLD,  13 },
+  { 0x000f000f, VEC_BCAST_PSRLW,  12 },
+  { 0x000f, VEC_BCAST_PSRLD,  12 },
+  { 0x001f001f, VEC_BCAST_PSRLW,  11 },
+  { 0x001f, VEC_BCAST_PSRLD,  11 },
+  { 0x003f003f, VEC_BCAST_PSRLW,  10 },
+  { 0x003f, VEC_BCAST_PSRLD,  10 },
+  { 0x

[PATCH] PR rtl-optimization/111267: Improved forward propagation.

2024-01-15 Thread Roger Sayle


This patch resolves PR rtl-optimization/111267 by improving RTL-level
forward propagation.  This x86_64 code quality regression was caused
(exposed) by my changes to improve how x86's (TImode) argument passing
is represented at the RTL-level (reducing the use of SUBREGs to catch
more optimization opportunities in combine).  The pitfall is that the
more complex RTL representations expose a limitation in RTL's fwprop
pass.

At the heart of fwprop, in try_fwprop_subst_pattern, the logic can
be summarized as three steps.  Step 1 is a heuristic that rejects the
propagation attempt if the expression is too complex, step 2 calls
the backend's recog to see if the propagated/simplified instruction
is recognizable/valid, and step 3 then calls src_cost to compare
the rtx costs of the replacement vs. the original, and accepts the
transformation if the final cost is the same of better.

The logic error (or missed optimization opportunity) is that the
step 1 heuristic that attempts to predict (second guess) the
process is flawed.  Ultimately the decision on whether to fwprop
or not should depend solely on actual improvement, as measured
by RTX costs.  Hence the prototype fix in the bugzilla PR removes
the heuristic of calling prop.profitable_p entirely, relying
entirely on the cost comparison in step 3.

Unfortunately, things are a tiny bit more complicated.  The cost
comparison in fwprop uses the older set_src_cost API and not the
newer (preffered) insn_cost API as currently used in combine.
This means that the cost improvement comparisons are only done
for single_set instructions (more complex PARALLELs etc. aren't
supported).  Hence we can only rely on skipping step 1 for that
subset of instructions actually evaluated by step 3.

The other subtlety is that to avoid potential infinite loops
in fwprop we should only reply purely on rtx costs when the
transformation is obviously an improvement.  If the replacement
has the same cost as the original, we can use the prop.profitable_p
test to preserve the current behavior.

Finally, to answer Richard Biener's remaining question about this
approach: yes, there is an asymmetry between how patterns are
handled and how REG_EQUAL notes are handled.  For example, at
the moment propagation into notes doesn't use rtx costs at all,
and ultimately when fwprop is updated to use insn_cost, this
(and recog) obviously isn't applicable to notes.  There's no reason
the logic need be identical between patterns and notes, and during
stage4 we only need update propagation into patterns to fix this
P1 regression (notes and use of cost_insn can be done for GCC 15).

For Jakub's reduced testcase:

struct S { float a, b, c, d; };
int bar (struct S x, struct S y) {
  return x.b <= y.d && x.c >= y.a;
}

On x86_64-pc-linux-gnu with -O2 gcc currently generates:

bar:movq%xmm2, %rdx
movq%xmm3, %rax
movq%xmm0, %rsi
xchgq   %rdx, %rax
movq%rsi, %rcx
movq%rax, %rsi
movq%rdx, %rax
shrq$32, %rcx
shrq$32, %rax
movd%ecx, %xmm4
movd%eax, %xmm0
comiss  %xmm4, %xmm0
jb  .L6
movd%esi, %xmm0
xorl%eax, %eax
comiss  %xmm0, %xmm1
setnb   %al
ret
.L6:xorl%eax, %eax
ret

with this simple patch to fwprop, we now generate:

bar:shufps  $85, %xmm0, %xmm0
shufps  $85, %xmm3, %xmm3
comiss  %xmm0, %xmm3
jb  .L6
xorl%eax, %eax
comiss  %xmm2, %xmm1
setnb   %al
ret
.L6:xorl%eax, %eax
ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Additionally, it also resolves the FAIL for
gcc.target/i386/pr82580.c.  Ok for mainline?


2024-01-16  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/111267
* fwprop.cc (try_fwprop_subst_pattern): Only bail-out early when
!prop.profitable_p for instructions that are not single sets.
When comparing costs, bail-out if the cost is unchanged and
!prop.profitable_p.

gcc/testsuite/ChangeLog
PR rtl-optimization/111267
* gcc.target/i386/pr111267.c: New test case.


Thanks in advance (and to Jeff Law for his guidance/help),
Roger
--

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0c588f8..f06225a 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -449,7 +449,10 @@ try_fwprop_subst_pattern (obstack_watermark , 
insn_change _change,
   if (prop.num_replacements == 0)
 return false;
 
-  if (!prop.profitable_p ())
+  if (!prop.profitable_p ()
+  && (prop.changed_mem_p ()
+ || use_insn->is_asm ()
+ || !single_set (use_rtl)))
 {
   if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "cannot propagate from insn %d into"
@@ -481,7 +484,8 @@ try_fwprop_

[PATCH/RFC] Add --with-dwarf4 configure option.

2024-01-14 Thread Roger Sayle


This patch fixes three of the four unexpected failures that I'm seeing
in the gcc testsuite on x86_64-pc-linux-gnu.  The three FAILs are:
FAIL: gcc.c-torture/execute/fprintf-2.c   -O3 -g  (test for excess errors)
FAIL: gcc.c-torture/execute/printf-2.c   -O3 -g  (test for excess errors)
FAIL: gcc.c-torture/execute/user-printf.c   -O3 -g  (test for excess errors)

and are caused by the linker/toolchain (GNU ld 2.27 on RedHat 7) issuing
a link-time warning:
/usr/bin/ld: Dwarf Error: found dwarf version '5', this reader only handles
version 2, 3 and 4 information.

This also explains why these c-torture tests only fail with -g.

One solution might be to tweak/improve GCC's testsuite to ignore
these warnings.  However, ideally it should also be possible to
configure gcc not to generate dwarf5 debugging information on
systems that don't/can't support it.  This patch supplements the
current --with-dwarf2 configure option with the addition of a
--with-dwarf4 option that adds a tm-dwarf4.h to $tm_file (using
the same mechanism as --with-dwarf2) that changes/redefines
DWARF_VERSION_DEFAULT to 4 (overriding the current default of 5),

This patch has been tested on x86_64-pc-linux-gnu, with a full
make bootstrap, both with and without --with-dwarf4.  This is
fixes the three failures above, and causes no new failures outside
of the gcc.dg/guality directory.  Unfortunately, the guality
testsuite contains a large number of tests that assume support
for dwarf5 and don't (yet) check check_effective_target_dwarf5.
Hopefully, adding --with-dwarf4 will help improve/test the
portability of the guality testsuite.

Ok for mainline?  An alternative implementation might be to
allow integer values for $with_dwarf such that --with-dwarf5,
--with-dwarf3 etc. do the right thing.  In fact, I'd originally
misread the documentation and assumed --with-dwarf4 was already
supported.


2024-01-14  Roger Sayle  

gcc/ChangeLog
* configure.ac: Add a with --with dwarf4 option.
* configure: Regenerate.
* config/tm-dwarf4.h: New target file to define
DWARF_VERSION_DEFAULT to 4.


Thanks in advance,
Roger
--

diff --git a/gcc/configure.ac b/gcc/configure.ac
index 596e5f2..2ce9093 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -1036,6 +1036,11 @@ AC_ARG_WITH(dwarf2,
 dwarf2="$with_dwarf2",
 dwarf2=no)
 
+AC_ARG_WITH(dwarf4,
+[AS_HELP_STRING([--with-dwarf4], [force the default debug format to be DWARF 
4])],
+dwarf4="$with_dwarf4",
+dwarf4=no)
+
 AC_ARG_ENABLE(shared,
 [AS_HELP_STRING([--disable-shared], [don't provide a shared libgcc])],
 [
@@ -1916,6 +1921,10 @@ if test x"$dwarf2" = xyes
 then tm_file="$tm_file tm-dwarf2.h"
 fi
 
+if test x"$dwarf4" = xyes
+then tm_file="$tm_file tm-dwarf4.h"
+fi
+
 # Say what files are being used for the output code and MD file.
 echo "Using \`$srcdir/config/$out_file' for machine-specific logic."
 echo "Using \`$srcdir/config/$md_file' as machine description file."
diff --git a/gcc/config/tm-dwarf4.h b/gcc/config/tm-dwarf4.h
new file mode 100644
index 000..9557b40
--- /dev/null
+++ b/gcc/config/tm-dwarf4.h
@@ -0,0 +1,3 @@
+/* Make Dwarf4 debugging info the default */
+#undef  DWARF_VERSION_DEFAULT
+#define  DWARF_VERSION_DEFAULT 4

RE: [libatomic PATCH] Fix testsuite regressions on ARM [raspberry pi].

2024-01-11 Thread Roger Sayle



Hi Richard,
As you've recommended, this issue has now been filed in bugzilla
as PR other/113336.  As explained in the new PR, libatomic's testsuite
used to pass on armv6 (raspberry pi) in previous GCC releases, but
the code was incorrect/non-synchronous; this was reported as
PR target/107567 and PR target/109166.  Now that those issues
have been fixed, we now see that there's a missing dependency in
libatomic that's required to implement this functionality correctly.

I'm more convinced that my fix is correct, but it's perhaps a little
disappointing that libatomic doesn't have a (multi-threaded) run-time
test to search for race conditions, and confirm its implementations
are correctly serializing.

Please let me know what you think.
Best regards,
Roger
--

> -Original Message-
> From: Richard Earnshaw 
> Sent: 10 January 2024 15:34
> To: Roger Sayle ; gcc-patches@gcc.gnu.org
> Subject: Re: [libatomic PATCH] Fix testsuite regressions on ARM [raspberry 
> pi].
> 
> 
> 
> On 08/01/2024 16:07, Roger Sayle wrote:
> >
> > Bootstrapping GCC on arm-linux-gnueabihf with --with-arch=armv6
> > currently has a large number of FAILs in libatomic (regressions since
> > last time I attempted this).  The failure mode is related to IFUNC
> > handling with the file tas_8_2_.o containing an unresolved reference
> > to the function libat_test_and_set_1_i2.
> >
> > Bearing in mind I've no idea what's going on, the following one line
> > change, to build tas_1_2_.o when building tas_8_2_.o, resolves the
> > problem for me and restores the libatomic testsuite to 44 expected
> > passes and 5 unsupported tests [from 22 unexpected failures and 22 
> > unresolved
> testcases].
> >
> > If this looks like the correct fix, I'm not confident with rebuilding
> > Makefile.in with correct version of automake, so I'd very much
> > appreciate it if someone/the reviewer/mainainer could please check this in 
> > for
> me.
> > Thanks in advance.
> >
> >
> > 2024-01-08  Roger Sayle  
> >
> > libatomic/ChangeLog
> >  * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX
> >  * Makefile.in: Regenerate.
> >
> >
> > Roger
> > --
> >
> 
> Hi Roger,
> 
> I don't really understand all this make foo :( so I'm not sure if this is the 
> right fix
> either.  If this is, as you say, a regression, have you been able to track 
> down when
> it first started to occur?  That might also help me to understand what 
> changed to
> cause this.
> 
> Perhaps we should have a PR for this, to make tracking the fixes easier.
> 
> R.

[libatomic PATCH] Fix testsuite regressions on ARM [raspberry pi].

2024-01-08 Thread Roger Sayle


Bootstrapping GCC on arm-linux-gnueabihf with --with-arch=armv6 currently
has a large number of FAILs in libatomic (regressions since last time I
attempted this).  The failure mode is related to IFUNC handling with the
file tas_8_2_.o containing an unresolved reference to the function
libat_test_and_set_1_i2.

Bearing in mind I've no idea what's going on, the following one line
change, to build tas_1_2_.o when building tas_8_2_.o, resolves the problem
for me and restores the libatomic testsuite to 44 expected passes and 5
unsupported tests [from 22 unexpected failures and 22 unresolved testcases].

If this looks like the correct fix, I'm not confident with rebuilding
Makefile.in with correct version of automake, so I'd very much appreciate
it if someone/the reviewer/mainainer could please check this in for me.
Thanks in advance.


2024-01-08  Roger Sayle  

libatomic/ChangeLog
* Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX
* Makefile.in: Regenerate.


Roger
--

diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am
index cfad90124f9..e0988a18c9a 100644
--- a/libatomic/Makefile.am
+++ b/libatomic/Makefile.am
@@ -139,6 +139,7 @@ if ARCH_ARM_LINUX
 IFUNC_OPTIONS   = -march=armv7-a+fp -DHAVE_KERNEL64
 libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix 
_$(s)_1_.lo,$(SIZEOBJS)))
 libatomic_la_LIBADD += $(addsuffix _8_2_.lo,$(SIZEOBJS))
+libatomic_la_LIBADD += $(addsuffix _1_2_.lo,$(SIZEOBJS))
 endif
 if ARCH_I386
 IFUNC_OPTIONS   = -march=i586

RE: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.

2024-01-06 Thread Roger Sayle

Hi Hongtao,

Many thanks for the review.  This revised patch implements several
of your suggestions, specifically to use pshufd for V4SImode and
punpcklqdq for V2DImode.  These changes are demonstrated by the
examples below:

typedef unsigned int v4si __attribute((vector_size(16)));
typedef unsigned long long v2di __attribute((vector_size(16)));

v4si foo() { return (v4si){1,1,1,1}; }
v2di bar() { return (v2di){1,1}; }

The previous version of my patch generated:

foo:movdqa  .LC0(%rip), %xmm0
ret
bar:movdqa  .LC1(%rip), %xmm0
ret

with this revised version, -O2 generates:

foo:movl$1, %eax
movd%eax, %xmm0
pshufd  $0, %xmm0, %xmm0
ret
bar:movl$1, %eax
movq%rax, %xmm0
punpcklqdq  %xmm0, %xmm0
ret

However, if it's OK with you, I'd prefer to allow this function to
return false, safely falling back to emitting a vector load from
the constant bool rather than ICEing from a gcc_assert.  For one
thing this isn't a unrecoverable correctness issue, but at worst
a missed optimization.  The deeper reason is that this usefully
provides a handle for tuning on different microarchitectures.
On some (AMD?) machines, where !TARGET_INTER_UNIT_MOVES_TO_VEC,
the first form above may be preferable to the second.  Currently
the start of ix86_convert_const_wide_int_to_broadcast disables
broadcasts for !TARGET_INTER_UNIT_MOVES_TO_VEC even when an
implementation doesn't reuire an inter unit move, such as a
broadcast from memory.  I plan follow-up patches that benefit
from this flexibility.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

gcc/ChangeLog
PR target/112992
* config/i386/i386-expand.cc
(ix86_convert_const_wide_int_to_broadcast): Allow call to
ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
(ix86_broadcast_from_constant): Revert recent change; Return a
suitable MEMREF independently of mode/target combinations.
(ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
to decide whether expansion is possible/preferrable.  Only try
forcing DImode constants to memory (and trying again) if calling
ix86_expand_vector_init_duplicate fails with an DImode immediate
constant.
(ix86_expand_vector_init_duplicate) : Try using
V4SImode for suitable immediate constants.
: Try using V8SImode for suitable constants.
: Fail for CONST_INT_P, i.e. use constant pool.
: Likewise.
: For CONST_INT_P try using V4SImode via widen.
: For CONT_INT_P try using V8HImode via widen.
: Handle CONT_INTs via simplify_binary_operation.
Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
: For CONST_INT_P try V8SImode via widen.
: For CONST_INT_P try V16HImode via widen.
(ix86_expand_vector_init): Move try using a broadcast for all_same
with ix86_expand_vector_init_duplicate before using constant pool.

gcc/testsuite/ChangeLog
* gcc.target/i386/auto-init-8.c: Update test case.
* gcc.target/i386/avx512f-broadcast-pr87767-1.c: Likewise.
* gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
* gcc.target/i386/avx512fp16-13.c: Likewise.
* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
* gcc.target/i386/pr100865-1.c: Likewise.
* gcc.target/i386/pr100865-10a.c: Likewise.
* gcc.target/i386/pr100865-10b.c: Likewise.
* gcc.target/i386/pr100865-2.c: Likewise.
* gcc.target/i386/pr100865-3.c: Likewise.
* gcc.target/i386/pr100865-4a.c: Likewise.
* gcc.target/i386/pr100865-4b.c: Likewise.
* gcc.target/i386/pr100865-5a.c: Likewise.
* gcc.target/i386/pr100865-5b.c: Likewise.
* gcc.target/i386/pr100865-9a.c: Likewise.
* gcc.target/i386/pr100865-9b.c: Likewise.
* gcc.target/i386/pr102021.c: Likewise.
* gcc.target/i386/pr90773-17.c: Likewise.

Thanks in advance.
Roger
--

> -Original Message-
> From: Hongtao Liu 
> Sent: 02 January 2024 05:40
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak 
> Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of
> constants.
> 
> On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle 
> wrote:
> >
> >
> > This patch resolves the second part of PR target/112992, building upon
> > Hongtao Liu's solution to the first part.
> >
> > The issue addressed by this patch is that when initializing vectors by
> > broadcasting integer constants, the compiler has the flexibility to
> > select the most appropriate vector mode to perform the broadcast, as
&

[x86 PATCH] PR target/113231: Improved costs in Scalar-To-Vector (STV) pass.

2024-01-06 Thread Roger Sayle


This patch improves the cost/gain calculation used during the i386 backend's
SImode/DImode scalar-to-vector (STV) conversion pass.  The current code
handles loads and stores, but doesn't consider that converting other
scalar operations with a memory destination, requires an explicit load
before and an explicit store after the vector equivalent.

To ease the review, the significant change looks like:

 /* For operations on memory operands, include the overhead
of explicit load and store instructions.  */
 if (MEM_P (dst))
   igain += !optimize_insn_for_size_p ()
? (m * (ix86_cost->int_load[2]
+ ix86_cost->int_store[2])
   - (ix86_cost->sse_load[sse_cost_idx] +
  ix86_cost->sse_store[sse_cost_idx]))
: -COSTS_N_BYTES (8);

however the patch itself is complicated by a change in indentation
which leads to a number of lines with only whitespace changes.
For architectures where integer load/store costs are the same as
vector load/store costs, there should be no change without -Os/-Oz.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-01-06  Roger Sayle  

gcc/ChangeLog
PR target/113231
* config/i386/i386-features.cc (compute_convert_gain): Include
the overhead of explicit load and store (movd) instructions when
converting non-store scalar operations with memory destinations.

gcc/testsuite/ChangeLog
PR target/113231
* gcc.target/i386/pr113231.c: New test case.


Thanks again,
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 4ae3e75..3677aef 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -563,183 +563,195 @@ general_scalar_chain::compute_convert_gain ()
   else if (MEM_P (src) && REG_P (dst))
igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
   else
-   switch (GET_CODE (src))
- {
- case ASHIFT:
- case ASHIFTRT:
- case LSHIFTRT:
-   if (m == 2)
- {
-   if (INTVAL (XEXP (src, 1)) >= 32)
- igain += ix86_cost->add;
-   /* Gain for extend highpart case.  */
-   else if (GET_CODE (XEXP (src, 0)) == ASHIFT)
- igain += ix86_cost->shift_const - ix86_cost->sse_op;
-   else
- igain += ix86_cost->shift_const;
- }
-
-   igain += ix86_cost->shift_const - ix86_cost->sse_op;
+   {
+ /* For operations on memory operands, include the overhead
+of explicit load and store instructions.  */
+ if (MEM_P (dst))
+   igain += !optimize_insn_for_size_p ()
+? (m * (ix86_cost->int_load[2]
++ ix86_cost->int_store[2])
+   - (ix86_cost->sse_load[sse_cost_idx] +
+  ix86_cost->sse_store[sse_cost_idx]))
+: -COSTS_N_BYTES (8);
 
-   if (CONST_INT_P (XEXP (src, 0)))
- igain -= vector_const_cost (XEXP (src, 0));
-   break;
+ switch (GET_CODE (src))
+   {
+   case ASHIFT:
+   case ASHIFTRT:
+   case LSHIFTRT:
+ if (m == 2)
+   {
+ if (INTVAL (XEXP (src, 1)) >= 32)
+   igain += ix86_cost->add;
+ /* Gain for extend highpart case.  */
+ else if (GET_CODE (XEXP (src, 0)) == ASHIFT)
+   igain += ix86_cost->shift_const - ix86_cost->sse_op;
+ else
+   igain += ix86_cost->shift_const;
+   }
 
- case ROTATE:
- case ROTATERT:
-   igain += m * ix86_cost->shift_const;
-   if (TARGET_AVX512VL)
- igain -= ix86_cost->sse_op;
-   else if (smode == DImode)
- {
-   int bits = INTVAL (XEXP (src, 1));
-   if ((bits & 0x0f) == 0)
- igain -= ix86_cost->sse_op;
-   else if ((bits & 0x07) == 0)
- igain -= 2 * ix86_cost->sse_op;
-   else
- igain -= 3 * ix86_cost->sse_op;
- }
-   else if (INTVAL (XEXP (src, 1)) == 16)
- igain -= ix86_cost->sse_op;
-   else
- igain -= 2 * ix86_cost->sse_op;
-   break;
+ igain += ix86_cost->shift_const - ix86_cost->sse_op;
 
- case AND:
- case IOR:
- case XOR:
- case PLUS:
- case MINUS:
-   igain += m * ix86_cost->add - ix86_cost->sse_op;
-   /* Additional gain for

[middle-end PATCH take #2] Only call targetm.truly_noop_truncation for truncations.

2023-12-31 Thread Roger Sayle


Very many thanks (and a Happy New Year) to the pre-commit
patch testing folks at linaro.org.   Their testing has revealed that
although my patch is clean on x86_64, it triggers some problems
on aarch64 and arm.  The issue (with the previous version of my
patch) is that these platforms require a paradoxical subreg to be
generated by the middle-end, where we were previously checking
for truly_noop_truncation.

This has been fixed (in revision 2) below.  Where previously I had:

@@ -66,7 +66,9 @@ gen_lowpart_general (machine_mode mode, rtx x)
   scalar_int_mode xmode;
   if (is_a  (GET_MODE (x), )
  && GET_MODE_SIZE (xmode) <= UNITS_PER_WORD
- && TRULY_NOOP_TRUNCATION_MODES_P (mode, xmode)
+ && (known_lt (GET_MODE_SIZE (mode), GET_MODE_SIZE (xmode))
+ ? TRULY_NOOP_TRUNCATION_MODES_P (mode, xmode)
+ : known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE (xmode)))
  && !reload_completed)
return gen_lowpart_general (mode, force_reg (xmode, x));

the correct change is:

   scalar_int_mode xmode;
   if (is_a  (GET_MODE (x), )
  && GET_MODE_SIZE (xmode) <= UNITS_PER_WORD
- && TRULY_NOOP_TRUNCATION_MODES_P (mode, xmode)
+ && (known_ge (GET_MODE_SIZE (mode), GET_MODE_SIZE (xmode))
+ || TRULY_NOOP_TRUNCATION_MODES_P (mode, xmode))
  && !reload_completed)
return gen_lowpart_general (mode, force_reg (xmode, x));

i.e. we only call TRULY_NOOP_TRUNCATION_MODES_P when we
know we have a truncation, but the behaviour of non-truncations
is preserved (no longer depends upon unspecified behaviour) and
gen_lowpart_general is called to create the paradoxical SUBREG.


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

Hopefully this revision tests cleanly on the linaro.org CI pipeline.

2023-12-31  Roger Sayle  

gcc/ChangeLog
* combine.cc (make_extraction): Confirm that OUTPREC is less than
INPREC before calling TRULY_NOOP_TRUNCATION_MODES_P.
* expmed.cc (store_bit_field_using_insv): Likewise.
(extract_bit_field_using_extv): Likewise.
(extract_bit_field_as_subreg): Likewise.
* optabs-query.cc (get_best_extraction_insn): Likewise.
* optabs.cc (expand_parity): Likewise.
* rtlhooks.cc (gen_lowpart_general): Likewise.
* simplify-rtx.cc (simplify_truncation): Disallow truncations
to the same precision.
(simplify_unary_operation_1) : Move optimization
of truncations to the same mode earlier.


> -Original Message-
> From: Roger Sayle 
> Sent: 28 December 2023 15:35
> To: 'gcc-patches@gcc.gnu.org' 
> Cc: 'Jeff Law' 
> Subject: [middle-end PATCH] Only call targetm.truly_noop_truncation for
> truncations.
> 
> 
> The truly_noop_truncation target hook is documented, in target.def, as
"true if it
> is safe to convert a value of inprec bits to one of outprec bits (where
outprec is
> smaller than inprec) by merely operating on it as if it had only outprec
bits", i.e.
> the middle-end can use a SUBREG instead of a TRUNCATE.
> 
> What's perhaps potentially a little ambiguous in the above description is
whether
> it is the caller or the callee that's responsible for ensuring or checking
whether
> "outprec < inprec".  The name TRULY_NOOP_TRUNCATION_P, like
> SUBREG_PROMOTED_P, may be prone to being understood as a predicate that
> confirms that something is a no-op truncation or a promoted subreg, when
in fact
> the caller must first confirm this is a truncation/subreg and only then
call the
> "classification" macro.
> 
> Alas making the following minor tweak (for testing) to the i386 backend:
> 
> static bool
> ix86_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec) {
>   gcc_assert (outprec < inprec);
>   return true;
> }
> 
> #undef TARGET_TRULY_NOOP_TRUNCATION
> #define TARGET_TRULY_NOOP_TRUNCATION ix86_truly_noop_truncation
> 
> reveals that there are numerous callers in middle-end that rely on the
default
> behaviour of silently returning true for any (invalid) input.
> These are fixed below.
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no new
> failures.  Ok for mainline?
> 
> 
> 2023-12-28  Roger Sayle  
> 
> gcc/ChangeLog
> * combine.cc (make_extraction): Confirm that OUTPREC is less than
> INPREC before calling TRULY_NOOP_TRUNCATION_MODES_P.
> * expmed.cc (store_bit_field_using_insv): Likewise.
> (extract_bit_field_using_extv): Likewise.
> (extract_bit_field_as_subreg

RE: [x86_PATCH] peephole2 to resolve failure of gcc.target/i386/pr43644-2.c

2023-12-31 Thread Roger Sayle


Hi Uros,

> From: Uros Bizjak 
> Sent: 28 December 2023 10:33
> On Fri, Dec 22, 2023 at 11:14 AM Roger Sayle 
> wrote:
> >
> > This patch resolves the failure of pr43644-2.c in the testsuite, a
> > code quality test I added back in July, that started failing as the
> > code GCC generates for 128-bit values (and their parameter passing)
> > has been in flux.  After a few attempts at tweaking pattern
> > constraints in the hope of convincing reload to produce a more
> > aggressive (but potentially
> > unsafe) register allocation, I think the best solution is to use a
> > peephole2 to catch/clean-up this specific case.
> >
> > Specifically, the function:
> >
> > unsigned __int128 foo(unsigned __int128 x, unsigned long long y) {
> >   return x+y;
> > }
> >
> > currently generates:
> >
> > foo:movq%rdx, %rcx
> > movq%rdi, %rax
> > movq%rsi, %rdx
> > addq%rcx, %rax
> > adcq$0, %rdx
> > ret
> >
> > and with this patch/peephole2 now generates:
> >
> > foo:movq%rdx, %rax
> > movq%rsi, %rdx
> > addq%rdi, %rax
> > adcq$0, %rdx
> > ret
> >
> > which I believe is optimal.
> 
> How about simply moving the assignment to the MSB in the split pattern after 
> the
> LSB calculation:
> 
>   [(set (match_dup 0) (match_dup 4))
> -   (set (match_dup 5) (match_dup 2))
>(parallel [(set (reg:CCC FLAGS_REG)
>   (compare:CCC
> (plus:DWIH (match_dup 0) (match_dup 1))
> (match_dup 0)))
>  (set (match_dup 0)
>   (plus:DWIH (match_dup 0) (match_dup 1)))])
> +   (set (match_dup 5) (match_dup 2))
>(parallel [(set (match_dup 5)
>   (plus:DWIH
> (plus:DWIH
> 
> There is an earlyclobber on the output operand, so we are sure that 
> assignments
> to (op 0) and (op 5) won't clobber anything.
> cprop_hardreg pass will then do the cleanup for us, resulting in:
> 
> foo: movq%rdi, %rax
>addq%rdx, %rax
>movq%rsi, %rdx
>    adcq$0, %rdx
> 
> Uros.

I agree.  This is a much better fix.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-12-31  Uros Bizjak  
Roger Sayle  

gcc/ChangeLog
PR target/43644
* config/i386/i386.md (*add3_doubleword_concat_zext): Tweak
order of instructions after split, to minimize number of moves.

gcc/testsuite/ChangeLog
PR target/43644
* gcc.target/i386/pr43644-2.c: Expect 2 movq instructions.


Thanks again (and Happy New Year).
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e862368..6671274 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -6412,13 +6412,13 @@
   "#"
   "&& reload_completed"
   [(set (match_dup 0) (match_dup 4))
-   (set (match_dup 5) (match_dup 2))
(parallel [(set (reg:CCC FLAGS_REG)
   (compare:CCC
 (plus:DWIH (match_dup 0) (match_dup 1))
 (match_dup 0)))
  (set (match_dup 0)
   (plus:DWIH (match_dup 0) (match_dup 1)))])
+   (set (match_dup 5) (match_dup 2))
(parallel [(set (match_dup 5)
   (plus:DWIH
 (plus:DWIH
diff --git a/gcc/testsuite/gcc.target/i386/pr43644-2.c 
b/gcc/testsuite/gcc.target/i386/pr43644-2.c
index d470b0a..3316ac6 100644
--- a/gcc/testsuite/gcc.target/i386/pr43644-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr43644-2.c
@@ -6,4 +6,4 @@ unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
   return x+y;
 }
 
-/* { dg-final { scan-assembler-times "movq" 1 } } */
+/* { dg-final { scan-assembler-times "movq" 2 } } */

RE: [PATCH] Improved RTL expansion of field assignments into promoted registers.

2023-12-28 Thread Roger Sayle

Hi Jeff,
Thanks for the speedy review.

> On 12/28/23 07:59, Roger Sayle wrote:
> > This patch fixes PR rtl-optmization/104914 by tweaking/improving the
> > way that fields are written into a pseudo register that needs to be
> > kept sign extended.
> Well, I think "fixes" is a bit of a stretch.  We're avoiding the issue by 
> changing the
> early RTL generation, but if I understand what's going on in the RTL 
> optimizers
> and MIPS backend correctly, the core bug still remains.  Admittedly I haven't 
> put it
> under a debugger, but that MIPS definition of NOOP_TRUNCATION just seems
> badly wrong and is just waiting to pop it's ugly head up again.

I think this really is the/a correct fix. The MIPS backend defines 
NOOP_TRUNCATION
to false, so it's not correct to use a SUBREG to convert from DImode to SImode.
The problem then is where in the compiler (middle-end or backend) is this 
invalid
SUBREG being created and how can it be fixed.  In this particular case, the 
fault
is in RTL expansion.  There may be other places where a SUBREG is 
inappropriately
used instead of a TRUNCATE, but this is the place where things go wrong for
PR rtl-optimization/104914.

Once an inappropriate SImode SUBREG is in the RTL stream, it can remain
harmlessly latent (most of the time), unless it gets split, simplified or 
spilled.
Copying this SImode expression into it's own pseudo, results in incorrect code.
One approach might be to use an UNSPEC for places where backend
invariants are temporarily invalid, but in this case it's machine independent
middle-end code that's using SUBREGs as though the target was an x86/pdp11.

So I agree that on the surface, both of these appear to be identical:
> (set (reg:DI) (sign_extend:DI (truncate:SI (reg:DI
> (set (reg:DI) (sign_extend:DI (subreg:SI (reg:DI

But should they get split or spilled by reload:

(set (reg_tmp:SI) (subreg:SI (reg:DI))
(set (reg:DI) (sign_extend:DI (reg_tmp:SI))

is invalid as the reg_tmp isn't correctly sign-extended for SImode.
But,

(set (reg_tmp:SI) (truncate:SI (reg:DI))
(set (reg:DI) (sign_extend:DI (reg_tmp:SI))

is fine.  The difference is the instant in time, when the SUBREG's invariants
aren't yet valid (and its contents shouldn't be thought of as SImode).

On nvptx, where truly_noop_truncation is always "false", it'd show the same
bug/failure, if it were not for that fact that nvptx doesn't attempt to store
values in "mode extended" (SUBREG_PROMOTED_VAR_P) registers.
The bug is really in MODE_REP_EXTENDED support.

> > The motivating example from the bugzilla PR is:
> >
> > extern void ext(int);
> > void foo(const unsigned char *buf) {
> >int val;
> >((unsigned char*))[0] = *buf++;
> >((unsigned char*))[1] = *buf++;
> >((unsigned char*))[2] = *buf++;
> >((unsigned char*))[3] = *buf++;
> >if(val > 0)
> >  ext(1);
> >else
> >  ext(0);
> > }
> >
> > which at the end of the tree optimization passes looks like:
> >
> > void foo (const unsigned char * buf)
> > {
> >int val;
> >unsigned char _1;
> >unsigned char _2;
> >unsigned char _3;
> >unsigned char _4;
> >int val.5_5;
> >
> > [local count: 1073741824]:
> >_1 = *buf_7(D);
> >MEM[(unsigned char *)] = _1;
> >_2 = MEM[(const unsigned char *)buf_7(D) + 1B];
> >MEM[(unsigned char *) + 1B] = _2;
> >_3 = MEM[(const unsigned char *)buf_7(D) + 2B];
> >MEM[(unsigned char *) + 2B] = _3;
> >_4 = MEM[(const unsigned char *)buf_7(D) + 3B];
> >MEM[(unsigned char *) + 3B] = _4;
> >val.5_5 = val;
> >if (val.5_5 > 0)
> >  goto ; [59.00%]
> >else
> >  goto ; [41.00%]
> >
> > [local count: 633507681]:
> >ext (1);
> >goto ; [100.00%]
> >
> > [local count: 440234144]:
> >ext (0);
> >
> > [local count: 1073741824]:
> >val ={v} {CLOBBER(eol)};
> >return;
> >
> > }
> >
> > Here four bytes are being sequentially written into the SImode value
> > val.  On some platforms, such as MIPS64, this SImode value is kept in
> > a 64-bit register, suitably sign-extended.  The function
> > expand_assignment contains logic to handle this via
> > SUBREG_PROMOTED_VAR_P (around line 6264 in expr.cc) which outputs an
> > explicit extension operation after each store_field (typically insv) to such
> promoted/extended pseudos.
> >
> > The first observation is that there's no need to perform sign
> > extension after each byte in the example above; the extension is only
> > required after changes to the most significant byte (i.e.

[PATCH] MIPS: Implement TARGET_INSN_COSTS

2023-12-28 Thread Roger Sayle

 

The current (default) behavior is that when the target doesn't define

TARGET_INSN_COST the middle-end uses the backend's

TARGET_RTX_COSTS, so multiplications are slower than additions,

but about the same size when optimizing for size (with -Os or -Oz).

 

All of this gets disabled with your proposed patch.

[If you don't check speed, you probably shouldn't touch insn_cost].

 

I agree that a backend can fine tune the (speed and size) costs of

instructions (especially complex !single_set instructions) via 

attributes in the machine description, but these should be used

to override/fine-tune rtx_costs, not override/replace/duplicate them.

 

Having accurate rtx_costs also helps RTL expansion and the earlier

optimizers, but insn_cost is used by combine and the later RTL

optimization passes, once instructions have been recognized.

 

Might I also recommend that instead of insn_count*perf_ratio*4,

or even the slightly better COSTS_N_INSNS (insn_count*perf_ratio),

that encode the relative cost in the attribute, avoiding the multiplication

(at runtime), and allowing fine tuning like "COSTS_N_INSNS(2) - 1".

Likewise, COSTS_N_BYTES is a very useful macro for a backend to

define/use in rtx_costs.  Conveniently for many RISC machines,

1 instruction takes about 4 bytes, for COSTS_N_INSNS (1) is

(approximately) comparable to COSTS_N_BYTES (4).

 

I hope this helps.  Perhaps something like:

 

 

static int

mips_insn_cost (rtx_insn *insn, bool speed)

{

  int cost;

  if (recog_memoized (insn) >= 0)

{

  if (speed)

{

  /* Use cost if provided.  */

  cost = get_attr_cost (insn);

  if (cost > 0)

return cost;

}

  else

{

  /* If optimizing for size, we want the insn size.  */

  return get_attr_length (insn);

}

}

 

  if (rtx set = single_set (insn))

cost = set_rtx_cost (set, speed);

  else

cost = pattern_cost (PATTERN (insn), speed);

  /* If the cost is zero, then it's likely a complex insn.  We don't

 want the cost of these to be less than something we know about.  */

  return cost ? cost : COSTS_N_INSNS (2);

}

[middle-end PATCH] Only call targetm.truly_noop_truncation for truncations.

2023-12-28 Thread Roger Sayle


The truly_noop_truncation target hook is documented, in target.def, as
"true if it is safe to convert a value of inprec bits to one of outprec
bits (where outprec is smaller than inprec) by merely operating on it
as if it had only outprec bits", i.e. the middle-end can use a SUBREG
instead of a TRUNCATE.

What's perhaps potentially a little ambiguous in the above description is
whether it is the caller or the callee that's responsible for ensuring or
checking whether "outprec < inprec".  The name TRULY_NOOP_TRUNCATION_P,
like SUBREG_PROMOTED_P, may be prone to being understood as a predicate
that confirms that something is a no-op truncation or a promoted subreg,
when in fact the caller must first confirm this is a truncation/subreg and
only then call the "classification" macro.

Alas making the following minor tweak (for testing) to the i386 backend:

static bool
ix86_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec)
{
  gcc_assert (outprec < inprec);
  return true;
}

#undef TARGET_TRULY_NOOP_TRUNCATION
#define TARGET_TRULY_NOOP_TRUNCATION ix86_truly_noop_truncation

reveals that there are numerous callers in middle-end that rely on the
default behaviour of silently returning true for any (invalid) input.
These are fixed below.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-12-28  Roger Sayle  

gcc/ChangeLog
* combine.cc (make_extraction): Confirm that OUTPREC is less than
INPREC before calling TRULY_NOOP_TRUNCATION_MODES_P.
* expmed.cc (store_bit_field_using_insv): Likewise.
(extract_bit_field_using_extv): Likewise.
(extract_bit_field_as_subreg): Likewise.
* optabs-query.cc (get_best_extraction_insn): Likewise.
* optabs.cc (expand_parity): Likewise.
* rtlhooks.cc (gen_lowpart_general): Likewise.
* simplify-rtx.cc (simplify_truncation): Disallow truncations
to the same precision.
(simplify_unary_operation_1) : Move optimization
of truncations to the same mode earlier.


Thanks in advance,
Roger
--

diff --git a/gcc/combine.cc b/gcc/combine.cc
index f2c64a9..5aa2f57 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7613,7 +7613,8 @@ make_extraction (machine_mode mode, rtx inner, 
HOST_WIDE_INT pos,
   && (pos == 0 || REG_P (inner))
   && (inner_mode == tmode
   || !REG_P (inner)
-  || TRULY_NOOP_TRUNCATION_MODES_P (tmode, inner_mode)
+  || (known_lt (GET_MODE_SIZE (tmode), GET_MODE_SIZE (inner_mode))
+  && TRULY_NOOP_TRUNCATION_MODES_P (tmode, inner_mode))
   || reg_truncated_to_mode (tmode, inner))
   && (! in_dest
   || (REG_P (inner)
@@ -7856,6 +7857,8 @@ make_extraction (machine_mode mode, rtx inner, 
HOST_WIDE_INT pos,
   /* On the LHS, don't create paradoxical subregs implicitely truncating
 the register unless TARGET_TRULY_NOOP_TRUNCATION.  */
   if (in_dest
+ && known_lt (GET_MODE_SIZE (GET_MODE (inner)),
+  GET_MODE_SIZE (wanted_inner_mode))
  && !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (inner),
 wanted_inner_mode))
return NULL_RTX;
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 0bba93f..8940d47 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -26707,6 +26707,16 @@ ix86_libm_function_max_error (unsigned cfn, 
machine_mode mode,
 #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
 #endif /* #if CHECKING_P */
 
+static bool
+ix86_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec)
+{
+  gcc_assert (outprec < inprec);
+  return true;
+}
+
+#undef TARGET_TRULY_NOOP_TRUNCATION
+#define TARGET_TRULY_NOOP_TRUNCATION ix86_truly_noop_truncation
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-i386.h"
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 05331dd..6398bf9 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -651,6 +651,7 @@ store_bit_field_using_insv (const extraction_insn *insv, 
rtx op0,
  X) 0)) is (reg:N X).  */
   if (GET_CODE (xop0) == SUBREG
   && REG_P (SUBREG_REG (xop0))
+  && paradoxical_subreg_p (xop0)
   && !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (SUBREG_REG (xop0)),
 op_mode))
 {
@@ -1585,7 +1586,11 @@ extract_bit_field_using_extv (const extraction_insn 
*extv, rtx op0,
 mode.  Instead, create a temporary and use convert_move to set
 the target.  */
   if (REG_P (target)
- && TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (target), ext_mode)
+ && (known_lt (GET_MODE_SIZE (GET_MODE (target)),
+   GET_

[PATCH] Improved RTL expansion of field assignments into promoted registers.

2023-12-28 Thread Roger Sayle


This patch fixes PR rtl-optmization/104914 by tweaking/improving the way
that fields are written into a pseudo register that needs to be kept sign
extended.

The motivating example from the bugzilla PR is:

extern void ext(int);
void foo(const unsigned char *buf) {
  int val;
  ((unsigned char*))[0] = *buf++;
  ((unsigned char*))[1] = *buf++;
  ((unsigned char*))[2] = *buf++;
  ((unsigned char*))[3] = *buf++;
  if(val > 0)
ext(1);
  else
ext(0);
}

which at the end of the tree optimization passes looks like:

void foo (const unsigned char * buf)
{
  int val;
  unsigned char _1;
  unsigned char _2;
  unsigned char _3;
  unsigned char _4;
  int val.5_5;

   [local count: 1073741824]:
  _1 = *buf_7(D);
  MEM[(unsigned char *)] = _1;
  _2 = MEM[(const unsigned char *)buf_7(D) + 1B];
  MEM[(unsigned char *) + 1B] = _2;
  _3 = MEM[(const unsigned char *)buf_7(D) + 2B];
  MEM[(unsigned char *) + 2B] = _3;
  _4 = MEM[(const unsigned char *)buf_7(D) + 3B];
  MEM[(unsigned char *) + 3B] = _4;
  val.5_5 = val;
  if (val.5_5 > 0)
goto ; [59.00%]
  else
goto ; [41.00%]

   [local count: 633507681]:
  ext (1);
  goto ; [100.00%]

   [local count: 440234144]:
  ext (0);

   [local count: 1073741824]:
  val ={v} {CLOBBER(eol)};
  return;

}

Here four bytes are being sequentially written into the SImode value
val.  On some platforms, such as MIPS64, this SImode value is kept in
a 64-bit register, suitably sign-extended.  The function expand_assignment
contains logic to handle this via SUBREG_PROMOTED_VAR_P (around line 6264
in expr.cc) which outputs an explicit extension operation after each
store_field (typically insv) to such promoted/extended pseudos.

The first observation is that there's no need to perform sign extension
after each byte in the example above; the extension is only required
after changes to the most significant byte (i.e. to a field that overlaps
the most significant bit).

The bug fix is actually a bit more subtle, but at this point during
code expansion it's not safe to use a SUBREG when sign-extending this
field.  Currently, GCC generates (sign_extend:DI (subreg:SI (reg:DI) 0))
but combine (and other RTL optimizers) later realize that because SImode
values are always sign-extended in their 64-bit hard registers that
this is a no-op and eliminates it.  The trouble is that it's unsafe to
refer to the SImode lowpart of a 64-bit register using SUBREG at those
critical points when temporarily the value isn't correctly sign-extended,
and the usual backend invariants don't hold.  At these critical points,
the middle-end needs to use an explicit TRUNCATE rtx (as this isn't a
TRULY_NOOP_TRUNCATION), so that the explicit sign-extension looks like
(sign_extend:DI (truncate:SI (reg:DI)), which avoids the problem.

Note that MODE_REP_EXTENDED (NARROW, WIDE) != UNKOWN implies (or should
imply) !TRULY_NOOP_TRUNCATION (NARROW, WIDE).  I've another (independent)
patch that I'll post in a few minutes.


This middle-end patch has been tested on x86_64-pc-linux-gnu with
make bootstrap and make -k check, both with and without
--target_board=unix{-m32} with no new failures.  The cc1 from a
cross-compiler to mips64 appears to generate much better code for
the above test case.  Ok for mainline?


2023-12-28  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/104914
* expr.cc (expand_assignment): When target is SUBREG_PROMOTED_VAR_P
a sign or zero extension is only required if the modified field
overlaps the SUBREG's most significant bit.  On MODE_REP_EXTENDED
targets, don't refer to the temporarily incorrectly extended value
using a SUBREG, but instead generate an explicit TRUNCATE rtx.


Thanks in advance,
Roger
--

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 9fef2bf6585..1a34b48e38f 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -6272,19 +6272,32 @@ expand_assignment (tree to, tree from, bool nontemporal)
  && known_eq (bitpos, 0)
  && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (to_rtx
result = store_expr (from, to_rtx, 0, nontemporal, false);
- else
+ /* Check if the field overlaps the MSB, requiring extension.  */
+ else if (known_eq (bitpos + bitsize,
+GET_MODE_BITSIZE (GET_MODE (to_rtx
{
- rtx to_rtx1
-   = lowpart_subreg (subreg_unpromoted_mode (to_rtx),
- SUBREG_REG (to_rtx),
- subreg_promoted_mode (to_rtx));
+ scalar_int_mode imode = subreg_unpromoted_mode (to_rtx);
+ scalar_int_mode omode = subreg_promoted_mode (to_rtx);
+ rtx to_rtx1 = lowpart_subreg (imode, SUBREG_REG (to_rtx),
+   omode);
  result = store_field (to_rtx1, bitsize, bitpos,

RE: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode

2023-12-24 Thread Roger Sayle



> > > What's exceedingly weird is T_N_T_M_P (DImode, SImode) isn't
> > > actually a truncation!  The output precision is first, the input
> > > precision is second.  The docs explicitly state the output precision
> > > should be smaller than the input precision (which makes sense for 
> > > truncation).
> > >
> > > That's where I'd start with trying to untangle this mess.
> >
> > Thanks (both) for correcting my misunderstanding.
> > At the very least might I suggest that we introduce a new
> > TRULY_NOOP_EXTENSION_MODES_P target hook that MIPS can use for this
> > purpose?  It'd help reduce confusion, and keep the
> > documentation/function naming correct.
> >
> 
> Yes. It is good for me.
> T_N_T_M_P is a really confusion naming.

Ignore my suggestion for a new target hook.  GCC already has one.
You shouldn't be using TRULY_NOOP_TRUNCATION_MODES_P
with incorrectly ordered arguments. The correct target hook is 
TARGET_MODE_REP_EXTENDED, which the MIPS backend correctly
defines via mips_mode_rep_extended.

It's MIPS definition of (and interpretation of) mips_truly_noop_truncation
that's suspect.

My latest theory is that these sign extensions should be:
(set (reg:DI) (sign_extend:DI (truncate:SI (reg:DI
and not
(set (reg:DI) (sign_extend:DI (subreg:SI (reg:DI
If the RTL optimizer's ever split this instruction the semantics of
the SUBREG intermediate are incorrect.  Another (less desirable)
approach might be to use an UNSPEC.

RE: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode

2023-12-24 Thread Roger Sayle



> What's exceedingly weird is T_N_T_M_P (DImode, SImode) isn't actually a
> truncation!  The output precision is first, the input precision is second.  
> The docs
> explicitly state the output precision should be smaller than the input 
> precision
> (which makes sense for truncation).
> 
> That's where I'd start with trying to untangle this mess.

Thanks (both) for correcting my misunderstanding.
At the very least might I suggest that we introduce a new
TRULY_NOOP_EXTENSION_MODES_P target hook that MIPS
can use for this purpose?  It'd help reduce confusion, and keep
the documentation/function naming correct.

When Richard Sandiford "hookized" truly_noop_truncation in 2017
https://gcc.gnu.org/legacy-ml/gcc-patches/2017-09/msg00836.html
he mentions the inprec/outprec confusion [deciding not to add a
gcc_assert outprec < inprec here, which might be a good idea].

The next question is whether this is just
TRULY_NOOP_SIGN_EXTENSION_MODES_P
or whether there are any targets that usefully ensure some modes
are zero-extended forms of others.  TRULY_NOOP_ZERO_EXTENSION...

My vote is that a DINS instruction that updates the most significant
bit of an SImode value should then expand or define_insn_and_split
with an explicit following sign-extension operation.  To avoid this being
eliminated by the RTL optimizers/combine the DINS should return a
DImode result, with the following extension truncating it to canonical
SImode form.  This preserves the required backend invariant (and
doesn't require tweaking machine-independent code in combine).
SImode DINS instructions that don't/can't affect the MSB, can be a
single SImode instruction.

Cheers,
Roger
--

RE: Re: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode

2023-12-23 Thread Roger Sayle

> There's a PR in Bugzilla around this representational issue on MIPS, but I
can't find
> it straight away.

Found it.  It's PR rtl-optimization/104914, where we've already
discussed this in comments #15 and #16.

> -Original Message-
> From: Roger Sayle 
> Sent: 24 December 2023 00:50
> To: 'gcc-patches@gcc.gnu.org' ; 'YunQiang Su'
> 
> Cc: 'Jeff Law' 
> Subject: Re: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for
SImode
> 
> 
> Hi YunQiang (and Jeff),
> 
> > MIPS claims TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode)) ==
> true
> > based on that the hard register is always sign-extended, but here the
> > hard register is polluted by zero_extract.
> 
> I suspect that the bug here is that the MIPS backend shouldn't be
returning
> true for TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode).   It's true
> that the backend stores SImode values in DImode registers by sign
extending
> them, but this doesn't mean that any DImode pseudo register can be
truncated to
> an SImode pseudo just by SUBREG/register naming.  As you point out, if the
high
> bits of a DImode value are random, truncation isn't a no-op, and requires
an
> explicit sign-extension instruction.
> 
> There's a PR in Bugzilla around this representational issue on MIPS, but I
can't find
> it straight away.
> 
> Out of curiosity, how badly affected is the testsuite if mips.cc's
> mips_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec) is
changed
> to just return !TARGET_64BIT ?
> 
> I agree with Jeff there's an invariant that isn't correctly being modelled
by the
> MIPS machine description.  A machine description probably shouldn't define
an
> addsi3  pattern if what it actually supports is (sign_extend:DI
(truncate:SI (plus:DI
> (reg:DI x) (reg:DI y Trying to model this as SImode addition plus a
> SUBREG_PROMOTED flag is less than ideal.
> 
> Just my thoughts.  I'm curious what other folks think.
> 
> Cheers,
> Roger
> --

Re: [PATCH v3] EXPR: Emit an truncate if 31+ bits polluted for SImode

2023-12-23 Thread Roger Sayle



Hi YunQiang (and Jeff),

> MIPS claims TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode)) == true
> based on that the hard register is always sign-extended, but here
> the hard register is polluted by zero_extract.

I suspect that the bug here is that the MIPS backend shouldn't be returning
true for TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode).   It's true
that the backend stores SImode values in DImode registers by sign extending
them, but this doesn't mean that any DImode pseudo register can be truncated
to an SImode pseudo just by SUBREG/register naming.  As you point out, if
the
high bits of a DImode value are random, truncation isn't a no-op, and
requires
an explicit sign-extension instruction.

There's a PR in Bugzilla around this representational issue on MIPS, but I
can't find it straight away.

Out of curiosity, how badly affected is the testsuite if mips.cc's
mips_truly_noop_truncation (poly_uint64 outprec, poly_uint64 inprec)
is changed to just return !TARGET_64BIT ?

I agree with Jeff there's an invariant that isn't correctly being modelled
by
the MIPS machine description.  A machine description probably shouldn't
define an addsi3  pattern if what it actually supports is
(sign_extend:DI (truncate:SI (plus:DI (reg:DI x) (reg:DI y
Trying to model this as SImode addition plus a SUBREG_PROMOTED flag
is less than ideal.

Just my thoughts.  I'm curious what other folks think.

Cheers,
Roger
--

[ARC PATCH] Table-driven ashlsi implementation for better code/rtx_costs.

2023-12-23 Thread Roger Sayle


One of the cool features of the H8 backend is its use of tables to select
optimal shift implementations for different CPU variants.  This patch
borrows (plagiarizes) that idiom for SImode left shifts in the ARC backend
(for CPUs without a barrel-shifter).  This provides a convenient mechanism
for both selecting the best implementation strategy (for speed vs. size),
and providing accurate rtx_costs [without duplicating a lot of logic].
Left shift RTX costs are especially important for use in synth_mult.

An example improvement is:

int foo(int x) { return 32768*x; }

which is now generated with -O2 -mcpu=em -mswap as:

foo:bmsk_s  r0,r0,16
swapr0,r0
j_s.d   [blink]
ror r0,r0

where previously the ARC backend would generate a loop:

foo:mov lp_count,15
lp  2f
add r0,r0,r0
nop
2:  # end single insn loop
j_s [blink]


Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's and/or Jeff's testing?
[Thanks again to Jeff for finding the typo in my last ARC patch]

2023-12-23  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.cc (arc_shift_alg): New enumerated type for
left shift implementation strategies.
(arc_shift_info): Type for each entry of the shift strategy table.
(arc_shift_context_idx): Return a integer value for each code
generation context, used as an index
(arc_ashl_alg): Table indexed by context and shifted bit count.
(arc_split_ashl): Use the arc_ashl_alg table to select SImode
left shift implementation.
(arc_rtx_costs) : Use the arc_ashl_alg table to
provide accurate costs, when optimizing for speed or size.


Thanks in advance,
Roger
--

diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc
index 3f4eb5a5736..925bffaa7d6 100644
--- a/gcc/config/arc/arc.cc
+++ b/gcc/config/arc/arc.cc
@@ -4222,6 +4222,253 @@ output_shift_loop (enum rtx_code code, rtx *operands)
   return "";
 }
 
+/* See below where shifts are handled for explanation of this enum.  */
+enum arc_shift_alg
+{
+  SHIFT_MOVE,  /* Register-to-register move.  */
+  SHIFT_LOOP,  /* Zero-overhead loop implementation.  */
+  SHIFT_INLINE,/* Mmultiple LSHIFTs and LSHIFT-PLUSs.  */ 
+  SHIFT_AND_ROT,/* Bitwise AND, then ROTATERTs.  */
+  SHIFT_SWAP,  /* SWAP then multiple LSHIFTs/LSHIFT-PLUSs.  */
+  SHIFT_AND_SWAP_ROT   /* Bitwise AND, then SWAP, then ROTATERTs.  */
+};
+
+struct arc_shift_info {
+  enum arc_shift_alg alg;
+  unsigned int cost;
+};
+
+/* Return shift algorithm context, an index into the following tables.
+ * 0 for -Os (optimize for size)   3 for -O2 (optimized for speed)
+ * 1 for -Os -mswap TARGET_V2  4 for -O2 -mswap TARGET_V2
+ * 2 for -Os -mswap !TARGET_V2 5 for -O2 -mswap !TARGET_V2  */
+static unsigned int
+arc_shift_context_idx ()
+{
+  if (optimize_function_for_size_p (cfun))
+{
+  if (!TARGET_SWAP)
+   return 0;
+  if (TARGET_V2)
+   return 1;
+  return 2;
+}
+  else
+{
+  if (!TARGET_SWAP)
+   return 3;
+  if (TARGET_V2)
+   return 4;
+  return 5;
+}
+}
+
+static const arc_shift_info arc_ashl_alg[6][32] = {
+  {  /* 0: -Os.  */
+{ SHIFT_MOVE, COSTS_N_INSNS (1) },  /*  0 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (1) },  /*  1 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (2) },  /*  2 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (2) },  /*  3 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (3) },  /*  4 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (3) },  /*  5 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (3) },  /*  6 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (4) },  /*  7 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (4) },  /*  8 */
+{ SHIFT_INLINE,   COSTS_N_INSNS (4) },  /*  9 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 10 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 11 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 12 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 13 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 14 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 15 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 16 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 17 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 18 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 19 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 20 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 21 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 22 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 23 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 24 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 25 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4) },  /* 26 */
+{ SHIFT_LOOP, COSTS_N_INSNS (4)

[x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.

2023-12-22 Thread Roger Sayle


This patch resolves the second part of PR target/112992, building upon
Hongtao Liu's solution to the first part.

The issue addressed by this patch is that when initializing vectors by
broadcasting integer constants, the compiler has the flexibility to
select the most appropriate vector mode to perform the broadcast, as
long as the resulting vector has an identical bit pattern.  For
example, the following constants are all equivalent:
V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 }
V8HImode {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 }
V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ... 0x01 }
So instruction sequences that construct any of these can be used to
construct the others (with a suitable cast/SUBREG).

On x86_64, it turns out that broadcasts of SImode constants are preferred,
as DImode constants often require a longer movabs instruction, and
HImode and QImode broadcasts require multiple uops on some architectures.
Hence, SImode is always the equal shortest/fastest implementation.

Examples of this improvement, can be seen in the testsuite.

gcc.target/i386/pr102021.c
Before:
   0:   48 b8 0c 00 0c 00 0cmovabs $0xc000c000c000c,%rax
   7:   00 0c 00
   a:   62 f2 fd 28 7c c0   vpbroadcastq %rax,%ymm0
  10:   c3  retq

After:
   0:   b8 0c 00 0c 00  mov$0xc000c,%eax
   5:   62 f2 7d 28 7c c0   vpbroadcastd %eax,%ymm0
   b:   c3  retq

and
gcc.target/i386/pr90773-17.c:
Before:
   0:   48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 7 
   7:   b8 0c 00 00 00  mov$0xc,%eax
   c:   62 f2 7d 08 7a c0   vpbroadcastb %eax,%xmm0
  12:   62 f1 7f 08 7f 02   vmovdqu8 %xmm0,(%rdx)
  18:   c7 42 0f 0c 0c 0c 0cmovl   $0xc0c0c0c,0xf(%rdx)
  1f:   c3  retq

After:
   0:   48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 7 
   7:   b8 0c 0c 0c 0c  mov$0xc0c0c0c,%eax
   c:   62 f2 7d 08 7c c0   vpbroadcastd %eax,%xmm0
  12:   62 f1 7f 08 7f 02   vmovdqu8 %xmm0,(%rdx)
  18:   c7 42 0f 0c 0c 0c 0cmovl   $0xc0c0c0c,0xf(%rdx)
  1f:   c3  retq

where according to Agner Fog's instruction tables broadcastd is slightly
faster on some microarchitectures, for example Knight's Landing.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-12-21  Roger Sayle  

gcc/ChangeLog
PR target/112992
* config/i386/i386-expand.cc
(ix86_convert_const_wide_int_to_broadcast): Allow call to
ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
(ix86_broadcast_from_constant): Revert recent change; Return a
suitable MEMREF independently of mode/target combinations.
(ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
to decide whether expansion is possible/preferrable.  Only try
forcing DImode constants to memory (and trying again) if calling
ix86_expand_vector_init_duplicate fails with an DImode immediate
constant.
(ix86_expand_vector_init_duplicate) : Try using
V4SImode for suitable immediate constants.
: Try using V8SImode for suitable constants.
: Use constant pool for AVX without AVX2.
: Fail for CONST_INT_P, i.e. use constant pool.
: Likewise.
: For CONST_INT_P try using V4SImode via widen.
: For CONT_INT_P try using V8HImode via widen.
: Handle CONT_INTs via simplify_binary_operation.
Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
: For CONST_INT_P try V8SImode via widen.
: For CONST_INT_P try V16HImode via widen.
(ix86_expand_vector_init): Move try using a broadcast for all_same
with ix86_expand_vector_init_duplicate before using constant pool.

gcc/testsuite/ChangeLog
* gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case.
* gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
* gcc.target/i386/avx512fp16-13.c: Likewise.
* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
* gcc.target/i386/pr100865-10a.c: Likewise.
* gcc.target/i386/pr100865-10b.c: Likewise.
* gcc.target/i386/pr100865-11c.c: Likewise.
* gcc.target/i386/pr100865-12c.c: Likewise.
* gcc.target/i386/pr100865-2.c: Likewise.
* gcc.target/i386/pr100865-3.c: Likewise.
* gcc.target/i386/pr100865-4a.c: Likewise.
* gcc.target/i386/pr100865-4b.c: Likewise.
* gcc.target/i386/pr100865-5a.c: Likewise.
* gcc.target/i386/pr100865-5b.c: Likewise.
* gcc.target/i386/pr100865-9a.c: Likewise.
* gcc.target/i386/pr100865-9b.c: Likewise.
* gcc.target/i386/pr102021.c: Likewise

[x86_PATCH] peephole2 to resolve failure of gcc.target/i386/pr43644-2.c

2023-12-22 Thread Roger Sayle


This patch resolves the failure of pr43644-2.c in the testsuite, a code
quality test I added back in July, that started failing as the code GCC
generates for 128-bit values (and their parameter passing) has been in
flux.  After a few attempts at tweaking pattern constraints in the hope
of convincing reload to produce a more aggressive (but potentially
unsafe) register allocation, I think the best solution is to use a
peephole2 to catch/clean-up this specific case.

Specifically, the function:

unsigned __int128 foo(unsigned __int128 x, unsigned long long y) {
  return x+y;
}

currently generates:

foo:movq%rdx, %rcx
movq%rdi, %rax
movq%rsi, %rdx
addq%rcx, %rax
adcq$0, %rdx
ret

and with this patch/peephole2 now generates:

foo:movq%rdx, %rax
movq%rsi, %rdx
addq%rdi, %rax
adcq$0, %rdx
ret

which I believe is optimal.


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-12-21  Roger Sayle  

gcc/ChangeLog
PR target/43644
* config/i386/i386.md (define_peephole2): Tweak register allocation
of *add3_doubleword_concat_zext.

gcc/testsuite/ChangeLog
PR target/43644
* gcc.target/i386/pr43644-2.c: Expect 2 movq instructions.


Thanks in advance, and for your patience with this testsuite noise.
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e862368..5967208 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -6428,6 +6428,38 @@
  (clobber (reg:CC FLAGS_REG))])]
  "split_double_mode (mode, [0], 1, [0], [5]);")
 
+(define_peephole2
+  [(set (match_operand:SWI48 0 "general_reg_operand")
+   (match_operand:SWI48 1 "general_reg_operand"))
+   (set (match_operand:SWI48 2 "general_reg_operand")
+   (match_operand:SWI48 3 "general_reg_operand"))
+   (set (match_dup 1) (match_operand:SWI48 4 "general_reg_operand"))
+   (parallel [(set (reg:CCC FLAGS_REG)
+  (compare:CCC
+(plus:SWI48 (match_dup 2) (match_dup 0))
+(match_dup 2)))
+ (set (match_dup 2)
+  (plus:SWI48 (match_dup 2) (match_dup 0)))])]
+  "REGNO (operands[0]) != REGNO (operands[1])
+   && REGNO (operands[0]) != REGNO (operands[2])
+   && REGNO (operands[0]) != REGNO (operands[3])
+   && REGNO (operands[0]) != REGNO (operands[4])
+   && REGNO (operands[1]) != REGNO (operands[2])
+   && REGNO (operands[1]) != REGNO (operands[3])
+   && REGNO (operands[1]) != REGNO (operands[4])
+   && REGNO (operands[2]) != REGNO (operands[3])
+   && REGNO (operands[2]) != REGNO (operands[4])
+   && REGNO (operands[3]) != REGNO (operands[4])
+   && peep2_reg_dead_p (4, operands[0])"
+  [(set (match_dup 2) (match_dup 1))
+   (set (match_dup 1) (match_dup 4))
+   (parallel [(set (reg:CCC FLAGS_REG)
+   (compare:CCC
+ (plus:SWI48 (match_dup 2) (match_dup 3))
+ (match_dup 2)))
+  (set (match_dup 2)
+   (plus:SWI48 (match_dup 2) (match_dup 3)))])])
+
 (define_insn "*add_1"
   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r,r,r,r,r")
(plus:SWI48
diff --git a/gcc/testsuite/gcc.target/i386/pr43644-2.c 
b/gcc/testsuite/gcc.target/i386/pr43644-2.c
index d470b0a..3316ac6 100644
--- a/gcc/testsuite/gcc.target/i386/pr43644-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr43644-2.c
@@ -6,4 +6,4 @@ unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
   return x+y;
 }
 
-/* { dg-final { scan-assembler-times "movq" 1 } } */
+/* { dg-final { scan-assembler-times "movq" 2 } } */

[x86 PATCH] Improved TImode (128-bit) integer constants on x86_64.

2023-12-18 Thread Roger Sayle


This patch fixes two issues with the handling of 128-bit TImode integer
constants in the x86_64 backend.  The main issue is that GCC always
tries to load 128-bit integer constants via broadcasts to vector SSE
registers, even if the result is required in general registers.  This
is seen in the two closely related functions below:

__int128 m;
#define CONST (((__int128)0x0123456789abcdefULL<<64) |
0x0123456789abcdefULL)
void foo() { m &= CONST; }
void bar() { m = CONST; }

When compiled with -O2 -mavx, we currently generate:

foo:movabsq $81985529216486895, %rax
vmovq   %rax, %xmm0
vpunpcklqdq %xmm0, %xmm0, %xmm0
vmovq   %xmm0, %rax
vpextrq $1, %xmm0, %rdx
andq%rax, m(%rip)
andq%rdx, m+8(%rip)
ret

bar:movabsq $81985529216486895, %rax
vmovq   %rax, %xmm1
vpunpcklqdq %xmm1, %xmm1, %xmm0
vpextrq $1, %xmm0, %rdx
vmovq   %xmm0, m(%rip)
movq%rdx, m+8(%rip)
ret

With this patch we defer the decision to use vector broadcast for
TImode until we know we need actually want a SSE register result,
by moving the call to ix86_convert_const_wide_int_to_broadcast from
the RTL expansion pass, to the scalar-to-vector (STV) pass.  With
this change (and a minor tweak described below) we now generate:

foo:movabsq $81985529216486895, %rax
andq%rax, m(%rip)
andq%rax, m+8(%rip)
ret

bar:movabsq $81985529216486895, %rax
vmovq   %rax, %xmm0
vpunpcklqdq %xmm0, %xmm0, %xmm0
vmovdqa %xmm0, m(%rip)
ret

showing that we now correctly use vector mode broadcasts (only)
where appropriate.

The one minor tweak mentioned above is to enable the un-cprop hi/lo
optimization, that I originally contributed back in September 2004
https://gcc.gnu.org/pipermail/gcc-patches/2004-September/148756.html
even when not optimizing for size.  Without this (and currently with
just -O2) the function foo above generates:

foo:movabsq $81985529216486895, %rax
movabsq $81985529216486895, %rdx
andq%rax, m(%rip)
andq%rdx, m+8(%rip)
ret

I'm not sure why (back in 2004) I thought that avoiding the implicit
"movq %rax, %rdx" instead of a second load was faster, perhaps avoiding
a dependency to allow better scheduling, but nowadays "movq %rax, %rdx"
is either eliminated by GCC's hardreg cprop pass, or special cased by
modern hardware, making the first foo preferrable, not only shorter but
also faster.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
and with/without -march=cascadelake with no new failures.
Ok for mainline?


2023-12-18  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc
(ix86_convert_const_wide_int_to_broadcast): Remove static.
(ix86_expand_move): Don't attempt to convert wide constants
to SSE using ix86_convert_const_wide_int_to_broadcast here.
(ix86_split_long_move): Always un-cprop multi-word constants.
* config/i386/i386-expand.h
(ix86_convert_const_wide_int_to_broadcast): Prototype here.
* config/i386/i386-features.cc: Include i386-expand.h.
(timode_scalar_chain::convert_insn): When converting TImode to
v1TImode, try ix86_convert_const_wide_int_to_broadcast.

gcc/testsuite/ChangeLog
* gcc.target/i386/movti-2.c: New test case.
* gcc.target/i386/movti-3.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index fad4f34..57a108a 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -289,7 +289,7 @@ ix86_broadcast (HOST_WIDE_INT v, unsigned int width,
 
 /* Convert the CONST_WIDE_INT operand OP to broadcast in MODE.  */
 
-static rtx
+rtx
 ix86_convert_const_wide_int_to_broadcast (machine_mode mode, rtx op)
 {
   /* Don't use integer vector broadcast if we can't move from GPR to SSE
@@ -541,14 +541,6 @@ ix86_expand_move (machine_mode mode, rtx operands[])
  return;
}
}
- else if (CONST_WIDE_INT_P (op1)
-  && GET_MODE_SIZE (mode) >= 16)
-   {
- rtx tmp = ix86_convert_const_wide_int_to_broadcast
-   (GET_MODE (op0), op1);
- if (tmp != nullptr)
-   op1 = tmp;
-   }
}
 }
 
@@ -6323,18 +6315,15 @@ ix86_split_long_move (rtx operands[])
}
 }
 
-  /* If optimizing for size, attempt to locally unCSE nonzero constants.  */
-  if (optimize_insn_for_size_p ())
-{
-  for (j = 0; j < nparts - 1; j++)
-   if (CONST_INT_P (operands[6 + j])
-   && operands[6 + j] != const0_rtx
-   && REG_P (operands[2 + j]))
- for (i = j; i < nparts - 1; i++)
-   if (CONST_INT_P (operand

[PING] PR112380: Defend against CLOBBERs in RTX expressions in combine.cc

2023-12-10 Thread Roger Sayle



I'd like to ping my patch for PR rtl-optimization/112380.
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636203.html

For those unfamiliar with the (clobber (const_int 0)) idiom used by
combine, I'll explain a little of the ancient history...

Back before time, in the prehistory of git/subversion/cvs or even
ChangeLogs,
in March 1987 to be precise, Richard Stallman's GCC version 0.9, had RTL
optimization passes similar to those in use today.  This far back, combine.c
contained the function gen_lowpart_for_combine, which was documented as
"Like gen_lowpart but for use in combine" where "it is not possible to
create any new pseudoregs." and "return zero if we don't see a way to
make a lowpart.".  And indeed, this function returned (rtx)0, and the
single caller of gen_lowpart_for_combine checked whether the return value
was non-zero.

Unfortunately, gcc 0.9's combine also contained bugs; At three places in
combine.c, it called gen_lowpart, the first of these looked like:
  return gen_rtx (AND, GET_MODE (x),
  gen_lowpart (GET_MODE (x), XEXP (to, 0)),
  XEXP (to, 1));

Time passes, and by version 1.21 in May 1988 (in fact before the
earliest ChangeLogs were introduced for version 1.17 in January 1988),
this issue had been identified, and a helpful reminder placed at the
top of the code:

/* It is not safe to use ordinary gen_lowpart in combine.
   Use gen_lowpart_for_combine instead.  See comments there.  */
#define gen_lowpart dont_use_gen_lowpart_you_dummy

However, to save a little effort, and avoid checking the return value
for validity at all of the callers of gen_lowpart_for_combine, RMS
invented the "(clobber (const_int 0))" idiom, which was returned
instead of zero.  The comment above gen_lowpart_for_combine was
modified to state:

/* If for some reason this cannot do its job, an rtx
   (clobber (const_int 0)) is returned.
   An insn containing that will not be recognized.  */

Aside: Around this time Bjarne Stroustrup was also trying to avoid
testing function return values for validitity, so introduced exceptions
into C++.

Thirty five years later this decision (short-cut) still haunts combine.
Using "(clobber (const_int 0))", like error_mark_node, that can appear
anywhere in a RTX expression makes it hard to impose strict typing (to
catch things like a CLOBBER of a CLOBBER) and as shown by bugzilla's
PR rtl-optimization/112380 these RTX occasionally escape from combine
to cause problems in generic RTL handling functions.

This patch doesn't eliminate combine.cc's such of (clobber (const_int 0)),
we still allocate memory to indicate exceptional conditions, and require
the garbage collector to clean things up, but testing the values returned
from functions for errors/exceptions is good software engineering, and
hopefully a step in the right direction.  I'd hoped allowing combine to
continue exploring alternate simplifications would also lead to better
code generation, but I've not been able to find any examples on x86_64.

This patch has been retested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-11-12  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/112380
* combine.cc (find_split_point): Check if gen_lowpart returned
a CLOBBER.
(subst): Check if combine_simplify_rtx returned a CLOBBER.
(simplify_set): Check if force_to_mode returned a CLOBBER.
Check if gen_lowpart returned a CLOBBER.
(expand_field_assignment): Likewise.
(make_extraction): Check if force_to_mode returned a CLOBBER.
(force_int_to_mode): Likewise.
(simplify_and_const_int_1): Check if VAROP is a CLOBBER, after
call to force_to_mode (and before).
(simplify_comparison): Check if force_to_mode returned a CLOBBER.
Check if gen_lowpart returned a CLOBBER.

gcc/testsuite/ChangeLog
PR rtl-optimization/112380
* gcc.dg/pr112380.c: New test case.


Thanks in advance,
Roger
--

RE: [ARC PATCH] Add *extvsi_n_0 define_insn_and_split for PR 110717.

2023-12-07 Thread Roger Sayle


Hi Jeff,
Doh!  Great catch.  The perils of not (yet) being able to actually
run any ARC execution tests myself.

> Shouldn't operands[4] be GEN_INT ((HOST_WIDE_INT_1U << tmp) - 1)?
Yes(-ish), operands[4] should be GEN_INT(HOST_WIDE_INT_1U << (tmp - 1)).

And the 32s in the test cases need to be 16s (the MSB of a five bit field is 
16).

You're probably also thinking the same thing that I am... that it might be 
possible
to implement this in the middle-end, but things are complicated by combine's
make_compound_operation/expand_compound_operation, and that
combine doesn't (normally) like turning two instructions into three.

Fingers-crossed the attached patch works better on the nightly testers.

Thanks in advance,
Roger
--

> -Original Message-
> From: Jeff Law 
> Sent: 07 December 2023 14:47
> To: Roger Sayle ; gcc-patches@gcc.gnu.org
> Cc: 'Claudiu Zissulescu' 
> Subject: Re: [ARC PATCH] Add *extvsi_n_0 define_insn_and_split for PR 110717.
> 
> On 12/5/23 06:59, Roger Sayle wrote:
> > This patch improves the code generated for bitfield sign extensions on
> > ARC cpus without a barrel shifter.
> >
> >
> > Compiling the following test case:
> >
> > int foo(int x) { return (x<<27)>>27; }
> >
> > with -O2 -mcpu=em, generates two loops:
> >
> > foo:mov lp_count,27
> >  lp  2f
> >  add r0,r0,r0
> >  nop
> > 2:  # end single insn loop
> >  mov lp_count,27
> >  lp  2f
> >  asr r0,r0
> >  nop
> > 2:  # end single insn loop
> >  j_s [blink]
> >
> >
> > and the closely related test case:
> >
> > struct S { int a : 5; };
> > int bar (struct S *p) { return p->a; }
> >
> > generates the slightly better:
> >
> > bar:ldb_s   r0,[r0]
> >  mov_s   r2,0;3
> >  add3r0,r2,r0
> >  sexb_s  r0,r0
> >  asr_s   r0,r0
> >  asr_s   r0,r0
> >  j_s.d   [blink]
> >  asr_s   r0,r0
> >
> > which uses 6 instructions to perform this particular sign extension.
> > It turns out that sign extensions can always be implemented using at
> > most three instructions on ARC (without a barrel shifter) using the
> > idiom ((x)^msb)-msb [as described in section "2-5 Sign Extension"
> > of Henry Warren's book "Hacker's Delight"].  Using this, the sign
> > extensions above on ARC's EM both become:
> >
> >  bmsk_s  r0,r0,4
> >  xor r0,r0,32
> >  sub r0,r0,32
> >
> > which takes about 3 cycles, compared to the ~112 cycles for the loops
> > in foo.
> >
> >
> > Tested with a cross-compiler to arc-linux hosted on x86_64, with no
> > new (compile-only) regressions from make -k check.
> > Ok for mainline if this passes Claudiu's nightly testing?
> >
> >
> > 2023-12-05  Roger Sayle
> >
> > gcc/ChangeLog
> >  * config/arc/arc.md (*extvsi_n_0): New define_insn_and_split to
> >  implement SImode sign extract using a AND, XOR and MINUS sequence.
> >
> > gcc/testsuite/ChangeLog
> >  * gcc.target/arc/extvsi-1.c: New test case.
> >  * gcc.target/arc/extvsi-2.c: Likewise.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >
> >
> > patchar.txt
> >
> > diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index
> > bf9f88eff047..5ebaf2e20ab0 100644
> > --- a/gcc/config/arc/arc.md
> > +++ b/gcc/config/arc/arc.md
> > @@ -6127,6 +6127,26 @@ archs4x, archs4xd"
> > ""
> > [(set_attr "length" "8")])
> >
> > +(define_insn_and_split "*extvsi_n_0"
> > +  [(set (match_operand:SI 0 "register_operand" "=r")
> > +   (sign_extract:SI (match_operand:SI 1 "register_operand" "0")
> > +(match_operand:QI 2 "const_int_operand")
> > +(const_int 0)))]
> > +  "!TARGET_BARREL_SHIFTER
> > +   && IN_RANGE (INTVAL (operands[2]), 2,
> > +   (optimize_insn_for_size_p () ? 28 : 30))"
> > +  "#"
> > +  "&& 1"
> > +[(set (match_dup 0) (and:SI (match_dup 0) (match_dup 3)))  (set
> > +(match_dup 0) (xor:SI (match_dup 0) (match_dup 4)))  (set (match_dup
> > +0) (minus:SI (match_dup 0) (match_dup 4)))] {
> > +  int tmp = INTVAL (operands[2]);
> > +  operands[3] = GEN_INT (~(HOST_WIDE_INT_M1U &

[ARC PATCH] Add *extvsi_n_0 define_insn_and_split for PR 110717.

2023-12-05 Thread Roger Sayle


This patch improves the code generated for bitfield sign extensions on
ARC cpus without a barrel shifter.


Compiling the following test case:

int foo(int x) { return (x<<27)>>27; }

with -O2 -mcpu=em, generates two loops:

foo:mov lp_count,27
lp  2f
add r0,r0,r0
nop
2:  # end single insn loop
mov lp_count,27
lp  2f
asr r0,r0
nop
2:  # end single insn loop
j_s [blink]


and the closely related test case:

struct S { int a : 5; };
int bar (struct S *p) { return p->a; }

generates the slightly better:

bar:ldb_s   r0,[r0]
mov_s   r2,0;3
add3r0,r2,r0
sexb_s  r0,r0
asr_s   r0,r0
asr_s   r0,r0
j_s.d   [blink]
asr_s   r0,r0

which uses 6 instructions to perform this particular sign extension.
It turns out that sign extensions can always be implemented using at
most three instructions on ARC (without a barrel shifter) using the
idiom ((x)^msb)-msb [as described in section "2-5 Sign Extension"
of Henry Warren's book "Hacker's Delight"].  Using this, the sign
extensions above on ARC's EM both become:

bmsk_s  r0,r0,4
xor r0,r0,32
sub r0,r0,32

which takes about 3 cycles, compared to the ~112 cycles for the loops
in foo.


Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?


2023-12-05  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.md (*extvsi_n_0): New define_insn_and_split to
implement SImode sign extract using a AND, XOR and MINUS sequence.

gcc/testsuite/ChangeLog
* gcc.target/arc/extvsi-1.c: New test case.
* gcc.target/arc/extvsi-2.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index bf9f88eff047..5ebaf2e20ab0 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -6127,6 +6127,26 @@ archs4x, archs4xd"
   ""
   [(set_attr "length" "8")])
 
+(define_insn_and_split "*extvsi_n_0"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+   (sign_extract:SI (match_operand:SI 1 "register_operand" "0")
+(match_operand:QI 2 "const_int_operand")
+(const_int 0)))]
+  "!TARGET_BARREL_SHIFTER
+   && IN_RANGE (INTVAL (operands[2]), 2,
+   (optimize_insn_for_size_p () ? 28 : 30))"
+  "#"
+  "&& 1"
+[(set (match_dup 0) (and:SI (match_dup 0) (match_dup 3)))
+ (set (match_dup 0) (xor:SI (match_dup 0) (match_dup 4)))
+ (set (match_dup 0) (minus:SI (match_dup 0) (match_dup 4)))]
+{
+  int tmp = INTVAL (operands[2]);
+  operands[3] = GEN_INT (~(HOST_WIDE_INT_M1U << tmp));
+  operands[4] = GEN_INT (HOST_WIDE_INT_1U << tmp);
+}
+  [(set_attr "length" "14")])
+
 (define_insn_and_split "rotlsi3_cnt1"
   [(set (match_operand:SI 0 "dest_reg_operand""=r")
(rotate:SI (match_operand:SI 1 "register_operand" "r")
diff --git a/gcc/testsuite/gcc.target/arc/extvsi-1.c 
b/gcc/testsuite/gcc.target/arc/extvsi-1.c
new file mode 100644
index ..eb53c78b4e6d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/extvsi-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+struct S { int a : 5; };
+
+int foo (struct S *p)
+{
+  return p->a;
+}
+
+/* { dg-final { scan-assembler "msk_s\\s+r0,r0,4" } } */
+/* { dg-final { scan-assembler "xor\\s+r0,r0,32" } } */
+/* { dg-final { scan-assembler "sub\\s+r0,r0,32" } } */
+/* { dg-final { scan-assembler-not "add3\\s+r0,r2,r0" } } */
+/* { dg-final { scan-assembler-not "sext_s\\s+r0,r0" } } */
+/* { dg-final { scan-assembler-not "asr_s\\s+r0,r0" } } */
diff --git a/gcc/testsuite/gcc.target/arc/extvsi-2.c 
b/gcc/testsuite/gcc.target/arc/extvsi-2.c
new file mode 100644
index ..a0c6894259d4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/extvsi-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+int foo(int x)
+{
+  return (x<<27)>>27;
+}
+
+/* { dg-final { scan-assembler "msk_s\\s+r0,r0,4" } } */
+/* { dg-final { scan-assembler "xor\\s+r0,r0,32" } } */
+/* { dg-final { scan-assembler "sub\\s+r0,r0,32" } } */
+/* { dg-final { scan-assembler-not "lp\\s+2f" } } */

[PATCH] Workaround array_slice constructor portability issues (with older g++).

2023-12-03 Thread Roger Sayle


The recent change to represent language and target attribute tables using
vec.h's array_slice template class triggers an issue/bug in older g++
compilers, specifically the g++ 4.8.5 system compiler of older RedHat
distributions.  This exhibits as the following compilation errors during
bootstrap:

../../gcc/gcc/c/c-lang.cc:55:2661: error: could not convert '(const
scoped_attribute_specs* const*)(& c_objc_attribute_table)' from 'const
scoped_attribute_specs* const*' to 'array_slice'
 struct lang_hooks lang_hooks = LANG_HOOKS_INITIALIZER;

../../gcc/gcc/c/c-decl.cc:4657:1: error: could not convert '(const
attribute_spec*)(& std_attributes)' from 'const attribute_spec*' to
'array_slice'

Here the issue is with constructors of the from:

static const int table[] = { 1, 2, 3 };
array_slice t = table;

Perhaps there's a fix possible in vec.h (an additional constructor?), but
the patch below fixes this issue by using one of array_slice's constructors
(that takes a size) explicitly, rather than rely on template resolution.
In the example above this looks like:

array_slice t (table, 3);

or equivalently

array_slice t = array_slice(table, 3);

or equivalently

array_slice t = array_slice(table, ARRAY_SIZE (table));


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap,
where these changes allow the bootstrap to complete.  Ok for mainline?
This fix might not by ideal, but it both draws attention to the problem
and restores bootstrap whilst better approaches are investigated.  For
example, an ARRAY_SLICE(table) macro might be appropriate if there isn't
an easy/portable template resolution solution.  Thoughts?


2023-12-03  Roger Sayle  

gcc/c-family/ChangeLog
* c-attribs.cc (c_common_gnu_attribute_table): Use an explicit
array_slice constructor with an explicit size argument.
(c_common_format_attribute_table): Likewise.

gcc/c/ChangeLog
* c-decl.cc (std_attribute_table): Use an explicit
array_slice constructor with an explicit size argument.
* c-objc-common.h (LANG_HOOKS_ATTRIBUTE_TABLE): Likewise.

gcc/ChangeLog
* config/i386/i386-options.cc (ix86_gnu_attribute_table): Use an
explicit array_slice constructor with an explicit size argument.
* config/i386/i386.cc (TARGET_ATTRIBUTE_TABLE): Likewise.

gcc/cp/ChangeLog
* cp-objcp-common.h (LANG_HOOKS_ATTRIBUTE_TABLE): Use an
explicit array_slice constructor with an explicit size argument.
* tree.cc (cxx_gnu_attribute_table): Likewise.
(std_attribute_table): Likewise.

gcc/lto/ChangeLog
* lto-lang.cc (lto_gnu_attribute_table): Use an explicit
array_slice constructor with an explicit size argument.
(lto_format_attribute_table): Likewise.
(LANG_HOOKS_ATTRIBUTE_TABLE): Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc
index 45af074..af83588 100644
--- a/gcc/c-family/c-attribs.cc
+++ b/gcc/c-family/c-attribs.cc
@@ -584,7 +584,9 @@ const struct attribute_spec c_common_gnu_attributes[] =
 
 const struct scoped_attribute_specs c_common_gnu_attribute_table =
 {
-  "gnu", c_common_gnu_attributes
+  "gnu",
+  array_slice(c_common_gnu_attributes,
+   ARRAY_SIZE (c_common_gnu_attributes))
 };
 
 /* Give the specifications for the format attributes, used by C and all
@@ -603,7 +605,9 @@ const struct attribute_spec c_common_format_attributes[] =
 
 const struct scoped_attribute_specs c_common_format_attribute_table =
 {
-  "gnu", c_common_format_attributes
+  "gnu",
+  array_slice(c_common_format_attributes,
+   ARRAY_SIZE (c_common_format_attributes))
 };
 
 /* Returns TRUE iff the attribute indicated by ATTR_ID takes a plain
diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 248d1bb..a6984b0 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -4653,7 +4653,8 @@ static const attribute_spec std_attributes[] =
 
 const scoped_attribute_specs std_attribute_table =
 {
-  nullptr, std_attributes
+  nullptr, array_slice(std_attributes,
+ARRAY_SIZE (std_attributes))
 };
 
 /* Create the predefined scalar types of C,
diff --git a/gcc/c/c-objc-common.h b/gcc/c/c-objc-common.h
index 426d938..021c651 100644
--- a/gcc/c/c-objc-common.h
+++ b/gcc/c/c-objc-common.h
@@ -83,7 +83,8 @@ static const scoped_attribute_specs *const 
c_objc_attribute_table[] =
 };
 
 #undef LANG_HOOKS_ATTRIBUTE_TABLE
-#define LANG_HOOKS_ATTRIBUTE_TABLE c_objc_attribute_table
+#define LANG_HOOKS_ATTRIBUTE_TABLE \
+array_slice (c_objc_attribute_table, 
ARRAY_SIZE (c_objc_attribute_table))
 
 #undef LANG_HOOKS_TREE_DUMP_DUMP_TREE_FN
 #define LANG_HOOKS_TREE_DUMP_DUMP_TREE_FN c_dump_tree
diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
index 8776592..50b3425 100644
--- a/gcc/config/i386/i386-options.cc
+++ b/gcc/config/i386/i3

[RISC-V PATCH] Improve style to work around PR 60994 in host compiler.

2023-12-01 Thread Roger Sayle


This simple patch allows me to build a cross-compiler to riscv using
older versions of RedHat's system compiler.  The issue is PR c++/60994
where g++ doesn't like the same name (demand_flags) to be used by both
a variable and a (enumeration) type, which is also undesirable from a
(GNU) coding style perspective.  One solution is to rename the type
to demand_flags_t, but a less invasive change is to simply use another
identifier for the problematic local variable, renaming demand_flags
to dflags.

This patch has been tested by building cc1 of a cross-compiler to
riscv64-unknown-linux-gnu using g++ 4.8.5 as the host compiler.
Ok for mainline?


2023-12-01  Roger Sayle  

gcc/ChangeLog
* config/riscv/riscv-vsetvl.cc (csetvl_info::parse_insn): Rename
local variable from demand_flags to dflags, to avoid conflicting
with (enumeration) type of the same name.

Thanks in advance,
Roger
--

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index b3e07d4..9d11416 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -987,11 +987,11 @@ public:
 
 /* Determine the demand info of the RVV insn.  */
 m_max_sew = get_max_int_sew ();
-unsigned demand_flags = 0;
+unsigned dflags = 0;
 if (vector_config_insn_p (insn->rtl ()))
   {
-   demand_flags |= demand_flags::DEMAND_AVL_P;
-   demand_flags |= demand_flags::DEMAND_RATIO_P;
+   dflags |= demand_flags::DEMAND_AVL_P;
+   dflags |= demand_flags::DEMAND_RATIO_P;
   }
 else
   {
@@ -1006,39 +1006,39 @@ public:
   available.
 */
if (has_non_zero_avl ())
- demand_flags |= demand_flags::DEMAND_NON_ZERO_AVL_P;
+ dflags |= demand_flags::DEMAND_NON_ZERO_AVL_P;
else
- demand_flags |= demand_flags::DEMAND_AVL_P;
+ dflags |= demand_flags::DEMAND_AVL_P;
  }
else
- demand_flags |= demand_flags::DEMAND_AVL_P;
+ dflags |= demand_flags::DEMAND_AVL_P;
  }
 
if (get_attr_ratio (insn->rtl ()) != INVALID_ATTRIBUTE)
- demand_flags |= demand_flags::DEMAND_RATIO_P;
+ dflags |= demand_flags::DEMAND_RATIO_P;
else
  {
if (scalar_move_insn_p (insn->rtl ()) && m_ta)
  {
-   demand_flags |= demand_flags::DEMAND_GE_SEW_P;
+   dflags |= demand_flags::DEMAND_GE_SEW_P;
m_max_sew = get_attr_type (insn->rtl ()) == TYPE_VFMOVFV
  ? get_max_float_sew ()
  : get_max_int_sew ();
  }
else
- demand_flags |= demand_flags::DEMAND_SEW_P;
+ dflags |= demand_flags::DEMAND_SEW_P;
 
if (!ignore_vlmul_insn_p (insn->rtl ()))
- demand_flags |= demand_flags::DEMAND_LMUL_P;
+ dflags |= demand_flags::DEMAND_LMUL_P;
  }
 
if (!m_ta)
- demand_flags |= demand_flags::DEMAND_TAIL_POLICY_P;
+ dflags |= demand_flags::DEMAND_TAIL_POLICY_P;
if (!m_ma)
- demand_flags |= demand_flags::DEMAND_MASK_POLICY_P;
+ dflags |= demand_flags::DEMAND_MASK_POLICY_P;
   }
 
-normalize_demand (demand_flags);
+normalize_demand (dflags);
 
 /* Optimize AVL from the vsetvl instruction.  */
 insn_info *def_insn = extract_single_source (get_avl_def ());

[PATCH] PR112380: Defend against CLOBBERs in RTX expressions in combine.cc

2023-11-12 Thread Roger Sayle


This patch addresses PR rtl-optimization/112380, an ICE-on-valid regression
where a (clobber (const_int 0)) encounters a sanity checking gcc_assert
(at line 7554) in simplify-rtx.cc.  These CLOBBERs are used internally
by GCC's combine pass much like error_mark_node is used by various
language front-ends.

The solutions are either to handle/accept these CLOBBERs through-out
(or in more places in) the middle-end's RTL optimizers, including functions
in simplify-rtx.cc that are used by passes other than combine, and/or
attempt to prevent these CLOBBERs escaping from try_combine into the
RTX/RTL stream.  The benefit of the second approach is that it actually
allows for better optimization: when try_combine fails to simplify an
expression instead of substituting a CLOBBER to avoid the instruction
pattern being recognized, noticing the CLOBBER often allows combine
to attempt alternate simplifications/transformations looking for those
that can be recognized.

This patch is provided as two alternatives.  The first is the minimal
fix to address the CLOBBER encountered in the bugzilla PR.  Assuming
this approach is the correct fix to a latent bug/liability through-out
combine.cc, the second alternative fixes many of the places that may
potentially trigger problems in future, and allows combine to attempt
more valid combinations/transformations.  These were identified
proactively by changing the "fail:" case in gen_lowpart_for_combine
to return NULL_RTX, and working through the fall-out sufficient for
x86_64 to bootstrap and regression test without new failures.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-11-12  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/112380
* combine.cc (expand_field_assignment): Check if gen_lowpart
returned a CLOBBER, and avoid calling gen_simplify_binary with
it if so.

gcc/testsuite/ChangeLog
PR rtl-optimization/112380
* gcc.dg/pr112380.c: New test case.

gcc/ChangeLog
PR rtl-optimization/112380
* combine.cc (find_split_point): Check if gen_lowpart returned
a CLOBBER.
(subst): Check if combine_simplify_rtx returned a CLOBBER.
(simplify_set): Check if force_to_mode returned a CLOBBER.
Check if gen_lowpart returned a CLOBBER.
(expand_field_assignment): Likewise.
(make_extraction): Check if force_to_mode returned a CLOBBER.
(force_int_to_mode): Likewise.
(simplify_and_const_int_1): Check if VAROP is a CLOBBER, after
call to force_to_mode (and before).
(simplify_comparison): Check if force_to_mode returned a CLOBBER.
Check if gen_lowpart returned a CLOBBER.

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 6344cd3..f2c64a9 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7466,6 +7466,11 @@ expand_field_assignment (const_rtx x)
   if (!targetm.scalar_mode_supported_p (compute_mode))
break;
 
+  /* gen_lowpart_for_combine returns CLOBBER on failure.  */
+  rtx lowpart = gen_lowpart (compute_mode, SET_SRC (x));
+  if (GET_CODE (lowpart) == CLOBBER)
+   break;
+
   /* Now compute the equivalent expression.  Make a copy of INNER
 for the SET_DEST in case it is a MEM into which we will substitute;
 we don't want shared RTL in that case.  */
@@ -7480,9 +7485,7 @@ expand_field_assignment (const_rtx x)
 inner);
   masked = simplify_gen_binary (ASHIFT, compute_mode,
simplify_gen_binary (
- AND, compute_mode,
- gen_lowpart (compute_mode, SET_SRC (x)),
- mask),
+ AND, compute_mode, lowpart, mask),
pos);
 
   x = gen_rtx_SET (copy_rtx (inner),
diff --git a/gcc/combine.cc b/gcc/combine.cc
index 6344cd3..969eb9d 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -5157,36 +5157,37 @@ find_split_point (rtx *loc, rtx_insn *insn, bool 
set_src)
 always at least get 8-bit constants in an AND insn, which is
 true for every current RISC.  */
 
- if (unsignedp && len <= 8)
+ rtx lowpart = gen_lowpart (mode, inner);
+ if (lowpart && GET_CODE (lowpart) != CLOBBER)
{
- unsigned HOST_WIDE_INT mask
-   = (HOST_WIDE_INT_1U << len) - 1;
- rtx pos_rtx = gen_int_shift_amount (mode, pos);
- SUBST (SET_SRC (x),
-gen_rtx_AND (mode,
- gen_rtx_LSHIFTRT
- (mode, gen_lowpart (mode, inner), pos_rtx),
- gen_int_mode (mask, mode)));
-
- split = fin

[x86 PATCH] Improve reg pressure of double-word right-shift then truncate.

2023-11-12 Thread Roger Sayle


This patch improves register pressure during reload, inspired by PR 97756.
Normally, a double-word right-shift by a constant produces a double-word
result, the highpart of which is dead when followed by a truncation.
The dead code calculating the high part gets cleaned up post-reload, so
the issue isn't normally visible, except for the increased register
pressure during reload, sometimes leading to odd register assignments.
Providing a post-reload splitter, which clobbers a single wordmode
result register instead of a doubleword result register, helps (a bit).

An example demonstrating this effect is:

#define MASK60 ((1ul << 60) - 1)
unsigned long foo (__uint128_t n)
{
  unsigned long a = n & MASK60;
  unsigned long b = (n >> 60);
  b = b & MASK60;
  unsigned long c = (n >> 120);
  return a+b+c;
}

which currently with -O2 generates (13 instructions):
foo:movabsq $1152921504606846975, %rcx
xchgq   %rdi, %rsi
movq%rsi, %rax
shrdq   $60, %rdi, %rax
movq%rax, %rdx
movq%rsi, %rax
movq%rdi, %rsi
andq%rcx, %rax
shrq$56, %rsi
andq%rcx, %rdx
addq%rsi, %rax
addq%rdx, %rax
ret

with this patch, we generate one less mov (12 instructions):
foo:movabsq $1152921504606846975, %rcx
xchgq   %rdi, %rsi
movq%rdi, %rdx
movq%rsi, %rax
movq%rdi, %rsi
shrdq   $60, %rdi, %rdx
andq%rcx, %rax
shrq$56, %rsi
addq%rsi, %rax
andq%rcx, %rdx
addq%rdx, %rax
ret

The significant difference is easier to see via diff:
<   shrdq   $60, %rdi, %rax
<   movq%rax, %rdx
---
>   shrdq   $60, %rdi, %rdx


Admittedly a single "mov" isn't much of a saving on modern architectures,
but as demonstrated by the PR, people still track the number of them.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-11-12  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.md (3_doubleword_lowpart): New
define_insn_and_split to optimize register usage of doubleword
right shifts followed by truncation.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 663db73..8a6928f 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -14833,6 +14833,31 @@
   [(const_int 0)]
   "ix86_split_ (operands, operands[3], mode); DONE;")
 
+;; Split truncations of TImode right shifts into x86_64_shrd_1.
+;; Split truncations of DImode right shifts into x86_shrd_1.
+(define_insn_and_split "3_doubleword_lowpart"
+  [(set (match_operand:DWIH 0 "register_operand" "=")
+   (subreg:DWIH
+ (any_shiftrt: (match_operand: 1 "register_operand" "r")
+(match_operand:QI 2 "const_int_operand")) 0))
+   (clobber (reg:CC FLAGS_REG))]
+  "UINTVAL (operands[2]) <  * BITS_PER_UNIT"
+  "#"
+  "&& reload_completed"
+  [(parallel
+  [(set (match_dup 0)
+   (ior:DWIH (lshiftrt:DWIH (match_dup 0) (match_dup 2))
+ (subreg:DWIH
+   (ashift: (zero_extend: (match_dup 3))
+ (match_dup 4)) 0)))
+   (clobber (reg:CC FLAGS_REG))])]
+{
+  split_double_mode (mode, [1], 1, [1], [3]);
+  operands[4] = GEN_INT (( * BITS_PER_UNIT) - INTVAL (operands[2]));
+  if (!rtx_equal_p (operands[0], operands[3]))
+emit_move_insn (operands[0], operands[3]);
+})
+
 (define_insn "x86_64_shrd"
   [(set (match_operand:DI 0 "nonimmediate_operand" "+r*m")
 (ior:DI (lshiftrt:DI (match_dup 0)

[ARC PATCH] Consistent use of whitespace in assembler templates.

2023-11-06 Thread Roger Sayle


This minor clean-up patch tweaks arc.md to use whitespace consistently
in output templates, always using a TAB between the mnemonic and its
operands, and avoiding spaces after commas between operands.  There
should be no functional changes with this patch, though several test
cases' scan-assembler needed to be updated to use \s+ instead of testing
for a TAB or a space explicitly.

Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?


2023-11-06  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.md: Make output template whitespace consistent.

gcc/testsuite/ChangeLog
* gcc.target/arc/jli-1.c: Update dg-final whitespace.
* gcc.target/arc/jli-2.c: Likewise.
* gcc.target/arc/naked-1.c: Likewise.
* gcc.target/arc/naked-2.c: Likewise.
* gcc.target/arc/tmac-1.c: Likewise.
* gcc.target/arc/tmac-2.c: Likewise.


Thanks again,
Roger
--

diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index 7702978..846aa32 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -669,26 +669,26 @@ archs4x, archs4xd"
|| (satisfies_constraint_Cm3 (operands[1])
&& memory_operand (operands[0], QImode))"
   "@
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   ldb%? %0,%1
-   stb%? %1,%0
-   ldb%? %0,%1
-   xldb%U1 %0,%1
-   ldb%U1%V1 %0,%1
-   xstb%U0 %1,%0
-   stb%U0%V0 %1,%0
-   stb%U0%V0 %1,%0
-   stb%U0%V0 %1,%0
-   stb%U0%V0 %1,%0"
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   ldb%?\\t%0,%1
+   stb%?\\t%1,%0
+   ldb%?\\t%0,%1
+   xldb%U1\\t%0,%1
+   ldb%U1%V1\\t%0,%1
+   xstb%U0\\t%1,%0
+   stb%U0%V0\\t%1,%0
+   stb%U0%V0\\t%1,%0
+   stb%U0%V0\\t%1,%0
+   stb%U0%V0\\t%1,%0"
   [(set_attr "type" 
"move,move,move,move,move,move,move,move,move,move,load,store,load,load,load,store,store,store,store,store")
(set_attr "iscompact" 
"maybe,maybe,maybe,true,true,false,false,false,maybe_limm,false,true,true,true,false,false,false,false,false,false,false")
(set_attr "predicable" 
"yes,no,yes,no,no,yes,no,yes,yes,yes,no,no,no,no,no,no,no,no,no,no")
@@ -713,26 +713,26 @@ archs4x, archs4xd"
|| (satisfies_constraint_Cm3 (operands[1])
&& memory_operand (operands[0], HImode))"
   "@
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   mov%? %0,%1
-   ld%_%? %0,%1
-   st%_%? %1,%0
-   xld%_%U1 %0,%1
-   ld%_%U1%V1 %0,%1
-   xst%_%U0 %1,%0
-   st%_%U0%V0 %1,%0
-   st%_%U0%V0 %1,%0
-   st%_%U0%V0 %1,%0
-   st%_%U0%V0 %1,%0"
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   mov%?\\t%0,%1
+   ld%_%?\\t%0,%1
+   st%_%?\\t%1,%0
+   xld%_%U1\\t%0,%1
+   ld%_%U1%V1\\t%0,%1
+   xst%_%U0\\t%1,%0
+   st%_%U0%V0\\t%1,%0
+   st%_%U0%V0\\t%1,%0
+   st%_%U0%V0\\t%1,%0
+   st%_%U0%V0\\t%1,%0"
   [(set_attr "type" 
"move,move,move,move,move,move,move,move,move,move,move,load,store,load,load,store,store,store,store,store")
(set_attr "iscompact" 
"maybe,maybe,maybe,true,true,false,false,false,maybe_limm,maybe_limm,false,true,true,false,false,false,false,false,false,false")
(set_attr "predicable" 
"yes,no,yes,no,no,yes,no,yes,yes,yes,yes,no,no,no,no,no,no,no,no,no")
@@ -818,7 +818,7 @@ archs4x, archs4xd"
  (plus:SI (reg:SI SP_REG)
   (match_operand 1 "immediate_operand" "Cal")]
   "reload_completed"
-  "ld.a %0,[sp,%1]"
+  "ld.a\\t%0,[sp,%1]"
   [(set_attr "type" "load")
(set_attr "length" "8")])
 
@@ -830,7 +830,7 @@ archs4x, archs4xd"
   (unspec:SI [(match_operand:SI 1 "register_operand" "c")]
UNSPEC_ARC_DIRECT))]
   ""
-  "st%U0 %1,%0\;st%U0.di %1,%0"
+  "st%U0\\t%1,%0\;st%U0.di\\t%1,%0"
   [(set_attr "type" "store")])
 
 ;; Combiner patterns for compare with zero
@@ -944,7 +944,7 @@ archs4x, archs4xd"
(set (match_operand:SI 0 "register_operand" "=w")
(match_dup 3))]
   ""
-  "%O3.f %0,%1"
+  "%O3.f\\t%0,%1"
   [(set_attr "type" "compare")
(set_attr "cond" "set_zn")
(set_attr "length" "4")])
@@ -987,15 +

[ARC PATCH] Improved DImode rotates and right shifts by one bit.

2023-11-06 Thread Roger Sayle


This patch improves the code generated for DImode right shifts (both
arithmetic and logical) by a single bit, and also for DImode rotates
(both left and right) by a single bit.  In approach, this is similar
to the recently added DImode left shift by a single bit patch, but
also builds upon i386.md's UNSPEC carry flag representation:
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632169.html

The benefits can be seen from the four new test cases:

long long ashr(long long x) { return x >> 1; }

Before:
ashr:   asl r2,r1,31
lsr_s   r0,r0
or_sr0,r0,r2
j_s.d   [blink]
asr_s   r1,r1,1

After:
ashr:   asr.f   r1,r1
j_s.d   [blink]
rrc r0,r0

unsigned long long lshr(unsigned long long x) { return x >> 1; }

Before:
lshr:   asl r2,r1,31
lsr_s   r0,r0
or_sr0,r0,r2
j_s.d   [blink]
lsr_s   r1,r1

After:
lshr:   lsr.f   r1,r1
j_s.d   [blink]
rrc r0,r0

unsigned long long rotl(unsigned long long x) { return (x<<1) | (x>>63); }

Before:
rotl:   lsr r12,r1,31
lsr r2,r0,31
asl_s   r3,r0,1
asl_s   r1,r1,1
or  r0,r12,r3
j_s.d   [blink]
or_sr1,r1,r2

After:
rotl:   add.f   r0,r0,r0
adc.f   r1,r1,r1
j_s.d   [blink]
add.cs  r0,r0,1

unsigned long long rotr(unsigned long long x) { return (x>>1) | (x<<63); }

Before:
rotr:   asl r12,r1,31
asl r2,r0,31
lsr_s   r3,r0
lsr_s   r1,r1
or  r0,r12,r3
j_s.d   [blink]
or_sr1,r1,r2

After:
rotr:   asr.f   0,r0
rrc.f   r1,r1
j_s.d   [blink]
rrc r0,r0

On CPUs without a barrel shifter the improvements are even better.

Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?


2023-11-06  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.md (UNSPEC_ARC_CC_NEZ): New UNSPEC that
represents the carry flag being set if the operand is non-zero.
(adc_f): New define_insn representing adc with updated flags.
(ashrdi3): New define_expand that only handles shifts by 1.
(ashrdi3_cnt1): New pre-reload define_insn_and_split.
(lshrdi3): New define_expand that only handles shifts by 1.
(lshrdi3_cnt1): New pre-reload define_insn_and_split.
(rrcsi2): New define_insn for rrc (SImode rotate right through
carry).
(rrcsi2_carry): Likewise for rrc.f, as above but updating flags.
(rotldi3): New define_expand that only handles rotates by 1.
(rotldi3_cnt1): New pre-reload define_insn_and_split.
(rotrdi3): New define_expand that only handles rotates by 1.
(rotrdi3_cnt1): New pre-reload define_insn_and_split.
(lshrsi3_cnt1_carry): New define_insn for lsr.f.
(ashrsi3_cnt1_carry): New define_insn for asr.f.
(btst_0_carry): New define_insn for asr.f without result.

gcc/testsuite/ChangeLog
* gcc.target/arc/ashrdi3-1.c: New test case.
* gcc.target/arc/lshrdi3-1.c: Likewise.
* gcc.target/arc/rotldi3-1.c: Likewise.
* gcc.target/arc/rotrdi3-1.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index 7702978..97231b9 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -137,6 +137,7 @@
   UNSPEC_ARC_VMAC2HU
   UNSPEC_ARC_VMPY2H
   UNSPEC_ARC_VMPY2HU
+  UNSPEC_ARC_CC_NEZ
 
   VUNSPEC_ARC_RTIE
   VUNSPEC_ARC_SYNC
@@ -2790,6 +2791,31 @@ archs4x, archs4xd"
(set_attr "type" "cc_arith")
(set_attr "length" "4,4,4,4,8,8")])
 
+(define_insn "adc_f"
+  [(set (reg:CC_C CC_REG)
+   (compare:CC_C
+ (zero_extend:DI
+   (plus:SI
+ (plus:SI
+   (ltu:SI (reg:CC_C CC_REG) (const_int 0))
+   (match_operand:SI 1 "register_operand" "%r"))
+ (match_operand:SI 2 "register_operand" "r")))
+ (plus:DI
+   (ltu:DI (reg:CC_C CC_REG) (const_int 0))
+   (zero_extend:DI (match_dup 1)
+   (set (match_operand:SI 0 "register_operand" "=r")
+   (plus:SI
+ (plus:SI
+   (ltu:SI (reg:CC_C CC_REG) (const_int 0))
+   (match_dup 1))
+ (match_dup 2)))]
+  ""
+  "adc.f\\t%0,%1,%2"
+  [(set_attr "cond" "set")
+   (set_attr "predicable" "no")
+   (set_attr "type" "cc_arith")
+   (set_attr "length" "4")])
+
 ; combiner-splitter cmp / scc -> cmp / adc
 (define_split
   [(set (match_operand:SI 0 "dest_reg_operand" "")
@@ -3530,6 +3556,68 @@ archs4x, archs4xd"
   ""
   [(set_attr "length" "8")

[ARC PATCH] Provide a TARGET_FOLD_BUILTIN target hook.

2023-11-03 Thread Roger Sayle


This patch implements a arc_fold_builtin target hook to allow ARC
builtins to be folded at the tree-level.  Currently this function
converts __builtin_arc_swap into a LROTATE_EXPR at the tree-level,
and evaluates __builtin_arc_norm and __builtin_arc_normw of integer
constant arguments at compile-time.  Because ARC_BUILTIIN_SWAP is
now handled at the tree-level, UNSPEC_ARC_SWAP no longer used,
allowing it and the "swap" define_insn to be removed.

An example benefit of folding things at compile-time is that
calling __builtin_arc_swap on the result of __builtin_arc_swap
now eliminates both and generates no code, and likewise calling
__builtin_arc_swap of a constant integer argument is evaluated
at compile-time.

Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?


2023-11-03  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.cc (TARGET_FOLD_BUILTIN): Define to
arc_fold_builtin.
(arc_fold_builtin): New function.  Convert ARC_BUILTIN_SWAP
into a rotate.  Evaluate ARC_BUILTIN_NORM and
ARC_BUILTIN_NORMW of constant arguments.
* config/arc/arc.md (UNSPEC_ARC_SWAP): Delete.
(normw): Make output template/assembler whitespace consistent.
(swap): Remove define_insn, only use of SWAP UNSPEC.
* config/arc/builtins.def: Tweak indentation.
(SWAP): Expand using rotlsi2_cnt16 instead of using swap.

gcc/testsuite/ChangeLog
* gcc.target/arc/builtin_norm-1.c: New test case.
* gcc.target/arc/builtin_norm-2.c: Likewise.
* gcc.target/arc/builtin_normw-1.c: Likewise.
* gcc.target/arc/builtin_normw-2.c: Likewise.
* gcc.target/arc/builtin_swap-1.c: Likewise.
* gcc.target/arc/builtin_swap-2.c: Likewise.
* gcc.target/arc/builtin_swap-3.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc
index e209ad2..70ee410 100644
--- a/gcc/config/arc/arc.cc
+++ b/gcc/config/arc/arc.cc
@@ -643,6 +643,9 @@ static rtx arc_legitimize_address_0 (rtx, rtx, machine_mode 
mode);
 #undef  TARGET_EXPAND_BUILTIN
 #define TARGET_EXPAND_BUILTIN arc_expand_builtin
 
+#undef  TARGET_FOLD_BUILTIN
+#define TARGET_FOLD_BUILTIN arc_fold_builtin
+
 #undef  TARGET_BUILTIN_DECL
 #define TARGET_BUILTIN_DECL arc_builtin_decl
 
@@ -7048,6 +7051,46 @@ arc_expand_builtin (tree exp,
 return const0_rtx;
 }
 
+/* Implement TARGET_FOLD_BUILTIN.  */
+
+static tree
+arc_fold_builtin (tree fndecl, int n_args ATTRIBUTE_UNUSED, tree *arg,
+  bool ignore ATTRIBUTE_UNUSED)
+{
+  unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl);
+
+  switch (fcode)
+{
+default:
+  break;
+
+case ARC_BUILTIN_SWAP:
+  return fold_build2 (LROTATE_EXPR, integer_type_node, arg[0],
+  build_int_cst (integer_type_node, 16));
+
+case ARC_BUILTIN_NORM:
+  if (TREE_CODE (arg[0]) == INTEGER_CST
+ && !TREE_OVERFLOW (arg[0]))
+   {
+ wide_int arg0 = wi::to_wide (arg[0], 32);
+ wide_int result = wi::shwi (wi::clrsb (arg0), 32);
+ return wide_int_to_tree (integer_type_node, result);
+   }
+  break;
+
+case ARC_BUILTIN_NORMW:
+  if (TREE_CODE (arg[0]) == INTEGER_CST
+ && !TREE_OVERFLOW (arg[0]))
+   {
+ wide_int arg0 = wi::to_wide (arg[0], 16);
+ wide_int result = wi::shwi (wi::clrsb (arg0), 32);
+ return wide_int_to_tree (integer_type_node, result);
+   }
+  break;
+}
+  return NULL_TREE;
+}
+
 /* Returns true if the operands[opno] is a valid compile-time constant to be
used as register number in the code for builtins.  Else it flags an error
and returns false.  */
diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index 96ff62d..9e81d13 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -116,7 +116,6 @@
   UNSPEC_TLS_OFF
   UNSPEC_ARC_NORM
   UNSPEC_ARC_NORMW
-  UNSPEC_ARC_SWAP
   UNSPEC_ARC_DIVAW
   UNSPEC_ARC_DIRECT
   UNSPEC_ARC_LP
@@ -4355,8 +4354,8 @@ archs4x, archs4xd"
  (clrsb:HI (match_operand:HI 1 "general_operand" "cL,Cal"]
   "TARGET_NORM"
   "@
-   norm%_ \t%0, %1
-   norm%_ \t%0, %1"
+   norm%_\\t%0,%1
+   norm%_\\t%0,%1"
   [(set_attr "length" "4,8")
(set_attr "type" "two_cycle_core,two_cycle_core")])
 
@@ -4453,18 +4452,6 @@ archs4x, archs4xd"
 [(set_attr "type" "unary")
  (set_attr "length" "20")])
 
-(define_insn "swap"
-  [(set (match_operand:SI  0 "dest_reg_operand" "=w,w,w")
-   (unspec:SI [(match_operand:SI 1 "general_operand" "L,Cal,c")]
-   UNSPEC_ARC_SWAP))]
-  "TARGET_SWAP"
-  "@
-

[AVR PATCH] Improvements to SImode and PSImode shifts by constants.

2023-11-02 Thread Roger Sayle


This patch provides non-looping implementations for more SImode (32-bit)
and PSImode (24-bit) shifts on AVR.  For most cases, these are shorter
and faster than using a loop, but for a few (controlled by optimize_size)
they are a little larger but significantly faster,  The approach is to
perform byte-based shifts by 1, 2 or 3 bytes, followed by bit-based shifts
(effectively in a narrower type) for the remaining bits, beyond 8, 16 or 24.

For example, the simple test case below (inspired by PR 112268):

unsigned long foo(unsigned long x)
{
  return x >> 26;
}

gcc -O2 currently generates:

foo:ldi r18,26
1:  lsr r25
ror r24
ror r23
ror r22
dec r18
brne 1b
ret

which is 8 instructions, and takes ~158 cycles.
With this patch, we now generate:

foo:mov r22,r25
clr r23
clr r24
clr r25
lsr r22
lsr r22
ret

which is 7 instructions, and takes ~7 cycles.

One complication is that the modified functions sometimes use spaces instead
of TABs, with occasional mistakes in GNU-style formatting, so I've fixed
these indentation/whitespace issues.  There's no change in the code for the
cases previously handled/special-cased, with the exception of ashrqi3 reg,5
where with -Os a (4-instruction) loop is shorter than the five single-bit
shifts of a fully unrolled implementation.

This patch has been (partially) tested with a cross-compiler to avr-elf
hosted on x86_64, without a simulator, where the compile-only tests in
the gcc testsuite show no regressions.  If someone could test this more
thoroughly that would be great.


2023-11-02  Roger Sayle  

gcc/ChangeLog
* config/avr/avr.cc (ashlqi3_out): Fix indentation whitespace.
(ashlhi3_out): Likewise.
(avr_out_ashlpsi3): Likewise.  Handle shifts by 9 and 17-22.
(ashlsi3_out): Fix formatting.  Handle shifts by 9 and 25-30.
(ashrqi3_our): Use loop for shifts by 5 when optimizing for size.
Fix indentation whitespace.
(ashrhi3_out): Likewise.
(avr_out_ashrpsi3): Likewise.  Handle shifts by 17.
(ashrsi3_out): Fix indentation.  Handle shifts by 17 and 25.
(lshrqi3_out): Fix whitespace.
(lshrhi3_out): Likewise.
(avr_out_lshrpsi3): Likewise.  Handle shifts by 9 and 17-22.
(lshrsi3_out): Fix indentation.  Handle shifts by 9,17,18 and 25-30.

gcc/testsuite/ChangeLog
* gcc.target/avr/ashlsi-1.c: New test case.
* gcc.target/avr/ashlsi-2.c: Likewise.
* gcc.target/avr/ashrsi-1.c: Likewise.
* gcc.target/avr/ashrsi-2.c: Likewise.
* gcc.target/avr/lshrsi-1.c: Likewise.
* gcc.target/avr/lshrsi-2.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
index 5e0217de36fc..706599b4aa6a 100644
--- a/gcc/config/avr/avr.cc
+++ b/gcc/config/avr/avr.cc
@@ -6715,7 +6715,7 @@ ashlqi3_out (rtx_insn *insn, rtx operands[], int *len)
 fatal_insn ("internal compiler error.  Incorrect shift:", insn);
 
   out_shift_with_cnt ("lsl %0",
-  insn, operands, len, 1);
+ insn, operands, len, 1);
   return "";
 }
 
@@ -6728,8 +6728,8 @@ ashlhi3_out (rtx_insn *insn, rtx operands[], int *len)
   if (CONST_INT_P (operands[2]))
 {
   int scratch = (GET_CODE (PATTERN (insn)) == PARALLEL
- && XVECLEN (PATTERN (insn), 0) == 3
- && REG_P (operands[3]));
+&& XVECLEN (PATTERN (insn), 0) == 3
+&& REG_P (operands[3]));
   int ldi_ok = test_hard_reg_class (LD_REGS, operands[0]);
   int k;
   int *t = len;
@@ -6826,8 +6826,9 @@ ashlhi3_out (rtx_insn *insn, rtx operands[], int *len)
  "ror %A0");
 
case 8:
- return *len = 2, ("mov %B0,%A1" CR_TAB
-   "clr %A0");
+ *len = 2;
+ return ("mov %B0,%A1" CR_TAB
+ "clr %A0");
 
case 9:
  *len = 3;
@@ -6974,7 +6975,7 @@ ashlhi3_out (rtx_insn *insn, rtx operands[], int *len)
   len = t;
 }
   out_shift_with_cnt ("lsl %A0" CR_TAB
-  "rol %B0", insn, operands, len, 2);
+ "rol %B0", insn, operands, len, 2);
   return "";
 }
 
@@ -6990,54 +6991,126 @@ avr_out_ashlpsi3 (rtx_insn *insn, rtx *op, int *plen)
   if (CONST_INT_P (op[2]))
 {
   switch (INTVAL (op[2]))
-{
-default:
-  if (INTVAL (op[2]) < 24)
-break;
+   {
+   default:
+ if (INTVAL (op[2]) < 24)
+   break;
 
-  return avr_asm_len ("clr %A0" CR_TAB
-  "clr %B0" CR_TAB
-  "clr %C0", op, plen, 3);
+ return avr_a

[AVR PATCH] Optimize (X>>C)&1 for C in [1, 4, 8, 16, 24] in *insv.any_shift..

2023-11-02 Thread Roger Sayle


This patch optimizes a few special cases in avr.md's *insv.any_shift.
instruction.  This template handles tests for a single bit, where the result
has only a (different) single bit set in the result.  Usually (currently)
this always requires a three-instruction sequence of a BST, a CLR and a BLD
(plus any additional CLR instructions to clear the rest of the result
bytes).
The special cases considered here are those that can be done with only two
instructions (plus CLRs); an ANDI preceded by either a MOV, a SHIFT or a
SWAP.

Hence for C=1 in HImode, GCC with -O2 currently generates:

bst r24,1
clr r24
clr r25
bld r24,0

with this patch, we now generate:

lsr r24
andi r24,1
clr r25

Likewise, HImode C=4 now becomes:

swap r24
andi r24,1
clr r25

and SImode C=8 now becomes:

mov r22,r23
andi r22,1
clr 23
clr 24
clr 25


I've not attempted to model the instruction length accurately for these
special cases; the logic would be ugly, but it's safe to use the current
(1 insn longer) length.

This patch has been (partially) tested with a cross-compiler to avr-elf
hosted on x86_64, without a simulator, where the compile-only tests in
the gcc testsuite show no regressions.  If someone could test this more
thoroughly that would be great.


2023-11-02  Roger Sayle  

gcc/ChangeLog
* config/avr/avr.md (*insv.any_shift.): Optimize special
cases of *insv.any_shift that save one instruction by using
ANDI with either a MOV, a SHIFT or a SWAP.

gcc/testsuite/ChangeLog
* gcc.target/avr/insvhi-1.c: New HImode test case.
* gcc.target/avr/insvhi-2.c: Likewise.
* gcc.target/avr/insvhi-3.c: Likewise.
* gcc.target/avr/insvhi-4.c: Likewise.
* gcc.target/avr/insvhi-5.c: Likewise.
* gcc.target/avr/insvqi-1.c: New QImode test case.
* gcc.target/avr/insvqi-2.c: Likewise.
* gcc.target/avr/insvqi-3.c: Likewise.
* gcc.target/avr/insvqi-4.c: Likewise.
* gcc.target/avr/insvsi-1.c: New SImode test case.
* gcc.target/avr/insvsi-2.c: Likewise.
* gcc.target/avr/insvsi-3.c: Likewise.
* gcc.target/avr/insvsi-4.c: Likewise.
* gcc.target/avr/insvsi-5.c: Likewise.
* gcc.target/avr/insvsi-6.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/avr/avr.md b/gcc/config/avr/avr.md
index 83dd15040b07..c2a1931733f8 100644
--- a/gcc/config/avr/avr.md
+++ b/gcc/config/avr/avr.md
@@ -9840,6 +9840,7 @@
(clobber (reg:CC REG_CC))]
   "reload_completed"
   {
+int ldi_ok = test_hard_reg_class (LD_REGS, operands[0]);
 int shift =  == ASHIFT ? INTVAL (operands[2]) : -INTVAL 
(operands[2]);
 int mask = GET_MODE_MASK (mode) & INTVAL (operands[3]);
 // Position of the output / input bit, respectively.
@@ -9850,6 +9851,217 @@
 operands[3] = GEN_INT (obit);
 operands[2] = GEN_INT (ibit);
 
+/* Special cases requiring MOV to low byte and ANDI.  */
+if ((shift & 7) == 0 && ldi_ok)
+  {
+   if (IN_RANGE (obit, 0, 7))
+ {
+   if (shift == -8)
+ {
+   if ( == 2)
+ return "mov %A0,%B1\;andi %A0,lo8(1<<%3)\;clr %B0";
+   if ( == 3)
+ return "mov %A0,%B1\;andi %A0,lo8(1<<%3)\;clr %B0\;clr %C0";
+   if ( == 4 && !AVR_HAVE_MOVW)
+ return "mov %A0,%B1\;andi %A0,lo8(1<<%3)\;"
+"clr %B0\;clr %C0\;clr %D0";
+ }
+   else if (shift == -16)
+ {
+   if ( == 3)
+ return "mov %A0,%C1\;andi %A0,lo8(1<<%3)\;clr %B0\;clr %C0";
+   if ( == 4 && !AVR_HAVE_MOVW)
+ return "mov %A0,%C1\;andi %A0,lo8(1<<%3)\;"
+"clr %B0\;clr %C0\;clr %D0";
+ }
+   else if (shift == -24 && !AVR_HAVE_MOVW)
+ return "mov %A0,%D1\;andi %A0,lo8(1<<%3)\;"
+"clr %B0\;clr %C0\;clr %D0";
+ }
+
+   /* Special cases requiring MOV and ANDI.  */
+   else if (IN_RANGE (obit, 8, 15))
+ {
+   if (shift == 8)
+ {
+   if ( == 2)
+ return "mov %B0,%A1\;andi %B0,lo8(1<<(%3-8))\;clr %A0";
+   if ( == 3)
+ return "mov %B0,%A1\;andi %B0,lo8(1<<(%3-8))\;"
+"clr %A0\;clr %C0";
+   if ( == 4 && !AVR_HAVE_MOVW)
+ return "mov %B0,%A1\;andi %B0,lo8(1<<(%3-8))\;"
+"clr %A0\;clr %C0\;clr %D0";
+ }
+   else if (shift == -8)
+ {
+   if ( == 3)
+ ret

RE: [x86_64 PATCH] PR target/110551: Tweak mulx register allocation using peephole2.

2023-11-01 Thread Roger Sayle



Hi Uros,

> From: Uros Bizjak 
> Sent: 01 November 2023 10:05
> Subject: Re: [x86_64 PATCH] PR target/110551: Tweak mulx register allocation
> using peephole2.
> 
> On Mon, Oct 30, 2023 at 6:27 PM Roger Sayle 
> wrote:
> >
> >
> > This patch is a follow-up to my previous PR target/110551 patch, this
> > time to address the additional move after mulx, seen on TARGET_BMI2
> > architectures (such as -march=haswell).  The complication here is that
> > the flexible multiple-set mulx instruction is introduced into RTL
> > after reload, by split2, and therefore can't benefit from register
> > preferencing.  This results in RTL like the following:
> >
> > (insn 32 31 17 2 (parallel [
> > (set (reg:DI 4 si [orig:101 r ] [101])
> > (mult:DI (reg:DI 1 dx [109])
> > (reg:DI 5 di [109])))
> > (set (reg:DI 5 di [ r+8 ])
> > (umul_highpart:DI (reg:DI 1 dx [109])
> > (reg:DI 5 di [109])))
> > ]) "pr110551-2.c":8:17 -1
> >  (nil))
> >
> > (insn 17 32 9 2 (set (reg:DI 0 ax [107])
> > (reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal}
> >  (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ])
> > (nil)))
> >
> > Here insn 32, the mulx instruction, places its results in si and di,
> > and then immediately after decides to move di to ax, with di now dead.
> > This can be trivially cleaned up by a peephole2.  I've added an
> > additional constraint that the two SET_DESTs can't be the same
> > register to avoid confusing the middle-end, but this has well-defined
> > behaviour on x86_64/BMI2, encoding a umul_highpart.
> >
> > For the new test case, compiled on x86_64 with -O2 -march=haswell:
> >
> > Before:
> > mulx64: movabsq $-7046029254386353131, %rdx
> > mulx%rdi, %rsi, %rdi
> > movq%rdi, %rax
> > xorq%rsi, %rax
> > ret
> >
> > After:
> > mulx64: movabsq $-7046029254386353131, %rdx
> > mulx%rdi, %rsi, %rax
> > xorq%rsi, %rax
> > ret
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> 
> It looks that your previous PR110551 patch regressed -march=cascadelake [1].
> Let's fix these regressions first.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634660.html
> 
> Uros.

This patch fixes that "regression".  Originally, the test case in PR110551 
contained
one unnecessary mov on "default" x86_targets, but two extra movs on BMI2
targets, including -march=haswell and -march=cascadelake.  The first patch
eliminated one of these MOVs, this patch eliminates the second.  I'm not sure
that you can call it a regression, the added test failed when run with a 
non-standard
-march setting.  The good news is that test case doesn't have to be changed with
this patch applied, i.e. the correct intended behaviour is no MOVs on all 
architectures.

I'll admit the timing is unusual; I had already written and was regression 
testing a
patch for the BMI2 issue, when the -march=cascadelake regression tester let me
know it was required for folks that helpfully run the regression suite with non
standard settings.  i.e. a long standing bug that wasn't previously tested for 
by
the testsuite.

> > 2023-10-30  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR target/110551
> > * config/i386/i386.md (*bmi2_umul3_1): Tidy condition
> > as operands[2] with predicate register_operand must be !MEM_P.
> > (peephole2): Optimize a mulx followed by a register-to-register
> > move, to place result in the correct destination if possible.
> >
> > gcc/testsuite/ChangeLog
> > PR target/110551
> > * gcc.target/i386/pr110551-2.c: New test case.
> >

Thanks again,
Roger
--

[x86_64 PATCH] PR target/110551: Tweak mulx register allocation using peephole2.

2023-10-30 Thread Roger Sayle


This patch is a follow-up to my previous PR target/110551 patch, this
time to address the additional move after mulx, seen on TARGET_BMI2
architectures (such as -march=haswell).  The complication here is
that the flexible multiple-set mulx instruction is introduced into
RTL after reload, by split2, and therefore can't benefit from register
preferencing.  This results in RTL like the following:

(insn 32 31 17 2 (parallel [
(set (reg:DI 4 si [orig:101 r ] [101])
(mult:DI (reg:DI 1 dx [109])
(reg:DI 5 di [109])))
(set (reg:DI 5 di [ r+8 ])
(umul_highpart:DI (reg:DI 1 dx [109])
(reg:DI 5 di [109])))
]) "pr110551-2.c":8:17 -1
 (nil))

(insn 17 32 9 2 (set (reg:DI 0 ax [107])
(reg:DI 5 di [ r+8 ])) "pr110551-2.c":9:40 90 {*movdi_internal}
 (expr_list:REG_DEAD (reg:DI 5 di [ r+8 ])
(nil)))

Here insn 32, the mulx instruction, places its results in si and di,
and then immediately after decides to move di to ax, with di now dead.
This can be trivially cleaned up by a peephole2.  I've added an
additional constraint that the two SET_DESTs can't be the same
register to avoid confusing the middle-end, but this has well-defined
behaviour on x86_64/BMI2, encoding a umul_highpart.

For the new test case, compiled on x86_64 with -O2 -march=haswell:

Before:
mulx64: movabsq $-7046029254386353131, %rdx
mulx%rdi, %rsi, %rdi
movq%rdi, %rax
xorq%rsi, %rax
ret

After:
mulx64: movabsq $-7046029254386353131, %rdx
mulx%rdi, %rsi, %rax
xorq%rsi, %rax
ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-10-30  Roger Sayle  

gcc/ChangeLog
PR target/110551
* config/i386/i386.md (*bmi2_umul3_1): Tidy condition
as operands[2] with predicate register_operand must be !MEM_P.
(peephole2): Optimize a mulx followed by a register-to-register
move, to place result in the correct destination if possible.

gcc/testsuite/ChangeLog
PR target/110551
* gcc.target/i386/pr110551-2.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index eb4121b..a314f1a 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9747,13 +9747,37 @@
  (match_operand:DWIH 3 "nonimmediate_operand" "rm")))
(set (match_operand:DWIH 1 "register_operand" "=r")
(umul_highpart:DWIH (match_dup 2) (match_dup 3)))]
-  "TARGET_BMI2
-   && !(MEM_P (operands[2]) && MEM_P (operands[3]))"
+  "TARGET_BMI2"
   "mulx\t{%3, %0, %1|%1, %0, %3}"
   [(set_attr "type" "imulx")
(set_attr "prefix" "vex")
(set_attr "mode" "")])
 
+;; Tweak *bmi2_umul3_1 to eliminate following mov.
+(define_peephole2
+  [(parallel [(set (match_operand:DWIH 0 "general_reg_operand")
+  (mult:DWIH (match_operand:DWIH 2 "register_operand")
+ (match_operand:DWIH 3 "nonimmediate_operand")))
+ (set (match_operand:DWIH 1 "general_reg_operand")
+  (umul_highpart:DWIH (match_dup 2) (match_dup 3)))])
+   (set (match_operand:DWIH 4 "general_reg_operand")
+   (match_operand:DWIH 5 "general_reg_operand"))]
+  "TARGET_BMI2
+   && ((REGNO (operands[5]) == REGNO (operands[0])
+&& REGNO (operands[1]) != REGNO (operands[4]))
+   || (REGNO (operands[5]) == REGNO (operands[1])
+  && REGNO (operands[0]) != REGNO (operands[4])))
+   && peep2_reg_dead_p (2, operands[5])"
+  [(parallel [(set (match_dup 0) (mult:DWIH (match_dup 2) (match_dup 3)))
+ (set (match_dup 1)
+  (umul_highpart:DWIH (match_dup 2) (match_dup 3)))])]
+{
+  if (REGNO (operands[5]) == REGNO (operands[0]))
+operands[0] = operands[4];
+  else
+operands[1] = operands[4];
+})
+
 (define_insn "*umul3_1"
   [(set (match_operand: 0 "register_operand" "=r,A")
(mult:
diff --git a/gcc/testsuite/gcc.target/i386/pr110551-2.c 
b/gcc/testsuite/gcc.target/i386/pr110551-2.c
new file mode 100644
index 000..4936adf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110551-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -march=haswell" } */
+
+typedef unsigned long long uint64_t;
+
+uint64_t mulx64(uint64_t x)
+{
+__uint128_t r = (__uint128_t)x * 0x9E3779B97F4A7C15ull;
+return (uint64_t)r ^ (uint64_t)( r >> 64 );
+}
+
+/* { dg-final { scan-assembler-not "movq" } } */

RE: [ARC PATCH] Improve DImode left shift by a single bit.

2023-10-30 Thread Roger Sayle

Hi Jeff,
> From: Jeff Law 
> Sent: 30 October 2023 15:09
> Subject: Re: [ARC PATCH] Improve DImode left shift by a single bit.
> 
> On 10/28/23 07:05, Roger Sayle wrote:
> >
> > This patch improves the code generated for X << 1 (and for X + X) when
> > X is 64-bit DImode, using the same two instruction code sequence used
> > for DImode addition.
> >
> > For the test case:
> >
> > long long foo(long long x) { return x << 1; }
> >
> > GCC -O2 currently generates the following code:
> >
> > foo:lsr r2,r0,31
> >  asl_s   r1,r1,1
> >  asl_s   r0,r0,1
> >  j_s.d   [blink]
> >  or_sr1,r1,r2
> >
> > and on CPU without a barrel shifter, i.e. -mcpu=em
> >
> > foo:add.f   0,r0,r0
> >  asl_s   r1,r1
> >  rlc r2,0
> >  asl_s   r0,r0
> >  j_s.d   [blink]
> >  or_sr1,r1,r2
> >
> > with this patch (both with and without a barrel shifter):
> >
> > foo:add.f   r0,r0,r0
> >  j_s.d   [blink]
> >  adc r1,r1,r1
> >
> > [For Jeff Law's benefit a similar optimization is also applicable to
> > H8300H, that could also use a two instruction sequence (plus rts) but
> > currently GCC generates 16 instructions (plus an rts) for foo above.]
> >
> > Tested with a cross-compiler to arc-linux hosted on x86_64, with no
> > new (compile-only) regressions from make -k check.
> > Ok for mainline if this passes Claudiu's nightly testing?
> WRT H8.  Bug filed so we don't lose track of it.  We don't have DImode 
> operations
> defined on the H8.  First step would be DImode loads/stores and basic 
> arithmetic.

The H8's machine description is impressively well organized.
Would it make sense to add a doubleword.md, or should DImode
support be added to each of the individual addsub.md, logical.md,
shiftrotate.md etc..?

The fact that register-to-register moves clobber some of the flags bits
must also make reload's task very difficult (impossible?).

Cheers,
Roger
--

[ARC PATCH] Improved ARC rtx_costs/insn_cost for SHIFTs and ROTATEs.

2023-10-29 Thread Roger Sayle


This patch overhauls the ARC backend's insn_cost target hook, and makes
some related improvements to rtx_costs, BRANCH_COST, etc.  The primary
goal is to allow the backend to indicate that shifts and rotates are
slow (discouraged) when the CPU doesn't have a barrel shifter. I should
also acknowledge Richard Sandiford for inspiring the use of set_cost
in this rewrite of arc_insn_cost; this implementation borrows heavily
for the target hooks for AArch64 and ARM.

The motivating example is derived from PR rtl-optimization/110717.

struct S { int a : 5; };
unsigned int foo (struct S *p) {
  return p->a;
}

With a barrel shifter, GCC -O2 generates the reasonable:

foo:ldb_s   r0,[r0]
asl_s   r0,r0,27
j_s.d   [blink]
asr_s   r0,r0,27

What's interesting is that during combine, the middle-end actually
has two shifts by three bits, and a sign-extension from QI to SI.

Trying 8, 9 -> 11:
8: r158:SI=r157:QI#0<<0x3
  REG_DEAD r157:QI
9: r159:SI=sign_extend(r158:SI#0)
  REG_DEAD r158:SI
   11: r155:SI=r159:SI>>0x3
  REG_DEAD r159:SI

Whilst it's reasonable to simplify this to two shifts by 27 bits when
the CPU has a barrel shifter, it's actually a significant pessimization
when these shifts are implemented by loops.  This combination can be
prevented if the backend provides accurate-ish estimates for insn_cost.


Previously, without a barrel shifter, GCC -O2 -mcpu=em generates:

foo:ldb_s   r0,[r0]
mov lp_count,27
lp  2f
add r0,r0,r0
nop
2:  # end single insn loop
mov lp_count,27
lp  2f
asr r0,r0
nop
2:  # end single insn loop
j_s [blink]

which contains two loops and requires about ~113 cycles to execute.
With this patch to rtx_cost/insn_cost, GCC -O2 -mcpu=em generates:

foo:ldb_s   r0,[r0]
mov_s   r2,0;3
add3r0,r2,r0
sexb_s  r0,r0
asr_s   r0,r0
asr_s   r0,r0
j_s.d   [blink]
asr_s   r0,r0

which requires only ~6 cycles, for the shorter shifts by 3 and sign
extension.


Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?


2023-10-29  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.cc (arc_rtx_costs): Improve cost estimates.
Provide reasonable values for SHIFTS and ROTATES by constant
bit counts depending upon TARGET_BARREL_SHIFTER.
(arc_insn_cost): Use insn attributes if the instruction is
recognized.  Avoid calling get_attr_length for type "multi",
i.e. define_insn_and_split patterns without explicit type.
Fall-back to set_rtx_cost for single_set and pattern_cost
otherwise.
* config/arc/arc.h (COSTS_N_BYTES): Define helper macro.
(BRANCH_COST): Improve/correct definition.
(LOGICAL_OP_NON_SHORT_CIRCUIT): Preserve previous behavior.


Thanks again,
Roger
--

diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc
index 353ac69..ae83e5e 100644
--- a/gcc/config/arc/arc.cc
+++ b/gcc/config/arc/arc.cc
@@ -5492,7 +5492,7 @@ arc_rtx_costs (rtx x, machine_mode mode, int outer_code,
 case CONST:
 case LABEL_REF:
 case SYMBOL_REF:
-  *total = speed ? COSTS_N_INSNS (1) : COSTS_N_INSNS (4);
+  *total = speed ? COSTS_N_INSNS (1) : COSTS_N_BYTES (4);
   return true;
 
 case CONST_DOUBLE:
@@ -5516,26 +5516,32 @@ arc_rtx_costs (rtx x, machine_mode mode, int outer_code,
 case ASHIFT:
 case ASHIFTRT:
 case LSHIFTRT:
+case ROTATE:
+case ROTATERT:
+  if (mode == DImode)
+   return false;
   if (TARGET_BARREL_SHIFTER)
{
- if (CONSTANT_P (XEXP (x, 0)))
+ *total = COSTS_N_INSNS (1);
+ if (CONSTANT_P (XEXP (x, 1)))
{
- *total += rtx_cost (XEXP (x, 1), mode, (enum rtx_code) code,
+ *total += rtx_cost (XEXP (x, 0), mode, (enum rtx_code) code,
  0, speed);
  return true;
}
- *total = COSTS_N_INSNS (1);
}
   else if (GET_CODE (XEXP (x, 1)) != CONST_INT)
-   *total = COSTS_N_INSNS (16);
+   *total = speed ? COSTS_N_INSNS (16) : COSTS_N_INSNS (4);
   else
{
- *total = COSTS_N_INSNS (INTVAL (XEXP ((x), 1)));
- /* ??? want_to_gcse_p can throw negative shift counts at us,
-and then panics when it gets a negative cost as result.
-Seen for gcc.c-torture/compile/20020710-1.c -Os .  */
- if (*total < 0)
-   *total = 0;
+ int n = INTVAL (XEXP (x, 1)) & 31;
+  if (n < 4)
+   *total = COSTS_N_INSNS (n);
+ else
+   *total = speed ? COSTS_N_INSNS (n + 2) : COSTS_N_INSNS (4);
+ *total += rtx_cost (XEXP (x, 0), mode, (enum rtx_code) code,
+

[ARC PATCH] Convert (signed<<31)>>31 to -(signed&1) without barrel shifter.

2023-10-28 Thread Roger Sayle


This patch optimizes PR middle-end/101955 for the ARC backend.  On ARC
CPUs with a barrel shifter, using two shifts is (probably) optimal as:

asl_s   r0,r0,31
asr_s   r0,r0,31

but without a barrel shifter, GCC -O2 -mcpu=em currently generates:

and r2,r0,1
ror r2,r2
add.f   0,r2,r2
sbc r0,r0,r0

with this patch, we now generate the smaller, faster and non-flags
clobbering:

bmsk_s  r0,r0,0
neg_s   r0,r0

Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?


2023-10-28  Roger Sayle  

gcc/ChangeLog
PR middle-end/101955
* config/arc/arc.md (*extvsi_1_0): New define_insn_and_split
to convert sign extract of the least significant bit into an
AND $1 then a NEG when !TARGET_BARREL_SHIFTER.

gcc/testsuite/ChangeLog
PR middle-end/101955
* gcc.target/arc/pr101955.c: New test case.


Thanks again,
Roger
--

diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index ee43887..6471344 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -5873,6 +5873,20 @@ archs4x, archs4xd"
   (zero_extract:SI (match_dup 1) (match_dup 5) (match_dup 
7)))])
(match_dup 1)])
 
+;; Split sign-extension of single least significant bit as and x,$1;neg x
+(define_insn_and_split "*extvsi_1_0"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+   (sign_extract:SI (match_operand:SI 1 "register_operand" "0")
+(const_int 1)
+(const_int 0)))]
+  "!TARGET_BARREL_SHIFTER"
+  "#"
+  "&& 1"
+  [(set (match_dup 0) (and:SI (match_dup 1) (const_int 1)))
+   (set (match_dup 0) (neg:SI (match_dup 0)))]
+  ""
+  [(set_attr "length" "8")])
+
 (define_insn_and_split "rotlsi3_cnt1"
   [(set (match_operand:SI 0 "dest_reg_operand""=r")
(rotate:SI (match_operand:SI 1 "register_operand" "r")
diff --git a/gcc/testsuite/gcc.target/arc/pr101955.c 
b/gcc/testsuite/gcc.target/arc/pr101955.c
new file mode 100644
index 000..74bca3c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/pr101955.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+int f(int a)
+{
+return (a << 31) >> 31;
+}
+
+/* { dg-final { scan-assembler "msk_s\\s+r0,r0,0" } } */
+/* { dg-final { scan-assembler "neg_s\\s+r0,r0" } } */

[ARC PATCH] Improve DImode left shift by a single bit.

2023-10-28 Thread Roger Sayle


This patch improves the code generated for X << 1 (and for X + X) when
X is 64-bit DImode, using the same two instruction code sequence used
for DImode addition.

For the test case:

long long foo(long long x) { return x << 1; }

GCC -O2 currently generates the following code:

foo:lsr r2,r0,31
asl_s   r1,r1,1
asl_s   r0,r0,1
j_s.d   [blink]
or_sr1,r1,r2

and on CPU without a barrel shifter, i.e. -mcpu=em

foo:add.f   0,r0,r0
asl_s   r1,r1
rlc r2,0
asl_s   r0,r0
j_s.d   [blink]
or_sr1,r1,r2

with this patch (both with and without a barrel shifter):

foo:add.f   r0,r0,r0
j_s.d   [blink]
adc r1,r1,r1

[For Jeff Law's benefit a similar optimization is also applicable to
H8300H, that could also use a two instruction sequence (plus rts) but
currently GCC generates 16 instructions (plus an rts) for foo above.]

Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?

2023-10-28  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.md (addsi3): Fix GNU-style code formatting.
(adddi3): Change define_expand to generate an *adddi3.
(*adddi3): New define_insn_and_split to lower DImode additions
during the split1 pass (after combine and before reload).
(ashldi3): New define_expand to (only) generate *ashldi3_cnt1
for DImode left shifts by a single bit.
(*ashldi3_cnt1): New define_insn_and_split to lower DImode
left shifts by one bit to an *adddi3.

gcc/testsuite/ChangeLog
* gcc.target/arc/adddi3-1.c: New test case.
* gcc.target/arc/ashldi3-1.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index ee43887..fe5f48c 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -2675,19 +2675,28 @@ archs4x, archs4xd"
(plus:SI (match_operand:SI 1 "register_operand" "")
 (match_operand:SI 2 "nonmemory_operand" "")))]
   ""
-  "if (flag_pic && arc_raw_symbolic_reference_mentioned_p (operands[2], false))
- {
-   operands[2]=force_reg(SImode, operands[2]);
- }
-  ")
+{
+  if (flag_pic && arc_raw_symbolic_reference_mentioned_p (operands[2], false))
+operands[2] = force_reg (SImode, operands[2]);
+})
 
 (define_expand "adddi3"
+  [(parallel
+  [(set (match_operand:DI 0 "register_operand" "")
+   (plus:DI (match_operand:DI 1 "register_operand" "")
+(match_operand:DI 2 "nonmemory_operand" "")))
+   (clobber (reg:CC CC_REG))])])
+
+(define_insn_and_split "*adddi3"
   [(set (match_operand:DI 0 "register_operand" "")
(plus:DI (match_operand:DI 1 "register_operand" "")
 (match_operand:DI 2 "nonmemory_operand" "")))
(clobber (reg:CC CC_REG))]
-  ""
-  "
+  "arc_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
   rtx l0 = gen_lowpart (SImode, operands[0]);
   rtx h0 = gen_highpart (SImode, operands[0]);
   rtx l1 = gen_lowpart (SImode, operands[1]);
@@ -2719,11 +2728,12 @@ archs4x, archs4xd"
   gen_rtx_LTU (VOIDmode, gen_rtx_REG (CC_Cmode, CC_REG), GEN_INT (0)),
   gen_rtx_SET (h0, plus_constant (SImode, h0, 1;
   DONE;
-  }
+}
   emit_insn (gen_add_f (l0, l1, l2));
   emit_insn (gen_adc (h0, h1, h2));
   DONE;
-")
+}
+  [(set_attr "length" "8")])
 
 (define_insn "add_f"
   [(set (reg:CC_C CC_REG)
@@ -3461,6 +3471,33 @@ archs4x, archs4xd"
   [(set_attr "type" "shift")
(set_attr "length" "16,20")])
 
+;; DImode shifts
+
+(define_expand "ashldi3"
+  [(parallel
+  [(set (match_operand:DI 0 "register_operand")
+   (ashift:DI (match_operand:DI 1 "register_operand")
+  (match_operand:QI 2 "const_int_operand")))
+   (clobber (reg:CC CC_REG))])]
+  ""
+{
+  if (operands[2] != const1_rtx)
+FAIL;
+})
+
+(define_insn_and_split "*ashldi3_cnt1"
+  [(set (match_operand:DI 0 "register_operand")
+   (ashift:DI (match_operand:DI 1 "register_operand")
+  (const_int 1)))
+   (clobber (reg:CC CC_REG))]
+  "arc_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(parallel [(set (match_dup 0) (plus:DI (match_dup 1) (match_dup 1)))
+ (clobber (reg:CC CC_REG))])]
+  ""
+  [(set_attr "length" "8")])
+
 ;; Rotate instructions.
 
 (define_insn "rotrsi3_insn"
diff --git

[wwwdocs] Get newlib via git in simtest-howto.html

2023-10-27 Thread Roger Sayle


A minor tweak to the documentation, to use git rather than cvs to obtain
the latest version of newlib.  Ok for mainline?


2023-10-27  Roger Sayle  

* htdocs/simtest-howto.html: Use git to obtain newlib.

Cheers,
Roger
--

diff --git a/htdocs/simtest-howto.html b/htdocs/simtest-howto.html
index 2e54476b..d9c027fd 100644
--- a/htdocs/simtest-howto.html
+++ b/htdocs/simtest-howto.html
@@ -59,9 +59,7 @@ contrib/gcc_update --touch
 
 
 cd ${TOP}
-cvs -d :pserver:anon...@sourceware.org:/cvs/src login
-# You will be prompted for a password; reply with "anoncvs".
-cvs -d :pserver:anon...@sourceware.org:/cvs/src co newlib
+git clone https://sourceware.org/git/newlib-cygwin.git newlib
 
 
 Check out the sim and binutils tree:

[ARC PATCH] Improved SImode shifts and rotates with -mswap.

2023-10-27 Thread Roger Sayle


This patch improves the code generated by the ARC back-end for CPUs
without a barrel shifter but with -mswap.  The -mswap option provides
a SWAP instruction that implements SImode rotations by 16, but also
logical shift instructions (left and right) by 16 bits.  Clearly these
are also useful building blocks for implementing shifts by 17, 18, etc.
which would otherwise require a loop.

As a representative example:
int shl20 (int x) { return x << 20; }

GCC with -O2 -mcpu=em -mswap would previously generate:

shl20:  mov lp_count,10
lp  2f
add r0,r0,r0
add r0,r0,r0
2:  # end single insn loop
j_s [blink]

with this patch we now generate:

shl20:  mov_s   r2,0;3
lsl16   r0,r0
add3r0,r2,r0
j_s.d   [blink]
asl_s r0,r0

Although both are four instructions (excluding the j_s),
the original takes ~22 cycles, and replacement ~4 cycles.


Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?


2023-10-27  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.cc (arc_split_ashl): Use lsl16 on TARGET_SWAP.
(arc_split_ashr): Use swap and sign-extend on TARGET_SWAP.
(arc_split_lshr): Use lsr16 on TARGET_SWAP.
(arc_split_rotl): Use swap on TARGET_SWAP.
(arc_split_rotr): Likewise.
* config/arc/arc.md (ANY_ROTATE): New code iterator.
(si2_cnt16): New define_insn for alternate form of
swap instruction on TARGET_SWAP.
(ashlsi2_cnt16): Rename from *ashlsi16_cnt16 and move earlier.
(lshrsi2_cnt16): New define_insn for LSR16 instruction.
(*ashlsi2_cnt16): See above.

gcc/testsuite/ChangeLog
* gcc.target/arc/lsl16-1.c: New test case.
* gcc.target/arc/lsr16-1.c: Likewise.
* gcc.target/arc/swap-1.c: Likewise.
* gcc.target/arc/swap-2.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc
index 353ac69..e98692a 100644
--- a/gcc/config/arc/arc.cc
+++ b/gcc/config/arc/arc.cc
@@ -4256,6 +4256,17 @@ arc_split_ashl (rtx *operands)
}
  return;
}
+  else if (n >= 16 && n <= 22 && TARGET_SWAP && TARGET_V2)
+   {
+ emit_insn (gen_ashlsi2_cnt16 (operands[0], operands[1]));
+ if (n > 16)
+   {
+ operands[1] = operands[0];
+ operands[2] = GEN_INT (n - 16);
+ arc_split_ashl (operands);
+   }
+ return;
+   }
   else if (n >= 29)
{
  if (n < 31)
@@ -4300,6 +4311,15 @@ arc_split_ashr (rtx *operands)
emit_move_insn (operands[0], operands[1]);
  return;
}
+  else if (n >= 16 && n <= 18 && TARGET_SWAP)
+   {
+ emit_insn (gen_rotrsi2_cnt16 (operands[0], operands[1]));
+ emit_insn (gen_extendhisi2 (operands[0],
+ gen_lowpart (HImode, operands[0])));
+ while (--n >= 16)
+   emit_insn (gen_ashrsi3_cnt1 (operands[0], operands[0]));
+ return;
+   }
   else if (n == 30)
{
  rtx tmp = gen_reg_rtx (SImode);
@@ -4339,6 +4359,13 @@ arc_split_lshr (rtx *operands)
emit_move_insn (operands[0], operands[1]);
  return;
}
+  else if (n >= 16 && n <= 19 && TARGET_SWAP && TARGET_V2)
+   {
+ emit_insn (gen_lshrsi2_cnt16 (operands[0], operands[1]));
+ while (--n >= 16)
+   emit_insn (gen_lshrsi3_cnt1 (operands[0], operands[0]));
+ return;
+   }
   else if (n == 30)
{
  rtx tmp = gen_reg_rtx (SImode);
@@ -4385,6 +4412,19 @@ arc_split_rotl (rtx *operands)
emit_insn (gen_rotrsi3_cnt1 (operands[0], operands[0]));
  return;
}
+  else if (n >= 13 && n <= 16 && TARGET_SWAP)
+   {
+ emit_insn (gen_rotlsi2_cnt16 (operands[0], operands[1]));
+ while (++n <= 16)
+   emit_insn (gen_rotrsi3_cnt1 (operands[0], operands[0]));
+ return;
+   }
+  else if (n == 17 && TARGET_SWAP)
+   {
+ emit_insn (gen_rotlsi2_cnt16 (operands[0], operands[1]));
+ emit_insn (gen_rotlsi3_cnt1 (operands[0], operands[0]));
+ return;
+   }
   else if (n >= 16 || n == 12 || n == 14)
{
  emit_insn (gen_rotrsi3_loop (operands[0], operands[1],
@@ -4415,6 +4455,19 @@ arc_split_rotr (rtx *operands)
emit_move_insn (operands[0], operands[1]);
  return;
}
+  else if (n == 15 && TARGET_SWAP)
+   {
+ emit_insn (gen_rotrsi2_cnt16 (operands[0], operands[1]));
+ emit_insn (gen_rotlsi3_cnt1 (operands[0], operands[0]));
+ return;
+   }
+  e

RE: [x86 PATCH] PR target/110511: Fix reg allocation for widening multiplications.

2023-10-25 Thread Roger Sayle

Hi Uros,

I've tried your suggestions to see what would happen.
Alas, allowing both operands to (i386's) widening multiplications
to be  nonimmediate_operand results in 90 additional testsuite
unexpected failures", and 41 unresolved testcase, around things
like:

gcc.c-torture/compile/di.c:6:1: error: unrecognizable insn:
(insn 14 13 15 2 (parallel [
(set (reg:DI 98 [ _3 ])
(mult:DI (zero_extend:DI (mem/c:SI (plus:SI (reg/f:SI 93 
virtual-stack-vars)
(const_int -8 [0xfff8])) [1 a+0 S4 
A64]))
(zero_extend:DI (mem/c:SI (plus:SI (reg/f:SI 93 
virtual-stack-vars)
(const_int -16 [0xfff0])) [1 b+0 S4 
A64]
(clobber (reg:CC 17 flags))
]) "gcc.c-torture/compile/di.c":5:12 -1
 (nil))
during RTL pass: vregs
gcc.c-torture/compile/di.c:6:1: internal compiler error: in extract_insn, at 
recog.cc:2791

In my experiments, I've used nonimmediate_operand instead of general_operand,
as a zero_extend of an immediate_operand, like const_int, would be 
non-canonical.

In short, it's ok (common) for '%' to apply to operands with different 
predicates;
reload will only swap things if the operand's predicates/constraints remain 
consistent.
For example, see i386.c's *add_1 pattern.  And as shown above it can't
be left to (until) reload to decide which "mem" gets loaded into a register 
(which
would be nice), as some passes before reload check both predicates and 
constraints.

My original patch fixes PR 110511, using the same peephole2 idiom as already
used elsewhere in i386.md.  Ok for mainline?

> -Original Message-
> From: Uros Bizjak 
> Sent: 19 October 2023 18:02
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86 PATCH] PR target/110511: Fix reg allocation for widening
> multiplications.
> 
> On Tue, Oct 17, 2023 at 9:05 PM Roger Sayle 
> wrote:
> >
> >
> > This patch contains clean-ups of the widening multiplication patterns
> > in i386.md, and provides variants of the existing highpart
> > multiplication
> > peephole2 transformations (that tidy up register allocation after
> > reload), and thereby fixes PR target/110511, which is a superfluous
> > move instruction.
> >
> > For the new test case, compiled on x86_64 with -O2.
> >
> > Before:
> > mulx64: movabsq $-7046029254386353131, %rcx
> > movq%rcx, %rax
> > mulq%rdi
> > xorq%rdx, %rax
> > ret
> >
> > After:
> > mulx64: movabsq $-7046029254386353131, %rax
> > mulq%rdi
> > xorq%rdx, %rax
> > ret
> >
> > The clean-ups are (i) that operand 1 is consistently made
> > register_operand and operand 2 becomes nonimmediate_operand, so that
> > predicates match the constraints, (ii) the representation of the BMI2
> > mulx instruction is updated to use the new umul_highpart RTX, and
> > (iii) because operands
> > 0 and 1 have different modes in widening multiplications, "a" is a
> > more appropriate constraint than "0" (which avoids spills/reloads
> > containing SUBREGs).  The new peephole2 transformations are based upon
> > those at around line 9951 of i386.md, that begins with the comment ;;
> > Highpart multiplication peephole2s to tweak register allocation.
> > ;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx  ->  mov imm,%rax; imulq
> > %rdi
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2023-10-17  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR target/110511
> > * config/i386/i386.md (mul3): Make operands 1 and
> > 2 take "regiser_operand" and "nonimmediate_operand" respectively.
> > (mulqihi3): Likewise.
> > (*bmi2_umul3_1): Operand 2 needs to be register_operand
> > matching the %d constraint.  Use umul_highpart RTX to represent
> > the highpart multiplication.
> > (*umul3_1):  Operand 2 should use regiser_operand
> > predicate, and "a" rather than "0" as operands 0 and 2 have
> > different modes.
> > (define_split): For mul to mulx conversion, use the new
> > umul_highpart RTX representation.
> > (*mul3_1):  Operand 1 should be register_operand
> > and the constraint %a as operands 0 and 1 have different modes.
> > (*mulqihi3_1): Operand 1 should be register_

[NVPTX] Patch pings...

2023-10-25 Thread Roger Sayle



Random fact: there have been no changes to nvptx.md in 2023 apart
from Jakub's tree-wide update to the copyright years in early January.

Please can I ping two of my of pending Nvidia nvptx patches:

"Correct pattern for popcountdi2 insn in nvptx.md" from January
https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609571.html

and

"Update nvptx's bitrev2 pattern to use BITREVERSE rtx" from June
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620994.html

Both of these still apply cleanly (because nvptx.md hasn't changed).

Thanks in advance,
Roger
--

[PATCH v2] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.

2023-10-25 Thread Roger Sayle


Hi Jeff,
Many thanks for the review/approval of my fix for PR rtl-optimization/91865.
Based on your and Richard Biener's feedback, I’d like to propose a revision
calling simplify_unary_operation instead of simplify_const_unary_operation
(i.e. Richi's recommendation).  I was originally concerned that this might
potentially result in unbounded recursion, and testing for ZERO_EXTEND was
safer but "uglier", but testing hasn't shown any issues.  If we do see issues
in the future, it's easy to fall back to the previous version of this patch.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-25  Roger Sayle  
Richard Biener  

gcc/ChangeLog
PR rtl-optimization/91865
* combine.cc (make_compound_operation): Avoid creating a
ZERO_EXTEND of a ZERO_EXTEND.

gcc/testsuite/ChangeLog
PR rtl-optimization/91865
* gcc.target/msp430/pr91865.c: New test case.


Thanks again,
Roger
--

> -Original Message-
> From: Jeff Law 
> Sent: 19 October 2023 16:20
> 
> On 10/14/23 16:14, Roger Sayle wrote:
> >
> > This patch is my proposed solution to PR rtl-optimization/91865.
> > Normally RTX simplification canonicalizes a ZERO_EXTEND of a
> > ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is
> > possible for combine's make_compound_operation to unintentionally
> > generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is
> > unlikely to be matched by the backend.
> >
> > For the new test case:
> >
> > const int table[2] = {1, 2};
> > int foo (char i) { return table[i]; }
> >
> > compiling with -O2 -mlarge on msp430 we currently see:
> >
> > Trying 2 -> 7:
> >  2: r25:HI=zero_extend(R12:QI)
> >REG_DEAD R12:QI
> >  7: r28:PSI=sign_extend(r25:HI)#0
> >REG_DEAD r25:HI
> > Failed to match this instruction:
> > (set (reg:PSI 28 [ iD.1772 ])
> >  (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ]
> >
> > which results in the following code:
> >
> > foo:AND #0xff, R12
> >  RLAM.A #4, R12 { RRAM.A #4, R12
> >  RLAM.A  #1, R12
> >  MOVX.W  table(R12), R12
> >  RETA
> >
> > With this patch, we now see:
> >
> > Trying 2 -> 7:
> >  2: r25:HI=zero_extend(R12:QI)
> >REG_DEAD R12:QI
> >  7: r28:PSI=sign_extend(r25:HI)#0
> >REG_DEAD r25:HI
> > Successfully matched this instruction:
> > (set (reg:PSI 28 [ iD.1772 ])
> >  (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing
> > combination of insns 2 and 7 original costs 4 + 8 = 12 replacement
> > cost 8
> >
> > foo:MOV.B   R12, R12
> >  RLAM.A  #1, R12
> >  MOVX.W  table(R12), R12
> >  RETA
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> > 2023-10-14  Roger Sayle  
> >
> > gcc/ChangeLog
> >  PR rtl-optimization/91865
> >  * combine.cc (make_compound_operation): Avoid creating a
> >  ZERO_EXTEND of a ZERO_EXTEND.
> Final question.  Is there a reasonable expectation that we could get a
> similar situation with sign extensions?   If so we probably ought to try
> and handle both.
> 
> OK with the obvious change to handle nested sign extensions if you think it's
> useful to do so.  And OK as-is if you don't think handling nested sign 
> extensions is
> useful.
> 
> jeff
diff --git a/gcc/combine.cc b/gcc/combine.cc
index 360aa2f25e6..b1b16ac7bb2 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -8449,8 +8449,8 @@ make_compound_operation (rtx x, enum rtx_code in_code)
   if (code == ZERO_EXTEND)
 {
   new_rtx = make_compound_operation (XEXP (x, 0), next_code);
-  tem = simplify_const_unary_operation (ZERO_EXTEND, GET_MODE (x),
-   new_rtx, GET_MODE (XEXP (x, 0)));
+  tem = simplify_unary_operation (ZERO_EXTEND, GET_MODE (x),
+ new_rtx, GET_MODE (XEXP (x, 0)));
   if (tem)
return tem;
   SUBST (XEXP (x, 0), new_rtx);
diff --git a/gcc/testsuite/gcc.target/msp430/pr91865.c 
b/gcc/testsuite/gcc.target/msp430/pr91865.c
new file mode 100644
index 000..8cc21c8b9e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/msp430/pr91865.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mlarge" } */
+
+const int table[2] = {1, 2};
+int foo (char i) { return table[i]; }
+
+/* { dg-final { scan-assembler-not "AND" } } */
+/* { dg-final { scan-assembler-not "RRAM" } } */

[x86 PATCH] Fine tune STV register conversion costs for -Os.

2023-10-23 Thread Roger Sayle


The eagle-eyed may have spotted that my recent testcases for DImode shifts
on x86_64 included -mno-stv in the dg-options.  This is because the
Scalar-To-Vector (STV) pass currently transforms these shifts to use
SSE vector operations, producing larger code even with -Os.  The issue
is that the compute_convert_gain currently underestimates the size of
instructions required for interunit moves, which is corrected with the
patch below.

For the simple test case:

unsigned long long shl1(unsigned long long x) { return x << 1; }

without this patch, GCC -m32 -Os -mavx2 currently generates:

shl1:   push   %ebp  // 1 byte
mov%esp,%ebp // 2 bytes
vmovq  0x8(%ebp),%xmm0   // 5 bytes
pop%ebp  // 1 byte
vpaddq %xmm0,%xmm0,%xmm0 // 4 bytes
vmovd  %xmm0,%eax// 4 bytes
vpextrd $0x1,%xmm0,%edx  // 6 bytes
ret  // 1 byte  = 24 bytes total

with this patch, we now generate the shorter

shl1:   push   %ebp // 1 byte
mov%esp,%ebp// 2 bytes
mov0x8(%ebp),%eax   // 3 bytes
mov0xc(%ebp),%edx   // 3 bytes
pop%ebp // 1 byte
add%eax,%eax// 2 bytes
adc%edx,%edx// 2 bytes
ret // 1 byte  = 15 bytes total

Benchmarking using CSiBE, shows that this patch saves 1361 bytes
when compiling with -m32 -Os, and saves 172 bytes when compiling
with -Os.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-23  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-features.cc (compute_convert_gain): Provide
more accurate values (sizes) for inter-unit moves with -Os.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index cead397..6fac67e 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -752,11 +752,33 @@ general_scalar_chain::compute_convert_gain ()
 fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
   /* Cost the integer to sse and sse to integer moves.  */
-  cost += n_sse_to_integer * ix86_cost->sse_to_integer;
-  /* ???  integer_to_sse but we only have that in the RA cost table.
- Assume sse_to_integer/integer_to_sse are the same which they
- are at the moment.  */
-  cost += n_integer_to_sse * ix86_cost->sse_to_integer;
+  if (!optimize_function_for_size_p (cfun))
+{
+  cost += n_sse_to_integer * ix86_cost->sse_to_integer;
+  /* ???  integer_to_sse but we only have that in the RA cost table.
+  Assume sse_to_integer/integer_to_sse are the same which they
+ are at the moment.  */
+  cost += n_integer_to_sse * ix86_cost->sse_to_integer;
+}
+  else if (TARGET_64BIT || smode == SImode)
+{
+  cost += n_sse_to_integer * COSTS_N_BYTES (4);
+  cost += n_integer_to_sse * COSTS_N_BYTES (4);
+}
+  else if (TARGET_SSE4_1)
+{
+  /* vmovd (4 bytes) + vpextrd (6 bytes).  */
+  cost += n_sse_to_integer * COSTS_N_BYTES (10);
+  /* vmovd (4 bytes) + vpinsrd (6 bytes).  */
+  cost += n_integer_to_sse * COSTS_N_BYTES (10);
+}
+  else
+{
+  /* movd (4 bytes) + psrlq (5 bytes) + movd (4 bytes).  */
+  cost += n_sse_to_integer * COSTS_N_BYTES (13);
+  /* movd (4 bytes) + movd (4 bytes) + unpckldq (4 bytes).  */
+  cost += n_integer_to_sse * COSTS_N_BYTES (12);
+}
 
   if (dump_file)
 fprintf (dump_file, "  Registers conversion cost: %d\n", cost);

RE: [Patch] nvptx: Use fatal_error when -march= is missing not an assert [PR111093]

2023-10-18 Thread Roger Sayle

Hi Tomas, Tobias and Tom,
Thanks for asking.  Interestingly, I've a patch (attached) from last year that
tackled some of the issues here.  The surface problem is that nvptx's march
and misa are related in complicated ways.  Specifying an arch defines the
range of valid isa's, and specifying an isa restricts the set of valid arches.

The current approach, which I agree is problematic, is to force these to
be specified (compatibly) on the cc1 command line.  Certainly, an error
is better than an abort.  My proposed solution was to allow either to 
imply a default for the other, and only issue an error if they are explicitly
specified incompatibly.

One reason for supporting this approach was to ultimately support an
-march=native in the driver (calling libcuda.so to determine the hardware
available on the current machine).

The other use case is bumping the "default" nvptx architecture to something
more recent, say sm_53, by providing/honoring a default arch at configure
time.

Alas, it turns out that specifying a recent arch during GCC bootstrap, allows
the build to notice that the backend (now) supports 16-bit floats, which then
prompts libgcc to contain the floathf and fixhf support that would be required.
Then this in turn shows up as a limitation in the middle-end's handling of 
libcalls, which I submitted as a patch to back in July 2022:
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598848.html

That patch hasn't yet been approved, so the whole nvptx -march= patch
series became backlogged/forgotten.

Hopefully, the attached "proof-of-concept" patch looks interesting (food
for thought).  If this approach seems reasonable, I'm happy to brush the
dust off, and resubmit it (or a series of pieces) for review.

Best regards,
Roger
--

> -Original Message-
> From: Thomas Schwinge 
> Sent: 18 October 2023 11:16
> To: Tobias Burnus 
> Cc: gcc-patches@gcc.gnu.org; Tom de Vries ; Roger Sayle
> 
> Subject: Re: [Patch] nvptx: Use fatal_error when -march= is missing not an 
> assert
> [PR111093]
> 
> Hi Tobias!
> 
> On 2023-10-16T11:18:45+0200, Tobias Burnus 
> wrote:
> > While mkoffload ensures that there is always a -march=, nvptx's
> > cc1 can also be run directly.
> >
> > In my case, I wanted to know which target-specific #define are
> > available; hence, I did run:
> >accel/nvptx-none/cc1 -E -dM < /dev/null which gave an ICE. After
> > some debugging, the reasons was clear (missing -march=) but somehow a
> > (fatal) error would have been nicer than an ICE + debugging.
> >
> > OK for mainline?
> 
> Yes, thanks.  I think I prefer this over hard-coding some default 
> 'ptx_isa_option' --
> but may be convinced otherwise (incremental change), if that's maybe more
> convenient for others?  (Roger?)
> 
> 
> Grüße
>  Thomas
> 
> 
> > nvptx: Use fatal_error when -march= is missing not an assert
> > [PR111093]
> >
> > gcc/ChangeLog:
> >
> >   PR target/111093
> >   * config/nvptx/nvptx.cc (nvptx_option_override): Issue fatal error
> >   instead of an assert ICE when no -march= has been specified.
> >
> > diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
> > index edef39fb5e1..634c31673be 100644
> > --- a/gcc/config/nvptx/nvptx.cc
> > +++ b/gcc/config/nvptx/nvptx.cc
> > @@ -335,8 +335,9 @@ nvptx_option_override (void)
> >init_machine_status = nvptx_init_machine_status;
> >
> >/* Via nvptx 'OPTION_DEFAULT_SPECS', '-misa' always appears on the
> command
> > - line.  */
> > -  gcc_checking_assert (OPTION_SET_P (ptx_isa_option));
> > + line; but handle the case that the compiler is not run via the
> > + driver.  */  if (!OPTION_SET_P (ptx_isa_option))
> > +fatal_error (UNKNOWN_LOCATION, "%<-march=%> must be specified");
> >
> >handle_ptx_version_option ();
> >
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
> München, HRB 106955
diff --git a/gcc/calls.cc b/gcc/calls.cc
index 6dd6f73..8a18eae 100644
--- a/gcc/calls.cc
+++ b/gcc/calls.cc
@@ -4795,14 +4795,20 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx 
value,
   else
{
  /* Convert to the proper mode if a promotion has been active.  */
- if (GET_MODE (valreg) != outmode)
+ enum machine_mode valmode = GET_MODE (valreg);
+ if (valmode != outmode)
{
  int unsignedp = TYPE_UNSIGNED (tfom);

  gcc_assert (promote_function_mode (tfom, outmode, ,

RE: [x86 PATCH] PR target/110551: Fix reg allocation for widening multiplications.

2023-10-18 Thread Roger Sayle



Many thanks to Tobias Burnus for pointing out the mistake/typo in the PR
number.
This fix is for PR 110551, not PR 110511.  I'll update the ChangeLog and
filename
of the new testcase, if approved.

Sorry for any inconvenience/confusion.
Cheers,
Roger
--

> -Original Message-
> From: Roger Sayle 
> Sent: 17 October 2023 20:06
> To: 'gcc-patches@gcc.gnu.org' 
> Cc: 'Uros Bizjak' 
> Subject: [x86 PATCH] PR target/110511: Fix reg allocation for widening
> multiplications.
> 
> 
> This patch contains clean-ups of the widening multiplication patterns in
i386.md,
> and provides variants of the existing highpart multiplication
> peephole2 transformations (that tidy up register allocation after reload),
and
> thereby fixes PR target/110511, which is a superfluous move instruction.
> 
> For the new test case, compiled on x86_64 with -O2.
> 
> Before:
> mulx64: movabsq $-7046029254386353131, %rcx
> movq%rcx, %rax
> mulq%rdi
> xorq%rdx, %rax
> ret
> 
> After:
> mulx64: movabsq $-7046029254386353131, %rax
> mulq%rdi
> xorq%rdx, %rax
> ret
> 
> The clean-ups are (i) that operand 1 is consistently made register_operand
and
> operand 2 becomes nonimmediate_operand, so that predicates match the
> constraints, (ii) the representation of the BMI2 mulx instruction is
updated to use
> the new umul_highpart RTX, and (iii) because operands
> 0 and 1 have different modes in widening multiplications, "a" is a more
> appropriate constraint than "0" (which avoids spills/reloads containing
SUBREGs).
> The new peephole2 transformations are based upon those at around line 9951
of
> i386.md, that begins with the comment ;; Highpart multiplication
peephole2s to
> tweak register allocation.
> ;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx  ->  mov imm,%rax; imulq %rdi
> 
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no new
> failures.  Ok for mainline?
> 
> 
> 2023-10-17  Roger Sayle  
> 
> gcc/ChangeLog
> PR target/110511
> * config/i386/i386.md (mul3): Make operands 1 and
> 2 take "regiser_operand" and "nonimmediate_operand" respectively.
> (mulqihi3): Likewise.
> (*bmi2_umul3_1): Operand 2 needs to be register_operand
> matching the %d constraint.  Use umul_highpart RTX to represent
> the highpart multiplication.
> (*umul3_1):  Operand 2 should use regiser_operand
> predicate, and "a" rather than "0" as operands 0 and 2 have
> different modes.
> (define_split): For mul to mulx conversion, use the new
> umul_highpart RTX representation.
> (*mul3_1):  Operand 1 should be register_operand
> and the constraint %a as operands 0 and 1 have different modes.
> (*mulqihi3_1): Operand 1 should be register_operand matching
> the constraint %0.
> (define_peephole2): Providing widening multiplication variants
> of the peephole2s that tweak highpart multiplication register
> allocation.
> 
> gcc/testsuite/ChangeLog
> PR target/110511
> * gcc.target/i386/pr110511.c: New test case.
> 
> 
> Thanks in advance,
> Roger

[x86 PATCH] PR target/110511: Fix reg allocation for widening multiplications.

2023-10-17 Thread Roger Sayle


This patch contains clean-ups of the widening multiplication patterns in
i386.md, and provides variants of the existing highpart multiplication
peephole2 transformations (that tidy up register allocation after
reload), and thereby fixes PR target/110511, which is a superfluous
move instruction.

For the new test case, compiled on x86_64 with -O2.

Before:
mulx64: movabsq $-7046029254386353131, %rcx
movq%rcx, %rax
mulq%rdi
xorq%rdx, %rax
ret

After:
mulx64: movabsq $-7046029254386353131, %rax
mulq%rdi
xorq%rdx, %rax
ret

The clean-ups are (i) that operand 1 is consistently made register_operand
and operand 2 becomes nonimmediate_operand, so that predicates match the
constraints, (ii) the representation of the BMI2 mulx instruction is
updated to use the new umul_highpart RTX, and (iii) because operands
0 and 1 have different modes in widening multiplications, "a" is a more
appropriate constraint than "0" (which avoids spills/reloads containing
SUBREGs).  The new peephole2 transformations are based upon those at
around line 9951 of i386.md, that begins with the comment
;; Highpart multiplication peephole2s to tweak register allocation.
;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx  ->  mov imm,%rax; imulq %rdi


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-17  Roger Sayle  

gcc/ChangeLog
PR target/110511
* config/i386/i386.md (mul3): Make operands 1 and
2 take "regiser_operand" and "nonimmediate_operand" respectively.
(mulqihi3): Likewise.
(*bmi2_umul3_1): Operand 2 needs to be register_operand
matching the %d constraint.  Use umul_highpart RTX to represent
the highpart multiplication.
(*umul3_1):  Operand 2 should use regiser_operand
predicate, and "a" rather than "0" as operands 0 and 2 have
different modes.
(define_split): For mul to mulx conversion, use the new
umul_highpart RTX representation.
(*mul3_1):  Operand 1 should be register_operand
and the constraint %a as operands 0 and 1 have different modes.
(*mulqihi3_1): Operand 1 should be register_operand matching
the constraint %0.
(define_peephole2): Providing widening multiplication variants
of the peephole2s that tweak highpart multiplication register
allocation.

gcc/testsuite/ChangeLog
PR target/110511
* gcc.target/i386/pr110511.c: New test case.


Thanks in advance,
Roger

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2a60df5..22f18c2 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9710,33 +9710,29 @@
   [(parallel [(set (match_operand: 0 "register_operand")
   (mult:
 (any_extend:
-  (match_operand:DWIH 1 "nonimmediate_operand"))
+  (match_operand:DWIH 1 "register_operand"))
 (any_extend:
-  (match_operand:DWIH 2 "register_operand"
+  (match_operand:DWIH 2 "nonimmediate_operand"
  (clobber (reg:CC FLAGS_REG))])])
 
 (define_expand "mulqihi3"
   [(parallel [(set (match_operand:HI 0 "register_operand")
   (mult:HI
 (any_extend:HI
-  (match_operand:QI 1 "nonimmediate_operand"))
+  (match_operand:QI 1 "register_operand"))
 (any_extend:HI
-  (match_operand:QI 2 "register_operand"
+  (match_operand:QI 2 "nonimmediate_operand"
  (clobber (reg:CC FLAGS_REG))])]
   "TARGET_QIMODE_MATH")
 
 (define_insn "*bmi2_umul3_1"
   [(set (match_operand:DWIH 0 "register_operand" "=r")
(mult:DWIH
- (match_operand:DWIH 2 "nonimmediate_operand" "%d")
+ (match_operand:DWIH 2 "register_operand" "%d")
  (match_operand:DWIH 3 "nonimmediate_operand" "rm")))
(set (match_operand:DWIH 1 "register_operand" "=r")
-   (truncate:DWIH
- (lshiftrt:
-   (mult: (zero_extend: (match_dup 2))
-   (zero_extend: (match_dup 3)))
-   (match_operand:QI 4 "const_int_operand"]
-  "TARGET_BMI2 && INTVAL (operands[4]) ==  * BITS_PER_UNIT
+   (umul_highpart:DWIH (match_dup 2) (match_dup 3)))]
+  "TARGET_BMI2
&& !(MEM_P (operands[2]) && MEM_P (operands[3]))"
   "mulx\t{%3, %0, %1|%1, %0, %3}"
   [(set_attr "type" &qu

RE: [x86 PATCH] PR 106245: Split (x<<31)>>31 as -(x&1) in i386.md

2023-10-17 Thread Roger Sayle



Hi Uros,
Thanks for the speedy review.

> From: Uros Bizjak 
> Sent: 17 October 2023 17:38
> 
> On Tue, Oct 17, 2023 at 3:08 PM Roger Sayle 
> wrote:
> >
> >
> > This patch is the backend piece of a solution to PRs 101955 and
> > 106245, that adds a define_insn_and_split to the i386 backend, to
> > perform sign extension of a single (least significant) bit using AND $1 
> > then NEG.
> >
> > Previously, (x<<31)>>31 would be generated as
> >
> > sall$31, %eax   // 3 bytes
> > sarl$31, %eax   // 3 bytes
> >
> > with this patch the backend now generates:
> >
> > andl$1, %eax// 3 bytes
> > negl%eax// 2 bytes
> >
> > Not only is this smaller in size, but microbenchmarking confirms that
> > it's a performance win on both Intel and AMD; Intel sees only a 2%
> > improvement (perhaps just a size effect), but AMD sees a 7% win.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2023-10-17  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR middle-end/101955
> > PR tree-optimization/106245
> > * config/i386/i386.md (*extv_1_0): New define_insn_and_split.
> >
> > gcc/testsuite/ChangeLog
> > PR middle-end/101955
> > PR tree-optimization/106245
> > * gcc.target/i386/pr106245-2.c: New test case.
> > * gcc.target/i386/pr106245-3.c: New 32-bit test case.
> > * gcc.target/i386/pr106245-4.c: New 64-bit test case.
> > * gcc.target/i386/pr106245-5.c: Likewise.
> 
> +;; Split sign-extension of single least significant bit as and x,$1;neg
> +x (define_insn_and_split "*extv_1_0"
> +  [(set (match_operand:SWI48 0 "register_operand" "=r")
> + (sign_extract:SWI48 (match_operand:SWI48 1 "register_operand" "0")
> +(const_int 1)
> +(const_int 0)))
> +   (clobber (reg:CC FLAGS_REG))]
> +  ""
> +  "#"
> +  "&& 1"
> 
> No need to use "&&" for an empty insn constraint. Just use "reload_completed" 
> in
> this case.
> 
> +  [(parallel [(set (match_dup 0) (and:SWI48 (match_dup 1) (const_int 1)))
> +  (clobber (reg:CC FLAGS_REG))])
> +   (parallel [(set (match_dup 0) (neg:SWI48 (match_dup 0)))
> +  (clobber (reg:CC FLAGS_REG))])])
> 
> Did you intend to split this after reload? If this is the case, then 
> reload_completed
> is missing.

Because this splitter neither required the allocation of a new pseudo, nor a
hard register assignment, i.e. it's a splitter that can be run before or after
reload, it's written to split "whenever".  If you'd prefer it to only split 
after
reload, I agree a "reload_completed" can be added (alternatively, adding
"ix86_pre_reload_split ()" would also work).

I now see from "*load_tp_" that "" is perhaps preferred over "&& 1"
In these cases.  Please let me know which you prefer.

Cheers,
Roger

[x86 PATCH] PR 106245: Split (x<<31)>>31 as -(x&1) in i386.md

2023-10-17 Thread Roger Sayle


This patch is the backend piece of a solution to PRs 101955 and 106245,
that adds a define_insn_and_split to the i386 backend, to perform sign
extension of a single (least significant) bit using AND $1 then NEG.

Previously, (x<<31)>>31 would be generated as

sall$31, %eax   // 3 bytes
sarl$31, %eax   // 3 bytes

with this patch the backend now generates:

andl$1, %eax// 3 bytes
negl%eax// 2 bytes

Not only is this smaller in size, but microbenchmarking confirms
that it's a performance win on both Intel and AMD; Intel sees only a
2% improvement (perhaps just a size effect), but AMD sees a 7% win.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-17  Roger Sayle  

gcc/ChangeLog
PR middle-end/101955
PR tree-optimization/106245
* config/i386/i386.md (*extv_1_0): New define_insn_and_split.

gcc/testsuite/ChangeLog
PR middle-end/101955
PR tree-optimization/106245
* gcc.target/i386/pr106245-2.c: New test case.
* gcc.target/i386/pr106245-3.c: New 32-bit test case.
* gcc.target/i386/pr106245-4.c: New 64-bit test case.
* gcc.target/i386/pr106245-5.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2a60df5..b7309be0 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3414,6 +3414,21 @@
   [(set_attr "type" "imovx")
(set_attr "mode" "SI")])
 
+;; Split sign-extension of single least significant bit as and x,$1;neg x
+(define_insn_and_split "*extv_1_0"
+  [(set (match_operand:SWI48 0 "register_operand" "=r")
+   (sign_extract:SWI48 (match_operand:SWI48 1 "register_operand" "0")
+   (const_int 1)
+   (const_int 0)))
+   (clobber (reg:CC FLAGS_REG))]
+  ""
+  "#"
+  "&& 1"
+  [(parallel [(set (match_dup 0) (and:SWI48 (match_dup 1) (const_int 1)))
+ (clobber (reg:CC FLAGS_REG))])
+   (parallel [(set (match_dup 0) (neg:SWI48 (match_dup 0)))
+ (clobber (reg:CC FLAGS_REG))])])
+
 (define_expand "extzv"
   [(set (match_operand:SWI248 0 "register_operand")
(zero_extract:SWI248 (match_operand:SWI248 1 "register_operand")
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-2.c 
b/gcc/testsuite/gcc.target/i386/pr106245-2.c
new file mode 100644
index 000..47b0d27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-2.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int f(int a)
+{
+return (a << 31) >> 31;
+}
+
+/* { dg-final { scan-assembler "andl" } } */
+/* { dg-final { scan-assembler "negl" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-3.c 
b/gcc/testsuite/gcc.target/i386/pr106245-3.c
new file mode 100644
index 000..4ec6342
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2" } */
+
+long long f(long long a)
+{
+return (a << 63) >> 63;
+}
+
+/* { dg-final { scan-assembler "andl" } } */
+/* { dg-final { scan-assembler "negl" } } */
+/* { dg-final { scan-assembler "cltd" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-4.c 
b/gcc/testsuite/gcc.target/i386/pr106245-4.c
new file mode 100644
index 000..ef77ee5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-4.c
@@ -0,0 +1,10 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2" } */
+
+long long f(long long a)
+{
+return (a << 63) >> 63;
+}
+
+/* { dg-final { scan-assembler "andl" } } */
+/* { dg-final { scan-assembler "negq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-5.c 
b/gcc/testsuite/gcc.target/i386/pr106245-5.c
new file mode 100644
index 000..0351866
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+__int128 f(__int128 a)
+{
+  return (a << 127) >> 127;
+}
+
+/* { dg-final { scan-assembler "andl" } } */
+/* { dg-final { scan-assembler "negq" } } */
+/* { dg-final { scan-assembler "cqto" } } */

RE: [PATCH] Support g++ 4.8 as a host compiler.

2023-10-15 Thread Roger Sayle

I'd like to ping my patch for restoring bootstrap using g++ 4.8.5
(the system compiler on RHEL 7 and later systems).
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632008.html

Note the preprocessor #ifs can be removed; they are only there to document
why the union u must have an explicit, empty (but not default) constructor.

I completely agree with the various opinions that we might consider
upgrading the minimum host compiler for many good reasons (Ada,
D, newer C++ features etc.).  It's inevitable that older compilers and
systems can't be supported indefinitely.

Having said that I don't think that this unintentional trivial breakage,
that has a safe one-line work around is sufficient cause (or non-neglible
risk or support burden), to inconvenice a large number of GCC users
(the impact/disruption to cfarm has already been mentioned).

Interestingly, "scl enable devtoolset-XX" to use a newer host compiler,
v10 or v11, results in a significant increase (100+) in unexpected failures I 
see
during mainline regression testing using "make -k check" (on RedHat 7.9).
(Older) system compilers, despite their flaws, are selected for their
(overall) stability and maturity.

If another patch/change hits the compiler next week that reasonably
means that 4.8.5 can no longer be supported, so be it, but its an
annoying (and unnecessary?) inconvenience in the meantime.

Perhaps we should file a Bugzilla PR indicating that the documentation
and release notes need updating, if my fix isn't considered acceptable?

Why this patch is an trigger issue (that requires significant discussion
and deliberation) is somewhat of a mystery.

Thanks in advance.
Roger
> -Original Message-
> From: Jeff Law 
> Sent: 07 October 2023 17:20
> To: Roger Sayle ; gcc-patches@gcc.gnu.org
> Cc: 'Richard Sandiford' 
> Subject: Re: [PATCH] Support g++ 4.8 as a host compiler.
> 
> 
> 
> On 10/4/23 16:19, Roger Sayle wrote:
> >
> > The recent patch to remove poly_int_pod triggers a bug in g++ 4.8.5's
> > C++ 11 support which mistakenly believes poly_uint16 has a non-trivial
> > constructor.  This in turn prohibits it from being used as a member in
> > a union (rtxunion) that constructed statically, resulting in a (fatal)
> > error during stage 1.  A workaround is to add an explicit constructor
> > to the problematic union, which allows mainline to be bootstrapped
> > with the system compiler on older RedHat 7 systems.
> >
> > This patch has been tested on x86_64-pc-linux-gnu where it allows a
> > bootstrap to complete when using g++ 4.8.5 as the host compiler.
> > Ok for mainline?
> >
> >
> > 2023-10-04  Roger Sayle  
> >
> > gcc/ChangeLog
> > * rtl.h (rtx_def::u): Add explicit constructor to workaround
> > issue using g++ 4.8 as a host compiler.
> I think the bigger question is whether or not we're going to step forward on 
> the
> minimum build requirements.
> 
> My recollection was we settled on gcc-4.8 for the benefit of RHEL 7 and 
> Centos 7
> which are rapidly approaching EOL (June 2024).
> 
> I would certainly support stepping forward to a more modern compiler for the
> build requirements, which might make this patch obsolete.
> 
> Jeff

RE: [PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.

2023-10-15 Thread Roger Sayle

Hi Jeff,
Thanks for the speedy review(s).

> From: Jeff Law 
> Sent: 15 October 2023 00:03
> To: Roger Sayle ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in
> make_compound_operation.
> 
> On 10/14/23 16:14, Roger Sayle wrote:
> >
> > This patch is my proposed solution to PR rtl-optimization/91865.
> > Normally RTX simplification canonicalizes a ZERO_EXTEND of a
> > ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is
> > possible for combine's make_compound_operation to unintentionally
> > generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is
> > unlikely to be matched by the backend.
> >
> > For the new test case:
> >
> > const int table[2] = {1, 2};
> > int foo (char i) { return table[i]; }
> >
> > compiling with -O2 -mlarge on msp430 we currently see:
> >
> > Trying 2 -> 7:
> >  2: r25:HI=zero_extend(R12:QI)
> >REG_DEAD R12:QI
> >  7: r28:PSI=sign_extend(r25:HI)#0
> >REG_DEAD r25:HI
> > Failed to match this instruction:
> > (set (reg:PSI 28 [ iD.1772 ])
> >  (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ]
> >
> > which results in the following code:
> >
> > foo:AND #0xff, R12
> >  RLAM.A #4, R12 { RRAM.A #4, R12
> >  RLAM.A  #1, R12
> >  MOVX.W  table(R12), R12
> >  RETA
> >
> > With this patch, we now see:
> >
> > Trying 2 -> 7:
> >  2: r25:HI=zero_extend(R12:QI)
> >REG_DEAD R12:QI
> >  7: r28:PSI=sign_extend(r25:HI)#0
> >REG_DEAD r25:HI
> > Successfully matched this instruction:
> > (set (reg:PSI 28 [ iD.1772 ])
> >  (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing
> > combination of insns 2 and 7 original costs 4 + 8 = 12 replacement
> > cost 8
> >
> > foo:MOV.B   R12, R12
> >  RLAM.A  #1, R12
> >  MOVX.W  table(R12), R12
> >  RETA
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> > 2023-10-14  Roger Sayle  
> >
> > gcc/ChangeLog
> >  PR rtl-optimization/91865
> >  * combine.cc (make_compound_operation): Avoid creating a
> >  ZERO_EXTEND of a ZERO_EXTEND.
> >
> > gcc/testsuite/ChangeLog
> >  PR rtl-optimization/91865
> >  * gcc.target/msp430/pr91865.c: New test case.
> Neither an ACK or NAK at this point.
> 
> The bug report includes a patch from Segher which purports to fix this in 
> simplify-
> rtx.  Any thoughts on Segher's approach and whether or not it should be
> considered?
> 
> The BZ also indicates that removal of 2 patterns from msp430.md would solve 
> this
> too (though it may cause regressions elsewhere?).  Any thoughts on that 
> approach
> as well?
> 

Great questions.  I believe Segher's proposed patch (in comment #4) was an
msp430-specific proof-of-concept workaround rather than intended to be fix.
Eliminating a ZERO_EXTEND simply by changing the mode of a hard register
is not a solution that'll work on many platforms (and therefore not really 
suitable
for target-independent middle-end code in the RTL optimizers).

For example, zero_extend:TI (and:QI (reg:QI hard_r1) (const_int 0x0f)) can't
universally be reduced to (and:TI (reg:TI hard_r1) (const_int 0x0f)).  Notice 
that
Segher's code doesn't check TARGET_HARD_REGNO_MODE_OK or 
TARGET_MODES_TIEABLE_P or any of the other backend hooks necessary
to confirm such a transformation is safe/possible.

Secondly, the hard register aspect is a bit of a red herring.  This work-around
fixes the issue in the original BZ description, but not the slightly modified 
test
case in comment #2 (with a global variable).  This doesn't have a hard register,
but does have the dubious ZERO_EXTEND/SIGN_EXTEND of a ZERO_EXTEND.

The underlying issue, which is applicable to all targets, is that combine.cc's
make_compound_operation is expected to reverse the local transformations
made by expand_compound_operation.  Hence, if an RTL expression is
canonical going into expand_compound_operation, it is expected (hoped)
to be canonical (and equivalent) coming out of make_compound_operation.

Hence, rather than be a MSP430 specific issue, no target should expect (or
be expected to see) a ZERO_EXTEND of a ZERO_EXTEND, or a SIGN_EXTEND
of a ZERO_EXTEND in the RTL stream.  Much like a binary operator with two
CONST_INT operands, or a shift by zero, it's something the middle-end might
reasonably be expected to

RE: [ARC PATCH] Split asl dst, 1, src into bset dst, 0, src to implement 1<

2023-10-15 Thread Roger Sayle

I've done it again. ENOPATCH.

 

From: Roger Sayle  
Sent: 15 October 2023 09:13
To: 'gcc-patches@gcc.gnu.org' 
Cc: 'Claudiu Zissulescu' 
Subject: [ARC PATCH] Split asl dst,1,src into bset dst,0,src to implement
1<mailto:ro...@nextmovesoftware.com> >

 

gcc/ChangeLog

* config/arc/arc.md (*ashlsi3_1): New pre-reload splitter to

use bset dst,0,src to implement 1<diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index a936a8b..22af0bf 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -3421,6 +3421,22 @@ archs4x, archs4xd"
(set_attr "predicable" "no,no,yes,no,no")
(set_attr "cond" "nocond,canuse,canuse,nocond,nocond")])
 
+;; Split asl dst,1,src into bset dst,0,src.
+(define_insn_and_split "*ashlsi3_1"
+  [(set (match_operand:SI 0 "dest_reg_operand")
+   (ashift:SI (const_int 1)
+  (match_operand:SI 1 "nonmemory_operand")))]
+  "!TARGET_BARREL_SHIFTER
+   && arc_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+   (ior:SI (ashift:SI (const_int 1) (match_dup 1))
+   (const_int 0)))]
+  ""
+  [(set_attr "type" "shift")
+   (set_attr "length" "8")])
+
 (define_insn_and_split "*ashlsi3_nobs"
   [(set (match_operand:SI 0 "dest_reg_operand")
(ashift:SI (match_operand:SI 1 "register_operand")

[ARC PATCH] Split asl dst, 1, src into bset dst, 0, src to implement 1<

2023-10-15 Thread Roger Sayle

 

This patch adds a pre-reload splitter to arc.md, to use the bset (set

specific bit instruction) to implement 1<

 

gcc/ChangeLog

* config/arc/arc.md (*ashlsi3_1): New pre-reload splitter to

use bset dst,0,src to implement 1<

[PATCH] Improved RTL expansion of 1LL << x.

2023-10-14 Thread Roger Sayle


This patch improves the initial RTL expanded for double word shifts
on architectures with conditional moves, so that later passes don't
need to clean-up unnecessary and/or unused instructions.

Consider the general case, x << y, which is expanded well as:

t1 = y & 32;
t2 = 0;
t3 = x_lo >> 1;
t4 = y ^ ~0;
t5 = t3 >> t4;
tmp_hi = x_hi << y;
tmp_hi |= t5;
tmp_lo = x_lo << y;
out_hi = t1 ? tmp_lo : tmp_hi;
out_lo = t1 ? t2 : tmp_lo;

which is nearly optimal, the only thing that can be improved is
that using a unary NOT operation "t4 = ~y" is better than XOR
with -1, on targets that support it.  [Note the one_cmpl_optab
expander didn't fall back to XOR when this code was originally
written, but has been improved since].

Now consider the relatively common idiom of 1LL << y, which
currently produces the RTL equivalent of:

t1 = y & 32;
t2 = 0;
t3 = 1 >> 1;
t4 = y ^ ~0;
t5 = t3 >> t4;
tmp_hi = 0 << y;
tmp_hi |= t5;
tmp_lo = 1 << y;
out_hi = t1 ? tmp_lo : tmp_hi;
out_lo = t1 ? t2 : tmp_lo;

Notice here that t3 is always zero, so the assignment of t5
is a variable shift of zero, which expands to a loop on many
smaller targets, a similar shift by zero in the first tmp_hi
assignment (another loop), that the value of t4 is no longer
required (as t3 is zero), and that the ultimate value of tmp_hi
is always zero.

Fortunately, for many (but perhaps not all) targets this mess
gets cleaned up by later optimization passes.  However, this
patch avoids generating unnecessary RTL at expand time, by
calling simplify_expand_binop instead of expand_binop, and
avoiding generating dead or unnecessary code when intermediate
values are known to be zero.  For the 1LL << y test case above,
we now generate:

t1 = y & 32;
t2 = 0;
tmp_hi = 0;
tmp_lo = 1 << y;
out_hi = t1 ? tmp_lo : tmp_hi;
out_lo = t1 ? t2 : tmp_lo;

On arc-elf, for example, there are 18 RTL INSN_P instructions
generated by expand before this patch, but only 12 with this patch
(improving both compile-time and memory usage).


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-15  Roger Sayle  

gcc/ChangeLog
* optabs.cc (expand_subword_shift): Call simplify_expand_binop
instead of expand_binop.  Optimize cases (i.e. avoid generating
RTL) when CARRIES or INTO_INPUT is zero.  Use one_cmpl_optab
(i.e. NOT) instead of xor_optab with ~0 to calculate ~OP1.


Thanks in advance,
Roger
--

diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index e1898da..f0a048a 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -533,15 +533,13 @@ expand_subword_shift (scalar_int_mode op1_mode, optab 
binoptab,
 has unknown behavior.  Do a single shift first, then shift by the
 remainder.  It's OK to use ~OP1 as the remainder if shift counts
 are truncated to the mode size.  */
-  carries = expand_binop (word_mode, reverse_unsigned_shift,
- outof_input, const1_rtx, 0, unsignedp, methods);
-  if (shift_mask == BITS_PER_WORD - 1)
-   {
- tmp = immed_wide_int_const
-   (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode);
- tmp = simplify_expand_binop (op1_mode, xor_optab, op1, tmp,
-  0, true, methods);
-   }
+  carries = simplify_expand_binop (word_mode, reverse_unsigned_shift,
+  outof_input, const1_rtx, 0,
+  unsignedp, methods);
+  if (carries == const0_rtx)
+   tmp = const0_rtx;
+  else if (shift_mask == BITS_PER_WORD - 1)
+   tmp = expand_unop (op1_mode, one_cmpl_optab, op1, 0, true);
   else
{
  tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1,
@@ -552,22 +550,29 @@ expand_subword_shift (scalar_int_mode op1_mode, optab 
binoptab,
 }
   if (tmp == 0 || carries == 0)
 return false;
-  carries = expand_binop (word_mode, reverse_unsigned_shift,
- carries, tmp, 0, unsignedp, methods);
+  if (carries != const0_rtx && tmp != const0_rtx)
+carries = simplify_expand_binop (word_mode, reverse_unsigned_shift,
+carries, tmp, 0, unsignedp, methods);
   if (carries == 0)
 return false;
 
-  /* Shift INTO_INPUT logically by OP1.  This is the last use of INTO_INPUT
- so the result can go directly into INTO_TARGET if convenient.  */
-  tmp = expand_binop (word_mode, unsigned_shift, into_input, op1,
- into_target, unsignedp, methods);
-  if (tmp == 0)
-return false;
+  if (into_input != const0_rt

[PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.

2023-10-14 Thread Roger Sayle


This patch is my proposed solution to PR rtl-optimization/91865.
Normally RTX simplification canonicalizes a ZERO_EXTEND of a ZERO_EXTEND
to a single ZERO_EXTEND, but as shown in this PR it is possible for
combine's make_compound_operation to unintentionally generate a
non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is unlikely to be
matched by the backend.

For the new test case:

const int table[2] = {1, 2};
int foo (char i) { return table[i]; }

compiling with -O2 -mlarge on msp430 we currently see:

Trying 2 -> 7:
2: r25:HI=zero_extend(R12:QI)
  REG_DEAD R12:QI
7: r28:PSI=sign_extend(r25:HI)#0
  REG_DEAD r25:HI
Failed to match this instruction:
(set (reg:PSI 28 [ iD.1772 ])
(zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ]

which results in the following code:

foo:AND #0xff, R12
RLAM.A #4, R12 { RRAM.A #4, R12
RLAM.A  #1, R12
MOVX.W  table(R12), R12
RETA

With this patch, we now see:

Trying 2 -> 7:
2: r25:HI=zero_extend(R12:QI)
  REG_DEAD R12:QI
7: r28:PSI=sign_extend(r25:HI)#0
  REG_DEAD r25:HI
Successfully matched this instruction:
(set (reg:PSI 28 [ iD.1772 ])
(zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ])))
allowing combination of insns 2 and 7
original costs 4 + 8 = 12
replacement cost 8

foo:MOV.B   R12, R12
RLAM.A  #1, R12
MOVX.W  table(R12), R12
RETA


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-10-14  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/91865
* combine.cc (make_compound_operation): Avoid creating a
ZERO_EXTEND of a ZERO_EXTEND.

gcc/testsuite/ChangeLog
PR rtl-optimization/91865
* gcc.target/msp430/pr91865.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 360aa2f25e6..f47ff596782 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -8453,6 +8453,9 @@ make_compound_operation (rtx x, enum rtx_code in_code)
new_rtx, GET_MODE (XEXP (x, 0)));
   if (tem)
return tem;
+  /* Avoid creating a ZERO_EXTEND of a ZERO_EXTEND.  */
+  if (GET_CODE (new_rtx) == ZERO_EXTEND)
+   new_rtx = XEXP (new_rtx, 0);
   SUBST (XEXP (x, 0), new_rtx);
   return x;
 }
diff --git a/gcc/testsuite/gcc.target/msp430/pr91865.c 
b/gcc/testsuite/gcc.target/msp430/pr91865.c
new file mode 100644
index 000..8cc21c8b9e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/msp430/pr91865.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mlarge" } */
+
+const int table[2] = {1, 2};
+int foo (char i) { return table[i]; }
+
+/* { dg-final { scan-assembler-not "AND" } } */
+/* { dg-final { scan-assembler-not "RRAM" } } */

[PATCH] Optimize (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) as (and:SI x 1).

2023-10-10 Thread Roger Sayle


This patch is the middle-end piece of an improvement to PRs 101955 and
106245, that adds a missing simplification to the RTL optimizers.
This transformation is to simplify (char)(x << 7) != 0 as x & 1.
Technically, the cast can be any truncation, where shift is by one
less than the narrower type's precision, setting the most significant
(only) bit from the least significant bit.

This transformation applies to any target, but it's easy to see
(and add a new test case) on x86, where the following function:

int f(int a) { return (a << 31) >> 31; }

currently gets compiled with -O2 to:

foo:movl%edi, %eax
sall$7, %eax
sarb$7, %al
movsbl  %al, %eax
ret

but with this patch, we now generate the slightly simpler.

foo:movl%edi, %eax
sall$31, %eax
sarl$31, %eax
ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  Ok for mainline?


2023-10-10  Roger Sayle  

gcc/ChangeLog
PR middle-end/101955
PR tree-optimization/106245
* simplify-rtx.c (simplify_relational_operation_1): Simplify
the RTL (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) to (and:SI x 1).

gcc/testsuite/ChangeLog
* gcc.target/i386/pr106245-1.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index bd9443d..69d8757 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -6109,6 +6109,23 @@ simplify_context::simplify_relational_operation_1 
(rtx_code code,
break;
   }
 
+  /* (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) -> (and:SI x 1).  */
+  if (code == NE
+  && op1 == const0_rtx
+  && (op0code == TRUNCATE
+ || (partial_subreg_p (op0)
+ && subreg_lowpart_p (op0)))
+  && SCALAR_INT_MODE_P (mode)
+  && STORE_FLAG_VALUE == 1)
+{
+  rtx tmp = XEXP (op0, 0);
+  if (GET_CODE (tmp) == ASHIFT
+ && GET_MODE (tmp) == mode
+ && CONST_INT_P (XEXP (tmp, 1))
+ && is_int_mode (GET_MODE (op0), _mode)
+ && INTVAL (XEXP (tmp, 1)) == GET_MODE_PRECISION (int_mode) - 1)
+   return simplify_gen_binary (AND, mode, XEXP (tmp, 0), const1_rtx);
+}
   return NULL_RTX;
 }
 
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-1.c 
b/gcc/testsuite/gcc.target/i386/pr106245-1.c
new file mode 100644
index 000..a0403e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-1.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int f(int a)
+{
+return (a << 31) >> 31;
+}
+
+/* { dg-final { scan-assembler-not "sarb" } } */
+/* { dg-final { scan-assembler-not "movsbl" } } */

[ARC PATCH] Improved SImode shifts and rotates on !TARGET_BARREL_SHIFTER.

2023-10-08 Thread Roger Sayle


This patch completes the ARC back-end's transition to using pre-reload
splitters for SImode shifts and rotates on targets without a barrel
shifter.  The core part is that the shift_si3 define_insn is no longer
needed, as shifts and rotates that don't require a loop are split
before reload, and then because shift_si3_loop is the only caller
of output_shift, both can be significantly cleaned up and simplified.
The output_shift function (Claudiu's "the elephant in the room") is
renamed output_shift_loop, which handles just the four instruction
zero-overhead loop implementations.

Aside from the clean-ups, the user visible changes are much improved
implementations of SImode shifts and rotates on affected targets.

For the function:
unsigned int rotr_1 (unsigned int x) { return (x >> 1) | (x << 31); }

GCC with -O2 -mcpu=em would previously generate:

rotr_1: lsr_s r2,r0
bmsk_s r0,r0,0
ror r0,r0
j_s.d   [blink]
or_sr0,r0,r2

with this patch, we now generate:

j_s.d   [blink]
ror r0,r0

For the function:
unsigned int rotr_31 (unsigned int x) { return (x >> 31) | (x << 1); }

GCC with -O2 -mcpu=em would previously generate:

rotr_31:
mov_s   r2,r0   ;4
asl_s r0,r0
add.f 0,r2,r2
rlc r2,0
j_s.d   [blink]
or_sr0,r0,r2

with this patch we now generate an add.f followed by an adc:

rotr_31:
add.f   r0,r0,r0
j_s.d   [blink]
add.cs  r0,r0,1


Shifts by constants requiring a loop have been improved for even counts
by performing two operations in each iteration:

int shl10(int x) { return x >> 10; }

Previously looked like:

shl10:  mov.f lp_count, 10
lpnz2f
asr r0,r0
nop
2:  # end single insn loop
j_s [blink]


And now becomes:

shl10:
mov lp_count,5
lp  2f
asr r0,r0
asr r0,r0
2:  # end single insn loop
j_s [blink]


So emulating ARC's SWAP on architectures that don't have it:

unsigned int rotr_16 (unsigned int x) { return (x >> 16) | (x << 16); }

previously required 10 instructions and ~70 cycles:

rotr_16:
mov_s   r2,r0   ;4
mov.f lp_count, 16
lpnz2f
add r0,r0,r0
nop
2:  # end single insn loop
mov.f lp_count, 16
lpnz2f
lsr r2,r2
nop
2:  # end single insn loop
j_s.d   [blink]
or_sr0,r0,r2

now becomes just 4 instructions and ~18 cycles:

rotr_16:
mov lp_count,8
lp  2f
ror r0,r0
ror r0,r0
2:  # end single insn loop
j_s [blink]


This patch has been tested with a cross-compiler to arc-linux hosted
on x86_64-pc-linux-gnu and (partially) tested with the compile-only
portions of the testsuite with no regressions.  Ok for mainline, if
your own testing shows no issues?


2023-10-07  Roger Sayle  

gcc/ChangeLog
* config/arc/arc-protos.h (output_shift): Rename to...
(output_shift_loop): Tweak API to take an explicit rtx_code.
(arc_split_ashl): Prototype new function here.
(arc_split_ashr): Likewise.
(arc_split_lshr): Likewise.
(arc_split_rotl): Likewise.
(arc_split_rotr): Likewise.
* config/arc/arc.cc (output_shift): Delete local prototype.  Rename.
(output_shift_loop): New function replacing output_shift to output
a zero overheap loop for SImode shifts and rotates on ARC targets
without barrel shifter (i.e. no hardware support for these insns).
(arc_split_ashl): New helper function to split *ashlsi3_nobs.
(arc_split_ashr): New helper function to split *ashrsi3_nobs.
(arc_split_lshr): New helper function to split *lshrsi3_nobs.
(arc_split_rotl): New helper function to split *rotlsi3_nobs.
(arc_split_rotr): New helper function to split *rotrsi3_nobs.
* config/arc/arc.md (any_shift_rotate): New define_code_iterator.
(define_code_attr insn): New code attribute to map to pattern name.
(si3): New expander unifying previous ashlsi3,
ashrsi3 and lshrsi3 define_expands.  Adds rotlsi3 and rotrsi3.
(*si3_nobs): New define_insn_and_split that
unifies the previous *ashlsi3_nobs, *ashrsi3_nobs and *lshrsi3_nobs.
We now call arc_split_ in arc.cc to implement each split.
(shift_si3): Delete define_insn, all shifts/rotates are now split.
(shift_si3_loop): Rename to...
(si3_loop): define_insn to handle loop implementations of
SImode shifts and rotates, calling ouput_shift_loop for template.
(rotrsi3): Rename to...
(*rotrsi3_insn): define_insn for TARGET_BARREL_SHIFTER's ror.
(*rotlsi3): New define_insn_and_split to transform left rotates
into right rotates before reload.
(rotlsi3_cnt1): New define_insn_and_split to implement a le

RE: [X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.

2023-10-06 Thread Roger Sayle


Grr!  I've done it again.  ENOPATCH.

> -Original Message-
> From: Roger Sayle 
> Sent: 06 October 2023 14:58
> To: 'gcc-patches@gcc.gnu.org' 
> Cc: 'Uros Bizjak' 
> Subject: [X86 PATCH] Implement doubleword right shifts by 1 bit using
s[ha]r+rcr.
> 
> 
> This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr
> functions to implement doubleword right shifts by 1 bit, using a shift of
the
> highpart that sets the carry flag followed by a rotate-carry-right
> (RCR) instruction on the lowpart.
> 
> Conceptually this is similar to the recent left shift patch, but with two
> complicating factors.  The first is that although the RCR sequence is
shorter, and is
> a ~3x performance improvement on AMD, my micro-benchmarking shows it
> ~10% slower on Intel.  Hence this patch also introduces a new
> X86_TUNE_USE_RCR tuning parameter.  The second is that I believe this is
the
> first time a "rotate-right-through-carry" and a right shift that sets the
carry flag
> from the least significant bit has been modelled in GCC RTL (on a MODE_CC
> target).  For this I've used the i386 back-end's UNSPEC_CC_NE which seems
> appropriate.  Finally rcrsi2 and rcrdi2 are separate define_insns so that
we can
> use their generator functions.
> 
> For the pair of functions:
> unsigned __int128 foo(unsigned __int128 x) { return x >> 1; }
> __int128 bar(__int128 x) { return x >> 1; }
> 
> with -O2 -march=znver4 we previously generated:
> 
> foo:movq%rdi, %rax
> movq%rsi, %rdx
> shrdq   $1, %rsi, %rax
> shrq%rdx
> ret
> bar:movq%rdi, %rax
> movq%rsi, %rdx
> shrdq   $1, %rsi, %rax
> sarq%rdx
> ret
> 
> with this patch we now generate:
> 
> foo:movq%rsi, %rdx
> movq%rdi, %rax
> shrq%rdx
> rcrq%rax
> ret
> bar:movq%rsi, %rdx
> movq%rdi, %rax
> sarq%rdx
> rcrq%rax
> ret
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no new
> failures.  And to provide additional testing, I've also bootstrapped and
regression
> tested a version of this patch where the RCR is always generated
(independent of
> the -march target) again with no regressions.  Ok for mainline?
> 
> 
> 2023-10-06  Roger Sayle  
> 
> gcc/ChangeLog
> * config/i386/i386-expand.c (ix86_split_ashr): Split shifts by
> one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR
> or -Oz.
> (ix86_split_lshr): Likewise, split shifts by one bit into
> lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz.
> * config/i386/i386.h (TARGET_USE_RCR): New backend macro.
> * config/i386/i386.md (rcrsi2): New define_insn for rcrl.
> (rcrdi2): New define_insn for rcrq.
> (3_carry): New define_insn for right shifts that
> set the carry flag from the least significant bit, modelled using
> UNSPEC_CC_NE.
> * config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning
parameter
> controlling use of rcr 1 vs. shrd, which is significantly faster
on
> AMD processors.
> 
> gcc/testsuite/ChangeLog
> * gcc.target/i386/rcr-1.c: New 64-bit test case.
> * gcc.target/i386/rcr-2.c: New 32-bit test case.
> 
> 
> Thanks in advance,
> Roger
> --

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index e42ff27..399eb8e 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -6496,6 +6496,22 @@ ix86_split_ashr (rtx *operands, rtx scratch, 
machine_mode mode)
emit_insn (gen_ashr3 (low[0], low[0],
  GEN_INT (count - half_width)));
}
+  else if (count == 1
+  && (TARGET_USE_RCR || optimize_size > 1))
+   {
+ if (!rtx_equal_p (operands[0], operands[1]))
+   emit_move_insn (operands[0], operands[1]);
+ if (mode == DImode)
+   {
+ emit_insn (gen_ashrsi3_carry (high[0], high[0]));
+ emit_insn (gen_rcrsi2 (low[0], low[0]));
+   }
+ else
+   {
+ emit_insn (gen_ashrdi3_carry (high[0], high[0]));
+ emit_insn (gen_rcrdi2 (low[0], low[0]));
+   }
+   }
   else
{
  gen_shrd = mode == DImode ? gen_x86_shrd : gen_x86_64_shrd;
@@ -6561,6 +6577,22 @@ ix86_split_lshr (rtx *operands, rtx scratch, 
machine_mode mode)
emit_insn (gen_lshr3 (low[0], low[0],
  GEN_INT (count - ha

[X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.

2023-10-06 Thread Roger Sayle



This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr
functions to implement doubleword right shifts by 1 bit, using a shift
of the highpart that sets the carry flag followed by a rotate-carry-right
(RCR) instruction on the lowpart.

Conceptually this is similar to the recent left shift patch, but with two
complicating factors.  The first is that although the RCR sequence is
shorter, and is a ~3x performance improvement on AMD, my micro-benchmarking
shows it ~10% slower on Intel.  Hence this patch also introduces a new
X86_TUNE_USE_RCR tuning parameter.  The second is that I believe this is
the first time a "rotate-right-through-carry" and a right shift that sets
the carry flag from the least significant bit has been modelled in GCC RTL
(on a MODE_CC target).  For this I've used the i386 back-end's UNSPEC_CC_NE
which seems appropriate.  Finally rcrsi2 and rcrdi2 are separate
define_insns so that we can use their generator functions.

For the pair of functions:
unsigned __int128 foo(unsigned __int128 x) { return x >> 1; }
__int128 bar(__int128 x) { return x >> 1; }

with -O2 -march=znver4 we previously generated:

foo:movq%rdi, %rax
movq%rsi, %rdx
shrdq   $1, %rsi, %rax
shrq%rdx
ret
bar:movq%rdi, %rax
movq%rsi, %rdx
shrdq   $1, %rsi, %rax
sarq%rdx
ret

with this patch we now generate:

foo:movq%rsi, %rdx
movq%rdi, %rax
shrq%rdx
rcrq%rax
ret
bar:movq%rsi, %rdx
movq%rdi, %rax
sarq%rdx
rcrq%rax
ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  And to provide additional testing, I've also
bootstrapped and regression tested a version of this patch where the
RCR is always generated (independent of the -march target) again with
no regressions.  Ok for mainline?


2023-10-06  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.c (ix86_split_ashr): Split shifts by
one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR
or -Oz.
(ix86_split_lshr): Likewise, split shifts by one bit into
lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz.
* config/i386/i386.h (TARGET_USE_RCR): New backend macro.
* config/i386/i386.md (rcrsi2): New define_insn for rcrl.
(rcrdi2): New define_insn for rcrq.
(3_carry): New define_insn for right shifts that
set the carry flag from the least significant bit, modelled using
UNSPEC_CC_NE.
* config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning parameter
controlling use of rcr 1 vs. shrd, which is significantly faster on
AMD processors.

gcc/testsuite/ChangeLog
* gcc.target/i386/rcr-1.c: New 64-bit test case.
* gcc.target/i386/rcr-2.c: New 32-bit test case.


Thanks in advance,
Roger
--

RE: [X86 PATCH] Split lea into shorter left shift by 2 or 3 bits with -Oz.

2023-10-05 Thread Roger Sayle



Hi Uros,
Very many thanks for the speedy reviews.

Uros Bizjak wrote:
> On Thu, Oct 5, 2023 at 11:06 AM Roger Sayle 
> wrote:
> >
> >
> > This patch avoids long lea instructions for performing x<<2 and x<<3
> > by splitting them into shorter sal and move (or xchg instructions).
> > Because this increases the number of instructions, but reduces the
> > total size, its suitable for -Oz (but not -Os).
> >
> > The impact can be seen in the new test case:
> >
> > int foo(int x) { return x<<2; }
> > int bar(int x) { return x<<3; }
> > long long fool(long long x) { return x<<2; } long long barl(long long
> > x) { return x<<3; }
> >
> > where with -O2 we generate:
> >
> > foo:lea0x0(,%rdi,4),%eax// 7 bytes
> > retq
> > bar:lea0x0(,%rdi,8),%eax// 7 bytes
> > retq
> > fool:   lea0x0(,%rdi,4),%rax// 8 bytes
> > retq
> > barl:   lea0x0(,%rdi,8),%rax// 8 bytes
> > retq
> >
> > and with -Oz we now generate:
> >
> > foo:xchg   %eax,%edi// 1 byte
> > shl$0x2,%eax// 3 bytes
> > retq
> > bar:xchg   %eax,%edi// 1 byte
> > shl$0x3,%eax// 3 bytes
> > retq
> > fool:   xchg   %rax,%rdi// 2 bytes
> > shl$0x2,%rax// 4 bytes
> > retq
> > barl:   xchg   %rax,%rdi// 2 bytes
> > shl$0x3,%rax// 4 bytes
> > retq
> >
> > Over the entirety of the CSiBE code size benchmark this saves 1347
> > bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32.
> > Conveniently, there's already a backend function in i386.cc for
> > deciding whether to split an lea into its component instructions,
> > ix86_avoid_lea_for_addr, all that's required is an additional clause
> > checking for -Oz (i.e. optimize_size > 1).
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board='unix{-m32}'
> > with no new failures.  Additional testing was performed by repeating
> > these steps after removing the "optimize_size > 1" condition, so that
> > suitable lea instructions were always split [-Oz is not heavily
> > tested, so this invoked the new code during the bootstrap and
> > regression testing], again with no regressions.  Ok for mainline?
> >
> >
> > 2023-10-05  Roger Sayle  
> >
> > gcc/ChangeLog
> > * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used
> > to perform left shifts into shorter instructions with -Oz.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/lea-2.c: New test case.
> >
> 
> OK, but ...
> 
> @@ -0,0 +1,7 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> 
> Is there a reason to avoid 32-bit targets? I'd expect that the optimization 
> also
> triggers on x86_32 for 32bit integers.

Good catch.  You're 100% correct; because the test case just checks that an LEA
is not used, and not for the specific sequence of shift instructions used 
instead,
this test also passes with --target_board='unix{-m32}'.  I'll remove the target 
clause
from the dg-do compile directive.

> +/* { dg-options "-Oz" } */
> +int foo(int x) { return x<<2; }
> +int bar(int x) { return x<<3; }
> +long long fool(long long x) { return x<<2; } long long barl(long long
> +x) { return x<<3; }
> +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */

Thanks again.
Roger
--

RE: [X86 PATCH] Implement doubleword shift left by 1 bit using add+adc.

2023-10-05 Thread Roger Sayle

Doh! ENOPATCH.

> -Original Message-
> From: Roger Sayle 
> Sent: 05 October 2023 12:44
> To: 'gcc-patches@gcc.gnu.org' 
> Cc: 'Uros Bizjak' 
> Subject: [X86 PATCH] Implement doubleword shift left by 1 bit using
add+adc.
> 
> 
> This patch tweaks the i386 back-end's ix86_split_ashl to implement
doubleword
> left shifts by 1 bit, using an add followed by an add-with-carry (i.e. a
doubleword
> x+x) instead of using the x86's shld instruction.
> The replacement sequence both requires fewer bytes and is faster on both
Intel
> and AMD architectures (from Agner Fog's latency tables and confirmed by my
> own microbenchmarking).
> 
> For the test case:
> __int128 foo(__int128 x) { return x << 1; }
> 
> with -O2 we previously generated:
> 
> foo:movq%rdi, %rax
> movq%rsi, %rdx
> shldq   $1, %rdi, %rdx
> addq%rdi, %rax
> ret
> 
> with this patch we now generate:
> 
> foo:movq%rdi, %rax
> movq%rsi, %rdx
> addq%rdi, %rax
> adcq%rsi, %rdx
> ret
> 
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no new
> failures.  Ok for mainline?
> 
> 
> 2023-10-05  Roger Sayle  
> 
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_split_ashl): Split shifts by
> one into add3_cc_overflow_1 followed by add3_carry.
> * config/i386/i386.md (@add3_cc_overflow_1): Renamed from
> "*add3_cc_overflow_1" to provide generator function.
> 
> gcc/testsuite/ChangeLog
> * gcc.target/i386/ashldi3-2.c: New 32-bit test case.
> * gcc.target/i386/ashlti3-3.c: New 64-bit test case.
> 
> 
> Thanks in advance,
> Roger
> --

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index e42ff27..09e41c8 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -6342,6 +6342,18 @@ ix86_split_ashl (rtx *operands, rtx scratch, 
machine_mode mode)
  if (count > half_width)
ix86_expand_ashl_const (high[0], count - half_width, mode);
}
+  else if (count == 1)
+   {
+ if (!rtx_equal_p (operands[0], operands[1]))
+   emit_move_insn (operands[0], operands[1]);
+ rtx x3 = gen_rtx_REG (CCCmode, FLAGS_REG);
+ rtx x4 = gen_rtx_LTU (mode, x3, const0_rtx);
+ half_mode = mode == DImode ? SImode : DImode;
+ emit_insn (gen_add3_cc_overflow_1 (half_mode, low[0],
+low[0], low[0]));
+ emit_insn (gen_add3_carry (half_mode, high[0], high[0], high[0],
+x3, x4));
+   }
   else
{
  gen_shld = mode == DImode ? gen_x86_shld : gen_x86_64_shld;
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index eef8a0e..6a5bc16 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -8864,7 +8864,7 @@
   [(set_attr "type" "alu")
(set_attr "mode" "")])
 
-(define_insn "*add3_cc_overflow_1"
+(define_insn "@add3_cc_overflow_1"
   [(set (reg:CCC FLAGS_REG)
(compare:CCC
(plus:SWI
diff --git a/gcc/testsuite/gcc.target/i386/ashldi3-2.c 
b/gcc/testsuite/gcc.target/i386/ashldi3-2.c
new file mode 100644
index 000..053389d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/ashldi3-2.c
@@ -0,0 +1,10 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2 -mno-stv" } */
+
+long long foo(long long x)
+{
+  return x << 1;
+}
+
+/* { dg-final { scan-assembler "adcl" } } */
+/* { dg-final { scan-assembler-not "shldl" } } */
diff --git a/gcc/testsuite/gcc.target/i386/ashlti3-3.c 
b/gcc/testsuite/gcc.target/i386/ashlti3-3.c
new file mode 100644
index 000..4f14ca0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/ashlti3-3.c
@@ -0,0 +1,10 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+__int128 foo(__int128 x)
+{
+  return x << 1;
+}
+
+/* { dg-final { scan-assembler "adcq" } } */
+/* { dg-final { scan-assembler-not "shldq" } } */

[X86 PATCH] Implement doubleword shift left by 1 bit using add+adc.

2023-10-05 Thread Roger Sayle



This patch tweaks the i386 back-end's ix86_split_ashl to implement
doubleword left shifts by 1 bit, using an add followed by an add-with-carry
(i.e. a doubleword x+x) instead of using the x86's shld instruction.
The replacement sequence both requires fewer bytes and is faster on
both Intel and AMD architectures (from Agner Fog's latency tables and
confirmed by my own microbenchmarking).

For the test case:
__int128 foo(__int128 x) { return x << 1; }

with -O2 we previously generated:

foo:movq%rdi, %rax
movq%rsi, %rdx
shldq   $1, %rdi, %rdx
addq%rdi, %rax
ret

with this patch we now generate:

foo:movq%rdi, %rax
movq%rsi, %rdx
addq%rdi, %rax
adcq%rsi, %rdx
ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-05  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_split_ashl): Split shifts by
one into add3_cc_overflow_1 followed by add3_carry.
* config/i386/i386.md (@add3_cc_overflow_1): Renamed from
"*add3_cc_overflow_1" to provide generator function.

gcc/testsuite/ChangeLog
* gcc.target/i386/ashldi3-2.c: New 32-bit test case.
* gcc.target/i386/ashlti3-3.c: New 64-bit test case.


Thanks in advance,
Roger
--

[X86 PATCH] Split lea into shorter left shift by 2 or 3 bits with -Oz.

2023-10-05 Thread Roger Sayle


This patch avoids long lea instructions for performing x<<2 and x<<3
by splitting them into shorter sal and move (or xchg instructions).
Because this increases the number of instructions, but reduces the
total size, its suitable for -Oz (but not -Os).

The impact can be seen in the new test case:

int foo(int x) { return x<<2; }
int bar(int x) { return x<<3; }
long long fool(long long x) { return x<<2; }
long long barl(long long x) { return x<<3; }

where with -O2 we generate:

foo:lea0x0(,%rdi,4),%eax// 7 bytes
retq
bar:lea0x0(,%rdi,8),%eax// 7 bytes
retq
fool:   lea0x0(,%rdi,4),%rax// 8 bytes
retq
barl:   lea0x0(,%rdi,8),%rax// 8 bytes
retq

and with -Oz we now generate:

foo:xchg   %eax,%edi// 1 byte
shl$0x2,%eax// 3 bytes
retq
bar:xchg   %eax,%edi// 1 byte
shl$0x3,%eax// 3 bytes
retq
fool:   xchg   %rax,%rdi// 2 bytes
shl$0x2,%rax// 4 bytes
retq
barl:   xchg   %rax,%rdi// 2 bytes
shl$0x3,%rax// 4 bytes
retq

Over the entirety of the CSiBE code size benchmark this saves 1347
bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32.
Conveniently, there's already a backend function in i386.cc for
deciding whether to split an lea into its component instructions,
ix86_avoid_lea_for_addr, all that's required is an additional clause
checking for -Oz (i.e. optimize_size > 1).

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board='unix{-m32}'
with no new failures.  Additional testing was performed by repeating
these steps after removing the "optimize_size > 1" condition, so that
suitable lea instructions were always split [-Oz is not heavily
tested, so this invoked the new code during the bootstrap and
regression testing], again with no regressions.  Ok for mainline?


2023-10-05  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used
to perform left shifts into shorter instructions with -Oz.

gcc/testsuite/ChangeLog
* gcc.target/i386/lea-2.c: New test case.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 477e6ce..9557bff 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -15543,6 +15543,13 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx 
operands[])
   && (regno0 == regno1 || regno0 == regno2))
 return true;
 
+  /* Split with -Oz if the encoding requires fewer bytes.  */
+  if (optimize_size > 1
+  && parts.scale > 1
+  && !parts.base
+  && (!parts.disp || parts.disp == const0_rtx)) 
+return true;
+
   /* Check we need to optimize.  */
   if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
 return false;
diff --git a/gcc/testsuite/gcc.target/i386/lea-2.c 
b/gcc/testsuite/gcc.target/i386/lea-2.c
new file mode 100644
index 000..20aded8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/lea-2.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-Oz" } */
+int foo(int x) { return x<<2; }
+int bar(int x) { return x<<3; }
+long long fool(long long x) { return x<<2; }
+long long barl(long long x) { return x<<3; }
+/* { dg-final { scan-assembler-not "lea\[lq\]" } } */

[PATCH] Support g++ 4.8 as a host compiler.

2023-10-04 Thread Roger Sayle


The recent patch to remove poly_int_pod triggers a bug in g++ 4.8.5's
C++ 11 support which mistakenly believes poly_uint16 has a non-trivial
constructor.  This in turn prohibits it from being used as a member in
a union (rtxunion) that constructed statically, resulting in a (fatal)
error during stage 1.  A workaround is to add an explicit constructor
to the problematic union, which allows mainline to be bootstrapped with
the system compiler on older RedHat 7 systems.

This patch has been tested on x86_64-pc-linux-gnu where it allows a
bootstrap to complete when using g++ 4.8.5 as the host compiler.
Ok for mainline?


2023-10-04  Roger Sayle  

gcc/ChangeLog
* rtl.h (rtx_def::u): Add explicit constructor to workaround
issue using g++ 4.8 as a host compiler.

diff --git a/gcc/rtl.h b/gcc/rtl.h
index 6850281..a7667f5 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -451,6 +451,9 @@ struct GTY((desc("0"), tag("0"),
 struct fixed_value fv;
 struct hwivec_def hwiv;
 struct const_poly_int_def cpi;
+#if defined(__GNUC__) && GCC_VERSION < 5000
+u () {}
+#endif
   } GTY ((special ("rtx_def"), desc ("GET_CODE (&%0)"))) u;
 };

PING: PR rtl-optimization/110701

2023-10-03 Thread Roger Sayle

 

There are a small handful of middle-end maintainers/reviewers that

understand and appreciate the difference between the RTL statements:

(set (subreg:HI (reg:SI x)) (reg:HI y))

and

(set (strict_lowpart:HI (reg:SI x)) (reg:HI y))

 

If one (or more) of them could please take a look at

https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625532.html

I'd very much appreciate it (one less wrong-code regression).

 

Many thanks in advance,

Roger

--

RE: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.

2023-10-03 Thread Roger Sayle



Hi Claudiu,
Thanks for the answers to my technical questions.
If you'd prefer to update arc.md's add3 pattern first,
I'm happy to update/revise my patch based on this
and your feedback, for example preferring add over
asl_s (or controlling this choice with -Os).

Thanks again.
Roger
--

> -Original Message-
> From: Claudiu Zissulescu 
> Sent: 03 October 2023 15:26
> To: Roger Sayle ; gcc-patches@gcc.gnu.org
> Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on
> !TARGET_BARREL_SHIFTER.
> 
> Hi Roger,
> 
> It was nice to meet you too.
> 
> Thank you in looking into the ARC's non-Barrel Shifter configurations.  I
will dive
> into your patch asap, but before starting here are a few of my comments:
> 
> -Original Message-
> From: Roger Sayle 
> Sent: Thursday, September 28, 2023 2:27 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Claudiu Zissulescu 
> Subject: [ARC PATCH] Split SImode shifts pre-reload on
> !TARGET_BARREL_SHIFTER.
> 
> 
> Hi Claudiu,
> It was great meeting up with you and the Synopsys ARC team at the GNU
tools
> Cauldron in Cambridge.
> 
> This patch is the first in a series to improve SImode and DImode shifts
and rotates
> in the ARC backend.  This first piece splits SImode shifts, for
> !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the
split1
> pass, as suggested by the FIXME comment above output_shift in arc.cc.  To
do
> this I've copied the implementation of the x86_pre_reload_split function
from
> i386 backend, and renamed it arc_pre_reload_split.
> 
> Although the actual implementations of shifts remain the same (as in
> output_shift), having them as explicit instructions in the RTL stream
allows better
> scheduling and use of compact forms when available.  The benefits can be
seen in
> two short examples below.
> 
> For the function:
> unsigned int foo(unsigned int x, unsigned int y) {
>   return y << 2;
> }
> 
> GCC with -O2 -mcpu=em would previously generate:
> foo:add r1,r1,r1
> add r1,r1,r1
> j_s.d   [blink]
> mov_s   r0,r1   ;4
> 
> [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly
> beneficial for older ARCv1 arches.
> 
> and with this patch now generates:
> foo:asl_s r0,r1
> j_s.d   [blink]
> asl_s r0,r0
> 
> [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus.
> 
> Notice the original (from shift_si3's output_shift) requires the shift
sequence to be
> monolithic with the same destination register as the source (requiring an
extra
> mov_s).  The new version can eliminate this move, and schedule the second
asl in
> the branch delay slot of the return.
> 
> For the function:
> int x,y,z;
> 
> void bar()
> {
>   x <<= 3;
>   y <<= 3;
>   z <<= 3;
> }
> 
> GCC -O2 -mcpu=em currently generates:
> bar:push_s  r13
> ld.as   r12,[gp,@x@sda] ;23
> ld.as   r3,[gp,@y@sda]  ;23
> mov r2,0
> add3 r12,r2,r12
> mov r2,0
> add3 r3,r2,r3
> ld.as   r2,[gp,@z@sda]  ;23
> st.as   r12,[gp,@x@sda] ;26
> mov r13,0
> add3 r2,r13,r2
> st.as   r3,[gp,@y@sda]  ;26
> st.as   r2,[gp,@z@sda]  ;26
> j_s.d   [blink]
> pop_s   r13
> 
> where each shift by 3, uses ARC's add3 instruction, which is similar to
x86's lea
> implementing x = (y<<3) + z, but requires the value zero to be placed in a
> temporary register "z".  Splitting this before reload allows these pseudos
to be
> shared/reused.  With this patch, we get
> 
> bar:ld.as   r2,[gp,@x@sda]  ;23
> mov_s   r3,0;3
> add3r2,r3,r2
> ld.as   r3,[gp,@y@sda]  ;23
> st.as   r2,[gp,@x@sda]  ;26
> ld.as   r2,[gp,@z@sda]  ;23
> mov_s   r12,0   ;3
> add3r3,r12,r3
> add3r2,r12,r2
> st.as   r3,[gp,@y@sda]  ;26
> st.as   r2,[gp,@z@sda]  ;26
> j_s [blink]
> 
> [CZI] Looks great, but it also shows that I've forgot to add to ADD3
instruction the
> Ra,LIMM,RC variant, which will lead to have instead of
> mov_s   r3,0;3
> add3r2,r3,r2
> Only this add3,0,r2, Indeed it is longer instruction but faster.
> 
> Unfortunately, register allocation means that we only share two of the
three
> "mov_s z,0", but this is sufficient to reduce register pressure enough to
avoid
> spilling r13 in the prologue/epilogue.
> 
> This patch also contains a (latent?) bug fix.  The implementation of the
default
> insn "length" attribute, assumes instructions of type "shift" have two
inpu

RE: [ARC PATCH] Use rlc r0, 0 to implement scc_ltu (i.e. carry_flag ? 1 : 0)

2023-09-29 Thread Roger Sayle



Hi Claudiu,
> The patch looks sane. Have you run dejagnu test suite?

I've not yet managed to set up an emulator or compile the entire toolchain,
so my dejagnu results are only useful for catching (serious) problems in the
compile only tests:

=== gcc Summary ===

# of expected passes91875
# of unexpected failures23768
# of unexpected successes   23
# of expected failures  1038
# of unresolved testcases   19490
# of unsupported tests  3819
/home/roger/GCC/arc-linux/gcc/xgcc  version 14.0.0 20230828 (experimental)
(GCC)

If someone could double check there are no issues on real hardware that
would be great.  I'm not sure if ARC is one of the targets covered by Jeff
Law's
compile farm?


> -Original Message-
> From: Roger Sayle 
> Sent: Friday, September 29, 2023 6:54 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Claudiu Zissulescu 
> Subject: [ARC PATCH] Use rlc r0,0 to implement scc_ltu (i.e. carry_flag ?
1 : 0)
> 
> 
> This patch teaches the ARC backend that the contents of the carry flag can
be
> placed in an integer register conveniently using the "rlc rX,0"
> instruction, which is a rotate-left-through-carry using zero as a source.
> This is a convenient special case for the LTU form of the scc pattern.
> 
> unsigned int foo(unsigned int x, unsigned int y) {
>   return (x+y) < x;
> }
> 
> With -O2 -mcpu=em this is currently compiled to:
> 
> foo:add.f 0,r0,r1
> mov_s   r0,1;3
> j_s.d   [blink]
> mov.hs r0,0
> 
> [which after an addition to set the carry flag, sets r0 to 1, followed by
a
> conditional assignment of r0 to zero if the carry flag is clear].  With
the new
> define_insn/optimization in this patch, this becomes:
> 
> foo:add.f 0,r0,r1
> j_s.d   [blink]
> rlc r0,0
> 
> This define_insn is also a useful building block for implementing shifts
and rotates.
> 
> Tested on a cross-compiler to arc-linux (hosted on x86_64-pc-linux-gnu),
and a
> partial tool chain, where the new case passes and there are no new
regressions.
> Ok for mainline?
> 
> 
> 2023-09-29  Roger Sayle  
> 
> gcc/ChangeLog
> * config/arc/arc.md (CC_ltu): New mode iterator for CC and CC_C.
> (scc_ltu_): New define_insn to handle LTU form of scc_insn.
> (*scc_insn): Don't split to a conditional move sequence for LTU.
> 
> gcc/testsuite/ChangeLog
> * gcc.target/arc/scc-ltu.c: New test case.
> 
> 
> Thanks in advance,
> Roger
> --

[ARC PATCH] Use rlc r0, 0 to implement scc_ltu (i.e. carry_flag ? 1 : 0)

2023-09-29 Thread Roger Sayle


This patch teaches the ARC backend that the contents of the carry flag
can be placed in an integer register conveniently using the "rlc rX,0"
instruction, which is a rotate-left-through-carry using zero as a source.
This is a convenient special case for the LTU form of the scc pattern.

unsigned int foo(unsigned int x, unsigned int y)
{
  return (x+y) < x;
}

With -O2 -mcpu=em this is currently compiled to:

foo:add.f 0,r0,r1
mov_s   r0,1;3
j_s.d   [blink]
mov.hs r0,0

[which after an addition to set the carry flag, sets r0 to 1,
followed by a conditional assignment of r0 to zero if the
carry flag is clear].  With the new define_insn/optimization
in this patch, this becomes:

foo:add.f 0,r0,r1
j_s.d   [blink]
rlc r0,0

This define_insn is also a useful building block for implementing
shifts and rotates.

Tested on a cross-compiler to arc-linux (hosted on x86_64-pc-linux-gnu),
and a partial tool chain, where the new case passes and there are no
new regressions.  Ok for mainline?


2023-09-29  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.md (CC_ltu): New mode iterator for CC and CC_C.
(scc_ltu_): New define_insn to handle LTU form of scc_insn.
(*scc_insn): Don't split to a conditional move sequence for LTU.

gcc/testsuite/ChangeLog
* gcc.target/arc/scc-ltu.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index d37ecbf..fe2e7fb 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -3658,12 +3658,24 @@ archs4x, archs4xd"
 (define_expand "scc_insn"
   [(set (match_operand:SI 0 "dest_reg_operand" "=w") (match_operand:SI 1 ""))])
 
+(define_mode_iterator CC_ltu [CC_C CC])
+
+(define_insn "scc_ltu_"
+  [(set (match_operand:SI 0 "dest_reg_operand" "=w")
+(ltu:SI (reg:CC_ltu CC_REG) (const_int 0)))]
+  ""
+  "rlc\\t%0,0"
+  [(set_attr "type" "shift")
+   (set_attr "predicable" "no")
+   (set_attr "length" "4")])
+
 (define_insn_and_split "*scc_insn"
   [(set (match_operand:SI 0 "dest_reg_operand" "=w")
(match_operator:SI 1 "proper_comparison_operator" [(reg CC_REG) 
(const_int 0)]))]
   ""
   "#"
-  "reload_completed"
+  "reload_completed
+   && GET_CODE (operands[1]) != LTU"
   [(set (match_dup 0) (const_int 1))
(cond_exec
  (match_dup 1)
diff --git a/gcc/testsuite/gcc.target/arc/scc-ltu.c 
b/gcc/testsuite/gcc.target/arc/scc-ltu.c
new file mode 100644
index 000..653c55d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/scc-ltu.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x, unsigned int y)
+{
+  return (x+y) < x;
+}
+
+/* { dg-final { scan-assembler "rlc\\s+r0,0" } } */
+/* { dg-final { scan-assembler "add.f\\s+0,r0,r1" } } */
+/* { dg-final { scan-assembler-not "mov_s\\s+r0,1" } } */
+/* { dg-final { scan-assembler-not "mov\.hs\\s+r0,0" } } */

RE: [RFC] expr: don't clear SUBREG_PROMOTED_VAR_P flag for a promoted subreg [target/111466]

2023-09-29 Thread Roger Sayle



I agree that this looks dubious.  Normally, if the middle-end/optimizers
wish to reuse a SUBREG in a context where the flags are not valid, it
should create a new one with the desired flags, rather than "mutate"
an existing (and possibly shared) RTX.

I wonder if creating a new SUBREG here also fixes your problem?
I'm not sure that clearing SUBREG_PROMOTED_VAR_P is needed
at all, but given its motivation has been lost to history, it would
good to have a plan B, if Jeff's alpha testing uncovers a subtle issue.

Roger
--

> -Original Message-
> From: Vineet Gupta 
> Sent: 28 September 2023 22:44
> To: gcc-patches@gcc.gnu.org; Robin Dapp 
> Cc: kito.ch...@gmail.com; Jeff Law ; Palmer Dabbelt
> ; gnu-toolch...@rivosinc.com; Roger Sayle
> ; Jakub Jelinek ; Jivan
> Hakobyan ; Vineet Gupta 
> Subject: [RFC] expr: don't clear SUBREG_PROMOTED_VAR_P flag for a promoted
> subreg [target/111466]
> 
> RISC-V suffers from extraneous sign extensions, despite/given the ABI
guarantee
> that 32-bit quantities are sign-extended into 64-bit registers, meaning
incoming SI
> function args need not be explicitly sign extended (so do SI return values
as most
> ALU insns implicitly sign-extend too.)
> 
> Existing REE doesn't seem to handle this well and there are various ideas
floating
> around to smarten REE about it.
> 
> RISC-V also seems to correctly implement middle-end hook PROMOTE_MODE
> etc.
> 
> Another approach would be to prevent EXPAND from generating the
sign_extend
> in the first place which this patch tries to do.
> 
> The hunk being removed was introduced way back in 1994 as
>5069803972 ("expand_expr, case CONVERT_EXPR .. clear the promotion
flag")
> 
> This survived full testsuite run for RISC-V rv64gc with surprisingly no
> fallouts: test results before/after are exactly same.
> 
> |   | # of unexpected case / # of unique
unexpected case
> |   |  gcc |  g++ |
gfortran |
> | rv64imafdc_zba_zbb_zbs_zicond/|  264 /87 |5 / 2 |   72 /
12 |
> |lp64d/medlow
> 
> Granted for something so old to have survived, there must be a valid
reason.
> Unfortunately the original change didn't have additional commentary or a
test
> case. That is not to say it can't/won't possibly break things on other
arches/ABIs,
> hence the RFC for someone to scream that this is just bonkers, don't do
this :-)
> 
> I've explicitly CC'ed Jakub and Roger who have last touched subreg
promoted
> notes in expr.cc for insight and/or screaming ;-)
> 
> Thanks to Robin for narrowing this down in an amazing debugging session @
GNU
> Cauldron.
> 
> ```
> foo2:
>   sext.w  a6,a1 <-- this goes away
>   beq a1,zero,.L4
>   li  a5,0
>   li  a0,0
> .L3:
>   addwa4,a2,a5
>   addwa5,a3,a5
>   addwa0,a4,a0
>   bltua5,a6,.L3
>   ret
> .L4:
>   li  a0,0
>   ret
> ```
> 
> Signed-off-by: Vineet Gupta 
> Co-developed-by: Robin Dapp 
> ---
>  gcc/expr.cc   |  7 ---
>  gcc/testsuite/gcc.target/riscv/pr111466.c | 15 +++
>  2 files changed, 15 insertions(+), 7 deletions(-)  create mode 100644
> gcc/testsuite/gcc.target/riscv/pr111466.c
> 
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 308ddc09e631..d259c6e53385 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -9332,13 +9332,6 @@ expand_expr_real_2 (sepops ops, rtx target,
> machine_mode tmode,
> op0 = expand_expr (treeop0, target, VOIDmode,
>modifier);
> 
> -   /* If the signedness of the conversion differs and OP0 is
> -  a promoted SUBREG, clear that indication since we now
> -  have to do the proper extension.  */
> -   if (TYPE_UNSIGNED (TREE_TYPE (treeop0)) != unsignedp
> -   && GET_CODE (op0) == SUBREG)
> - SUBREG_PROMOTED_VAR_P (op0) = 0;
> -
> return REDUCE_BIT_FIELD (op0);
>   }
> 
> diff --git a/gcc/testsuite/gcc.target/riscv/pr111466.c
> b/gcc/testsuite/gcc.target/riscv/pr111466.c
> new file mode 100644
> index ..007792466a51
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/pr111466.c
> @@ -0,0 +1,15 @@
> +/* Simplified varaint of gcc.target/riscv/zba-adduw.c.  */
> +
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gc_zba_zbs -mabi=lp64" } */
> +/* { dg-skip-if "" { *-*-* } { "-O0" } } */
> +
> +int foo2(int unused, int n, unsigned y, unsigned delta){
> +  int s = 0;
> +  unsigned int x = 0;
> +  for (;x +s += x+y;
> +  return s;
> +}
> +
> +/* { dg-final { scan-assembler "\msext\M" } } */
> --
> 2.34.1

[ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.

2023-09-28 Thread Roger Sayle


Hi Claudiu,
It was great meeting up with you and the Synopsys ARC team at the
GNU tools Cauldron in Cambridge.

This patch is the first in a series to improve SImode and DImode
shifts and rotates in the ARC backend.  This first piece splits
SImode shifts, for !TARGET_BARREL_SHIFTER targets, after combine
and before reload, in the split1 pass, as suggested by the FIXME
comment above output_shift in arc.cc.  To do this I've copied the
implementation of the x86_pre_reload_split function from i386
backend, and renamed it arc_pre_reload_split.

Although the actual implementations of shifts remain the same
(as in output_shift), having them as explicit instructions in
the RTL stream allows better scheduling and use of compact forms
when available.  The benefits can be seen in two short examples
below.

For the function:
unsigned int foo(unsigned int x, unsigned int y) {
  return y << 2;
}

GCC with -O2 -mcpu=em would previously generate:
foo:add r1,r1,r1
add r1,r1,r1
j_s.d   [blink]
mov_s   r0,r1   ;4
and with this patch now generates:
foo:asl_s r0,r1
j_s.d   [blink]
asl_s r0,r0

Notice the original (from shift_si3's output_shift) requires the
shift sequence to be monolithic with the same destination register
as the source (requiring an extra mov_s).  The new version can
eliminate this move, and schedule the second asl in the branch
delay slot of the return.

For the function:
int x,y,z;

void bar()
{
  x <<= 3;
  y <<= 3;
  z <<= 3;
}

GCC -O2 -mcpu=em currently generates:
bar:push_s  r13
ld.as   r12,[gp,@x@sda] ;23
ld.as   r3,[gp,@y@sda]  ;23
mov r2,0
add3 r12,r2,r12
mov r2,0
add3 r3,r2,r3
ld.as   r2,[gp,@z@sda]  ;23
st.as   r12,[gp,@x@sda] ;26
mov r13,0
add3 r2,r13,r2
st.as   r3,[gp,@y@sda]  ;26
st.as   r2,[gp,@z@sda]  ;26
j_s.d   [blink]
pop_s   r13

where each shift by 3, uses ARC's add3 instruction, which is similar
to x86's lea implementing x = (y<<3) + z, but requires the value zero
to be placed in a temporary register "z".  Splitting this before reload
allows these pseudos to be shared/reused.  With this patch, we get

bar:ld.as   r2,[gp,@x@sda]  ;23
mov_s   r3,0;3
add3r2,r3,r2
ld.as   r3,[gp,@y@sda]  ;23
st.as   r2,[gp,@x@sda]  ;26
ld.as   r2,[gp,@z@sda]  ;23
mov_s   r12,0   ;3
add3r3,r12,r3
add3r2,r12,r2
st.as   r3,[gp,@y@sda]  ;26
st.as   r2,[gp,@z@sda]  ;26
j_s [blink]

Unfortunately, register allocation means that we only share two of the
three "mov_s z,0", but this is sufficient to reduce register pressure
enough to avoid spilling r13 in the prologue/epilogue.

This patch also contains a (latent?) bug fix.  The implementation of
the default insn "length" attribute, assumes instructions of type
"shift" have two input operands and accesses operands[2], hence 
specializations of shifts that don't have a operands[2], need to be
categorized as type "unary" (which results in the correct length).

This patch has been tested on a cross-compiler to arc-elf (hosted on
x86_64-pc-linux-gnu), but because I've an incomplete tool chain many
of the regression test fail, but there are no new failures with new
test cases added below.  If you can confirm that there are no issues
from additional testing, is this OK for mainline?

Finally a quick technical question.  ARC's zero overhead loops require
at least two instructions in the loop, so currently the backend's
implementation of shr20 pads the loop body with a "nop".

lshr20: mov.f lp_count, 20
lpnz2f
lsr r0,r0
nop
2:  # end single insn loop
j_s [blink]

could this be more efficiently implemented as:

lshr20: mov lp_count, 10
lp 2f
lsr_s r0,r0
lsr_s r0,r0
2:  # end single insn loop
j_s [blink]

i.e. half the number of iterations, but doing twice as much useful
work in each iteration?  Or might the nop be free on advanced
microarchitectures, and/or the consecutive dependent shifts cause
a pipeline stall?  It would be nice to fuse loops to implement
rotations, such that rotr16 (aka swap) would look like:

rot16:  mov_s r1,r0
mov lp_count, 16
lp 2f
asl_s r0,r0
lsr_s r1,r1
2:  # end single insn loop
    j_s.d[blink]
or_s r0,r0,r1

Thanks in advance,
Roger


2023-09-28  Roger Sayle  

gcc/ChangeLog
* config/arc/arc-protos.h (emit_shift): Delete prototype.
(arc_pre_reload_split): New function prototype.
* config/arc/arc.cc (emit_shift): Delete function.
(arc_pre_reload_split): New predicate function, copied from i386,
to schedule define_insn_and_split splitters to the split1 pass.
* config/arc/arc.md (ashlsi3): Exp

RE: [x86_64 PATCH] Improve __int128 argument passing (in ix86_expand_move).

2023-09-01 Thread Roger Sayle



Hi Manolis,
Many thanks.  If you haven't already, could you create/file a
bug report at https://gcc.gnu.org/bugzilla/ which ensures this
doesn't get lost/forgotten.  It provides a PR number for tracking
discussions, and patches/fixes with PR numbers are (often)
prioritized during the review and approval process.

I'll investigate what's going on.  Either my "improvements"
need to be disabled for V2SF arguments, or the middle/back
end needs to figure out how to efficiently shuffle these values,
without reload moving them via integer registers, at least as
efficiently as we were before.  As you/clang show, we could
do better.

Thanks again, and sorry for any inconvenience.
Best regards,
Roger
--

> -Original Message-
> From: Manolis Tsamis 
> Sent: 01 September 2023 11:45
> To: Uros Bizjak 
> Cc: Roger Sayle ; gcc-patches@gcc.gnu.org
> Subject: Re: [x86_64 PATCH] Improve __int128 argument passing (in
> ix86_expand_move).
> 
> Hi Roger,
> 
> I've (accidentally) found a codegen regression that I bisected down to this 
> patch.
> For these two functions:
> 
> typedef struct {
>   float minx, miny;
>   float maxx, maxy;
> } AABB;
> 
> int TestOverlap(AABB a, AABB b) {
>   return a.minx <= b.maxx
>   && a.miny <= b.maxy
>   && a.maxx >= b.minx
>   && a.maxx >= b.minx;
> }
> 
> int TestOverlap2(AABB a, AABB b) {
>   return a.miny <= b.maxy
>   && a.maxx >= b.minx;
> }
> 
> GCC used to produce this code:
> 
> TestOverlap:
> comiss  xmm3, xmm0
> movqrdx, xmm0
> movqrsi, xmm1
> movqrax, xmm3
> jb  .L10
> shr rdx, 32
> shr rax, 32
> movdxmm0, eax
> movdxmm4, edx
> comiss  xmm0, xmm4
> jb  .L10
> movdxmm1, esi
> xor eax, eax
> comiss  xmm1, xmm2
> setnb   al
> ret
> .L10:
> xor eax, eax
> ret
> TestOverlap2:
> shufps  xmm0, xmm0, 85
> shufps  xmm3, xmm3, 85
> comiss  xmm3, xmm0
> jb  .L17
> xor eax, eax
> comiss  xmm1, xmm2
> setnb   al
> ret
> .L17:
> xor eax, eax
> ret
> 
> After this patch codegen gets much worse:
> 
> TestOverlap:
> movqrax, xmm1
> movqrdx, xmm2
> movqrsi, xmm0
> mov rdi, rax
> movqrax, xmm3
> mov rcx, rsi
> xchgrdx, rax
> movdxmm1, edx
> mov rsi, rax
> mov rax, rdx
> comiss  xmm1, xmm0
> jb  .L10
> shr rcx, 32
> shr rax, 32
> movdxmm0, eax
> movdxmm4, ecx
> comiss  xmm0, xmm4
> jb  .L10
> movdxmm0, esi
> movdxmm1, edi
> xor eax, eax
> comiss  xmm1, xmm0
> setnb   al
> ret
> .L10:
> xor eax, eax
> ret
> TestOverlap2:
> movqrdx, xmm2
> movqrax, xmm3
> movqrsi, xmm0
> xchgrdx, rax
> mov rcx, rsi
> mov rsi, rax
> mov rax, rdx
> shr rcx, 32
> shr rax, 32
> movdxmm4, ecx
> movdxmm0, eax
> comiss  xmm0, xmm4
> jb  .L17
> movdxmm0, esi
> xor eax, eax
> comiss  xmm1, xmm0
> setnb   al
> ret
> .L17:
> xor eax, eax
> ret
> 
> I saw that you've been improving i386 argument passing, so maybe this is just 
> a
> missed case of these additions?
> 
> (Can also be seen here https://godbolt.org/z/E4xrEn6KW)
> 
> PS: I found the code that clang generates, with cmpleps + pextrw to avoid the 
> fp-
> >int->fp + shr interesting. I wonder if something like this could be added to 
> >GCC as
> well.
> 
> Thanks!
> Manolis
> 
> On Thu, Jul 6, 2023 at 5:21 PM Uros Bizjak via Gcc-patches  patc...@gcc.gnu.org> wrote:
> >
> > On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle 
> wrote:
> > >
> > > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle
> > > > 
> > > > wrote:
> > > > >
> > > > >
> > > > > Passing 128-bit integer (TImode) parameters on x86_64 can
> > > > > sometimes result in surprising code.  Consider the example below 
> > > > > (from PR
> 43644):
> > > > >
> > > > > __uint128 foo(__uint128 x

1 2 3 4 5 6 7 >

1 - 100 of 663 matches

Mail list logo