Re: [ARC PATCH] Improved ARC rtx_costs/insn_cost for SHIFTs and ROTATEs.

2023-10-30 Thread Jeff Law




On 10/29/23 03:16, Roger Sayle wrote:


This patch overhauls the ARC backend's insn_cost target hook, and makes
some related improvements to rtx_costs, BRANCH_COST, etc.  The primary
goal is to allow the backend to indicate that shifts and rotates are
slow (discouraged) when the CPU doesn't have a barrel shifter. I should
also acknowledge Richard Sandiford for inspiring the use of set_cost
in this rewrite of arc_insn_cost; this implementation borrows heavily
for the target hooks for AArch64 and ARM.

The motivating example is derived from PR rtl-optimization/110717.

struct S { int a : 5; };
unsigned int foo (struct S *p) {
   return p->a;
}

With a barrel shifter, GCC -O2 generates the reasonable:

foo:ldb_s   r0,[r0]
 asl_s   r0,r0,27
 j_s.d   [blink]
 asr_s   r0,r0,27

What's interesting is that during combine, the middle-end actually
has two shifts by three bits, and a sign-extension from QI to SI.

Trying 8, 9 -> 11:
 8: r158:SI=r157:QI#0<<0x3
   REG_DEAD r157:QI
 9: r159:SI=sign_extend(r158:SI#0)
   REG_DEAD r158:SI
11: r155:SI=r159:SI>>0x3
   REG_DEAD r159:SI

Whilst it's reasonable to simplify this to two shifts by 27 bits when
the CPU has a barrel shifter, it's actually a significant pessimization
when these shifts are implemented by loops.  This combination can be
prevented if the backend provides accurate-ish estimates for insn_cost.
Same scenario on the H8, though we already had the cost issues under 
control.  byte load (effectively shift by 24), 3 bit shifts and extension.


Jeff


Re: [ARC PATCH] Improved ARC rtx_costs/insn_cost for SHIFTs and ROTATEs.

2023-10-30 Thread Claudiu Zissulescu Ianculescu
Hi Roger,

You have a block of 8 spaces that needs to be replaced by tabs:
gcc/config/arc/arc.cc:5538:0:   if (n < 4)

Please fix the above, and proceed with your commit.

Thank you,
Claudiu

On Sun, Oct 29, 2023 at 11:16 AM Roger Sayle  wrote:
>
>
> This patch overhauls the ARC backend's insn_cost target hook, and makes
> some related improvements to rtx_costs, BRANCH_COST, etc.  The primary
> goal is to allow the backend to indicate that shifts and rotates are
> slow (discouraged) when the CPU doesn't have a barrel shifter. I should
> also acknowledge Richard Sandiford for inspiring the use of set_cost
> in this rewrite of arc_insn_cost; this implementation borrows heavily
> for the target hooks for AArch64 and ARM.
>
> The motivating example is derived from PR rtl-optimization/110717.
>
> struct S { int a : 5; };
> unsigned int foo (struct S *p) {
>   return p->a;
> }
>
> With a barrel shifter, GCC -O2 generates the reasonable:
>
> foo:ldb_s   r0,[r0]
> asl_s   r0,r0,27
> j_s.d   [blink]
> asr_s   r0,r0,27
>
> What's interesting is that during combine, the middle-end actually
> has two shifts by three bits, and a sign-extension from QI to SI.
>
> Trying 8, 9 -> 11:
> 8: r158:SI=r157:QI#0<<0x3
>   REG_DEAD r157:QI
> 9: r159:SI=sign_extend(r158:SI#0)
>   REG_DEAD r158:SI
>11: r155:SI=r159:SI>>0x3
>   REG_DEAD r159:SI
>
> Whilst it's reasonable to simplify this to two shifts by 27 bits when
> the CPU has a barrel shifter, it's actually a significant pessimization
> when these shifts are implemented by loops.  This combination can be
> prevented if the backend provides accurate-ish estimates for insn_cost.
>
>
> Previously, without a barrel shifter, GCC -O2 -mcpu=em generates:
>
> foo:ldb_s   r0,[r0]
> mov lp_count,27
> lp  2f
> add r0,r0,r0
> nop
> 2:  # end single insn loop
> mov lp_count,27
> lp  2f
> asr r0,r0
> nop
> 2:  # end single insn loop
> j_s [blink]
>
> which contains two loops and requires about ~113 cycles to execute.
> With this patch to rtx_cost/insn_cost, GCC -O2 -mcpu=em generates:
>
> foo:ldb_s   r0,[r0]
> mov_s   r2,0;3
> add3r0,r2,r0
> sexb_s  r0,r0
> asr_s   r0,r0
> asr_s   r0,r0
> j_s.d   [blink]
> asr_s   r0,r0
>
> which requires only ~6 cycles, for the shorter shifts by 3 and sign
> extension.
>
>
> Tested with a cross-compiler to arc-linux hosted on x86_64,
> with no new (compile-only) regressions from make -k check.
> Ok for mainline if this passes Claudiu's nightly testing?
>
>
> 2023-10-29  Roger Sayle  
>
> gcc/ChangeLog
> * config/arc/arc.cc (arc_rtx_costs): Improve cost estimates.
> Provide reasonable values for SHIFTS and ROTATES by constant
> bit counts depending upon TARGET_BARREL_SHIFTER.
> (arc_insn_cost): Use insn attributes if the instruction is
> recognized.  Avoid calling get_attr_length for type "multi",
> i.e. define_insn_and_split patterns without explicit type.
> Fall-back to set_rtx_cost for single_set and pattern_cost
> otherwise.
> * config/arc/arc.h (COSTS_N_BYTES): Define helper macro.
> (BRANCH_COST): Improve/correct definition.
> (LOGICAL_OP_NON_SHORT_CIRCUIT): Preserve previous behavior.
>
>
> Thanks again,
> Roger
> --
>


[ARC PATCH] Improved ARC rtx_costs/insn_cost for SHIFTs and ROTATEs.

2023-10-29 Thread Roger Sayle

This patch overhauls the ARC backend's insn_cost target hook, and makes
some related improvements to rtx_costs, BRANCH_COST, etc.  The primary
goal is to allow the backend to indicate that shifts and rotates are
slow (discouraged) when the CPU doesn't have a barrel shifter. I should
also acknowledge Richard Sandiford for inspiring the use of set_cost
in this rewrite of arc_insn_cost; this implementation borrows heavily
for the target hooks for AArch64 and ARM.

The motivating example is derived from PR rtl-optimization/110717.

struct S { int a : 5; };
unsigned int foo (struct S *p) {
  return p->a;
}

With a barrel shifter, GCC -O2 generates the reasonable:

foo:ldb_s   r0,[r0]
asl_s   r0,r0,27
j_s.d   [blink]
asr_s   r0,r0,27

What's interesting is that during combine, the middle-end actually
has two shifts by three bits, and a sign-extension from QI to SI.

Trying 8, 9 -> 11:
8: r158:SI=r157:QI#0<<0x3
  REG_DEAD r157:QI
9: r159:SI=sign_extend(r158:SI#0)
  REG_DEAD r158:SI
   11: r155:SI=r159:SI>>0x3
  REG_DEAD r159:SI

Whilst it's reasonable to simplify this to two shifts by 27 bits when
the CPU has a barrel shifter, it's actually a significant pessimization
when these shifts are implemented by loops.  This combination can be
prevented if the backend provides accurate-ish estimates for insn_cost.


Previously, without a barrel shifter, GCC -O2 -mcpu=em generates:

foo:ldb_s   r0,[r0]
mov lp_count,27
lp  2f
add r0,r0,r0
nop
2:  # end single insn loop
mov lp_count,27
lp  2f
asr r0,r0
nop
2:  # end single insn loop
j_s [blink]

which contains two loops and requires about ~113 cycles to execute.
With this patch to rtx_cost/insn_cost, GCC -O2 -mcpu=em generates:

foo:ldb_s   r0,[r0]
mov_s   r2,0;3
add3r0,r2,r0
sexb_s  r0,r0
asr_s   r0,r0
asr_s   r0,r0
j_s.d   [blink]
asr_s   r0,r0

which requires only ~6 cycles, for the shorter shifts by 3 and sign
extension.


Tested with a cross-compiler to arc-linux hosted on x86_64,
with no new (compile-only) regressions from make -k check.
Ok for mainline if this passes Claudiu's nightly testing?


2023-10-29  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.cc (arc_rtx_costs): Improve cost estimates.
Provide reasonable values for SHIFTS and ROTATES by constant
bit counts depending upon TARGET_BARREL_SHIFTER.
(arc_insn_cost): Use insn attributes if the instruction is
recognized.  Avoid calling get_attr_length for type "multi",
i.e. define_insn_and_split patterns without explicit type.
Fall-back to set_rtx_cost for single_set and pattern_cost
otherwise.
* config/arc/arc.h (COSTS_N_BYTES): Define helper macro.
(BRANCH_COST): Improve/correct definition.
(LOGICAL_OP_NON_SHORT_CIRCUIT): Preserve previous behavior.


Thanks again,
Roger
--

diff --git a/gcc/config/arc/arc.cc b/gcc/config/arc/arc.cc
index 353ac69..ae83e5e 100644
--- a/gcc/config/arc/arc.cc
+++ b/gcc/config/arc/arc.cc
@@ -5492,7 +5492,7 @@ arc_rtx_costs (rtx x, machine_mode mode, int outer_code,
 case CONST:
 case LABEL_REF:
 case SYMBOL_REF:
-  *total = speed ? COSTS_N_INSNS (1) : COSTS_N_INSNS (4);
+  *total = speed ? COSTS_N_INSNS (1) : COSTS_N_BYTES (4);
   return true;
 
 case CONST_DOUBLE:
@@ -5516,26 +5516,32 @@ arc_rtx_costs (rtx x, machine_mode mode, int outer_code,
 case ASHIFT:
 case ASHIFTRT:
 case LSHIFTRT:
+case ROTATE:
+case ROTATERT:
+  if (mode == DImode)
+   return false;
   if (TARGET_BARREL_SHIFTER)
{
- if (CONSTANT_P (XEXP (x, 0)))
+ *total = COSTS_N_INSNS (1);
+ if (CONSTANT_P (XEXP (x, 1)))
{
- *total += rtx_cost (XEXP (x, 1), mode, (enum rtx_code) code,
+ *total += rtx_cost (XEXP (x, 0), mode, (enum rtx_code) code,
  0, speed);
  return true;
}
- *total = COSTS_N_INSNS (1);
}
   else if (GET_CODE (XEXP (x, 1)) != CONST_INT)
-   *total = COSTS_N_INSNS (16);
+   *total = speed ? COSTS_N_INSNS (16) : COSTS_N_INSNS (4);
   else
{
- *total = COSTS_N_INSNS (INTVAL (XEXP ((x), 1)));
- /* ??? want_to_gcse_p can throw negative shift counts at us,
-and then panics when it gets a negative cost as result.
-Seen for gcc.c-torture/compile/20020710-1.c -Os .  */
- if (*total < 0)
-   *total = 0;
+ int n = INTVAL (XEXP (x, 1)) & 31;
+  if (n < 4)
+   *total = COSTS_N_INSNS (n);
+ else
+   *total = speed ? COSTS_N_INSNS (n + 2) : COSTS_N_INSNS (4);
+ *total += rtx_cost (XEXP (x, 0), mode, (enum rtx_code) code,
+ 0, speed);
+ return true;