Re: [PATCH] Check if loading const from mem is faster

Jiufu Guo via Gcc-patches Mon, 28 Feb 2022 01:18:25 -0800

Richard Biener <rguent...@suse.de> writes:

> On Fri, 25 Feb 2022, Jiufu Guo wrote:
>
>> Richard Biener <rguent...@suse.de> writes:
>> 
>> > On Fri, 25 Feb 2022, Jiufu Guo wrote:
>> >
>> >> Richard Biener <rguent...@suse.de> writes:
>> >> 
>> >> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> >> >
>> >> >> Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> >> 
>> >> >> > Segher Boessenkool <seg...@kernel.crashing.org> writes:
>> >> >> >
>> >> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> >> >> >>> I'm assuming we're always dealing with
>> >> >> >>> 
>> >> >> >>>   (set (reg:MODE ..) <src_folded>)
>> >> >> >>> 
>> >> >> >>> here and CSE is not substituting into random places of an
>> >> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> >> >> >>> to for a constant, if it should implicitely evaluate the cost
>> >> >> >>> of putting the result into a register for example.
>> >> >> >>
>> >> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> >> >> >> for anything that is used as input in a machine instruction -- but 
>> >> >> >> you
>> >> >> >> need much more context to determine that.  insn_cost is much 
>> >> >> >> simpler and
>> >> >> >> much easier to use.
>> >> >> >>
>> >> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
>> >> >> >>> your proposed new target hook and comparing it with the original
>> >> >> >>> unfolded src (again with SET and 1).
>> >> >> >>
>> >> >> >> It is required to generate valid instructions no matter what, before
>> >> >> >> the pass has finished that is.  On all more modern architectures it 
>> >> >> >> is
>> >> >> >> futile to think you can usefully consider the cost of an RTL 
>> >> >> >> expression
>> >> >> >> and derive a real-world cost of the generated code from that.
>> >> >> >
>> >> >> > Thanks Segher for pointing out these!  Here is  another reason that I
>> >> >> > did not use rtx_cost: in a few passes, there are codes to check the
>> >> >> > constants and store them in constant pool.  I'm thinking to 
>> >> >> > integerate
>> >> >> > those codes in a consistent way.
>> >> >> 
>> >> >> Hi Segher, Richard!
>> >> >> 
>> >> >> I'm thinking the way like: For a constant,
>> >> >> 1. if the constant could be used as an immediate for the
>> >> >> instruction, then retreated as an operand;
>> >> >> 2. otherwise if the constant can not be stored into a
>> >> >> constant pool, then handle through instructions;
>> >> >> 3. if it is faster to access constant from pool, then emit
>> >> >> constant as data(.rodata);
>> >> >> 4. otherwise, handle the constant by instructions.
>> >> >> 
>> >> >> And to store the constant into a pool, besides force_const_mem,
>> >> >> create reference (toc) may be needed on some platforms.
>> >> >> 
>> >> >> For this particular issue in CSE, there is already code that
>> >> >> tries to put constant into a pool (invoke force_const_mem).
>> >> >> While the code is too late.  So, we may check the constant
>> >> >> earlier and store it into constant pool if profitable.
>> >> >> 
>> >> >> And another thing as Segher pointed out, CSE is doing too
>> >> >> much work.  It may be ok to separate the constant handling
>> >> >> logic from CSE.
>> >> >
>> >> > Not sure - CSE just is value numbering, I don't see that it does
>> >> > more than that.  Yes, it might have developed "heuristics" over
>> >> > the years what to CSE and to what and where to substitute and
>> >> > where not.  But in the end it does just value numbering.
>> >> >
>> >> >> 
>> >> >> I update a new version patch as follow (did not seprate CSE):
>> >> >
>> >> > How is the new target hook better in any way compared to rtx_cost
>> >> > or insn_cost?  It looks like a total hack.
>> >> >
>> >> > I suppose the actual way of materializing a constant is done
>> >> > behind GCCs back and not exposed anywhere?  But instead you
>> >> > claim the constants are valid when they actually are not?
>> >> > Isn't the problem then that the rs6000 backend lies?
>> >> 
>> >> Hi Richard,
>> >> 
>> >> Thanks for your comments and sugguestions!
>> >> 
>> >> Materializing a constant should be done behind GCC.
>> >> On rs6000, in expand pass, during emit_move, the constant is
>> >> checked and store into constant pool if necessary.
>> >> Some other platforms are doing a similar thing, e.g.
>> >> ix86_expand_vector_move, alpha_expand_mov,...
>> >> mips_legitimize_const_move.
>> >> 
>> >> But, it does not as we expect, force_const_mem is also 
>> >> exposed other places (not only ira/reload for stack reference).
>> >> 
>> >> CSE is one place, for example, CSE first retrieve the constant
>> >> from insn's equal, but it also tries to put constant into
>> >> pool for some condition (the condition was introduced at
>> >> early age of cse.c, and it is rare to run into in trunk).
>> >> In some aspects, IMHO, this seems not a great work of CSE.
>> >> 
>> >> And this is how the 'invalid(or say slow)' constant occurs.
>> >> e.g.  before cse:
>> >>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>> >>       REG_EQUAL 0x100803004101001
>> >> after cse: 
>> >>     7: r119:DI=0x100803004101001
>> >>       REG_EQUAL 0x100803004101001
>> >> 
>> >> As you pointed out: we can also avoid this transformation through
>> >> rtx_cost/insn_cost by estimating the COST more accurately for
>> >>  "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
>> >> be a real final instruction.)
>> >
>> > At which point does this become the final instruction?  I suppose
>> > CSE belives this constant is legitimate and the insn is recognized
>> > just fine in CSE?  (that's what I mean with "behind GCCs back")
>> >
>> It would become final instructions during split pass on rs6000,
>> other target, e.g. alpha, seem also do split it.
>> >> Is it necessary to refine this constant handling for CSE, or even
>> >> to eliminate the logic on constant extracting for an insn, beside
>> >> updating rtx_cost/insn_cost?
>> >
>> > So if CSE really just transforms
>> >
>> >>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>> >>       REG_EQUAL 0x100803004101001
>> >
>> > to
>> >
>> >>     7: r119:DI=0x100803004101001
>> >>       REG_EQUAL 0x100803004101001
>> >
>> > then why can't rtx_cost (SET, 1) or insn_cost () be used to
>> > accurately tell it that the immediate is going to be a lot
>> > more expensive?
>> >
>> > That is, rtx_cost (CONST_INT-rtx, DI, SET, 1, ...) is accurate
>> > enough to be treated as an actual insn (it _might_ be part of
>> > a parallel I guess, but that's unlikely) and thus the backend
>> > should have a very good idea of the many instruction it needs
>> > for this.  Likewise rtx_cost ([unspec[`*.LC0',%r2:DI] 47], DI, SET, 1, 
>> > ...)
>> > should receive accurate cost that can be compared to other
>> > rtx_cost (..., DI, SET, 1, ...)
>> >
>> > And CSE doesn't even need to create fake insns here since IMHO
>> > rtx_cost is good enough to capture the full insn.  Targets can
>> > choose to split out a set_cost from their rtx_cost/insn_cost
>> > hooks for this case for example.
>> 
>> Hi Richard,
>> 
>> Right, we can fix this issue by updating rtx_cost/insn_cost to
>> tell that this kind of constants is a lot of expansive.
>> 
>> To update rtx_cost, we can use a trivial patch (shown as end of
>> this mail) to fix this particular issue.
>> To use insn_cost, additional work is replacing rtx_cost with
>> insn_cost for CSE, maybe more suitable for stage1.
>> 
>> So, it would be too far to fix this by refactoring the logic of
>> constant handling. :-)
>> 
>> Thanks for your comments!
>> 
>> BR,
>> Jiufu
>> 
>> >
>> > Richard.
>> >
>> >> Any sugguestions?
>> >> 
>> >> >
>> >> > Btw, all of this is of course not appropriate for stage4 and changes
>> >> > to CSE need testing on more than one target.
>> >> I would do more evaluation, thanks!
>> >> 
>> >> Jiufu
>> >> 
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks for the comments and suggestions again!
>> >> >> 
>> >> >> 
>> >> >> BR,
>> >> >> Jiufu
>> >> >> 
>> 
>> Part of the experimental patch for rs6000, which could help
>> to mitigate inaccurate rtx cost on constant.
>
> Yes, that sort of thing is exactly what I had in mind.  I assume
> the corresponding insn_costs hook does the same?


Hi Richard!

Thanks for your comments!

On most targets, insn_cost also uses code like COSTS_N_INSNS, while
insn_cost is also checking insn's attributes (like length,cost, size).
Since cse is using rtx_cost, to use insn_cost, we would need to
construct an insn first.

For this idea, I hacked a patch, which is using insn_cost partially,
and could be more aggressively enhanced.

BR,
Jiufu

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..b6ca1283b18 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -22131,6 +22131,16 @@ rs6000_debug_rtx_costs (rtx x, machine_mode mode, int 
outer_code,
 static int
 rs6000_insn_cost (rtx_insn *insn, bool speed)
 {
+  /* Handle special 'invalid' insn. */
+  rtx set = PATTERN (insn);
+  if (GET_CODE (set) == SET && CONSTANT_P (SET_SRC (set)))
+    {
+      rtx src = SET_SRC (set);
+      machine_mode mode = GET_MODE (SET_DEST (set));
+      if (CONST_INT_P (src) || CONST_WIDE_INT_P (src) || CONST_DOUBLE_P (src))
+       return COSTS_N_INSNS (num_insns_constant (src, mode));
+    }
+  
   if (recog_memoized (insn) < 0)
     return 0;
 
diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..53fdb3fff58 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -456,6 +456,8 @@ struct table_elt
   (REG_P (X) ? 0 : notreg_cost (X, MODE, SET, 1))
 #define COST_IN(X, MODE, OUTER, OPNO)                                  \
   (REG_P (X) ? 0 : notreg_cost (X, MODE, OUTER, OPNO))
+#define COST_SRC(I, X, MODE)                                           \
+  (REG_P (X) ? 0 : notreg_cost (X, MODE, SET, 1, I))
 
 /* Get the number of times this register has been updated in this
    basic block.  */
@@ -523,7 +525,8 @@ static bitmap cse_ebb_live_in, cse_ebb_live_out;
 static sbitmap cse_visited_basic_blocks;
 
 static bool fixed_base_plus_p (rtx x);
-static int notreg_cost (rtx, machine_mode, enum rtx_code, int);
+static int notreg_cost (rtx, machine_mode, enum rtx_code, int, rtx_insn*);
+static int insn_cost_x (rtx_insn *, rtx);
 static int preferable (int, int, int, int);
 static void new_basic_block (void);
 static void make_new_qty (unsigned int, machine_mode);
@@ -698,7 +701,8 @@ preferable (int cost_a, int regcost_a, int cost_b, int 
regcost_b)
    from COST macro to keep it simple.  */
 
 static int
-notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
+notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno,
+            rtx_insn *insn = NULL)
 {
   scalar_int_mode int_mode, inner_mode;
   return ((GET_CODE (x) == SUBREG
@@ -709,9 +713,20 @@ notreg_cost (rtx x, machine_mode mode, enum rtx_code 
outer, int opno)
           && subreg_lowpart_p (x)
           && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
          ? 0
+         : insn != NULL ? insn_cost_x (insn, x)
          : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
 }
 
+/* Internal function, to get cost when use X to replace source of insn
+   which is a SET.  */
+
+static int
+insn_cost_x (rtx_insn *insn, rtx x)
+{
+  SET_SRC (PATTERN (insn)) = x;
+  return insn_cost (insn, optimize_this_for_speed_p);
+}
+
 
 /* Initialize CSE_REG_INFO_TABLE.  */
 
@@ -4603,6 +4618,7 @@ cse_insn (rtx_insn *insn)
 
      Nothing in this loop changes the hash table or the register chains.  */
 
+  rtx_insn *tmp_insn = NULL;
   for (i = 0; i < n_sets; i++)
     {
       bool repeat = false;
@@ -4638,6 +4654,10 @@ cse_insn (rtx_insn *insn)
       mode = GET_MODE (src) == VOIDmode ? GET_MODE (dest) : GET_MODE (src);
       sets[i].mode = mode;
 
+      if (tmp_insn == NULL_RTX && src && dest && dest != pc_rtx
+         && src != pc_rtx)
+       tmp_insn = gen_move_insn (copy_rtx (dest), src);
+
       if (src_eqv)
        {
          machine_mode eqvmode = mode;
@@ -5103,7 +5123,7 @@ cse_insn (rtx_insn *insn)
            src_cost = src_regcost = -1;
          else
            {
-             src_cost = COST (src, mode);
+             src_cost = COST_SRC (tmp_insn, src, mode);
              src_regcost = approx_reg_cost (src);
            }
        }
@@ -5114,7 +5134,7 @@ cse_insn (rtx_insn *insn)
            src_eqv_cost = src_eqv_regcost = -1;
          else
            {
-             src_eqv_cost = COST (src_eqv_here, mode);
+             src_eqv_cost = COST_SRC (tmp_insn, src_eqv_here, mode);
              src_eqv_regcost = approx_reg_cost (src_eqv_here);
            }
        }
@@ -5125,7 +5145,7 @@ cse_insn (rtx_insn *insn)
            src_folded_cost = src_folded_regcost = -1;
          else
            {
-             src_folded_cost = COST (src_folded, mode);
+             src_folded_cost = COST_SRC (tmp_insn, src_folded, mode);
              src_folded_regcost = approx_reg_cost (src_folded);
            }
        }
@@ -5136,7 +5156,7 @@ cse_insn (rtx_insn *insn)
            src_related_cost = src_related_regcost = -1;
          else
            {
-             src_related_cost = COST (src_related, mode);
+             src_related_cost = COST_SRC (tmp_insn, src_related, mode);
              src_related_regcost = approx_reg_cost (src_related);
 
              /* If a const-anchor is used to synthesize a constant that
@@ -5406,7 +5426,7 @@ cse_insn (rtx_insn *insn)
              src_folded = force_const_mem (mode, trial);
              if (src_folded)
                {
-                 src_folded_cost = COST (src_folded, mode);
+                 src_folded_cost = COST_SRC (tmp_insn, src_folded, mode);
                  src_folded_regcost = approx_reg_cost (src_folded);
                }
            }
@@ -5460,7 +5480,9 @@ cse_insn (rtx_insn *insn)
                  /* If we had a constant that is cheaper than what we are now
                     setting SRC to, use that constant.  We ignored it when we
                     thought we could make this into a no-op.  */
-                 if (src_const && COST (src_const, mode) < COST (src, mode)
+                 if (src_const
+                     && COST_SRC (tmp_insn, src_const, mode) 
+                          < COST_SRC (tmp_insn, src, mode)
                      && validate_change (insn, &SET_SRC (sets[i].rtl),
                                          src_const, 0))
                    src = src_const;

>
> Richard.
>
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index e82a47f4c0e..e429ae2bcf0 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -21884,6 +21884,14 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int 
>> outer_code,
>>  
>>      case CONST_DOUBLE:
>>      case CONST_WIDE_INT:
>> +      /* Set const to a reg, it may needs a few insns.  */
>> +      if (outer_code == SET)
>> +    {
>> +      *total = COSTS_N_INSNS (num_insns_constant (x, mode));
>> +      return true;
>> +    }
>> +      /* FALLTHRU */
>> +
>>      case CONST:
>>      case HIGH:
>>      case SYMBOL_REF:

Re: [PATCH] Check if loading const from mem is faster

Reply via email to