Re: [PATCH] Check if loading const from mem is faster

2022-03-10 Thread Jiufu Guo via Gcc-patches


Hi!

Richard Biener  writes:

> On Thu, 10 Mar 2022, Jiufu Guo wrote:
>
>> 
>> Hi!
>> 
>> Richard Biener  writes:
>> 
>> > On Wed, 9 Mar 2022, Jiufu Guo wrote:
>> >
>> >> 
>> >> Hi!
>> >> 
>> >> Richard Biener  writes:
>> >> 
>> >> > On Tue, 8 Mar 2022, Jiufu Guo wrote:
>> >> >
>> >> >> Jiufu Guo  writes:
>> >> >> 
>> >> >> Hi!
>> >> >> 
>> >> >> > Hi Sehger,
>> >> >> >
>> >> >> > Segher Boessenkool  writes:
>> >> >> >
>> >> >> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>> >> >> >>> Segher Boessenkool  writes:
>> >> >> >>> > No.  insn_cost is only for correct, existing instructions, not 
>> >> >> >>> > for
>> >> >> >>> > made-up nonsense.  I created insn_cost precisely to get away 
>> >> >> >>> > from that
>> >> >> >>> > aspect of rtx_cost (and some other issues, like, it is 
>> >> >> >>> > incredibly hard
>> >> >> >>> > and cumbersome to write a correct rtx_cost).
>> >> >> >>> 
>> >> >> >>> Thanks! The implementations of hook insn_cost are align with this
>> >> >> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>> >> >> >>> 
>> >> >> >>> One question on the speciall case: 
>> >> >> >>> For instruction: "r119:DI=0x100803004101001"
>> >> >> >>> Would we treat it as valid instruction?
>> >> >> >>
>> >> >> >> Currently we do, alternative 6 in *movdi_internal64: we allow any 
>> >> >> >> r<-n.
>> >> >> >> This is costed as 5 insns (cost=20).
>> >> >> >>
>> >> >> >> It generally is better to split things into patterns close to the
>> >> >> >> eventual machine isntructions as early as possible: all the more 
>> >> >> >> generic
>> >> >> >> optimisations can take advantage of that then.
>> >> >> > Get it!
>> >> >> >>
>> >> >> >>> A patch, which is attached the end of this mail, accepts
>> >> >> >>> "r119:DI=0x100803004101001" as input of insn_cost.
>> >> >> >>> In this patch, 
>> >> >> >>> - A tmp instruction is generated via make_insn_raw.
>> >> >> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>> >> >> >>> - In hook of insn_cost, checking the special 'constant' 
>> >> >> >>> instruction.
>> >> >> >>> Are these make sense?
>> >> >> >>
>> >> >> >> I'll review that patch inline.
>> >> >> 
>> >> >> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
>> >> >> Different from the previous partial patch, this patch replaces all 
>> >> >> usage
>> >> >> of rtx_cost. It may be better/aggressive than previous one.
>> >> >
>> >> > I think there's no advantage for using insn_cost over rtx_cost for
>> >> > the simple SET case.
>> >> 
>> >> Thanks for your comments and raise this concern.
>> >> 
>> >> For those targets which do not implement insn_cost, insn_cost calls
>> >> rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.
>> >> 
>> >> While, for those targets which have insn_cost, it seems insn_cost would
>> >> be better(or say more accurate/consistent?) than rtx_cost. Since:
>> >> - insn_cost recog the insn first, and compute cost through something
>> >
>> > target hooks are expected to call recog on the insn but the generic
>> > fallback does not!?  Or do you say a target _could_ call recog?
>> > I think it would be valid to only expect recognized insns here
>> > and thus cse.cc obligation would be to call regoc on the "fake"
>> > instruction which then raises the obvious issue whether you should
>> > ever call recog on something "fake".
>> Thanks Richard! I also feel this is a main point of insn_cost.
>> From my understanding: it would be better to let insn_cost check
>> the valid recognizable insns; this would be the major purpose of
>> insn_cost.  While, I'm also wondering, we may let it go for 'fake'
>> instruction for current implementations (e.g. arm_insn_cost) which
>> behaviors to walk/check pattern. 
>
> Note for the CSE case you'll always end up with the single move
> pattern so it's somewhat pointless to do the recog & insn_cost
> dance there.  For move cost using rtx_cost (SET, ..) should
> be good enough.  One could argue we should standardize a (wrapping)
> API like move_cost (enum machine_mode mode, rtx to, rtx from),
> with to/from optional (if omitted a REG of MODE).  But the existing
> rtx_cost target hook implementation should be sufficient to handle
> it, without building a fake insn and without doing (the pointless,
> if not failing) recog process.

Thanks so much for your comments and suggestions!

Using rtx_cost is able to handle those things in cse.cc.
For insn_cost, it is good for checking an instruction.  To simulate
'rtx_cost' using insn_cost, there is excess work: 'recog' on the setting
of rtx expr and alternative estimation. 

I just wondering we may prefer to use insn_cost for its consistency and
accuracy. :-) So, the patch is prepared.


BR,
Jiufu

>
> Richard.
>
>> >
>> > I also see that rs6000_insn_cost does
>> >
>> > static int
>> > rs6000_insn_cost (rtx_insn *insn, bool speed)
>> > {
>> >   if (recog_memoized (insn) < 0)
>> > return 0;
>> >
>> > so not recognized insns become quite 

Re: [PATCH] Check if loading const from mem is faster

2022-03-09 Thread Richard Biener via Gcc-patches
On Thu, 10 Mar 2022, Jiufu Guo wrote:

> 
> Hi!
> 
> Richard Biener  writes:
> 
> > On Wed, 9 Mar 2022, Jiufu Guo wrote:
> >
> >> 
> >> Hi!
> >> 
> >> Richard Biener  writes:
> >> 
> >> > On Tue, 8 Mar 2022, Jiufu Guo wrote:
> >> >
> >> >> Jiufu Guo  writes:
> >> >> 
> >> >> Hi!
> >> >> 
> >> >> > Hi Sehger,
> >> >> >
> >> >> > Segher Boessenkool  writes:
> >> >> >
> >> >> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
> >> >> >>> Segher Boessenkool  writes:
> >> >> >>> > No.  insn_cost is only for correct, existing instructions, not for
> >> >> >>> > made-up nonsense.  I created insn_cost precisely to get away from 
> >> >> >>> > that
> >> >> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly 
> >> >> >>> > hard
> >> >> >>> > and cumbersome to write a correct rtx_cost).
> >> >> >>> 
> >> >> >>> Thanks! The implementations of hook insn_cost are align with this
> >> >> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
> >> >> >>> 
> >> >> >>> One question on the speciall case: 
> >> >> >>> For instruction: "r119:DI=0x100803004101001"
> >> >> >>> Would we treat it as valid instruction?
> >> >> >>
> >> >> >> Currently we do, alternative 6 in *movdi_internal64: we allow any 
> >> >> >> r<-n.
> >> >> >> This is costed as 5 insns (cost=20).
> >> >> >>
> >> >> >> It generally is better to split things into patterns close to the
> >> >> >> eventual machine isntructions as early as possible: all the more 
> >> >> >> generic
> >> >> >> optimisations can take advantage of that then.
> >> >> > Get it!
> >> >> >>
> >> >> >>> A patch, which is attached the end of this mail, accepts
> >> >> >>> "r119:DI=0x100803004101001" as input of insn_cost.
> >> >> >>> In this patch, 
> >> >> >>> - A tmp instruction is generated via make_insn_raw.
> >> >> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
> >> >> >>> - In hook of insn_cost, checking the special 'constant' instruction.
> >> >> >>> Are these make sense?
> >> >> >>
> >> >> >> I'll review that patch inline.
> >> >> 
> >> >> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
> >> >> Different from the previous partial patch, this patch replaces all usage
> >> >> of rtx_cost. It may be better/aggressive than previous one.
> >> >
> >> > I think there's no advantage for using insn_cost over rtx_cost for
> >> > the simple SET case.
> >> 
> >> Thanks for your comments and raise this concern.
> >> 
> >> For those targets which do not implement insn_cost, insn_cost calls
> >> rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.
> >> 
> >> While, for those targets which have insn_cost, it seems insn_cost would
> >> be better(or say more accurate/consistent?) than rtx_cost. Since:
> >> - insn_cost recog the insn first, and compute cost through something
> >
> > target hooks are expected to call recog on the insn but the generic
> > fallback does not!?  Or do you say a target _could_ call recog?
> > I think it would be valid to only expect recognized insns here
> > and thus cse.cc obligation would be to call regoc on the "fake"
> > instruction which then raises the obvious issue whether you should
> > ever call recog on something "fake".
> Thanks Richard! I also feel this is a main point of insn_cost.
> From my understanding: it would be better to let insn_cost check
> the valid recognizable insns; this would be the major purpose of
> insn_cost.  While, I'm also wondering, we may let it go for 'fake'
> instruction for current implementations (e.g. arm_insn_cost) which
> behaviors to walk/check pattern. 

Note for the CSE case you'll always end up with the single move
pattern so it's somewhat pointless to do the recog & insn_cost
dance there.  For move cost using rtx_cost (SET, ..) should
be good enough.  One could argue we should standardize a (wrapping)
API like move_cost (enum machine_mode mode, rtx to, rtx from),
with to/from optional (if omitted a REG of MODE).  But the existing
rtx_cost target hook implementation should be sufficient to handle
it, without building a fake insn and without doing (the pointless,
if not failing) recog process.

Richard.

> >
> > I also see that rs6000_insn_cost does
> >
> > static int
> > rs6000_insn_cost (rtx_insn *insn, bool speed)
> > {
> >   if (recog_memoized (insn) < 0)
> > return 0;
> >
> > so not recognized insns become quite cheap - it looks like
> > insn_cost has no documented failure mode and initial implementors
> > chickened out asserting that an insn is / can be recognized.
> > In fact I'd assert that INSN_CODE (insn) >= 0 simply because there's
> > no failure mode and the _caller_ should have made sure the
> > insn can be recognized.  That said, the insn_cost should do that.
> Thanks! Using the assert would help to catch un-recog failures.
> 
> > (is INSN_CODE () == 0 a real thing?)
> 0 is specialy, it is some thing like CODE_FOR_nothing. :-)
> 
> >
> >> (like length/cost attributes from .md file) for the 'machine insn'.

Re: [PATCH] Check if loading const from mem is faster

2022-03-09 Thread Jiufu Guo via Gcc-patches


Hi!

Richard Biener  writes:

> On Wed, 9 Mar 2022, Jiufu Guo wrote:
>
>> 
>> Hi!
>> 
>> Richard Biener  writes:
>> 
>> > On Tue, 8 Mar 2022, Jiufu Guo wrote:
>> >
>> >> Jiufu Guo  writes:
>> >> 
>> >> Hi!
>> >> 
>> >> > Hi Sehger,
>> >> >
>> >> > Segher Boessenkool  writes:
>> >> >
>> >> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>> >> >>> Segher Boessenkool  writes:
>> >> >>> > No.  insn_cost is only for correct, existing instructions, not for
>> >> >>> > made-up nonsense.  I created insn_cost precisely to get away from 
>> >> >>> > that
>> >> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly 
>> >> >>> > hard
>> >> >>> > and cumbersome to write a correct rtx_cost).
>> >> >>> 
>> >> >>> Thanks! The implementations of hook insn_cost are align with this
>> >> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>> >> >>> 
>> >> >>> One question on the speciall case: 
>> >> >>> For instruction: "r119:DI=0x100803004101001"
>> >> >>> Would we treat it as valid instruction?
>> >> >>
>> >> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
>> >> >> This is costed as 5 insns (cost=20).
>> >> >>
>> >> >> It generally is better to split things into patterns close to the
>> >> >> eventual machine isntructions as early as possible: all the more 
>> >> >> generic
>> >> >> optimisations can take advantage of that then.
>> >> > Get it!
>> >> >>
>> >> >>> A patch, which is attached the end of this mail, accepts
>> >> >>> "r119:DI=0x100803004101001" as input of insn_cost.
>> >> >>> In this patch, 
>> >> >>> - A tmp instruction is generated via make_insn_raw.
>> >> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>> >> >>> - In hook of insn_cost, checking the special 'constant' instruction.
>> >> >>> Are these make sense?
>> >> >>
>> >> >> I'll review that patch inline.
>> >> 
>> >> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
>> >> Different from the previous partial patch, this patch replaces all usage
>> >> of rtx_cost. It may be better/aggressive than previous one.
>> >
>> > I think there's no advantage for using insn_cost over rtx_cost for
>> > the simple SET case.
>> 
>> Thanks for your comments and raise this concern.
>> 
>> For those targets which do not implement insn_cost, insn_cost calls
>> rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.
>> 
>> While, for those targets which have insn_cost, it seems insn_cost would
>> be better(or say more accurate/consistent?) than rtx_cost. Since:
>> - insn_cost recog the insn first, and compute cost through something
>
> target hooks are expected to call recog on the insn but the generic
> fallback does not!?  Or do you say a target _could_ call recog?
> I think it would be valid to only expect recognized insns here
> and thus cse.cc obligation would be to call regoc on the "fake"
> instruction which then raises the obvious issue whether you should
> ever call recog on something "fake".
Thanks Richard! I also feel this is a main point of insn_cost.
>From my understanding: it would be better to let insn_cost check
the valid recognizable insns; this would be the major purpose of
insn_cost.  While, I'm also wondering, we may let it go for 'fake'
instruction for current implementations (e.g. arm_insn_cost) which
behaviors to walk/check pattern. 

>
> I also see that rs6000_insn_cost does
>
> static int
> rs6000_insn_cost (rtx_insn *insn, bool speed)
> {
>   if (recog_memoized (insn) < 0)
> return 0;
>
> so not recognized insns become quite cheap - it looks like
> insn_cost has no documented failure mode and initial implementors
> chickened out asserting that an insn is / can be recognized.
> In fact I'd assert that INSN_CODE (insn) >= 0 simply because there's
> no failure mode and the _caller_ should have made sure the
> insn can be recognized.  That said, the insn_cost should do that.
Thanks! Using the assert would help to catch un-recog failures.

> (is INSN_CODE () == 0 a real thing?)
0 is specialy, it is some thing like CODE_FOR_nothing. :-)

>
>> (like length/cost attributes from .md file) for the 'machine insn'.
>> - rtx_cost estimates the cost through analyzing the 'rtx content'.
>> The accurate estimation relates to the context.
>> 
>> For a special example: "%r100 = C", as a previous patch, by tunning
>> target's rtx_cost hook, cost could be computed according to the value
>> of C. insn_cost may just model the cost in the define of the machine
>> instruction.
>> 
>> These reasons are my initial thoughts.  Segher may have better
>> explain. :-) 
>
> OK, so in theory the targets insn_cost can use insn attributes
> which allows to keep the cost parts of an instruction close to the
> instruction patterns which is probably a good thing for
> maintainability.  But as you point out this requires recognizing
> of possibly generated instructions.  And before reload it has
> the issue that the alternative is not determined which 

Re: [PATCH] Check if loading const from mem is faster

2022-03-08 Thread Richard Biener via Gcc-patches
On Wed, 9 Mar 2022, Jiufu Guo wrote:

> 
> Hi!
> 
> Richard Biener  writes:
> 
> > On Tue, 8 Mar 2022, Jiufu Guo wrote:
> >
> >> Jiufu Guo  writes:
> >> 
> >> Hi!
> >> 
> >> > Hi Sehger,
> >> >
> >> > Segher Boessenkool  writes:
> >> >
> >> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
> >> >>> Segher Boessenkool  writes:
> >> >>> > No.  insn_cost is only for correct, existing instructions, not for
> >> >>> > made-up nonsense.  I created insn_cost precisely to get away from 
> >> >>> > that
> >> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly 
> >> >>> > hard
> >> >>> > and cumbersome to write a correct rtx_cost).
> >> >>> 
> >> >>> Thanks! The implementations of hook insn_cost are align with this
> >> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
> >> >>> 
> >> >>> One question on the speciall case: 
> >> >>> For instruction: "r119:DI=0x100803004101001"
> >> >>> Would we treat it as valid instruction?
> >> >>
> >> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
> >> >> This is costed as 5 insns (cost=20).
> >> >>
> >> >> It generally is better to split things into patterns close to the
> >> >> eventual machine isntructions as early as possible: all the more generic
> >> >> optimisations can take advantage of that then.
> >> > Get it!
> >> >>
> >> >>> A patch, which is attached the end of this mail, accepts
> >> >>> "r119:DI=0x100803004101001" as input of insn_cost.
> >> >>> In this patch, 
> >> >>> - A tmp instruction is generated via make_insn_raw.
> >> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
> >> >>> - In hook of insn_cost, checking the special 'constant' instruction.
> >> >>> Are these make sense?
> >> >>
> >> >> I'll review that patch inline.
> >> 
> >> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
> >> Different from the previous partial patch, this patch replaces all usage
> >> of rtx_cost. It may be better/aggressive than previous one.
> >
> > I think there's no advantage for using insn_cost over rtx_cost for
> > the simple SET case.
> 
> Thanks for your comments and raise this concern.
> 
> For those targets which do not implement insn_cost, insn_cost calls
> rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.
> 
> While, for those targets which have insn_cost, it seems insn_cost would
> be better(or say more accurate/consistent?) than rtx_cost. Since:
> - insn_cost recog the insn first, and compute cost through something

target hooks are expected to call recog on the insn but the generic
fallback does not!?  Or do you say a target _could_ call recog?
I think it would be valid to only expect recognized insns here
and thus cse.cc obligation would be to call regoc on the "fake"
instruction which then raises the obvious issue whether you should
ever call recog on something "fake".

I also see that rs6000_insn_cost does

static int
rs6000_insn_cost (rtx_insn *insn, bool speed)
{
  if (recog_memoized (insn) < 0)
return 0;

so not recognized insns become quite cheap - it looks like
insn_cost has no documented failure mode and initial implementors
chickened out asserting that an insn is / can be recognized.
In fact I'd assert that INSN_CODE (insn) >= 0 simply because there's
no failure mode and the _caller_ should have made sure the
insn can be recognized.  That said, the insn_cost should do that.
(is INSN_CODE () == 0 a real thing?)

> (like length/cost attributes from .md file) for the 'machine insn'.
> - rtx_cost estimates the cost through analyzing the 'rtx content'.
> The accurate estimation relates to the context.
> 
> For a special example: "%r100 = C", as a previous patch, by tunning
> target's rtx_cost hook, cost could be computed according to the value
> of C. insn_cost may just model the cost in the define of the machine
> instruction.
> 
> These reasons are my initial thoughts.  Segher may have better
> explain. :-) 

OK, so in theory the targets insn_cost can use insn attributes
which allows to keep the cost parts of an instruction close to the
instruction patterns which is probably a good thing for
maintainability.  But as you point out this requires recognizing
of possibly generated instructions.  And before reload it has
the issue that the alternative is not determined which means the
true cost cannot be determined without walking the pattern,
like on x86 where many instruction operands have both register
and memory alternatives.

> To replace rtx_cost with insn_cost, this patch build a SET instruction:
> "%r = rtx_expr", then using "%r = rtx_expr" from insn_cost to simulate
> the cost of "rtx_expr" from rtx_cost.
> 
> 
> BR,
> Jiufu
> 
> >
> > Richard.
> >
> >> With this patch, bootstrap pass.
> >> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
> >> change seems as expected.
> >> 
> >> 
> >> BR,
> >> Jiufu
> >> 
> >> diff --git a/gcc/cse.cc b/gcc/cse.cc
> >> index a18b599d324..e623ad298db 100644
> >> 

Re: [PATCH] Check if loading const from mem is faster

2022-03-08 Thread Jiufu Guo via Gcc-patches


Hi!

Richard Biener  writes:

> On Tue, 8 Mar 2022, Jiufu Guo wrote:
>
>> Jiufu Guo  writes:
>> 
>> Hi!
>> 
>> > Hi Sehger,
>> >
>> > Segher Boessenkool  writes:
>> >
>> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>> >>> Segher Boessenkool  writes:
>> >>> > No.  insn_cost is only for correct, existing instructions, not for
>> >>> > made-up nonsense.  I created insn_cost precisely to get away from that
>> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
>> >>> > and cumbersome to write a correct rtx_cost).
>> >>> 
>> >>> Thanks! The implementations of hook insn_cost are align with this
>> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>> >>> 
>> >>> One question on the speciall case: 
>> >>> For instruction: "r119:DI=0x100803004101001"
>> >>> Would we treat it as valid instruction?
>> >>
>> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
>> >> This is costed as 5 insns (cost=20).
>> >>
>> >> It generally is better to split things into patterns close to the
>> >> eventual machine isntructions as early as possible: all the more generic
>> >> optimisations can take advantage of that then.
>> > Get it!
>> >>
>> >>> A patch, which is attached the end of this mail, accepts
>> >>> "r119:DI=0x100803004101001" as input of insn_cost.
>> >>> In this patch, 
>> >>> - A tmp instruction is generated via make_insn_raw.
>> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>> >>> - In hook of insn_cost, checking the special 'constant' instruction.
>> >>> Are these make sense?
>> >>
>> >> I'll review that patch inline.
>> 
>> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
>> Different from the previous partial patch, this patch replaces all usage
>> of rtx_cost. It may be better/aggressive than previous one.
>
> I think there's no advantage for using insn_cost over rtx_cost for
> the simple SET case.

Thanks for your comments and raise this concern.

For those targets which do not implement insn_cost, insn_cost calls
rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.

While, for those targets which have insn_cost, it seems insn_cost would
be better(or say more accurate/consistent?) than rtx_cost. Since:
- insn_cost recog the insn first, and compute cost through something
(like length/cost attributes from .md file) for the 'machine insn'.
- rtx_cost estimates the cost through analyzing the 'rtx content'.
The accurate estimation relates to the context.

For a special example: "%r100 = C", as a previous patch, by tunning
target's rtx_cost hook, cost could be computed according to the value
of C. insn_cost may just model the cost in the define of the machine
instruction.

These reasons are my initial thoughts.  Segher may have better
explain. :-) 

To replace rtx_cost with insn_cost, this patch build a SET instruction:
"%r = rtx_expr", then using "%r = rtx_expr" from insn_cost to simulate
the cost of "rtx_expr" from rtx_cost.


BR,
Jiufu

>
> Richard.
>
>> With this patch, bootstrap pass.
>> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
>> change seems as expected.
>> 
>> 
>> BR,
>> Jiufu
>> 
>> diff --git a/gcc/cse.cc b/gcc/cse.cc
>> index a18b599d324..e623ad298db 100644
>> --- a/gcc/cse.cc
>> +++ b/gcc/cse.cc
>> @@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
>>  static rtx_insn *this_insn;
>>  static bool optimize_this_for_speed_p;
>>  
>> +/* Used for insn_cost. */
>> +static rtx_insn *estimate_insn;
>> +
>>  /* Index by register number, gives the number of the next (or
>> previous) register in the chain of registers sharing the same
>> value.
>> @@ -445,7 +448,7 @@ struct table_elt
>>  /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
>> hard registers and pointers into the frame are the cheapest with a cost
>> of 0.  Next come pseudos with a cost of one and other hard registers with
>> -   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
>> +   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
>>  
>>  #define CHEAP_REGNO(N)  
>> \
>>(REGNO_PTR_FRAME_P (N)\
>> @@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int 
>> regcost_b)
>> from COST macro to keep it simple.  */
>>  
>>  static int
>> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
>> +notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int 
>> /*opno*/)
>>  {
>>scalar_int_mode int_mode, inner_mode;
>> -  return ((GET_CODE (x) == SUBREG
>> -   && REG_P (SUBREG_REG (x))
>> -   && is_int_mode (mode, _mode)
>> -   && is_int_mode (GET_MODE (SUBREG_REG (x)), _mode)
>> -   && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
>> -   && subreg_lowpart_p (x)
>> -   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, 

Re: [PATCH] Check if loading const from mem is faster

2022-03-08 Thread Richard Biener via Gcc-patches
On Tue, 8 Mar 2022, Jiufu Guo wrote:

> Jiufu Guo  writes:
> 
> Hi!
> 
> > Hi Sehger,
> >
> > Segher Boessenkool  writes:
> >
> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
> >>> Segher Boessenkool  writes:
> >>> > No.  insn_cost is only for correct, existing instructions, not for
> >>> > made-up nonsense.  I created insn_cost precisely to get away from that
> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
> >>> > and cumbersome to write a correct rtx_cost).
> >>> 
> >>> Thanks! The implementations of hook insn_cost are align with this
> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
> >>> 
> >>> One question on the speciall case: 
> >>> For instruction: "r119:DI=0x100803004101001"
> >>> Would we treat it as valid instruction?
> >>
> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
> >> This is costed as 5 insns (cost=20).
> >>
> >> It generally is better to split things into patterns close to the
> >> eventual machine isntructions as early as possible: all the more generic
> >> optimisations can take advantage of that then.
> > Get it!
> >>
> >>> A patch, which is attached the end of this mail, accepts
> >>> "r119:DI=0x100803004101001" as input of insn_cost.
> >>> In this patch, 
> >>> - A tmp instruction is generated via make_insn_raw.
> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
> >>> - In hook of insn_cost, checking the special 'constant' instruction.
> >>> Are these make sense?
> >>
> >> I'll review that patch inline.
> 
> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
> Different from the previous partial patch, this patch replaces all usage
> of rtx_cost. It may be better/aggressive than previous one.

I think there's no advantage for using insn_cost over rtx_cost for
the simple SET case.

Richard.

> With this patch, bootstrap pass.
> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
> change seems as expected.
> 
> 
> BR,
> Jiufu
> 
> diff --git a/gcc/cse.cc b/gcc/cse.cc
> index a18b599d324..e623ad298db 100644
> --- a/gcc/cse.cc
> +++ b/gcc/cse.cc
> @@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
>  static rtx_insn *this_insn;
>  static bool optimize_this_for_speed_p;
>  
> +/* Used for insn_cost. */
> +static rtx_insn *estimate_insn;
> +
>  /* Index by register number, gives the number of the next (or
> previous) register in the chain of registers sharing the same
> value.
> @@ -445,7 +448,7 @@ struct table_elt
>  /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
> hard registers and pointers into the frame are the cheapest with a cost
> of 0.  Next come pseudos with a cost of one and other hard registers with
> -   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
> +   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
>  
>  #define CHEAP_REGNO(N)   
> \
>(REGNO_PTR_FRAME_P (N) \
> @@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int 
> regcost_b)
> from COST macro to keep it simple.  */
>  
>  static int
> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
> +notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
>  {
>scalar_int_mode int_mode, inner_mode;
> -  return ((GET_CODE (x) == SUBREG
> -&& REG_P (SUBREG_REG (x))
> -&& is_int_mode (mode, _mode)
> -&& is_int_mode (GET_MODE (SUBREG_REG (x)), _mode)
> -&& GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
> -&& subreg_lowpart_p (x)
> -&& TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> -   ? 0
> -   : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
> +  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
> +  && is_int_mode (mode, _mode)
> +  && is_int_mode (GET_MODE (SUBREG_REG (x)), _mode)
> +  && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
> +  && subreg_lowpart_p (x)
> +  && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> +return 0;
> +
> +  if (estimate_insn == NULL)
> +{
> +  estimate_insn = make_insn_raw (
> + gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
> +  SET_PREV_INSN (estimate_insn) = NULL_RTX;
> +  SET_NEXT_INSN (estimate_insn) = NULL_RTX;
> +  INSN_LOCATION (estimate_insn) = 0;
> +}
> +  else
> +{
> +  /* Update for new context.  */
> +  INSN_CODE (estimate_insn) = -1;
> +  PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
> +  SET_SRC (PATTERN (estimate_insn)) = x;
> +}
> +  return insn_cost (estimate_insn, optimize_this_for_speed_p);
>  }
>  
>  
> @@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
>  
>init_recog ();
>init_alias_analysis ();
> +  estimate_insn = NULL;
>  

Re: [PATCH] Check if loading const from mem is faster

2022-03-08 Thread Jiufu Guo via Gcc-patches
Jiufu Guo  writes:

Hi!

> Hi Sehger,
>
> Segher Boessenkool  writes:
>
>> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>>> Segher Boessenkool  writes:
>>> > No.  insn_cost is only for correct, existing instructions, not for
>>> > made-up nonsense.  I created insn_cost precisely to get away from that
>>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
>>> > and cumbersome to write a correct rtx_cost).
>>> 
>>> Thanks! The implementations of hook insn_cost are align with this
>>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>>> 
>>> One question on the speciall case: 
>>> For instruction: "r119:DI=0x100803004101001"
>>> Would we treat it as valid instruction?
>>
>> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
>> This is costed as 5 insns (cost=20).
>>
>> It generally is better to split things into patterns close to the
>> eventual machine isntructions as early as possible: all the more generic
>> optimisations can take advantage of that then.
> Get it!
>>
>>> A patch, which is attached the end of this mail, accepts
>>> "r119:DI=0x100803004101001" as input of insn_cost.
>>> In this patch, 
>>> - A tmp instruction is generated via make_insn_raw.
>>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>>> - In hook of insn_cost, checking the special 'constant' instruction.
>>> Are these make sense?
>>
>> I'll review that patch inline.

I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
Different from the previous partial patch, this patch replaces all usage
of rtx_cost. It may be better/aggressive than previous one.

With this patch, bootstrap pass.
>From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
change seems as expected.


BR,
Jiufu

diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..e623ad298db 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
 static rtx_insn *this_insn;
 static bool optimize_this_for_speed_p;
 
+/* Used for insn_cost. */
+static rtx_insn *estimate_insn;
+
 /* Index by register number, gives the number of the next (or
previous) register in the chain of registers sharing the same
value.
@@ -445,7 +448,7 @@ struct table_elt
 /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
hard registers and pointers into the frame are the cheapest with a cost
of 0.  Next come pseudos with a cost of one and other hard registers with
-   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
+   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
 
 #define CHEAP_REGNO(N) \
   (REGNO_PTR_FRAME_P (N)   \
@@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int 
regcost_b)
from COST macro to keep it simple.  */
 
 static int
-notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
+notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
 {
   scalar_int_mode int_mode, inner_mode;
-  return ((GET_CODE (x) == SUBREG
-  && REG_P (SUBREG_REG (x))
-  && is_int_mode (mode, _mode)
-  && is_int_mode (GET_MODE (SUBREG_REG (x)), _mode)
-  && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
-  && subreg_lowpart_p (x)
-  && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
- ? 0
- : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
+  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
+  && is_int_mode (mode, _mode)
+  && is_int_mode (GET_MODE (SUBREG_REG (x)), _mode)
+  && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
+  && subreg_lowpart_p (x)
+  && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
+return 0;
+
+  if (estimate_insn == NULL)
+{
+  estimate_insn = make_insn_raw (
+   gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
+  SET_PREV_INSN (estimate_insn) = NULL_RTX;
+  SET_NEXT_INSN (estimate_insn) = NULL_RTX;
+  INSN_LOCATION (estimate_insn) = 0;
+}
+  else
+{
+  /* Update for new context.  */
+  INSN_CODE (estimate_insn) = -1;
+  PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
+  SET_SRC (PATTERN (estimate_insn)) = x;
+}
+  return insn_cost (estimate_insn, optimize_this_for_speed_p);
 }
 
 
@@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
 
   init_recog ();
   init_alias_analysis ();
+  estimate_insn = NULL;
 
   reg_eqv_table = XNEWVEC (struct reg_eqv_elem, nregs);
 


Re: [PATCH] Check if loading const from mem is faster

2022-03-03 Thread Jiufu Guo via Gcc-patches


Hi Sehger,

Segher Boessenkool  writes:

> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>> Segher Boessenkool  writes:
>> > No.  insn_cost is only for correct, existing instructions, not for
>> > made-up nonsense.  I created insn_cost precisely to get away from that
>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
>> > and cumbersome to write a correct rtx_cost).
>> 
>> Thanks! The implementations of hook insn_cost are align with this
>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>> 
>> One question on the speciall case: 
>> For instruction: "r119:DI=0x100803004101001"
>> Would we treat it as valid instruction?
>
> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
> This is costed as 5 insns (cost=20).
>
> It generally is better to split things into patterns close to the
> eventual machine isntructions as early as possible: all the more generic
> optimisations can take advantage of that then.
Get it!
>
>> A patch, which is attached the end of this mail, accepts
>> "r119:DI=0x100803004101001" as input of insn_cost.
>> In this patch, 
>> - A tmp instruction is generated via make_insn_raw.
>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>> - In hook of insn_cost, checking the special 'constant' instruction.
>> Are these make sense?
>
> I'll review that patch inline.
>
>> > That is one reason why it is better to generate (close to) machine
>> > insns as early as possible: it makes it much easier to estimate
>> > realistic costs.  (Another important reason is it allows other
>> > optimisations, without us having to do any work for it!)
>> Get it!  In the middle of an optimization pass, 'interim'
>> instruction maybe acceptable.  While it would better to outputs
>> only contains 'valid machine insn' from any RTL passes.
>
> Acceptable only if there is a very good reason for it, really :-(
>
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -22131,6 +22131,16 @@ rs6000_debug_rtx_costs (rtx x, machine_mode mode, 
>> int outer_code,
>>  static int
>>  rs6000_insn_cost (rtx_insn *insn, bool speed)
>>  {
>> +  /* Handle special 'constant int' insn. */
>> +  rtx set = PATTERN (insn);
>> +  if (GET_CODE (set) == SET && CONSTANT_P (SET_SRC (set)))
>> +{
>> +  rtx src = SET_SRC (set);
>> +  machine_mode mode = GET_MODE (SET_DEST (set));
>> +  if (CONST_INT_P (src) || CONST_WIDE_INT_P (src))
>> +return COSTS_N_INSNS (num_insns_constant (src, mode));
>> +}
>> +  
>>if (recog_memoized (insn) < 0)
>>  return 0;
>
> Why would such a set not recog()?
Thanks.  This code is not need at the top of function insn_cost.
recog_memoized could check insn_code on 'insn'.
>
> Needs a comment in any case, to say what this is a workaround for.
>
>> +static int insn_cost_x (rtx_insn *, rtx);
>
> Don't declare functions, just put their definitions before their first
> use.  (And use a better name please :-) )
Get it. :-)
>
>>  static int
>> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
>> +notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno,
>> + rtx_insn *insn = NULL)
>
> Don't use default arguments like this, it is an abomination.
Thanks.
>
>> @@ -709,9 +713,21 @@ notreg_cost (rtx x, machine_mode mode, enum rtx_code 
>> outer, int opno)
>> && subreg_lowpart_p (x)
>> && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>>? 0
>> +  : insn != NULL ? insn_cost_x (insn, x)
>>: rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
>>  }
>
> You can just always use insn_cost?  insn_cost -> pattern_cost ->
> set_src_cost -> rtx_cost.  That works for COST at least, not sure about
> COST_IN, maybe that needs a little more care (cse.c works with invalid
> insns all over the place :-( )
>
This experiement patch just replace part of rtx_cost with insn_cost.
In case, COST is called outside cse_insn, 'insn' may not be set, and
then 'insn_cost' may not work.   This would need to be enhanced.
>>  
>> +/* Internal function, to get cost when use X to replace source of insn
>> +   which is a SET.  */
>> +
>> +static int
>> +insn_cost_x (rtx_insn *insn, rtx x)
>> +{
>> +  INSN_CODE (insn) = -1;
>> +  SET_SRC (PATTERN (insn)) = x;
>> +  return insn_cost (insn, optimize_this_for_speed_p);
>> +}
>
> You need to restore stuff as well?
In this patch, this function is called on a tmp_insn, so I did not
restore it.  If using the original 'insn' of cse_insn to invoked
'insn_cost_x', fields of 'insn' should be restored.
>
>> @@ -4603,6 +4619,7 @@ cse_insn (rtx_insn *insn)
>>  
>>   Nothing in this loop changes the hash table or the register chains.  */
>>  
>> +  rtx_insn *tmp_insn = NULL;
>>for (i = 0; i < n_sets; i++)
>>  {
>>bool repeat = false;
>> @@ -4638,6 +4655,10 @@ cse_insn (rtx_insn *insn)
>>mode = GET_MODE (src) == VOIDmode ? GET_MODE (dest) : GET_MODE (src);
>>

Re: [PATCH] Check if loading const from mem is faster

2022-03-03 Thread Jiufu Guo via Gcc-patches


Hi,

Jeff Law  writes:

> On 3/1/2022 12:47 AM, Richard Biener via Gcc-patches wrote:
>> On Tue, 1 Mar 2022, Jiufu Guo wrote:
>>
>>> Segher Boessenkool  writes:
>>>
 On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> And another thing as Segher pointed out, CSE is doing too
>> much work.  It may be ok to separate the constant handling
>> logic from CSE.
> Not sure - CSE just is value numbering, I don't see that it does
> more than that.  Yes, it might have developed "heuristics" over
> the years what to CSE and to what and where to substitute and
> where not.  But in the end it does just value numbering.
 It also does various micro-optimisations, like all the CC things it
 does.

 It is not very good at doing the CSE job, but it cannot easily be
 replaced by a better implementation because it does many other small
 optimisations (that are not done elsewhere).

>>> Thanks a lot for these comments! I'm also wondering if we would
>>> rewrite this cse.cc or refactor it in some aspects.
>> I think time is better spent elsewhere ... I don't think CSE is as
>> bad as Segher depicts it - it might do "CC things" and other bits
>> but in the end that's going to be instruction/expression combination
>> things that "fit" likely because a value lattice (or just nonzero bits
>> in the cselib variant) existed.
>>
>> So what might be interesting would be to work towards cleansing
>> CSE of those, producing testcases and making sure a better fit
>> pass (combine? fwprop? compare-elim?) performs the desired
>> optimization.
>>
>> But I'm not really sure what Segher is talking about - I suppose
>> it must be magic done inside cselib (which only does analysis),
>> not in cse.cc itself.
> I also don't see cse.cc has being *that* bad. It's largely the
> same bits as we had before SSA and I wouldn't be surprised if there
> are things in there that aren't needed anymore.
>
> For example, while we have excised HAVE_cc0, I think there are still
> remnants of those concepts lying around (see this_insn_cc0 and
> friends).
>
> I'd always hoped we'd get to a point where we could eliminate the
> follow-jumps and cse-around-loops bits, but we never managed to
> accomplish that.  When I last looked (a long long time ago), there
> were still things that got exposed when we lowered from gimple to RTL
> that were picked up by those options.   Changing to work  with
> dominators to find EBBs would be a nice cleanup, but the code as it
> stands right now works.
>
> I wouldn't be surprised if the costing stuff could use some serious
> cleanup.  But I don't see it as inherently broken, just very dated.
>
> But in the end, it's really just value-numbering as you say.  It could
> be rewritten to be more modern, but I doubt it's going to make much of
> a real difference in the end.  If there are things that fit logically
> elsehwere, sure move them where they more logically fit, but I doubt
> there's a lot of this stuff.
>
> jeff

Thanks for your comments and suggestions!

I had a quick test about cse1/cse2 on spec2017. Compare with "-O3",
we can see both positive and negative impacts for "-O3
-fdisable-rtl-cse1 -fdisable-rtl-cse2".  we can see performance gain on
500.perlbench_r(+1.83%),  538.imagick_r(0.7%) 520.omnetpp_r(+0.6%);
and performance recession on  548.exchange2_r(-1.97%) 557.xz_r(-0.7%)
on Power9.  The performance change on 500.perlbench_r would relate
to the behavior of 'constant loading'.

The performance impactions may be directly or indirectly caused by
sub-behaviors of current cse.cc.  And the data would change on
on different targets. Anyway, this data may also indicate that
we could clean up or enhance some functionalities in cse.cc.

BR,
Jiufu


Re: [PATCH] Check if loading const from mem is faster

2022-03-02 Thread Segher Boessenkool
On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
> Segher Boessenkool  writes:
> > No.  insn_cost is only for correct, existing instructions, not for
> > made-up nonsense.  I created insn_cost precisely to get away from that
> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
> > and cumbersome to write a correct rtx_cost).
> 
> Thanks! The implementations of hook insn_cost are align with this
> design, they are  checking insn's attributes and COSTS_N_INSNS.
> 
> One question on the speciall case: 
> For instruction: "r119:DI=0x100803004101001"
> Would we treat it as valid instruction?

Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
This is costed as 5 insns (cost=20).

It generally is better to split things into patterns close to the
eventual machine isntructions as early as possible: all the more generic
optimisations can take advantage of that then.

> A patch, which is attached the end of this mail, accepts
> "r119:DI=0x100803004101001" as input of insn_cost.
> In this patch, 
> - A tmp instruction is generated via make_insn_raw.
> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
> - In hook of insn_cost, checking the special 'constant' instruction.
> Are these make sense?

I'll review that patch inline.

> > That is one reason why it is better to generate (close to) machine
> > insns as early as possible: it makes it much easier to estimate
> > realistic costs.  (Another important reason is it allows other
> > optimisations, without us having to do any work for it!)
> Get it!  In the middle of an optimization pass, 'interim'
> instruction maybe acceptable.  While it would better to outputs
> only contains 'valid machine insn' from any RTL passes.

Acceptable only if there is a very good reason for it, really :-(

> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -22131,6 +22131,16 @@ rs6000_debug_rtx_costs (rtx x, machine_mode mode, 
> int outer_code,
>  static int
>  rs6000_insn_cost (rtx_insn *insn, bool speed)
>  {
> +  /* Handle special 'constant int' insn. */
> +  rtx set = PATTERN (insn);
> +  if (GET_CODE (set) == SET && CONSTANT_P (SET_SRC (set)))
> +{
> +  rtx src = SET_SRC (set);
> +  machine_mode mode = GET_MODE (SET_DEST (set));
> +  if (CONST_INT_P (src) || CONST_WIDE_INT_P (src))
> + return COSTS_N_INSNS (num_insns_constant (src, mode));
> +}
> +  
>if (recog_memoized (insn) < 0)
>  return 0;

Why would such a set not recog()?

Needs a comment in any case, to say what this is a workaround for.

> +static int insn_cost_x (rtx_insn *, rtx);

Don't declare functions, just put their definitions before their first
use.  (And use a better name please :-) )

>  static int
> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
> +notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno,
> +  rtx_insn *insn = NULL)

Don't use default arguments like this, it is an abomination.

> @@ -709,9 +713,21 @@ notreg_cost (rtx x, machine_mode mode, enum rtx_code 
> outer, int opno)
>  && subreg_lowpart_p (x)
>  && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> ? 0
> +   : insn != NULL ? insn_cost_x (insn, x)
> : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
>  }

You can just always use insn_cost?  insn_cost -> pattern_cost ->
set_src_cost -> rtx_cost.  That works for COST at least, not sure about
COST_IN, maybe that needs a little more care (cse.c works with invalid
insns all over the place :-( )

>  
> +/* Internal function, to get cost when use X to replace source of insn
> +   which is a SET.  */
> +
> +static int
> +insn_cost_x (rtx_insn *insn, rtx x)
> +{
> +  INSN_CODE (insn) = -1;
> +  SET_SRC (PATTERN (insn)) = x;
> +  return insn_cost (insn, optimize_this_for_speed_p);
> +}

You need to restore stuff as well?

> @@ -4603,6 +4619,7 @@ cse_insn (rtx_insn *insn)
>  
>   Nothing in this loop changes the hash table or the register chains.  */
>  
> +  rtx_insn *tmp_insn = NULL;
>for (i = 0; i < n_sets; i++)
>  {
>bool repeat = false;
> @@ -4638,6 +4655,10 @@ cse_insn (rtx_insn *insn)
>mode = GET_MODE (src) == VOIDmode ? GET_MODE (dest) : GET_MODE (src);
>sets[i].mode = mode;
>  
> +  if (tmp_insn == NULL_RTX && src && dest && dest != pc_rtx
> +   && src != pc_rtx)
> + tmp_insn = make_insn_raw (gen_rtx_SET (copy_rtx (dest), copy_rtx(src)));

src and dest are always non-nil here.  I'll have to read the code better
to know about the (pc) stuff.

> @@ -5103,7 +5124,7 @@ cse_insn (rtx_insn *insn)
>   src_cost = src_regcost = -1;
> else
>   {
> -   src_cost = COST (src, mode);
> +   src_cost = COST_SRC (tmp_insn, src, mode);

I think you can just leave this as COST?


Segher


Re: [PATCH] Check if loading const from mem is faster

2022-03-02 Thread Jeff Law via Gcc-patches




On 3/1/2022 12:47 AM, Richard Biener via Gcc-patches wrote:

On Tue, 1 Mar 2022, Jiufu Guo wrote:


Segher Boessenkool  writes:


On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:

On Thu, 24 Feb 2022, Jiufu Guo wrote:

And another thing as Segher pointed out, CSE is doing too
much work.  It may be ok to separate the constant handling
logic from CSE.

Not sure - CSE just is value numbering, I don't see that it does
more than that.  Yes, it might have developed "heuristics" over
the years what to CSE and to what and where to substitute and
where not.  But in the end it does just value numbering.

It also does various micro-optimisations, like all the CC things it
does.

It is not very good at doing the CSE job, but it cannot easily be
replaced by a better implementation because it does many other small
optimisations (that are not done elsewhere).


Thanks a lot for these comments! I'm also wondering if we would
rewrite this cse.cc or refactor it in some aspects.

I think time is better spent elsewhere ... I don't think CSE is as
bad as Segher depicts it - it might do "CC things" and other bits
but in the end that's going to be instruction/expression combination
things that "fit" likely because a value lattice (or just nonzero bits
in the cselib variant) existed.

So what might be interesting would be to work towards cleansing
CSE of those, producing testcases and making sure a better fit
pass (combine? fwprop? compare-elim?) performs the desired
optimization.

But I'm not really sure what Segher is talking about - I suppose
it must be magic done inside cselib (which only does analysis),
not in cse.cc itself.
I also don't see cse.cc has being *that* bad. It's largely the same 
bits as we had before SSA and I wouldn't be surprised if there are 
things in there that aren't needed anymore.


For example, while we have excised HAVE_cc0, I think there are still 
remnants of those concepts lying around (see this_insn_cc0 and friends).


I'd always hoped we'd get to a point where we could eliminate the 
follow-jumps and cse-around-loops bits, but we never managed to 
accomplish that.  When I last looked (a long long time ago), there were 
still things that got exposed when we lowered from gimple to RTL that 
were picked up by those options.   Changing to work  with dominators to 
find EBBs would be a nice cleanup, but the code as it stands right now 
works.


I wouldn't be surprised if the costing stuff could use some serious 
cleanup.  But I don't see it as inherently broken, just very dated.


But in the end, it's really just value-numbering as you say.  It could 
be rewritten to be more modern, but I doubt it's going to make much of a 
real difference in the end.  If there are things that fit logically 
elsehwere, sure move them where they more logically fit, but I doubt 
there's a lot of this stuff.


jeff



Re: [PATCH] Check if loading const from mem is faster

2022-03-01 Thread Jiufu Guo via Gcc-patches
Segher Boessenkool  writes:

Hi!

> Hi!
>
> On Thu, Feb 24, 2022 at 03:48:54PM +0800, Jiufu Guo wrote:
>> Segher Boessenkool  writes:
>> > That is the problem yes.  You need insns to call insn_cost on.  You can
>> > look in combine.c:combine_validate_cost to see how this can be done; but
>> > you need to have some code to generate in the first place, and for CSE
>> > it isn't always clear what code to generate, it really is based on RTL
>> > expressions having a cost.
>> 
>> Hi Segher,
>> 
>> Thanks! combine_validate_cost is useful to help me on
>> evaluating the costs of several instructions or replacements.
>> 
>> As you pointed out, at CSE, it may not be clear to know what
>> extact insn sequences will be generated. Actually,  the same
>> issue also exists on RTL expression.  At CSE, it may not clear
>> the exact cost, since the real instructions maybe emitted in
>> very late passes.
>
> But there will be RTL insns already.  Those may not correspond 1-1 to
> the eventual machine insns (ideally they do most of the time though),
> but it should be possible to estimate pretty accurate costs for them.
>
> Costs are never exact anyway, it is only one number, while in reality
> there are many dimensions.
>
>> To get the accurate cost, we may analyze the constant in the
>> hook(insn_cost or rtx_cost) and estimate the possible final
>> instructions and then calculate the costs.
>
> Yes.
>
>> We discussed one idea: let the hook insn_cost accept
>> any interim instruction, and estimate the real instruction
>> base on the interim insn, and then return the estimated
>> costs.
>
> No.  insn_cost is only for correct, existing instructions, not for
> made-up nonsense.  I created insn_cost precisely to get away from that
> aspect of rtx_cost (and some other issues, like, it is incredibly hard
> and cumbersome to write a correct rtx_cost).

Thanks! The implementations of hook insn_cost are align with this
design, they are  checking insn's attributes and COSTS_N_INSNS.

One question on the speciall case: 
For instruction: "r119:DI=0x100803004101001"
Would we treat it as valid instruction?

A patch, which is attached the end of this mail, accepts
"r119:DI=0x100803004101001" as input of insn_cost.
In this patch, 
- A tmp instruction is generated via make_insn_raw.
- A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
- In hook of insn_cost, checking the special 'constant' instruction.
Are these make sense?

>
>> For example: input insn "r119:DI=0x100803004101001" to
>> insn_cost; and in rs6000_insn_cost (for ppc), analyze
>> constant "0x100803004101001" which would need 5 insns;
>> then rs6000_insn_cost sumarize the cost of 5 insns.
>> 
>> A minor concern: because we know that reading this
>> constant from the pool is faster than building it by insns,
>> we will generate instructions to load constant from the pool
>> finally, do not emit 5 real instructions to build the value.
>> So, we are more interested in if it is faster to load from
>> pool or not.
>
> That is one reason why it is better to generate (close to) machine
> insns as early as possible: it makes it much easier to estimate
> realistic costs.  (Another important reason is it allows other
> optimisations, without us having to do any work for it!)
Get it!  In the middle of an optimization pass, 'interim'
instruction maybe acceptable.  While it would better to outputs
only contains 'valid machine insn' from any RTL passes.

BR,
Jiufu
>
>
> Segher


diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..1184e6ad65b 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -22131,6 +22131,16 @@ rs6000_debug_rtx_costs (rtx x, machine_mode mode, int 
outer_code,
 static int
 rs6000_insn_cost (rtx_insn *insn, bool speed)
 {
+  /* Handle special 'constant int' insn. */
+  rtx set = PATTERN (insn);
+  if (GET_CODE (set) == SET && CONSTANT_P (SET_SRC (set)))
+{
+  rtx src = SET_SRC (set);
+  machine_mode mode = GET_MODE (SET_DEST (set));
+  if (CONST_INT_P (src) || CONST_WIDE_INT_P (src))
+   return COSTS_N_INSNS (num_insns_constant (src, mode));
+}
+  
   if (recog_memoized (insn) < 0)
 return 0;
 
diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..4705ad82084 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -456,6 +456,8 @@ struct table_elt
   (REG_P (X) ? 0 : notreg_cost (X, MODE, SET, 1))
 #define COST_IN(X, MODE, OUTER, OPNO)  \
   (REG_P (X) ? 0 : notreg_cost (X, MODE, OUTER, OPNO))
+#define COST_SRC(I, X, MODE)   \
+  (REG_P (X) ? 0 : notreg_cost (X, MODE, SET, 1, I))
 
 /* Get the number of times this register has been updated in this
basic block.  */
@@ -523,7 +525,8 @@ static bitmap cse_ebb_live_in, cse_ebb_live_out;
 static sbitmap cse_visited_basic_blocks;
 
 static bool fixed_base_plus_p (rtx x);
-static int notreg_cost (rtx, machine_mode, enum rtx_code, int);
+static int notreg_cost 

Re: [PATCH] Check if loading const from mem is faster

2022-03-01 Thread Jiufu Guo via Gcc-patches
Richard Biener  writes:

> On Tue, 1 Mar 2022, Jiufu Guo wrote:
>
>> Segher Boessenkool  writes:
>> 
>> > On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
>> >> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> >> > And another thing as Segher pointed out, CSE is doing too
>> >> > much work.  It may be ok to separate the constant handling
>> >> > logic from CSE.
>> >> 
>> >> Not sure - CSE just is value numbering, I don't see that it does
>> >> more than that.  Yes, it might have developed "heuristics" over
>> >> the years what to CSE and to what and where to substitute and
>> >> where not.  But in the end it does just value numbering.
>> >
>> > It also does various micro-optimisations, like all the CC things it
>> > does.
>> >
>> > It is not very good at doing the CSE job, but it cannot easily be
>> > replaced by a better implementation because it does many other small
>> > optimisations (that are not done elsewhere).
>> >
>> 
>> Thanks a lot for these comments! I'm also wondering if we would
>> rewrite this cse.cc or refactor it in some aspects.
>
> I think time is better spent elsewhere ... I don't think CSE is as
> bad as Segher depicts it - it might do "CC things" and other bits
> but in the end that's going to be instruction/expression combination
> things that "fit" likely because a value lattice (or just nonzero bits
> in the cselib variant) existed.
>
> So what might be interesting would be to work towards cleansing
> CSE of those, producing testcases and making sure a better fit
> pass (combine? fwprop? compare-elim?) performs the desired
> optimization.
Hi Richard,

I'm not sure. I guess, those micro-optimizations may be also the
the things which could be move to other pass, like fwprop/combining...
Segher would correct me. :-)
cse.cc contains the code to take care of constant/ const_anchors,
expr folding, condition code/jump...

BR,
Jiufu 

>
> But I'm not really sure what Segher is talking about - I suppose
> it must be magic done inside cselib (which only does analysis),
> not in cse.cc itself.
>
> Richard.
>
>> BR,
>> Jiufu
>> 
>> >
>> > Segher
>> 


Re: [PATCH] Check if loading const from mem is faster

2022-02-28 Thread Richard Biener via Gcc-patches
On Tue, 1 Mar 2022, Jiufu Guo wrote:

> Segher Boessenkool  writes:
> 
> > On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
> >> On Thu, 24 Feb 2022, Jiufu Guo wrote:
> >> > And another thing as Segher pointed out, CSE is doing too
> >> > much work.  It may be ok to separate the constant handling
> >> > logic from CSE.
> >> 
> >> Not sure - CSE just is value numbering, I don't see that it does
> >> more than that.  Yes, it might have developed "heuristics" over
> >> the years what to CSE and to what and where to substitute and
> >> where not.  But in the end it does just value numbering.
> >
> > It also does various micro-optimisations, like all the CC things it
> > does.
> >
> > It is not very good at doing the CSE job, but it cannot easily be
> > replaced by a better implementation because it does many other small
> > optimisations (that are not done elsewhere).
> >
> 
> Thanks a lot for these comments! I'm also wondering if we would
> rewrite this cse.cc or refactor it in some aspects.

I think time is better spent elsewhere ... I don't think CSE is as
bad as Segher depicts it - it might do "CC things" and other bits
but in the end that's going to be instruction/expression combination
things that "fit" likely because a value lattice (or just nonzero bits
in the cselib variant) existed.

So what might be interesting would be to work towards cleansing
CSE of those, producing testcases and making sure a better fit
pass (combine? fwprop? compare-elim?) performs the desired
optimization.

But I'm not really sure what Segher is talking about - I suppose
it must be magic done inside cselib (which only does analysis),
not in cse.cc itself.

Richard.

> BR,
> Jiufu
> 
> >
> > Segher
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)


Re: [PATCH] Check if loading const from mem is faster

2022-02-28 Thread Jiufu Guo via Gcc-patches
Segher Boessenkool  writes:

> On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
>> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> > And another thing as Segher pointed out, CSE is doing too
>> > much work.  It may be ok to separate the constant handling
>> > logic from CSE.
>> 
>> Not sure - CSE just is value numbering, I don't see that it does
>> more than that.  Yes, it might have developed "heuristics" over
>> the years what to CSE and to what and where to substitute and
>> where not.  But in the end it does just value numbering.
>
> It also does various micro-optimisations, like all the CC things it
> does.
>
> It is not very good at doing the CSE job, but it cannot easily be
> replaced by a better implementation because it does many other small
> optimisations (that are not done elsewhere).
>

Thanks a lot for these comments! I'm also wondering if we would
rewrite this cse.cc or refactor it in some aspects.

BR,
Jiufu

>
> Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-28 Thread Segher Boessenkool
On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
> On Thu, 24 Feb 2022, Jiufu Guo wrote:
> > And another thing as Segher pointed out, CSE is doing too
> > much work.  It may be ok to separate the constant handling
> > logic from CSE.
> 
> Not sure - CSE just is value numbering, I don't see that it does
> more than that.  Yes, it might have developed "heuristics" over
> the years what to CSE and to what and where to substitute and
> where not.  But in the end it does just value numbering.

It also does various micro-optimisations, like all the CC things it
does.

It is not very good at doing the CSE job, but it cannot easily be
replaced by a better implementation because it does many other small
optimisations (that are not done elsewhere).


Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-28 Thread Segher Boessenkool
Hi!

On Thu, Feb 24, 2022 at 03:48:54PM +0800, Jiufu Guo wrote:
> Segher Boessenkool  writes:
> > That is the problem yes.  You need insns to call insn_cost on.  You can
> > look in combine.c:combine_validate_cost to see how this can be done; but
> > you need to have some code to generate in the first place, and for CSE
> > it isn't always clear what code to generate, it really is based on RTL
> > expressions having a cost.
> 
> Hi Segher,
> 
> Thanks! combine_validate_cost is useful to help me on
> evaluating the costs of several instructions or replacements.
> 
> As you pointed out, at CSE, it may not be clear to know what
> extact insn sequences will be generated. Actually,  the same
> issue also exists on RTL expression.  At CSE, it may not clear
> the exact cost, since the real instructions maybe emitted in
> very late passes.

But there will be RTL insns already.  Those may not correspond 1-1 to
the eventual machine insns (ideally they do most of the time though),
but it should be possible to estimate pretty accurate costs for them.

Costs are never exact anyway, it is only one number, while in reality
there are many dimensions.

> To get the accurate cost, we may analyze the constant in the
> hook(insn_cost or rtx_cost) and estimate the possible final
> instructions and then calculate the costs.

Yes.

> We discussed one idea: let the hook insn_cost accept
> any interim instruction, and estimate the real instruction
> base on the interim insn, and then return the estimated
> costs.

No.  insn_cost is only for correct, existing instructions, not for
made-up nonsense.  I created insn_cost precisely to get away from that
aspect of rtx_cost (and some other issues, like, it is incredibly hard
and cumbersome to write a correct rtx_cost).

> For example: input insn "r119:DI=0x100803004101001" to
> insn_cost; and in rs6000_insn_cost (for ppc), analyze
> constant "0x100803004101001" which would need 5 insns;
> then rs6000_insn_cost sumarize the cost of 5 insns.
> 
> A minor concern: because we know that reading this
> constant from the pool is faster than building it by insns,
> we will generate instructions to load constant from the pool
> finally, do not emit 5 real instructions to build the value.
> So, we are more interested in if it is faster to load from
> pool or not.

That is one reason why it is better to generate (close to) machine
insns as early as possible: it makes it much easier to estimate
realistic costs.  (Another important reason is it allows other
optimisations, without us having to do any work for it!)


Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-28 Thread Jiufu Guo via Gcc-patches
Richard Biener  writes:

> On Fri, 25 Feb 2022, Jiufu Guo wrote:
>
>> Richard Biener  writes:
>> 
>> > On Fri, 25 Feb 2022, Jiufu Guo wrote:
>> >
>> >> Richard Biener  writes:
>> >> 
>> >> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> >> >
>> >> >> Jiufu Guo via Gcc-patches  writes:
>> >> >> 
>> >> >> > Segher Boessenkool  writes:
>> >> >> >
>> >> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> >> >> >>> I'm assuming we're always dealing with
>> >> >> >>> 
>> >> >> >>>   (set (reg:MODE ..) )
>> >> >> >>> 
>> >> >> >>> here and CSE is not substituting into random places of an
>> >> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> >> >> >>> to for a constant, if it should implicitely evaluate the cost
>> >> >> >>> of putting the result into a register for example.
>> >> >> >>
>> >> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> >> >> >> for anything that is used as input in a machine instruction -- but 
>> >> >> >> you
>> >> >> >> need much more context to determine that.  insn_cost is much 
>> >> >> >> simpler and
>> >> >> >> much easier to use.
>> >> >> >>
>> >> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
>> >> >> >>> your proposed new target hook and comparing it with the original
>> >> >> >>> unfolded src (again with SET and 1).
>> >> >> >>
>> >> >> >> It is required to generate valid instructions no matter what, before
>> >> >> >> the pass has finished that is.  On all more modern architectures it 
>> >> >> >> is
>> >> >> >> futile to think you can usefully consider the cost of an RTL 
>> >> >> >> expression
>> >> >> >> and derive a real-world cost of the generated code from that.
>> >> >> >
>> >> >> > Thanks Segher for pointing out these!  Here is  another reason that I
>> >> >> > did not use rtx_cost: in a few passes, there are codes to check the
>> >> >> > constants and store them in constant pool.  I'm thinking to 
>> >> >> > integerate
>> >> >> > those codes in a consistent way.
>> >> >> 
>> >> >> Hi Segher, Richard!
>> >> >> 
>> >> >> I'm thinking the way like: For a constant,
>> >> >> 1. if the constant could be used as an immediate for the
>> >> >> instruction, then retreated as an operand;
>> >> >> 2. otherwise if the constant can not be stored into a
>> >> >> constant pool, then handle through instructions;
>> >> >> 3. if it is faster to access constant from pool, then emit
>> >> >> constant as data(.rodata);
>> >> >> 4. otherwise, handle the constant by instructions.
>> >> >> 
>> >> >> And to store the constant into a pool, besides force_const_mem,
>> >> >> create reference (toc) may be needed on some platforms.
>> >> >> 
>> >> >> For this particular issue in CSE, there is already code that
>> >> >> tries to put constant into a pool (invoke force_const_mem).
>> >> >> While the code is too late.  So, we may check the constant
>> >> >> earlier and store it into constant pool if profitable.
>> >> >> 
>> >> >> And another thing as Segher pointed out, CSE is doing too
>> >> >> much work.  It may be ok to separate the constant handling
>> >> >> logic from CSE.
>> >> >
>> >> > Not sure - CSE just is value numbering, I don't see that it does
>> >> > more than that.  Yes, it might have developed "heuristics" over
>> >> > the years what to CSE and to what and where to substitute and
>> >> > where not.  But in the end it does just value numbering.
>> >> >
>> >> >> 
>> >> >> I update a new version patch as follow (did not seprate CSE):
>> >> >
>> >> > How is the new target hook better in any way compared to rtx_cost
>> >> > or insn_cost?  It looks like a total hack.
>> >> >
>> >> > I suppose the actual way of materializing a constant is done
>> >> > behind GCCs back and not exposed anywhere?  But instead you
>> >> > claim the constants are valid when they actually are not?
>> >> > Isn't the problem then that the rs6000 backend lies?
>> >> 
>> >> Hi Richard,
>> >> 
>> >> Thanks for your comments and sugguestions!
>> >> 
>> >> Materializing a constant should be done behind GCC.
>> >> On rs6000, in expand pass, during emit_move, the constant is
>> >> checked and store into constant pool if necessary.
>> >> Some other platforms are doing a similar thing, e.g.
>> >> ix86_expand_vector_move, alpha_expand_mov,...
>> >> mips_legitimize_const_move.
>> >> 
>> >> But, it does not as we expect, force_const_mem is also 
>> >> exposed other places (not only ira/reload for stack reference).
>> >> 
>> >> CSE is one place, for example, CSE first retrieve the constant
>> >> from insn's equal, but it also tries to put constant into
>> >> pool for some condition (the condition was introduced at
>> >> early age of cse.c, and it is rare to run into in trunk).
>> >> In some aspects, IMHO, this seems not a great work of CSE.
>> >> 
>> >> And this is how the 'invalid(or say slow)' constant occurs.
>> >> e.g.  before cse:
>> >> 7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>> >>   REG_EQUAL 0x100803004101001
>> >> after cse: 

Re: [PATCH] Check if loading const from mem is faster

2022-02-25 Thread Richard Biener via Gcc-patches
On Fri, 25 Feb 2022, Jiufu Guo wrote:

> Richard Biener  writes:
> 
> > On Fri, 25 Feb 2022, Jiufu Guo wrote:
> >
> >> Richard Biener  writes:
> >> 
> >> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
> >> >
> >> >> Jiufu Guo via Gcc-patches  writes:
> >> >> 
> >> >> > Segher Boessenkool  writes:
> >> >> >
> >> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
> >> >> >>> I'm assuming we're always dealing with
> >> >> >>> 
> >> >> >>>   (set (reg:MODE ..) )
> >> >> >>> 
> >> >> >>> here and CSE is not substituting into random places of an
> >> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
> >> >> >>> to for a constant, if it should implicitely evaluate the cost
> >> >> >>> of putting the result into a register for example.
> >> >> >>
> >> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
> >> >> >> for anything that is used as input in a machine instruction -- but 
> >> >> >> you
> >> >> >> need much more context to determine that.  insn_cost is much simpler 
> >> >> >> and
> >> >> >> much easier to use.
> >> >> >>
> >> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
> >> >> >>> your proposed new target hook and comparing it with the original
> >> >> >>> unfolded src (again with SET and 1).
> >> >> >>
> >> >> >> It is required to generate valid instructions no matter what, before
> >> >> >> the pass has finished that is.  On all more modern architectures it 
> >> >> >> is
> >> >> >> futile to think you can usefully consider the cost of an RTL 
> >> >> >> expression
> >> >> >> and derive a real-world cost of the generated code from that.
> >> >> >
> >> >> > Thanks Segher for pointing out these!  Here is  another reason that I
> >> >> > did not use rtx_cost: in a few passes, there are codes to check the
> >> >> > constants and store them in constant pool.  I'm thinking to integerate
> >> >> > those codes in a consistent way.
> >> >> 
> >> >> Hi Segher, Richard!
> >> >> 
> >> >> I'm thinking the way like: For a constant,
> >> >> 1. if the constant could be used as an immediate for the
> >> >> instruction, then retreated as an operand;
> >> >> 2. otherwise if the constant can not be stored into a
> >> >> constant pool, then handle through instructions;
> >> >> 3. if it is faster to access constant from pool, then emit
> >> >> constant as data(.rodata);
> >> >> 4. otherwise, handle the constant by instructions.
> >> >> 
> >> >> And to store the constant into a pool, besides force_const_mem,
> >> >> create reference (toc) may be needed on some platforms.
> >> >> 
> >> >> For this particular issue in CSE, there is already code that
> >> >> tries to put constant into a pool (invoke force_const_mem).
> >> >> While the code is too late.  So, we may check the constant
> >> >> earlier and store it into constant pool if profitable.
> >> >> 
> >> >> And another thing as Segher pointed out, CSE is doing too
> >> >> much work.  It may be ok to separate the constant handling
> >> >> logic from CSE.
> >> >
> >> > Not sure - CSE just is value numbering, I don't see that it does
> >> > more than that.  Yes, it might have developed "heuristics" over
> >> > the years what to CSE and to what and where to substitute and
> >> > where not.  But in the end it does just value numbering.
> >> >
> >> >> 
> >> >> I update a new version patch as follow (did not seprate CSE):
> >> >
> >> > How is the new target hook better in any way compared to rtx_cost
> >> > or insn_cost?  It looks like a total hack.
> >> >
> >> > I suppose the actual way of materializing a constant is done
> >> > behind GCCs back and not exposed anywhere?  But instead you
> >> > claim the constants are valid when they actually are not?
> >> > Isn't the problem then that the rs6000 backend lies?
> >> 
> >> Hi Richard,
> >> 
> >> Thanks for your comments and sugguestions!
> >> 
> >> Materializing a constant should be done behind GCC.
> >> On rs6000, in expand pass, during emit_move, the constant is
> >> checked and store into constant pool if necessary.
> >> Some other platforms are doing a similar thing, e.g.
> >> ix86_expand_vector_move, alpha_expand_mov,...
> >> mips_legitimize_const_move.
> >> 
> >> But, it does not as we expect, force_const_mem is also 
> >> exposed other places (not only ira/reload for stack reference).
> >> 
> >> CSE is one place, for example, CSE first retrieve the constant
> >> from insn's equal, but it also tries to put constant into
> >> pool for some condition (the condition was introduced at
> >> early age of cse.c, and it is rare to run into in trunk).
> >> In some aspects, IMHO, this seems not a great work of CSE.
> >> 
> >> And this is how the 'invalid(or say slow)' constant occurs.
> >> e.g.  before cse:
> >> 7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
> >>   REG_EQUAL 0x100803004101001
> >> after cse: 
> >> 7: r119:DI=0x100803004101001
> >>   REG_EQUAL 0x100803004101001
> >> 
> >> As you pointed out: we can also avoid this transformation 

Re: [PATCH] Check if loading const from mem is faster

2022-02-25 Thread Jiufu Guo via Gcc-patches
Richard Biener  writes:

> On Fri, 25 Feb 2022, Jiufu Guo wrote:
>
>> Richard Biener  writes:
>> 
>> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> >
>> >> Jiufu Guo via Gcc-patches  writes:
>> >> 
>> >> > Segher Boessenkool  writes:
>> >> >
>> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> >> >>> I'm assuming we're always dealing with
>> >> >>> 
>> >> >>>   (set (reg:MODE ..) )
>> >> >>> 
>> >> >>> here and CSE is not substituting into random places of an
>> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> >> >>> to for a constant, if it should implicitely evaluate the cost
>> >> >>> of putting the result into a register for example.
>> >> >>
>> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> >> >> for anything that is used as input in a machine instruction -- but you
>> >> >> need much more context to determine that.  insn_cost is much simpler 
>> >> >> and
>> >> >> much easier to use.
>> >> >>
>> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
>> >> >>> your proposed new target hook and comparing it with the original
>> >> >>> unfolded src (again with SET and 1).
>> >> >>
>> >> >> It is required to generate valid instructions no matter what, before
>> >> >> the pass has finished that is.  On all more modern architectures it is
>> >> >> futile to think you can usefully consider the cost of an RTL expression
>> >> >> and derive a real-world cost of the generated code from that.
>> >> >
>> >> > Thanks Segher for pointing out these!  Here is  another reason that I
>> >> > did not use rtx_cost: in a few passes, there are codes to check the
>> >> > constants and store them in constant pool.  I'm thinking to integerate
>> >> > those codes in a consistent way.
>> >> 
>> >> Hi Segher, Richard!
>> >> 
>> >> I'm thinking the way like: For a constant,
>> >> 1. if the constant could be used as an immediate for the
>> >> instruction, then retreated as an operand;
>> >> 2. otherwise if the constant can not be stored into a
>> >> constant pool, then handle through instructions;
>> >> 3. if it is faster to access constant from pool, then emit
>> >> constant as data(.rodata);
>> >> 4. otherwise, handle the constant by instructions.
>> >> 
>> >> And to store the constant into a pool, besides force_const_mem,
>> >> create reference (toc) may be needed on some platforms.
>> >> 
>> >> For this particular issue in CSE, there is already code that
>> >> tries to put constant into a pool (invoke force_const_mem).
>> >> While the code is too late.  So, we may check the constant
>> >> earlier and store it into constant pool if profitable.
>> >> 
>> >> And another thing as Segher pointed out, CSE is doing too
>> >> much work.  It may be ok to separate the constant handling
>> >> logic from CSE.
>> >
>> > Not sure - CSE just is value numbering, I don't see that it does
>> > more than that.  Yes, it might have developed "heuristics" over
>> > the years what to CSE and to what and where to substitute and
>> > where not.  But in the end it does just value numbering.
>> >
>> >> 
>> >> I update a new version patch as follow (did not seprate CSE):
>> >
>> > How is the new target hook better in any way compared to rtx_cost
>> > or insn_cost?  It looks like a total hack.
>> >
>> > I suppose the actual way of materializing a constant is done
>> > behind GCCs back and not exposed anywhere?  But instead you
>> > claim the constants are valid when they actually are not?
>> > Isn't the problem then that the rs6000 backend lies?
>> 
>> Hi Richard,
>> 
>> Thanks for your comments and sugguestions!
>> 
>> Materializing a constant should be done behind GCC.
>> On rs6000, in expand pass, during emit_move, the constant is
>> checked and store into constant pool if necessary.
>> Some other platforms are doing a similar thing, e.g.
>> ix86_expand_vector_move, alpha_expand_mov,...
>> mips_legitimize_const_move.
>> 
>> But, it does not as we expect, force_const_mem is also 
>> exposed other places (not only ira/reload for stack reference).
>> 
>> CSE is one place, for example, CSE first retrieve the constant
>> from insn's equal, but it also tries to put constant into
>> pool for some condition (the condition was introduced at
>> early age of cse.c, and it is rare to run into in trunk).
>> In some aspects, IMHO, this seems not a great work of CSE.
>> 
>> And this is how the 'invalid(or say slow)' constant occurs.
>> e.g.  before cse:
>> 7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>>   REG_EQUAL 0x100803004101001
>> after cse: 
>> 7: r119:DI=0x100803004101001
>>   REG_EQUAL 0x100803004101001
>> 
>> As you pointed out: we can also avoid this transformation through
>> rtx_cost/insn_cost by estimating the COST more accurately for
>>  "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
>> be a real final instruction.)
>
> At which point does this become the final instruction?  I suppose
> CSE belives this constant is legitimate and the 

Re: [PATCH] Check if loading const from mem is faster

2022-02-25 Thread Richard Biener via Gcc-patches
On Fri, 25 Feb 2022, Jiufu Guo wrote:

> Richard Biener  writes:
> 
> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
> >
> >> Jiufu Guo via Gcc-patches  writes:
> >> 
> >> > Segher Boessenkool  writes:
> >> >
> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
> >> >>> I'm assuming we're always dealing with
> >> >>> 
> >> >>>   (set (reg:MODE ..) )
> >> >>> 
> >> >>> here and CSE is not substituting into random places of an
> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
> >> >>> to for a constant, if it should implicitely evaluate the cost
> >> >>> of putting the result into a register for example.
> >> >>
> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
> >> >> for anything that is used as input in a machine instruction -- but you
> >> >> need much more context to determine that.  insn_cost is much simpler and
> >> >> much easier to use.
> >> >>
> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
> >> >>> your proposed new target hook and comparing it with the original
> >> >>> unfolded src (again with SET and 1).
> >> >>
> >> >> It is required to generate valid instructions no matter what, before
> >> >> the pass has finished that is.  On all more modern architectures it is
> >> >> futile to think you can usefully consider the cost of an RTL expression
> >> >> and derive a real-world cost of the generated code from that.
> >> >
> >> > Thanks Segher for pointing out these!  Here is  another reason that I
> >> > did not use rtx_cost: in a few passes, there are codes to check the
> >> > constants and store them in constant pool.  I'm thinking to integerate
> >> > those codes in a consistent way.
> >> 
> >> Hi Segher, Richard!
> >> 
> >> I'm thinking the way like: For a constant,
> >> 1. if the constant could be used as an immediate for the
> >> instruction, then retreated as an operand;
> >> 2. otherwise if the constant can not be stored into a
> >> constant pool, then handle through instructions;
> >> 3. if it is faster to access constant from pool, then emit
> >> constant as data(.rodata);
> >> 4. otherwise, handle the constant by instructions.
> >> 
> >> And to store the constant into a pool, besides force_const_mem,
> >> create reference (toc) may be needed on some platforms.
> >> 
> >> For this particular issue in CSE, there is already code that
> >> tries to put constant into a pool (invoke force_const_mem).
> >> While the code is too late.  So, we may check the constant
> >> earlier and store it into constant pool if profitable.
> >> 
> >> And another thing as Segher pointed out, CSE is doing too
> >> much work.  It may be ok to separate the constant handling
> >> logic from CSE.
> >
> > Not sure - CSE just is value numbering, I don't see that it does
> > more than that.  Yes, it might have developed "heuristics" over
> > the years what to CSE and to what and where to substitute and
> > where not.  But in the end it does just value numbering.
> >
> >> 
> >> I update a new version patch as follow (did not seprate CSE):
> >
> > How is the new target hook better in any way compared to rtx_cost
> > or insn_cost?  It looks like a total hack.
> >
> > I suppose the actual way of materializing a constant is done
> > behind GCCs back and not exposed anywhere?  But instead you
> > claim the constants are valid when they actually are not?
> > Isn't the problem then that the rs6000 backend lies?
> 
> Hi Richard,
> 
> Thanks for your comments and sugguestions!
> 
> Materializing a constant should be done behind GCC.
> On rs6000, in expand pass, during emit_move, the constant is
> checked and store into constant pool if necessary.
> Some other platforms are doing a similar thing, e.g.
> ix86_expand_vector_move, alpha_expand_mov,...
> mips_legitimize_const_move.
> 
> But, it does not as we expect, force_const_mem is also 
> exposed other places (not only ira/reload for stack reference).
> 
> CSE is one place, for example, CSE first retrieve the constant
> from insn's equal, but it also tries to put constant into
> pool for some condition (the condition was introduced at
> early age of cse.c, and it is rare to run into in trunk).
> In some aspects, IMHO, this seems not a great work of CSE.
> 
> And this is how the 'invalid(or say slow)' constant occurs.
> e.g.  before cse:
> 7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>   REG_EQUAL 0x100803004101001
> after cse: 
> 7: r119:DI=0x100803004101001
>   REG_EQUAL 0x100803004101001
> 
> As you pointed out: we can also avoid this transformation through
> rtx_cost/insn_cost by estimating the COST more accurately for
>  "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
> be a real final instruction.)

At which point does this become the final instruction?  I suppose
CSE belives this constant is legitimate and the insn is recognized
just fine in CSE?  (that's what I mean with "behind GCCs back")

> Is it necessary to refine this constant handling for CSE, or even
> 

Re: [PATCH] Check if loading const from mem is faster

2022-02-24 Thread Jiufu Guo via Gcc-patches
Richard Biener  writes:

> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>
>> Jiufu Guo via Gcc-patches  writes:
>> 
>> > Segher Boessenkool  writes:
>> >
>> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> >>> I'm assuming we're always dealing with
>> >>> 
>> >>>   (set (reg:MODE ..) )
>> >>> 
>> >>> here and CSE is not substituting into random places of an
>> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> >>> to for a constant, if it should implicitely evaluate the cost
>> >>> of putting the result into a register for example.
>> >>
>> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> >> for anything that is used as input in a machine instruction -- but you
>> >> need much more context to determine that.  insn_cost is much simpler and
>> >> much easier to use.
>> >>
>> >>> Using RTX_COST with SET and 1 at least looks no worse than using
>> >>> your proposed new target hook and comparing it with the original
>> >>> unfolded src (again with SET and 1).
>> >>
>> >> It is required to generate valid instructions no matter what, before
>> >> the pass has finished that is.  On all more modern architectures it is
>> >> futile to think you can usefully consider the cost of an RTL expression
>> >> and derive a real-world cost of the generated code from that.
>> >
>> > Thanks Segher for pointing out these!  Here is  another reason that I
>> > did not use rtx_cost: in a few passes, there are codes to check the
>> > constants and store them in constant pool.  I'm thinking to integerate
>> > those codes in a consistent way.
>> 
>> Hi Segher, Richard!
>> 
>> I'm thinking the way like: For a constant,
>> 1. if the constant could be used as an immediate for the
>> instruction, then retreated as an operand;
>> 2. otherwise if the constant can not be stored into a
>> constant pool, then handle through instructions;
>> 3. if it is faster to access constant from pool, then emit
>> constant as data(.rodata);
>> 4. otherwise, handle the constant by instructions.
>> 
>> And to store the constant into a pool, besides force_const_mem,
>> create reference (toc) may be needed on some platforms.
>> 
>> For this particular issue in CSE, there is already code that
>> tries to put constant into a pool (invoke force_const_mem).
>> While the code is too late.  So, we may check the constant
>> earlier and store it into constant pool if profitable.
>> 
>> And another thing as Segher pointed out, CSE is doing too
>> much work.  It may be ok to separate the constant handling
>> logic from CSE.
>
> Not sure - CSE just is value numbering, I don't see that it does
> more than that.  Yes, it might have developed "heuristics" over
> the years what to CSE and to what and where to substitute and
> where not.  But in the end it does just value numbering.
>
>> 
>> I update a new version patch as follow (did not seprate CSE):
>
> How is the new target hook better in any way compared to rtx_cost
> or insn_cost?  It looks like a total hack.
>
> I suppose the actual way of materializing a constant is done
> behind GCCs back and not exposed anywhere?  But instead you
> claim the constants are valid when they actually are not?
> Isn't the problem then that the rs6000 backend lies?

Hi Richard,

Thanks for your comments and sugguestions!

Materializing a constant should be done behind GCC.
On rs6000, in expand pass, during emit_move, the constant is
checked and store into constant pool if necessary.
Some other platforms are doing a similar thing, e.g.
ix86_expand_vector_move, alpha_expand_mov,...
mips_legitimize_const_move.

But, it does not as we expect, force_const_mem is also 
exposed other places (not only ira/reload for stack reference).

CSE is one place, for example, CSE first retrieve the constant
from insn's equal, but it also tries to put constant into
pool for some condition (the condition was introduced at
early age of cse.c, and it is rare to run into in trunk).
In some aspects, IMHO, this seems not a great work of CSE.

And this is how the 'invalid(or say slow)' constant occurs.
e.g.  before cse:
7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
  REG_EQUAL 0x100803004101001
after cse: 
7: r119:DI=0x100803004101001
  REG_EQUAL 0x100803004101001

As you pointed out: we can also avoid this transformation through
rtx_cost/insn_cost by estimating the COST more accurately for
 "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
be a real final instruction.)

Is it necessary to refine this constant handling for CSE, or even
to eliminate the logic on constant extracting for an insn, beside
updating rtx_cost/insn_cost?
Any sugguestions?

>
> Btw, all of this is of course not appropriate for stage4 and changes
> to CSE need testing on more than one target.
I would do more evaluation, thanks!

Jiufu

>
> Richard.
>
>> Thanks for the comments and suggestions again!
>> 
>> 
>> BR,
>> Jiufu
>> 
>> ---
>>  gcc/config/rs6000/rs6000.cc   | 39 ++-
>>  

Re: [PATCH] Check if loading const from mem is faster

2022-02-24 Thread Richard Biener via Gcc-patches
On Thu, 24 Feb 2022, Jiufu Guo wrote:

> Jiufu Guo via Gcc-patches  writes:
> 
> > Segher Boessenkool  writes:
> >
> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
> >>> I'm assuming we're always dealing with
> >>> 
> >>>   (set (reg:MODE ..) )
> >>> 
> >>> here and CSE is not substituting into random places of an
> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
> >>> to for a constant, if it should implicitely evaluate the cost
> >>> of putting the result into a register for example.
> >>
> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
> >> for anything that is used as input in a machine instruction -- but you
> >> need much more context to determine that.  insn_cost is much simpler and
> >> much easier to use.
> >>
> >>> Using RTX_COST with SET and 1 at least looks no worse than using
> >>> your proposed new target hook and comparing it with the original
> >>> unfolded src (again with SET and 1).
> >>
> >> It is required to generate valid instructions no matter what, before
> >> the pass has finished that is.  On all more modern architectures it is
> >> futile to think you can usefully consider the cost of an RTL expression
> >> and derive a real-world cost of the generated code from that.
> >
> > Thanks Segher for pointing out these!  Here is  another reason that I
> > did not use rtx_cost: in a few passes, there are codes to check the
> > constants and store them in constant pool.  I'm thinking to integerate
> > those codes in a consistent way.
> 
> Hi Segher, Richard!
> 
> I'm thinking the way like: For a constant,
> 1. if the constant could be used as an immediate for the
> instruction, then retreated as an operand;
> 2. otherwise if the constant can not be stored into a
> constant pool, then handle through instructions;
> 3. if it is faster to access constant from pool, then emit
> constant as data(.rodata);
> 4. otherwise, handle the constant by instructions.
> 
> And to store the constant into a pool, besides force_const_mem,
> create reference (toc) may be needed on some platforms.
> 
> For this particular issue in CSE, there is already code that
> tries to put constant into a pool (invoke force_const_mem).
> While the code is too late.  So, we may check the constant
> earlier and store it into constant pool if profitable.
> 
> And another thing as Segher pointed out, CSE is doing too
> much work.  It may be ok to separate the constant handling
> logic from CSE.

Not sure - CSE just is value numbering, I don't see that it does
more than that.  Yes, it might have developed "heuristics" over
the years what to CSE and to what and where to substitute and
where not.  But in the end it does just value numbering.

> 
> I update a new version patch as follow (did not seprate CSE):

How is the new target hook better in any way compared to rtx_cost
or insn_cost?  It looks like a total hack.

I suppose the actual way of materializing a constant is done
behind GCCs back and not exposed anywhere?  But instead you
claim the constants are valid when they actually are not?
Isn't the problem then that the rs6000 backend lies?

Btw, all of this is of course not appropriate for stage4 and changes
to CSE need testing on more than one target.

Richard.

> Thanks for the comments and suggestions again!
> 
> 
> BR,
> Jiufu
> 
> ---
>  gcc/config/rs6000/rs6000.cc   | 39 ++-
>  gcc/cse.cc| 36 -
>  gcc/doc/tm.texi   |  5 +++
>  gcc/doc/tm.texi.in|  2 +
>  gcc/target.def|  8 
>  gcc/targhooks.cc  |  6 +++
>  gcc/targhooks.h   |  1 +
>  .../gcc.target/powerpc/medium_offset.c|  2 +-
>  gcc/testsuite/gcc.target/powerpc/pr63281.c| 14 +++
>  gcc/testsuite/gcc.target/powerpc/pr93012.c|  2 +-
>  10 files changed, 84 insertions(+), 31 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index d7a7cfe860f..0a8f487d516 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1361,6 +1361,9 @@ static const struct attribute_spec 
> rs6000_attribute_table[] =
>  #undef TARGET_CANNOT_FORCE_CONST_MEM
>  #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
>  
> +#undef TARGET_FASTER_LOADING_CONSTANT
> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
> +
>  #undef TARGET_DELEGITIMIZE_ADDRESS
>  #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
>  
> @@ -9684,8 +9687,8 @@ rs6000_init_stack_protect_guard (void)
>  static bool
>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>  {
> -  if (GET_CODE (x) == HIGH
> -  && GET_CODE (XEXP (x, 0)) == UNSPEC)
> +  /* Exclude CONSTANT HIGH part.  */
> +  if (GET_CODE (x) == HIGH)
>  

Re: [PATCH] Check if loading const from mem is faster

2022-02-23 Thread Jiufu Guo via Gcc-patches
Segher Boessenkool  writes:

> On Wed, Feb 23, 2022 at 07:32:55PM +0800, guojiufu wrote:
>> >We already have TARGET_INSN_COST which you could ask for a cost.
>> >Like if we'd have a single_set then just temporarily substitute
>> >the RHS with the candidate and cost the insns and compare against
>> >the original insn cost.  So why exactly do you need a new hook
>> >for this particular situation?
>> 
>> Thanks for pointing out this! Segher also mentioned this before.
>> Currently, CSE is using rtx_cost. Using insn_cost to replace
>> rtx_cost would be a good idea for all necessary places including CSE.
>
> I have updated many places that used rtx_cost to use insn_cost instead,
> over the years (as a fallback the generic insn_cost will use rtx_cost).
> CSE is the biggest remaining thing.  There is a long tail left as well
> of course.
>
>> For this particular case: check the cost for constants.
>> I did not use insn_cost. Because to use insn_cost, we may need
>> to create a recognizable insn temporarily, and for some kind of
>> constants we may need to create a sequence instructions on some
>> platform, e.g. "li xx; ori ; sldi .." on ppc64, and check the
>> sum cost of those instructions. If only create one fake
>> instruction, the insn_cost may not return the accurate cost either.
>
> That is the problem yes.  You need insns to call insn_cost on.  You can
> look in combine.c:combine_validate_cost to see how this can be done; but
> you need to have some code to generate in the first place, and for CSE
> it isn't always clear what code to generate, it really is based on RTL
> expressions having a cost.

Hi Segher,

Thanks! combine_validate_cost is useful to help me on
evaluating the costs of several instructions or replacements.

As you pointed out, at CSE, it may not be clear to know what
extact insn sequences will be generated. Actually,  the same
issue also exists on RTL expression.  At CSE, it may not clear
the exact cost, since the real instructions maybe emitted in
very late passes.

To get the accurate cost, we may analyze the constant in the
hook(insn_cost or rtx_cost) and estimate the possible final
instructions and then calculate the costs.

We discussed one idea: let the hook insn_cost accept
any interim instruction, and estimate the real instruction
base on the interim insn, and then return the estimated
costs.

For example: input insn "r119:DI=0x100803004101001" to
insn_cost; and in rs6000_insn_cost (for ppc), analyze
constant "0x100803004101001" which would need 5 insns;
then rs6000_insn_cost sumarize the cost of 5 insns.

A minor concern: because we know that reading this
constant from the pool is faster than building it by insns,
we will generate instructions to load constant from the pool
finally, do not emit 5 real instructions to build the value.
So, we are more interested in if it is faster to load from
pool or not.


BR,
Jiufu 

>
>
> Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-23 Thread Jiufu Guo via Gcc-patches
Jiufu Guo via Gcc-patches  writes:

> Segher Boessenkool  writes:
>
>> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>>> I'm assuming we're always dealing with
>>> 
>>>   (set (reg:MODE ..) )
>>> 
>>> here and CSE is not substituting into random places of an
>>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>>> to for a constant, if it should implicitely evaluate the cost
>>> of putting the result into a register for example.
>>
>> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> for anything that is used as input in a machine instruction -- but you
>> need much more context to determine that.  insn_cost is much simpler and
>> much easier to use.
>>
>>> Using RTX_COST with SET and 1 at least looks no worse than using
>>> your proposed new target hook and comparing it with the original
>>> unfolded src (again with SET and 1).
>>
>> It is required to generate valid instructions no matter what, before
>> the pass has finished that is.  On all more modern architectures it is
>> futile to think you can usefully consider the cost of an RTL expression
>> and derive a real-world cost of the generated code from that.
>
> Thanks Segher for pointing out these!  Here is  another reason that I
> did not use rtx_cost: in a few passes, there are codes to check the
> constants and store them in constant pool.  I'm thinking to integerate
> those codes in a consistent way.

Hi Segher, Richard!

I'm thinking the way like: For a constant,
1. if the constant could be used as an immediate for the
instruction, then retreated as an operand;
2. otherwise if the constant can not be stored into a
constant pool, then handle through instructions;
3. if it is faster to access constant from pool, then emit
constant as data(.rodata);
4. otherwise, handle the constant by instructions.

And to store the constant into a pool, besides force_const_mem,
create reference (toc) may be needed on some platforms.

For this particular issue in CSE, there is already code that
tries to put constant into a pool (invoke force_const_mem).
While the code is too late.  So, we may check the constant
earlier and store it into constant pool if profitable.

And another thing as Segher pointed out, CSE is doing too
much work.  It may be ok to separate the constant handling
logic from CSE.

I update a new version patch as follow (did not seprate CSE):

Thanks for the comments and suggestions again!


BR,
Jiufu

---
 gcc/config/rs6000/rs6000.cc   | 39 ++-
 gcc/cse.cc| 36 -
 gcc/doc/tm.texi   |  5 +++
 gcc/doc/tm.texi.in|  2 +
 gcc/target.def|  8 
 gcc/targhooks.cc  |  6 +++
 gcc/targhooks.h   |  1 +
 .../gcc.target/powerpc/medium_offset.c|  2 +-
 gcc/testsuite/gcc.target/powerpc/pr63281.c| 14 +++
 gcc/testsuite/gcc.target/powerpc/pr93012.c|  2 +-
 10 files changed, 84 insertions(+), 31 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..0a8f487d516 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1361,6 +1361,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 #undef TARGET_CANNOT_FORCE_CONST_MEM
 #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
 
+#undef TARGET_FASTER_LOADING_CONSTANT
+#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
+
 #undef TARGET_DELEGITIMIZE_ADDRESS
 #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
 
@@ -9684,8 +9687,8 @@ rs6000_init_stack_protect_guard (void)
 static bool
 rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
 {
-  if (GET_CODE (x) == HIGH
-  && GET_CODE (XEXP (x, 0)) == UNSPEC)
+  /* Exclude CONSTANT HIGH part.  */
+  if (GET_CODE (x) == HIGH)
 return true;
 
   /* A TLS symbol in the TOC cannot contain a sum.  */
@@ -10483,6 +10486,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, 
machine_mode mode)
   return false;
 }
 
+
+/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
+
+static bool
+rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
+{
+  gcc_assert (CONSTANT_P (src));
+
+  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
+return false;
+  if (GET_CODE (src) == HIGH)
+return false;
+  if (toc_relative_expr_p (src, false, NULL, NULL))
+return false;
+  if (rs6000_cannot_force_const_mem (mode, src))
+return false;
+
+  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
+return true;
+  if (!CONST_INT_P (src))
+return true;
+  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
+}
+
 /* Emit a move from SOURCE to DEST in mode MODE.  */
 void
 rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
@@ 

Re: [PATCH] Check if loading const from mem is faster

2022-02-23 Thread Jiufu Guo via Gcc-patches
Segher Boessenkool  writes:

> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> I'm assuming we're always dealing with
>> 
>>   (set (reg:MODE ..) )
>> 
>> here and CSE is not substituting into random places of an
>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> to for a constant, if it should implicitely evaluate the cost
>> of putting the result into a register for example.
>
> rtx_cost is no good here (and in most places).  rtx_cost should be 0
> for anything that is used as input in a machine instruction -- but you
> need much more context to determine that.  insn_cost is much simpler and
> much easier to use.
>
>> Using RTX_COST with SET and 1 at least looks no worse than using
>> your proposed new target hook and comparing it with the original
>> unfolded src (again with SET and 1).
>
> It is required to generate valid instructions no matter what, before
> the pass has finished that is.  On all more modern architectures it is
> futile to think you can usefully consider the cost of an RTL expression
> and derive a real-world cost of the generated code from that.

Thanks Segher for pointing out these!  Here is  another reason that I
did not use rtx_cost: in a few passes, there are codes to check the
constants and store them in constant pool.  I'm thinking to integerate
those codes in a consistent way.


BR,
Jiufu

>
> But there is so much more wrong with cse.c :-(
>
>
> Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-23 Thread Segher Boessenkool
On Wed, Feb 23, 2022 at 07:32:55PM +0800, guojiufu wrote:
> >We already have TARGET_INSN_COST which you could ask for a cost.
> >Like if we'd have a single_set then just temporarily substitute
> >the RHS with the candidate and cost the insns and compare against
> >the original insn cost.  So why exactly do you need a new hook
> >for this particular situation?
> 
> Thanks for pointing out this! Segher also mentioned this before.
> Currently, CSE is using rtx_cost. Using insn_cost to replace
> rtx_cost would be a good idea for all necessary places including CSE.

I have updated many places that used rtx_cost to use insn_cost instead,
over the years (as a fallback the generic insn_cost will use rtx_cost).
CSE is the biggest remaining thing.  There is a long tail left as well
of course.

> For this particular case: check the cost for constants.
> I did not use insn_cost. Because to use insn_cost, we may need
> to create a recognizable insn temporarily, and for some kind of
> constants we may need to create a sequence instructions on some
> platform, e.g. "li xx; ori ; sldi .." on ppc64, and check the
> sum cost of those instructions. If only create one fake
> instruction, the insn_cost may not return the accurate cost either.

That is the problem yes.  You need insns to call insn_cost on.  You can
look in combine.c:combine_validate_cost to see how this can be done; but
you need to have some code to generate in the first place, and for CSE
it isn't always clear what code to generate, it really is based on RTL
expressions having a cost.


Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-23 Thread Segher Boessenkool
On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
> I'm assuming we're always dealing with
> 
>   (set (reg:MODE ..) )
> 
> here and CSE is not substituting into random places of an
> instruction(?).  I don't know what 'rtx_cost' should evaluate
> to for a constant, if it should implicitely evaluate the cost
> of putting the result into a register for example.

rtx_cost is no good here (and in most places).  rtx_cost should be 0
for anything that is used as input in a machine instruction -- but you
need much more context to determine that.  insn_cost is much simpler and
much easier to use.

> Using RTX_COST with SET and 1 at least looks no worse than using
> your proposed new target hook and comparing it with the original
> unfolded src (again with SET and 1).

It is required to generate valid instructions no matter what, before
the pass has finished that is.  On all more modern architectures it is
futile to think you can usefully consider the cost of an RTL expression
and derive a real-world cost of the generated code from that.

But there is so much more wrong with cse.c :-(


Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-23 Thread Richard Biener via Gcc-patches
On Wed, 23 Feb 2022, guojiufu wrote:

> 
> 
> On 2/22/22 PM3:26, Richard Biener wrote:
> > On Tue, 22 Feb 2022, Jiufu Guo wrote:
> > 
> >> Hi,
> >>
> >> For constants, there are some codes to check: if it is able to put
> >> to instruction as an immediate operand or it is profitable to load from
> >> mem.  There are still some places that could be improved for platforms.
> >>
> >> This patch could handle PR63281/57836.  This patch does not change
> >> too much on the code like force_const_mem and legitimate_constant_p.
> >> We may integrate these APIs for passes like expand/cse/combine
> >> as a whole solution in the future (maybe better for stage1?).
> >>
> >> Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
> >> Thanks for comments!
> > 
> > I'm not sure whether we need a new hook here, but iff, then I think
> > whether loading a constant (from memory?) is faster or not depends
> > on the context.  So what's the exact situation and the two variants
> > you are costing against each other?  I assume (since you are
> > touching CSE) you are costing
> 
> Hi Richard,
> 
> Thanks for your review!
> 
> In some contexts, it may be faster to load from memory for some
> constant value, and for some constant value, it would be faster
> to put into immediate of few (1 or 2) instructions.
> 
> For example 0x1234567812345678, on ppc64, we may need 3 instructions
> to build it, and then it would be better to put it in .rodata, and
> then load it from memory.
> 
> Currently, we already have hooks TARGET_CANNOT_FORCE_CONST_MEM and
> TARGET_LEGITIMATE_CONSTANT_P.
> 
> TARGET_CANNOT_FORCE_CONST_MEM is used to check if one 'rtx' can be
> store into the constant pool.
> On some targets (e.g. alpha), TARGET_LEGITIMATE_CONSTANT_P does the
> behavior like what we expect:
> 
> I once thought to use TARGET_LEGITIMATE_CONSTANT_P too.
> But in general, it seems this hook is designed to check if one
> 'rtx' could be used as an immediate instruction. This hook is used
> in RTL passes: ira/reload. It is also used in recog.cc and expr.cc.
> 
> In other words, I feel, whether putting a constant in the constant
> pool, we could check:
> - If TARGET_CANNOT_FORCE_CONST_MEM returns true, we should not put
> the 'constant' in the constant pool.
> - If TARGET_LEGITIMATE_CONSTANT_P returns true, then the 'constant'
> would be immediate of **one** instruction, and not put to constant
> pool.
> - If the new hook TARGET_FASTER_LOADING_CONSTANT returns true, then
> the 'constant' would be stored in the constant pool.
> Otherwise, it would be better to use an instructions-seq to build the
> 'constant'.
> This is why I introduce a new hook.

I agree TARGET_CANNOT_FORCE_CONST_MEM and TARGET_LEGITIMATE_CONSTANT_P
are not the correct vehicle for a cost based consideration, they
are used for correctness checks.

> We may also use the new hook at other places, e.g. expand/combining...
> where is calling force_const_mem.
> 
> Any suggestions?
> 
> > 
> >(set (...) (mem))  (before CSE)
> > 
> > against
> > 
> >(set (...) (immediate))  (what CSE does now)
> > 
> > vs.
> > 
> >(set (...) (mem))  (original, no CSE)
> > 
> > ?  With the new hook you are skipping _all_ of the following loops
> > logic which does look like a quite bad design and hack (not that
> > I am very familiar with the candidate / costing logic in there).
> 
> At cse_insn, in the following loop of the code, it is also testing
> the constant and try to put into memory:
> 
>   else if (crtl->uses_const_pool
>&& CONSTANT_P (trial)
>&& !CONST_INT_P (trial)
>&& (src_folded == 0 || !MEM_P (src_folded))
>&& GET_MODE_CLASS (mode) != MODE_CC
>&& mode != VOIDmode)
> {
>   src_folded = force_const_mem (mode, trial);
>   if (src_folded)
> {
>   src_folded_cost = COST (src_folded, mode);
>   src_folded_regcost = approx_reg_cost (src_folded);
> }
> }
> 
> This code is at the end of the loop, it would only be tested for
> the next iteration. It may be better to test "if need to put the
> constant into memory" for all iterations.
>
> The current patch is adding an additional test before the loop.
> I will update the patch to integrate these two places!

I'm assuming we're always dealing with

  (set (reg:MODE ..) )

here and CSE is not substituting into random places of an
instruction(?).  I don't know what 'rtx_cost' should evaluate
to for a constant, if it should implicitely evaluate the cost
of putting the result into a register for example.

The RTX_COST target hook at least has some context
(outer_code and opno) and COST passes SET and 1 here.

So why is adjusting the RTX_COST hook not enough then for this case?
Using RTX_COST with SET and 1 at least looks no worse than using
your proposed new target hook and comparing it with the original
unfolded 

Re: [PATCH] Check if loading const from mem is faster

2022-02-23 Thread guojiufu via Gcc-patches




On 2/22/22 PM3:26, Richard Biener wrote:

On Tue, 22 Feb 2022, Jiufu Guo wrote:


Hi,

For constants, there are some codes to check: if it is able to put
to instruction as an immediate operand or it is profitable to load from
mem.  There are still some places that could be improved for platforms.

This patch could handle PR63281/57836.  This patch does not change
too much on the code like force_const_mem and legitimate_constant_p.
We may integrate these APIs for passes like expand/cse/combine
as a whole solution in the future (maybe better for stage1?).

Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
Thanks for comments!


I'm not sure whether we need a new hook here, but iff, then I think
whether loading a constant (from memory?) is faster or not depends
on the context.  So what's the exact situation and the two variants
you are costing against each other?  I assume (since you are
touching CSE) you are costing


Hi Richard,

Thanks for your review!

In some contexts, it may be faster to load from memory for some
constant value, and for some constant value, it would be faster
to put into immediate of few (1 or 2) instructions.

For example 0x1234567812345678, on ppc64, we may need 3 instructions
to build it, and then it would be better to put it in .rodata, and
then load it from memory.

Currently, we already have hooks TARGET_CANNOT_FORCE_CONST_MEM and
TARGET_LEGITIMATE_CONSTANT_P.

TARGET_CANNOT_FORCE_CONST_MEM is used to check if one 'rtx' can be
store into the constant pool.
On some targets (e.g. alpha), TARGET_LEGITIMATE_CONSTANT_P does the
behavior like what we expect:

I once thought to use TARGET_LEGITIMATE_CONSTANT_P too.
But in general, it seems this hook is designed to check if one
'rtx' could be used as an immediate instruction. This hook is used
in RTL passes: ira/reload. It is also used in recog.cc and expr.cc.

In other words, I feel, whether putting a constant in the constant
pool, we could check:
- If TARGET_CANNOT_FORCE_CONST_MEM returns true, we should not put
the 'constant' in the constant pool.
- If TARGET_LEGITIMATE_CONSTANT_P returns true, then the 'constant'
would be immediate of **one** instruction, and not put to constant
pool.
- If the new hook TARGET_FASTER_LOADING_CONSTANT returns true, then
the 'constant' would be stored in the constant pool.
Otherwise, it would be better to use an instructions-seq to build the
'constant'.
This is why I introduce a new hook.

We may also use the new hook at other places, e.g. expand/combining...
where is calling force_const_mem.

Any suggestions?



   (set (...) (mem))  (before CSE)

against

   (set (...) (immediate))  (what CSE does now)

vs.

   (set (...) (mem))  (original, no CSE)

?  With the new hook you are skipping _all_ of the following loops
logic which does look like a quite bad design and hack (not that
I am very familiar with the candidate / costing logic in there).


At cse_insn, in the following loop of the code, it is also testing
the constant and try to put into memory:

  else if (crtl->uses_const_pool
   && CONSTANT_P (trial)
   && !CONST_INT_P (trial)
   && (src_folded == 0 || !MEM_P (src_folded))
   && GET_MODE_CLASS (mode) != MODE_CC
   && mode != VOIDmode)
{
  src_folded = force_const_mem (mode, trial);
  if (src_folded)
{
  src_folded_cost = COST (src_folded, mode);
  src_folded_regcost = approx_reg_cost (src_folded);
}
}

This code is at the end of the loop, it would only be tested for
the next iteration. It may be better to test "if need to put the
constant into memory" for all iterations.

The current patch is adding an additional test before the loop.
I will update the patch to integrate these two places!



We already have TARGET_INSN_COST which you could ask for a cost.
Like if we'd have a single_set then just temporarily substitute
the RHS with the candidate and cost the insns and compare against
the original insn cost.  So why exactly do you need a new hook
for this particular situation?


Thanks for pointing out this! Segher also mentioned this before.
Currently, CSE is using rtx_cost. Using insn_cost to replace
rtx_cost would be a good idea for all necessary places including CSE.

For this particular case: check the cost for constants.
I did not use insn_cost. Because to use insn_cost, we may need
to create a recognizable insn temporarily, and for some kind of
constants we may need to create a sequence instructions on some
platform, e.g. "li xx; ori ; sldi .." on ppc64, and check the
sum cost of those instructions. If only create one fake
instruction, the insn_cost may not return the accurate cost either.

BR,
Jiufu



Thanks,
Richard.




BR,
Jiufu

gcc/ChangeLog:

PR target/94393
PR rtl-optimization/63281
* config/rs6000/rs6000.cc 

Re: [PATCH] Check if loading const from mem is faster

2022-02-22 Thread guojiufu via Gcc-patches

On 2022-02-23 01:30, Segher Boessenkool wrote:

Hi Jiu Fu,

On Tue, Feb 22, 2022 at 02:53:13PM +0800, Jiufu Guo wrote:

 static bool
 rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, 
rtx x)

 {
-  if (GET_CODE (x) == HIGH
-  && GET_CODE (XEXP (x, 0)) == UNSPEC)
+  if (GET_CODE (x) == HIGH)
 return true;



Hi Segher,


This isn't explained anywhere.  "Update" is not enough ;-)
Thanks! I will add explanations for it.   This excludes all 'HIGH' for 
'x' code,

like function "rs6000_emit_move" also check if the code is 'HIGH'.

And on P10, I also encounter this kind of case like:
 (high:DI (symbol_ref:DI ("var_1") [flags 0xc0] var_1>))

Which fail to store into .rodata.




CSE is the pass that is most ancient and still causing problems left 
and

right.  It should be rewritten sooner rather than later.

The problem with that is that the pass does so much more than just CSE,
and we don't want to lose all those other things.  So it will be a slow
arduous affair of peeling off bits into separate passes, I think :-(


Yes, it does a lot of work. One of the additional works is checking 
'folding out

constants and putting constant in memory'.

BR,
Jiufu



Doing actual CSE without all the restrictive restrictions our pass has
historically had isn't the hard part!


Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-22 Thread Segher Boessenkool
Hi Jiu Fu,

On Tue, Feb 22, 2022 at 02:53:13PM +0800, Jiufu Guo wrote:
>  static bool
>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>  {
> -  if (GET_CODE (x) == HIGH
> -  && GET_CODE (XEXP (x, 0)) == UNSPEC)
> +  if (GET_CODE (x) == HIGH)
>  return true;

This isn't explained anywhere.  "Update" is not enough ;-)


CSE is the pass that is most ancient and still causing problems left and
right.  It should be rewritten sooner rather than later.

The problem with that is that the pass does so much more than just CSE,
and we don't want to lose all those other things.  So it will be a slow
arduous affair of peeling off bits into separate passes, I think :-(

Doing actual CSE without all the restrictive restrictions our pass has
historically had isn't the hard part!


Segher


Re: [PATCH] Check if loading const from mem is faster

2022-02-21 Thread Richard Biener via Gcc-patches
On Tue, 22 Feb 2022, Jiufu Guo wrote:

> Hi,
> 
> For constants, there are some codes to check: if it is able to put
> to instruction as an immediate operand or it is profitable to load from
> mem.  There are still some places that could be improved for platforms.
> 
> This patch could handle PR63281/57836.  This patch does not change
> too much on the code like force_const_mem and legitimate_constant_p.
> We may integrate these APIs for passes like expand/cse/combine
> as a whole solution in the future (maybe better for stage1?).
> 
> Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
> Thanks for comments!

I'm not sure whether we need a new hook here, but iff, then I think
whether loading a constant (from memory?) is faster or not depends
on the context.  So what's the exact situation and the two variants
you are costing against each other?  I assume (since you are
touching CSE) you are costing

  (set (...) (mem))  (before CSE)

against

  (set (...) (immediate))  (what CSE does now)

vs.

  (set (...) (mem))  (original, no CSE)

?  With the new hook you are skipping _all_ of the following loops
logic which does look like a quite bad design and hack (not that
I am very familiar with the candidate / costing logic in there).

We already have TARGET_INSN_COST which you could ask for a cost.
Like if we'd have a single_set then just temporarily substitute
the RHS with the candidate and cost the insns and compare against
the original insn cost.  So why exactly do you need a new hook
for this particular situation?

Thanks,
Richard.


> 
> BR,
> Jiufu
> 
> gcc/ChangeLog:
> 
>   PR target/94393
>   PR rtl-optimization/63281
>   * config/rs6000/rs6000.cc (TARGET_FASTER_LOADING_CONSTANT):
>   New hook.
>   (rs6000_faster_loading_const): New hook.
>   (rs6000_cannot_force_const_mem): Update.
>   (rs6000_emit_move): Use rs6000_faster_loading_const.
>   * cse.cc (cse_insn): Use new hook.
>   * doc/tm.texi.in: New hook TARGET_FASTER_LOADING_CONSTANT.
>   * doc/tm.texi: Regenerate.
>   * target.def: New hook faster_loading_constant.
>   * targhooks.cc (default_faster_loading_constant): New.
>   * targhooks.h (default_faster_loading_constant): Declare it.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/powerpc/medium_offset.c: Update.
>   * gcc.target/powerpc/pr93012.c: Update.
>   * gcc.target/powerpc/pr63281.c: New test.
> 
> ---
>  gcc/config/rs6000/rs6000.cc   | 38 ++-
>  gcc/cse.cc|  5 +++
>  gcc/doc/tm.texi   |  5 +++
>  gcc/doc/tm.texi.in|  2 +
>  gcc/target.def|  8 
>  gcc/targhooks.cc  |  6 +++
>  .../gcc.target/powerpc/medium_offset.c|  2 +-
>  gcc/testsuite/gcc.target/powerpc/pr63281.c| 14 +++
>  gcc/testsuite/gcc.target/powerpc/pr93012.c|  2 +-
>  9 files changed, 71 insertions(+), 11 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index d7a7cfe860f..beb21a0bb2b 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1361,6 +1361,9 @@ static const struct attribute_spec 
> rs6000_attribute_table[] =
>  #undef TARGET_CANNOT_FORCE_CONST_MEM
>  #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
>  
> +#undef TARGET_FASTER_LOADING_CONSTANT
> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
> +
>  #undef TARGET_DELEGITIMIZE_ADDRESS
>  #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
>  
> @@ -9684,8 +9687,7 @@ rs6000_init_stack_protect_guard (void)
>  static bool
>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>  {
> -  if (GET_CODE (x) == HIGH
> -  && GET_CODE (XEXP (x, 0)) == UNSPEC)
> +  if (GET_CODE (x) == HIGH)
>  return true;
>  
>/* A TLS symbol in the TOC cannot contain a sum.  */
> @@ -10483,6 +10485,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, 
> machine_mode mode)
>return false;
>  }
>  
> +
> +/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
> +
> +static bool
> +rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
> +{
> +  gcc_assert (CONSTANT_P (src));
> +
> +  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
> +return false;
> +  if (GET_CODE (src) == HIGH)
> +return false;
> +  if (toc_relative_expr_p (src, false, NULL, NULL))
> +return false;
> +  if (rs6000_cannot_force_const_mem (mode, src))
> +return false;
> +
> +  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
> +return true;
> +  if (!CONST_INT_P (src))
> +return true;
> +  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
> +}
> +
>  /* Emit a move from SOURCE to DEST in mode MODE.  */
>  void
>  rs6000_emit_move (rtx dest, rtx source, 

[PATCH] Check if loading const from mem is faster

2022-02-21 Thread Jiufu Guo via Gcc-patches
Hi,

For constants, there are some codes to check: if it is able to put
to instruction as an immediate operand or it is profitable to load from
mem.  There are still some places that could be improved for platforms.

This patch could handle PR63281/57836.  This patch does not change
too much on the code like force_const_mem and legitimate_constant_p.
We may integrate these APIs for passes like expand/cse/combine
as a whole solution in the future (maybe better for stage1?).

Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
Thanks for comments!


BR,
Jiufu

gcc/ChangeLog:

PR target/94393
PR rtl-optimization/63281
* config/rs6000/rs6000.cc (TARGET_FASTER_LOADING_CONSTANT):
New hook.
(rs6000_faster_loading_const): New hook.
(rs6000_cannot_force_const_mem): Update.
(rs6000_emit_move): Use rs6000_faster_loading_const.
* cse.cc (cse_insn): Use new hook.
* doc/tm.texi.in: New hook TARGET_FASTER_LOADING_CONSTANT.
* doc/tm.texi: Regenerate.
* target.def: New hook faster_loading_constant.
* targhooks.cc (default_faster_loading_constant): New.
* targhooks.h (default_faster_loading_constant): Declare it.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/medium_offset.c: Update.
* gcc.target/powerpc/pr93012.c: Update.
* gcc.target/powerpc/pr63281.c: New test.

---
 gcc/config/rs6000/rs6000.cc   | 38 ++-
 gcc/cse.cc|  5 +++
 gcc/doc/tm.texi   |  5 +++
 gcc/doc/tm.texi.in|  2 +
 gcc/target.def|  8 
 gcc/targhooks.cc  |  6 +++
 .../gcc.target/powerpc/medium_offset.c|  2 +-
 gcc/testsuite/gcc.target/powerpc/pr63281.c| 14 +++
 gcc/testsuite/gcc.target/powerpc/pr93012.c|  2 +-
 9 files changed, 71 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..beb21a0bb2b 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1361,6 +1361,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 #undef TARGET_CANNOT_FORCE_CONST_MEM
 #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
 
+#undef TARGET_FASTER_LOADING_CONSTANT
+#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
+
 #undef TARGET_DELEGITIMIZE_ADDRESS
 #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
 
@@ -9684,8 +9687,7 @@ rs6000_init_stack_protect_guard (void)
 static bool
 rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
 {
-  if (GET_CODE (x) == HIGH
-  && GET_CODE (XEXP (x, 0)) == UNSPEC)
+  if (GET_CODE (x) == HIGH)
 return true;
 
   /* A TLS symbol in the TOC cannot contain a sum.  */
@@ -10483,6 +10485,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, 
machine_mode mode)
   return false;
 }
 
+
+/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
+
+static bool
+rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
+{
+  gcc_assert (CONSTANT_P (src));
+
+  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
+return false;
+  if (GET_CODE (src) == HIGH)
+return false;
+  if (toc_relative_expr_p (src, false, NULL, NULL))
+return false;
+  if (rs6000_cannot_force_const_mem (mode, src))
+return false;
+
+  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
+return true;
+  if (!CONST_INT_P (src))
+return true;
+  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
+}
+
 /* Emit a move from SOURCE to DEST in mode MODE.  */
 void
 rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
@@ -10860,13 +10886,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode 
mode)
operands[1] = create_TOC_reference (operands[1], operands[0]);
   else if (mode == Pmode
   && CONSTANT_P (operands[1])
-  && GET_CODE (operands[1]) != HIGH
-  && ((REG_P (operands[0])
-   && FP_REGNO_P (REGNO (operands[0])))
-  || !CONST_INT_P (operands[1])
-  || (num_insns_constant (operands[1], mode)
-  > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
-  && !toc_relative_expr_p (operands[1], false, NULL, NULL)
+  && rs6000_faster_loading_const (mode, operands[0], operands[1])
   && (TARGET_CMODEL == CMODEL_SMALL
   || can_create_pseudo_p ()
   || (REG_P (operands[0])
diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..c33d3c7e911 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -5158,6 +5158,11 @@ cse_insn (rtx_insn *insn)
   if (dest == pc_rtx && src_const && GET_CODE (src_const) == LABEL_REF)
src_folded = src_const, src_folded_cost = src_folded_regcost = -1;
 
+