Re: [PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.

2023-07-27 Thread Richard Biener via Gcc-patches



> Am 27.07.2023 um 19:12 schrieb Roger Sayle :
> 
> 
> Hi Richard,
> 
> You're 100% right.  It’s possible to significantly clean-up this code, 
> replacing
> the body of the conditional with a call to force_reg and simplifying the 
> conditions
> under which it is called.  These improvements are implemented in the patch
> below, which has been tested on x86_64-pc-linux-gnu, with a bootstrap and
> make -k check, both with and without -m32, as usual.
> 
> Interestingly, the CONCAT clause afterwards is still required (I've learned 
> something
> new),  as calling force_reg (or gen_reg_rtx) with HCmode, actually returns a 
> CONCAT
> instead of a REG,

Heh, interesting.

> so although the code looks dead, it's required to build libgcc during
> a bootstrap.  But the remaining clean-up is good, reducing the number of 
> source lines
> and making the logic easier to understand.
> 
> Ok for mainline?

Ok.

Thanks,
Richard 

> 2023-07-27  Roger Sayle  
>Richard Biener  
> 
> gcc/ChangeLog
>PR middle-end/28071
>PR rtl-optimization/110587
>* expr.cc (emit_group_load_1): Simplify logic for calling
>force_reg on ORIG_SRC, to avoid making a copy if the source
>is already in a pseudo register.
> 
> Roger
> --
> 
>> -Original Message-
>> From: Richard Biener 
>> Sent: 25 July 2023 12:50
>> 
>>> On Tue, Jul 25, 2023 at 1:31 PM Roger Sayle 
>>> wrote:
>>> 
>>> This patch is the third in series of fixes for PR
>>> rtl-optimization/110587, a compile-time regression with -O0, that
>>> attempts to address the underlying cause.  As noted previously, the
>>> pathological test case pr28071.c contains a large number of useless
>>> register-to-register moves that can produce quadratic behaviour (in
>>> LRA).  These move are generated during RTL expansion in
>>> emit_group_load_1, where the middle-end attempts to simplify the
>>> source before calling extract_bit_field.  This is reasonable if the
>>> source is a complex expression (from before the tree-ssa optimizers),
>>> or a SUBREG, or a hard register, but it's not particularly useful to
>>> copy a pseudo register into a new pseudo register.  This patch eliminates 
>>> that
>> redundancy.
>>> 
>>> The -fdump-tree-expand for pr28071.c compiled with -O0 currently
>>> contains 777K lines, with this patch it contains 717K lines, i.e.
>>> saving about 60K lines (admittedly of debugging text output, but it makes 
>>> the
>> point).
>>> 
>>> 
>>> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
>>> and make -k check, both with and without --target_board=unix{-m32}
>>> with no new failures.  Ok for mainline?
>>> 
>>> As always, I'm happy to revert this change quickly if there's a
>>> problem, and investigate why this additional copy might (still) be
>>> needed on other
>>> non-x86 targets.
>> 
>> @@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src,
>> tree type,
>> be loaded directly into the destination.  */
>>   src = orig_src;
>>   if (!MEM_P (orig_src)
>> + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src))
>>  && (!CONSTANT_P (orig_src)
>>  || (GET_MODE (orig_src) != mode
>>  && GET_MODE (orig_src) != VOIDmode)))
>> 
>> so that means the code guarded by the conditional could instead be 
>> transformed
>> to
>> 
>>   src = force_reg (mode, orig_src);
>> 
>> ?  Btw, the || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) !=
>> VOIDmode) case looks odd as in that case we'd use GET_MODE (orig_src) for the
>> move ... that might also mean we have to use force_reg (GET_MODE (orig_src) 
>> ==
>> VOIDmode ? mode : GET_MODE (orig_src), orig_src))
>> 
>> Otherwise I think this is OK, as said, using force_reg somehow would improve
>> readability here I think.
>> 
>> I also wonder how the
>> 
>>  else if (GET_CODE (src) == CONCAT)
>> 
>> case will ever trigger with the current code.
>> 
>> Richard.
>> 
>>> 
>>> 2023-07-25  Roger Sayle  
>>> 
>>> gcc/ChangeLog
>>>PR middle-end/28071
>>>PR rtl-optimization/110587
>>>* expr.cc (emit_group_load_1): Avoid copying a pseudo register into
>>>a new pseudo register, i.e. only copy hard regs into a new pseudo.
>>> 
>>> 
> 
> 


RE: [PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.

2023-07-27 Thread Roger Sayle

Hi Richard,

You're 100% right.  It’s possible to significantly clean-up this code, replacing
the body of the conditional with a call to force_reg and simplifying the 
conditions
under which it is called.  These improvements are implemented in the patch
below, which has been tested on x86_64-pc-linux-gnu, with a bootstrap and
make -k check, both with and without -m32, as usual.

Interestingly, the CONCAT clause afterwards is still required (I've learned 
something
new),  as calling force_reg (or gen_reg_rtx) with HCmode, actually returns a 
CONCAT
instead of a REG, so although the code looks dead, it's required to build 
libgcc during
a bootstrap.  But the remaining clean-up is good, reducing the number of source 
lines
and making the logic easier to understand.

Ok for mainline?

2023-07-27  Roger Sayle  
Richard Biener  

gcc/ChangeLog
PR middle-end/28071
PR rtl-optimization/110587
* expr.cc (emit_group_load_1): Simplify logic for calling
force_reg on ORIG_SRC, to avoid making a copy if the source
is already in a pseudo register.

Roger
--

> -Original Message-
> From: Richard Biener 
> Sent: 25 July 2023 12:50
> 
> On Tue, Jul 25, 2023 at 1:31 PM Roger Sayle 
> wrote:
> >
> > This patch is the third in series of fixes for PR
> > rtl-optimization/110587, a compile-time regression with -O0, that
> > attempts to address the underlying cause.  As noted previously, the
> > pathological test case pr28071.c contains a large number of useless
> > register-to-register moves that can produce quadratic behaviour (in
> > LRA).  These move are generated during RTL expansion in
> > emit_group_load_1, where the middle-end attempts to simplify the
> > source before calling extract_bit_field.  This is reasonable if the
> > source is a complex expression (from before the tree-ssa optimizers),
> > or a SUBREG, or a hard register, but it's not particularly useful to
> > copy a pseudo register into a new pseudo register.  This patch eliminates 
> > that
> redundancy.
> >
> > The -fdump-tree-expand for pr28071.c compiled with -O0 currently
> > contains 777K lines, with this patch it contains 717K lines, i.e.
> > saving about 60K lines (admittedly of debugging text output, but it makes 
> > the
> point).
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> > As always, I'm happy to revert this change quickly if there's a
> > problem, and investigate why this additional copy might (still) be
> > needed on other
> > non-x86 targets.
> 
> @@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src,
> tree type,
>  be loaded directly into the destination.  */
>src = orig_src;
>if (!MEM_P (orig_src)
> + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src))
>   && (!CONSTANT_P (orig_src)
>   || (GET_MODE (orig_src) != mode
>   && GET_MODE (orig_src) != VOIDmode)))
> 
> so that means the code guarded by the conditional could instead be transformed
> to
> 
>src = force_reg (mode, orig_src);
> 
> ?  Btw, the || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) !=
> VOIDmode) case looks odd as in that case we'd use GET_MODE (orig_src) for the
> move ... that might also mean we have to use force_reg (GET_MODE (orig_src) ==
> VOIDmode ? mode : GET_MODE (orig_src), orig_src))
> 
> Otherwise I think this is OK, as said, using force_reg somehow would improve
> readability here I think.
> 
> I also wonder how the
> 
>   else if (GET_CODE (src) == CONCAT)
> 
> case will ever trigger with the current code.
> 
> Richard.
> 
> >
> > 2023-07-25  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR middle-end/28071
> > PR rtl-optimization/110587
> > * expr.cc (emit_group_load_1): Avoid copying a pseudo register into
> > a new pseudo register, i.e. only copy hard regs into a new pseudo.
> >
> >

diff --git a/gcc/expr.cc b/gcc/expr.cc
index fff09dc..174f8ac 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -2622,16 +2622,11 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, 
tree type,
 be loaded directly into the destination.  */
   src = orig_src;
   if (!MEM_P (orig_src)
- && (!CONSTANT_P (orig_src)
- || (GET_MODE (orig_src) != mode
- && GET_MODE (orig_src) != VOIDmode)))
+ && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src))
+ && !CONSTANT_P (orig_src))
{
- if (GET_MODE (orig_src) == VOIDmode)
-   src = gen_reg_rtx (mode);
- else
-   src = gen_reg_rtx (GET_MODE (orig_src));
-
- emit_move_insn (src, orig_src);
+ gcc_assert (GET_MODE (orig_src) != VOIDmode);
+ src = force_reg (GET_MODE (orig_src), orig_src);
}
 
   /* Optimize the access just a bit.  */


Re: [PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.

2023-07-25 Thread Richard Biener via Gcc-patches
On Tue, Jul 25, 2023 at 1:31 PM Roger Sayle  wrote:
>
>
> This patch is the third in series of fixes for PR rtl-optimization/110587,
> a compile-time regression with -O0, that attempts to address the underlying
> cause.  As noted previously, the pathological test case pr28071.c contains
> a large number of useless register-to-register moves that can produce
> quadratic behaviour (in LRA).  These move are generated during RTL
> expansion in emit_group_load_1, where the middle-end attempts to simplify
> the source before calling extract_bit_field.  This is reasonable if the
> source is a complex expression (from before the tree-ssa optimizers), or
> a SUBREG, or a hard register, but it's not particularly useful to copy
> a pseudo register into a new pseudo register.  This patch eliminates that
> redundancy.
>
> The -fdump-tree-expand for pr28071.c compiled with -O0 currently contains
> 777K lines, with this patch it contains 717K lines, i.e. saving about 60K
> lines (admittedly of debugging text output, but it makes the point).
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
> As always, I'm happy to revert this change quickly if there's a problem,
> and investigate why this additional copy might (still) be needed on other
> non-x86 targets.

@@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx
orig_src, tree type,
 be loaded directly into the destination.  */
   src = orig_src;
   if (!MEM_P (orig_src)
+ && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src))
  && (!CONSTANT_P (orig_src)
  || (GET_MODE (orig_src) != mode
  && GET_MODE (orig_src) != VOIDmode)))

so that means the code guarded by the conditional could instead
be transformed to

   src = force_reg (mode, orig_src);

?  Btw, the || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) != VOIDmode)
case looks odd as in that case we'd use GET_MODE (orig_src) for the
move ... that
might also mean we have to use force_reg (GET_MODE (orig_src) ==
VOIDmode ? mode : GET_MODE (orig_src), orig_src))

Otherwise I think this is OK, as said, using
force_reg somehow would improve readability here I think.

I also wonder how the

  else if (GET_CODE (src) == CONCAT)

case will ever trigger with the current code.

Richard.

>
> 2023-07-25  Roger Sayle  
>
> gcc/ChangeLog
> PR middle-end/28071
> PR rtl-optimization/110587
> * expr.cc (emit_group_load_1): Avoid copying a pseudo register into
> a new pseudo register, i.e. only copy hard regs into a new pseudo.
>
>
> Thanks in advance,
> Roger
> --
>