Re: [PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.
> Am 27.07.2023 um 19:12 schrieb Roger Sayle : > > > Hi Richard, > > You're 100% right. It’s possible to significantly clean-up this code, > replacing > the body of the conditional with a call to force_reg and simplifying the > conditions > under which it is called. These improvements are implemented in the patch > below, which has been tested on x86_64-pc-linux-gnu, with a bootstrap and > make -k check, both with and without -m32, as usual. > > Interestingly, the CONCAT clause afterwards is still required (I've learned > something > new), as calling force_reg (or gen_reg_rtx) with HCmode, actually returns a > CONCAT > instead of a REG, Heh, interesting. > so although the code looks dead, it's required to build libgcc during > a bootstrap. But the remaining clean-up is good, reducing the number of > source lines > and making the logic easier to understand. > > Ok for mainline? Ok. Thanks, Richard > 2023-07-27 Roger Sayle >Richard Biener > > gcc/ChangeLog >PR middle-end/28071 >PR rtl-optimization/110587 >* expr.cc (emit_group_load_1): Simplify logic for calling >force_reg on ORIG_SRC, to avoid making a copy if the source >is already in a pseudo register. > > Roger > -- > >> -Original Message- >> From: Richard Biener >> Sent: 25 July 2023 12:50 >> >>> On Tue, Jul 25, 2023 at 1:31 PM Roger Sayle >>> wrote: >>> >>> This patch is the third in series of fixes for PR >>> rtl-optimization/110587, a compile-time regression with -O0, that >>> attempts to address the underlying cause. As noted previously, the >>> pathological test case pr28071.c contains a large number of useless >>> register-to-register moves that can produce quadratic behaviour (in >>> LRA). These move are generated during RTL expansion in >>> emit_group_load_1, where the middle-end attempts to simplify the >>> source before calling extract_bit_field. This is reasonable if the >>> source is a complex expression (from before the tree-ssa optimizers), >>> or a SUBREG, or a hard register, but it's not particularly useful to >>> copy a pseudo register into a new pseudo register. This patch eliminates >>> that >> redundancy. >>> >>> The -fdump-tree-expand for pr28071.c compiled with -O0 currently >>> contains 777K lines, with this patch it contains 717K lines, i.e. >>> saving about 60K lines (admittedly of debugging text output, but it makes >>> the >> point). >>> >>> >>> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap >>> and make -k check, both with and without --target_board=unix{-m32} >>> with no new failures. Ok for mainline? >>> >>> As always, I'm happy to revert this change quickly if there's a >>> problem, and investigate why this additional copy might (still) be >>> needed on other >>> non-x86 targets. >> >> @@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, >> tree type, >> be loaded directly into the destination. */ >> src = orig_src; >> if (!MEM_P (orig_src) >> + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src)) >> && (!CONSTANT_P (orig_src) >> || (GET_MODE (orig_src) != mode >> && GET_MODE (orig_src) != VOIDmode))) >> >> so that means the code guarded by the conditional could instead be >> transformed >> to >> >> src = force_reg (mode, orig_src); >> >> ? Btw, the || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) != >> VOIDmode) case looks odd as in that case we'd use GET_MODE (orig_src) for the >> move ... that might also mean we have to use force_reg (GET_MODE (orig_src) >> == >> VOIDmode ? mode : GET_MODE (orig_src), orig_src)) >> >> Otherwise I think this is OK, as said, using force_reg somehow would improve >> readability here I think. >> >> I also wonder how the >> >> else if (GET_CODE (src) == CONCAT) >> >> case will ever trigger with the current code. >> >> Richard. >> >>> >>> 2023-07-25 Roger Sayle >>> >>> gcc/ChangeLog >>>PR middle-end/28071 >>>PR rtl-optimization/110587 >>>* expr.cc (emit_group_load_1): Avoid copying a pseudo register into >>>a new pseudo register, i.e. only copy hard regs into a new pseudo. >>> >>> > >
RE: [PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.
Hi Richard, You're 100% right. It’s possible to significantly clean-up this code, replacing the body of the conditional with a call to force_reg and simplifying the conditions under which it is called. These improvements are implemented in the patch below, which has been tested on x86_64-pc-linux-gnu, with a bootstrap and make -k check, both with and without -m32, as usual. Interestingly, the CONCAT clause afterwards is still required (I've learned something new), as calling force_reg (or gen_reg_rtx) with HCmode, actually returns a CONCAT instead of a REG, so although the code looks dead, it's required to build libgcc during a bootstrap. But the remaining clean-up is good, reducing the number of source lines and making the logic easier to understand. Ok for mainline? 2023-07-27 Roger Sayle Richard Biener gcc/ChangeLog PR middle-end/28071 PR rtl-optimization/110587 * expr.cc (emit_group_load_1): Simplify logic for calling force_reg on ORIG_SRC, to avoid making a copy if the source is already in a pseudo register. Roger -- > -Original Message- > From: Richard Biener > Sent: 25 July 2023 12:50 > > On Tue, Jul 25, 2023 at 1:31 PM Roger Sayle > wrote: > > > > This patch is the third in series of fixes for PR > > rtl-optimization/110587, a compile-time regression with -O0, that > > attempts to address the underlying cause. As noted previously, the > > pathological test case pr28071.c contains a large number of useless > > register-to-register moves that can produce quadratic behaviour (in > > LRA). These move are generated during RTL expansion in > > emit_group_load_1, where the middle-end attempts to simplify the > > source before calling extract_bit_field. This is reasonable if the > > source is a complex expression (from before the tree-ssa optimizers), > > or a SUBREG, or a hard register, but it's not particularly useful to > > copy a pseudo register into a new pseudo register. This patch eliminates > > that > redundancy. > > > > The -fdump-tree-expand for pr28071.c compiled with -O0 currently > > contains 777K lines, with this patch it contains 717K lines, i.e. > > saving about 60K lines (admittedly of debugging text output, but it makes > > the > point). > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > As always, I'm happy to revert this change quickly if there's a > > problem, and investigate why this additional copy might (still) be > > needed on other > > non-x86 targets. > > @@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, > tree type, > be loaded directly into the destination. */ >src = orig_src; >if (!MEM_P (orig_src) > + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src)) > && (!CONSTANT_P (orig_src) > || (GET_MODE (orig_src) != mode > && GET_MODE (orig_src) != VOIDmode))) > > so that means the code guarded by the conditional could instead be transformed > to > >src = force_reg (mode, orig_src); > > ? Btw, the || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) != > VOIDmode) case looks odd as in that case we'd use GET_MODE (orig_src) for the > move ... that might also mean we have to use force_reg (GET_MODE (orig_src) == > VOIDmode ? mode : GET_MODE (orig_src), orig_src)) > > Otherwise I think this is OK, as said, using force_reg somehow would improve > readability here I think. > > I also wonder how the > > else if (GET_CODE (src) == CONCAT) > > case will ever trigger with the current code. > > Richard. > > > > > 2023-07-25 Roger Sayle > > > > gcc/ChangeLog > > PR middle-end/28071 > > PR rtl-optimization/110587 > > * expr.cc (emit_group_load_1): Avoid copying a pseudo register into > > a new pseudo register, i.e. only copy hard regs into a new pseudo. > > > > diff --git a/gcc/expr.cc b/gcc/expr.cc index fff09dc..174f8ac 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -2622,16 +2622,11 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, be loaded directly into the destination. */ src = orig_src; if (!MEM_P (orig_src) - && (!CONSTANT_P (orig_src) - || (GET_MODE (orig_src) != mode - && GET_MODE (orig_src) != VOIDmode))) + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src)) + && !CONSTANT_P (orig_src)) { - if (GET_MODE (orig_src) == VOIDmode) - src = gen_reg_rtx (mode); - else - src = gen_reg_rtx (GET_MODE (orig_src)); - - emit_move_insn (src, orig_src); + gcc_assert (GET_MODE (orig_src) != VOIDmode); + src = force_reg (GET_MODE (orig_src), orig_src); } /* Optimize the access just a bit. */
Re: [PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.
On Tue, Jul 25, 2023 at 1:31 PM Roger Sayle wrote: > > > This patch is the third in series of fixes for PR rtl-optimization/110587, > a compile-time regression with -O0, that attempts to address the underlying > cause. As noted previously, the pathological test case pr28071.c contains > a large number of useless register-to-register moves that can produce > quadratic behaviour (in LRA). These move are generated during RTL > expansion in emit_group_load_1, where the middle-end attempts to simplify > the source before calling extract_bit_field. This is reasonable if the > source is a complex expression (from before the tree-ssa optimizers), or > a SUBREG, or a hard register, but it's not particularly useful to copy > a pseudo register into a new pseudo register. This patch eliminates that > redundancy. > > The -fdump-tree-expand for pr28071.c compiled with -O0 currently contains > 777K lines, with this patch it contains 717K lines, i.e. saving about 60K > lines (admittedly of debugging text output, but it makes the point). > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? > > As always, I'm happy to revert this change quickly if there's a problem, > and investigate why this additional copy might (still) be needed on other > non-x86 targets. @@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, be loaded directly into the destination. */ src = orig_src; if (!MEM_P (orig_src) + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src)) && (!CONSTANT_P (orig_src) || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) != VOIDmode))) so that means the code guarded by the conditional could instead be transformed to src = force_reg (mode, orig_src); ? Btw, the || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) != VOIDmode) case looks odd as in that case we'd use GET_MODE (orig_src) for the move ... that might also mean we have to use force_reg (GET_MODE (orig_src) == VOIDmode ? mode : GET_MODE (orig_src), orig_src)) Otherwise I think this is OK, as said, using force_reg somehow would improve readability here I think. I also wonder how the else if (GET_CODE (src) == CONCAT) case will ever trigger with the current code. Richard. > > 2023-07-25 Roger Sayle > > gcc/ChangeLog > PR middle-end/28071 > PR rtl-optimization/110587 > * expr.cc (emit_group_load_1): Avoid copying a pseudo register into > a new pseudo register, i.e. only copy hard regs into a new pseudo. > > > Thanks in advance, > Roger > -- >