On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw <richard.earns...@foss.arm.com> wrote: > > On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: > > On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw > > <richard.earns...@foss.arm.com> wrote: > >> > >> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: > >>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw > >>> <richard.earns...@foss.arm.com> wrote: > >>>> > >>>> On 20/10/2020 12:22, Richard Earnshaw wrote: > >>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: > >>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw > >>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>> > >>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: > >>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw > >>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>> > >>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw > >>>>>>>>>> <richard.earns...@foss.arm.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: > >>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load > >>>>>>>>>>>> delta > >>>>>>>>>>>> from code memory (like we do without -mpure-code). > >>>>>>>>>>>> > >>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of > >>>>>>>>>>>> movs/adds/lsls. > >>>>>>>>>>>> > >>>>>>>>>>>> We also do some cleanup by not emitting the function address and > >>>>>>>>>>>> delta > >>>>>>>>>>>> via .word directives at the end of the thunk since we don't use > >>>>>>>>>>>> them > >>>>>>>>>>>> with -mpure-code. > >>>>>>>>>>>> > >>>>>>>>>>>> No need for new testcases, this bug was already identified by > >>>>>>>>>>>> eg. pr46287-3.C > >>>>>>>>>>>> > >>>>>>>>>>>> 2020-09-29 Christophe Lyon <christophe.l...@linaro.org> > >>>>>>>>>>>> > >>>>>>>>>>>> gcc/ > >>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta > >>>>>>>>>>>> in r3 and > >>>>>>>>>>>> do not emit function address and delta when -mpure-code is > >>>>>>>>>>>> used. > >>>>>>>>>>> > >>>>>>>>>> Hi Richard, > >>>>>>>>>> > >>>>>>>>>> Thanks for your comments. > >>>>>>>>>> > >>>>>>>>>>> There are some optimizations you can make to this code. > >>>>>>>>>>> > >>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be > >>>>>>>>>>> better > >>>>>>>>>>> to just expand a mov of 255 followed by an add. > >>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't > >>>>>>>>>> noticed > >>>>>>>>>> before, so I can write something similar indeed. > >>>>>>>>>> > >>>>>>>>>> However, I'm note quite sure to understand the benefit in the split > >>>>>>>>>> when -mpure-code is NOT used. > >>>>>>>>>> Consider: > >>>>>>>>>> int f3_1 (void) { return 510; } > >>>>>>>>>> int f3_2 (void) { return 511; } > >>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: > >>>>>>>>>> f3_1: > >>>>>>>>>> movs r0, #255 > >>>>>>>>>> lsls r0, r0, #1 > >>>>>>>>>> bx lr > >>>>>>>>>> f3_2: > >>>>>>>>>> ldr r0, .L4 > >>>>>>>>>> bx lr > >>>>>>>>>> > >>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this > >>>>>>>>>> by > >>>>>>>>>> not having to load the constant? > >>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into > >>>>>>>>>> account when comparing code size, > >>>>>>>>> > >>>>>>>>> Yes, the size of the literal pool entry needs to be taken into > >>>>>>>>> account. > >>>>>>>>> It might happen that the entry could be shared with another use of > >>>>>>>>> that > >>>>>>>>> literal, but in general that's rare. > >>>>>>>>> > >>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three > >>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading > >>>>>>>>>> from the literal pool. Should the 256-510 range be extended? > >>>>>>>>> > >>>>>>>>> It's a bit borderline at three instructions when literal pools are > >>>>>>>>> not > >>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite > >>>>>>>>> small due > >>>>>>>>> to the limited pc offsets we can use. I think on balance we > >>>>>>>>> probably > >>>>>>>>> want to use the instruction sequence unless optimizing for size. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> This is also true for > >>>>>>>>>>> the literal pools alternative as well, so should be handled > >>>>>>>>>>> before all > >>>>>>>>>>> this. > >>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is > >>>>>>>>>> compiled as: > >>>>>>>>>> f3_1: > >>>>>>>>>> movs r0, #255 > >>>>>>>>>> lsls r0, r0, #1 > >>>>>>>>>> bx lr > >>>>>>>>>> f3_2: > >>>>>>>>>> movs r0, #1 > >>>>>>>>>> lsls r0, r0, #8 > >>>>>>>>>> adds r0, r0, #255 > >>>>>>>>>> bx lr > >>>>>>>>>> > >>>>>>>>>> so the "return 510" case is already handled as without -mpure-code. > >>>>>>>>> > >>>>>>>>> I was thinking specifically of the thunk sequence where you seem to > >>>>>>>>> be > >>>>>>>>> emitting instructions directly rather than generating RTL. The > >>>>>>>>> examples > >>>>>>>>> you show here are not thunks. > >>>>>>>>> > >>>>>>>> OK thanks for the clarification. > >>>>>>>> > >>>>>>>> Here is an updated version, split into 3 patches to hopefully make > >>>>>>>> review easier. > >>>>>>>> They apply on top of my other mpure-code patches for PR96967 and > >>>>>>>> PR96770: > >>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html > >>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html > >>>>>>>> > >>>>>>>> I kept it this way to make incremental changes easier to understand. > >>>>>>>> > >>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance > >>>>>>>> easier, > >>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either > >>>>>>>> RTL or > >>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the > >>>>>>>> same place, > >>>>>>>> in case we need to improve/fix it later. We now generate shorter > >>>>>>>> sequences in > >>>>>>>> several cases matching your comments. > >>>>>>>> > >>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern > >>>>>>>> and > >>>>>>>> calls thumb1_gen_const_int. > >>>>>>>> > >>>>>>>> Patch 3: Update of the original patch in this thread, now calls > >>>>>>>> thumb1_gen_const_int. > >>>>>>> > >>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. > >>>>>>> > >>>>>>> I think we should be able to leverage the fact that the compiler can > >>>>>>> use > >>>>>>> C++ now to do much better than that, for example by making that > >>>>>>> function > >>>>>>> a template. For example (and this is just a sketch): > >>>>>>> > >>>>>> > >>>>>> Indeed! I didn't think about it since there is no other use of > >>>>>> templates in arm.c yet. > >>>>>> I'll send an update soon. > >>>>>> > >>>>>> Other than that, does the approach look OK to you? > >>>>>> > >>>>> > >>>>> Yes, I think this is heading in the right direction. Bringing the two > >>>>> immediate generating operations into a single function can only be a > >>>>> good thing. > >>>>> > >>>>> Looking again at your example constant sequences, I see: > >>>>> > >>>>> 0x1000010: > >>>>> movs r3, #16 > >>>>> lsls r3, #16 > >>>>> adds r3, #1 > >>>>> lsls r3, #4 > >>>>> 0x1000011: > >>>>> movs r3, #1 > >>>>> lsls r3, #24 > >>>>> adds r3, #17 > >>>>> > >>>>> The first of these looks odd, given the second sequence. Why doesn't > >>>>> the first expand to > >>>>> > >>>>> 0x1000010: > >>>>> movs r3, #16 > >>>>> lsls r3, #16 > >>>>> adds r3, #16 > >>>>> > >>>> Err, I mean to: > >>>> > >>>> > >>>> 0x1000010: > >>>> movs r3, #1 > >>>> lsls r3, #24 > >>>> adds r3, #16 > >>>> > >>>> ? > >>> > >>> Because I first try to right-shift the constant, hoping to reduce its > >>> range and need less instructions to build the higher part, then > >>> left-shift back. > >>> > >>> In this particular case, we'd need to realize that there are many > >>> zeros "inside" the constant. > >>> > >>> If I remove the part that tries to reduce the range, I do get that > >>> sequence, but for 764 I now generate > >>> movs r3, #2 > >>> lsls r3, #8 > >>> adds r3, #252 > >>> instead of > >>> movs r3, #191 > >>> lsls r3, #2 > >>> > >>> A possibility would be to try both approaches and keep the shortest one. > >> > >> Lets leave that for now, it's not important to fixing the main issue; > >> but we should remember we need to come back to it at some point. > >> > > Thanks, that's what I was thinking too. > > > >> There are other tricks as well, such as > >> > >> 0xffffff > >> > >> can be done as > >> > >> 0x1000000 - 1 > >> > >> and > >> > >> 0xfffffd > >> > >> as > >> > >> 0x1000000 - 3 > >> > >> but these can wait as well. > >> > > > > Didn't we already need to handle such tricks? I'm surprised this > > wasn't needed earlier. > > > > I don't think we ever worried about them. Most of them need at least 3 > instructions so aren't a code size saving over using a literal pool entry. > OK, this will also help when using -mslow-flash-data.
Here are updated patches, now using a template as you suggested. Thanks, Christophe > R. > > > > >> > >> R. > >> > >>> > >>> > >>>>> > >>>>> R. > >>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Christophe > >>>>>> > >>>>>>> class t1_rtl > >>>>>>> { > >>>>>>> public: > >>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } > >>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } > >>>>>>> }; > >>>>>>> > >>>>>>> class t1_print > >>>>>>> { > >>>>>>> public: > >>>>>>> t1_print (FILE *f) : t_file(f) {} > >>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } > >>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } > >>>>>>> private: > >>>>>>> FILE *t_file; > >>>>>>> }; > >>>>>>> > >>>>>>> template <class T> > >>>>>>> void thumb1_gen_const_int(T t, int f) > >>>>>>> { > >>>>>>> // Expansion of thumb1_gen_const_int ... > >>>>>>> t.ashift(f); > >>>>>>> } > >>>>>>> > >>>>>>> // Usage... > >>>>>>> void f1() > >>>>>>> { > >>>>>>> // Use the RTL expander > >>>>>>> t1_rtl g; > >>>>>>> thumb1_gen_const_int (g, 3); > >>>>>>> } > >>>>>>> > >>>>>>> void f2() > >>>>>>> { > >>>>>>> // Use the printf expander writing to stdout > >>>>>>> t1_print g(stdout); > >>>>>>> thumb1_gen_const_int (g, 3); > >>>>>>> } > >>>>>>> > >>>>>>> With this you can write thumb1_gen_const_int without having to worry > >>>>>>> about which expander is being used in each instance and the template > >>>>>>> expansion will use the right version. > >>>>>>> > >>>>>>> R. > >>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> I also suspect (but haven't check) that the base adjustment will > >>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4). If > >>>>>>>>>>> that is > >>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 > >>>>>>>>>>> for an > >>>>>>>>>>> even greater range of literals. > >>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY > >>>>>>>>>> case, > >>>>>>>>>> I'll update my patch. > >>>>>>>>>> > >>>>>>>>>>> More generally, any sequence of up to > >>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast > >>>>>>>>>>> as the > >>>>>>>>>>> existing literal pool fall back. > >>>>>>>>>>> > >>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your > >>>>>>>>>>> code will > >>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two > >>>>>>>>>>> shifts could > >>>>>>>>>>> be merged into a single shift-by-16. > >>>>>>>>>> > >>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. > >>>>>>>>>> > >>>>>>>>>>> Finally, I'd really like to see some executable tests for this, > >>>>>>>>>>> if at > >>>>>>>>>>> all possible. > >>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase > >>>>>>>>>> that showed the problem. There are also: > >>>>>>>>>> g++.dg/opt/thunk1.C > >>>>>>>>>> g++.dg/ipa/pr46984.C > >>>>>>>>>> g++.dg/torture/pr46287.C > >>>>>>>>>> g++.dg/torture/pr45699.C > >>>>>>>>>> > >>>>>>>>>> Do you want that I copy one of these in the arm subdir and add > >>>>>>>>>> -mpure-code in dg-options? > >>>>>>>>> > >>>>>>>>> On reflection, probably not - that just makes things more > >>>>>>>>> complicated > >>>>>>>>> with all the dg-options mess (I'm worried about interactions with > >>>>>>>>> other > >>>>>>>>> sets of options on the command line and the fall-out from that). If > >>>>>>>>> someone cares about pure-code they should be doing full testsuite > >>>>>>>>> runs > >>>>>>>>> with it enabled and that should be sufficient. > >>>>>>>> > >>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a > >>>>>>>> modified simulator. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Christophe > >>>>>>>> > >>>>>>>>> > >>>>>>>>> R. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> > >>>>>>>>>> Christophe > >>>>>>>>>> > >>>>>>>>>>> R. > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) > >>>>>>>>>>>> --- > >>>>>>>>>>>> gcc/config/arm/arm.c | 91 > >>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++--------------- > >>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) > >>>>>>>>>>>> > >>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c > >>>>>>>>>>>> index ceeb91f..62abeb5 100644 > >>>>>>>>>>>> --- a/gcc/config/arm/arm.c > >>>>>>>>>>>> +++ b/gcc/config/arm/arm.c > >>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, > >>>>>>>>>>>> HOST_WIDE_INT delta, > >>>>>>>>>>>> { > >>>>>>>>>>>> if (mi_delta > 255) > >>>>>>>>>>>> { > >>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); > >>>>>>>>>>>> - assemble_name (file, label); > >>>>>>>>>>>> - fputs ("+4\n", file); > >>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the > >>>>>>>>>>>> constant > >>>>>>>>>>>> + pool: we build it explicitly. */ > >>>>>>>>>>>> + if (target_pure_code) > >>>>>>>>>>>> + { > >>>>>>>>>>>> + bool mov_done_p = false; > >>>>>>>>>>>> + int i; > >>>>>>>>>>>> + > >>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ > >>>>>>>>>>>> + for (i = 0; i < 3; i++) > >>>>>>>>>>>> + { > >>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & 0xff; > >>>>>>>>>>>> + > >>>>>>>>>>>> + if (byte) > >>>>>>>>>>>> + { > >>>>>>>>>>>> + if (mov_done_p) > >>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", > >>>>>>>>>>>> byte); > >>>>>>>>>>>> + else > >>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", > >>>>>>>>>>>> byte); > >>>>>>>>>>>> + mov_done_p = true; > >>>>>>>>>>>> + } > >>>>>>>>>>>> + > >>>>>>>>>>>> + if (mov_done_p) > >>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); > >>>>>>>>>>>> + } > >>>>>>>>>>>> + > >>>>>>>>>>>> + /* Emit lower byte if needed. */ > >>>>>>>>>>>> + if (!mov_done_p) > >>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & > >>>>>>>>>>>> 0xff); > >>>>>>>>>>>> + else if (mi_delta & 0xff) > >>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & > >>>>>>>>>>>> 0xff); > >>>>>>>>>>>> + } > >>>>>>>>>>>> + else > >>>>>>>>>>>> + { > >>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); > >>>>>>>>>>>> + assemble_name (file, label); > >>>>>>>>>>>> + fputs ("+4\n", file); > >>>>>>>>>>>> + } > >>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", > >>>>>>>>>>>> mi_op, this_regno, this_regno); > >>>>>>>>>>>> } > >>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, > >>>>>>>>>>>> HOST_WIDE_INT delta, > >>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); > >>>>>>>>>>>> > >>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); > >>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); > >>>>>>>>>>>> - assemble_name (file, label); > >>>>>>>>>>>> - fputs (":\n", file); > >>>>>>>>>>>> - if (flag_pic) > >>>>>>>>>>>> + > >>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals for > >>>>>>>>>>>> the > >>>>>>>>>>>> + function address and delta since we emitted code to build > >>>>>>>>>>>> + them. */ > >>>>>>>>>>>> + if (!target_pure_code) > >>>>>>>>>>>> { > >>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ > >>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); > >>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so > >>>>>>>>>>>> the PC > >>>>>>>>>>>> - pipeline offset is four rather than eight. Adjust > >>>>>>>>>>>> the offset > >>>>>>>>>>>> - accordingly. */ > >>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, > >>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); > >>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), > >>>>>>>>>>>> - tem, > >>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, > >>>>>>>>>>>> - ggc_strdup > >>>>>>>>>>>> (labelpc))); > >>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); > >>>>>>>>>>>> - } > >>>>>>>>>>>> - else > >>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ > >>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, > >>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); > >>>>>>>>>>>> + assemble_name (file, label); > >>>>>>>>>>>> + fputs (":\n", file); > >>>>>>>>>>>> + if (flag_pic) > >>>>>>>>>>>> + { > >>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ > >>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); > >>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb > >>>>>>>>>>>> mode, so the PC > >>>>>>>>>>>> + pipeline offset is four rather than eight. > >>>>>>>>>>>> Adjust the offset > >>>>>>>>>>>> + accordingly. */ > >>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, > >>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : -7); > >>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), > >>>>>>>>>>>> + tem, > >>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, > >>>>>>>>>>>> + ggc_strdup > >>>>>>>>>>>> (labelpc))); > >>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); > >>>>>>>>>>>> + } > >>>>>>>>>>>> + else > >>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ > >>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), 4, > >>>>>>>>>>>> BITS_PER_WORD, 1); > >>>>>>>>>>>> > >>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) > >>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); > >>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) > >>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, > >>>>>>>>>>>> 1); > >>>>>>>>>>>> + } > >>>>>>>>>>>> } > >>>>>>>>>>>> else > >>>>>>>>>>>> { > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>>> > >> >
From 592f8871ea23530409a030fd4a5806d34d7b6b7f Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Tue, 29 Sep 2020 19:40:39 +0000 Subject: [PATCH 3/3] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code When -mpure-code is used, we cannot load delta from code memory (like we do without -mpure-code). This patch builds the value of mi_delta into r3 with a series of movs/adds/lsls. We also do some cleanup by not emitting the function address and delta via .word directives at the end of the thunk since we don't use them with -mpure-code. No need for new testcases, this bug was already identified by: g++.dg/ipa/pr46287-3.C g++.dg/ipa/pr46984.C g++.dg/opt/thunk1.C g++.dg/torture/pr46287.C g++.dg/torture/pr45699.C 2020-10-12 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and do not emit function address and delta when -mpure-code is used. --- gcc/config/arm/arm.c | 67 ++++++++++++++++++++++++++++++++-------------------- 1 file changed, 42 insertions(+), 25 deletions(-) diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 0a78344..8be64f2 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -28473,9 +28473,19 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, { if (mi_delta > 255) { - fputs ("\tldr\tr3, ", file); - assemble_name (file, label); - fputs ("+4\n", file); + /* With -mpure-code, we cannot load MI_DELTA from the + constant pool: we build it explicitly. */ + if (target_pure_code) + { + t1_print r3 (file, 3); + thumb1_gen_const_int_1 (r3, mi_delta); + } + else + { + fputs ("\tldr\tr3, ", file); + assemble_name (file, label); + fputs ("+4\n", file); + } asm_fprintf (file, "\t%ss\t%r, %r, r3\n", mi_op, this_regno, this_regno); } @@ -28511,30 +28521,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, fputs ("\tpop\t{r3}\n", file); fprintf (file, "\tbx\tr12\n"); - ASM_OUTPUT_ALIGN (file, 2); - assemble_name (file, label); - fputs (":\n", file); - if (flag_pic) + + /* With -mpure-code, we don't need to emit literals for the + function address and delta since we emitted code to build + them. */ + if (!target_pure_code) { - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ - rtx tem = XEXP (DECL_RTL (function), 0); - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC - pipeline offset is four rather than eight. Adjust the offset - accordingly. */ - tem = plus_constant (GET_MODE (tem), tem, - TARGET_THUMB1_ONLY ? -3 : -7); - tem = gen_rtx_MINUS (GET_MODE (tem), - tem, - gen_rtx_SYMBOL_REF (Pmode, - ggc_strdup (labelpc))); - assemble_integer (tem, 4, BITS_PER_WORD, 1); - } - else - /* Output ".word .LTHUNKn". */ - assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); + ASM_OUTPUT_ALIGN (file, 2); + assemble_name (file, label); + fputs (":\n", file); + if (flag_pic) + { + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ + rtx tem = XEXP (DECL_RTL (function), 0); + /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC + pipeline offset is four rather than eight. Adjust the offset + accordingly. */ + tem = plus_constant (GET_MODE (tem), tem, + TARGET_THUMB1_ONLY ? -3 : -7); + tem = gen_rtx_MINUS (GET_MODE (tem), + tem, + gen_rtx_SYMBOL_REF (Pmode, + ggc_strdup (labelpc))); + assemble_integer (tem, 4, BITS_PER_WORD, 1); + } + else + /* Output ".word .LTHUNKn". */ + assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); - if (TARGET_THUMB1_ONLY && mi_delta > 255) - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); + if (TARGET_THUMB1_ONLY && mi_delta > 255) + assemble_integer (GEN_INT (mi_delta), 4, BITS_PER_WORD, 1); + } } else { -- 2.7.4
From e5a75d819e976b8d8fb0aad80d7450c98d012cbf Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Fri, 9 Oct 2020 15:18:07 +0000 Subject: [PATCH 2/3] arm: Call thumb1_gen_const_int from thumb1_movsi_insn thumb1_movsi_insn used the same algorithm to build a constant in asm than thumb1_gen_const_int does in RTL. Since the previous patch added support for asm generation in thumb1_gen_const_int, this patch calls it from thumb1_movsi_insn to avoid duplication. 2020-10-22 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/thumb1.md (thumb1_movsi_insn): Call thumb1_gen_const_int. --- gcc/config/arm/thumb1.md | 37 ++++--------------------------------- 1 file changed, 4 insertions(+), 33 deletions(-) diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md index 2258a52..e8e8c11 100644 --- a/gcc/config/arm/thumb1.md +++ b/gcc/config/arm/thumb1.md @@ -688,40 +688,11 @@ (define_insn "*thumb1_movsi_insn" } else if (GET_CODE (operands[1]) == CONST_INT) { - int i; - HOST_WIDE_INT op1 = INTVAL (operands[1]); - bool mov_done_p = false; - rtx ops[2]; - ops[0] = operands[0]; - - /* Emit upper 3 bytes if needed. */ - for (i = 0; i < 3; i++) - { - int byte = (op1 >> (8 * (3 - i))) & 0xff; - - if (byte) - { - ops[1] = GEN_INT (byte); - if (mov_done_p) - output_asm_insn ("adds\t%0, %1", ops); - else - output_asm_insn ("movs\t%0, %1", ops); - mov_done_p = true; - } - - if (mov_done_p) - output_asm_insn ("lsls\t%0, #8", ops); - } - - /* Emit lower byte if needed. */ - ops[1] = GEN_INT (op1 & 0xff); - if (!mov_done_p) - output_asm_insn ("movs\t%0, %1", ops); - else if (op1 & 0xff) - output_asm_insn ("adds\t%0, %1", ops); - return ""; + thumb1_gen_const_int (operands[0], INTVAL (operands[1])); + return \"\"; } - gcc_unreachable (); + + gcc_unreachable (); case 8: return "ldr\t%0, %1"; case 9: return "str\t%1, %0"; -- 2.7.4
From ad092339f4172141fc0e567cec724f452d95a603 Mon Sep 17 00:00:00 2001 From: Christophe Lyon <christophe.l...@linaro.org> Date: Fri, 9 Oct 2020 12:42:39 +0000 Subject: [PATCH 1/3] arm: Improve thumb1_gen_const_int Enable thumb1_gen_const_int to generate RTL or asm depending on the context, so that we avoid duplicating code to handle constants in thumb-1 with -mpure-code. Use a template so that the algorithm is effectively shared, and rely on two classes to handle the actual emission as RTL or asm. The generated sequence is improved to handle right-shiftable and small values with less instructions. We now generate: 128: movs r0, r0, #128 264: movs r3, #33 lsls r3, #3 510: movs r3, #255 lsls r3, #1 512: movs r3, #1 lsls r3, #9 764: movs r3, #191 lsls r3, #2 65536: movs r3, #1 lsls r3, #16 0x123456: movs r3, #18 ;0x12 lsls r3, #8 adds r3, #52 ;0x34 lsls r3, #8 adds r3, #86 ;0x56 0x1123456: movs r3, #137 ;0x89 lsls r3, #8 adds r3, #26 ;0x1a lsls r3, #8 adds r3, #43 ;0x2b lsls r3, #1 0x1000010: movs r3, #16 lsls r3, #16 adds r3, #1 lsls r3, #4 0x1000011: movs r3, #1 lsls r3, #24 adds r3, #17 2020-10-22 Christophe Lyon <christophe.l...@linaro.org> gcc/ * config/arm/arm.c (t1_rtl, t1_print): New classes. (thumb1_gen_const_int_1): New helper function. (thumb1_gen_const_int): Add capability to emit either RTL or asm, improve generated code by calling thumb1_gen_const_int_1. --- gcc/config/arm/arm.c | 195 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 163 insertions(+), 32 deletions(-) diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index ceeb91f..0a78344 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -4513,38 +4513,6 @@ const_ok_for_dimode_op (HOST_WIDE_INT i, enum rtx_code code) } } -/* Emit a sequence of movs/adds/shift to produce a 32-bit constant. - Avoid generating useless code when one of the bytes is zero. */ -void -thumb1_gen_const_int (rtx op0, HOST_WIDE_INT op1) -{ - bool mov_done_p = false; - int i; - - /* Emit upper 3 bytes if needed. */ - for (i = 0; i < 3; i++) - { - int byte = (op1 >> (8 * (3 - i))) & 0xff; - - if (byte) - { - emit_set_insn (op0, mov_done_p - ? gen_rtx_PLUS (SImode,op0, GEN_INT (byte)) - : GEN_INT (byte)); - mov_done_p = true; - } - - if (mov_done_p) - emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (8))); - } - - /* Emit lower byte if needed. */ - if (!mov_done_p) - emit_set_insn (op0, GEN_INT (op1 & 0xff)); - else if (op1 & 0xff) - emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (op1 & 0xff))); -} - /* Emit a sequence of insns to handle a large constant. CODE is the code of the operation required, it can be any of SET, PLUS, IOR, AND, XOR, MINUS; @@ -28244,6 +28212,169 @@ arm_internal_label (FILE *stream, const char *prefix, unsigned long labelno) default_internal_label (stream, prefix, labelno); } +/* Define classes to generate code as RTL or output asm to a file. + Using templates then allows to use the same code to output code + sequences in the two formats. */ +class t1_rtl +{ + public: + t1_rtl (rtx dst) : dst (dst) {} + + void mov (int val) + { + emit_set_insn (dst, GEN_INT (val)); + } + + void add (int val) + { + emit_set_insn (dst, gen_rtx_PLUS (SImode, dst, GEN_INT (val))); + } + + void ashift (int shift) + { + emit_set_insn (dst, gen_rtx_ASHIFT (SImode, dst, GEN_INT (shift))); + } + + private: + rtx dst; +}; + +class t1_print +{ + public: + t1_print (FILE *f, int regno) : t_file (f), dst_regno (regno) {} + + void mov (int val) + { + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); + } + + void add (int val) + { + asm_fprintf (t_file, "\tadds\tr%d, #%d\n", dst_regno, val); + } + + void ashift (int shift) + { + asm_fprintf (t_file, "\tlsls\tr%d, #%d\n", dst_regno, shift); + } + + private: + FILE *t_file; + int dst_regno; +}; + +/* Emit a sequence of movs/adds/shift to produce a 32-bit constant. + Avoid generating useless code when one of the bytes is zero. */ +template <class T> +void +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) +{ + bool mov_done_p = false; + int val = op1; + int shift = 0; + int i; + + if (val == 0) + { + dst.mov (val); + return; + } + + /* In the general case, we need 7 instructions to build + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can + do better if VAL is small enough, or + right-shiftable by a suitable amount. If the + right-shift enables to encode at least one less byte, + it's worth it: we save a adds and a lsls at the + expense of a final lsls. */ + int final_shift = number_of_first_bit_set (val); + + int leading_zeroes = clz_hwi (val); + int number_of_bytes_needed + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) + / BITS_PER_UNIT) + 1; + int number_of_bytes_needed2 + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) + / BITS_PER_UNIT) + 1; + + if (number_of_bytes_needed2 < number_of_bytes_needed) + val >>= final_shift; + else + final_shift = 0; + + /* If we are in a very small range, we can use either a single movs + or movs+adds. */ + if ((val >= 0) && (val <= 510)) + { + if (val > 255) + { + int high = val - 255; + + dst.mov (high); + dst.add (255); + } + else + dst.mov (val); + + if (final_shift > 0) + dst.ashift (final_shift); + } + else + { + /* General case, emit upper 3 bytes as needed. */ + for (i = 0; i < 3; i++) + { + int byte = (val >> (8 * (3 - i))) & 0xff; + + if (byte) + { + /* We are about to emit new bits, stop accumulating a + shift amount, and left-shift only if we have already + emitted some upper bits. */ + if (mov_done_p) + { + dst.ashift (shift); + dst.add (byte); + } + else + dst.mov (byte); + + /* Stop accumulating shift amount since we've just + emitted some bits. */ + shift = 0; + + mov_done_p = true; + } + + if (mov_done_p) + shift += 8; + } + + /* Emit lower byte. */ + if (!mov_done_p) + dst.mov (val & 0xff); + else + { + dst.ashift (shift); + if (val & 0xff) + dst.add (val & 0xff); + } + + if (final_shift > 0) + dst.ashift (final_shift); + } +} + +/* Proxy for thumb1.md, since the t1_print and t1_rtl classes are not + exported. */ +void +thumb1_gen_const_int (rtx dst, HOST_WIDE_INT op1) +{ + t1_rtl t (dst); + thumb1_gen_const_int_1 (t, op1); +} + /* Output code to add DELTA to the first argument, and then jump to FUNCTION. Used for C++ multiple inheritance. */ -- 2.7.4