Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-11-02 Thread Richard Earnshaw via Gcc-patches
On 02/11/2020 10:24, Christophe Lyon via Gcc-patches wrote:
> Hi,
> 
> On Fri, 30 Oct 2020 at 13:49, Richard Earnshaw
>  wrote:
>>
>> On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote:
>>> On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
 On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
  wrote:
>
> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>>>  wrote:

 On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>  wrote:
>>
>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>  wrote:

 On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>  wrote:
>>
>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
 On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
  wrote:
>
> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>  wrote:
>>>
>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
 On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
  wrote:
>
> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches 
> wrote:
>> When mi_delta is > 255 and -mpure-code is used, we 
>> cannot load delta
>> from code memory (like we do without -mpure-code).
>>
>> This patch builds the value of mi_delta into r3 with a 
>> series of
>> movs/adds/lsls.
>>
>> We also do some cleanup by not emitting the function 
>> address and delta
>> via .word directives at the end of the thunk since we 
>> don't use them
>> with -mpure-code.
>>
>> No need for new testcases, this bug was already 
>> identified by
>> eg. pr46287-3.C
>>
>> 2020-09-29  Christophe Lyon  
>>
>>   gcc/
>>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build 
>> mi_delta in r3 and
>>   do not emit function address and delta when 
>> -mpure-code is used.
>
 Hi Richard,

 Thanks for your comments.

> There are some optimizations you can make to this code.
>
> Firstly, for values between 256 and 510 (inclusive), it 
> would be better
> to just expand a mov of 255 followed by an add.
 I now see the splitted for the "Pe" constraint which I 
 hadn't noticed
 before, so I can write something similar indeed.

 However, I'm note quite sure to understand the benefit in 
 the split
 when -mpure-code is NOT used.
 Consider:
 int f3_1 (void) { return 510; }
 int f3_2 (void) { return 511; }
 Compile with -O2 -mcpu=cortex-m0:
 f3_1:
 movsr0, #255
 lslsr0, r0, #1
 bx  lr
 f3_2:
 ldr r0, .L4
 bx  lr

 The splitter makes the code bigger, does it "compensate" 
 for this by
 not having to load the constant?
 Actually the constant uses 4 more bytes, which should be 
 taken into
 account when comparing code size,
>>>
>>> Yes, the size of the literal pool entry needs to be taken 
>>> into account.
>>>  It might happen that the entry could be shared with 
>>> another use of that
>>> literal, but in general that's rare.
>>>
 so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-11-02 Thread Christophe Lyon via Gcc-patches
Hi,

On Fri, 30 Oct 2020 at 13:49, Richard Earnshaw
 wrote:
>
> On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote:
> > On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
> >> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
>  On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> > On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
> >  wrote:
> >>
> >> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> >>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
> >>>  wrote:
> 
>  On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> > On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >  wrote:
> >>
> >> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> >>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >>>  wrote:
> 
>  On 20/10/2020 12:22, Richard Earnshaw wrote:
> > On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>  On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>   wrote:
> >
> > On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches 
> >>> wrote:
>  When mi_delta is > 255 and -mpure-code is used, we 
>  cannot load delta
>  from code memory (like we do without -mpure-code).
> 
>  This patch builds the value of mi_delta into r3 with a 
>  series of
>  movs/adds/lsls.
> 
>  We also do some cleanup by not emitting the function 
>  address and delta
>  via .word directives at the end of the thunk since we 
>  don't use them
>  with -mpure-code.
> 
>  No need for new testcases, this bug was already 
>  identified by
>  eg. pr46287-3.C
> 
>  2020-09-29  Christophe Lyon  
> 
>    gcc/
>    * config/arm/arm.c (arm_thumb1_mi_thunk): Build 
>  mi_delta in r3 and
>    do not emit function address and delta when 
>  -mpure-code is used.
> >>>
> >> Hi Richard,
> >>
> >> Thanks for your comments.
> >>
> >>> There are some optimizations you can make to this code.
> >>>
> >>> Firstly, for values between 256 and 510 (inclusive), it 
> >>> would be better
> >>> to just expand a mov of 255 followed by an add.
> >> I now see the splitted for the "Pe" constraint which I 
> >> hadn't noticed
> >> before, so I can write something similar indeed.
> >>
> >> However, I'm note quite sure to understand the benefit in 
> >> the split
> >> when -mpure-code is NOT used.
> >> Consider:
> >> int f3_1 (void) { return 510; }
> >> int f3_2 (void) { return 511; }
> >> Compile with -O2 -mcpu=cortex-m0:
> >> f3_1:
> >> movsr0, #255
> >> lslsr0, r0, #1
> >> bx  lr
> >> f3_2:
> >> ldr r0, .L4
> >> bx  lr
> >>
> >> The splitter makes the code bigger, does it "compensate" 
> >> for this by
> >> not having to load the constant?
> >> Actually the constant uses 4 more bytes, which should be 
> >> taken into
> >> account when comparing code size,
> >
> > Yes, the size of the literal pool entry needs to be taken 
> > into account.
> >  It might happen that the entry could be shared with 
> > another use of that
> > literal, but in general that's rare.
> >
> >> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below 
> >> three
> >> thumb1 instructions would 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-30 Thread Richard Earnshaw via Gcc-patches
On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote:
> On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
>> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
>>  wrote:
>>>
>>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
 On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>  wrote:
>>
>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>>  wrote:

 On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>  wrote:
>>
>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>  wrote:

 On 20/10/2020 12:22, Richard Earnshaw wrote:
> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>  wrote:
>>>
>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
 On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
  wrote:
>
> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>  wrote:
>>>
>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
 When mi_delta is > 255 and -mpure-code is used, we cannot 
 load delta
 from code memory (like we do without -mpure-code).

 This patch builds the value of mi_delta into r3 with a 
 series of
 movs/adds/lsls.

 We also do some cleanup by not emitting the function 
 address and delta
 via .word directives at the end of the thunk since we 
 don't use them
 with -mpure-code.

 No need for new testcases, this bug was already identified 
 by
 eg. pr46287-3.C

 2020-09-29  Christophe Lyon  

   gcc/
   * config/arm/arm.c (arm_thumb1_mi_thunk): Build 
 mi_delta in r3 and
   do not emit function address and delta when 
 -mpure-code is used.
>>>
>> Hi Richard,
>>
>> Thanks for your comments.
>>
>>> There are some optimizations you can make to this code.
>>>
>>> Firstly, for values between 256 and 510 (inclusive), it 
>>> would be better
>>> to just expand a mov of 255 followed by an add.
>> I now see the splitted for the "Pe" constraint which I 
>> hadn't noticed
>> before, so I can write something similar indeed.
>>
>> However, I'm note quite sure to understand the benefit in 
>> the split
>> when -mpure-code is NOT used.
>> Consider:
>> int f3_1 (void) { return 510; }
>> int f3_2 (void) { return 511; }
>> Compile with -O2 -mcpu=cortex-m0:
>> f3_1:
>> movsr0, #255
>> lslsr0, r0, #1
>> bx  lr
>> f3_2:
>> ldr r0, .L4
>> bx  lr
>>
>> The splitter makes the code bigger, does it "compensate" for 
>> this by
>> not having to load the constant?
>> Actually the constant uses 4 more bytes, which should be 
>> taken into
>> account when comparing code size,
>
> Yes, the size of the literal pool entry needs to be taken 
> into account.
>  It might happen that the entry could be shared with another 
> use of that
> literal, but in general that's rare.
>
>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below 
>> three
>> thumb1 instructions would be equivalent in size compared to 
>> loading
>> from the literal pool. Should the 256-510 range be extended?
>
> It's a bit borderline at three instructions when literal 
> pools are not
> expensive to 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-29 Thread Richard Earnshaw via Gcc-patches
On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
>  wrote:
>>
>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
 On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
  wrote:
>
> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>  wrote:
>>>
>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
 On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
  wrote:
>
> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>  wrote:
>>>
>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
 On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>  wrote:
>>
>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>  wrote:

 On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>  wrote:
>>
>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>> When mi_delta is > 255 and -mpure-code is used, we cannot 
>>> load delta
>>> from code memory (like we do without -mpure-code).
>>>
>>> This patch builds the value of mi_delta into r3 with a 
>>> series of
>>> movs/adds/lsls.
>>>
>>> We also do some cleanup by not emitting the function 
>>> address and delta
>>> via .word directives at the end of the thunk since we don't 
>>> use them
>>> with -mpure-code.
>>>
>>> No need for new testcases, this bug was already identified 
>>> by
>>> eg. pr46287-3.C
>>>
>>> 2020-09-29  Christophe Lyon  
>>>
>>>   gcc/
>>>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build 
>>> mi_delta in r3 and
>>>   do not emit function address and delta when 
>>> -mpure-code is used.
>>
> Hi Richard,
>
> Thanks for your comments.
>
>> There are some optimizations you can make to this code.
>>
>> Firstly, for values between 256 and 510 (inclusive), it 
>> would be better
>> to just expand a mov of 255 followed by an add.
> I now see the splitted for the "Pe" constraint which I hadn't 
> noticed
> before, so I can write something similar indeed.
>
> However, I'm note quite sure to understand the benefit in the 
> split
> when -mpure-code is NOT used.
> Consider:
> int f3_1 (void) { return 510; }
> int f3_2 (void) { return 511; }
> Compile with -O2 -mcpu=cortex-m0:
> f3_1:
> movsr0, #255
> lslsr0, r0, #1
> bx  lr
> f3_2:
> ldr r0, .L4
> bx  lr
>
> The splitter makes the code bigger, does it "compensate" for 
> this by
> not having to load the constant?
> Actually the constant uses 4 more bytes, which should be 
> taken into
> account when comparing code size,

 Yes, the size of the literal pool entry needs to be taken into 
 account.
  It might happen that the entry could be shared with another 
 use of that
 literal, but in general that's rare.

> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below 
> three
> thumb1 instructions would be equivalent in size compared to 
> loading
> from the literal pool. Should the 256-510 range be extended?

 It's a bit borderline at three instructions when literal pools 
 are not
 expensive to use, but in thumb1 literal pools tend to be quite 
 small due
 to the limited pc offsets we can use.  I think on balance we 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-28 Thread Christophe Lyon via Gcc-patches
On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
 wrote:
>
> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
> > On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> >> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>  On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>   wrote:
> >
> > On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> >> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>  On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>   wrote:
> >
> > On 20/10/2020 12:22, Richard Earnshaw wrote:
> >> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>>  wrote:
> 
>  On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> > On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >  wrote:
> >>
> >> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>  wrote:
> 
>  On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> > When mi_delta is > 255 and -mpure-code is used, we cannot 
> > load delta
> > from code memory (like we do without -mpure-code).
> >
> > This patch builds the value of mi_delta into r3 with a 
> > series of
> > movs/adds/lsls.
> >
> > We also do some cleanup by not emitting the function 
> > address and delta
> > via .word directives at the end of the thunk since we don't 
> > use them
> > with -mpure-code.
> >
> > No need for new testcases, this bug was already identified 
> > by
> > eg. pr46287-3.C
> >
> > 2020-09-29  Christophe Lyon  
> >
> >   gcc/
> >   * config/arm/arm.c (arm_thumb1_mi_thunk): Build 
> > mi_delta in r3 and
> >   do not emit function address and delta when 
> > -mpure-code is used.
> 
> >>> Hi Richard,
> >>>
> >>> Thanks for your comments.
> >>>
>  There are some optimizations you can make to this code.
> 
>  Firstly, for values between 256 and 510 (inclusive), it 
>  would be better
>  to just expand a mov of 255 followed by an add.
> >>> I now see the splitted for the "Pe" constraint which I hadn't 
> >>> noticed
> >>> before, so I can write something similar indeed.
> >>>
> >>> However, I'm note quite sure to understand the benefit in the 
> >>> split
> >>> when -mpure-code is NOT used.
> >>> Consider:
> >>> int f3_1 (void) { return 510; }
> >>> int f3_2 (void) { return 511; }
> >>> Compile with -O2 -mcpu=cortex-m0:
> >>> f3_1:
> >>> movsr0, #255
> >>> lslsr0, r0, #1
> >>> bx  lr
> >>> f3_2:
> >>> ldr r0, .L4
> >>> bx  lr
> >>>
> >>> The splitter makes the code bigger, does it "compensate" for 
> >>> this by
> >>> not having to load the constant?
> >>> Actually the constant uses 4 more bytes, which should be 
> >>> taken into
> >>> account when comparing code size,
> >>
> >> Yes, the size of the literal pool entry needs to be taken into 
> >> account.
> >>  It might happen that the entry could be shared with another 
> >> use of that
> >> literal, but in general that's rare.
> >>
> >>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below 
> >>> three
> >>> thumb1 instructions would be equivalent in size compared to 
> >>> loading
> >>> from the literal pool. Should the 256-510 range be extended?
> >>
> >> It's a bit borderline at three instructions when literal pools 
> >> are not
> >> expensive to use, but in thumb1 literal pools tend to be quite 
> >> small due
> >> to the limited pc offsets we can use.  I think on balance we 
> >> probably
> >> want to use the instruction sequence 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-28 Thread Richard Earnshaw via Gcc-patches
On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>>  wrote:
>>>
>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
 On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
  wrote:
>
> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>  wrote:
>>>
>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
 On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
  wrote:
>
> On 20/10/2020 12:22, Richard Earnshaw wrote:
>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>  wrote:

 On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>  wrote:
>>
>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>  wrote:

 On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> When mi_delta is > 255 and -mpure-code is used, we cannot 
> load delta
> from code memory (like we do without -mpure-code).
>
> This patch builds the value of mi_delta into r3 with a series 
> of
> movs/adds/lsls.
>
> We also do some cleanup by not emitting the function address 
> and delta
> via .word directives at the end of the thunk since we don't 
> use them
> with -mpure-code.
>
> No need for new testcases, this bug was already identified by
> eg. pr46287-3.C
>
> 2020-09-29  Christophe Lyon  
>
>   gcc/
>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build 
> mi_delta in r3 and
>   do not emit function address and delta when -mpure-code 
> is used.

>>> Hi Richard,
>>>
>>> Thanks for your comments.
>>>
 There are some optimizations you can make to this code.

 Firstly, for values between 256 and 510 (inclusive), it would 
 be better
 to just expand a mov of 255 followed by an add.
>>> I now see the splitted for the "Pe" constraint which I hadn't 
>>> noticed
>>> before, so I can write something similar indeed.
>>>
>>> However, I'm note quite sure to understand the benefit in the 
>>> split
>>> when -mpure-code is NOT used.
>>> Consider:
>>> int f3_1 (void) { return 510; }
>>> int f3_2 (void) { return 511; }
>>> Compile with -O2 -mcpu=cortex-m0:
>>> f3_1:
>>> movsr0, #255
>>> lslsr0, r0, #1
>>> bx  lr
>>> f3_2:
>>> ldr r0, .L4
>>> bx  lr
>>>
>>> The splitter makes the code bigger, does it "compensate" for 
>>> this by
>>> not having to load the constant?
>>> Actually the constant uses 4 more bytes, which should be taken 
>>> into
>>> account when comparing code size,
>>
>> Yes, the size of the literal pool entry needs to be taken into 
>> account.
>>  It might happen that the entry could be shared with another use 
>> of that
>> literal, but in general that's rare.
>>
>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>> thumb1 instructions would be equivalent in size compared to 
>>> loading
>>> from the literal pool. Should the 256-510 range be extended?
>>
>> It's a bit borderline at three instructions when literal pools 
>> are not
>> expensive to use, but in thumb1 literal pools tend to be quite 
>> small due
>> to the limited pc offsets we can use.  I think on balance we 
>> probably
>> want to use the instruction sequence unless optimizing for size.
>>
>>>
>>>
 This is also true for
 the literal pools alternative as well, so should be handled 
 before all
 this.
>>> I am not sure what you mean: with -mpure-code, the above 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-27 Thread Richard Earnshaw via Gcc-patches
On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>  wrote:
>>
>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>>  wrote:

 On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>  wrote:
>>
>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>  wrote:

 On 20/10/2020 12:22, Richard Earnshaw wrote:
> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>  wrote:
>>>
>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
 On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
  wrote:
>
> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>  wrote:
>>>
>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
 When mi_delta is > 255 and -mpure-code is used, we cannot load 
 delta
 from code memory (like we do without -mpure-code).

 This patch builds the value of mi_delta into r3 with a series 
 of
 movs/adds/lsls.

 We also do some cleanup by not emitting the function address 
 and delta
 via .word directives at the end of the thunk since we don't 
 use them
 with -mpure-code.

 No need for new testcases, this bug was already identified by
 eg. pr46287-3.C

 2020-09-29  Christophe Lyon  

   gcc/
   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta 
 in r3 and
   do not emit function address and delta when -mpure-code 
 is used.
>>>
>> Hi Richard,
>>
>> Thanks for your comments.
>>
>>> There are some optimizations you can make to this code.
>>>
>>> Firstly, for values between 256 and 510 (inclusive), it would 
>>> be better
>>> to just expand a mov of 255 followed by an add.
>> I now see the splitted for the "Pe" constraint which I hadn't 
>> noticed
>> before, so I can write something similar indeed.
>>
>> However, I'm note quite sure to understand the benefit in the 
>> split
>> when -mpure-code is NOT used.
>> Consider:
>> int f3_1 (void) { return 510; }
>> int f3_2 (void) { return 511; }
>> Compile with -O2 -mcpu=cortex-m0:
>> f3_1:
>> movsr0, #255
>> lslsr0, r0, #1
>> bx  lr
>> f3_2:
>> ldr r0, .L4
>> bx  lr
>>
>> The splitter makes the code bigger, does it "compensate" for 
>> this by
>> not having to load the constant?
>> Actually the constant uses 4 more bytes, which should be taken 
>> into
>> account when comparing code size,
>
> Yes, the size of the literal pool entry needs to be taken into 
> account.
>  It might happen that the entry could be shared with another use 
> of that
> literal, but in general that's rare.
>
>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>> thumb1 instructions would be equivalent in size compared to 
>> loading
>> from the literal pool. Should the 256-510 range be extended?
>
> It's a bit borderline at three instructions when literal pools 
> are not
> expensive to use, but in thumb1 literal pools tend to be quite 
> small due
> to the limited pc offsets we can use.  I think on balance we 
> probably
> want to use the instruction sequence unless optimizing for size.
>
>>
>>
>>> This is also true for
>>> the literal pools alternative as well, so should be handled 
>>> before all
>>> this.
>> I am not sure what you mean: with -mpure-code, the above sample 
>> is compiled as:
>> f3_1:
>> movsr0, #255
>> lslsr0, r0, #1
>> bx  lr

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-26 Thread Christophe Lyon via Gcc-patches
On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
 wrote:
>
> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> > On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
> >  wrote:
> >>
> >> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> >>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >>>  wrote:
> 
>  On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> > On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >  wrote:
> >>
> >> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>  On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>   wrote:
> >
> > On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>  On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>   wrote:
> >
> > On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >> When mi_delta is > 255 and -mpure-code is used, we cannot load 
> >> delta
> >> from code memory (like we do without -mpure-code).
> >>
> >> This patch builds the value of mi_delta into r3 with a series 
> >> of
> >> movs/adds/lsls.
> >>
> >> We also do some cleanup by not emitting the function address 
> >> and delta
> >> via .word directives at the end of the thunk since we don't 
> >> use them
> >> with -mpure-code.
> >>
> >> No need for new testcases, this bug was already identified by
> >> eg. pr46287-3.C
> >>
> >> 2020-09-29  Christophe Lyon  
> >>
> >>   gcc/
> >>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta 
> >> in r3 and
> >>   do not emit function address and delta when -mpure-code 
> >> is used.
> >
>  Hi Richard,
> 
>  Thanks for your comments.
> 
> > There are some optimizations you can make to this code.
> >
> > Firstly, for values between 256 and 510 (inclusive), it would 
> > be better
> > to just expand a mov of 255 followed by an add.
>  I now see the splitted for the "Pe" constraint which I hadn't 
>  noticed
>  before, so I can write something similar indeed.
> 
>  However, I'm note quite sure to understand the benefit in the 
>  split
>  when -mpure-code is NOT used.
>  Consider:
>  int f3_1 (void) { return 510; }
>  int f3_2 (void) { return 511; }
>  Compile with -O2 -mcpu=cortex-m0:
>  f3_1:
>  movsr0, #255
>  lslsr0, r0, #1
>  bx  lr
>  f3_2:
>  ldr r0, .L4
>  bx  lr
> 
>  The splitter makes the code bigger, does it "compensate" for 
>  this by
>  not having to load the constant?
>  Actually the constant uses 4 more bytes, which should be taken 
>  into
>  account when comparing code size,
> >>>
> >>> Yes, the size of the literal pool entry needs to be taken into 
> >>> account.
> >>>  It might happen that the entry could be shared with another use 
> >>> of that
> >>> literal, but in general that's rare.
> >>>
>  so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>  thumb1 instructions would be equivalent in size compared to 
>  loading
>  from the literal pool. Should the 256-510 range be extended?
> >>>
> >>> It's a bit borderline at three instructions when literal pools 
> >>> are not
> >>> expensive to use, but in thumb1 literal pools tend to be quite 
> >>> small due
> >>> to the limited pc offsets we can use.  I think on balance we 
> >>> probably
> >>> want to use the instruction sequence unless optimizing for size.
> >>>
> 
> 
> > This is also true for
> > the literal pools alternative as well, so should be handled 
> > before all
> > this.
>  I am not sure what you mean: with -mpure-code, the above sample 
>  is compiled as:
>  f3_1:
>  movsr0, #255
>  lslsr0, r0, #1
>  bx  lr
>  f3_2:
>  movsr0, #1
> 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-22 Thread Richard Earnshaw via Gcc-patches
On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>  wrote:
>>
>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>  wrote:

 On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>  wrote:
>>
>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
 On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
  wrote:
>
> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>  wrote:
>>>
>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
 On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
  wrote:
>
> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>> When mi_delta is > 255 and -mpure-code is used, we cannot load 
>> delta
>> from code memory (like we do without -mpure-code).
>>
>> This patch builds the value of mi_delta into r3 with a series of
>> movs/adds/lsls.
>>
>> We also do some cleanup by not emitting the function address and 
>> delta
>> via .word directives at the end of the thunk since we don't use 
>> them
>> with -mpure-code.
>>
>> No need for new testcases, this bug was already identified by
>> eg. pr46287-3.C
>>
>> 2020-09-29  Christophe Lyon  
>>
>>   gcc/
>>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta 
>> in r3 and
>>   do not emit function address and delta when -mpure-code is 
>> used.
>
 Hi Richard,

 Thanks for your comments.

> There are some optimizations you can make to this code.
>
> Firstly, for values between 256 and 510 (inclusive), it would be 
> better
> to just expand a mov of 255 followed by an add.
 I now see the splitted for the "Pe" constraint which I hadn't 
 noticed
 before, so I can write something similar indeed.

 However, I'm note quite sure to understand the benefit in the split
 when -mpure-code is NOT used.
 Consider:
 int f3_1 (void) { return 510; }
 int f3_2 (void) { return 511; }
 Compile with -O2 -mcpu=cortex-m0:
 f3_1:
 movsr0, #255
 lslsr0, r0, #1
 bx  lr
 f3_2:
 ldr r0, .L4
 bx  lr

 The splitter makes the code bigger, does it "compensate" for this 
 by
 not having to load the constant?
 Actually the constant uses 4 more bytes, which should be taken into
 account when comparing code size,
>>>
>>> Yes, the size of the literal pool entry needs to be taken into 
>>> account.
>>>  It might happen that the entry could be shared with another use of 
>>> that
>>> literal, but in general that's rare.
>>>
 so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
 thumb1 instructions would be equivalent in size compared to loading
 from the literal pool. Should the 256-510 range be extended?
>>>
>>> It's a bit borderline at three instructions when literal pools are 
>>> not
>>> expensive to use, but in thumb1 literal pools tend to be quite 
>>> small due
>>> to the limited pc offsets we can use.  I think on balance we 
>>> probably
>>> want to use the instruction sequence unless optimizing for size.
>>>


> This is also true for
> the literal pools alternative as well, so should be handled 
> before all
> this.
 I am not sure what you mean: with -mpure-code, the above sample is 
 compiled as:
 f3_1:
 movsr0, #255
 lslsr0, r0, #1
 bx  lr
 f3_2:
 movsr0, #1
 lslsr0, r0, #8
 addsr0, r0, #255
 bx  lr

 so the "return 510" case is already handled as without -mpure-code.
>>>
>>> I was thinking specifically of the thunk sequence where you seem to 
>>> be
>>> emitting instructions directly 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-22 Thread Christophe Lyon via Gcc-patches
On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
 wrote:
>
> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> > On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >  wrote:
> >>
> >> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> >>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >>>  wrote:
> 
>  On 20/10/2020 12:22, Richard Earnshaw wrote:
> > On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>  On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>   wrote:
> >
> > On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>  When mi_delta is > 255 and -mpure-code is used, we cannot load 
>  delta
>  from code memory (like we do without -mpure-code).
> 
>  This patch builds the value of mi_delta into r3 with a series of
>  movs/adds/lsls.
> 
>  We also do some cleanup by not emitting the function address and 
>  delta
>  via .word directives at the end of the thunk since we don't use 
>  them
>  with -mpure-code.
> 
>  No need for new testcases, this bug was already identified by
>  eg. pr46287-3.C
> 
>  2020-09-29  Christophe Lyon  
> 
>    gcc/
>    * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta 
>  in r3 and
>    do not emit function address and delta when -mpure-code is 
>  used.
> >>>
> >> Hi Richard,
> >>
> >> Thanks for your comments.
> >>
> >>> There are some optimizations you can make to this code.
> >>>
> >>> Firstly, for values between 256 and 510 (inclusive), it would be 
> >>> better
> >>> to just expand a mov of 255 followed by an add.
> >> I now see the splitted for the "Pe" constraint which I hadn't 
> >> noticed
> >> before, so I can write something similar indeed.
> >>
> >> However, I'm note quite sure to understand the benefit in the split
> >> when -mpure-code is NOT used.
> >> Consider:
> >> int f3_1 (void) { return 510; }
> >> int f3_2 (void) { return 511; }
> >> Compile with -O2 -mcpu=cortex-m0:
> >> f3_1:
> >> movsr0, #255
> >> lslsr0, r0, #1
> >> bx  lr
> >> f3_2:
> >> ldr r0, .L4
> >> bx  lr
> >>
> >> The splitter makes the code bigger, does it "compensate" for this 
> >> by
> >> not having to load the constant?
> >> Actually the constant uses 4 more bytes, which should be taken into
> >> account when comparing code size,
> >
> > Yes, the size of the literal pool entry needs to be taken into 
> > account.
> >  It might happen that the entry could be shared with another use of 
> > that
> > literal, but in general that's rare.
> >
> >> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >> thumb1 instructions would be equivalent in size compared to loading
> >> from the literal pool. Should the 256-510 range be extended?
> >
> > It's a bit borderline at three instructions when literal pools are 
> > not
> > expensive to use, but in thumb1 literal pools tend to be quite 
> > small due
> > to the limited pc offsets we can use.  I think on balance we 
> > probably
> > want to use the instruction sequence unless optimizing for size.
> >
> >>
> >>
> >>> This is also true for
> >>> the literal pools alternative as well, so should be handled 
> >>> before all
> >>> this.
> >> I am not sure what you mean: with -mpure-code, the above sample is 
> >> compiled as:
> >> f3_1:
> >> movsr0, #255
> >> lslsr0, r0, #1
> >> bx  lr
> >> f3_2:
> >> movsr0, #1
> >> lslsr0, r0, #8
> >> addsr0, r0, #255
> >> bx  lr
> >>
> >> so the "return 510" case is already handled as without -mpure-code.
> >
> > I was thinking specifically of the thunk sequence where you seem to 
> > be
> > emitting instructions directly rather than generating RTL.  The 
> > examples
> > 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-21 Thread Richard Earnshaw via Gcc-patches
On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>  wrote:
>>
>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>  wrote:

 On 20/10/2020 12:22, Richard Earnshaw wrote:
> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>  wrote:
>>>
>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
 On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
  wrote:
>
> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>  wrote:
>>>
>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
 When mi_delta is > 255 and -mpure-code is used, we cannot load 
 delta
 from code memory (like we do without -mpure-code).

 This patch builds the value of mi_delta into r3 with a series of
 movs/adds/lsls.

 We also do some cleanup by not emitting the function address and 
 delta
 via .word directives at the end of the thunk since we don't use 
 them
 with -mpure-code.

 No need for new testcases, this bug was already identified by
 eg. pr46287-3.C

 2020-09-29  Christophe Lyon  

   gcc/
   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in 
 r3 and
   do not emit function address and delta when -mpure-code is 
 used.
>>>
>> Hi Richard,
>>
>> Thanks for your comments.
>>
>>> There are some optimizations you can make to this code.
>>>
>>> Firstly, for values between 256 and 510 (inclusive), it would be 
>>> better
>>> to just expand a mov of 255 followed by an add.
>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>> before, so I can write something similar indeed.
>>
>> However, I'm note quite sure to understand the benefit in the split
>> when -mpure-code is NOT used.
>> Consider:
>> int f3_1 (void) { return 510; }
>> int f3_2 (void) { return 511; }
>> Compile with -O2 -mcpu=cortex-m0:
>> f3_1:
>> movsr0, #255
>> lslsr0, r0, #1
>> bx  lr
>> f3_2:
>> ldr r0, .L4
>> bx  lr
>>
>> The splitter makes the code bigger, does it "compensate" for this by
>> not having to load the constant?
>> Actually the constant uses 4 more bytes, which should be taken into
>> account when comparing code size,
>
> Yes, the size of the literal pool entry needs to be taken into 
> account.
>  It might happen that the entry could be shared with another use of 
> that
> literal, but in general that's rare.
>
>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>> thumb1 instructions would be equivalent in size compared to loading
>> from the literal pool. Should the 256-510 range be extended?
>
> It's a bit borderline at three instructions when literal pools are not
> expensive to use, but in thumb1 literal pools tend to be quite small 
> due
> to the limited pc offsets we can use.  I think on balance we probably
> want to use the instruction sequence unless optimizing for size.
>
>>
>>
>>> This is also true for
>>> the literal pools alternative as well, so should be handled before 
>>> all
>>> this.
>> I am not sure what you mean: with -mpure-code, the above sample is 
>> compiled as:
>> f3_1:
>> movsr0, #255
>> lslsr0, r0, #1
>> bx  lr
>> f3_2:
>> movsr0, #1
>> lslsr0, r0, #8
>> addsr0, r0, #255
>> bx  lr
>>
>> so the "return 510" case is already handled as without -mpure-code.
>
> I was thinking specifically of the thunk sequence where you seem to be
> emitting instructions directly rather than generating RTL.  The 
> examples
> you show here are not thunks.
>
 OK thanks for the clarification.

 Here is an updated version, split into 3 patches to hopefully make
 review easier.
 They apply on top of my other mpure-code patches for PR96967 and 
 PR96770:
 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-21 Thread Christophe Lyon via Gcc-patches
On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
 wrote:
>
> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> > On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >  wrote:
> >>
> >> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>  On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>   wrote:
> >
> > On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>  On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>   wrote:
> >
> > On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >> When mi_delta is > 255 and -mpure-code is used, we cannot load 
> >> delta
> >> from code memory (like we do without -mpure-code).
> >>
> >> This patch builds the value of mi_delta into r3 with a series of
> >> movs/adds/lsls.
> >>
> >> We also do some cleanup by not emitting the function address and 
> >> delta
> >> via .word directives at the end of the thunk since we don't use 
> >> them
> >> with -mpure-code.
> >>
> >> No need for new testcases, this bug was already identified by
> >> eg. pr46287-3.C
> >>
> >> 2020-09-29  Christophe Lyon  
> >>
> >>   gcc/
> >>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in 
> >> r3 and
> >>   do not emit function address and delta when -mpure-code is 
> >> used.
> >
>  Hi Richard,
> 
>  Thanks for your comments.
> 
> > There are some optimizations you can make to this code.
> >
> > Firstly, for values between 256 and 510 (inclusive), it would be 
> > better
> > to just expand a mov of 255 followed by an add.
>  I now see the splitted for the "Pe" constraint which I hadn't noticed
>  before, so I can write something similar indeed.
> 
>  However, I'm note quite sure to understand the benefit in the split
>  when -mpure-code is NOT used.
>  Consider:
>  int f3_1 (void) { return 510; }
>  int f3_2 (void) { return 511; }
>  Compile with -O2 -mcpu=cortex-m0:
>  f3_1:
>  movsr0, #255
>  lslsr0, r0, #1
>  bx  lr
>  f3_2:
>  ldr r0, .L4
>  bx  lr
> 
>  The splitter makes the code bigger, does it "compensate" for this by
>  not having to load the constant?
>  Actually the constant uses 4 more bytes, which should be taken into
>  account when comparing code size,
> >>>
> >>> Yes, the size of the literal pool entry needs to be taken into 
> >>> account.
> >>>  It might happen that the entry could be shared with another use of 
> >>> that
> >>> literal, but in general that's rare.
> >>>
>  so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>  thumb1 instructions would be equivalent in size compared to loading
>  from the literal pool. Should the 256-510 range be extended?
> >>>
> >>> It's a bit borderline at three instructions when literal pools are not
> >>> expensive to use, but in thumb1 literal pools tend to be quite small 
> >>> due
> >>> to the limited pc offsets we can use.  I think on balance we probably
> >>> want to use the instruction sequence unless optimizing for size.
> >>>
> 
> 
> > This is also true for
> > the literal pools alternative as well, so should be handled before 
> > all
> > this.
>  I am not sure what you mean: with -mpure-code, the above sample is 
>  compiled as:
>  f3_1:
>  movsr0, #255
>  lslsr0, r0, #1
>  bx  lr
>  f3_2:
>  movsr0, #1
>  lslsr0, r0, #8
>  addsr0, r0, #255
>  bx  lr
> 
>  so the "return 510" case is already handled as without -mpure-code.
> >>>
> >>> I was thinking specifically of the thunk sequence where you seem to be
> >>> emitting instructions directly rather than generating RTL.  The 
> >>> examples
> >>> you show here are not thunks.
> >>>
> >> OK thanks for the clarification.
> >>
> >> Here is an updated version, split into 3 patches to hopefully make
> >> review easier.
> >> They apply on top of my other mpure-code patches for PR96967 and 
> >> PR96770:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> >> 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-21 Thread Richard Earnshaw via Gcc-patches
On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>  wrote:
>>
>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
 On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
  wrote:
>
> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>  wrote:
>>>
>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
 On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
  wrote:
>
> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>> from code memory (like we do without -mpure-code).
>>
>> This patch builds the value of mi_delta into r3 with a series of
>> movs/adds/lsls.
>>
>> We also do some cleanup by not emitting the function address and 
>> delta
>> via .word directives at the end of the thunk since we don't use them
>> with -mpure-code.
>>
>> No need for new testcases, this bug was already identified by
>> eg. pr46287-3.C
>>
>> 2020-09-29  Christophe Lyon  
>>
>>   gcc/
>>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 
>> and
>>   do not emit function address and delta when -mpure-code is 
>> used.
>
 Hi Richard,

 Thanks for your comments.

> There are some optimizations you can make to this code.
>
> Firstly, for values between 256 and 510 (inclusive), it would be 
> better
> to just expand a mov of 255 followed by an add.
 I now see the splitted for the "Pe" constraint which I hadn't noticed
 before, so I can write something similar indeed.

 However, I'm note quite sure to understand the benefit in the split
 when -mpure-code is NOT used.
 Consider:
 int f3_1 (void) { return 510; }
 int f3_2 (void) { return 511; }
 Compile with -O2 -mcpu=cortex-m0:
 f3_1:
 movsr0, #255
 lslsr0, r0, #1
 bx  lr
 f3_2:
 ldr r0, .L4
 bx  lr

 The splitter makes the code bigger, does it "compensate" for this by
 not having to load the constant?
 Actually the constant uses 4 more bytes, which should be taken into
 account when comparing code size,
>>>
>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>  It might happen that the entry could be shared with another use of that
>>> literal, but in general that's rare.
>>>
 so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
 thumb1 instructions would be equivalent in size compared to loading
 from the literal pool. Should the 256-510 range be extended?
>>>
>>> It's a bit borderline at three instructions when literal pools are not
>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>> to the limited pc offsets we can use.  I think on balance we probably
>>> want to use the instruction sequence unless optimizing for size.
>>>


> This is also true for
> the literal pools alternative as well, so should be handled before all
> this.
 I am not sure what you mean: with -mpure-code, the above sample is 
 compiled as:
 f3_1:
 movsr0, #255
 lslsr0, r0, #1
 bx  lr
 f3_2:
 movsr0, #1
 lslsr0, r0, #8
 addsr0, r0, #255
 bx  lr

 so the "return 510" case is already handled as without -mpure-code.
>>>
>>> I was thinking specifically of the thunk sequence where you seem to be
>>> emitting instructions directly rather than generating RTL.  The examples
>>> you show here are not thunks.
>>>
>> OK thanks for the clarification.
>>
>> Here is an updated version, split into 3 patches to hopefully make
>> review easier.
>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>
>> I kept it this way to make incremental changes easier to understand.
>>
>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>> I have updated thumb1_gen_const_int() so that it can generate either RTL 
>> or
>> asm. This way, all the code used to build thumb-1 constants is in the
>> same place,
>>  in 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-21 Thread Christophe Lyon via Gcc-patches
On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
 wrote:
>
> On 20/10/2020 12:22, Richard Earnshaw wrote:
> > On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>  On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>   wrote:
> >
> > On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>  When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>  from code memory (like we do without -mpure-code).
> 
>  This patch builds the value of mi_delta into r3 with a series of
>  movs/adds/lsls.
> 
>  We also do some cleanup by not emitting the function address and 
>  delta
>  via .word directives at the end of the thunk since we don't use them
>  with -mpure-code.
> 
>  No need for new testcases, this bug was already identified by
>  eg. pr46287-3.C
> 
>  2020-09-29  Christophe Lyon  
> 
>    gcc/
>    * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 
>  and
>    do not emit function address and delta when -mpure-code is 
>  used.
> >>>
> >> Hi Richard,
> >>
> >> Thanks for your comments.
> >>
> >>> There are some optimizations you can make to this code.
> >>>
> >>> Firstly, for values between 256 and 510 (inclusive), it would be 
> >>> better
> >>> to just expand a mov of 255 followed by an add.
> >> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >> before, so I can write something similar indeed.
> >>
> >> However, I'm note quite sure to understand the benefit in the split
> >> when -mpure-code is NOT used.
> >> Consider:
> >> int f3_1 (void) { return 510; }
> >> int f3_2 (void) { return 511; }
> >> Compile with -O2 -mcpu=cortex-m0:
> >> f3_1:
> >> movsr0, #255
> >> lslsr0, r0, #1
> >> bx  lr
> >> f3_2:
> >> ldr r0, .L4
> >> bx  lr
> >>
> >> The splitter makes the code bigger, does it "compensate" for this by
> >> not having to load the constant?
> >> Actually the constant uses 4 more bytes, which should be taken into
> >> account when comparing code size,
> >
> > Yes, the size of the literal pool entry needs to be taken into account.
> >  It might happen that the entry could be shared with another use of that
> > literal, but in general that's rare.
> >
> >> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >> thumb1 instructions would be equivalent in size compared to loading
> >> from the literal pool. Should the 256-510 range be extended?
> >
> > It's a bit borderline at three instructions when literal pools are not
> > expensive to use, but in thumb1 literal pools tend to be quite small due
> > to the limited pc offsets we can use.  I think on balance we probably
> > want to use the instruction sequence unless optimizing for size.
> >
> >>
> >>
> >>> This is also true for
> >>> the literal pools alternative as well, so should be handled before all
> >>> this.
> >> I am not sure what you mean: with -mpure-code, the above sample is 
> >> compiled as:
> >> f3_1:
> >> movsr0, #255
> >> lslsr0, r0, #1
> >> bx  lr
> >> f3_2:
> >> movsr0, #1
> >> lslsr0, r0, #8
> >> addsr0, r0, #255
> >> bx  lr
> >>
> >> so the "return 510" case is already handled as without -mpure-code.
> >
> > I was thinking specifically of the thunk sequence where you seem to be
> > emitting instructions directly rather than generating RTL.  The examples
> > you show here are not thunks.
> >
>  OK thanks for the clarification.
> 
>  Here is an updated version, split into 3 patches to hopefully make
>  review easier.
>  They apply on top of my other mpure-code patches for PR96967 and PR96770:
>  https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>  https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> 
>  I kept it this way to make incremental changes easier to understand.
> 
>  Patch 1: With the hope to avoid confusion and make maintenance easier,
>  I have updated thumb1_gen_const_int() so that it can generate either RTL 
>  or
>  asm. This way, all the code used to build thumb-1 constants is in the
>  same place,
>   in case we need to improve/fix it later. We now generate shorter 
> 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-20 Thread Richard Earnshaw via Gcc-patches
On 20/10/2020 12:22, Richard Earnshaw wrote:
> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>  wrote:
>>>
>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
 On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
  wrote:
>
> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>  wrote:
>>>
>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
 When mi_delta is > 255 and -mpure-code is used, we cannot load delta
 from code memory (like we do without -mpure-code).

 This patch builds the value of mi_delta into r3 with a series of
 movs/adds/lsls.

 We also do some cleanup by not emitting the function address and delta
 via .word directives at the end of the thunk since we don't use them
 with -mpure-code.

 No need for new testcases, this bug was already identified by
 eg. pr46287-3.C

 2020-09-29  Christophe Lyon  

   gcc/
   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 
 and
   do not emit function address and delta when -mpure-code is used.
>>>
>> Hi Richard,
>>
>> Thanks for your comments.
>>
>>> There are some optimizations you can make to this code.
>>>
>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>> to just expand a mov of 255 followed by an add.
>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>> before, so I can write something similar indeed.
>>
>> However, I'm note quite sure to understand the benefit in the split
>> when -mpure-code is NOT used.
>> Consider:
>> int f3_1 (void) { return 510; }
>> int f3_2 (void) { return 511; }
>> Compile with -O2 -mcpu=cortex-m0:
>> f3_1:
>> movsr0, #255
>> lslsr0, r0, #1
>> bx  lr
>> f3_2:
>> ldr r0, .L4
>> bx  lr
>>
>> The splitter makes the code bigger, does it "compensate" for this by
>> not having to load the constant?
>> Actually the constant uses 4 more bytes, which should be taken into
>> account when comparing code size,
>
> Yes, the size of the literal pool entry needs to be taken into account.
>  It might happen that the entry could be shared with another use of that
> literal, but in general that's rare.
>
>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>> thumb1 instructions would be equivalent in size compared to loading
>> from the literal pool. Should the 256-510 range be extended?
>
> It's a bit borderline at three instructions when literal pools are not
> expensive to use, but in thumb1 literal pools tend to be quite small due
> to the limited pc offsets we can use.  I think on balance we probably
> want to use the instruction sequence unless optimizing for size.
>
>>
>>
>>> This is also true for
>>> the literal pools alternative as well, so should be handled before all
>>> this.
>> I am not sure what you mean: with -mpure-code, the above sample is 
>> compiled as:
>> f3_1:
>> movsr0, #255
>> lslsr0, r0, #1
>> bx  lr
>> f3_2:
>> movsr0, #1
>> lslsr0, r0, #8
>> addsr0, r0, #255
>> bx  lr
>>
>> so the "return 510" case is already handled as without -mpure-code.
>
> I was thinking specifically of the thunk sequence where you seem to be
> emitting instructions directly rather than generating RTL.  The examples
> you show here are not thunks.
>
 OK thanks for the clarification.

 Here is an updated version, split into 3 patches to hopefully make
 review easier.
 They apply on top of my other mpure-code patches for PR96967 and PR96770:
 https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
 https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html

 I kept it this way to make incremental changes easier to understand.

 Patch 1: With the hope to avoid confusion and make maintenance easier,
 I have updated thumb1_gen_const_int() so that it can generate either RTL or
 asm. This way, all the code used to build thumb-1 constants is in the
 same place,
  in case we need to improve/fix it later. We now generate shorter 
 sequences in
 several cases matching your comments.

 Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
 calls thumb1_gen_const_int.

 Patch 3: Update of the original patch in this thread, now calls
 thumb1_gen_const_int.
>>>
>>> Yuk!  Those changes to thumb1_gen_const_int are 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-20 Thread Richard Earnshaw via Gcc-patches
On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>  wrote:
>>
>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>  wrote:

 On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>  wrote:
>>
>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>> from code memory (like we do without -mpure-code).
>>>
>>> This patch builds the value of mi_delta into r3 with a series of
>>> movs/adds/lsls.
>>>
>>> We also do some cleanup by not emitting the function address and delta
>>> via .word directives at the end of the thunk since we don't use them
>>> with -mpure-code.
>>>
>>> No need for new testcases, this bug was already identified by
>>> eg. pr46287-3.C
>>>
>>> 2020-09-29  Christophe Lyon  
>>>
>>>   gcc/
>>>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>   do not emit function address and delta when -mpure-code is used.
>>
> Hi Richard,
>
> Thanks for your comments.
>
>> There are some optimizations you can make to this code.
>>
>> Firstly, for values between 256 and 510 (inclusive), it would be better
>> to just expand a mov of 255 followed by an add.
> I now see the splitted for the "Pe" constraint which I hadn't noticed
> before, so I can write something similar indeed.
>
> However, I'm note quite sure to understand the benefit in the split
> when -mpure-code is NOT used.
> Consider:
> int f3_1 (void) { return 510; }
> int f3_2 (void) { return 511; }
> Compile with -O2 -mcpu=cortex-m0:
> f3_1:
> movsr0, #255
> lslsr0, r0, #1
> bx  lr
> f3_2:
> ldr r0, .L4
> bx  lr
>
> The splitter makes the code bigger, does it "compensate" for this by
> not having to load the constant?
> Actually the constant uses 4 more bytes, which should be taken into
> account when comparing code size,

 Yes, the size of the literal pool entry needs to be taken into account.
  It might happen that the entry could be shared with another use of that
 literal, but in general that's rare.

> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> thumb1 instructions would be equivalent in size compared to loading
> from the literal pool. Should the 256-510 range be extended?

 It's a bit borderline at three instructions when literal pools are not
 expensive to use, but in thumb1 literal pools tend to be quite small due
 to the limited pc offsets we can use.  I think on balance we probably
 want to use the instruction sequence unless optimizing for size.

>
>
>> This is also true for
>> the literal pools alternative as well, so should be handled before all
>> this.
> I am not sure what you mean: with -mpure-code, the above sample is 
> compiled as:
> f3_1:
> movsr0, #255
> lslsr0, r0, #1
> bx  lr
> f3_2:
> movsr0, #1
> lslsr0, r0, #8
> addsr0, r0, #255
> bx  lr
>
> so the "return 510" case is already handled as without -mpure-code.

 I was thinking specifically of the thunk sequence where you seem to be
 emitting instructions directly rather than generating RTL.  The examples
 you show here are not thunks.

>>> OK thanks for the clarification.
>>>
>>> Here is an updated version, split into 3 patches to hopefully make
>>> review easier.
>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>
>>> I kept it this way to make incremental changes easier to understand.
>>>
>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>> asm. This way, all the code used to build thumb-1 constants is in the
>>> same place,
>>>  in case we need to improve/fix it later. We now generate shorter sequences 
>>> in
>>> several cases matching your comments.
>>>
>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>> calls thumb1_gen_const_int.
>>>
>>> Patch 3: Update of the original patch in this thread, now calls
>>> thumb1_gen_const_int.
>>
>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>
>> I think we should be able to leverage the fact that the compiler can use
>> C++ now to do much better than that, for example by making that function
>> a template. 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-19 Thread Christophe Lyon via Gcc-patches
On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
 wrote:
>
> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> > On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >  wrote:
> >>
> >> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>  wrote:
> 
>  On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> > When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> > from code memory (like we do without -mpure-code).
> >
> > This patch builds the value of mi_delta into r3 with a series of
> > movs/adds/lsls.
> >
> > We also do some cleanup by not emitting the function address and delta
> > via .word directives at the end of the thunk since we don't use them
> > with -mpure-code.
> >
> > No need for new testcases, this bug was already identified by
> > eg. pr46287-3.C
> >
> > 2020-09-29  Christophe Lyon  
> >
> >   gcc/
> >   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >   do not emit function address and delta when -mpure-code is used.
> 
> >>> Hi Richard,
> >>>
> >>> Thanks for your comments.
> >>>
>  There are some optimizations you can make to this code.
> 
>  Firstly, for values between 256 and 510 (inclusive), it would be better
>  to just expand a mov of 255 followed by an add.
> >>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>> before, so I can write something similar indeed.
> >>>
> >>> However, I'm note quite sure to understand the benefit in the split
> >>> when -mpure-code is NOT used.
> >>> Consider:
> >>> int f3_1 (void) { return 510; }
> >>> int f3_2 (void) { return 511; }
> >>> Compile with -O2 -mcpu=cortex-m0:
> >>> f3_1:
> >>> movsr0, #255
> >>> lslsr0, r0, #1
> >>> bx  lr
> >>> f3_2:
> >>> ldr r0, .L4
> >>> bx  lr
> >>>
> >>> The splitter makes the code bigger, does it "compensate" for this by
> >>> not having to load the constant?
> >>> Actually the constant uses 4 more bytes, which should be taken into
> >>> account when comparing code size,
> >>
> >> Yes, the size of the literal pool entry needs to be taken into account.
> >>  It might happen that the entry could be shared with another use of that
> >> literal, but in general that's rare.
> >>
> >>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>> thumb1 instructions would be equivalent in size compared to loading
> >>> from the literal pool. Should the 256-510 range be extended?
> >>
> >> It's a bit borderline at three instructions when literal pools are not
> >> expensive to use, but in thumb1 literal pools tend to be quite small due
> >> to the limited pc offsets we can use.  I think on balance we probably
> >> want to use the instruction sequence unless optimizing for size.
> >>
> >>>
> >>>
>  This is also true for
>  the literal pools alternative as well, so should be handled before all
>  this.
> >>> I am not sure what you mean: with -mpure-code, the above sample is 
> >>> compiled as:
> >>> f3_1:
> >>> movsr0, #255
> >>> lslsr0, r0, #1
> >>> bx  lr
> >>> f3_2:
> >>> movsr0, #1
> >>> lslsr0, r0, #8
> >>> addsr0, r0, #255
> >>> bx  lr
> >>>
> >>> so the "return 510" case is already handled as without -mpure-code.
> >>
> >> I was thinking specifically of the thunk sequence where you seem to be
> >> emitting instructions directly rather than generating RTL.  The examples
> >> you show here are not thunks.
> >>
> > OK thanks for the clarification.
> >
> > Here is an updated version, split into 3 patches to hopefully make
> > review easier.
> > They apply on top of my other mpure-code patches for PR96967 and PR96770:
> > https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> > https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >
> > I kept it this way to make incremental changes easier to understand.
> >
> > Patch 1: With the hope to avoid confusion and make maintenance easier,
> > I have updated thumb1_gen_const_int() so that it can generate either RTL or
> > asm. This way, all the code used to build thumb-1 constants is in the
> > same place,
> >  in case we need to improve/fix it later. We now generate shorter sequences 
> > in
> > several cases matching your comments.
> >
> > Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> > calls thumb1_gen_const_int.
> >
> > Patch 3: Update of the original patch in this thread, now calls
> > thumb1_gen_const_int.
>
> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>
> I think we should be able to leverage the fact that the compiler can use
> C++ now to do much better than that, for example by making that function
> a template.  For example (and this is just a sketch):
>

Indeed! I didn't think 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-19 Thread Richard Earnshaw via Gcc-patches
On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>  wrote:
>>
>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>  wrote:

 On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> from code memory (like we do without -mpure-code).
>
> This patch builds the value of mi_delta into r3 with a series of
> movs/adds/lsls.
>
> We also do some cleanup by not emitting the function address and delta
> via .word directives at the end of the thunk since we don't use them
> with -mpure-code.
>
> No need for new testcases, this bug was already identified by
> eg. pr46287-3.C
>
> 2020-09-29  Christophe Lyon  
>
>   gcc/
>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>   do not emit function address and delta when -mpure-code is used.

>>> Hi Richard,
>>>
>>> Thanks for your comments.
>>>
 There are some optimizations you can make to this code.

 Firstly, for values between 256 and 510 (inclusive), it would be better
 to just expand a mov of 255 followed by an add.
>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>> before, so I can write something similar indeed.
>>>
>>> However, I'm note quite sure to understand the benefit in the split
>>> when -mpure-code is NOT used.
>>> Consider:
>>> int f3_1 (void) { return 510; }
>>> int f3_2 (void) { return 511; }
>>> Compile with -O2 -mcpu=cortex-m0:
>>> f3_1:
>>> movsr0, #255
>>> lslsr0, r0, #1
>>> bx  lr
>>> f3_2:
>>> ldr r0, .L4
>>> bx  lr
>>>
>>> The splitter makes the code bigger, does it "compensate" for this by
>>> not having to load the constant?
>>> Actually the constant uses 4 more bytes, which should be taken into
>>> account when comparing code size,
>>
>> Yes, the size of the literal pool entry needs to be taken into account.
>>  It might happen that the entry could be shared with another use of that
>> literal, but in general that's rare.
>>
>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>> thumb1 instructions would be equivalent in size compared to loading
>>> from the literal pool. Should the 256-510 range be extended?
>>
>> It's a bit borderline at three instructions when literal pools are not
>> expensive to use, but in thumb1 literal pools tend to be quite small due
>> to the limited pc offsets we can use.  I think on balance we probably
>> want to use the instruction sequence unless optimizing for size.
>>
>>>
>>>
 This is also true for
 the literal pools alternative as well, so should be handled before all
 this.
>>> I am not sure what you mean: with -mpure-code, the above sample is compiled 
>>> as:
>>> f3_1:
>>> movsr0, #255
>>> lslsr0, r0, #1
>>> bx  lr
>>> f3_2:
>>> movsr0, #1
>>> lslsr0, r0, #8
>>> addsr0, r0, #255
>>> bx  lr
>>>
>>> so the "return 510" case is already handled as without -mpure-code.
>>
>> I was thinking specifically of the thunk sequence where you seem to be
>> emitting instructions directly rather than generating RTL.  The examples
>> you show here are not thunks.
>>
> OK thanks for the clarification.
> 
> Here is an updated version, split into 3 patches to hopefully make
> review easier.
> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> 
> I kept it this way to make incremental changes easier to understand.
> 
> Patch 1: With the hope to avoid confusion and make maintenance easier,
> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> asm. This way, all the code used to build thumb-1 constants is in the
> same place,
>  in case we need to improve/fix it later. We now generate shorter sequences in
> several cases matching your comments.
> 
> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> calls thumb1_gen_const_int.
> 
> Patch 3: Update of the original patch in this thread, now calls
> thumb1_gen_const_int.

Yuk!  Those changes to thumb1_gen_const_int are horrible.

I think we should be able to leverage the fact that the compiler can use
C++ now to do much better than that, for example by making that function
a template.  For example (and this is just a sketch):

class t1_rtl
{
 public:
  void ashift(int a) { gen_rtx_ASHIFT(a); }
  void rshift(int b) { gen_rtx_SHIFTRT(b); }
};

class t1_print
{
 public:
  t1_print (FILE *f) : t_file(f) {}
  void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
  void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
 private:
  FILE 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-12 Thread Christophe Lyon via Gcc-patches
On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
 wrote:
>
> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> > On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >  wrote:
> >>
> >> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>> from code memory (like we do without -mpure-code).
> >>>
> >>> This patch builds the value of mi_delta into r3 with a series of
> >>> movs/adds/lsls.
> >>>
> >>> We also do some cleanup by not emitting the function address and delta
> >>> via .word directives at the end of the thunk since we don't use them
> >>> with -mpure-code.
> >>>
> >>> No need for new testcases, this bug was already identified by
> >>> eg. pr46287-3.C
> >>>
> >>> 2020-09-29  Christophe Lyon  
> >>>
> >>>   gcc/
> >>>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>   do not emit function address and delta when -mpure-code is used.
> >>
> > Hi Richard,
> >
> > Thanks for your comments.
> >
> >> There are some optimizations you can make to this code.
> >>
> >> Firstly, for values between 256 and 510 (inclusive), it would be better
> >> to just expand a mov of 255 followed by an add.
> > I now see the splitted for the "Pe" constraint which I hadn't noticed
> > before, so I can write something similar indeed.
> >
> > However, I'm note quite sure to understand the benefit in the split
> > when -mpure-code is NOT used.
> > Consider:
> > int f3_1 (void) { return 510; }
> > int f3_2 (void) { return 511; }
> > Compile with -O2 -mcpu=cortex-m0:
> > f3_1:
> > movsr0, #255
> > lslsr0, r0, #1
> > bx  lr
> > f3_2:
> > ldr r0, .L4
> > bx  lr
> >
> > The splitter makes the code bigger, does it "compensate" for this by
> > not having to load the constant?
> > Actually the constant uses 4 more bytes, which should be taken into
> > account when comparing code size,
>
> Yes, the size of the literal pool entry needs to be taken into account.
>  It might happen that the entry could be shared with another use of that
> literal, but in general that's rare.
>
> > so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> > thumb1 instructions would be equivalent in size compared to loading
> > from the literal pool. Should the 256-510 range be extended?
>
> It's a bit borderline at three instructions when literal pools are not
> expensive to use, but in thumb1 literal pools tend to be quite small due
> to the limited pc offsets we can use.  I think on balance we probably
> want to use the instruction sequence unless optimizing for size.
>
> >
> >
> >> This is also true for
> >> the literal pools alternative as well, so should be handled before all
> >> this.
> > I am not sure what you mean: with -mpure-code, the above sample is compiled 
> > as:
> > f3_1:
> > movsr0, #255
> > lslsr0, r0, #1
> > bx  lr
> > f3_2:
> > movsr0, #1
> > lslsr0, r0, #8
> > addsr0, r0, #255
> > bx  lr
> >
> > so the "return 510" case is already handled as without -mpure-code.
>
> I was thinking specifically of the thunk sequence where you seem to be
> emitting instructions directly rather than generating RTL.  The examples
> you show here are not thunks.
>
OK thanks for the clarification.

Here is an updated version, split into 3 patches to hopefully make
review easier.
They apply on top of my other mpure-code patches for PR96967 and PR96770:
https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html

I kept it this way to make incremental changes easier to understand.

Patch 1: With the hope to avoid confusion and make maintenance easier,
I have updated thumb1_gen_const_int() so that it can generate either RTL or
asm. This way, all the code used to build thumb-1 constants is in the
same place,
 in case we need to improve/fix it later. We now generate shorter sequences in
several cases matching your comments.

Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
calls thumb1_gen_const_int.

Patch 3: Update of the original patch in this thread, now calls
thumb1_gen_const_int.

>
> >
> >>  I also suspect (but haven't check) that the base adjustment will
> >> most commonly be a multiple of the machine word size (ie 4).  If that is
> >> the case then you could generate n/4 and then shift it left by 2 for an
> >> even greater range of literals.
> > I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> > I'll update my patch.
> >
> >>  More generally, any sequence of up to
> >> three thumb1 instructions will be no larger, and probably as fast as the
> >> existing literal pool fall back.
> >>
> >> Secondly, if the value is, for example, 65536 (0x1), your code will
> >> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >> be merged into 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-08 Thread Richard Earnshaw via Gcc-patches
On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>  wrote:
>>
>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>> from code memory (like we do without -mpure-code).
>>>
>>> This patch builds the value of mi_delta into r3 with a series of
>>> movs/adds/lsls.
>>>
>>> We also do some cleanup by not emitting the function address and delta
>>> via .word directives at the end of the thunk since we don't use them
>>> with -mpure-code.
>>>
>>> No need for new testcases, this bug was already identified by
>>> eg. pr46287-3.C
>>>
>>> 2020-09-29  Christophe Lyon  
>>>
>>>   gcc/
>>>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>   do not emit function address and delta when -mpure-code is used.
>>
> Hi Richard,
> 
> Thanks for your comments.
> 
>> There are some optimizations you can make to this code.
>>
>> Firstly, for values between 256 and 510 (inclusive), it would be better
>> to just expand a mov of 255 followed by an add.
> I now see the splitted for the "Pe" constraint which I hadn't noticed
> before, so I can write something similar indeed.
> 
> However, I'm note quite sure to understand the benefit in the split
> when -mpure-code is NOT used.
> Consider:
> int f3_1 (void) { return 510; }
> int f3_2 (void) { return 511; }
> Compile with -O2 -mcpu=cortex-m0:
> f3_1:
> movsr0, #255
> lslsr0, r0, #1
> bx  lr
> f3_2:
> ldr r0, .L4
> bx  lr
> 
> The splitter makes the code bigger, does it "compensate" for this by
> not having to load the constant?
> Actually the constant uses 4 more bytes, which should be taken into
> account when comparing code size,

Yes, the size of the literal pool entry needs to be taken into account.
 It might happen that the entry could be shared with another use of that
literal, but in general that's rare.

> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> thumb1 instructions would be equivalent in size compared to loading
> from the literal pool. Should the 256-510 range be extended?

It's a bit borderline at three instructions when literal pools are not
expensive to use, but in thumb1 literal pools tend to be quite small due
to the limited pc offsets we can use.  I think on balance we probably
want to use the instruction sequence unless optimizing for size.

> 
> 
>> This is also true for
>> the literal pools alternative as well, so should be handled before all
>> this.
> I am not sure what you mean: with -mpure-code, the above sample is compiled 
> as:
> f3_1:
> movsr0, #255
> lslsr0, r0, #1
> bx  lr
> f3_2:
> movsr0, #1
> lslsr0, r0, #8
> addsr0, r0, #255
> bx  lr
> 
> so the "return 510" case is already handled as without -mpure-code.

I was thinking specifically of the thunk sequence where you seem to be
emitting instructions directly rather than generating RTL.  The examples
you show here are not thunks.


> 
>>  I also suspect (but haven't check) that the base adjustment will
>> most commonly be a multiple of the machine word size (ie 4).  If that is
>> the case then you could generate n/4 and then shift it left by 2 for an
>> even greater range of literals.
> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> I'll update my patch.
> 
>>  More generally, any sequence of up to
>> three thumb1 instructions will be no larger, and probably as fast as the
>> existing literal pool fall back.
>>
>> Secondly, if the value is, for example, 65536 (0x1), your code will
>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>> be merged into a single shift-by-16.
> 
> Right, I'll try to make use of thumb_shiftable_const.
> 
>> Finally, I'd really like to see some executable tests for this, if at
>> all possible.
> I mentioned pr46287-3.C, but that's not the only existing testcase
> that showed the problem. There are also:
> g++.dg/opt/thunk1.C
> g++.dg/ipa/pr46984.C
> g++.dg/torture/pr46287.C
> g++.dg/torture/pr45699.C
> 
> Do you want that I copy one of these in the arm subdir and add
> -mpure-code in dg-options?

On reflection, probably not - that just makes things more complicated
with all the dg-options mess (I'm worried about interactions with other
sets of options on the command line and the fall-out from that).  If
someone cares about pure-code they should be doing full testsuite runs
with it enabled and that should be sufficient.

R.

> 
> Thanks,
> 
> Christophe
> 
>> R.
>>
>>>
>>> k#   (use "git pull" to merge the remote branch into yours)
>>> ---
>>>  gcc/config/arm/arm.c | 91 
>>> +---
>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>> index ceeb91f..62abeb5 100644
>>> --- 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-08 Thread Christophe Lyon via Gcc-patches
On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
 wrote:
>
> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> > When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> > from code memory (like we do without -mpure-code).
> >
> > This patch builds the value of mi_delta into r3 with a series of
> > movs/adds/lsls.
> >
> > We also do some cleanup by not emitting the function address and delta
> > via .word directives at the end of the thunk since we don't use them
> > with -mpure-code.
> >
> > No need for new testcases, this bug was already identified by
> > eg. pr46287-3.C
> >
> > 2020-09-29  Christophe Lyon  
> >
> >   gcc/
> >   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >   do not emit function address and delta when -mpure-code is used.
>
Hi Richard,

Thanks for your comments.

> There are some optimizations you can make to this code.
>
> Firstly, for values between 256 and 510 (inclusive), it would be better
> to just expand a mov of 255 followed by an add.
I now see the splitted for the "Pe" constraint which I hadn't noticed
before, so I can write something similar indeed.

However, I'm note quite sure to understand the benefit in the split
when -mpure-code is NOT used.
Consider:
int f3_1 (void) { return 510; }
int f3_2 (void) { return 511; }
Compile with -O2 -mcpu=cortex-m0:
f3_1:
movsr0, #255
lslsr0, r0, #1
bx  lr
f3_2:
ldr r0, .L4
bx  lr

The splitter makes the code bigger, does it "compensate" for this by
not having to load the constant?
Actually the constant uses 4 more bytes, which should be taken into
account when comparing code size,
so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
thumb1 instructions would be equivalent in size compared to loading
from the literal pool. Should the 256-510 range be extended?


> This is also true for
> the literal pools alternative as well, so should be handled before all
> this.
I am not sure what you mean: with -mpure-code, the above sample is compiled as:
f3_1:
movsr0, #255
lslsr0, r0, #1
bx  lr
f3_2:
movsr0, #1
lslsr0, r0, #8
addsr0, r0, #255
bx  lr

so the "return 510" case is already handled as without -mpure-code.

>  I also suspect (but haven't check) that the base adjustment will
> most commonly be a multiple of the machine word size (ie 4).  If that is
> the case then you could generate n/4 and then shift it left by 2 for an
> even greater range of literals.
I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
I'll update my patch.

>  More generally, any sequence of up to
> three thumb1 instructions will be no larger, and probably as fast as the
> existing literal pool fall back.
>
> Secondly, if the value is, for example, 65536 (0x1), your code will
> emit a mov followed by two shift-by-8 instructions; the two shifts could
> be merged into a single shift-by-16.

Right, I'll try to make use of thumb_shiftable_const.

> Finally, I'd really like to see some executable tests for this, if at
> all possible.
I mentioned pr46287-3.C, but that's not the only existing testcase
that showed the problem. There are also:
g++.dg/opt/thunk1.C
g++.dg/ipa/pr46984.C
g++.dg/torture/pr46287.C
g++.dg/torture/pr45699.C

Do you want that I copy one of these in the arm subdir and add
-mpure-code in dg-options?

Thanks,

Christophe

> R.
>
> >
> > k#   (use "git pull" to merge the remote branch into yours)
> > ---
> >  gcc/config/arm/arm.c | 91 
> > +---
> >  1 file changed, 66 insertions(+), 25 deletions(-)
> >
> > diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> > index ceeb91f..62abeb5 100644
> > --- a/gcc/config/arm/arm.c
> > +++ b/gcc/config/arm/arm.c
> > @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, 
> > HOST_WIDE_INT delta,
> >  {
> >if (mi_delta > 255)
> >   {
> > -   fputs ("\tldr\tr3, ", file);
> > -   assemble_name (file, label);
> > -   fputs ("+4\n", file);
> > +   /* With -mpure-code, we cannot load delta from the constant
> > +  pool: we build it explicitly.  */
> > +   if (target_pure_code)
> > + {
> > +   bool mov_done_p = false;
> > +   int i;
> > +
> > +   /* Emit upper 3 bytes if needed.  */
> > +   for (i = 0; i < 3; i++)
> > + {
> > +   int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> > +
> > +   if (byte)
> > + {
> > +   if (mov_done_p)
> > + asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> > +   else
> > + asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> > +   mov_done_p = true;
> > + }
> > +
> > +   if (mov_done_p)
> > + asm_fprintf (file, "\tlsls\tr3, #8\n");
> > +  

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-06 Thread Richard Earnshaw via Gcc-patches
On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> from code memory (like we do without -mpure-code).
> 
> This patch builds the value of mi_delta into r3 with a series of
> movs/adds/lsls.
> 
> We also do some cleanup by not emitting the function address and delta
> via .word directives at the end of the thunk since we don't use them
> with -mpure-code.
> 
> No need for new testcases, this bug was already identified by
> eg. pr46287-3.C
> 
> 2020-09-29  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>   do not emit function address and delta when -mpure-code is used.

There are some optimizations you can make to this code.

Firstly, for values between 256 and 510 (inclusive), it would be better
to just expand a mov of 255 followed by an add.  This is also true for
the literal pools alternative as well, so should be handled before all
this.  I also suspect (but haven't check) that the base adjustment will
most commonly be a multiple of the machine word size (ie 4).  If that is
the case then you could generate n/4 and then shift it left by 2 for an
even greater range of literals.  More generally, any sequence of up to
three thumb1 instructions will be no larger, and probably as fast as the
existing literal pool fall back.

Secondly, if the value is, for example, 65536 (0x1), your code will
emit a mov followed by two shift-by-8 instructions; the two shifts could
be merged into a single shift-by-16.

Finally, I'd really like to see some executable tests for this, if at
all possible.

R.

> 
> k#   (use "git pull" to merge the remote branch into yours)
> ---
>  gcc/config/arm/arm.c | 91 
> +---
>  1 file changed, 66 insertions(+), 25 deletions(-)
> 
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index ceeb91f..62abeb5 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT 
> delta,
>  {
>if (mi_delta > 255)
>   {
> -   fputs ("\tldr\tr3, ", file);
> -   assemble_name (file, label);
> -   fputs ("+4\n", file);
> +   /* With -mpure-code, we cannot load delta from the constant
> +  pool: we build it explicitly.  */
> +   if (target_pure_code)
> + {
> +   bool mov_done_p = false;
> +   int i;
> +
> +   /* Emit upper 3 bytes if needed.  */
> +   for (i = 0; i < 3; i++)
> + {
> +   int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> +
> +   if (byte)
> + {
> +   if (mov_done_p)
> + asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> +   else
> + asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> +   mov_done_p = true;
> + }
> +
> +   if (mov_done_p)
> + asm_fprintf (file, "\tlsls\tr3, #8\n");
> + }
> +
> +   /* Emit lower byte if needed.  */
> +   if (!mov_done_p)
> + asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> +   else if (mi_delta & 0xff)
> + asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> + }
> +   else
> + {
> +   fputs ("\tldr\tr3, ", file);
> +   assemble_name (file, label);
> +   fputs ("+4\n", file);
> + }
> asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>  mi_op, this_regno, this_regno);
>   }
> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, 
> HOST_WIDE_INT delta,
>   fputs ("\tpop\t{r3}\n", file);
>  
>fprintf (file, "\tbx\tr12\n");
> -  ASM_OUTPUT_ALIGN (file, 2);
> -  assemble_name (file, label);
> -  fputs (":\n", file);
> -  if (flag_pic)
> +
> +  /* With -mpure-code, we don't need to emit literals for the
> +  function address and delta since we emitted code to build
> +  them.  */
> +  if (!target_pure_code)
>   {
> -   /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> -   rtx tem = XEXP (DECL_RTL (function), 0);
> -   /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> -  pipeline offset is four rather than eight.  Adjust the offset
> -  accordingly.  */
> -   tem = plus_constant (GET_MODE (tem), tem,
> -TARGET_THUMB1_ONLY ? -3 : -7);
> -   tem = gen_rtx_MINUS (GET_MODE (tem),
> -tem,
> -gen_rtx_SYMBOL_REF (Pmode,
> -ggc_strdup (labelpc)));
> -   assemble_integer (tem, 4, BITS_PER_WORD, 1);
> - }
> -  else
> - /* Output ".word .LTHUNKn".  */
> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> 

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-06 Thread Christophe Lyon via Gcc-patches
ping?

On Tue, 29 Sep 2020 at 21:50, Christophe Lyon
 wrote:
>
> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> from code memory (like we do without -mpure-code).
>
> This patch builds the value of mi_delta into r3 with a series of
> movs/adds/lsls.
>
> We also do some cleanup by not emitting the function address and delta
> via .word directives at the end of the thunk since we don't use them
> with -mpure-code.
>
> No need for new testcases, this bug was already identified by
> eg. pr46287-3.C
>
> 2020-09-29  Christophe Lyon  
>
> gcc/
> * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> do not emit function address and delta when -mpure-code is used.
>
> k#   (use "git pull" to merge the remote branch into yours)
> ---
>  gcc/config/arm/arm.c | 91 
> +---
>  1 file changed, 66 insertions(+), 25 deletions(-)
>
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index ceeb91f..62abeb5 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT 
> delta,
>  {
>if (mi_delta > 255)
> {
> - fputs ("\tldr\tr3, ", file);
> - assemble_name (file, label);
> - fputs ("+4\n", file);
> + /* With -mpure-code, we cannot load delta from the constant
> +pool: we build it explicitly.  */
> + if (target_pure_code)
> +   {
> + bool mov_done_p = false;
> + int i;
> +
> + /* Emit upper 3 bytes if needed.  */
> + for (i = 0; i < 3; i++)
> +   {
> + int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> +
> + if (byte)
> +   {
> + if (mov_done_p)
> +   asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> + else
> +   asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> + mov_done_p = true;
> +   }
> +
> + if (mov_done_p)
> +   asm_fprintf (file, "\tlsls\tr3, #8\n");
> +   }
> +
> + /* Emit lower byte if needed.  */
> + if (!mov_done_p)
> +   asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> + else if (mi_delta & 0xff)
> +   asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> +   }
> + else
> +   {
> + fputs ("\tldr\tr3, ", file);
> + assemble_name (file, label);
> + fputs ("+4\n", file);
> +   }
>   asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>mi_op, this_regno, this_regno);
> }
> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, 
> HOST_WIDE_INT delta,
> fputs ("\tpop\t{r3}\n", file);
>
>fprintf (file, "\tbx\tr12\n");
> -  ASM_OUTPUT_ALIGN (file, 2);
> -  assemble_name (file, label);
> -  fputs (":\n", file);
> -  if (flag_pic)
> +
> +  /* With -mpure-code, we don't need to emit literals for the
> +function address and delta since we emitted code to build
> +them.  */
> +  if (!target_pure_code)
> {
> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> - rtx tem = XEXP (DECL_RTL (function), 0);
> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> -pipeline offset is four rather than eight.  Adjust the offset
> -accordingly.  */
> - tem = plus_constant (GET_MODE (tem), tem,
> -  TARGET_THUMB1_ONLY ? -3 : -7);
> - tem = gen_rtx_MINUS (GET_MODE (tem),
> -  tem,
> -  gen_rtx_SYMBOL_REF (Pmode,
> -  ggc_strdup (labelpc)));
> - assemble_integer (tem, 4, BITS_PER_WORD, 1);
> -   }
> -  else
> -   /* Output ".word .LTHUNKn".  */
> -   assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> + ASM_OUTPUT_ALIGN (file, 2);
> + assemble_name (file, label);
> + fputs (":\n", file);
> + if (flag_pic)
> +   {
> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> + rtx tem = XEXP (DECL_RTL (function), 0);
> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> +pipeline offset is four rather than eight.  Adjust the offset
> +accordingly.  */
> + tem = plus_constant (GET_MODE (tem), tem,
> +  TARGET_THUMB1_ONLY ? -3 : -7);
> + tem = gen_rtx_MINUS (GET_MODE (tem),
> +  tem,
> +  gen_rtx_SYMBOL_REF (Pmode,
> +  

[PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-09-29 Thread Christophe Lyon via Gcc-patches
When mi_delta is > 255 and -mpure-code is used, we cannot load delta
from code memory (like we do without -mpure-code).

This patch builds the value of mi_delta into r3 with a series of
movs/adds/lsls.

We also do some cleanup by not emitting the function address and delta
via .word directives at the end of the thunk since we don't use them
with -mpure-code.

No need for new testcases, this bug was already identified by
eg. pr46287-3.C

2020-09-29  Christophe Lyon  

gcc/
* config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
do not emit function address and delta when -mpure-code is used.

k#   (use "git pull" to merge the remote branch into yours)
---
 gcc/config/arm/arm.c | 91 +---
 1 file changed, 66 insertions(+), 25 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ceeb91f..62abeb5 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT 
delta,
 {
   if (mi_delta > 255)
{
- fputs ("\tldr\tr3, ", file);
- assemble_name (file, label);
- fputs ("+4\n", file);
+ /* With -mpure-code, we cannot load delta from the constant
+pool: we build it explicitly.  */
+ if (target_pure_code)
+   {
+ bool mov_done_p = false;
+ int i;
+
+ /* Emit upper 3 bytes if needed.  */
+ for (i = 0; i < 3; i++)
+   {
+ int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
+
+ if (byte)
+   {
+ if (mov_done_p)
+   asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
+ else
+   asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
+ mov_done_p = true;
+   }
+
+ if (mov_done_p)
+   asm_fprintf (file, "\tlsls\tr3, #8\n");
+   }
+
+ /* Emit lower byte if needed.  */
+ if (!mov_done_p)
+   asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
+ else if (mi_delta & 0xff)
+   asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
+   }
+ else
+   {
+ fputs ("\tldr\tr3, ", file);
+ assemble_name (file, label);
+ fputs ("+4\n", file);
+   }
  asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
   mi_op, this_regno, this_regno);
}
@@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT 
delta,
fputs ("\tpop\t{r3}\n", file);
 
   fprintf (file, "\tbx\tr12\n");
-  ASM_OUTPUT_ALIGN (file, 2);
-  assemble_name (file, label);
-  fputs (":\n", file);
-  if (flag_pic)
+
+  /* With -mpure-code, we don't need to emit literals for the
+function address and delta since we emitted code to build
+them.  */
+  if (!target_pure_code)
{
- /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
- rtx tem = XEXP (DECL_RTL (function), 0);
- /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
-pipeline offset is four rather than eight.  Adjust the offset
-accordingly.  */
- tem = plus_constant (GET_MODE (tem), tem,
-  TARGET_THUMB1_ONLY ? -3 : -7);
- tem = gen_rtx_MINUS (GET_MODE (tem),
-  tem,
-  gen_rtx_SYMBOL_REF (Pmode,
-  ggc_strdup (labelpc)));
- assemble_integer (tem, 4, BITS_PER_WORD, 1);
-   }
-  else
-   /* Output ".word .LTHUNKn".  */
-   assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
+ ASM_OUTPUT_ALIGN (file, 2);
+ assemble_name (file, label);
+ fputs (":\n", file);
+ if (flag_pic)
+   {
+ /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
+ rtx tem = XEXP (DECL_RTL (function), 0);
+ /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
+pipeline offset is four rather than eight.  Adjust the offset
+accordingly.  */
+ tem = plus_constant (GET_MODE (tem), tem,
+  TARGET_THUMB1_ONLY ? -3 : -7);
+ tem = gen_rtx_MINUS (GET_MODE (tem),
+  tem,
+  gen_rtx_SYMBOL_REF (Pmode,
+  ggc_strdup (labelpc)));
+ assemble_integer (tem, 4, BITS_PER_WORD, 1);
+   }
+ else
+   /* Output ".word .LTHUNKn".  */
+   assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 
1);
 
-  if (TARGET_THUMB1_ONLY && mi_delta > 255)
-