Re: [RFC] propgation leap over memory copy for struct

2022-11-09 Thread Richard Biener via Gcc-patches
On Wed, 9 Nov 2022, Jiufu Guo wrote:

> Hi,
> 
> Richard Biener  writes:
> 
> > On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches
> >  wrote:
> >>
> >>
> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >> > Hi,
> >> >
> >> > We know that for struct variable assignment, memory copy may be used.
> >> > And for memcpy, we may load and store more bytes as possible at one time.
> >> > While it may be not best here:
> >> > 1. Before/after stuct variable assignment, the vaiable may be operated.
> >> > And it is hard for some optimizations to leap over memcpy.  Then some 
> >> > struct
> >> > operations may be sub-optimimal.  Like the issue in PR65421.
> >> > 2. The size of struct is constant mostly, the memcpy would be expanded.  
> >> > Using
> >> > small size to load/store and executing in parallel may not slower than 
> >> > using
> >> > large size to loat/store. (sure, more registers may be used for smaller 
> >> > bytes.)
> >> >
> >> >
> >> > In PR65421, For source code as below:
> >> > t.c
> >> > #define FN 4
> >> > typedef struct { double a[FN]; } A;
> >> >
> >> > A foo (const A *a) { return *a; }
> >> > A bar (const A a) { return a; }
> >>
> >> So the first question in my mind is can we do better at the gimple
> >> phase?  For the second case in particular can't we just "return a"
> >> rather than copying a into  then returning ?  This feels
> >> a lot like the return value optimization from C++.  I'm not sure if it
> >> applies to the first case or not, it's been a long time since I looked
> >> at NRV optimizations, but it might be worth poking around in there a bit
> >> (tree-nrv.cc).
> >>
> >>
> >> But even so, these kinds of things are still bound to happen, so it's
> >> probably worth thinking about if we can do better in RTL as well.
> >>
> >>
> >> The first thing that comes to my mind is to annotate memcpy calls that
> >> are structure assignments.  The idea here is that we may want to expand
> >> a memcpy differently in those cases.   Changing how we expand an opaque
> >> memcpy call is unlikely to be beneficial in most cases.  But changing
> >> how we expand a structure copy may be beneficial by exposing the
> >> underlying field values.   This would roughly correspond to your method #1.
> >>
> >> Or instead of changing how we expand, teach the optimizers about these
> >> annotated memcpy calls -- they're just a a copy of each field.   That's
> >> how CSE and the propagators could treat them. After some point we'd
> >> lower them in the usual ways, but at least early in the RTL pipeline we
> >> could keep them as annotated memcpy calls.  This roughly corresponds to
> >> your second suggestion.
> >
> > In the end it depends on the access patterns so some analysis like SRA
> > performs would be nice.  The issue with expanding memcpy on GIMPLE
> > is that we currently cannot express 'rep; movb;' or other target specific
> > sequences from the cpymem like optabs on GIMPLE and recovering those
> > from piecewise copies on RTL is going to be difficult.
> Actually, it is a special memcpy. It is generated during expanding the
> struct assignment(expand_assignment/store_expr/emit_block_move).
> We may introduce a function block_move_for_record for struct type.  And
> this function could be a target hook to generate specificed sequences.
> For example:
> r125:DF=[r112:DI+0x20]
> r126:DF=[r112:DI+0x28]
> [r112:DI]=r125:DF
> [r112:DI+0x8]=r126:DF
> 
> After expanding, following passes(cse/prop/dse/..) could optimize the
> insn sequences. e.g "[r112:DI+0x20]=f1;r125:DF=[r112:DI+0x20];
> [r112:DI]=r125:DF;r129:DF=[r112:DI]" ==> "r129:DF=f1"
> 
> And if the small reading/writing insns are still occur in late passes
> e.g. combine, we would recover the isnsn to better sequence:
> r125:DF=[r112:DI+0x20];r126:DF=[r112:DI+0x28]
> ==>
> r155:V2DI=[r112:DI+0x20];
> 
> Any comments? Thanks!

As said the best copying decomposition depends on the followup uses
and the argument passing ABI which is why I suggested to perform
SRA like analysis which collects the access patterns and use that
to drive the heuristic expanding this special memcpy.

Richard.


Re: [RFC] propgation leap over memory copy for struct

2022-11-09 Thread Jiufu Guo via Gcc-patches
Hi,

Richard Biener  writes:

> On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches
>  wrote:
>>
>>
>> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> > Hi,
>> >
>> > We know that for struct variable assignment, memory copy may be used.
>> > And for memcpy, we may load and store more bytes as possible at one time.
>> > While it may be not best here:
>> > 1. Before/after stuct variable assignment, the vaiable may be operated.
>> > And it is hard for some optimizations to leap over memcpy.  Then some 
>> > struct
>> > operations may be sub-optimimal.  Like the issue in PR65421.
>> > 2. The size of struct is constant mostly, the memcpy would be expanded.  
>> > Using
>> > small size to load/store and executing in parallel may not slower than 
>> > using
>> > large size to loat/store. (sure, more registers may be used for smaller 
>> > bytes.)
>> >
>> >
>> > In PR65421, For source code as below:
>> > t.c
>> > #define FN 4
>> > typedef struct { double a[FN]; } A;
>> >
>> > A foo (const A *a) { return *a; }
>> > A bar (const A a) { return a; }
>>
>> So the first question in my mind is can we do better at the gimple
>> phase?  For the second case in particular can't we just "return a"
>> rather than copying a into  then returning ?  This feels
>> a lot like the return value optimization from C++.  I'm not sure if it
>> applies to the first case or not, it's been a long time since I looked
>> at NRV optimizations, but it might be worth poking around in there a bit
>> (tree-nrv.cc).
>>
>>
>> But even so, these kinds of things are still bound to happen, so it's
>> probably worth thinking about if we can do better in RTL as well.
>>
>>
>> The first thing that comes to my mind is to annotate memcpy calls that
>> are structure assignments.  The idea here is that we may want to expand
>> a memcpy differently in those cases.   Changing how we expand an opaque
>> memcpy call is unlikely to be beneficial in most cases.  But changing
>> how we expand a structure copy may be beneficial by exposing the
>> underlying field values.   This would roughly correspond to your method #1.
>>
>> Or instead of changing how we expand, teach the optimizers about these
>> annotated memcpy calls -- they're just a a copy of each field.   That's
>> how CSE and the propagators could treat them. After some point we'd
>> lower them in the usual ways, but at least early in the RTL pipeline we
>> could keep them as annotated memcpy calls.  This roughly corresponds to
>> your second suggestion.
>
> In the end it depends on the access patterns so some analysis like SRA
> performs would be nice.  The issue with expanding memcpy on GIMPLE
> is that we currently cannot express 'rep; movb;' or other target specific
> sequences from the cpymem like optabs on GIMPLE and recovering those
> from piecewise copies on RTL is going to be difficult.
Actually, it is a special memcpy. It is generated during expanding the
struct assignment(expand_assignment/store_expr/emit_block_move).
We may introduce a function block_move_for_record for struct type.  And
this function could be a target hook to generate specificed sequences.
For example:
r125:DF=[r112:DI+0x20]
r126:DF=[r112:DI+0x28]
[r112:DI]=r125:DF
[r112:DI+0x8]=r126:DF

After expanding, following passes(cse/prop/dse/..) could optimize the
insn sequences. e.g "[r112:DI+0x20]=f1;r125:DF=[r112:DI+0x20];
[r112:DI]=r125:DF;r129:DF=[r112:DI]" ==> "r129:DF=f1"

And if the small reading/writing insns are still occur in late passes
e.g. combine, we would recover the isnsn to better sequence:
r125:DF=[r112:DI+0x20];r126:DF=[r112:DI+0x28]
==>
r155:V2DI=[r112:DI+0x20];

Any comments? Thanks!

BR,
Jeff(Jiufu)


>
>>
>> jeff
>>
>>
>>


Re: [RFC] propgation leap over memory copy for struct

2022-11-09 Thread Richard Biener via Gcc-patches
On Wed, 9 Nov 2022, Jiufu Guo wrote:

> Jiufu Guo via Gcc-patches  writes:
> 
> > Richard Biener  writes:
> >
> >> On Tue, 1 Nov 2022, Jiufu Guo wrote:
> >>
> >>> Segher Boessenkool  writes:
> >>> 
> >>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
> >>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >>> >> >We know that for struct variable assignment, memory copy may be used.
> >>> >> >And for memcpy, we may load and store more bytes as possible at one 
> >>> >> >time.
> >>> >> >While it may be not best here:
> >>> >
> >>> >> So the first question in my mind is can we do better at the gimple 
> >>> >> phase?  For the second case in particular can't we just "return a" 
> >>> >> rather than copying a into  then returning ?  This 
> >>> >> feels 
> >>> >> a lot like the return value optimization from C++.  I'm not sure if it 
> >>> >> applies to the first case or not, it's been a long time since I looked 
> >>> >> at NRV optimizations, but it might be worth poking around in there a 
> >>> >> bit 
> >>> >> (tree-nrv.cc).
> >>> >
> >>> > If it is a bigger struct you end up with quite a lot of stuff in
> >>> > registers.  GCC will eventually put that all in memory so it will work
> >>> > out fine in the end, but you are likely to get inefficient code.
> >>> Yes.  We may need to use memory to save regiters for big struct.
> >>> Small struct may be practical to use registers.  We may leverage the
> >>> idea that: some type of small struct are passing to function through
> >>> registers. 
> >>> 
> >>> >
> >>> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
> >>> > would be useful to put smaller temportaries, say 32 bytes and smaller,
> >>> > in registers instead of in memory.
> >>> I think you mean:  we should try to registers to avoid memory accesing,
> >>> and using registers would be ok for more bytes memcpy(32bytes).
> >>> Great sugguestion, thanks a lot!
> >>> 
> >>> Like below idea:
> >>> [r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
> >>> dest = [r100:TI, r101:TI];
> >>> 
> >>> Currently, for 8bytes structure, we are using TImode for it.
> >>> And subreg/fwprop/cse passes are able to optimize it as expected.
> >>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
> >>> I'm not sure if current infrastructure supports to use two more
> >>> registers for one structure.
> >>> 
> >>> >
> >>> >> But even so, these kinds of things are still bound to happen, so it's 
> >>> >> probably worth thinking about if we can do better in RTL as well.
> >>> >
> >>> > Always.  It is a mistake to think that having better high-level
> >>> > optimisations means that you don't need good low-level optimisations
> >>> > anymore: in fact deficiencies there become more glaringly apparent if
> >>> > the early pipeline opts become better :-)
> >>> Understant, thanks :)
> >>> 
> >>> >
> >>> >> The first thing that comes to my mind is to annotate memcpy calls that 
> >>> >> are structure assignments.  The idea here is that we may want to 
> >>> >> expand 
> >>> >> a memcpy differently in those cases.   Changing how we expand an 
> >>> >> opaque 
> >>> >> memcpy call is unlikely to be beneficial in most cases.  But changing 
> >>> >> how we expand a structure copy may be beneficial by exposing the 
> >>> >> underlying field values.   This would roughly correspond to your 
> >>> >> method 
> >>> >> #1.
> >>> >> 
> >>> >> Or instead of changing how we expand, teach the optimizers about these 
> >>> >> annotated memcpy calls -- they're just a a copy of each field.   
> >>> >> That's 
> >>> >> how CSE and the propagators could treat them. After some point we'd 
> >>> >> lower them in the usual ways, but at least early in the RTL pipeline 
> >>> >> we 
> >>> >> could keep them as annotated memcpy calls.  This roughly corresponds 
> >>> >> to 
> >>> >> your second suggestion.
> >>> >
> >>> > Ideally this won't ever make it as far as RTL, if the structures do not
> >>> > need to go via memory.  All high-level optimissations should have been
> >>> > done earlier, and hopefully it was not expand tiself that forced stuff
> >>> > into memory!  :-/
> >>> Currently, after early gimple optimization, the struct member accessing
> >>> may still need to be in memory (if the mode of the struct is BLK).
> >>> For example:
> >>> 
> >>> _Bool foo (const A a) { return a.a[0] > 1.0; }
> >>> 
> >>> The optimized gimple would be:
> >>>   _1 = a.a[0];
> >>>   _3 = _1 > 1.0e+0;
> >>>   return _3;
> >>> 
> >>> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
> >>> and "a.a[0]" is also reading from memory.  It may be better to use
> >>> "f1" for "a.a[0]" here.
> >>> 
> >>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; 
> >>> DF}"
> >>> for the struct (BLK may be changed), and using 4 DF registers to access
> >>> the structure in expand pass.
> >>
> >> I think for cases like this it might be a good idea to perform
> >> SRA-like analysis at RTL 

Re: [RFC] propgation leap over memory copy for struct

2022-11-08 Thread Jiufu Guo via Gcc-patches
Jiufu Guo via Gcc-patches  writes:

> Richard Biener  writes:
>
>> On Tue, 1 Nov 2022, Jiufu Guo wrote:
>>
>>> Segher Boessenkool  writes:
>>> 
>>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>>> >> >We know that for struct variable assignment, memory copy may be used.
>>> >> >And for memcpy, we may load and store more bytes as possible at one 
>>> >> >time.
>>> >> >While it may be not best here:
>>> >
>>> >> So the first question in my mind is can we do better at the gimple 
>>> >> phase?  For the second case in particular can't we just "return a" 
>>> >> rather than copying a into  then returning ?  This feels 
>>> >> a lot like the return value optimization from C++.  I'm not sure if it 
>>> >> applies to the first case or not, it's been a long time since I looked 
>>> >> at NRV optimizations, but it might be worth poking around in there a bit 
>>> >> (tree-nrv.cc).
>>> >
>>> > If it is a bigger struct you end up with quite a lot of stuff in
>>> > registers.  GCC will eventually put that all in memory so it will work
>>> > out fine in the end, but you are likely to get inefficient code.
>>> Yes.  We may need to use memory to save regiters for big struct.
>>> Small struct may be practical to use registers.  We may leverage the
>>> idea that: some type of small struct are passing to function through
>>> registers. 
>>> 
>>> >
>>> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
>>> > would be useful to put smaller temportaries, say 32 bytes and smaller,
>>> > in registers instead of in memory.
>>> I think you mean:  we should try to registers to avoid memory accesing,
>>> and using registers would be ok for more bytes memcpy(32bytes).
>>> Great sugguestion, thanks a lot!
>>> 
>>> Like below idea:
>>> [r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
>>> dest = [r100:TI, r101:TI];
>>> 
>>> Currently, for 8bytes structure, we are using TImode for it.
>>> And subreg/fwprop/cse passes are able to optimize it as expected.
>>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
>>> I'm not sure if current infrastructure supports to use two more
>>> registers for one structure.
>>> 
>>> >
>>> >> But even so, these kinds of things are still bound to happen, so it's 
>>> >> probably worth thinking about if we can do better in RTL as well.
>>> >
>>> > Always.  It is a mistake to think that having better high-level
>>> > optimisations means that you don't need good low-level optimisations
>>> > anymore: in fact deficiencies there become more glaringly apparent if
>>> > the early pipeline opts become better :-)
>>> Understant, thanks :)
>>> 
>>> >
>>> >> The first thing that comes to my mind is to annotate memcpy calls that 
>>> >> are structure assignments.  The idea here is that we may want to expand 
>>> >> a memcpy differently in those cases.   Changing how we expand an opaque 
>>> >> memcpy call is unlikely to be beneficial in most cases.  But changing 
>>> >> how we expand a structure copy may be beneficial by exposing the 
>>> >> underlying field values.   This would roughly correspond to your method 
>>> >> #1.
>>> >> 
>>> >> Or instead of changing how we expand, teach the optimizers about these 
>>> >> annotated memcpy calls -- they're just a a copy of each field.   That's 
>>> >> how CSE and the propagators could treat them. After some point we'd 
>>> >> lower them in the usual ways, but at least early in the RTL pipeline we 
>>> >> could keep them as annotated memcpy calls.  This roughly corresponds to 
>>> >> your second suggestion.
>>> >
>>> > Ideally this won't ever make it as far as RTL, if the structures do not
>>> > need to go via memory.  All high-level optimissations should have been
>>> > done earlier, and hopefully it was not expand tiself that forced stuff
>>> > into memory!  :-/
>>> Currently, after early gimple optimization, the struct member accessing
>>> may still need to be in memory (if the mode of the struct is BLK).
>>> For example:
>>> 
>>> _Bool foo (const A a) { return a.a[0] > 1.0; }
>>> 
>>> The optimized gimple would be:
>>>   _1 = a.a[0];
>>>   _3 = _1 > 1.0e+0;
>>>   return _3;
>>> 
>>> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
>>> and "a.a[0]" is also reading from memory.  It may be better to use
>>> "f1" for "a.a[0]" here.
>>> 
>>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; 
>>> DF}"
>>> for the struct (BLK may be changed), and using 4 DF registers to access
>>> the structure in expand pass.
>>
>> I think for cases like this it might be a good idea to perform
>> SRA-like analysis at RTL expansion time when we know how parameters
>> arrive (in pieces) and take that knowledge into account when
>> assigning the RTL to a decl.  The same applies for the return ABI.
>> Since we rely on RTL to elide copies to/from return/argument
>> registers/slots we have to assign "layout compatible" registers
>> to the 

Re: [RFC] propgation leap over memory copy for struct

2022-11-07 Thread Jiufu Guo via Gcc-patches
Richard Biener  writes:

> On Tue, 1 Nov 2022, Jiufu Guo wrote:
>
>> Segher Boessenkool  writes:
>> 
>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> >> >We know that for struct variable assignment, memory copy may be used.
>> >> >And for memcpy, we may load and store more bytes as possible at one time.
>> >> >While it may be not best here:
>> >
>> >> So the first question in my mind is can we do better at the gimple 
>> >> phase?  For the second case in particular can't we just "return a" 
>> >> rather than copying a into  then returning ?  This feels 
>> >> a lot like the return value optimization from C++.  I'm not sure if it 
>> >> applies to the first case or not, it's been a long time since I looked 
>> >> at NRV optimizations, but it might be worth poking around in there a bit 
>> >> (tree-nrv.cc).
>> >
>> > If it is a bigger struct you end up with quite a lot of stuff in
>> > registers.  GCC will eventually put that all in memory so it will work
>> > out fine in the end, but you are likely to get inefficient code.
>> Yes.  We may need to use memory to save regiters for big struct.
>> Small struct may be practical to use registers.  We may leverage the
>> idea that: some type of small struct are passing to function through
>> registers. 
>> 
>> >
>> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
>> > would be useful to put smaller temportaries, say 32 bytes and smaller,
>> > in registers instead of in memory.
>> I think you mean:  we should try to registers to avoid memory accesing,
>> and using registers would be ok for more bytes memcpy(32bytes).
>> Great sugguestion, thanks a lot!
>> 
>> Like below idea:
>> [r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
>> dest = [r100:TI, r101:TI];
>> 
>> Currently, for 8bytes structure, we are using TImode for it.
>> And subreg/fwprop/cse passes are able to optimize it as expected.
>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
>> I'm not sure if current infrastructure supports to use two more
>> registers for one structure.
>> 
>> >
>> >> But even so, these kinds of things are still bound to happen, so it's 
>> >> probably worth thinking about if we can do better in RTL as well.
>> >
>> > Always.  It is a mistake to think that having better high-level
>> > optimisations means that you don't need good low-level optimisations
>> > anymore: in fact deficiencies there become more glaringly apparent if
>> > the early pipeline opts become better :-)
>> Understant, thanks :)
>> 
>> >
>> >> The first thing that comes to my mind is to annotate memcpy calls that 
>> >> are structure assignments.  The idea here is that we may want to expand 
>> >> a memcpy differently in those cases.   Changing how we expand an opaque 
>> >> memcpy call is unlikely to be beneficial in most cases.  But changing 
>> >> how we expand a structure copy may be beneficial by exposing the 
>> >> underlying field values.   This would roughly correspond to your method 
>> >> #1.
>> >> 
>> >> Or instead of changing how we expand, teach the optimizers about these 
>> >> annotated memcpy calls -- they're just a a copy of each field.   That's 
>> >> how CSE and the propagators could treat them. After some point we'd 
>> >> lower them in the usual ways, but at least early in the RTL pipeline we 
>> >> could keep them as annotated memcpy calls.  This roughly corresponds to 
>> >> your second suggestion.
>> >
>> > Ideally this won't ever make it as far as RTL, if the structures do not
>> > need to go via memory.  All high-level optimissations should have been
>> > done earlier, and hopefully it was not expand tiself that forced stuff
>> > into memory!  :-/
>> Currently, after early gimple optimization, the struct member accessing
>> may still need to be in memory (if the mode of the struct is BLK).
>> For example:
>> 
>> _Bool foo (const A a) { return a.a[0] > 1.0; }
>> 
>> The optimized gimple would be:
>>   _1 = a.a[0];
>>   _3 = _1 > 1.0e+0;
>>   return _3;
>> 
>> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
>> and "a.a[0]" is also reading from memory.  It may be better to use
>> "f1" for "a.a[0]" here.
>> 
>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
>> for the struct (BLK may be changed), and using 4 DF registers to access
>> the structure in expand pass.
>
> I think for cases like this it might be a good idea to perform
> SRA-like analysis at RTL expansion time when we know how parameters
> arrive (in pieces) and take that knowledge into account when
> assigning the RTL to a decl.  The same applies for the return ABI.
> Since we rely on RTL to elide copies to/from return/argument
> registers/slots we have to assign "layout compatible" registers
> to the corresponding auto vars.
>
Thanks for pointing out this!
This looks like a kind of SRA, especially for parm and return value.
As you pointed out, there is 

Re: [RFC] propgation leap over memory copy for struct

2022-11-05 Thread Richard Biener via Gcc-patches
On Tue, 1 Nov 2022, Jiufu Guo wrote:

> Segher Boessenkool  writes:
> 
> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >> >We know that for struct variable assignment, memory copy may be used.
> >> >And for memcpy, we may load and store more bytes as possible at one time.
> >> >While it may be not best here:
> >
> >> So the first question in my mind is can we do better at the gimple 
> >> phase?  For the second case in particular can't we just "return a" 
> >> rather than copying a into  then returning ?  This feels 
> >> a lot like the return value optimization from C++.  I'm not sure if it 
> >> applies to the first case or not, it's been a long time since I looked 
> >> at NRV optimizations, but it might be worth poking around in there a bit 
> >> (tree-nrv.cc).
> >
> > If it is a bigger struct you end up with quite a lot of stuff in
> > registers.  GCC will eventually put that all in memory so it will work
> > out fine in the end, but you are likely to get inefficient code.
> Yes.  We may need to use memory to save regiters for big struct.
> Small struct may be practical to use registers.  We may leverage the
> idea that: some type of small struct are passing to function through
> registers. 
> 
> >
> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
> > would be useful to put smaller temportaries, say 32 bytes and smaller,
> > in registers instead of in memory.
> I think you mean:  we should try to registers to avoid memory accesing,
> and using registers would be ok for more bytes memcpy(32bytes).
> Great sugguestion, thanks a lot!
> 
> Like below idea:
> [r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
> dest = [r100:TI, r101:TI];
> 
> Currently, for 8bytes structure, we are using TImode for it.
> And subreg/fwprop/cse passes are able to optimize it as expected.
> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
> I'm not sure if current infrastructure supports to use two more
> registers for one structure.
> 
> >
> >> But even so, these kinds of things are still bound to happen, so it's 
> >> probably worth thinking about if we can do better in RTL as well.
> >
> > Always.  It is a mistake to think that having better high-level
> > optimisations means that you don't need good low-level optimisations
> > anymore: in fact deficiencies there become more glaringly apparent if
> > the early pipeline opts become better :-)
> Understant, thanks :)
> 
> >
> >> The first thing that comes to my mind is to annotate memcpy calls that 
> >> are structure assignments.  The idea here is that we may want to expand 
> >> a memcpy differently in those cases.   Changing how we expand an opaque 
> >> memcpy call is unlikely to be beneficial in most cases.  But changing 
> >> how we expand a structure copy may be beneficial by exposing the 
> >> underlying field values.   This would roughly correspond to your method 
> >> #1.
> >> 
> >> Or instead of changing how we expand, teach the optimizers about these 
> >> annotated memcpy calls -- they're just a a copy of each field.   That's 
> >> how CSE and the propagators could treat them. After some point we'd 
> >> lower them in the usual ways, but at least early in the RTL pipeline we 
> >> could keep them as annotated memcpy calls.  This roughly corresponds to 
> >> your second suggestion.
> >
> > Ideally this won't ever make it as far as RTL, if the structures do not
> > need to go via memory.  All high-level optimissations should have been
> > done earlier, and hopefully it was not expand tiself that forced stuff
> > into memory!  :-/
> Currently, after early gimple optimization, the struct member accessing
> may still need to be in memory (if the mode of the struct is BLK).
> For example:
> 
> _Bool foo (const A a) { return a.a[0] > 1.0; }
> 
> The optimized gimple would be:
>   _1 = a.a[0];
>   _3 = _1 > 1.0e+0;
>   return _3;
> 
> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
> and "a.a[0]" is also reading from memory.  It may be better to use
> "f1" for "a.a[0]" here.
> 
> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
> for the struct (BLK may be changed), and using 4 DF registers to access
> the structure in expand pass.

I think for cases like this it might be a good idea to perform
SRA-like analysis at RTL expansion time when we know how parameters
arrive (in pieces) and take that knowledge into account when
assigning the RTL to a decl.  The same applies for the return ABI.
Since we rely on RTL to elide copies to/from return/argument
registers/slots we have to assign "layout compatible" registers
to the corresponding auto vars.

> 
> Thanks again for your kindly and helpful comments!
> 
> BR,
> Jeff(Jiufu)
> 
> >
> >
> > Segher
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, 

Re: [RFC] propgation leap over memory copy for struct

2022-11-05 Thread Richard Biener via Gcc-patches
On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches
 wrote:
>
>
> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> > Hi,
> >
> > We know that for struct variable assignment, memory copy may be used.
> > And for memcpy, we may load and store more bytes as possible at one time.
> > While it may be not best here:
> > 1. Before/after stuct variable assignment, the vaiable may be operated.
> > And it is hard for some optimizations to leap over memcpy.  Then some struct
> > operations may be sub-optimimal.  Like the issue in PR65421.
> > 2. The size of struct is constant mostly, the memcpy would be expanded.  
> > Using
> > small size to load/store and executing in parallel may not slower than using
> > large size to loat/store. (sure, more registers may be used for smaller 
> > bytes.)
> >
> >
> > In PR65421, For source code as below:
> > t.c
> > #define FN 4
> > typedef struct { double a[FN]; } A;
> >
> > A foo (const A *a) { return *a; }
> > A bar (const A a) { return a; }
>
> So the first question in my mind is can we do better at the gimple
> phase?  For the second case in particular can't we just "return a"
> rather than copying a into  then returning ?  This feels
> a lot like the return value optimization from C++.  I'm not sure if it
> applies to the first case or not, it's been a long time since I looked
> at NRV optimizations, but it might be worth poking around in there a bit
> (tree-nrv.cc).
>
>
> But even so, these kinds of things are still bound to happen, so it's
> probably worth thinking about if we can do better in RTL as well.
>
>
> The first thing that comes to my mind is to annotate memcpy calls that
> are structure assignments.  The idea here is that we may want to expand
> a memcpy differently in those cases.   Changing how we expand an opaque
> memcpy call is unlikely to be beneficial in most cases.  But changing
> how we expand a structure copy may be beneficial by exposing the
> underlying field values.   This would roughly correspond to your method #1.
>
> Or instead of changing how we expand, teach the optimizers about these
> annotated memcpy calls -- they're just a a copy of each field.   That's
> how CSE and the propagators could treat them. After some point we'd
> lower them in the usual ways, but at least early in the RTL pipeline we
> could keep them as annotated memcpy calls.  This roughly corresponds to
> your second suggestion.

In the end it depends on the access patterns so some analysis like SRA
performs would be nice.  The issue with expanding memcpy on GIMPLE
is that we currently cannot express 'rep; movb;' or other target specific
sequences from the cpymem like optabs on GIMPLE and recovering those
from piecewise copies on RTL is going to be difficult.

>
> jeff
>
>
>


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Jiufu Guo via Gcc-patches
Segher Boessenkool  writes:

> On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> >We know that for struct variable assignment, memory copy may be used.
>> >And for memcpy, we may load and store more bytes as possible at one time.
>> >While it may be not best here:
>
>> So the first question in my mind is can we do better at the gimple 
>> phase?  For the second case in particular can't we just "return a" 
>> rather than copying a into  then returning ?  This feels 
>> a lot like the return value optimization from C++.  I'm not sure if it 
>> applies to the first case or not, it's been a long time since I looked 
>> at NRV optimizations, but it might be worth poking around in there a bit 
>> (tree-nrv.cc).
>
> If it is a bigger struct you end up with quite a lot of stuff in
> registers.  GCC will eventually put that all in memory so it will work
> out fine in the end, but you are likely to get inefficient code.
Yes.  We may need to use memory to save regiters for big struct.
Small struct may be practical to use registers.  We may leverage the
idea that: some type of small struct are passing to function through
registers. 

>
> OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
> would be useful to put smaller temportaries, say 32 bytes and smaller,
> in registers instead of in memory.
I think you mean:  we should try to registers to avoid memory accesing,
and using registers would be ok for more bytes memcpy(32bytes).
Great sugguestion, thanks a lot!

Like below idea:
[r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
dest = [r100:TI, r101:TI];

Currently, for 8bytes structure, we are using TImode for it.
And subreg/fwprop/cse passes are able to optimize it as expected.
Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
I'm not sure if current infrastructure supports to use two more
registers for one structure.

>
>> But even so, these kinds of things are still bound to happen, so it's 
>> probably worth thinking about if we can do better in RTL as well.
>
> Always.  It is a mistake to think that having better high-level
> optimisations means that you don't need good low-level optimisations
> anymore: in fact deficiencies there become more glaringly apparent if
> the early pipeline opts become better :-)
Understant, thanks :)

>
>> The first thing that comes to my mind is to annotate memcpy calls that 
>> are structure assignments.  The idea here is that we may want to expand 
>> a memcpy differently in those cases.   Changing how we expand an opaque 
>> memcpy call is unlikely to be beneficial in most cases.  But changing 
>> how we expand a structure copy may be beneficial by exposing the 
>> underlying field values.   This would roughly correspond to your method 
>> #1.
>> 
>> Or instead of changing how we expand, teach the optimizers about these 
>> annotated memcpy calls -- they're just a a copy of each field.   That's 
>> how CSE and the propagators could treat them. After some point we'd 
>> lower them in the usual ways, but at least early in the RTL pipeline we 
>> could keep them as annotated memcpy calls.  This roughly corresponds to 
>> your second suggestion.
>
> Ideally this won't ever make it as far as RTL, if the structures do not
> need to go via memory.  All high-level optimissations should have been
> done earlier, and hopefully it was not expand tiself that forced stuff
> into memory!  :-/
Currently, after early gimple optimization, the struct member accessing
may still need to be in memory (if the mode of the struct is BLK).
For example:

_Bool foo (const A a) { return a.a[0] > 1.0; }

The optimized gimple would be:
  _1 = a.a[0];
  _3 = _1 > 1.0e+0;
  return _3;

During expand to RTL, parm 'a' is store to memory from arg regs firstly,
and "a.a[0]" is also reading from memory.  It may be better to use
"f1" for "a.a[0]" here.

Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
for the struct (BLK may be changed), and using 4 DF registers to access
the structure in expand pass.


Thanks again for your kindly and helpful comments!

BR,
Jeff(Jiufu)

>
>
> Segher


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Jiufu Guo via Gcc-patches
Jeff Law  writes:

> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> Hi,
>>
>> We know that for struct variable assignment, memory copy may be used.
>> And for memcpy, we may load and store more bytes as possible at one time.
>> While it may be not best here:
>> 1. Before/after stuct variable assignment, the vaiable may be operated.
>> And it is hard for some optimizations to leap over memcpy.  Then some struct
>> operations may be sub-optimimal.  Like the issue in PR65421.
>> 2. The size of struct is constant mostly, the memcpy would be expanded.  
>> Using
>> small size to load/store and executing in parallel may not slower than using
>> large size to loat/store. (sure, more registers may be used for smaller 
>> bytes.)
>>
>>
>> In PR65421, For source code as below:
>> t.c
>> #define FN 4
>> typedef struct { double a[FN]; } A;
>>
>> A foo (const A *a) { return *a; }
>> A bar (const A a) { return a; }
>
> So the first question in my mind is can we do better at the gimple
> phase?  For the second case in particular can't we just "return a"
> rather than copying a into  then returning ?  This
> feels a lot like the return value optimization from C++.  I'm not sure
> if it applies to the first case or not, it's been a long time since I
> looked at NRV optimizations, but it might be worth poking around in
> there a bit (tree-nrv.cc).
Thanks for point out this idea!!

Currently the optimized gimple looks like:
  D.3334 = a;
  return D.3334;

and
  D.3336 = *a_2(D);
  return D.3336;

It may be better to have:
"return a;" and "return *a;"
-

If the code looks like:
typedef struct { double a[3]; long l;} A; //mix types
A foo (const A a) { return a; }
A bar (const A *a) { return *a; }

Current optimized gimples looks like:
   = a;
  return ;
and
   = *a_2(D);
  return ;

"return a;" and "return *a;" may be works here too.
>
>
> But even so, these kinds of things are still bound to happen, so it's
> probably worth thinking about if we can do better in RTL as well. 
>
Yeap, thanks!
>
> The first thing that comes to my mind is to annotate memcpy calls that
> are structure assignments.  The idea here is that we may want to
> expand a memcpy differently in those cases.   Changing how we expand
> an opaque memcpy call is unlikely to be beneficial in most cases.  But
> changing how we expand a structure copy may be beneficial by exposing
> the underlying field values.   This would roughly correspond to your
> method #1.
Right.  For general memcpy, we would read/write larger bytes at one
time. Reading/writing small fields may only beneficial for structure
assignment.

>
> Or instead of changing how we expand, teach the optimizers about these
> annotated memcpy calls -- they're just a a copy of each field.  
> That's how CSE and the propagators could treat them. After some point
> we'd lower them in the usual ways, but at least early in the RTL
> pipeline we could keep them as annotated memcpy calls.  This roughly
> corresponds to your second suggestion.
Thanks for your insights about this idea! Using annoated memcpy for
early optimizations, and it would be treated as general memcpy in later
passes.


Thanks again for your very helpful comments and sugguestions!

BR,
Jeff(Jiufu)

>
>
> jeff


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Jiufu Guo via Gcc-patches
Segher Boessenkool  writes:

> Hi!
>
> On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote:
>> #define FN 4
>> typedef struct { double a[FN]; } A;
>> 
>> A foo (const A *a) { return *a; }
>> A bar (const A a) { return a; }
>> ///
>> 
>> If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
>> (by subreg/cse/fwprop/cprop) as:
>> ---
>> foo:
>> .LFB0:
>> .cfi_startproc
>> blr
>> 
>> bar:
>> .LFB1:
>>  .cfi_startproc
>>  lfd 2,8(3)
>>  lfd 1,0(3)
>>  blr
>> 
>
> I think you swapped foo and bar here?
Oh, thanks!
>
>> If the size of "A" is larger than any INT mode size, RTL insns would be 
>> generated as:
>>13: r125:V2DI=[r112:DI+0x20]
>>14: r126:V2DI=[r112:DI+0x30]
>>15: [r112:DI]=r125:V2DI
>>16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
>>17: r127:DF=[r112:DI]
>>18: r128:DF=[r112:DI+0x8]
>>19: r129:DF=[r112:DI+0x10]
>>20: r130:DF=[r112:DI+0x18]
>> 
>> 
>> I'm thinking about ways to improve this.
>> Metod1: One way may be changing the memory copy by referencing the type 
>> of the struct if the size of struct is not too big. And generate insns 
>> like the below:
>>13: r125:DF=[r112:DI+0x20]
>>15: r126:DF=[r112:DI+0x28]
>>17: r127:DF=[r112:DI+0x30]
>>19: r128:DF=[r112:DI+0x38]
>>14: [r112:DI]=r125:DF
>>16: [r112:DI+0x8]=r126:DF
>>18: [r112:DI+0x10]=r127:DF
>>20: [r112:DI+0x18]=r128:DF
>>21: r129:DF=[r112:DI]
>>22: r130:DF=[r112:DI+0x8]
>>23: r131:DF=[r112:DI+0x10]
>>24: r132:DF=[r112:DI+0x18]
>
> This is much worse though?  The expansion with memcpy used V2DI, which
> typically is close to 2x faster than DFmode accesses.
Using V2DI, it help to access 2x bytes at one time than DF/DI.
While since those readings can be executed in parallel, it would be not
too bad via using DF/DI.

>
> Or are you trying to avoid small reads of large stores here?  Those
> aren't so bad, large reads of small stores is the nastiness we need to
> avoid.
Here, I want to use 2 DF readings, because optimizations cse/fwprop/dse
could eleminate those memory accesses on same size.
>
> The code we have now does
>
>15: [r112:DI]=r125:V2DI
> ...
>17: r127:DF=[r112:DI]
>18: r128:DF=[r112:DI+0x8]
>
> Can you make this optimised to not use a memory temporary at all, just
> immediately assign from r125 to r127 and r128?
r125 are not directly assinged to r127/r128, since 'insn 15' and 'insn
17/18' are expanded for different gimple stmt:
  D.3331 = a;  ==> 'insn 15' is generated for struct assignment here.
  return D.3331; ==> 'insn 17/18' are prepared for return registers.

I'm trying to eliminate thos  memory temporary, and did not find a good
way.  Method1-3 are the ideas which I'm trying to use to delete those
temporaries.

>
>> Method2: One way may be enhancing CSE to make it able to treat one large
>> memory slot as two(or more) combined slots: 
>>13: r125:V2DI#0=[r112:DI+0x20]
>>13': r125:V2DI#8=[r112:DI+0x28]
>>15: [r112:DI]#0=r125:V2DI#0
>>15': [r112:DI]#8=r125:V2DI#8
>> 
>> This may seems more hack in CSE.
>
> The current CSE pass we have is the pass most in need of a full rewrite
> we have, since many many years.  It does a lot of things, important
> things that we should not lose, but it does a pretty bad job of CSE.
>
>> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".
>
> :BLK can never be optimised well.  It always has to live in memory, by
> definition.

Thanks for your sugguestions!

BR,
Jeff (Jiufu)
>
>
> Segher


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Segher Boessenkool
On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >We know that for struct variable assignment, memory copy may be used.
> >And for memcpy, we may load and store more bytes as possible at one time.
> >While it may be not best here:

> So the first question in my mind is can we do better at the gimple 
> phase?  For the second case in particular can't we just "return a" 
> rather than copying a into  then returning ?  This feels 
> a lot like the return value optimization from C++.  I'm not sure if it 
> applies to the first case or not, it's been a long time since I looked 
> at NRV optimizations, but it might be worth poking around in there a bit 
> (tree-nrv.cc).

If it is a bigger struct you end up with quite a lot of stuff in
registers.  GCC will eventually put that all in memory so it will work
out fine in the end, but you are likely to get inefficient code.

OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
would be useful to put smaller temportaries, say 32 bytes and smaller,
in registers instead of in memory.

> But even so, these kinds of things are still bound to happen, so it's 
> probably worth thinking about if we can do better in RTL as well.

Always.  It is a mistake to think that having better high-level
optimisations means that you don't need good low-level optimisations
anymore: in fact deficiencies there become more glaringly apparent if
the early pipeline opts become better :-)

> The first thing that comes to my mind is to annotate memcpy calls that 
> are structure assignments.  The idea here is that we may want to expand 
> a memcpy differently in those cases.   Changing how we expand an opaque 
> memcpy call is unlikely to be beneficial in most cases.  But changing 
> how we expand a structure copy may be beneficial by exposing the 
> underlying field values.   This would roughly correspond to your method 
> #1.
> 
> Or instead of changing how we expand, teach the optimizers about these 
> annotated memcpy calls -- they're just a a copy of each field.   That's 
> how CSE and the propagators could treat them. After some point we'd 
> lower them in the usual ways, but at least early in the RTL pipeline we 
> could keep them as annotated memcpy calls.  This roughly corresponds to 
> your second suggestion.

Ideally this won't ever make it as far as RTL, if the structures do not
need to go via memory.  All high-level optimissations should have been
done earlier, and hopefully it was not expand tiself that forced stuff
into memory!  :-/


Segher


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Segher Boessenkool
Hi!

On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote:
> #define FN 4
> typedef struct { double a[FN]; } A;
> 
> A foo (const A *a) { return *a; }
> A bar (const A a) { return a; }
> ///
> 
> If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
> (by subreg/cse/fwprop/cprop) as:
> ---
> foo:
> .LFB0:
> .cfi_startproc
> blr
> 
> bar:
> .LFB1:
>   .cfi_startproc
>   lfd 2,8(3)
>   lfd 1,0(3)
>   blr
> 

I think you swapped foo and bar here?

> If the size of "A" is larger than any INT mode size, RTL insns would be 
> generated as:
>13: r125:V2DI=[r112:DI+0x20]
>14: r126:V2DI=[r112:DI+0x30]
>15: [r112:DI]=r125:V2DI
>16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
>17: r127:DF=[r112:DI]
>18: r128:DF=[r112:DI+0x8]
>19: r129:DF=[r112:DI+0x10]
>20: r130:DF=[r112:DI+0x18]
> 
> 
> I'm thinking about ways to improve this.
> Metod1: One way may be changing the memory copy by referencing the type 
> of the struct if the size of struct is not too big. And generate insns 
> like the below:
>13: r125:DF=[r112:DI+0x20]
>15: r126:DF=[r112:DI+0x28]
>17: r127:DF=[r112:DI+0x30]
>19: r128:DF=[r112:DI+0x38]
>14: [r112:DI]=r125:DF
>16: [r112:DI+0x8]=r126:DF
>18: [r112:DI+0x10]=r127:DF
>20: [r112:DI+0x18]=r128:DF
>21: r129:DF=[r112:DI]
>22: r130:DF=[r112:DI+0x8]
>23: r131:DF=[r112:DI+0x10]
>24: r132:DF=[r112:DI+0x18]

This is much worse though?  The expansion with memcpy used V2DI, which
typically is close to 2x faster than DFmode accesses.

Or are you trying to avoid small reads of large stores here?  Those
aren't so bad, large reads of small stores is the nastiness we need to
avoid.

The code we have now does

   15: [r112:DI]=r125:V2DI
...
   17: r127:DF=[r112:DI]
   18: r128:DF=[r112:DI+0x8]

Can you make this optimised to not use a memory temporary at all, just
immediately assign from r125 to r127 and r128?

> Method2: One way may be enhancing CSE to make it able to treat one large
> memory slot as two(or more) combined slots: 
>13: r125:V2DI#0=[r112:DI+0x20]
>13': r125:V2DI#8=[r112:DI+0x28]
>15: [r112:DI]#0=r125:V2DI#0
>15': [r112:DI]#8=r125:V2DI#8
> 
> This may seems more hack in CSE.

The current CSE pass we have is the pass most in need of a full rewrite
we have, since many many years.  It does a lot of things, important
things that we should not lose, but it does a pretty bad job of CSE.

> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".

:BLK can never be optimised well.  It always has to live in memory, by
definition.


Segher


Re: [RFC] propgation leap over memory copy for struct

2022-10-31 Thread Jeff Law via Gcc-patches



On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:

Hi,

We know that for struct variable assignment, memory copy may be used.
And for memcpy, we may load and store more bytes as possible at one time.
While it may be not best here:
1. Before/after stuct variable assignment, the vaiable may be operated.
And it is hard for some optimizations to leap over memcpy.  Then some struct
operations may be sub-optimimal.  Like the issue in PR65421.
2. The size of struct is constant mostly, the memcpy would be expanded.  Using
small size to load/store and executing in parallel may not slower than using
large size to loat/store. (sure, more registers may be used for smaller bytes.)


In PR65421, For source code as below:
t.c
#define FN 4
typedef struct { double a[FN]; } A;

A foo (const A *a) { return *a; }
A bar (const A a) { return a; }


So the first question in my mind is can we do better at the gimple 
phase?  For the second case in particular can't we just "return a" 
rather than copying a into  then returning ?  This feels 
a lot like the return value optimization from C++.  I'm not sure if it 
applies to the first case or not, it's been a long time since I looked 
at NRV optimizations, but it might be worth poking around in there a bit 
(tree-nrv.cc).



But even so, these kinds of things are still bound to happen, so it's 
probably worth thinking about if we can do better in RTL as well.



The first thing that comes to my mind is to annotate memcpy calls that 
are structure assignments.  The idea here is that we may want to expand 
a memcpy differently in those cases.   Changing how we expand an opaque 
memcpy call is unlikely to be beneficial in most cases.  But changing 
how we expand a structure copy may be beneficial by exposing the 
underlying field values.   This would roughly correspond to your method #1.


Or instead of changing how we expand, teach the optimizers about these 
annotated memcpy calls -- they're just a a copy of each field.   That's 
how CSE and the propagators could treat them. After some point we'd 
lower them in the usual ways, but at least early in the RTL pipeline we 
could keep them as annotated memcpy calls.  This roughly corresponds to 
your second suggestion.



jeff





[RFC] propgation leap over memory copy for struct

2022-10-30 Thread Jiufu Guo via Gcc-patches
Hi,

We know that for struct variable assignment, memory copy may be used.
And for memcpy, we may load and store more bytes as possible at one time.
While it may be not best here:
1. Before/after stuct variable assignment, the vaiable may be operated.
And it is hard for some optimizations to leap over memcpy.  Then some struct
operations may be sub-optimimal.  Like the issue in PR65421.
2. The size of struct is constant mostly, the memcpy would be expanded.  Using
small size to load/store and executing in parallel may not slower than using
large size to loat/store. (sure, more registers may be used for smaller bytes.)


In PR65421, For source code as below:
t.c
#define FN 4
typedef struct { double a[FN]; } A;

A foo (const A *a) { return *a; }
A bar (const A a) { return a; }
///

If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
(by subreg/cse/fwprop/cprop) as:
---
foo:
.LFB0:
.cfi_startproc
blr

bar:
.LFB1:
.cfi_startproc
lfd 2,8(3)
lfd 1,0(3)
blr

If the size of "A" is larger than any INT mode size, RTL insns would be 
generated as:
   13: r125:V2DI=[r112:DI+0x20]
   14: r126:V2DI=[r112:DI+0x30]
   15: [r112:DI]=r125:V2DI
   16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
   17: r127:DF=[r112:DI]
   18: r128:DF=[r112:DI+0x8]
   19: r129:DF=[r112:DI+0x10]
   20: r130:DF=[r112:DI+0x18]


I'm thinking about ways to improve this.
Metod1: One way may be changing the memory copy by referencing the type 
of the struct if the size of struct is not too big. And generate insns 
like the below:
   13: r125:DF=[r112:DI+0x20]
   15: r126:DF=[r112:DI+0x28]
   17: r127:DF=[r112:DI+0x30]
   19: r128:DF=[r112:DI+0x38]
   14: [r112:DI]=r125:DF
   16: [r112:DI+0x8]=r126:DF
   18: [r112:DI+0x10]=r127:DF
   20: [r112:DI+0x18]=r128:DF
   21: r129:DF=[r112:DI]
   22: r130:DF=[r112:DI+0x8]
   23: r131:DF=[r112:DI+0x10]
   24: r132:DF=[r112:DI+0x18]

Then passes (cse, prop, dse...) could help to optimize the code.
Concerns of the method: we may not need to do this if the number of 
fields is too large.  And the types/modes of each load/store may
depend on the platform and not same with the type of the fields of
the struct. For example: 
For "struct {double a[3]; long long l;}", on ppc64le, DImode may be
 better for assignments on parameter.


Method2: One way may be enhancing CSE to make it able to treat one large
memory slot as two(or more) combined slots: 
   13: r125:V2DI#0=[r112:DI+0x20]
   13': r125:V2DI#8=[r112:DI+0x28]
   15: [r112:DI]#0=r125:V2DI#0
   15': [r112:DI]#8=r125:V2DI#8

This may seems more hack in CSE.


Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".
To do this, "moving" between "PARALLEL<->PARALLEL" and "PARALLEL<->MEM" 
may need to be enhanced.  This method may require more effort to make
it works for corner/unknown cases.

I'm wondering which would be more flexible to handle this issue?
Thanks for any comments and suggestions!

BR,
Jeff(Jiufu)