Re: [RFC] propgation leap over memory copy for struct
On Wed, 9 Nov 2022, Jiufu Guo wrote: > Hi, > > Richard Biener writes: > > > On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches > > wrote: > >> > >> > >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > >> > Hi, > >> > > >> > We know that for struct variable assignment, memory copy may be used. > >> > And for memcpy, we may load and store more bytes as possible at one time. > >> > While it may be not best here: > >> > 1. Before/after stuct variable assignment, the vaiable may be operated. > >> > And it is hard for some optimizations to leap over memcpy. Then some > >> > struct > >> > operations may be sub-optimimal. Like the issue in PR65421. > >> > 2. The size of struct is constant mostly, the memcpy would be expanded. > >> > Using > >> > small size to load/store and executing in parallel may not slower than > >> > using > >> > large size to loat/store. (sure, more registers may be used for smaller > >> > bytes.) > >> > > >> > > >> > In PR65421, For source code as below: > >> > t.c > >> > #define FN 4 > >> > typedef struct { double a[FN]; } A; > >> > > >> > A foo (const A *a) { return *a; } > >> > A bar (const A a) { return a; } > >> > >> So the first question in my mind is can we do better at the gimple > >> phase? For the second case in particular can't we just "return a" > >> rather than copying a into then returning ? This feels > >> a lot like the return value optimization from C++. I'm not sure if it > >> applies to the first case or not, it's been a long time since I looked > >> at NRV optimizations, but it might be worth poking around in there a bit > >> (tree-nrv.cc). > >> > >> > >> But even so, these kinds of things are still bound to happen, so it's > >> probably worth thinking about if we can do better in RTL as well. > >> > >> > >> The first thing that comes to my mind is to annotate memcpy calls that > >> are structure assignments. The idea here is that we may want to expand > >> a memcpy differently in those cases. Changing how we expand an opaque > >> memcpy call is unlikely to be beneficial in most cases. But changing > >> how we expand a structure copy may be beneficial by exposing the > >> underlying field values. This would roughly correspond to your method #1. > >> > >> Or instead of changing how we expand, teach the optimizers about these > >> annotated memcpy calls -- they're just a a copy of each field. That's > >> how CSE and the propagators could treat them. After some point we'd > >> lower them in the usual ways, but at least early in the RTL pipeline we > >> could keep them as annotated memcpy calls. This roughly corresponds to > >> your second suggestion. > > > > In the end it depends on the access patterns so some analysis like SRA > > performs would be nice. The issue with expanding memcpy on GIMPLE > > is that we currently cannot express 'rep; movb;' or other target specific > > sequences from the cpymem like optabs on GIMPLE and recovering those > > from piecewise copies on RTL is going to be difficult. > Actually, it is a special memcpy. It is generated during expanding the > struct assignment(expand_assignment/store_expr/emit_block_move). > We may introduce a function block_move_for_record for struct type. And > this function could be a target hook to generate specificed sequences. > For example: > r125:DF=[r112:DI+0x20] > r126:DF=[r112:DI+0x28] > [r112:DI]=r125:DF > [r112:DI+0x8]=r126:DF > > After expanding, following passes(cse/prop/dse/..) could optimize the > insn sequences. e.g "[r112:DI+0x20]=f1;r125:DF=[r112:DI+0x20]; > [r112:DI]=r125:DF;r129:DF=[r112:DI]" ==> "r129:DF=f1" > > And if the small reading/writing insns are still occur in late passes > e.g. combine, we would recover the isnsn to better sequence: > r125:DF=[r112:DI+0x20];r126:DF=[r112:DI+0x28] > ==> > r155:V2DI=[r112:DI+0x20]; > > Any comments? Thanks! As said the best copying decomposition depends on the followup uses and the argument passing ABI which is why I suggested to perform SRA like analysis which collects the access patterns and use that to drive the heuristic expanding this special memcpy. Richard.
Re: [RFC] propgation leap over memory copy for struct
Hi, Richard Biener writes: > On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches > wrote: >> >> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> > Hi, >> > >> > We know that for struct variable assignment, memory copy may be used. >> > And for memcpy, we may load and store more bytes as possible at one time. >> > While it may be not best here: >> > 1. Before/after stuct variable assignment, the vaiable may be operated. >> > And it is hard for some optimizations to leap over memcpy. Then some >> > struct >> > operations may be sub-optimimal. Like the issue in PR65421. >> > 2. The size of struct is constant mostly, the memcpy would be expanded. >> > Using >> > small size to load/store and executing in parallel may not slower than >> > using >> > large size to loat/store. (sure, more registers may be used for smaller >> > bytes.) >> > >> > >> > In PR65421, For source code as below: >> > t.c >> > #define FN 4 >> > typedef struct { double a[FN]; } A; >> > >> > A foo (const A *a) { return *a; } >> > A bar (const A a) { return a; } >> >> So the first question in my mind is can we do better at the gimple >> phase? For the second case in particular can't we just "return a" >> rather than copying a into then returning ? This feels >> a lot like the return value optimization from C++. I'm not sure if it >> applies to the first case or not, it's been a long time since I looked >> at NRV optimizations, but it might be worth poking around in there a bit >> (tree-nrv.cc). >> >> >> But even so, these kinds of things are still bound to happen, so it's >> probably worth thinking about if we can do better in RTL as well. >> >> >> The first thing that comes to my mind is to annotate memcpy calls that >> are structure assignments. The idea here is that we may want to expand >> a memcpy differently in those cases. Changing how we expand an opaque >> memcpy call is unlikely to be beneficial in most cases. But changing >> how we expand a structure copy may be beneficial by exposing the >> underlying field values. This would roughly correspond to your method #1. >> >> Or instead of changing how we expand, teach the optimizers about these >> annotated memcpy calls -- they're just a a copy of each field. That's >> how CSE and the propagators could treat them. After some point we'd >> lower them in the usual ways, but at least early in the RTL pipeline we >> could keep them as annotated memcpy calls. This roughly corresponds to >> your second suggestion. > > In the end it depends on the access patterns so some analysis like SRA > performs would be nice. The issue with expanding memcpy on GIMPLE > is that we currently cannot express 'rep; movb;' or other target specific > sequences from the cpymem like optabs on GIMPLE and recovering those > from piecewise copies on RTL is going to be difficult. Actually, it is a special memcpy. It is generated during expanding the struct assignment(expand_assignment/store_expr/emit_block_move). We may introduce a function block_move_for_record for struct type. And this function could be a target hook to generate specificed sequences. For example: r125:DF=[r112:DI+0x20] r126:DF=[r112:DI+0x28] [r112:DI]=r125:DF [r112:DI+0x8]=r126:DF After expanding, following passes(cse/prop/dse/..) could optimize the insn sequences. e.g "[r112:DI+0x20]=f1;r125:DF=[r112:DI+0x20]; [r112:DI]=r125:DF;r129:DF=[r112:DI]" ==> "r129:DF=f1" And if the small reading/writing insns are still occur in late passes e.g. combine, we would recover the isnsn to better sequence: r125:DF=[r112:DI+0x20];r126:DF=[r112:DI+0x28] ==> r155:V2DI=[r112:DI+0x20]; Any comments? Thanks! BR, Jeff(Jiufu) > >> >> jeff >> >> >>
Re: [RFC] propgation leap over memory copy for struct
On Wed, 9 Nov 2022, Jiufu Guo wrote: > Jiufu Guo via Gcc-patches writes: > > > Richard Biener writes: > > > >> On Tue, 1 Nov 2022, Jiufu Guo wrote: > >> > >>> Segher Boessenkool writes: > >>> > >>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: > >>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > >>> >> >We know that for struct variable assignment, memory copy may be used. > >>> >> >And for memcpy, we may load and store more bytes as possible at one > >>> >> >time. > >>> >> >While it may be not best here: > >>> > > >>> >> So the first question in my mind is can we do better at the gimple > >>> >> phase? For the second case in particular can't we just "return a" > >>> >> rather than copying a into then returning ? This > >>> >> feels > >>> >> a lot like the return value optimization from C++. I'm not sure if it > >>> >> applies to the first case or not, it's been a long time since I looked > >>> >> at NRV optimizations, but it might be worth poking around in there a > >>> >> bit > >>> >> (tree-nrv.cc). > >>> > > >>> > If it is a bigger struct you end up with quite a lot of stuff in > >>> > registers. GCC will eventually put that all in memory so it will work > >>> > out fine in the end, but you are likely to get inefficient code. > >>> Yes. We may need to use memory to save regiters for big struct. > >>> Small struct may be practical to use registers. We may leverage the > >>> idea that: some type of small struct are passing to function through > >>> registers. > >>> > >>> > > >>> > OTOH, 8 bytes isn't as big as we would want these days, is it? So it > >>> > would be useful to put smaller temportaries, say 32 bytes and smaller, > >>> > in registers instead of in memory. > >>> I think you mean: we should try to registers to avoid memory accesing, > >>> and using registers would be ok for more bytes memcpy(32bytes). > >>> Great sugguestion, thanks a lot! > >>> > >>> Like below idea: > >>> [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; > >>> dest = [r100:TI, r101:TI]; > >>> > >>> Currently, for 8bytes structure, we are using TImode for it. > >>> And subreg/fwprop/cse passes are able to optimize it as expected. > >>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; > >>> I'm not sure if current infrastructure supports to use two more > >>> registers for one structure. > >>> > >>> > > >>> >> But even so, these kinds of things are still bound to happen, so it's > >>> >> probably worth thinking about if we can do better in RTL as well. > >>> > > >>> > Always. It is a mistake to think that having better high-level > >>> > optimisations means that you don't need good low-level optimisations > >>> > anymore: in fact deficiencies there become more glaringly apparent if > >>> > the early pipeline opts become better :-) > >>> Understant, thanks :) > >>> > >>> > > >>> >> The first thing that comes to my mind is to annotate memcpy calls that > >>> >> are structure assignments. The idea here is that we may want to > >>> >> expand > >>> >> a memcpy differently in those cases. Changing how we expand an > >>> >> opaque > >>> >> memcpy call is unlikely to be beneficial in most cases. But changing > >>> >> how we expand a structure copy may be beneficial by exposing the > >>> >> underlying field values. This would roughly correspond to your > >>> >> method > >>> >> #1. > >>> >> > >>> >> Or instead of changing how we expand, teach the optimizers about these > >>> >> annotated memcpy calls -- they're just a a copy of each field. > >>> >> That's > >>> >> how CSE and the propagators could treat them. After some point we'd > >>> >> lower them in the usual ways, but at least early in the RTL pipeline > >>> >> we > >>> >> could keep them as annotated memcpy calls. This roughly corresponds > >>> >> to > >>> >> your second suggestion. > >>> > > >>> > Ideally this won't ever make it as far as RTL, if the structures do not > >>> > need to go via memory. All high-level optimissations should have been > >>> > done earlier, and hopefully it was not expand tiself that forced stuff > >>> > into memory! :-/ > >>> Currently, after early gimple optimization, the struct member accessing > >>> may still need to be in memory (if the mode of the struct is BLK). > >>> For example: > >>> > >>> _Bool foo (const A a) { return a.a[0] > 1.0; } > >>> > >>> The optimized gimple would be: > >>> _1 = a.a[0]; > >>> _3 = _1 > 1.0e+0; > >>> return _3; > >>> > >>> During expand to RTL, parm 'a' is store to memory from arg regs firstly, > >>> and "a.a[0]" is also reading from memory. It may be better to use > >>> "f1" for "a.a[0]" here. > >>> > >>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; > >>> DF}" > >>> for the struct (BLK may be changed), and using 4 DF registers to access > >>> the structure in expand pass. > >> > >> I think for cases like this it might be a good idea to perform > >> SRA-like analysis at RTL
Re: [RFC] propgation leap over memory copy for struct
Jiufu Guo via Gcc-patches writes: > Richard Biener writes: > >> On Tue, 1 Nov 2022, Jiufu Guo wrote: >> >>> Segher Boessenkool writes: >>> >>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: >>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >>> >> >We know that for struct variable assignment, memory copy may be used. >>> >> >And for memcpy, we may load and store more bytes as possible at one >>> >> >time. >>> >> >While it may be not best here: >>> > >>> >> So the first question in my mind is can we do better at the gimple >>> >> phase? For the second case in particular can't we just "return a" >>> >> rather than copying a into then returning ? This feels >>> >> a lot like the return value optimization from C++. I'm not sure if it >>> >> applies to the first case or not, it's been a long time since I looked >>> >> at NRV optimizations, but it might be worth poking around in there a bit >>> >> (tree-nrv.cc). >>> > >>> > If it is a bigger struct you end up with quite a lot of stuff in >>> > registers. GCC will eventually put that all in memory so it will work >>> > out fine in the end, but you are likely to get inefficient code. >>> Yes. We may need to use memory to save regiters for big struct. >>> Small struct may be practical to use registers. We may leverage the >>> idea that: some type of small struct are passing to function through >>> registers. >>> >>> > >>> > OTOH, 8 bytes isn't as big as we would want these days, is it? So it >>> > would be useful to put smaller temportaries, say 32 bytes and smaller, >>> > in registers instead of in memory. >>> I think you mean: we should try to registers to avoid memory accesing, >>> and using registers would be ok for more bytes memcpy(32bytes). >>> Great sugguestion, thanks a lot! >>> >>> Like below idea: >>> [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; >>> dest = [r100:TI, r101:TI]; >>> >>> Currently, for 8bytes structure, we are using TImode for it. >>> And subreg/fwprop/cse passes are able to optimize it as expected. >>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; >>> I'm not sure if current infrastructure supports to use two more >>> registers for one structure. >>> >>> > >>> >> But even so, these kinds of things are still bound to happen, so it's >>> >> probably worth thinking about if we can do better in RTL as well. >>> > >>> > Always. It is a mistake to think that having better high-level >>> > optimisations means that you don't need good low-level optimisations >>> > anymore: in fact deficiencies there become more glaringly apparent if >>> > the early pipeline opts become better :-) >>> Understant, thanks :) >>> >>> > >>> >> The first thing that comes to my mind is to annotate memcpy calls that >>> >> are structure assignments. The idea here is that we may want to expand >>> >> a memcpy differently in those cases. Changing how we expand an opaque >>> >> memcpy call is unlikely to be beneficial in most cases. But changing >>> >> how we expand a structure copy may be beneficial by exposing the >>> >> underlying field values. This would roughly correspond to your method >>> >> #1. >>> >> >>> >> Or instead of changing how we expand, teach the optimizers about these >>> >> annotated memcpy calls -- they're just a a copy of each field. That's >>> >> how CSE and the propagators could treat them. After some point we'd >>> >> lower them in the usual ways, but at least early in the RTL pipeline we >>> >> could keep them as annotated memcpy calls. This roughly corresponds to >>> >> your second suggestion. >>> > >>> > Ideally this won't ever make it as far as RTL, if the structures do not >>> > need to go via memory. All high-level optimissations should have been >>> > done earlier, and hopefully it was not expand tiself that forced stuff >>> > into memory! :-/ >>> Currently, after early gimple optimization, the struct member accessing >>> may still need to be in memory (if the mode of the struct is BLK). >>> For example: >>> >>> _Bool foo (const A a) { return a.a[0] > 1.0; } >>> >>> The optimized gimple would be: >>> _1 = a.a[0]; >>> _3 = _1 > 1.0e+0; >>> return _3; >>> >>> During expand to RTL, parm 'a' is store to memory from arg regs firstly, >>> and "a.a[0]" is also reading from memory. It may be better to use >>> "f1" for "a.a[0]" here. >>> >>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; >>> DF}" >>> for the struct (BLK may be changed), and using 4 DF registers to access >>> the structure in expand pass. >> >> I think for cases like this it might be a good idea to perform >> SRA-like analysis at RTL expansion time when we know how parameters >> arrive (in pieces) and take that knowledge into account when >> assigning the RTL to a decl. The same applies for the return ABI. >> Since we rely on RTL to elide copies to/from return/argument >> registers/slots we have to assign "layout compatible" registers >> to the
Re: [RFC] propgation leap over memory copy for struct
Richard Biener writes: > On Tue, 1 Nov 2022, Jiufu Guo wrote: > >> Segher Boessenkool writes: >> >> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: >> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> >> >We know that for struct variable assignment, memory copy may be used. >> >> >And for memcpy, we may load and store more bytes as possible at one time. >> >> >While it may be not best here: >> > >> >> So the first question in my mind is can we do better at the gimple >> >> phase? For the second case in particular can't we just "return a" >> >> rather than copying a into then returning ? This feels >> >> a lot like the return value optimization from C++. I'm not sure if it >> >> applies to the first case or not, it's been a long time since I looked >> >> at NRV optimizations, but it might be worth poking around in there a bit >> >> (tree-nrv.cc). >> > >> > If it is a bigger struct you end up with quite a lot of stuff in >> > registers. GCC will eventually put that all in memory so it will work >> > out fine in the end, but you are likely to get inefficient code. >> Yes. We may need to use memory to save regiters for big struct. >> Small struct may be practical to use registers. We may leverage the >> idea that: some type of small struct are passing to function through >> registers. >> >> > >> > OTOH, 8 bytes isn't as big as we would want these days, is it? So it >> > would be useful to put smaller temportaries, say 32 bytes and smaller, >> > in registers instead of in memory. >> I think you mean: we should try to registers to avoid memory accesing, >> and using registers would be ok for more bytes memcpy(32bytes). >> Great sugguestion, thanks a lot! >> >> Like below idea: >> [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; >> dest = [r100:TI, r101:TI]; >> >> Currently, for 8bytes structure, we are using TImode for it. >> And subreg/fwprop/cse passes are able to optimize it as expected. >> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; >> I'm not sure if current infrastructure supports to use two more >> registers for one structure. >> >> > >> >> But even so, these kinds of things are still bound to happen, so it's >> >> probably worth thinking about if we can do better in RTL as well. >> > >> > Always. It is a mistake to think that having better high-level >> > optimisations means that you don't need good low-level optimisations >> > anymore: in fact deficiencies there become more glaringly apparent if >> > the early pipeline opts become better :-) >> Understant, thanks :) >> >> > >> >> The first thing that comes to my mind is to annotate memcpy calls that >> >> are structure assignments. The idea here is that we may want to expand >> >> a memcpy differently in those cases. Changing how we expand an opaque >> >> memcpy call is unlikely to be beneficial in most cases. But changing >> >> how we expand a structure copy may be beneficial by exposing the >> >> underlying field values. This would roughly correspond to your method >> >> #1. >> >> >> >> Or instead of changing how we expand, teach the optimizers about these >> >> annotated memcpy calls -- they're just a a copy of each field. That's >> >> how CSE and the propagators could treat them. After some point we'd >> >> lower them in the usual ways, but at least early in the RTL pipeline we >> >> could keep them as annotated memcpy calls. This roughly corresponds to >> >> your second suggestion. >> > >> > Ideally this won't ever make it as far as RTL, if the structures do not >> > need to go via memory. All high-level optimissations should have been >> > done earlier, and hopefully it was not expand tiself that forced stuff >> > into memory! :-/ >> Currently, after early gimple optimization, the struct member accessing >> may still need to be in memory (if the mode of the struct is BLK). >> For example: >> >> _Bool foo (const A a) { return a.a[0] > 1.0; } >> >> The optimized gimple would be: >> _1 = a.a[0]; >> _3 = _1 > 1.0e+0; >> return _3; >> >> During expand to RTL, parm 'a' is store to memory from arg regs firstly, >> and "a.a[0]" is also reading from memory. It may be better to use >> "f1" for "a.a[0]" here. >> >> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}" >> for the struct (BLK may be changed), and using 4 DF registers to access >> the structure in expand pass. > > I think for cases like this it might be a good idea to perform > SRA-like analysis at RTL expansion time when we know how parameters > arrive (in pieces) and take that knowledge into account when > assigning the RTL to a decl. The same applies for the return ABI. > Since we rely on RTL to elide copies to/from return/argument > registers/slots we have to assign "layout compatible" registers > to the corresponding auto vars. > Thanks for pointing out this! This looks like a kind of SRA, especially for parm and return value. As you pointed out, there is
Re: [RFC] propgation leap over memory copy for struct
On Tue, 1 Nov 2022, Jiufu Guo wrote: > Segher Boessenkool writes: > > > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: > >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > >> >We know that for struct variable assignment, memory copy may be used. > >> >And for memcpy, we may load and store more bytes as possible at one time. > >> >While it may be not best here: > > > >> So the first question in my mind is can we do better at the gimple > >> phase? For the second case in particular can't we just "return a" > >> rather than copying a into then returning ? This feels > >> a lot like the return value optimization from C++. I'm not sure if it > >> applies to the first case or not, it's been a long time since I looked > >> at NRV optimizations, but it might be worth poking around in there a bit > >> (tree-nrv.cc). > > > > If it is a bigger struct you end up with quite a lot of stuff in > > registers. GCC will eventually put that all in memory so it will work > > out fine in the end, but you are likely to get inefficient code. > Yes. We may need to use memory to save regiters for big struct. > Small struct may be practical to use registers. We may leverage the > idea that: some type of small struct are passing to function through > registers. > > > > > OTOH, 8 bytes isn't as big as we would want these days, is it? So it > > would be useful to put smaller temportaries, say 32 bytes and smaller, > > in registers instead of in memory. > I think you mean: we should try to registers to avoid memory accesing, > and using registers would be ok for more bytes memcpy(32bytes). > Great sugguestion, thanks a lot! > > Like below idea: > [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; > dest = [r100:TI, r101:TI]; > > Currently, for 8bytes structure, we are using TImode for it. > And subreg/fwprop/cse passes are able to optimize it as expected. > Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; > I'm not sure if current infrastructure supports to use two more > registers for one structure. > > > > >> But even so, these kinds of things are still bound to happen, so it's > >> probably worth thinking about if we can do better in RTL as well. > > > > Always. It is a mistake to think that having better high-level > > optimisations means that you don't need good low-level optimisations > > anymore: in fact deficiencies there become more glaringly apparent if > > the early pipeline opts become better :-) > Understant, thanks :) > > > > >> The first thing that comes to my mind is to annotate memcpy calls that > >> are structure assignments. The idea here is that we may want to expand > >> a memcpy differently in those cases. Changing how we expand an opaque > >> memcpy call is unlikely to be beneficial in most cases. But changing > >> how we expand a structure copy may be beneficial by exposing the > >> underlying field values. This would roughly correspond to your method > >> #1. > >> > >> Or instead of changing how we expand, teach the optimizers about these > >> annotated memcpy calls -- they're just a a copy of each field. That's > >> how CSE and the propagators could treat them. After some point we'd > >> lower them in the usual ways, but at least early in the RTL pipeline we > >> could keep them as annotated memcpy calls. This roughly corresponds to > >> your second suggestion. > > > > Ideally this won't ever make it as far as RTL, if the structures do not > > need to go via memory. All high-level optimissations should have been > > done earlier, and hopefully it was not expand tiself that forced stuff > > into memory! :-/ > Currently, after early gimple optimization, the struct member accessing > may still need to be in memory (if the mode of the struct is BLK). > For example: > > _Bool foo (const A a) { return a.a[0] > 1.0; } > > The optimized gimple would be: > _1 = a.a[0]; > _3 = _1 > 1.0e+0; > return _3; > > During expand to RTL, parm 'a' is store to memory from arg regs firstly, > and "a.a[0]" is also reading from memory. It may be better to use > "f1" for "a.a[0]" here. > > Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}" > for the struct (BLK may be changed), and using 4 DF registers to access > the structure in expand pass. I think for cases like this it might be a good idea to perform SRA-like analysis at RTL expansion time when we know how parameters arrive (in pieces) and take that knowledge into account when assigning the RTL to a decl. The same applies for the return ABI. Since we rely on RTL to elide copies to/from return/argument registers/slots we have to assign "layout compatible" registers to the corresponding auto vars. > > Thanks again for your kindly and helpful comments! > > BR, > Jeff(Jiufu) > > > > > > > Segher > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
Re: [RFC] propgation leap over memory copy for struct
On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches wrote: > > > On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > > Hi, > > > > We know that for struct variable assignment, memory copy may be used. > > And for memcpy, we may load and store more bytes as possible at one time. > > While it may be not best here: > > 1. Before/after stuct variable assignment, the vaiable may be operated. > > And it is hard for some optimizations to leap over memcpy. Then some struct > > operations may be sub-optimimal. Like the issue in PR65421. > > 2. The size of struct is constant mostly, the memcpy would be expanded. > > Using > > small size to load/store and executing in parallel may not slower than using > > large size to loat/store. (sure, more registers may be used for smaller > > bytes.) > > > > > > In PR65421, For source code as below: > > t.c > > #define FN 4 > > typedef struct { double a[FN]; } A; > > > > A foo (const A *a) { return *a; } > > A bar (const A a) { return a; } > > So the first question in my mind is can we do better at the gimple > phase? For the second case in particular can't we just "return a" > rather than copying a into then returning ? This feels > a lot like the return value optimization from C++. I'm not sure if it > applies to the first case or not, it's been a long time since I looked > at NRV optimizations, but it might be worth poking around in there a bit > (tree-nrv.cc). > > > But even so, these kinds of things are still bound to happen, so it's > probably worth thinking about if we can do better in RTL as well. > > > The first thing that comes to my mind is to annotate memcpy calls that > are structure assignments. The idea here is that we may want to expand > a memcpy differently in those cases. Changing how we expand an opaque > memcpy call is unlikely to be beneficial in most cases. But changing > how we expand a structure copy may be beneficial by exposing the > underlying field values. This would roughly correspond to your method #1. > > Or instead of changing how we expand, teach the optimizers about these > annotated memcpy calls -- they're just a a copy of each field. That's > how CSE and the propagators could treat them. After some point we'd > lower them in the usual ways, but at least early in the RTL pipeline we > could keep them as annotated memcpy calls. This roughly corresponds to > your second suggestion. In the end it depends on the access patterns so some analysis like SRA performs would be nice. The issue with expanding memcpy on GIMPLE is that we currently cannot express 'rep; movb;' or other target specific sequences from the cpymem like optabs on GIMPLE and recovering those from piecewise copies on RTL is going to be difficult. > > jeff > > >
Re: [RFC] propgation leap over memory copy for struct
Segher Boessenkool writes: > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> >We know that for struct variable assignment, memory copy may be used. >> >And for memcpy, we may load and store more bytes as possible at one time. >> >While it may be not best here: > >> So the first question in my mind is can we do better at the gimple >> phase? For the second case in particular can't we just "return a" >> rather than copying a into then returning ? This feels >> a lot like the return value optimization from C++. I'm not sure if it >> applies to the first case or not, it's been a long time since I looked >> at NRV optimizations, but it might be worth poking around in there a bit >> (tree-nrv.cc). > > If it is a bigger struct you end up with quite a lot of stuff in > registers. GCC will eventually put that all in memory so it will work > out fine in the end, but you are likely to get inefficient code. Yes. We may need to use memory to save regiters for big struct. Small struct may be practical to use registers. We may leverage the idea that: some type of small struct are passing to function through registers. > > OTOH, 8 bytes isn't as big as we would want these days, is it? So it > would be useful to put smaller temportaries, say 32 bytes and smaller, > in registers instead of in memory. I think you mean: we should try to registers to avoid memory accesing, and using registers would be ok for more bytes memcpy(32bytes). Great sugguestion, thanks a lot! Like below idea: [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; dest = [r100:TI, r101:TI]; Currently, for 8bytes structure, we are using TImode for it. And subreg/fwprop/cse passes are able to optimize it as expected. Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; I'm not sure if current infrastructure supports to use two more registers for one structure. > >> But even so, these kinds of things are still bound to happen, so it's >> probably worth thinking about if we can do better in RTL as well. > > Always. It is a mistake to think that having better high-level > optimisations means that you don't need good low-level optimisations > anymore: in fact deficiencies there become more glaringly apparent if > the early pipeline opts become better :-) Understant, thanks :) > >> The first thing that comes to my mind is to annotate memcpy calls that >> are structure assignments. The idea here is that we may want to expand >> a memcpy differently in those cases. Changing how we expand an opaque >> memcpy call is unlikely to be beneficial in most cases. But changing >> how we expand a structure copy may be beneficial by exposing the >> underlying field values. This would roughly correspond to your method >> #1. >> >> Or instead of changing how we expand, teach the optimizers about these >> annotated memcpy calls -- they're just a a copy of each field. That's >> how CSE and the propagators could treat them. After some point we'd >> lower them in the usual ways, but at least early in the RTL pipeline we >> could keep them as annotated memcpy calls. This roughly corresponds to >> your second suggestion. > > Ideally this won't ever make it as far as RTL, if the structures do not > need to go via memory. All high-level optimissations should have been > done earlier, and hopefully it was not expand tiself that forced stuff > into memory! :-/ Currently, after early gimple optimization, the struct member accessing may still need to be in memory (if the mode of the struct is BLK). For example: _Bool foo (const A a) { return a.a[0] > 1.0; } The optimized gimple would be: _1 = a.a[0]; _3 = _1 > 1.0e+0; return _3; During expand to RTL, parm 'a' is store to memory from arg regs firstly, and "a.a[0]" is also reading from memory. It may be better to use "f1" for "a.a[0]" here. Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}" for the struct (BLK may be changed), and using 4 DF registers to access the structure in expand pass. Thanks again for your kindly and helpful comments! BR, Jeff(Jiufu) > > > Segher
Re: [RFC] propgation leap over memory copy for struct
Jeff Law writes: > On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> Hi, >> >> We know that for struct variable assignment, memory copy may be used. >> And for memcpy, we may load and store more bytes as possible at one time. >> While it may be not best here: >> 1. Before/after stuct variable assignment, the vaiable may be operated. >> And it is hard for some optimizations to leap over memcpy. Then some struct >> operations may be sub-optimimal. Like the issue in PR65421. >> 2. The size of struct is constant mostly, the memcpy would be expanded. >> Using >> small size to load/store and executing in parallel may not slower than using >> large size to loat/store. (sure, more registers may be used for smaller >> bytes.) >> >> >> In PR65421, For source code as below: >> t.c >> #define FN 4 >> typedef struct { double a[FN]; } A; >> >> A foo (const A *a) { return *a; } >> A bar (const A a) { return a; } > > So the first question in my mind is can we do better at the gimple > phase? For the second case in particular can't we just "return a" > rather than copying a into then returning ? This > feels a lot like the return value optimization from C++. I'm not sure > if it applies to the first case or not, it's been a long time since I > looked at NRV optimizations, but it might be worth poking around in > there a bit (tree-nrv.cc). Thanks for point out this idea!! Currently the optimized gimple looks like: D.3334 = a; return D.3334; and D.3336 = *a_2(D); return D.3336; It may be better to have: "return a;" and "return *a;" - If the code looks like: typedef struct { double a[3]; long l;} A; //mix types A foo (const A a) { return a; } A bar (const A *a) { return *a; } Current optimized gimples looks like: = a; return ; and = *a_2(D); return ; "return a;" and "return *a;" may be works here too. > > > But even so, these kinds of things are still bound to happen, so it's > probably worth thinking about if we can do better in RTL as well. > Yeap, thanks! > > The first thing that comes to my mind is to annotate memcpy calls that > are structure assignments. The idea here is that we may want to > expand a memcpy differently in those cases. Changing how we expand > an opaque memcpy call is unlikely to be beneficial in most cases. But > changing how we expand a structure copy may be beneficial by exposing > the underlying field values. This would roughly correspond to your > method #1. Right. For general memcpy, we would read/write larger bytes at one time. Reading/writing small fields may only beneficial for structure assignment. > > Or instead of changing how we expand, teach the optimizers about these > annotated memcpy calls -- they're just a a copy of each field. > That's how CSE and the propagators could treat them. After some point > we'd lower them in the usual ways, but at least early in the RTL > pipeline we could keep them as annotated memcpy calls. This roughly > corresponds to your second suggestion. Thanks for your insights about this idea! Using annoated memcpy for early optimizations, and it would be treated as general memcpy in later passes. Thanks again for your very helpful comments and sugguestions! BR, Jeff(Jiufu) > > > jeff
Re: [RFC] propgation leap over memory copy for struct
Segher Boessenkool writes: > Hi! > > On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote: >> #define FN 4 >> typedef struct { double a[FN]; } A; >> >> A foo (const A *a) { return *a; } >> A bar (const A a) { return a; } >> /// >> >> If FN<=2; the size of "A" fits into TImode, then this code can be optimized >> (by subreg/cse/fwprop/cprop) as: >> --- >> foo: >> .LFB0: >> .cfi_startproc >> blr >> >> bar: >> .LFB1: >> .cfi_startproc >> lfd 2,8(3) >> lfd 1,0(3) >> blr >> > > I think you swapped foo and bar here? Oh, thanks! > >> If the size of "A" is larger than any INT mode size, RTL insns would be >> generated as: >>13: r125:V2DI=[r112:DI+0x20] >>14: r126:V2DI=[r112:DI+0x30] >>15: [r112:DI]=r125:V2DI >>16: [r112:DI+0x10]=r126:V2DI /// memcpy for assignment: D.3338 = arg; >>17: r127:DF=[r112:DI] >>18: r128:DF=[r112:DI+0x8] >>19: r129:DF=[r112:DI+0x10] >>20: r130:DF=[r112:DI+0x18] >> >> >> I'm thinking about ways to improve this. >> Metod1: One way may be changing the memory copy by referencing the type >> of the struct if the size of struct is not too big. And generate insns >> like the below: >>13: r125:DF=[r112:DI+0x20] >>15: r126:DF=[r112:DI+0x28] >>17: r127:DF=[r112:DI+0x30] >>19: r128:DF=[r112:DI+0x38] >>14: [r112:DI]=r125:DF >>16: [r112:DI+0x8]=r126:DF >>18: [r112:DI+0x10]=r127:DF >>20: [r112:DI+0x18]=r128:DF >>21: r129:DF=[r112:DI] >>22: r130:DF=[r112:DI+0x8] >>23: r131:DF=[r112:DI+0x10] >>24: r132:DF=[r112:DI+0x18] > > This is much worse though? The expansion with memcpy used V2DI, which > typically is close to 2x faster than DFmode accesses. Using V2DI, it help to access 2x bytes at one time than DF/DI. While since those readings can be executed in parallel, it would be not too bad via using DF/DI. > > Or are you trying to avoid small reads of large stores here? Those > aren't so bad, large reads of small stores is the nastiness we need to > avoid. Here, I want to use 2 DF readings, because optimizations cse/fwprop/dse could eleminate those memory accesses on same size. > > The code we have now does > >15: [r112:DI]=r125:V2DI > ... >17: r127:DF=[r112:DI] >18: r128:DF=[r112:DI+0x8] > > Can you make this optimised to not use a memory temporary at all, just > immediately assign from r125 to r127 and r128? r125 are not directly assinged to r127/r128, since 'insn 15' and 'insn 17/18' are expanded for different gimple stmt: D.3331 = a; ==> 'insn 15' is generated for struct assignment here. return D.3331; ==> 'insn 17/18' are prepared for return registers. I'm trying to eliminate thos memory temporary, and did not find a good way. Method1-3 are the ideas which I'm trying to use to delete those temporaries. > >> Method2: One way may be enhancing CSE to make it able to treat one large >> memory slot as two(or more) combined slots: >>13: r125:V2DI#0=[r112:DI+0x20] >>13': r125:V2DI#8=[r112:DI+0x28] >>15: [r112:DI]#0=r125:V2DI#0 >>15': [r112:DI]#8=r125:V2DI#8 >> >> This may seems more hack in CSE. > > The current CSE pass we have is the pass most in need of a full rewrite > we have, since many many years. It does a lot of things, important > things that we should not lose, but it does a pretty bad job of CSE. > >> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK". > > :BLK can never be optimised well. It always has to live in memory, by > definition. Thanks for your sugguestions! BR, Jeff (Jiufu) > > > Segher
Re: [RFC] propgation leap over memory copy for struct
On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: > On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > >We know that for struct variable assignment, memory copy may be used. > >And for memcpy, we may load and store more bytes as possible at one time. > >While it may be not best here: > So the first question in my mind is can we do better at the gimple > phase? For the second case in particular can't we just "return a" > rather than copying a into then returning ? This feels > a lot like the return value optimization from C++. I'm not sure if it > applies to the first case or not, it's been a long time since I looked > at NRV optimizations, but it might be worth poking around in there a bit > (tree-nrv.cc). If it is a bigger struct you end up with quite a lot of stuff in registers. GCC will eventually put that all in memory so it will work out fine in the end, but you are likely to get inefficient code. OTOH, 8 bytes isn't as big as we would want these days, is it? So it would be useful to put smaller temportaries, say 32 bytes and smaller, in registers instead of in memory. > But even so, these kinds of things are still bound to happen, so it's > probably worth thinking about if we can do better in RTL as well. Always. It is a mistake to think that having better high-level optimisations means that you don't need good low-level optimisations anymore: in fact deficiencies there become more glaringly apparent if the early pipeline opts become better :-) > The first thing that comes to my mind is to annotate memcpy calls that > are structure assignments. The idea here is that we may want to expand > a memcpy differently in those cases. Changing how we expand an opaque > memcpy call is unlikely to be beneficial in most cases. But changing > how we expand a structure copy may be beneficial by exposing the > underlying field values. This would roughly correspond to your method > #1. > > Or instead of changing how we expand, teach the optimizers about these > annotated memcpy calls -- they're just a a copy of each field. That's > how CSE and the propagators could treat them. After some point we'd > lower them in the usual ways, but at least early in the RTL pipeline we > could keep them as annotated memcpy calls. This roughly corresponds to > your second suggestion. Ideally this won't ever make it as far as RTL, if the structures do not need to go via memory. All high-level optimissations should have been done earlier, and hopefully it was not expand tiself that forced stuff into memory! :-/ Segher
Re: [RFC] propgation leap over memory copy for struct
Hi! On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote: > #define FN 4 > typedef struct { double a[FN]; } A; > > A foo (const A *a) { return *a; } > A bar (const A a) { return a; } > /// > > If FN<=2; the size of "A" fits into TImode, then this code can be optimized > (by subreg/cse/fwprop/cprop) as: > --- > foo: > .LFB0: > .cfi_startproc > blr > > bar: > .LFB1: > .cfi_startproc > lfd 2,8(3) > lfd 1,0(3) > blr > I think you swapped foo and bar here? > If the size of "A" is larger than any INT mode size, RTL insns would be > generated as: >13: r125:V2DI=[r112:DI+0x20] >14: r126:V2DI=[r112:DI+0x30] >15: [r112:DI]=r125:V2DI >16: [r112:DI+0x10]=r126:V2DI /// memcpy for assignment: D.3338 = arg; >17: r127:DF=[r112:DI] >18: r128:DF=[r112:DI+0x8] >19: r129:DF=[r112:DI+0x10] >20: r130:DF=[r112:DI+0x18] > > > I'm thinking about ways to improve this. > Metod1: One way may be changing the memory copy by referencing the type > of the struct if the size of struct is not too big. And generate insns > like the below: >13: r125:DF=[r112:DI+0x20] >15: r126:DF=[r112:DI+0x28] >17: r127:DF=[r112:DI+0x30] >19: r128:DF=[r112:DI+0x38] >14: [r112:DI]=r125:DF >16: [r112:DI+0x8]=r126:DF >18: [r112:DI+0x10]=r127:DF >20: [r112:DI+0x18]=r128:DF >21: r129:DF=[r112:DI] >22: r130:DF=[r112:DI+0x8] >23: r131:DF=[r112:DI+0x10] >24: r132:DF=[r112:DI+0x18] This is much worse though? The expansion with memcpy used V2DI, which typically is close to 2x faster than DFmode accesses. Or are you trying to avoid small reads of large stores here? Those aren't so bad, large reads of small stores is the nastiness we need to avoid. The code we have now does 15: [r112:DI]=r125:V2DI ... 17: r127:DF=[r112:DI] 18: r128:DF=[r112:DI+0x8] Can you make this optimised to not use a memory temporary at all, just immediately assign from r125 to r127 and r128? > Method2: One way may be enhancing CSE to make it able to treat one large > memory slot as two(or more) combined slots: >13: r125:V2DI#0=[r112:DI+0x20] >13': r125:V2DI#8=[r112:DI+0x28] >15: [r112:DI]#0=r125:V2DI#0 >15': [r112:DI]#8=r125:V2DI#8 > > This may seems more hack in CSE. The current CSE pass we have is the pass most in need of a full rewrite we have, since many many years. It does a lot of things, important things that we should not lose, but it does a pretty bad job of CSE. > Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK". :BLK can never be optimised well. It always has to live in memory, by definition. Segher
Re: [RFC] propgation leap over memory copy for struct
On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: Hi, We know that for struct variable assignment, memory copy may be used. And for memcpy, we may load and store more bytes as possible at one time. While it may be not best here: 1. Before/after stuct variable assignment, the vaiable may be operated. And it is hard for some optimizations to leap over memcpy. Then some struct operations may be sub-optimimal. Like the issue in PR65421. 2. The size of struct is constant mostly, the memcpy would be expanded. Using small size to load/store and executing in parallel may not slower than using large size to loat/store. (sure, more registers may be used for smaller bytes.) In PR65421, For source code as below: t.c #define FN 4 typedef struct { double a[FN]; } A; A foo (const A *a) { return *a; } A bar (const A a) { return a; } So the first question in my mind is can we do better at the gimple phase? For the second case in particular can't we just "return a" rather than copying a into then returning ? This feels a lot like the return value optimization from C++. I'm not sure if it applies to the first case or not, it's been a long time since I looked at NRV optimizations, but it might be worth poking around in there a bit (tree-nrv.cc). But even so, these kinds of things are still bound to happen, so it's probably worth thinking about if we can do better in RTL as well. The first thing that comes to my mind is to annotate memcpy calls that are structure assignments. The idea here is that we may want to expand a memcpy differently in those cases. Changing how we expand an opaque memcpy call is unlikely to be beneficial in most cases. But changing how we expand a structure copy may be beneficial by exposing the underlying field values. This would roughly correspond to your method #1. Or instead of changing how we expand, teach the optimizers about these annotated memcpy calls -- they're just a a copy of each field. That's how CSE and the propagators could treat them. After some point we'd lower them in the usual ways, but at least early in the RTL pipeline we could keep them as annotated memcpy calls. This roughly corresponds to your second suggestion. jeff
[RFC] propgation leap over memory copy for struct
Hi, We know that for struct variable assignment, memory copy may be used. And for memcpy, we may load and store more bytes as possible at one time. While it may be not best here: 1. Before/after stuct variable assignment, the vaiable may be operated. And it is hard for some optimizations to leap over memcpy. Then some struct operations may be sub-optimimal. Like the issue in PR65421. 2. The size of struct is constant mostly, the memcpy would be expanded. Using small size to load/store and executing in parallel may not slower than using large size to loat/store. (sure, more registers may be used for smaller bytes.) In PR65421, For source code as below: t.c #define FN 4 typedef struct { double a[FN]; } A; A foo (const A *a) { return *a; } A bar (const A a) { return a; } /// If FN<=2; the size of "A" fits into TImode, then this code can be optimized (by subreg/cse/fwprop/cprop) as: --- foo: .LFB0: .cfi_startproc blr bar: .LFB1: .cfi_startproc lfd 2,8(3) lfd 1,0(3) blr If the size of "A" is larger than any INT mode size, RTL insns would be generated as: 13: r125:V2DI=[r112:DI+0x20] 14: r126:V2DI=[r112:DI+0x30] 15: [r112:DI]=r125:V2DI 16: [r112:DI+0x10]=r126:V2DI /// memcpy for assignment: D.3338 = arg; 17: r127:DF=[r112:DI] 18: r128:DF=[r112:DI+0x8] 19: r129:DF=[r112:DI+0x10] 20: r130:DF=[r112:DI+0x18] I'm thinking about ways to improve this. Metod1: One way may be changing the memory copy by referencing the type of the struct if the size of struct is not too big. And generate insns like the below: 13: r125:DF=[r112:DI+0x20] 15: r126:DF=[r112:DI+0x28] 17: r127:DF=[r112:DI+0x30] 19: r128:DF=[r112:DI+0x38] 14: [r112:DI]=r125:DF 16: [r112:DI+0x8]=r126:DF 18: [r112:DI+0x10]=r127:DF 20: [r112:DI+0x18]=r128:DF 21: r129:DF=[r112:DI] 22: r130:DF=[r112:DI+0x8] 23: r131:DF=[r112:DI+0x10] 24: r132:DF=[r112:DI+0x18] Then passes (cse, prop, dse...) could help to optimize the code. Concerns of the method: we may not need to do this if the number of fields is too large. And the types/modes of each load/store may depend on the platform and not same with the type of the fields of the struct. For example: For "struct {double a[3]; long long l;}", on ppc64le, DImode may be better for assignments on parameter. Method2: One way may be enhancing CSE to make it able to treat one large memory slot as two(or more) combined slots: 13: r125:V2DI#0=[r112:DI+0x20] 13': r125:V2DI#8=[r112:DI+0x28] 15: [r112:DI]#0=r125:V2DI#0 15': [r112:DI]#8=r125:V2DI#8 This may seems more hack in CSE. Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK". To do this, "moving" between "PARALLEL<->PARALLEL" and "PARALLEL<->MEM" may need to be enhanced. This method may require more effort to make it works for corner/unknown cases. I'm wondering which would be more flexible to handle this issue? Thanks for any comments and suggestions! BR, Jeff(Jiufu)