Re: [PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns
Alex Coplan writes: > On 12/12/2023 15:58, Richard Sandiford wrote: >> Alex Coplan writes: >> > Hi, >> > >> > This is a v2 version which addresses feedback from Richard's review >> > here: >> > >> > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html >> > >> > I'll reply inline to address specific comments. >> > >> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk? >> > >> > Thanks, >> > Alex >> > >> > -- >8 -- >> > >> > This patch overhauls the load/store pair patterns with two main goals: >> > >> > 1. Fixing a correctness issue (the current patterns are not RA-friendly). >> > 2. Allowing more flexibility in which operand modes are supported, and >> > which >> >combinations of modes are allowed in the two arms of the load/store >> > pair, >> >while reducing the number of patterns required both in the source and in >> >the generated code. >> > >> > The correctness issue (1) is due to the fact that the current patterns have >> > two independent memory operands tied together only by a predicate on the >> > insns. >> > Since LRA only looks at the constraints, one of the memory operands can get >> > reloaded without the other one being changed, leading to the insn becoming >> > unrecognizable after reload. >> > >> > We fix this issue by changing the patterns such that they only ever have >> > one >> > memory operand representing the entire pair. For the store case, we use an >> > unspec to logically concatenate the register operands before storing them. >> > For the load case, we use unspecs to extract the "lanes" from the pair mem, >> > with the second occurrence of the mem matched using a match_dup (such that >> > there >> > is still really only one memory operand as far as the RA is concerned). >> > >> > In terms of the modes used for the pair memory operands, we canonicalize >> > these to V2x4QImode, V2x8QImode, and V2x16QImode. These modes have not >> > only the correct size but also correct alignment requirement for a >> > memory operand representing an entire load/store pair. Unlike the other >> > two, V2x4QImode didn't previously exist, so had to be added with the >> > patch. >> > >> > As with the previous patch generalizing the writeback patterns, this >> > patch aims to be flexible in the combinations of modes supported by the >> > patterns without requiring a large number of generated patterns by using >> > distinct mode iterators. >> > >> > The new scheme means we only need a single (generated) pattern for each >> > load/store operation of a given operand size. For the 4-byte and 8-byte >> > operand cases, we use the GPI iterator to synthesize the two patterns. >> > The 16-byte case is implemented as a separate pattern in the source (due >> > to only having a single possible alternative). >> > >> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code, >> > we add REG_CFA_OFFSET notes to the store pair insns emitted by >> > aarch64_save_callee_saves, so that correct CFI information can still be >> > generated. Furthermore, we now unconditionally generate these CFA >> > notes on frame-related insns emitted by aarch64_save_callee_saves. >> > This is done in case that the load/store pair pass forms these into >> > pairs, in which case the CFA notes would be needed. >> > >> > We also adjust the ldp/stp peepholes to generate the new form. This is >> > done by switching the generation to use the >> > aarch64_gen_{load,store}_pair interface, making it easier to change the >> > form in the future if needed. (Likewise, the upcoming aarch64 >> > load/store pair pass also makes use of this interface). >> > >> > This patch also adds an "ldpstp" attribute to the non-writeback >> > load/store pair patterns, which is used by the post-RA load/store pair >> > pass to identify existing patterns and see if they can be promoted to >> > writeback variants. >> > >> > One potential concern with using unspecs for the patterns is that it can >> > block >> > optimization by the generic RTL passes. This patch series tries to >> > mitigate >> > this in two ways: >> > 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline. >> > 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion >> > to >> > emit individual loads/stores instead of ldp/stp. These should then be >> > formed back into load/store pairs much later in the RTL pipeline by the >> > new load/store pair pass. >> > >> > gcc/ChangeLog: >> > >> > * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp >> > representation from peepholes, allowing use of new form. >> > * config/aarch64/aarch64-modes.def (V2x4QImode): Define. >> > * config/aarch64/aarch64-protos.h >> > (aarch64_finish_ldpstp_peephole): Declare. >> > (aarch64_swap_ldrstr_operands): Delete declaration. >> > (aarch64_gen_load_pair): Adjust parameters. >> > (aarch64_gen_store_pair): Likewise. >> > * config/aarch64/aarch64
Re: [PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns
On 12/12/2023 15:58, Richard Sandiford wrote: > Alex Coplan writes: > > Hi, > > > > This is a v2 version which addresses feedback from Richard's review > > here: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html > > > > I'll reply inline to address specific comments. > > > > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk? > > > > Thanks, > > Alex > > > > -- >8 -- > > > > This patch overhauls the load/store pair patterns with two main goals: > > > > 1. Fixing a correctness issue (the current patterns are not RA-friendly). > > 2. Allowing more flexibility in which operand modes are supported, and which > >combinations of modes are allowed in the two arms of the load/store pair, > >while reducing the number of patterns required both in the source and in > >the generated code. > > > > The correctness issue (1) is due to the fact that the current patterns have > > two independent memory operands tied together only by a predicate on the > > insns. > > Since LRA only looks at the constraints, one of the memory operands can get > > reloaded without the other one being changed, leading to the insn becoming > > unrecognizable after reload. > > > > We fix this issue by changing the patterns such that they only ever have one > > memory operand representing the entire pair. For the store case, we use an > > unspec to logically concatenate the register operands before storing them. > > For the load case, we use unspecs to extract the "lanes" from the pair mem, > > with the second occurrence of the mem matched using a match_dup (such that > > there > > is still really only one memory operand as far as the RA is concerned). > > > > In terms of the modes used for the pair memory operands, we canonicalize > > these to V2x4QImode, V2x8QImode, and V2x16QImode. These modes have not > > only the correct size but also correct alignment requirement for a > > memory operand representing an entire load/store pair. Unlike the other > > two, V2x4QImode didn't previously exist, so had to be added with the > > patch. > > > > As with the previous patch generalizing the writeback patterns, this > > patch aims to be flexible in the combinations of modes supported by the > > patterns without requiring a large number of generated patterns by using > > distinct mode iterators. > > > > The new scheme means we only need a single (generated) pattern for each > > load/store operation of a given operand size. For the 4-byte and 8-byte > > operand cases, we use the GPI iterator to synthesize the two patterns. > > The 16-byte case is implemented as a separate pattern in the source (due > > to only having a single possible alternative). > > > > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code, > > we add REG_CFA_OFFSET notes to the store pair insns emitted by > > aarch64_save_callee_saves, so that correct CFI information can still be > > generated. Furthermore, we now unconditionally generate these CFA > > notes on frame-related insns emitted by aarch64_save_callee_saves. > > This is done in case that the load/store pair pass forms these into > > pairs, in which case the CFA notes would be needed. > > > > We also adjust the ldp/stp peepholes to generate the new form. This is > > done by switching the generation to use the > > aarch64_gen_{load,store}_pair interface, making it easier to change the > > form in the future if needed. (Likewise, the upcoming aarch64 > > load/store pair pass also makes use of this interface). > > > > This patch also adds an "ldpstp" attribute to the non-writeback > > load/store pair patterns, which is used by the post-RA load/store pair > > pass to identify existing patterns and see if they can be promoted to > > writeback variants. > > > > One potential concern with using unspecs for the patterns is that it can > > block > > optimization by the generic RTL passes. This patch series tries to mitigate > > this in two ways: > > 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline. > > 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion > > to > > emit individual loads/stores instead of ldp/stp. These should then be > > formed back into load/store pairs much later in the RTL pipeline by the > > new load/store pair pass. > > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp > > representation from peepholes, allowing use of new form. > > * config/aarch64/aarch64-modes.def (V2x4QImode): Define. > > * config/aarch64/aarch64-protos.h > > (aarch64_finish_ldpstp_peephole): Declare. > > (aarch64_swap_ldrstr_operands): Delete declaration. > > (aarch64_gen_load_pair): Adjust parameters. > > (aarch64_gen_store_pair): Likewise. > > * config/aarch64/aarch64-simd.md (load_pair): > > Delete. > > (vec_store_pair): Delete. > > (load_pair): Delete. > > (vec_store_pair): Del
Re: [PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns
Alex Coplan writes: > Hi, > > This is a v2 version which addresses feedback from Richard's review > here: > > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html > > I'll reply inline to address specific comments. > > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk? > > Thanks, > Alex > > -- >8 -- > > This patch overhauls the load/store pair patterns with two main goals: > > 1. Fixing a correctness issue (the current patterns are not RA-friendly). > 2. Allowing more flexibility in which operand modes are supported, and which >combinations of modes are allowed in the two arms of the load/store pair, >while reducing the number of patterns required both in the source and in >the generated code. > > The correctness issue (1) is due to the fact that the current patterns have > two independent memory operands tied together only by a predicate on the > insns. > Since LRA only looks at the constraints, one of the memory operands can get > reloaded without the other one being changed, leading to the insn becoming > unrecognizable after reload. > > We fix this issue by changing the patterns such that they only ever have one > memory operand representing the entire pair. For the store case, we use an > unspec to logically concatenate the register operands before storing them. > For the load case, we use unspecs to extract the "lanes" from the pair mem, > with the second occurrence of the mem matched using a match_dup (such that > there > is still really only one memory operand as far as the RA is concerned). > > In terms of the modes used for the pair memory operands, we canonicalize > these to V2x4QImode, V2x8QImode, and V2x16QImode. These modes have not > only the correct size but also correct alignment requirement for a > memory operand representing an entire load/store pair. Unlike the other > two, V2x4QImode didn't previously exist, so had to be added with the > patch. > > As with the previous patch generalizing the writeback patterns, this > patch aims to be flexible in the combinations of modes supported by the > patterns without requiring a large number of generated patterns by using > distinct mode iterators. > > The new scheme means we only need a single (generated) pattern for each > load/store operation of a given operand size. For the 4-byte and 8-byte > operand cases, we use the GPI iterator to synthesize the two patterns. > The 16-byte case is implemented as a separate pattern in the source (due > to only having a single possible alternative). > > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code, > we add REG_CFA_OFFSET notes to the store pair insns emitted by > aarch64_save_callee_saves, so that correct CFI information can still be > generated. Furthermore, we now unconditionally generate these CFA > notes on frame-related insns emitted by aarch64_save_callee_saves. > This is done in case that the load/store pair pass forms these into > pairs, in which case the CFA notes would be needed. > > We also adjust the ldp/stp peepholes to generate the new form. This is > done by switching the generation to use the > aarch64_gen_{load,store}_pair interface, making it easier to change the > form in the future if needed. (Likewise, the upcoming aarch64 > load/store pair pass also makes use of this interface). > > This patch also adds an "ldpstp" attribute to the non-writeback > load/store pair patterns, which is used by the post-RA load/store pair > pass to identify existing patterns and see if they can be promoted to > writeback variants. > > One potential concern with using unspecs for the patterns is that it can block > optimization by the generic RTL passes. This patch series tries to mitigate > this in two ways: > 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline. > 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to > emit individual loads/stores instead of ldp/stp. These should then be > formed back into load/store pairs much later in the RTL pipeline by the > new load/store pair pass. > > gcc/ChangeLog: > > * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp > representation from peepholes, allowing use of new form. > * config/aarch64/aarch64-modes.def (V2x4QImode): Define. > * config/aarch64/aarch64-protos.h > (aarch64_finish_ldpstp_peephole): Declare. > (aarch64_swap_ldrstr_operands): Delete declaration. > (aarch64_gen_load_pair): Adjust parameters. > (aarch64_gen_store_pair): Likewise. > * config/aarch64/aarch64-simd.md (load_pair): > Delete. > (vec_store_pair): Delete. > (load_pair): Delete. > (vec_store_pair): Delete. > * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New. > (aarch64_gen_store_pair): Adjust to use new unspec form of stp. > Drop second mem from parameters. > (aarch64_gen_load_pair): Likewise. > (aarch64
[PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns
Hi, This is a v2 version which addresses feedback from Richard's review here: https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html I'll reply inline to address specific comments. Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk? Thanks, Alex -- >8 -- This patch overhauls the load/store pair patterns with two main goals: 1. Fixing a correctness issue (the current patterns are not RA-friendly). 2. Allowing more flexibility in which operand modes are supported, and which combinations of modes are allowed in the two arms of the load/store pair, while reducing the number of patterns required both in the source and in the generated code. The correctness issue (1) is due to the fact that the current patterns have two independent memory operands tied together only by a predicate on the insns. Since LRA only looks at the constraints, one of the memory operands can get reloaded without the other one being changed, leading to the insn becoming unrecognizable after reload. We fix this issue by changing the patterns such that they only ever have one memory operand representing the entire pair. For the store case, we use an unspec to logically concatenate the register operands before storing them. For the load case, we use unspecs to extract the "lanes" from the pair mem, with the second occurrence of the mem matched using a match_dup (such that there is still really only one memory operand as far as the RA is concerned). In terms of the modes used for the pair memory operands, we canonicalize these to V2x4QImode, V2x8QImode, and V2x16QImode. These modes have not only the correct size but also correct alignment requirement for a memory operand representing an entire load/store pair. Unlike the other two, V2x4QImode didn't previously exist, so had to be added with the patch. As with the previous patch generalizing the writeback patterns, this patch aims to be flexible in the combinations of modes supported by the patterns without requiring a large number of generated patterns by using distinct mode iterators. The new scheme means we only need a single (generated) pattern for each load/store operation of a given operand size. For the 4-byte and 8-byte operand cases, we use the GPI iterator to synthesize the two patterns. The 16-byte case is implemented as a separate pattern in the source (due to only having a single possible alternative). Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code, we add REG_CFA_OFFSET notes to the store pair insns emitted by aarch64_save_callee_saves, so that correct CFI information can still be generated. Furthermore, we now unconditionally generate these CFA notes on frame-related insns emitted by aarch64_save_callee_saves. This is done in case that the load/store pair pass forms these into pairs, in which case the CFA notes would be needed. We also adjust the ldp/stp peepholes to generate the new form. This is done by switching the generation to use the aarch64_gen_{load,store}_pair interface, making it easier to change the form in the future if needed. (Likewise, the upcoming aarch64 load/store pair pass also makes use of this interface). This patch also adds an "ldpstp" attribute to the non-writeback load/store pair patterns, which is used by the post-RA load/store pair pass to identify existing patterns and see if they can be promoted to writeback variants. One potential concern with using unspecs for the patterns is that it can block optimization by the generic RTL passes. This patch series tries to mitigate this in two ways: 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline. 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to emit individual loads/stores instead of ldp/stp. These should then be formed back into load/store pairs much later in the RTL pipeline by the new load/store pair pass. gcc/ChangeLog: * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp representation from peepholes, allowing use of new form. * config/aarch64/aarch64-modes.def (V2x4QImode): Define. * config/aarch64/aarch64-protos.h (aarch64_finish_ldpstp_peephole): Declare. (aarch64_swap_ldrstr_operands): Delete declaration. (aarch64_gen_load_pair): Adjust parameters. (aarch64_gen_store_pair): Likewise. * config/aarch64/aarch64-simd.md (load_pair): Delete. (vec_store_pair): Delete. (load_pair): Delete. (vec_store_pair): Delete. * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New. (aarch64_gen_store_pair): Adjust to use new unspec form of stp. Drop second mem from parameters. (aarch64_gen_load_pair): Likewise. (aarch64_pair_mem_from_base): New. (aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for frame-related saves. Adjust call to aarch64_gen_store_pair (aarch64_restore_callee_saves): Adjust calls