Re: [PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-12-13 Thread Richard Sandiford
Alex Coplan  writes:
> On 12/12/2023 15:58, Richard Sandiford wrote:
>> Alex Coplan  writes:
>> > Hi,
>> >
>> > This is a v2 version which addresses feedback from Richard's review
>> > here:
>> >
>> > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html
>> >
>> > I'll reply inline to address specific comments.
>> >
>> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
>> >
>> > Thanks,
>> > Alex
>> >
>> > -- >8 --
>> >
>> > This patch overhauls the load/store pair patterns with two main goals:
>> >
>> > 1. Fixing a correctness issue (the current patterns are not RA-friendly).
>> > 2. Allowing more flexibility in which operand modes are supported, and 
>> > which
>> >combinations of modes are allowed in the two arms of the load/store 
>> > pair,
>> >while reducing the number of patterns required both in the source and in
>> >the generated code.
>> >
>> > The correctness issue (1) is due to the fact that the current patterns have
>> > two independent memory operands tied together only by a predicate on the 
>> > insns.
>> > Since LRA only looks at the constraints, one of the memory operands can get
>> > reloaded without the other one being changed, leading to the insn becoming
>> > unrecognizable after reload.
>> >
>> > We fix this issue by changing the patterns such that they only ever have 
>> > one
>> > memory operand representing the entire pair.  For the store case, we use an
>> > unspec to logically concatenate the register operands before storing them.
>> > For the load case, we use unspecs to extract the "lanes" from the pair mem,
>> > with the second occurrence of the mem matched using a match_dup (such that 
>> > there
>> > is still really only one memory operand as far as the RA is concerned).
>> >
>> > In terms of the modes used for the pair memory operands, we canonicalize
>> > these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
>> > only the correct size but also correct alignment requirement for a
>> > memory operand representing an entire load/store pair.  Unlike the other
>> > two, V2x4QImode didn't previously exist, so had to be added with the
>> > patch.
>> >
>> > As with the previous patch generalizing the writeback patterns, this
>> > patch aims to be flexible in the combinations of modes supported by the
>> > patterns without requiring a large number of generated patterns by using
>> > distinct mode iterators.
>> >
>> > The new scheme means we only need a single (generated) pattern for each
>> > load/store operation of a given operand size.  For the 4-byte and 8-byte
>> > operand cases, we use the GPI iterator to synthesize the two patterns.
>> > The 16-byte case is implemented as a separate pattern in the source (due
>> > to only having a single possible alternative).
>> >
>> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
>> > we add REG_CFA_OFFSET notes to the store pair insns emitted by
>> > aarch64_save_callee_saves, so that correct CFI information can still be
>> > generated.  Furthermore, we now unconditionally generate these CFA
>> > notes on frame-related insns emitted by aarch64_save_callee_saves.
>> > This is done in case that the load/store pair pass forms these into
>> > pairs, in which case the CFA notes would be needed.
>> >
>> > We also adjust the ldp/stp peepholes to generate the new form.  This is
>> > done by switching the generation to use the
>> > aarch64_gen_{load,store}_pair interface, making it easier to change the
>> > form in the future if needed.  (Likewise, the upcoming aarch64
>> > load/store pair pass also makes use of this interface).
>> >
>> > This patch also adds an "ldpstp" attribute to the non-writeback
>> > load/store pair patterns, which is used by the post-RA load/store pair
>> > pass to identify existing patterns and see if they can be promoted to
>> > writeback variants.
>> >
>> > One potential concern with using unspecs for the patterns is that it can 
>> > block
>> > optimization by the generic RTL passes.  This patch series tries to 
>> > mitigate
>> > this in two ways:
>> >  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
>> >  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion 
>> > to
>> > emit individual loads/stores instead of ldp/stp.  These should then be
>> > formed back into load/store pairs much later in the RTL pipeline by the
>> > new load/store pair pass.
>> >
>> > gcc/ChangeLog:
>> >
>> > * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
>> > representation from peepholes, allowing use of new form.
>> > * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
>> > * config/aarch64/aarch64-protos.h
>> > (aarch64_finish_ldpstp_peephole): Declare.
>> > (aarch64_swap_ldrstr_operands): Delete declaration.
>> > (aarch64_gen_load_pair): Adjust parameters.
>> > (aarch64_gen_store_pair): Likewise.
>> > * config/aarch64/aarch64

Re: [PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-12-13 Thread Alex Coplan
On 12/12/2023 15:58, Richard Sandiford wrote:
> Alex Coplan  writes:
> > Hi,
> >
> > This is a v2 version which addresses feedback from Richard's review
> > here:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html
> >
> > I'll reply inline to address specific comments.
> >
> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > -- >8 --
> >
> > This patch overhauls the load/store pair patterns with two main goals:
> >
> > 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> > 2. Allowing more flexibility in which operand modes are supported, and which
> >combinations of modes are allowed in the two arms of the load/store pair,
> >while reducing the number of patterns required both in the source and in
> >the generated code.
> >
> > The correctness issue (1) is due to the fact that the current patterns have
> > two independent memory operands tied together only by a predicate on the 
> > insns.
> > Since LRA only looks at the constraints, one of the memory operands can get
> > reloaded without the other one being changed, leading to the insn becoming
> > unrecognizable after reload.
> >
> > We fix this issue by changing the patterns such that they only ever have one
> > memory operand representing the entire pair.  For the store case, we use an
> > unspec to logically concatenate the register operands before storing them.
> > For the load case, we use unspecs to extract the "lanes" from the pair mem,
> > with the second occurrence of the mem matched using a match_dup (such that 
> > there
> > is still really only one memory operand as far as the RA is concerned).
> >
> > In terms of the modes used for the pair memory operands, we canonicalize
> > these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> > only the correct size but also correct alignment requirement for a
> > memory operand representing an entire load/store pair.  Unlike the other
> > two, V2x4QImode didn't previously exist, so had to be added with the
> > patch.
> >
> > As with the previous patch generalizing the writeback patterns, this
> > patch aims to be flexible in the combinations of modes supported by the
> > patterns without requiring a large number of generated patterns by using
> > distinct mode iterators.
> >
> > The new scheme means we only need a single (generated) pattern for each
> > load/store operation of a given operand size.  For the 4-byte and 8-byte
> > operand cases, we use the GPI iterator to synthesize the two patterns.
> > The 16-byte case is implemented as a separate pattern in the source (due
> > to only having a single possible alternative).
> >
> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> > we add REG_CFA_OFFSET notes to the store pair insns emitted by
> > aarch64_save_callee_saves, so that correct CFI information can still be
> > generated.  Furthermore, we now unconditionally generate these CFA
> > notes on frame-related insns emitted by aarch64_save_callee_saves.
> > This is done in case that the load/store pair pass forms these into
> > pairs, in which case the CFA notes would be needed.
> >
> > We also adjust the ldp/stp peepholes to generate the new form.  This is
> > done by switching the generation to use the
> > aarch64_gen_{load,store}_pair interface, making it easier to change the
> > form in the future if needed.  (Likewise, the upcoming aarch64
> > load/store pair pass also makes use of this interface).
> >
> > This patch also adds an "ldpstp" attribute to the non-writeback
> > load/store pair patterns, which is used by the post-RA load/store pair
> > pass to identify existing patterns and see if they can be promoted to
> > writeback variants.
> >
> > One potential concern with using unspecs for the patterns is that it can 
> > block
> > optimization by the generic RTL passes.  This patch series tries to mitigate
> > this in two ways:
> >  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
> >  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion 
> > to
> > emit individual loads/stores instead of ldp/stp.  These should then be
> > formed back into load/store pairs much later in the RTL pipeline by the
> > new load/store pair pass.
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> > representation from peepholes, allowing use of new form.
> > * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
> > * config/aarch64/aarch64-protos.h
> > (aarch64_finish_ldpstp_peephole): Declare.
> > (aarch64_swap_ldrstr_operands): Delete declaration.
> > (aarch64_gen_load_pair): Adjust parameters.
> > (aarch64_gen_store_pair): Likewise.
> > * config/aarch64/aarch64-simd.md (load_pair):
> > Delete.
> > (vec_store_pair): Delete.
> > (load_pair): Delete.
> > (vec_store_pair): Del

Re: [PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-12-12 Thread Richard Sandiford
Alex Coplan  writes:
> Hi,
>
> This is a v2 version which addresses feedback from Richard's review
> here:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html
>
> I'll reply inline to address specific comments.
>
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> -- >8 --
>
> This patch overhauls the load/store pair patterns with two main goals:
>
> 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> 2. Allowing more flexibility in which operand modes are supported, and which
>combinations of modes are allowed in the two arms of the load/store pair,
>while reducing the number of patterns required both in the source and in
>the generated code.
>
> The correctness issue (1) is due to the fact that the current patterns have
> two independent memory operands tied together only by a predicate on the 
> insns.
> Since LRA only looks at the constraints, one of the memory operands can get
> reloaded without the other one being changed, leading to the insn becoming
> unrecognizable after reload.
>
> We fix this issue by changing the patterns such that they only ever have one
> memory operand representing the entire pair.  For the store case, we use an
> unspec to logically concatenate the register operands before storing them.
> For the load case, we use unspecs to extract the "lanes" from the pair mem,
> with the second occurrence of the mem matched using a match_dup (such that 
> there
> is still really only one memory operand as far as the RA is concerned).
>
> In terms of the modes used for the pair memory operands, we canonicalize
> these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> only the correct size but also correct alignment requirement for a
> memory operand representing an entire load/store pair.  Unlike the other
> two, V2x4QImode didn't previously exist, so had to be added with the
> patch.
>
> As with the previous patch generalizing the writeback patterns, this
> patch aims to be flexible in the combinations of modes supported by the
> patterns without requiring a large number of generated patterns by using
> distinct mode iterators.
>
> The new scheme means we only need a single (generated) pattern for each
> load/store operation of a given operand size.  For the 4-byte and 8-byte
> operand cases, we use the GPI iterator to synthesize the two patterns.
> The 16-byte case is implemented as a separate pattern in the source (due
> to only having a single possible alternative).
>
> Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> we add REG_CFA_OFFSET notes to the store pair insns emitted by
> aarch64_save_callee_saves, so that correct CFI information can still be
> generated.  Furthermore, we now unconditionally generate these CFA
> notes on frame-related insns emitted by aarch64_save_callee_saves.
> This is done in case that the load/store pair pass forms these into
> pairs, in which case the CFA notes would be needed.
>
> We also adjust the ldp/stp peepholes to generate the new form.  This is
> done by switching the generation to use the
> aarch64_gen_{load,store}_pair interface, making it easier to change the
> form in the future if needed.  (Likewise, the upcoming aarch64
> load/store pair pass also makes use of this interface).
>
> This patch also adds an "ldpstp" attribute to the non-writeback
> load/store pair patterns, which is used by the post-RA load/store pair
> pass to identify existing patterns and see if they can be promoted to
> writeback variants.
>
> One potential concern with using unspecs for the patterns is that it can block
> optimization by the generic RTL passes.  This patch series tries to mitigate
> this in two ways:
>  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
>  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
> emit individual loads/stores instead of ldp/stp.  These should then be
> formed back into load/store pairs much later in the RTL pipeline by the
> new load/store pair pass.
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> representation from peepholes, allowing use of new form.
> * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
> * config/aarch64/aarch64-protos.h
> (aarch64_finish_ldpstp_peephole): Declare.
> (aarch64_swap_ldrstr_operands): Delete declaration.
> (aarch64_gen_load_pair): Adjust parameters.
> (aarch64_gen_store_pair): Likewise.
> * config/aarch64/aarch64-simd.md (load_pair):
> Delete.
> (vec_store_pair): Delete.
> (load_pair): Delete.
> (vec_store_pair): Delete.
> * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
> (aarch64_gen_store_pair): Adjust to use new unspec form of stp.
> Drop second mem from parameters.
> (aarch64_gen_load_pair): Likewise.
> (aarch64

[PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-12-05 Thread Alex Coplan
Hi,

This is a v2 version which addresses feedback from Richard's review
here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html

I'll reply inline to address specific comments.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

This patch overhauls the load/store pair patterns with two main goals:

1. Fixing a correctness issue (the current patterns are not RA-friendly).
2. Allowing more flexibility in which operand modes are supported, and which
   combinations of modes are allowed in the two arms of the load/store pair,
   while reducing the number of patterns required both in the source and in
   the generated code.

The correctness issue (1) is due to the fact that the current patterns have
two independent memory operands tied together only by a predicate on the insns.
Since LRA only looks at the constraints, one of the memory operands can get
reloaded without the other one being changed, leading to the insn becoming
unrecognizable after reload.

We fix this issue by changing the patterns such that they only ever have one
memory operand representing the entire pair.  For the store case, we use an
unspec to logically concatenate the register operands before storing them.
For the load case, we use unspecs to extract the "lanes" from the pair mem,
with the second occurrence of the mem matched using a match_dup (such that there
is still really only one memory operand as far as the RA is concerned).

In terms of the modes used for the pair memory operands, we canonicalize
these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
only the correct size but also correct alignment requirement for a
memory operand representing an entire load/store pair.  Unlike the other
two, V2x4QImode didn't previously exist, so had to be added with the
patch.

As with the previous patch generalizing the writeback patterns, this
patch aims to be flexible in the combinations of modes supported by the
patterns without requiring a large number of generated patterns by using
distinct mode iterators.

The new scheme means we only need a single (generated) pattern for each
load/store operation of a given operand size.  For the 4-byte and 8-byte
operand cases, we use the GPI iterator to synthesize the two patterns.
The 16-byte case is implemented as a separate pattern in the source (due
to only having a single possible alternative).

Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
we add REG_CFA_OFFSET notes to the store pair insns emitted by
aarch64_save_callee_saves, so that correct CFI information can still be
generated.  Furthermore, we now unconditionally generate these CFA
notes on frame-related insns emitted by aarch64_save_callee_saves.
This is done in case that the load/store pair pass forms these into
pairs, in which case the CFA notes would be needed.

We also adjust the ldp/stp peepholes to generate the new form.  This is
done by switching the generation to use the
aarch64_gen_{load,store}_pair interface, making it easier to change the
form in the future if needed.  (Likewise, the upcoming aarch64
load/store pair pass also makes use of this interface).

This patch also adds an "ldpstp" attribute to the non-writeback
load/store pair patterns, which is used by the post-RA load/store pair
pass to identify existing patterns and see if they can be promoted to
writeback variants.

One potential concern with using unspecs for the patterns is that it can block
optimization by the generic RTL passes.  This patch series tries to mitigate
this in two ways:
 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
emit individual loads/stores instead of ldp/stp.  These should then be
formed back into load/store pairs much later in the RTL pipeline by the
new load/store pair pass.

gcc/ChangeLog:

* config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
representation from peepholes, allowing use of new form.
* config/aarch64/aarch64-modes.def (V2x4QImode): Define.
* config/aarch64/aarch64-protos.h
(aarch64_finish_ldpstp_peephole): Declare.
(aarch64_swap_ldrstr_operands): Delete declaration.
(aarch64_gen_load_pair): Adjust parameters.
(aarch64_gen_store_pair): Likewise.
* config/aarch64/aarch64-simd.md (load_pair):
Delete.
(vec_store_pair): Delete.
(load_pair): Delete.
(vec_store_pair): Delete.
* config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
(aarch64_gen_store_pair): Adjust to use new unspec form of stp.
Drop second mem from parameters.
(aarch64_gen_load_pair): Likewise.
(aarch64_pair_mem_from_base): New.
(aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
frame-related saves.  Adjust call to aarch64_gen_store_pair
(aarch64_restore_callee_saves): Adjust calls