Re: [Patch, aarch64, middle-end] v3: Move pair_fusion pass from aarch64 to middle-end

2024-05-22 Thread Alex Coplan
Hi Ajit,

You need to remove the header dependencies that are no longer required
for aarch64-ldp-fusion.o in t-aarch64 (not forgetting to update the
ChangeLog).  A few other minor nits below.

LGTM with those changes, but you'll need Richard S to approve.

Thanks a lot for doing this.

On 22/05/2024 00:16, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> All comments are addressed.
> 
> Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> to support multiple targets.
> 
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code.
> 
> Target independent code is structured in the following files.
> gcc/pair-fusion.h
> gcc/pair-fusion.cc
> 
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
> 
> Bootstrapped and regtested on aarch64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> 
> 
> aarch64, middle-end: Move pair_fusion pass from aarch64 to middle-end
> 
> Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> to support multiple targets.
> 
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code.
> 
> Target independent code is structured in the following files.
> gcc/pair-fusion.h
> gcc/pair-fusion.cc
> 
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
> 
> 2024-05-22  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * pair-fusion.h: Generic header code for load store pair fusion
>   that can be shared across different architectures.
>   * pair-fusion.cc: Generic source code implementation for
>   load store pair fusion that can be shared across different 
> architectures.
>   * Makefile.in: Add new object file pair-fusion.o.
>   * config/aarch64/aarch64-ldp-fusion.cc: Delete generic code and move it
>   to pair-fusion.cc in the middle-end.
>   * config/aarch64/t-aarch64: Add header file dependency on pair-fusion.h.
> ---
>  gcc/Makefile.in  |1 +
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 3298 +-
>  gcc/config/aarch64/t-aarch64 |2 +-
>  gcc/pair-fusion.cc   | 3013 
>  gcc/pair-fusion.h|  193 ++
>  5 files changed, 3286 insertions(+), 3221 deletions(-)
>  create mode 100644 gcc/pair-fusion.cc
>  create mode 100644 gcc/pair-fusion.h
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index a7f15694c34..643342f623d 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1563,6 +1563,7 @@ OBJS = \
>   ipa-strub.o \
>   ipa.o \
>   ira.o \
> + pair-fusion.o \
>   ira-build.o \
>   ira-costs.o \
>   ira-conflicts.o \
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 085366cdf68..0af927231d3 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc

> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index 78713558e7d..bdada08be70 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -203,7 +203,7 @@ aarch64-early-ra.o: 
> $(srcdir)/config/aarch64/aarch64-early-ra.cc \
>  aarch64-ldp-fusion.o: $(srcdir)/config/aarch64/aarch64-ldp-fusion.cc \
>  $(CONFIG_H) $(SYSTEM_H) $(CORETYPES_H) $(BACKEND_H) $(RTL_H) $(DF_H) \
>  $(RTL_SSA_H) cfgcleanup.h tree-pass.h ordered-hash-map.h tree-dfa.h \
> -fold-const.h tree-hash-traits.h print-tree.h
> +fold-const.h tree-hash-traits.h print-tree.h pair-fusion.h

So now you also need to remove the deps on the includes removed in the latest
version of the patch.

>   $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>   $(srcdir)/config/aarch64/aarch64-ldp-fusion.cc
>  
> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
> new file mode 100644
> index 000..827b88cf2fc
> --- /dev/null
> +++ b/gcc/pair-fusion.cc
> @@ -0,0 +1,3013 @@
> +// Pass to fuse adjacent loads/stores into paired memory accesses.
> +// Copyright (C) 2024 Free Software Foundation, Inc.
> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it
> +// under the terms of the GNU General Public License as published by
> +// the Free Software Foundation; either version 3, or (at your option)
> +// any later version.
> +//
> +// GCC is distributed in the hope that it will be useful, but
> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +// General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// .
> +
> +#define 

Re: [Patch, aarch64, middle-end] v2: Move pair_fusion pass from aarch64 to middle-end

2024-05-21 Thread Alex Coplan
Hi Ajit,

I've left some more comments below.  It's getting there now, thanks for
your patience.

On 21/05/2024 20:32, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> All comments are addressed.
> 
> Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> to support multiple targets.
> 
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code.
> 
> Target independent code is structured in the following files.
> gcc/pair-fusion.h
> gcc/pair-fusion.cc
> 
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
> 
> Bootstrapped and regtested on aarch64-linux-gnu.
> 
> Thabks & Regards
> Ajit
> 
> 
> aarch64, middle-end: Move pair_fusion pass from aarch64 to middle-end
> 
> Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> to support multiple targets.
> 
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code.
> 
> Target independent code is structured in the following files.
> gcc/pair-fusion.h
> gcc/pair-fusion.cc
> 
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
> 
> 2024-05-21  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * pair-fusion.h: Generic header code for load store pair fusion
>   that can be shared across different architectures.
>   * pair-fusion.cc: Generic source code implementation for
>   load store pair fusion that can be shared across different 
> architectures.
>   * Makefile.in: Add new object file pair-fusion.o.
>   * config/aarch64/aarch64-ldp-fusion.cc: Delete generic code and move it
>   to pair-fusion.cc in the middle-end.
>   * config/aarch64/t-aarch64: Add header file dependency pair-fusion.h.

insert "on" after dependency.

> ---
>  gcc/Makefile.in  |1 +
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 3282 +-
>  gcc/config/aarch64/t-aarch64 |2 +-
>  gcc/pair-fusion.cc   | 3013 
>  gcc/pair-fusion.h|  189 ++
>  5 files changed, 3280 insertions(+), 3207 deletions(-)
>  create mode 100644 gcc/pair-fusion.cc
>  create mode 100644 gcc/pair-fusion.h
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index a7f15694c34..643342f623d 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1563,6 +1563,7 @@ OBJS = \
>   ipa-strub.o \
>   ipa.o \
>   ira.o \
> + pair-fusion.o \
>   ira-build.o \
>   ira-costs.o \
>   ira-conflicts.o \
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 085366cdf68..612f62060bc 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -40,262 +40,13 @@
>  
>  using namespace rtl_ssa;

I think we should drop this, since the public interface and remaining
backend code in this file is independent of RTL-SSA.  I think you should
also drop the inlcude of "rtl-ssa.h" from this file.   These two
changes will force you to get the header file (pair-fusion.h) right.

With these changes we can also significantly thin out the include list
in this file.  The current set of includes is:

#define INCLUDE_ALGORITHM
#define INCLUDE_FUNCTIONAL
#define INCLUDE_LIST
#define INCLUDE_TYPE_TRAITS
#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "rtl.h"
#include "df.h"
#include "rtl-iter.h"
#include "rtl-ssa.h"
#include "cfgcleanup.h"
#include "tree-pass.h"
#include "ordered-hash-map.h"
#include "tree-dfa.h"
#include "fold-const.h"
#include "tree-hash-traits.h"
#include "print-tree.h"
#include "insn-attr.h"

I think instead the following should be enough for this file:

#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "rtl.h"
#include "memmodel.h"
#include "emit-rtl.h"
#include "tm_p.h"
#include "rtl-iter.h"
#include "tree-pass.h"
#include "insn-attr.h"
#include "pair-fusion.h"

>  
> +#include "pair-fusion.h"
> +
>  static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
>  static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
>  static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
>  static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
>  

> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index 78713558e7d..bdada08be70 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -203,7 +203,7 @@ aarch64-early-ra.o: 
> $(srcdir)/config/aarch64/aarch64-early-ra.cc \
>  aarch64-ldp-fusion.o: $(srcdir)/config/aarch64/aarch64-ldp-fusion.cc \
>  $(CONFIG_H) $(SYSTEM_H) $(CORETYPES_H) $(BACKEND_H) $(RTL_H) $(DF_H) \
>  $(RTL_SSA_H) cfgcleanup.h tree-pass.h ordered-hash-map.h tree-dfa.h \
> -

Re: [Patch, aarch64, middle-end] Move pair_fusion pass from aarch64 to middle-end

2024-05-21 Thread Alex Coplan
On 20/05/2024 21:50, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> to support multiple targets.
> 
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code.
> 
> Target independent code is structured in the following files.
> gcc/pair-fusion.h
> gcc/pair-fusion.cc
> 
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
> 
> Bootstrapped and regtested on aarch64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> aarch64, middle-end: Move pair_fusion pass from aarch64 to middle-end
> 
> Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> to support multiple targets.
> 
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code.
> 
> Target independent code is structured in the following files.
> gcc/pair-fusion.h
> gcc/pair-fusion.cc
> 
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
> 
> 2024-05-20  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * pair-fusion.h: Generic header code for load store fusion
>   that can be shared across different architectures.
>   * pair-fusion.cc: Generic source code implementation for
>   load store fusion that can be shared across different architectures.
>   * Makefile.in: Add new executable pair-fusion.o
>   * config/aarch64/aarch64-ldp-fusion.cc: Target specific
>   code for load store fusion of aarch64.

Apologies for missing this in the last review but you'll also need to
update gcc/config/aarch64/t-aarch64 to add a dependency on pair-fusion.h
for aarch64-ldp-fusion.o.

Thanks,
Alex

> ---
>  gcc/Makefile.in  |1 +
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 3303 +-
>  gcc/pair-fusion.cc   | 2852 +++
>  gcc/pair-fusion.h|  340 +++
>  4 files changed, 3268 insertions(+), 3228 deletions(-)
>  create mode 100644 gcc/pair-fusion.cc
>  create mode 100644 gcc/pair-fusion.h



Re: [Patch, aarch64, middle-end] Move pair_fusion pass from aarch64 to middle-end

2024-05-21 Thread Alex Coplan
On 21/05/2024 16:02, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 21/05/24 1:16 am, Alex Coplan wrote:
> > On 20/05/2024 18:44, Alex Coplan wrote:
> >> Hi Ajit,
> >>
> >> On 20/05/2024 21:50, Ajit Agarwal wrote:
> >>> Hello Alex/Richard:
> >>>
> >>> Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> >>> to support multiple targets.
> >>>
> >>> Common infrastructure of load store pair fusion is divided into
> >>> target independent and target dependent code.
> >>>
> >>> Target independent code is structured in the following files.
> >>> gcc/pair-fusion.h
> >>> gcc/pair-fusion.cc
> >>>
> >>> Target independent code is the Generic code with pure virtual
> >>> function to interface betwwen target independent and dependent
> >>> code.
> >>>
> >>> Bootstrapped and regtested on aarch64-linux-gnu.
> >>>
> >>> Thanks & Regards
> >>> Ajit
> >>>
> >>> aarch64, middle-end: Move pair_fusion pass from aarch64 to middle-end
> >>>
> >>> Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> >>> to support multiple targets.
> >>>
> >>> Common infrastructure of load store pair fusion is divided into
> >>> target independent and target dependent code.
> >>>
> >>> Target independent code is structured in the following files.
> >>> gcc/pair-fusion.h
> >>> gcc/pair-fusion.cc
> >>>
> >>> Target independent code is the Generic code with pure virtual
> >>> function to interface betwwen target independent and dependent
> >>> code.
> >>>
> >>> 2024-05-20  Ajit Kumar Agarwal  
> >>>
> >>> gcc/ChangeLog:
> >>>
> >>>   * pair-fusion.h: Generic header code for load store fusion
> >>
> >> Insert "pair" before fusion?
> 
> Addressed in v1 of the patch.
> >>
> >>>   that can be shared across different architectures.
> >>>   * pair-fusion.cc: Generic source code implementation for
> >>>   load store fusion that can be shared across different architectures.
> >>
> >> Likewise.
> Addressed in v1 of the patch.
> >>
> >>>   * Makefile.in: Add new executable pair-fusion.o
> >>
> >> It's not an executable but an object file.
> >>
> >>>   * config/aarch64/aarch64-ldp-fusion.cc: Target specific
> >>>   code for load store fusion of aarch64.
> >>
> >> I guess this should say something like: "Delete generic code and move it
> >> to pair-fusion.cc in the middle-end."
> >>
> >> I've left some comments below on the header file.  The rest of the patch
> >> looks pretty good to me.  I tried diffing the original contents of
> >> aarch64-ldp-fusion.cc with pair-fusion.cc, and that looks as expected.
> >>
> > 
> > 
> > 
> >>> diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
> >>> new file mode 100644
> >>> index 000..00f6d3e149a
> >>> --- /dev/null
> >>> +++ b/gcc/pair-fusion.h
> >>> @@ -0,0 +1,340 @@
> >>> +// Pair Mem fusion generic header file.
> >>> +// Copyright (C) 2024 Free Software Foundation, Inc.
> >>> +//
> >>> +// This file is part of GCC.
> >>> +//
> >>> +// GCC is free software; you can redistribute it and/or modify it
> >>> +// under the terms of the GNU General Public License as published by
> >>> +// the Free Software Foundation; either version 3, or (at your option)
> >>> +// any later version.
> >>> +//
> >>> +// GCC is distributed in the hope that it will be useful, but
> >>> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >>> +// General Public License for more details.
> >>> +//
> >>> +// You should have received a copy of the GNU General Public License
> >>> +// along with GCC; see the file COPYING3.  If not see
> >>> +// <http://www.gnu.org/licenses/>.
> >>> +
> >>> +#define INCLUDE_ALGORITHM
> >>> +#define INCLUDE_FUNCTIONAL
> >>> +#define INCLUDE_LIST
> >>> +#define INCLUDE_TYPE_TRAITS
> >>> +#include "config.h"
>

Re: [Patch, aarch64, middle-end] Move pair_fusion pass from aarch64 to middle-end

2024-05-20 Thread Alex Coplan
On 20/05/2024 18:44, Alex Coplan wrote:
> Hi Ajit,
> 
> On 20/05/2024 21:50, Ajit Agarwal wrote:
> > Hello Alex/Richard:
> > 
> > Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> > to support multiple targets.
> > 
> > Common infrastructure of load store pair fusion is divided into
> > target independent and target dependent code.
> > 
> > Target independent code is structured in the following files.
> > gcc/pair-fusion.h
> > gcc/pair-fusion.cc
> > 
> > Target independent code is the Generic code with pure virtual
> > function to interface betwwen target independent and dependent
> > code.
> > 
> > Bootstrapped and regtested on aarch64-linux-gnu.
> > 
> > Thanks & Regards
> > Ajit
> > 
> > aarch64, middle-end: Move pair_fusion pass from aarch64 to middle-end
> > 
> > Move pair fusion pass from aarch64-ldp-fusion.cc to middle-end
> > to support multiple targets.
> > 
> > Common infrastructure of load store pair fusion is divided into
> > target independent and target dependent code.
> > 
> > Target independent code is structured in the following files.
> > gcc/pair-fusion.h
> > gcc/pair-fusion.cc
> > 
> > Target independent code is the Generic code with pure virtual
> > function to interface betwwen target independent and dependent
> > code.
> > 
> > 2024-05-20  Ajit Kumar Agarwal  
> > 
> > gcc/ChangeLog:
> > 
> > * pair-fusion.h: Generic header code for load store fusion
> 
> Insert "pair" before fusion?
> 
> > that can be shared across different architectures.
> > * pair-fusion.cc: Generic source code implementation for
> > load store fusion that can be shared across different architectures.
> 
> Likewise.
> 
> > * Makefile.in: Add new executable pair-fusion.o
> 
> It's not an executable but an object file.
> 
> > * config/aarch64/aarch64-ldp-fusion.cc: Target specific
> > code for load store fusion of aarch64.
> 
> I guess this should say something like: "Delete generic code and move it
> to pair-fusion.cc in the middle-end."
> 
> I've left some comments below on the header file.  The rest of the patch
> looks pretty good to me.  I tried diffing the original contents of
> aarch64-ldp-fusion.cc with pair-fusion.cc, and that looks as expected.
> 



> > diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
> > new file mode 100644
> > index 000..00f6d3e149a
> > --- /dev/null
> > +++ b/gcc/pair-fusion.h
> > @@ -0,0 +1,340 @@
> > +// Pair Mem fusion generic header file.
> > +// Copyright (C) 2024 Free Software Foundation, Inc.
> > +//
> > +// This file is part of GCC.
> > +//
> > +// GCC is free software; you can redistribute it and/or modify it
> > +// under the terms of the GNU General Public License as published by
> > +// the Free Software Foundation; either version 3, or (at your option)
> > +// any later version.
> > +//
> > +// GCC is distributed in the hope that it will be useful, but
> > +// WITHOUT ANY WARRANTY; without even the implied warranty of
> > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +// General Public License for more details.
> > +//
> > +// You should have received a copy of the GNU General Public License
> > +// along with GCC; see the file COPYING3.  If not see
> > +// <http://www.gnu.org/licenses/>.
> > +
> > +#define INCLUDE_ALGORITHM
> > +#define INCLUDE_FUNCTIONAL
> > +#define INCLUDE_LIST
> > +#define INCLUDE_TYPE_TRAITS
> > +#include "config.h"
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +#include "df.h"
> > +#include "rtl-iter.h"
> > +#include "rtl-ssa.h"
> 
> I'm not sure how desirable this is, but you might be able to
> forward-declare RTL-SSA types like this:
> 
> class def_info;
> class insn_info;
> class insn_range_info;
> 
> thus removing the need to include the header here, since the interface
> only refers to these types by pointer or reference.
> 
> Richard: please say if you'd prefer keeping the include.
> 
> > +#include "cfgcleanup.h"
> > +#include "tree-pass.h"
> > +#include "ordered-hash-map.h"
> > +#include "tree-dfa.h"
> > +#include "fold-const.h"
> > +#include "tree-hash-traits.h"
> > +#include "print-

Re: [Patch, aarch64] v6: Preparatory patch to place target independent and,dependent changed code in one file

2024-05-17 Thread Alex Coplan
Hi Ajit,

On 17/05/2024 18:05, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 16/05/24 10:21 pm, Alex Coplan wrote:
> > Hi Ajit,
> > 
> > Thanks a lot for working through the review feedback.
> > 
> 
> Thanks a lot for reviewing the code and approving the patch.

To be clear, I didn't approve the patch because I can't, I just said
that it looks good to me.  You need an AArch64 maintainer (probably
Richard S) to approve it.

> 
> > The patch LGTM with the two minor suggested changes below.  I can't
> > approve the patch, though, so you'll need an OK from Richard S.
> > 
> > Also, I'm not sure if it makes sense to apply the patch in isolation, it
> > might make more sense to only apply it in series with follow-up patches to:
> >  - Finish renaming any bits of the generic code that need renaming (I
> >guess we'll want to rename at least ldp_bb_info to something else,
> >probably there are other bits too).
> >  - Move the generic parts out of gcc/config/aarch64 to a .cc file in the
> >middle-end.
> >
> 
> Addressed in separate patch sent.

Hmm, that doens't look right.  You sent a single patch here:
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652028.html
which looks to squash the work you've done in this patch together with
the move.

What I expect to see is a patch series, as follows:

[PATCH 1/3] aarch64: Split generic code from aarch64 code in ldp fusion
[PATCH 2/3] aarch64: Further renaming of generic code
[PATCH 3/3] aarch64, middle-end: Move pair_fusion pass from aarch64 to 
middle-end

where 1/3 is exactly the patch that I reviewed above with the two
(minor) requested changes (plus any changes requested by Richard), 2/3
(optionally) does further renaming to use generic terminology in the
generic code where needed/desired, and 3/3 does a straight cut/paste
move of code into pair-fusion.h and pair-fusion.cc, with no other
changes (save for perhaps a Makefile change and adding an include in
aarch64-ldp-fusion.cc).

Arguably you could split this even further and do the move of the
pair_fusion class to the new header in a separate patch prior to the
final move.

N.B. (IMO) the patches should be presented like this both for review and
(if approved) when committing.

Richard S may have further suggestions on how to split the patches /
make them more tractable to review, I think this is the bare minimum
that is needed though.

Hope that makes sense.

Thanks,
Alex

>  
> > I'll let Richard S make the final judgement on that.  I don't really
> > mind either way.
> 
> Sure.
> 
> Thanks & Regards
> Ajit
> > 
> > On 15/05/2024 15:06, Ajit Agarwal wrote:
> >> Hello Alex/Richard:
> >>
> >> All review comments are addressed.
> >>
> >> Common infrastructure of load store pair fusion is divided into target
> >> independent and target dependent changed code.
> >>
> >> Target independent code is the Generic code with pure virtual function
> >> to interface between target independent and dependent code.
> >>
> >> Target dependent code is the implementation of pure virtual function for
> >> aarch64 target and the call to target independent code.
> >>
> >> Bootstrapped and regtested on aarch64-linux-gnu.
> >>
> >> Thanks & Regards
> >> Ajit
> >>
> >> aarch64: Preparatory patch to place target independent and
> >> dependent changed code in one file
> >>
> >> Common infrastructure of load store pair fusion is divided into target
> >> independent and target dependent changed code.
> >>
> >> Target independent code is the Generic code with pure virtual function
> >> to interface betwwen target independent and dependent code.
> >>
> >> Target dependent code is the implementation of pure virtual function for
> >> aarch64 target and the call to target independent code.
> >>
> >> 2024-05-15  Ajit Kumar Agarwal  
> >>
> >> gcc/ChangeLog:
> >>
> >>* config/aarch64/aarch64-ldp-fusion.cc: Place target
> >>independent and dependent changed code.
> >> ---
> >>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 533 +++
> >>  1 file changed, 357 insertions(+), 176 deletions(-)
> >>
> >> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> >> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> index 1d9caeab05d..429e532ea3b 100644
> >> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> @@ -138,6 +138,225 @@ struct alt_base
> >>poly_int64 offset;
> >>  };
> >&g

Re: [Patch, aarch64] v6: Preparatory patch to place target independent and,dependent changed code in one file

2024-05-16 Thread Alex Coplan
Hi Ajit,

Thanks a lot for working through the review feedback.

The patch LGTM with the two minor suggested changes below.  I can't
approve the patch, though, so you'll need an OK from Richard S.

Also, I'm not sure if it makes sense to apply the patch in isolation, it
might make more sense to only apply it in series with follow-up patches to:
 - Finish renaming any bits of the generic code that need renaming (I
   guess we'll want to rename at least ldp_bb_info to something else,
   probably there are other bits too).
 - Move the generic parts out of gcc/config/aarch64 to a .cc file in the
   middle-end.

I'll let Richard S make the final judgement on that.  I don't really
mind either way.

On 15/05/2024 15:06, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> All review comments are addressed.
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface between target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> Bootstrapped and regtested on aarch64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> aarch64: Preparatory patch to place target independent and
> dependent changed code in one file
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> 2024-05-15  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code.
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 533 +++
>  1 file changed, 357 insertions(+), 176 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 1d9caeab05d..429e532ea3b 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,6 +138,225 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +// When querying handle_writeback_opportunities, this enum is used to
> +// qualify which opportunities we are asking about.
> +enum class writeback {
> +  // Only those writeback opportunities that arise from existing
> +  // auto-increment accesses.
> +  EXISTING,

Very minor nit: I think an extra blank line here would be nice for readability
now that the enumerators have comments above.

> +  // All writeback opportunities including those that involve folding
> +  // base register updates into a non-writeback pair.
> +  ALL
> +};
> +

Can we have a block comment here which describes the purpose of the
class and how it fits together with the target?  Something like the
following would do:

// This class can be overriden by targets to give a pass that fuses
// adjacent loads and stores into load/store pair instructions.
//
// The target can override the various virtual functions to customize
// the behaviour of the pass as appropriate for the target.

> +struct pair_fusion {
> +  pair_fusion ()
> +  {
> +calculate_dominance_info (CDI_DOMINATORS);
> +df_analyze ();
> +crtl->ssa = new rtl_ssa::function_info (cfun);
> +  };
> +
> +  // Given:
> +  // - an rtx REG_OP, the non-memory operand in a load/store insn,
> +  // - a machine_mode MEM_MODE, the mode of the MEM in that insn, and
> +  // - a boolean LOAD_P (true iff the insn is a load), then:
> +  // return true if the access should be considered an FP/SIMD access.
> +  // Such accesses are segregated from GPR accesses, since we only want
> +  // to form pairs for accesses that use the same register file.
> +  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
> +  {
> +return false;
> +  }
> +
> +  // Return true if we should consider forming pairs from memory
> +  // accesses with operand mode MODE at this stage in compilation.
> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> +
> +  // Return true iff REG_OP is a suitable register operand for a paired
> +  // memory access, where LOAD_P is true if we're asking about loads and
> +  // false for stores.  MODE gives the mode of the operand.
> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> +   machine_mode mode) = 0;
> +
> +  // Return alias check limit.
> +  // This is needed to avoid 

Re: [Patch, aarch64] v4: Preparatory patch to place target independent and,dependent changed code in one file

2024-05-14 Thread Alex Coplan
Hi Ajit,

Please can you pay careful attention to the review comments?

In particular, you have ignored my comment about changing the access of
member functions in ldp_bb_info several times now (on at least three
patch reviews).

Likewise on multiple occasions you've only partially implemented a piece
of review feedback (e.g. applying the "override" keyword to virtual
overrides).

That all makes it rather tiresome to review your patches.

Also, I realise I should have mentioned this on a previous revision of
this patch, but I thought we previously agreed (with Richard S) to split
out the renaming in existing code (e.g. ldp/stp -> "paired access" and
so on) to a separate patch?  That would make this eaiser to review.

On 14/05/2024 15:08, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> All comments are addressed.
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> Bootstrapped on aarch64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> 
> 
> arch64: Preparatory patch to place target independent and
> dependent changed code in one file
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> 2024-05-14  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code.
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 526 +++
>  1 file changed, 345 insertions(+), 181 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 1d9caeab05d..e6af4b0570a 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,6 +138,210 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +// This is used in handle_writeback_opportunities describing
> +// ALL if aarch64_ldp_writeback > 1 otherwise check
> +// EXISTING if aarch64_ldp_writeback.

Since this enum belongs to the generic interface, it's best if it is
described in general terms, i.e. the comment shouldn't refer to the
aarch64 param.

How about:

// When querying handle_writeback_opportunities, this enum is used to
// qualify which opportunities we are asking about.

then above the EXISTING enumerator, you could say:

  // Only those writeback opportunities that arise from existing
  // auto-increment accesses.

and for ALL, you could say:

  // All writeback opportunities including those that involve folding
  // base register updates into a non-writeback pair.

> +enum class writeback {
> +  ALL,
> +  EXISTING
> +};

Also, sorry for the very minor nit, but I think it is more logical if we
flip the order of the enumerators here, i.e. EXISTING should come first.

> +
> +struct pair_fusion {
> +  pair_fusion ()
> +  {
> +calculate_dominance_info (CDI_DOMINATORS);
> +df_analyze ();
> +crtl->ssa = new rtl_ssa::function_info (cfun);
> +  };
> +
> +  // Given:
> +  // - an rtx REG_OP, the non-memory operand in a load/store insn,
> +  // - a machine_mode MEM_MODE, the mode of the MEM in that insn, and
> +  // - a boolean LOAD_P (true iff the insn is a load), then:
> +  // return true if the access should be considered an FP/SIMD access.
> +  // Such accesses are segregated from GPR accesses, since we only want
> +  // to form pairs for accesses that use the same register file.
> +  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
> +  {
> +return false;
> +  }
> +
> +  // Return true if we should consider forming ldp/stp insns from memory

Replace "ldp/stp insns" with "pairs" here, since this is the generic
interface.

> +  // accesses with operand mode MODE at this stage in compilation.
> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> +
> +  // Return true iff REG_OP is a suitable register operand for a paired
> +  // memory access, where LOAD_P is true if we're asking about loads and
> +  // false for stores.  MODE gives the mode of the operand.
> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> +   

Re: [Patch, aarch64] v3: Preparatory patch to place target independent and,dependent changed code in one file

2024-05-13 Thread Alex Coplan
Hi Ajit,

Why did you send three mails for this revision of the patch?  If you're
going to send a new revision of the patch you should increment the
version number and outline the changes / reasons for the new revision.

Mostly the comments below are just style nits and things you missed from
the last review(s) (please try not to miss so many in the future).

On 09/05/2024 17:06, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> All review comments are addressed.
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> Bootstrapped on aarch64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> 
> 
> aarch64: Preparatory patch to place target independent and
> dependent changed code in one file
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> 2024-05-09  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code.
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 542 +++
>  1 file changed, 363 insertions(+), 179 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 1d9caeab05d..217790e111a 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,6 +138,224 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +enum class writeback{

You missed a nit here.  Space before '{'.

> +  ALL,
> +  EXISTING
> +};

You also missed adding comments for the enum, please see the review for v2:
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651074.html

> +
> +struct pair_fusion {
> +  pair_fusion ()
> +  {
> +calculate_dominance_info (CDI_DOMINATORS);
> +df_analyze ();
> +crtl->ssa = new rtl_ssa::function_info (cfun);
> +  };
> +
> +  // Given:
> +  // - an rtx REG_OP, the non-memory operand in a load/store insn,
> +  // - a machine_mode MEM_MODE, the mode of the MEM in that insn, and
> +  // - a boolean LOAD_P (true iff the insn is a load), then:
> +  // return true if the access should be considered an FP/SIMD access.
> +  // Such accesses are segregated from GPR accesses, since we only want
> +  // to form pairs for accesses that use the same register file.
> +  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
> +  {
> +return false;
> +  }
> +
> +  // Return true if we should consider forming ldp/stp insns from memory
> +  // accesses with operand mode MODE at this stage in compilation.
> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> +
> +  // Return true iff REG_OP is a suitable register operand for a paired
> +  // memory access, where LOAD_P is true if we're asking about loads and
> +  // false for stores.  MEM_MODE gives the mode of the operand.
> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> +   machine_mode mode) = 0;

The comment needs updating since we changed the name of the last param,
i.e. s/MEM_MODE/MODE/.

> +
> +  // Return alias check limit.
> +  // This is needed to avoid unbounded quadratic behaviour when
> +  // performing alias analysis.
> +  virtual int pair_mem_alias_check_limit () = 0;
> +
> +  // Returns true if we should try to handle writeback opportunities
> +  // (not whether there are any).
> +  virtual bool handle_writeback_opportunities (enum writeback which) = 0 ;

Heh, the bit in parens from the v2 review probably doesn't need to go
into the comment here.

Also you should describe WHICH in the comment.

> +
> +  // Given BASE_MEM, the mem from the lower candidate access for a pair,
> +  // and LOAD_P (true if the access is a load), check if we should proceed
> +  // to form the pair given the target's code generation policy on
> +  // paired accesses.
> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> + machine_mode mode) = 0;

The name of the first param needs updating in the prototype, i.e.
s/first_mem/base_mem/.  I think you missed the bit about 

Re: [PATCH, aarch64] v2: Preparatory patch to place target independent and,dependent changed code in one file

2024-05-08 Thread Alex Coplan
Hi Ajit,

Sorry for the long delay in reviewing this.

This is really getting there now.  I've left a few more comments below.

Apart from minor style things, the main remaining issues are mostly
around comments.  It's important to have good clear comments for
functions with the parameters (and return value, if any) clearly
described.  See https://www.gnu.org/prep/standards/standards.html#Comments

Note that this now needs a little rebasing, too.

On 21/04/2024 13:22, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> All review comments are addressed and changes are made to transform_for_base
> function as per consensus.
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> Bootstrapped on aarch64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> 
> 
> aarch64: Preparatory patch to place target independent and
> dependent changed code in one file
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> 2024-04-21  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 484 +++
>  1 file changed, 325 insertions(+), 159 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 365dcf48b22..83a917e1d20 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,6 +138,189 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +// Forward declaration to be used inside the aarch64_pair_fusion class.
> +bool ldp_operand_mode_ok_p (machine_mode mode);
> +rtx aarch64_destructure_load_pair (rtx regs[2], rtx pattern);
> +rtx aarch64_destructure_store_pair (rtx regs[2], rtx pattern);
> +rtx aarch64_gen_writeback_pair (rtx wb_effect, rtx pair_mem, rtx regs[2],
> + bool load_p);

I don't think we want to change the linkage of these, they should be kept
static.

> +enum class writeback{

Nit: space before '{'

> +  WRITEBACK_PAIR_P,
> +  WRITEBACK
> +};

We're going to want some more descriptive names here.  How about
EXISTING and ALL?  Note that the WRITEBACK_ prefix isn't necessary as
you're using an enum class, so uses of the enumerators need to be
prefixed with writeback:: anyway.  A comment describing the usage of the
enum as well as comments above the enumerators describing their
interpretation would be good.

> +
> +struct pair_fusion {
> +

Nit: excess blank line.

> +  pair_fusion ()
> +  {
> +calculate_dominance_info (CDI_DOMINATORS);
> +df_analyze ();
> +crtl->ssa = new rtl_ssa::function_info (cfun);
> +  };

Can we have one blank line between the virtual functions, please?  I
think that would be more readable now that there are comments above each
of them.

> +  // Return true if GPR is FP or SIMD accesses, passed
> +  // with GPR reg_op rtx, machine mode and load_p.

It's slightly awkward trying to document this without the parameter
names, but I can see that you're omitting them to avoid unused parameter
warnings.  One option would be to introduce names in the comment as you
go.  How about this instead:

// Given:
// - an rtx REG_OP, the non-memory operand in a load/store insn,
// - a machine_mode MEM_MODE, the mode of the MEM in that insn, and
// - a boolean LOAD_P (true iff the insn is a load), then:
// return true if the access should be considered an FP/SIMD access.
// Such accesses are segregated from GPR accesses, since we only want to
// form pairs for accesses that use the same register file.

> +  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
> +  {
> +return false;
> +  }
> +  // Return true if pair operand mode is ok. Passed with
> +  // machine mode.

Could you use something closer to the comment that is already above
ldp_operand_mode_ok_p?  The purpose of this predicate is really to test
the following: "is it a good idea (for optimization) to form paired
accesses with this operand mode at this stage in compilation?"

> + 

Re: [PATCH v2] aarch64: Preserve mem info on change of base for ldp/stp [PR114674]

2024-05-07 Thread Alex Coplan
On 12/04/2024 12:13, Richard Sandiford wrote:
> Alex Coplan  writes:
> > This is a v2 because I accidentally sent a WIP version of the patch last
> > time round which used replace_equiv_address instead of
> > replace_equiv_address_nv; that caused some ICEs (pointed out by the
> > Linaro CI) since pair addressing modes aren't a subset of the addresses
> > that are accepted by memory_operand for a given mode.
> >
> > This patch should otherwise be identical to v1.  Bootstrapped/regtested
> > on aarch64-linux-gnu (indeed this is the patch I actually tested last
> > time), is this version also OK for GCC 15?
> 
> OK, thanks.  Sorry for missing this in the first review.

Now pushed to trunk, thanks.

Alex

> 
> Richard
> 
> > Thanks,
> > Alex
> >
> > --- >8 ---
> >
> > The ldp/stp fusion pass can change the base of an access so that the two
> > accesses end up using a common base register.  So far we have been using
> > adjust_address_nv to do this, but this means that we don't preserve
> > other properties of the mem we're replacing.  It seems better to use
> > replace_equiv_address_nv, as this will preserve e.g. the MEM_ALIGN of the
> > mem whose address we're changing.
> >
> > The PR shows that by adjusting the other mem we lose alignment
> > information about the original access and therefore end up rejecting an
> > otherwise viable pair when --param=aarch64-stp-policy=aligned is passed.
> > This patch fixes that by using replace_equiv_address_nv instead.
> >
> > Notably this is the same approach as taken by
> > aarch64_check_consecutive_mems when a change of base is required, so
> > this at least makes things more consistent between the ldp fusion pass
> > and the peepholes.
> >
> > gcc/ChangeLog:
> >
> > PR target/114674
> > * config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::fuse_pair):
> > Use replace_equiv_address_nv on a change of base instead of
> > adjust_address_nv on the other access.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR target/114674
> > * gcc.target/aarch64/pr114674.c: New test.
> >
> > diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> > b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > index 365dcf48b22..d07d79df06c 100644
> > --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > @@ -1730,11 +1730,11 @@ ldp_bb_info::fuse_pair (bool load_p,
> > adjust_amt *= -1;
> >  
> >rtx change_reg = XEXP (change_pat, !load_p);
> > -  machine_mode mode_for_mem = GET_MODE (change_mem);
> >rtx effective_base = drop_writeback (base_mem);
> > -  rtx new_mem = adjust_address_nv (effective_base,
> > -  mode_for_mem,
> > -  adjust_amt);
> > +  rtx adjusted_addr = plus_constant (Pmode,
> > +XEXP (effective_base, 0),
> > +adjust_amt);
> > +  rtx new_mem = replace_equiv_address_nv (change_mem, adjusted_addr);
> >rtx new_set = load_p
> > ? gen_rtx_SET (change_reg, new_mem)
> > : gen_rtx_SET (new_mem, change_reg);
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr114674.c 
> > b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> > new file mode 100644
> > index 000..944784fd008
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 --param=aarch64-stp-policy=aligned" } */
> > +typedef struct {
> > +   unsigned int f1;
> > +   unsigned int f2;
> > +} test_struct;
> > +
> > +static test_struct ts = {
> > +   123, 456
> > +};
> > +
> > +void foo(void)
> > +{
> > +   ts.f2 = 36969 * (ts.f2 & 65535) + (ts.f1 >> 16);
> > +   ts.f1 = 18000 * (ts.f2 & 65535) + (ts.f2 >> 16);
> > +}
> > +/* { dg-final { scan-assembler-times "stp" 1 } } */


[PATCH] aarch64: Fix typo in aarch64-ldp-fusion.cc:combine_reg_notes [PR114936]

2024-05-03 Thread Alex Coplan
This fixes a typo in combine_reg_notes in the load/store pair fusion
pass.  As it stands, the calls to filter_notes store any
REG_FRAME_RELATED_EXPR to fr_expr with the following association:

 - i2 -> fr_expr[0]
 - i1 -> fr_expr[1]

but then the checks inside the following if statement expect the
opposite (more natural) association, i.e.:

 - i2 -> fr_expr[1]
 - i1 -> fr_expr[0]

this patch fixes the oversight by swapping the fr_expr indices in the
calls to filter_notes.

In hindsight it would probably have been less confusing / error-prone to
have combine_reg_notes take an array of two insns, then we wouldn't have
to mix 1-based and 0-based indexing as well as remembering to call
filter_notes in reverse program order.  This however is a minimal fix
for backporting purposes.

Many thanks to Matthew for spotting this typo and pointing it out to me.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk and the 14
branch after the 14.1 release?

Thanks,
Alex

gcc/ChangeLog:

PR target/114936
* config/aarch64/aarch64-ldp-fusion.cc (combine_reg_notes):
Ensure insn iN has its REG_FRAME_RELATED_EXPR (if any) stored in
FR_EXPR[N-1], thus matching the correspondence expected by the
copy_rtx calls.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 0bc225dae7b..12ef305d8d3 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -1085,9 +1085,9 @@ combine_reg_notes (insn_info *i1, insn_info *i2, bool 
load_p)
   bool found_eh_region = false;
   rtx result = NULL_RTX;
   result = filter_notes (REG_NOTES (i2->rtl ()), result,
-_eh_region, fr_expr);
-  result = filter_notes (REG_NOTES (i1->rtl ()), result,
 _eh_region, fr_expr + 1);
+  result = filter_notes (REG_NOTES (i1->rtl ()), result,
+_eh_region, fr_expr);
 
   if (!load_p)
 {


Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-05-03 Thread Alex Coplan
On 22/04/2024 13:01, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 14/04/24 10:29 pm, Ajit Agarwal wrote:
> > Hello Alex:
> > 
> > On 12/04/24 11:02 pm, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 12/04/24 8:15 pm, Alex Coplan wrote:
> >>> On 12/04/2024 20:02, Ajit Agarwal wrote:
> >>>> Hello Alex:
> >>>>
> >>>> On 11/04/24 7:55 pm, Alex Coplan wrote:
> >>>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
> >>>>>> Hello Alex:
> >>>>>>
> >>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
> >>>>>>> Hi Ajit,
> >>>>>>>
> >>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
> >>>>>>>> Hello Alex:
> >>>>>>>>
> >>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
> >>>>>>>>> Hi Ajit,
> >>>>>>>>>
> >>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
> >>>>>>>>>> Hello Alex:
> >>>>>>>>>>
> >>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
> >>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
> >>>>>>>>>>>> Hello Alex:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>>>>>>>>>>>> Hello Alex/Richard:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> All review comments are incorporated.
> >>>>>>> 
> >>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
> >>>>>>>>>>>>>>>> _list,
> >>>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, 
> >>>>>>>>>>>>>>>> call
> >>>>>>>>>>>>>>>>  // merge_pairs.
> >>>>>>>>>>>>>>>>  void
> >>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>>>>>> - access_group )
> >>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>>>>>> + access_group )
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>>const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>>>>>>>>>>>const unsigned access_size = lfs.size;
> >>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
> >>>>>>>>>>>>>>>> encoded_lfs,
> >>>>>>>>>>>>>>>> access.cand_insns,
> >>>>>>>>>>>>>>>> lfs.load_p,
> >>>>>>>>>>>>>>>> access_size);
> >>>>>>>>>>>>>>>> -  skip_next = access.cand_insns.empty ();
> >>>>>>>>>>>>>>>> +  skip_next = bb_state->cand_insns_empty_p 
> >>>>>>>>>>>>>>>> (access.cand_insns);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As above, why is this needed?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
> >>>>

cfgrtl: Fix MEM_EXPR update in duplicate_insn_chain [PR114924]

2024-05-02 Thread Alex Coplan
Hi,

The PR shows that when cfgrtl.cc:duplicate_insn_chain attempts to
update the MR_DEPENDENCE_CLIQUE information for a MEM_EXPR we can end up
accidentally dropping (e.g.) an ARRAY_REF from the MEM_EXPR and end up
replacing it with the underlying MEM_REF.  This leads to an
inconsistency in the MEM_EXPR information, and could lead to wrong code.

While the walk down to the MEM_REF is necessary to update
MR_DEPENDENCE_CLIQUE, we should use the outer tree expression for the
MEM_EXPR.  This patch does that.

Bootstrapped/regtested on aarch64-linux-gnu, no regressions.  OK for
trunk?  What about backports?

Thanks,
Alex

gcc/ChangeLog:

PR rtl-optimization/114924
* cfgrtl.cc (duplicate_insn_chain): When updating MEM_EXPRs,
don't strip (e.g.) ARRAY_REFs from the final MEM_EXPR.
diff --git a/gcc/cfgrtl.cc b/gcc/cfgrtl.cc
index 304c429c99b..a5dc3512159 100644
--- a/gcc/cfgrtl.cc
+++ b/gcc/cfgrtl.cc
@@ -4432,12 +4432,13 @@ duplicate_insn_chain (rtx_insn *from, rtx_insn *to,
   since MEM_EXPR is shared so make a copy and
   walk to the subtree again.  */
tree new_expr = unshare_expr (MEM_EXPR (*iter));
+   tree orig_new_expr = new_expr;
if (TREE_CODE (new_expr) == WITH_SIZE_EXPR)
  new_expr = TREE_OPERAND (new_expr, 0);
while (handled_component_p (new_expr))
  new_expr = TREE_OPERAND (new_expr, 0);
MR_DEPENDENCE_CLIQUE (new_expr) = newc;
-   set_mem_expr (const_cast  (*iter), new_expr);
+   set_mem_expr (const_cast  (*iter), orig_new_expr);
  }
  }
}


Re: [PATCH] wwwdocs: Add note to changes.html for __has_{feature,extension}

2024-04-26 Thread Alex Coplan
On 26/04/2024 09:14, Marek Polacek wrote:
> On Fri, Apr 26, 2024 at 11:12:54AM +0100, Alex Coplan wrote:
> > On 17/04/2024 11:41, Marek Polacek wrote:
> > > On Mon, Apr 15, 2024 at 11:13:27AM +0100, Alex Coplan wrote:
> > > > On 04/04/2024 11:00, Alex Coplan wrote:
> > > > > Hi,
> > > > > 
> > > > > This adds a note to the GCC 14 release notes mentioning support for
> > > > > __has_{feature,extension} (PR60512).
> > > > > 
> > > > > OK to commit?
> > > > 
> > > > Ping.  Is this changes.html patch OK?  I guess it needs a review from 
> > > > C++
> > > > maintainers since it adds to the C++ section.
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > Alex
> > > > 
> > > > > diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
> > > > > index 9fd224c1..facead8d 100644
> > > > > --- a/htdocs/gcc-14/changes.html
> > > > > +++ b/htdocs/gcc-14/changes.html
> > > > > @@ -242,6 +242,12 @@ a work-in-progress.
> > > > >constinit and optimized dynamic 
> > > > > initialization
> > > > >  
> > > > >
> > > > > +  The Clang language extensions __has_feature and
> > > > > +__has_extension have been implemented in GCC.  These
> > > > > +are available from C, C++, and Objective-C(++).
> > > 
> > > Since the extension is for the whole c-family, not just C++, I think it
> > > belongs to a "C family" section.  See e.g. 
> > > <https://gcc.gnu.org/gcc-13/changes.html>.
> > 
> > Thanks, I agree that makes more sense.  How about this version instead then:
> 
> Thanks, I think you can go ahead with this.

Great, I've pushed that to wwwdocs.

>  
> > diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
> > index fce0fb44..42353955 100644
> > --- a/htdocs/gcc-14/changes.html
> > +++ b/htdocs/gcc-14/changes.html
> > @@ -303,7 +303,15 @@ a work-in-progress.
> >Further clean up and improvements to the GNAT code.
> >  
> >  
> > -
> > +C family
> > +
> > +  The Clang language extensions __has_feature and
> > +__has_extension have been implemented in GCC.  These
> > +are available from C, C++, and Objective-C(++).
> > +This is primarily intended to aid the portability of code written
> > +against Clang.
> > +  
> > +
> >  
> >  C
> 
> Marek
> 


Re: [PATCH] wwwdocs: Add note to changes.html for __has_{feature,extension}

2024-04-26 Thread Alex Coplan
On 17/04/2024 11:41, Marek Polacek wrote:
> On Mon, Apr 15, 2024 at 11:13:27AM +0100, Alex Coplan wrote:
> > On 04/04/2024 11:00, Alex Coplan wrote:
> > > Hi,
> > > 
> > > This adds a note to the GCC 14 release notes mentioning support for
> > > __has_{feature,extension} (PR60512).
> > > 
> > > OK to commit?
> > 
> > Ping.  Is this changes.html patch OK?  I guess it needs a review from C++
> > maintainers since it adds to the C++ section.
> > 
> > > 
> > > Thanks,
> > > Alex
> > 
> > > diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
> > > index 9fd224c1..facead8d 100644
> > > --- a/htdocs/gcc-14/changes.html
> > > +++ b/htdocs/gcc-14/changes.html
> > > @@ -242,6 +242,12 @@ a work-in-progress.
> > >constinit and optimized dynamic initialization
> > >  
> > >
> > > +  The Clang language extensions __has_feature and
> > > +__has_extension have been implemented in GCC.  These
> > > +are available from C, C++, and Objective-C(++).
> 
> Since the extension is for the whole c-family, not just C++, I think it
> belongs to a "C family" section.  See e.g. 
> <https://gcc.gnu.org/gcc-13/changes.html>.

Thanks, I agree that makes more sense.  How about this version instead then:

diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index fce0fb44..42353955 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -303,7 +303,15 @@ a work-in-progress.
   Further clean up and improvements to the GNAT code.
 
 
-
+C family
+
+  The Clang language extensions __has_feature and
+__has_extension have been implemented in GCC.  These
+are available from C, C++, and Objective-C(++).
+This is primarily intended to aid the portability of code written
+against Clang.
+  
+
 
 C
 
Alex

> 
> Marek
> 


Re: [PATCH] wwwdocs: Add note to changes.html for __has_{feature,extension}

2024-04-15 Thread Alex Coplan
On 04/04/2024 11:00, Alex Coplan wrote:
> Hi,
> 
> This adds a note to the GCC 14 release notes mentioning support for
> __has_{feature,extension} (PR60512).
> 
> OK to commit?

Ping.  Is this changes.html patch OK?  I guess it needs a review from C++
maintainers since it adds to the C++ section.

> 
> Thanks,
> Alex

> diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
> index 9fd224c1..facead8d 100644
> --- a/htdocs/gcc-14/changes.html
> +++ b/htdocs/gcc-14/changes.html
> @@ -242,6 +242,12 @@ a work-in-progress.
>constinit and optimized dynamic initialization
>  
>
> +  The Clang language extensions __has_feature and
> +__has_extension have been implemented in GCC.  These
> +are available from C, C++, and Objective-C(++).
> +This is primarily intended to aid the portability of code written
> +against Clang.
> +  
>  
>  
>  Runtime Library (libstdc++)



Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-12 Thread Alex Coplan
On 12/04/2024 20:02, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 11/04/24 7:55 pm, Alex Coplan wrote:
> > On 10/04/2024 23:48, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 10/04/24 7:52 pm, Alex Coplan wrote:
> >>> Hi Ajit,
> >>>
> >>> On 10/04/2024 15:31, Ajit Agarwal wrote:
> >>>> Hello Alex:
> >>>>
> >>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
> >>>>> Hi Ajit,
> >>>>>
> >>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
> >>>>>> Hello Alex:
> >>>>>>
> >>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
> >>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
> >>>>>>>> Hello Alex:
> >>>>>>>>
> >>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>>>>>>>> Hello Alex/Richard:
> >>>>>>>>>>>>
> >>>>>>>>>>>> All review comments are incorporated.
> >>> 
> >>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
> >>>>>>>>>>>> _list,
> >>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>>>>>>>>>  // merge_pairs.
> >>>>>>>>>>>>  void
> >>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>> - access_group )
> >>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>> + access_group )
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>>>>>>>const unsigned access_size = lfs.size;
> >>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
> >>>>>>>>>>>> encoded_lfs,
> >>>>>>>>>>>> access.cand_insns,
> >>>>>>>>>>>> lfs.load_p,
> >>>>>>>>>>>> access_size);
> >>>>>>>>>>>> -  skip_next = access.cand_insns.empty ();
> >>>>>>>>>>>> +  skip_next = bb_state->cand_insns_empty_p 
> >>>>>>>>>>>> (access.cand_insns);
> >>>>>>>>>>>
> >>>>>>>>>>> As above, why is this needed?
> >>>>>>>>>>
> >>>>>>>>>> For rs6000 we want to return always true. as load store pair
> >>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
> >>>>>>>>>
> >>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 
> >>>>>>>>> you have
> >>>>>>>>> load/store pair instructions where the two arms of the access are 
> >>>>>>>>> storing
> >>>>>>>>> operands of different sizes?  Or something else?
> >>>>>>>>>
> >>>>>>>>> As it stands the logic is to skip the next iteration only if we
> >>>>>>>>> exhausted all the candidate insns for the current access.  In the 
> >>>>>>>>> case
> >>>>>>>>> that we didn't exhaust all such candidates, then the idea is that 
> >>>>>>>>> when
> >>>>>>&

[PATCH v2] aarch64: Preserve mem info on change of base for ldp/stp [PR114674]

2024-04-12 Thread Alex Coplan
This is a v2 because I accidentally sent a WIP version of the patch last
time round which used replace_equiv_address instead of
replace_equiv_address_nv; that caused some ICEs (pointed out by the
Linaro CI) since pair addressing modes aren't a subset of the addresses
that are accepted by memory_operand for a given mode.

This patch should otherwise be identical to v1.  Bootstrapped/regtested
on aarch64-linux-gnu (indeed this is the patch I actually tested last
time), is this version also OK for GCC 15?

Thanks,
Alex

--- >8 ---

The ldp/stp fusion pass can change the base of an access so that the two
accesses end up using a common base register.  So far we have been using
adjust_address_nv to do this, but this means that we don't preserve
other properties of the mem we're replacing.  It seems better to use
replace_equiv_address_nv, as this will preserve e.g. the MEM_ALIGN of the
mem whose address we're changing.

The PR shows that by adjusting the other mem we lose alignment
information about the original access and therefore end up rejecting an
otherwise viable pair when --param=aarch64-stp-policy=aligned is passed.
This patch fixes that by using replace_equiv_address_nv instead.

Notably this is the same approach as taken by
aarch64_check_consecutive_mems when a change of base is required, so
this at least makes things more consistent between the ldp fusion pass
and the peepholes.

gcc/ChangeLog:

PR target/114674
* config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::fuse_pair):
Use replace_equiv_address_nv on a change of base instead of
adjust_address_nv on the other access.

gcc/testsuite/ChangeLog:

PR target/114674
* gcc.target/aarch64/pr114674.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 365dcf48b22..d07d79df06c 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -1730,11 +1730,11 @@ ldp_bb_info::fuse_pair (bool load_p,
adjust_amt *= -1;
 
   rtx change_reg = XEXP (change_pat, !load_p);
-  machine_mode mode_for_mem = GET_MODE (change_mem);
   rtx effective_base = drop_writeback (base_mem);
-  rtx new_mem = adjust_address_nv (effective_base,
-  mode_for_mem,
-  adjust_amt);
+  rtx adjusted_addr = plus_constant (Pmode,
+XEXP (effective_base, 0),
+adjust_amt);
+  rtx new_mem = replace_equiv_address_nv (change_mem, adjusted_addr);
   rtx new_set = load_p
? gen_rtx_SET (change_reg, new_mem)
: gen_rtx_SET (new_mem, change_reg);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr114674.c 
b/gcc/testsuite/gcc.target/aarch64/pr114674.c
new file mode 100644
index 000..944784fd008
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr114674.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --param=aarch64-stp-policy=aligned" } */
+typedef struct {
+   unsigned int f1;
+   unsigned int f2;
+} test_struct;
+
+static test_struct ts = {
+   123, 456
+};
+
+void foo(void)
+{
+   ts.f2 = 36969 * (ts.f2 & 65535) + (ts.f1 >> 16);
+   ts.f1 = 18000 * (ts.f2 & 65535) + (ts.f2 >> 16);
+}
+/* { dg-final { scan-assembler-times "stp" 1 } } */


Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-11 Thread Alex Coplan
On 10/04/2024 23:48, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 10/04/24 7:52 pm, Alex Coplan wrote:
> > Hi Ajit,
> > 
> > On 10/04/2024 15:31, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 10/04/24 1:42 pm, Alex Coplan wrote:
> >>> Hi Ajit,
> >>>
> >>> On 09/04/2024 20:59, Ajit Agarwal wrote:
> >>>> Hello Alex:
> >>>>
> >>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
> >>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
> >>>>>> Hello Alex:
> >>>>>>
> >>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>>>>>> Hello Alex/Richard:
> >>>>>>>>>>
> >>>>>>>>>> All review comments are incorporated.
> > 
> >>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
> >>>>>>>>>> _list,
> >>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>>>>>>>  // merge_pairs.
> >>>>>>>>>>  void
> >>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>> -   access_group )
> >>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>> +   access_group )
> >>>>>>>>>>  {
> >>>>>>>>>>const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>>>>>const unsigned access_size = lfs.size;
> >>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
> >>>>>>>>>> encoded_lfs,
> >>>>>>>>>>   access.cand_insns,
> >>>>>>>>>>   lfs.load_p,
> >>>>>>>>>>   access_size);
> >>>>>>>>>> -skip_next = access.cand_insns.empty ();
> >>>>>>>>>> +skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> >>>>>>>>>
> >>>>>>>>> As above, why is this needed?
> >>>>>>>>
> >>>>>>>> For rs6000 we want to return always true. as load store pair
> >>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>>>>>> to generate always true for rs6000 to skip pairs as above.
> >>>>>>>
> >>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 
> >>>>>>> you have
> >>>>>>> load/store pair instructions where the two arms of the access are 
> >>>>>>> storing
> >>>>>>> operands of different sizes?  Or something else?
> >>>>>>>
> >>>>>>> As it stands the logic is to skip the next iteration only if we
> >>>>>>> exhausted all the candidate insns for the current access.  In the case
> >>>>>>> that we didn't exhaust all such candidates, then the idea is that when
> >>>>>>> access becomes prev_access, we can attempt to use those candidates as
> >>>>>>> the "left-hand side" of a pair in the next iteration since we failed 
> >>>>>>> to
> >>>>>>> use them as the "right-hand side" of a pair in the current iteration.
> >>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
> >>>>>>> explain?
> >>>>>>>
> >>>>>>
> >>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
> >>>>>> other load is from 16th offset. Then in next iteration we get load
> >>>>>> from 16th offset and other load from 32 offse

[PATCH] aarch64: Preserve mem info on change of base for ldp/stp [PR114674]

2024-04-11 Thread Alex Coplan
Hi,

The ldp/stp fusion pass can change the base of an access so that the two
accesses end up using a common base register.  So far we have been using
adjust_address_nv to do this, but this means that we don't preserve
other properties of the mem we're replacing.  It seems better to use
replace_equiv_address_nv, as this will preserve e.g. the MEM_ALIGN of the
mem whose address we're changing.

The PR shows that by adjusting the other mem we lose alignment
information about the original access and therefore end up rejecting an
otherwise viable pair when --param=aarch64-stp-policy=aligned is passed.
This patch fixes that by using replace_equiv_address_nv instead.

Notably this is the same approach as taken by
aarch64_check_consecutive_mems when a change of base is required, so
this at least makes things more consistent between the ldp fusion pass
and the peepholes.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk when stage 1
opens for GCC 15?

Thanks,
Alex


gcc/ChangeLog:

PR target/114674
* config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::fuse_pair):
Use replace_equiv_address_nv on a change of base instead of
adjust_address_nv on the other access.

gcc/testsuite/ChangeLog:

PR target/114674
* gcc.target/aarch64/pr114674.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 365dcf48b22..4258a560c48 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -1730,11 +1730,11 @@ ldp_bb_info::fuse_pair (bool load_p,
adjust_amt *= -1;
 
   rtx change_reg = XEXP (change_pat, !load_p);
-  machine_mode mode_for_mem = GET_MODE (change_mem);
   rtx effective_base = drop_writeback (base_mem);
-  rtx new_mem = adjust_address_nv (effective_base,
-  mode_for_mem,
-  adjust_amt);
+  rtx adjusted_addr = plus_constant (Pmode,
+XEXP (effective_base, 0),
+adjust_amt);
+  rtx new_mem = replace_equiv_address (change_mem, adjusted_addr);
   rtx new_set = load_p
? gen_rtx_SET (change_reg, new_mem)
: gen_rtx_SET (new_mem, change_reg);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr114674.c 
b/gcc/testsuite/gcc.target/aarch64/pr114674.c
new file mode 100644
index 000..944784fd008
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr114674.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --param=aarch64-stp-policy=aligned" } */
+typedef struct {
+   unsigned int f1;
+   unsigned int f2;
+} test_struct;
+
+static test_struct ts = {
+   123, 456
+};
+
+void foo(void)
+{
+   ts.f2 = 36969 * (ts.f2 & 65535) + (ts.f1 >> 16);
+   ts.f1 = 18000 * (ts.f2 & 65535) + (ts.f2 >> 16);
+}
+/* { dg-final { scan-assembler-times "stp" 1 } } */


Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-10 Thread Alex Coplan
Hi Ajit,

On 10/04/2024 15:31, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 10/04/24 1:42 pm, Alex Coplan wrote:
> > Hi Ajit,
> > 
> > On 09/04/2024 20:59, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 09/04/24 8:39 pm, Alex Coplan wrote:
> >>> On 09/04/2024 20:01, Ajit Agarwal wrote:
> >>>> Hello Alex:
> >>>>
> >>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>>>> Hello Alex/Richard:
> >>>>>>>>
> >>>>>>>> All review comments are incorporated.

> >>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
> >>>>>>>> _list,
> >>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>>>>>  // merge_pairs.
> >>>>>>>>  void
> >>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>> - access_group )
> >>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>> + access_group )
> >>>>>>>>  {
> >>>>>>>>const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>>>const unsigned access_size = lfs.size;
> >>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
> >>>>>>>> encoded_lfs,
> >>>>>>>> access.cand_insns,
> >>>>>>>> lfs.load_p,
> >>>>>>>> access_size);
> >>>>>>>> -  skip_next = access.cand_insns.empty ();
> >>>>>>>> +  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> >>>>>>>
> >>>>>>> As above, why is this needed?
> >>>>>>
> >>>>>> For rs6000 we want to return always true. as load store pair
> >>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >>>>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>>>> to generate always true for rs6000 to skip pairs as above.
> >>>>>
> >>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you 
> >>>>> have
> >>>>> load/store pair instructions where the two arms of the access are 
> >>>>> storing
> >>>>> operands of different sizes?  Or something else?
> >>>>>
> >>>>> As it stands the logic is to skip the next iteration only if we
> >>>>> exhausted all the candidate insns for the current access.  In the case
> >>>>> that we didn't exhaust all such candidates, then the idea is that when
> >>>>> access becomes prev_access, we can attempt to use those candidates as
> >>>>> the "left-hand side" of a pair in the next iteration since we failed to
> >>>>> use them as the "right-hand side" of a pair in the current iteration.
> >>>>> I don't see why you wouldn't want that behaviour.  Please can you
> >>>>> explain?
> >>>>>
> >>>>
> >>>> In merge_pair we get the 2 load candiates one load from 0 offset and
> >>>> other load is from 16th offset. Then in next iteration we get load
> >>>> from 16th offset and other load from 32 offset. In next iteration
> >>>> we get load from 32 offset and other load from 48 offset.
> >>>>
> >>>> For example:
> >>>>
> >>>> Currently we get the load candiates as follows.
> >>>>
> >>>> pairs:
> >>>>
> >>>> load from 0th offset.
> >>>> load from 16th offset.
> >>>>
> >>>> next pairs:
> >>>>
> >>>> load from 16th offset.
> >>>> load from 32th offset.
> >>>>
> >>>> next pairs:
> >>>>
> >>>> load from 32th offset
> >

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-10 Thread Alex Coplan
Hi Ajit,

On 09/04/2024 20:59, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 09/04/24 8:39 pm, Alex Coplan wrote:
> > On 09/04/2024 20:01, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>
> >>>>
> >>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>> Hello Alex/Richard:
> >>>>>>
> >>>>>> All review comments are incorporated.
> >>>>>
> >>>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
> >>>>> preparatory patch as we discussed.
> >>>>>
> >>>>> Sorry for another meta comment, but: I think the reason that the Linaro
> >>>>> CI isn't running tests on your patches is actually because you're
> >>>>> sending 1/3 of a series but not sending the rest of the series.
> >>>>>
> >>>>> So please can you either send this as an individual preparatory patch
> >>>>> (not marked as a series) or if you're going to send a series (e.g. with
> >>>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> >>>>> series when you make updates.  That way the CI should test your patches,
> >>>>> which would be helpful.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>>>
> >>>>>> Common infrastructure of load store pair fusion is divided into target
> >>>>>> independent and target dependent changed code.
> >>>>>>
> >>>>>> Target independent code is the Generic code with pure virtual function
> >>>>>> to interface betwwen target independent and dependent code.
> >>>>>>
> >>>>>> Target dependent code is the implementation of pure virtual function 
> >>>>>> for
> >>>>>> aarch64 target and the call to target independent code.
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Ajit
> >>>>>>
> >>>>>>
> >>>>>> aarch64: Place target independent and dependent changed code in one 
> >>>>>> file
> >>>>>>
> >>>>>> Common infrastructure of load store pair fusion is divided into target
> >>>>>> independent and target dependent changed code.
> >>>>>>
> >>>>>> Target independent code is the Generic code with pure virtual function
> >>>>>> to interface betwwen target independent and dependent code.
> >>>>>>
> >>>>>> Target dependent code is the implementation of pure virtual function 
> >>>>>> for
> >>>>>> aarch64 target and the call to target independent code.
> >>>>>>
> >>>>>> 2024-04-06  Ajit Kumar Agarwal  
> >>>>>>
> >>>>>> gcc/ChangeLog:
> >>>>>>
> >>>>>>* config/aarch64/aarch64-ldp-fusion.cc: Place target
> >>>>>>independent and dependent changed code.
> >>>>>
> >>>>> You're going to need a proper ChangeLog eventually, but I guess there's
> >>>>> no need for that right now.
> >>>>>
> >>>>>> ---
> >>>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
> >>>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
> >>>>>>
> >>>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> >>>>>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> index 22ed95eb743..cb21b514ef7 100644
> >>>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> @@ -138,8 +138,122 @@ struct alt_base
> >>>>>>poly_int64 offset;
> >>>>>>  };
> >>>>>>  
> >>>>>> +// Virtual base class for load/store walkers used in alias analysis.
> >>>>>> +struct alias_walker
> >>>>>> +{
> >>>>>> +  virtual bool confl

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-09 Thread Alex Coplan
On 09/04/2024 20:01, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 09/04/24 7:29 pm, Alex Coplan wrote:
> > On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>
> >>
> >> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>> Hello Alex/Richard:
> >>>>
> >>>> All review comments are incorporated.
> >>>
> >>> Thanks, I was kind-of expecting you to also send the renaming patch as a
> >>> preparatory patch as we discussed.
> >>>
> >>> Sorry for another meta comment, but: I think the reason that the Linaro
> >>> CI isn't running tests on your patches is actually because you're
> >>> sending 1/3 of a series but not sending the rest of the series.
> >>>
> >>> So please can you either send this as an individual preparatory patch
> >>> (not marked as a series) or if you're going to send a series (e.g. with
> >>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> >>> series when you make updates.  That way the CI should test your patches,
> >>> which would be helpful.
> >>>
> >>
> >> Addressed.
> >>  
> >>>>
> >>>> Common infrastructure of load store pair fusion is divided into target
> >>>> independent and target dependent changed code.
> >>>>
> >>>> Target independent code is the Generic code with pure virtual function
> >>>> to interface betwwen target independent and dependent code.
> >>>>
> >>>> Target dependent code is the implementation of pure virtual function for
> >>>> aarch64 target and the call to target independent code.
> >>>>
> >>>> Thanks & Regards
> >>>> Ajit
> >>>>
> >>>>
> >>>> aarch64: Place target independent and dependent changed code in one file
> >>>>
> >>>> Common infrastructure of load store pair fusion is divided into target
> >>>> independent and target dependent changed code.
> >>>>
> >>>> Target independent code is the Generic code with pure virtual function
> >>>> to interface betwwen target independent and dependent code.
> >>>>
> >>>> Target dependent code is the implementation of pure virtual function for
> >>>> aarch64 target and the call to target independent code.
> >>>>
> >>>> 2024-04-06  Ajit Kumar Agarwal  
> >>>>
> >>>> gcc/ChangeLog:
> >>>>
> >>>>  * config/aarch64/aarch64-ldp-fusion.cc: Place target
> >>>>  independent and dependent changed code.
> >>>
> >>> You're going to need a proper ChangeLog eventually, but I guess there's
> >>> no need for that right now.
> >>>
> >>>> ---
> >>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
> >>>>  1 file changed, 249 insertions(+), 122 deletions(-)
> >>>>
> >>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> >>>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>> index 22ed95eb743..cb21b514ef7 100644
> >>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>> @@ -138,8 +138,122 @@ struct alt_base
> >>>>poly_int64 offset;
> >>>>  };
> >>>>  
> >>>> +// Virtual base class for load/store walkers used in alias analysis.
> >>>> +struct alias_walker
> >>>> +{
> >>>> +  virtual bool conflict_p (int ) const = 0;
> >>>> +  virtual insn_info *insn () const = 0;
> >>>> +  virtual bool valid () const  = 0;
> >>>
> >>> Heh, looking at this made me realise there is a whitespace bug here in
> >>> the existing code (double space after const).  Sorry about that!  I'll
> >>> push an obvious fix for that.
> >>>
> >>>> +  virtual void advance () = 0;
> >>>> +};
> >>>> +
> >>>> +struct pair_fusion {
> >>>> +
> >>>> +  pair_fusion () {};
> >>>
> >>> This ctor looks pointless at the moment.  Perhaps instead we could put
> >>> the contents of ldp_fusion_init in here and then delete that function?
> >>>
> >>
>

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-09 Thread Alex Coplan
On 09/04/2024 17:30, Ajit Agarwal wrote:
> 
> 
> On 05/04/24 10:03 pm, Alex Coplan wrote:
> > On 05/04/2024 13:53, Ajit Agarwal wrote:
> >> Hello Alex/Richard:
> >>
> >> All review comments are incorporated.
> > 
> > Thanks, I was kind-of expecting you to also send the renaming patch as a
> > preparatory patch as we discussed.
> > 
> > Sorry for another meta comment, but: I think the reason that the Linaro
> > CI isn't running tests on your patches is actually because you're
> > sending 1/3 of a series but not sending the rest of the series.
> > 
> > So please can you either send this as an individual preparatory patch
> > (not marked as a series) or if you're going to send a series (e.g. with
> > a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> > series when you make updates.  That way the CI should test your patches,
> > which would be helpful.
> >
> 
> Addressed.
>  
> >>
> >> Common infrastructure of load store pair fusion is divided into target
> >> independent and target dependent changed code.
> >>
> >> Target independent code is the Generic code with pure virtual function
> >> to interface betwwen target independent and dependent code.
> >>
> >> Target dependent code is the implementation of pure virtual function for
> >> aarch64 target and the call to target independent code.
> >>
> >> Thanks & Regards
> >> Ajit
> >>
> >>
> >> aarch64: Place target independent and dependent changed code in one file
> >>
> >> Common infrastructure of load store pair fusion is divided into target
> >> independent and target dependent changed code.
> >>
> >> Target independent code is the Generic code with pure virtual function
> >> to interface betwwen target independent and dependent code.
> >>
> >> Target dependent code is the implementation of pure virtual function for
> >> aarch64 target and the call to target independent code.
> >>
> >> 2024-04-06  Ajit Kumar Agarwal  
> >>
> >> gcc/ChangeLog:
> >>
> >>* config/aarch64/aarch64-ldp-fusion.cc: Place target
> >>independent and dependent changed code.
> > 
> > You're going to need a proper ChangeLog eventually, but I guess there's
> > no need for that right now.
> > 
> >> ---
> >>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
> >>  1 file changed, 249 insertions(+), 122 deletions(-)
> >>
> >> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> >> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> index 22ed95eb743..cb21b514ef7 100644
> >> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> @@ -138,8 +138,122 @@ struct alt_base
> >>poly_int64 offset;
> >>  };
> >>  
> >> +// Virtual base class for load/store walkers used in alias analysis.
> >> +struct alias_walker
> >> +{
> >> +  virtual bool conflict_p (int ) const = 0;
> >> +  virtual insn_info *insn () const = 0;
> >> +  virtual bool valid () const  = 0;
> > 
> > Heh, looking at this made me realise there is a whitespace bug here in
> > the existing code (double space after const).  Sorry about that!  I'll
> > push an obvious fix for that.
> > 
> >> +  virtual void advance () = 0;
> >> +};
> >> +
> >> +struct pair_fusion {
> >> +
> >> +  pair_fusion () {};
> > 
> > This ctor looks pointless at the moment.  Perhaps instead we could put
> > the contents of ldp_fusion_init in here and then delete that function?
> > 
> 
> Addressed.
> 
> >> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >> + bool load_p) = 0;
> > 
> > Please can we have comments above each of these virtual functions
> > describing any parameters, what the purpose of the hook is, and the
> > interpretation of the return value?  This will serve as the
> > documentation for other targets that want to make use of the pass.
> > 
> > It might make sense to have a default-false implementation for
> > fpsimd_op_p, especially if you don't want to make use of this bit for
> > rs6000.
> >
> 
> Addressed.
>  
> >> +
> >> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> >> +  virtual bool pair_trailing_writeback_p () = 0;
> > 
> > Sorry for the run

[PATCH][committed] aarch64: Fix whitespace in aarch64-ldp-fusion.cc:alias_walker

2024-04-05 Thread Alex Coplan
I spotted this whitespace error during the review of
https://gcc.gnu.org/pipermail/gcc-patches/2024-April/648846.html.

Pushing as obvious after testing on aarch64-linux-gnu.

Thanks,
Alex

gcc/ChangeLog:

* config/aarch64/aarch64-ldp-fusion.cc (struct alias_walker):
Fix double space after const qualifier on valid ().
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 22ed95eb743..365dcf48b22 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -2138,7 +2138,7 @@ struct alias_walker
 {
   virtual bool conflict_p (int ) const = 0;
   virtual insn_info *insn () const = 0;
-  virtual bool valid () const  = 0;
+  virtual bool valid () const = 0;
   virtual void advance () = 0;
 };
 


Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-05 Thread Alex Coplan
On 05/04/2024 13:53, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> All review comments are incorporated.

Thanks, I was kind-of expecting you to also send the renaming patch as a
preparatory patch as we discussed.

Sorry for another meta comment, but: I think the reason that the Linaro
CI isn't running tests on your patches is actually because you're
sending 1/3 of a series but not sending the rest of the series.

So please can you either send this as an individual preparatory patch
(not marked as a series) or if you're going to send a series (e.g. with
a preparatory rename patch as 1/2 and this as 2/2) then send the entire
series when you make updates.  That way the CI should test your patches,
which would be helpful.

> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> Thanks & Regards
> Ajit
> 
> 
> aarch64: Place target independent and dependent changed code in one file
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> 2024-04-06  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code.

You're going to need a proper ChangeLog eventually, but I guess there's
no need for that right now.

> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
>  1 file changed, 249 insertions(+), 122 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 22ed95eb743..cb21b514ef7 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,8 +138,122 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const  = 0;

Heh, looking at this made me realise there is a whitespace bug here in
the existing code (double space after const).  Sorry about that!  I'll
push an obvious fix for that.

> +  virtual void advance () = 0;
> +};
> +
> +struct pair_fusion {
> +
> +  pair_fusion () {};

This ctor looks pointless at the moment.  Perhaps instead we could put
the contents of ldp_fusion_init in here and then delete that function?

> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> +bool load_p) = 0;

Please can we have comments above each of these virtual functions
describing any parameters, what the purpose of the hook is, and the
interpretation of the return value?  This will serve as the
documentation for other targets that want to make use of the pass.

It might make sense to have a default-false implementation for
fpsimd_op_p, especially if you don't want to make use of this bit for
rs6000.

> +
> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> +  virtual bool pair_trailing_writeback_p () = 0;

Sorry for the run-around, but: I think this and
handle_writeback_opportunities () should be the same function, either
returning an enum or taking an enum and returning a boolean.

At a minimum they should have more similar sounding names.

> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> +   machine_mode mem_mode) = 0;
> +  virtual int pair_mem_alias_check_limit () = 0;
> +  virtual bool handle_writeback_opportunities () = 0 ;
> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> + machine_mode mode) = 0;
> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,

Nit: excess whitespace after pats,

> + bool load_p) = 0;
> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
> +  virtual bool track_load_p () = 0;
> +  virtual bool track_store_p () = 0;

I think it would probably make more sense for these two to have
default-true implementations rather than being pure virtual functions.

Also, sorry for the bikeshedding, but please can we keep the plural
names (so track_loads_p and track_stores_p).

> +  virtual bool cand_insns_empty_p (std::list ) = 0;

Why does this need to be virtualised?  I would it expect it to
just be insns.empty () on all targets.

> +  virtual bool pair_mem_in_range_p 

[PATCH] wwwdocs: Add note to changes.html for __has_{feature,extension}

2024-04-04 Thread Alex Coplan
Hi,

This adds a note to the GCC 14 release notes mentioning support for
__has_{feature,extension} (PR60512).

OK to commit?

Thanks,
Alex
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index 9fd224c1..facead8d 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -242,6 +242,12 @@ a work-in-progress.
   constinit and optimized dynamic initialization
 
   
+  The Clang language extensions __has_feature and
+__has_extension have been implemented in GCC.  These
+are available from C, C++, and Objective-C(++).
+This is primarily intended to aid the portability of code written
+against Clang.
+  
 
 
 Runtime Library (libstdc++)


Re: [PATCH V3 0/2] aarch64: Place target independent and dependent changed code in one file.

2024-04-03 Thread Alex Coplan
On 23/02/2024 16:41, Ajit Agarwal wrote:
> Hello Richard/Alex/Segher:

Hi Ajit,

Sorry for the delay and thanks for working on this.

Generally this looks like the right sort of approach (IMO) but I've left
some comments below.

I'll start with a meta comment: in the subject line you have marked this
as 0/2, but usually 0/n is reserved for the cover letter of a patch
series and wouldn't contain an actual patch.  I think this might have
confused the Linaro CI suitably such that it didn't run regression tests
on the patch.

> 
> This patch adds the changed code for target independent and
> dependent code for load store fusion.
> 
> Common infrastructure of load store pair fusion is
> divided into target independent and target dependent
> changed code.
> 
> Target independent code is the Generic code with
> pure virtual function to interface betwwen target
> independent and dependent code.
> 
> Target dependent code is the implementation of pure
> virtual function for aarch64 target and the call
> to target independent code.
> 
> Bootstrapped for aarch64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> aarch64: Place target independent and dependent changed code in one file.
> 
> Common infrastructure of load store pair fusion is
> divided into target independent and target dependent
> changed code.
> 
> Target independent code is the Generic code with
> pure virtual function to interface betwwen target
> independent and dependent code.
> 
> Target dependent code is the implementation of pure
> virtual function for aarch64 target and the call
> to target independent code.
> 
> 2024-02-23  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code.
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 437 ---
>  1 file changed, 305 insertions(+), 132 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 22ed95eb743..2ef22ff1e96 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -40,10 +40,10 @@
>  
>  using namespace rtl_ssa;
>  
> -static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
> -static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
> -static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
> -static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_BITS = 7;
> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_SIGN_BIT = (1 << 
> (PAIR_MEM_IMM_BITS - 1));
> +static constexpr HOST_WIDE_INT PAIR_MEM_MAX_IMM = PAIR_MEM_IMM_SIGN_BIT - 1;
> +static constexpr HOST_WIDE_INT PAIR_MEM_MIN_IMM = -PAIR_MEM_MAX_IMM - 1;

These constants shouldn't be renamed: they are specific to aarch64 so the
original names should be preserved in this file.

I expect we want to introduce virtual functions to validate an offset
for a paired access.  The aarch64 code could then implement it by
comparing the offset against LDP_{MIN,MAX}_IMM, and the generic code
could use that hook to replace the code that queries those constants
directly (i.e. in find_trailing_add and get_viable_bases).

>  
>  // We pack these fields (load_p, fpsimd_p, and size) into an integer
>  // (LFS) which we use as part of the key into the main hash tables.
> @@ -138,8 +138,18 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const  = 0;
> +  virtual void advance () = 0;
> +};
> +
> +
>  // State used by the pass for a given basic block.
> -struct ldp_bb_info
> +struct pair_fusion

As a comment on the high-level design, I think we want a generic class
for the overall pass, not just for the BB-specific structure.

That is because naturally we want the ldp_fusion_bb function itself to
be a member of such a class, so that it can access virtual functions to
query the target e.g. about the load/store pair policy, and whether to
try and promote writeback pairs.

If we keep all of the virtual functions in such an outer class, then we
can keep the ldp_fusion_bb class generic (not needing an override for
each target) and that inner class can perhaps be given a pointer or
reference to the outer class when it is instantiated in ldp_fusion_bb.

>  {
>using def_hash = nofree_ptr_hash;
>using expr_key_t = pair_hash>;
> @@ -161,13 +171,13 @@ struct ldp_bb_info
>static const size_t obstack_alignment = sizeof (void *);
>bb_info *m_bb;
>  
> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> +  pair_fusion (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>{
>  obstack_specify_allocation (_obstack, OBSTACK_CHUNK_SIZE,
>   obstack_alignment, obstack_chunk_alloc,
>  

Re: [PATCH 0/1 V2] Target independent code for common infrastructure of load,store fusion for rs6000 and aarch64 target.

2024-02-15 Thread Alex Coplan
On 15/02/2024 22:38, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 15/02/24 10:12 pm, Alex Coplan wrote:
> > On 15/02/2024 21:24, Ajit Agarwal wrote:
> >> Hello Richard:
> >>
> >> As per your suggestion I have divided the patch into target independent
> >> and target dependent for aarch64 target. I kept aarch64-ldp-fusion same
> >> and did not change that.
> > 
> > I'm not sure this was what Richard suggested doing, though.
> > He said (from
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645545.html):
> > 
> >> Maybe one way of making the review easier would be to split the aarch64
> >> pass into the "target-dependent" and "target-independent" pieces
> >> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
> >> (as separate patches) move the target-independent pieces outside
> >> config/aarch64.
> > 
> > but this adds the target-independent parts separately instead of
> > splitting it out within config/aarch64 (which I agree should make the
> > review easier).
> 
> I am sorry I didnt follow. Can you kindly elaborate on this.

So IIUC Richard was suggesting splitting into target-independent and
target-dependent pieces within aarch64-ldp-fusion.cc as a first step,
i.e. you introduce the abstractions (virtual functions) needed within
that file.  That should hopefully be a relatively small diff.

Then in a separate patch you can move the target-independent parts out of
config/aarch64.

Does that make sense?

Thanks,
Alex

> 
> Thanks & Regards
> Ajit
> > 
> > Thanks,
> > Alex
> > 
> >>
> >> Common infrastructure of load store pair fusion is divided into
> >> target independent and target dependent code for rs6000 and aarch64
> >> target.
> >>
> >> Target independent code is structured in the following files.
> >> gcc/pair-fusion-base.h
> >> gcc/pair-fusion-common.cc
> >> gcc/pair-fusion.cc
> >>
> >> Target independent code is the Generic code with pure virtual
> >> function to interface betwwen target independent and dependent
> >> code.
> >>
> >> Thanks & Regards
> >> Ajit
> >>
> >> Target independent code for common infrastructure of load
> >> store fusion for rs6000 and aarch64 target.
> >>
> >> Common infrastructure of load store pair fusion is divided into
> >> target independent and target dependent code for rs6000 and aarch64
> >> target.
> >>
> >> Target independent code is structured in the following files.
> >> gcc/pair-fusion-base.h
> >> gcc/pair-fusion-common.cc
> >> gcc/pair-fusion.cc
> >>
> >> Target independent code is the Generic code with pure virtual
> >> function to interface betwwen target independent and dependent
> >> code.
> >>
> >> 2024-02-15  Ajit Kumar Agarwal  
> >>
> >> gcc/ChangeLog:
> >>
> >>* pair-fusion-base.h: Generic header code for load store fusion
> >>that can be shared across different architectures.
> >>* pair-fusion-common.cc: Generic source code for load store
> >>fusion that can be shared across different architectures.
> >>* pair-fusion.cc: Generic implementation of pair_fusion class
> >>defined in pair-fusion-base.h
> >>* Makefile.in: Add new executable pair-fusion.o and
> >>pair-fusion-common.o.
> >> ---
> >>  gcc/Makefile.in   |2 +
> >>  gcc/pair-fusion-base.h|  586 ++
> >>  gcc/pair-fusion-common.cc | 1202 
> >>  gcc/pair-fusion.cc| 1225 +
> >>  4 files changed, 3015 insertions(+)
> >>  create mode 100644 gcc/pair-fusion-base.h
> >>  create mode 100644 gcc/pair-fusion-common.cc
> >>  create mode 100644 gcc/pair-fusion.cc
> >>
> >> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> >> index a74761b7ab3..df5061ddfe7 100644
> >> --- a/gcc/Makefile.in
> >> +++ b/gcc/Makefile.in
> >> @@ -1563,6 +1563,8 @@ OBJS = \
> >>ipa-strub.o \
> >>ipa.o \
> >>ira.o \
> >> +  pair-fusion-common.o \
> >> +  pair-fusion.o \
> >>ira-build.o \
> >>ira-costs.o \
> >>ira-conflicts.o \
> >> diff --git a/gcc/pair-fusion-base.h b/gcc/pair-fusion-base.h
> >> new file mode 100644
> >> index 000..fdaf4fd743d
> >> --- /dev/null
> 

Re: [PATCH 0/1 V2] Target independent code for common infrastructure of load,store fusion for rs6000 and aarch64 target.

2024-02-15 Thread Alex Coplan
On 15/02/2024 21:24, Ajit Agarwal wrote:
> Hello Richard:
> 
> As per your suggestion I have divided the patch into target independent
> and target dependent for aarch64 target. I kept aarch64-ldp-fusion same
> and did not change that.

I'm not sure this was what Richard suggested doing, though.
He said (from
https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645545.html):

> Maybe one way of making the review easier would be to split the aarch64
> pass into the "target-dependent" and "target-independent" pieces
> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
> (as separate patches) move the target-independent pieces outside
> config/aarch64.

but this adds the target-independent parts separately instead of
splitting it out within config/aarch64 (which I agree should make the
review easier).

Thanks,
Alex

> 
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code for rs6000 and aarch64
> target.
> 
> Target independent code is structured in the following files.
> gcc/pair-fusion-base.h
> gcc/pair-fusion-common.cc
> gcc/pair-fusion.cc
> 
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
> 
> Thanks & Regards
> Ajit
> 
> Target independent code for common infrastructure of load
> store fusion for rs6000 and aarch64 target.
> 
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code for rs6000 and aarch64
> target.
> 
> Target independent code is structured in the following files.
> gcc/pair-fusion-base.h
> gcc/pair-fusion-common.cc
> gcc/pair-fusion.cc
> 
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
> 
> 2024-02-15  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * pair-fusion-base.h: Generic header code for load store fusion
>   that can be shared across different architectures.
>   * pair-fusion-common.cc: Generic source code for load store
>   fusion that can be shared across different architectures.
>   * pair-fusion.cc: Generic implementation of pair_fusion class
>   defined in pair-fusion-base.h
>   * Makefile.in: Add new executable pair-fusion.o and
>   pair-fusion-common.o.
> ---
>  gcc/Makefile.in   |2 +
>  gcc/pair-fusion-base.h|  586 ++
>  gcc/pair-fusion-common.cc | 1202 
>  gcc/pair-fusion.cc| 1225 +
>  4 files changed, 3015 insertions(+)
>  create mode 100644 gcc/pair-fusion-base.h
>  create mode 100644 gcc/pair-fusion-common.cc
>  create mode 100644 gcc/pair-fusion.cc
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index a74761b7ab3..df5061ddfe7 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1563,6 +1563,8 @@ OBJS = \
>   ipa-strub.o \
>   ipa.o \
>   ira.o \
> + pair-fusion-common.o \
> + pair-fusion.o \
>   ira-build.o \
>   ira-costs.o \
>   ira-conflicts.o \
> diff --git a/gcc/pair-fusion-base.h b/gcc/pair-fusion-base.h
> new file mode 100644
> index 000..fdaf4fd743d
> --- /dev/null
> +++ b/gcc/pair-fusion-base.h
> @@ -0,0 +1,586 @@
> +// Generic code for Pair MEM  fusion optimization pass.
> +// Copyright (C) 2023-2024 Free Software Foundation, Inc.
> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it
> +// under the terms of the GNU General Public License as published by
> +// the Free Software Foundation; either version 3, or (at your option)
> +// any later version.
> +//
> +// GCC is distributed in the hope that it will be useful, but
> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +// General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// .
> +
> +#ifndef GCC_PAIR_FUSION_H
> +#define GCC_PAIR_FUSION_H
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#define INCLUDE_LIST
> +#define INCLUDE_TYPE_TRAITS
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "rtl-iter.h"
> +#include "rtl-ssa.h"
> +#include "cfgcleanup.h"
> +#include "tree-pass.h"
> +#include "ordered-hash-map.h"
> +#include "tree-dfa.h"
> +#include "fold-const.h"
> +#include "tree-hash-traits.h"
> +#include "print-tree.h"
> +#include "insn-attr.h"
> +using namespace rtl_ssa;
> +// We pack these fields (load_p, fpsimd_p, and size) into an integer
> +// (LFS) which we use as part of the key into the main hash tables.
> +//
> +// The idea is that we group candidates together only if they agree on
> 

Re: [PATCH][GCC 12] aarch64: Avoid out-of-range shrink-wrapped saves [PR111677]

2024-02-15 Thread Alex Coplan
On 14/02/2024 11:18, Richard Sandiford wrote:
> Alex Coplan  writes:
> > This is a backport of the GCC 13 fix for PR111677 to the GCC 12 branch.
> > The only part of the patch that isn't a straight cherry-pick is due to
> > the TX iterator lacking TDmode for GCC 12, so this version adjusts
> > TX_V16QI accordingly.
> >
> > Bootstrapped/regtested on aarch64-linux-gnu, the only changes in the
> > testsuite I saw were in
> > gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c where the dg-output
> > "READ of size 4 [...]" check appears to be flaky on the GCC 12 branch
> > since libhwasan gained the short granule tag feature, I've requested a
> > backport of the following patch (committed as
> > r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) which should fix that
> > (independent) issue for GCC 12:
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645278.html
> >
> > OK for the GCC 12 branch?
> 
> OK, thanks.

Thanks.  The patch cherry-picks cleanly on the GCC 11 branch, and
bootstraps/regtests OK there.  Is it OK for GCC 11 too, even though the
issue is latent there (at least for the testcase in the patch)?

Alex

> 
> Richard
> 
> > Thanks,
> > Alex
> >
> > -- >8 --
> >
> > The PR shows us ICEing due to an unrecognizable TFmode save emitted by
> > aarch64_process_components.  The problem is that for T{I,F,D}mode we
> > conservatively require mems to be in range for x-register ldp/stp.  That
> > is because (at least for TImode) it can be allocated to both GPRs and
> > FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
> > a q-register load/store.
> >
> > As Richard pointed out in the PR, aarch64_get_separate_components
> > already checks that the offsets are suitable for a single load, so we
> > just need to choose a mode in aarch64_reg_save_mode that gives the full
> > q-register range.  In this patch, we choose V16QImode as an alternative
> > 16-byte "bag-of-bits" mode that doesn't have the artificial range
> > restrictions imposed on T{I,F,D}mode.
> >
> > Unlike for GCC 14 we need additional handling in the load/store pair
> > code as various cases are not expecting to see V16QImode (particularly
> > the writeback patterns, but also aarch64_gen_load_pair).
> >
> > gcc/ChangeLog:
> >
> > PR target/111677
> > * config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use
> > V16QImode for the full 16-byte FPR saves in the vector PCS case.
> > (aarch64_gen_storewb_pair): Handle V16QImode.
> > (aarch64_gen_loadwb_pair): Likewise.
> > (aarch64_gen_load_pair): Likewise.
> > * config/aarch64/aarch64.md (loadwb_pair_):
> > Rename to ...
> > (loadwb_pair_): ... this, extending to
> > V16QImode.
> > (storewb_pair_): Rename to ...
> > (storewb_pair_): ... this, extending to
> > V16QImode.
> > * config/aarch64/iterators.md (TX_V16QI): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR target/111677
> > * gcc.target/aarch64/torture/pr111677.c: New test.
> >
> > (cherry picked from commit 2bd8264a131ee1215d3bc6181722f9d30f5569c3)
> > ---
> >  gcc/config/aarch64/aarch64.cc | 13 ++-
> >  gcc/config/aarch64/aarch64.md | 35 ++-
> >  gcc/config/aarch64/iterators.md   |  3 ++
> >  .../gcc.target/aarch64/torture/pr111677.c | 28 +++
> >  4 files changed, 61 insertions(+), 18 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 3bccd96a23d..2bbba323770 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno)
> >case ARM_PCS_SIMD:
> > /* The vector PCS saves the low 128 bits (which is the full
> >register on non-SVE targets).  */
> > -   return TFmode;
> > +   return V16QImode;
> >  
> >case ARM_PCS_SVE:
> > /* Use vectors of DImode for registers that need frame
> > @@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx 
> > base, rtx reg, rtx reg2,
> >return gen_storewb_pairtf_di (base, base, reg, reg2,
> > GEN_INT (-adjustment),
> > GEN_INT (UNITS_PER_VREG - adjustment));
> > +case E_V16QImode:
> > +  return gen_storewb_pairv16qi_di (base

[PATCH][GCC 12] aarch64: Avoid out-of-range shrink-wrapped saves [PR111677]

2024-02-12 Thread Alex Coplan
This is a backport of the GCC 13 fix for PR111677 to the GCC 12 branch.
The only part of the patch that isn't a straight cherry-pick is due to
the TX iterator lacking TDmode for GCC 12, so this version adjusts
TX_V16QI accordingly.

Bootstrapped/regtested on aarch64-linux-gnu, the only changes in the
testsuite I saw were in
gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c where the dg-output
"READ of size 4 [...]" check appears to be flaky on the GCC 12 branch
since libhwasan gained the short granule tag feature, I've requested a
backport of the following patch (committed as
r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) which should fix that
(independent) issue for GCC 12:
https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645278.html

OK for the GCC 12 branch?

Thanks,
Alex

-- >8 --

The PR shows us ICEing due to an unrecognizable TFmode save emitted by
aarch64_process_components.  The problem is that for T{I,F,D}mode we
conservatively require mems to be in range for x-register ldp/stp.  That
is because (at least for TImode) it can be allocated to both GPRs and
FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
a q-register load/store.

As Richard pointed out in the PR, aarch64_get_separate_components
already checks that the offsets are suitable for a single load, so we
just need to choose a mode in aarch64_reg_save_mode that gives the full
q-register range.  In this patch, we choose V16QImode as an alternative
16-byte "bag-of-bits" mode that doesn't have the artificial range
restrictions imposed on T{I,F,D}mode.

Unlike for GCC 14 we need additional handling in the load/store pair
code as various cases are not expecting to see V16QImode (particularly
the writeback patterns, but also aarch64_gen_load_pair).

gcc/ChangeLog:

PR target/111677
* config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use
V16QImode for the full 16-byte FPR saves in the vector PCS case.
(aarch64_gen_storewb_pair): Handle V16QImode.
(aarch64_gen_loadwb_pair): Likewise.
(aarch64_gen_load_pair): Likewise.
* config/aarch64/aarch64.md (loadwb_pair_):
Rename to ...
(loadwb_pair_): ... this, extending to
V16QImode.
(storewb_pair_): Rename to ...
(storewb_pair_): ... this, extending to
V16QImode.
* config/aarch64/iterators.md (TX_V16QI): New.

gcc/testsuite/ChangeLog:

PR target/111677
* gcc.target/aarch64/torture/pr111677.c: New test.

(cherry picked from commit 2bd8264a131ee1215d3bc6181722f9d30f5569c3)
---
 gcc/config/aarch64/aarch64.cc | 13 ++-
 gcc/config/aarch64/aarch64.md | 35 ++-
 gcc/config/aarch64/iterators.md   |  3 ++
 .../gcc.target/aarch64/torture/pr111677.c | 28 +++
 4 files changed, 61 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/torture/pr111677.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 3bccd96a23d..2bbba323770 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno)
   case ARM_PCS_SIMD:
 	/* The vector PCS saves the low 128 bits (which is the full
 	   register on non-SVE targets).  */
-	return TFmode;
+	return V16QImode;
 
   case ARM_PCS_SVE:
 	/* Use vectors of DImode for registers that need frame
@@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
   return gen_storewb_pairtf_di (base, base, reg, reg2,
 GEN_INT (-adjustment),
 GEN_INT (UNITS_PER_VREG - adjustment));
+case E_V16QImode:
+  return gen_storewb_pairv16qi_di (base, base, reg, reg2,
+   GEN_INT (-adjustment),
+   GEN_INT (UNITS_PER_VREG - adjustment));
 default:
   gcc_unreachable ();
 }
@@ -8647,6 +8651,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
 case E_TFmode:
   return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT (adjustment),
    GEN_INT (UNITS_PER_VREG));
+case E_V16QImode:
+  return gen_loadwb_pairv16qi_di (base, base, reg, reg2,
+  GEN_INT (adjustment),
+  GEN_INT (UNITS_PER_VREG));
 default:
   gcc_unreachable ();
 }
@@ -8730,6 +8738,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
 case E_V4SImode:
   return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
 
+case E_V16QImode:
+  return gen_load_pairv16qiv16qi (reg1, mem1, reg2, mem2);
+
 default:
   gcc_unreachable ();
 }
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index fb100bdf6b3..99f185718c9 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1874,17 +1874,18 @@ (define_insn "loadwb_pair_"
   [(set_attr "type" "neon_load1_2reg")]
 )
 
-(define_insn "loadwb_pair_"
+(define_insn 

Re: [PATCH][PUSHED] hwasan: support new dg-output format.

2024-02-09 Thread Alex Coplan
Hi,

On 04/05/2022 09:59, Martin Liška wrote:
> Supports change in libsanitizer where it newly reports:
> READ of size 4 at 0xc3d4 tags: 02/01(00) (ptr/mem) in thread T0
> 
> So the 'tags' contains now 3 entries compared to 2 entries.
> 
> gcc/testsuite/ChangeLog:
> 
>   * c-c++-common/hwasan/alloca-outside-caught.c: Update dg-output.
>   * c-c++-common/hwasan/heap-overflow.c: Likewise.
>   * c-c++-common/hwasan/hwasan-thread-access-parent.c: Likewise.
>   * c-c++-common/hwasan/large-aligned-1.c: Likewise.

I noticed the above test (large-aligned-1.c) failing on the GCC 12
branch due to the change in output format mentioned above.  This patch
(committed as r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) seems
to apply cleanly on the GCC 12 branch too, is it OK to backport to GCC 12?

Thanks,
Alex

>   * c-c++-common/hwasan/stack-tagging-basic-1.c: Likewise.
> ---
>  gcc/testsuite/c-c++-common/hwasan/alloca-outside-caught.c   | 2 +-
>  gcc/testsuite/c-c++-common/hwasan/heap-overflow.c   | 2 +-
>  gcc/testsuite/c-c++-common/hwasan/hwasan-thread-access-parent.c | 2 +-
>  gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c | 2 +-
>  gcc/testsuite/c-c++-common/hwasan/stack-tagging-basic-1.c   | 2 +-
>  5 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/gcc/testsuite/c-c++-common/hwasan/alloca-outside-caught.c 
> b/gcc/testsuite/c-c++-common/hwasan/alloca-outside-caught.c
> index 60d7a9a874f..6f3825bee7c 100644
> --- a/gcc/testsuite/c-c++-common/hwasan/alloca-outside-caught.c
> +++ b/gcc/testsuite/c-c++-common/hwasan/alloca-outside-caught.c
> @@ -20,6 +20,6 @@ main ()
>  }
>  
>  /* { dg-output "HWAddressSanitizer: tag-mismatch on address 0x\[0-9a-f\]*.*" 
> } */
> -/* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: 
> \[\[:xdigit:\]\]\[\[:xdigit:\]\]/00 \\(ptr/mem\\) in thread T0.*" } */
> +/* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: 
> \[\[:xdigit:\]\]\[\[:xdigit:\]\]/00.* \\(ptr/mem\\) in thread T0.*" } */
>  /* { dg-output "Address 0x\[0-9a-f\]* is located in stack of thread T0.*" } 
> */
>  /* { dg-output "SUMMARY: HWAddressSanitizer: tag-mismatch \[^\n\]*.*" } */
> diff --git a/gcc/testsuite/c-c++-common/hwasan/heap-overflow.c 
> b/gcc/testsuite/c-c++-common/hwasan/heap-overflow.c
> index 137466800de..bddb38c81f1 100644
> --- a/gcc/testsuite/c-c++-common/hwasan/heap-overflow.c
> +++ b/gcc/testsuite/c-c++-common/hwasan/heap-overflow.c
> @@ -23,7 +23,7 @@ int main(int argc, char **argv) {
>  }
>  
>  /* { dg-output "HWAddressSanitizer: tag-mismatch on address 0x\[0-9a-f\]*.*" 
> } */
> -/* { dg-output "READ of size 1 at 0x\[0-9a-f\]* tags: 
> \[\[:xdigit:\]\]\[\[:xdigit:\]\]/\[\[:xdigit:\]\]\[\[:xdigit:\]\] 
> \\(ptr/mem\\) in thread T0.*" } */
> +/* { dg-output "READ of size 1 at 0x\[0-9a-f\]* tags: 
> \[\[:xdigit:\]\]\[\[:xdigit:\]\]/\[\[:xdigit:\]\]\[\[:xdigit:\]\].* 
> \\(ptr/mem\\) in thread T0.*" } */
>  /* { dg-output "located 0 bytes to the right of 10-byte region.*" } */
>  /* { dg-output "allocated here:.*" } */
>  /* { dg-output "#1 0x\[0-9a-f\]+ +in _*main \[^\n\r]*heap-overflow.c:18" } */
> diff --git a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-access-parent.c 
> b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-access-parent.c
> index 828909d3b3b..eca27c8cd2c 100644
> --- a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-access-parent.c
> +++ b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-access-parent.c
> @@ -46,6 +46,6 @@ main (int argc, char **argv)
>  }
>  
>  /* { dg-output "HWAddressSanitizer: tag-mismatch on address 0x\[0-9a-f\]*.*" 
> } */
> -/* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: 
> 00/\[\[:xdigit:\]\]\[\[:xdigit:\]\] \\(ptr/mem\\) in thread T1.*" } */
> +/* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: 
> 00/\[\[:xdigit:\]\]\[\[:xdigit:\]\].* \\(ptr/mem\\) in thread T1.*" } */
>  /* { dg-output "Address 0x\[0-9a-f\]* is located in stack of thread T0.*" } 
> */
>  /* { dg-output "SUMMARY: HWAddressSanitizer: tag-mismatch \[^\n\]*.*" } */
> diff --git a/gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c 
> b/gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c
> index 1aa13032396..6158ba4bdbc 100644
> --- a/gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c
> +++ b/gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c
> @@ -9,6 +9,6 @@
>  /* { dg-output "HWAddressSanitizer: tag-mismatch on address 0x\[0-9a-f\]*.*" 
> } */
>  /* NOTE: This assumes the current tagging mechanism (one at a time from the
> base and large aligned variables being handled first).  */
> -/* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: 
> \[\[:xdigit:\]\]\[\[:xdigit:\]\]/\[\[:xdigit:\]\]\[\[:xdigit:\]\] 
> \\(ptr/mem\\) in thread T0.*" } */
> +/* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: 
> \[\[:xdigit:\]\]\[\[:xdigit:\]\]/\[\[:xdigit:\]\]\[\[:xdigit:\]\].* 
> \\(ptr/mem\\) in thread T0.*" } */
>  /* { dg-output "Address 0x\[0-9a-f\]* is located in stack of thread T0.*" } 
> 

Re: [PATCH] c++: Don't advertise cxx_constexpr_string_builtins [PR113658]

2024-02-02 Thread Alex Coplan
On 02/02/2024 09:34, Marek Polacek wrote:
> On Fri, Feb 02, 2024 at 10:27:23AM +0000, Alex Coplan wrote:
> > Bootstrapped/regtested on x86_64-apple-darwin, OK for trunk?
> > 
> > Thanks,
> > Alex
> > 
> > -- >8 --
> > 
> > When __has_feature was introduced for GCC 14, I included the feature
> > cxx_constexpr_string_builtins, since of the relevant string builtins
> > that GCC implements, it seems to support constexpr evaluation of those
> > builtins.
> > 
> > However, as the PR shows, GCC doesn't implement the full list of
> > builtins in the clang documentation.  After enumerating the builtins,
> > the clang docs [1] say:
> > 
> > > Support for constant expression evaluation for the above builtins can
> > > be detected with __has_feature(cxx_constexpr_string_builtins).
> > 
> > and a strict reading of this would suggest we can't really support
> > constexpr evaluation of a builtin if we don't implement the builtin in
> > the first place.
> > 
> > So the conservatively correct thing to do seems to be to stop
> > advertising the feature altogether to avoid failing to build code which
> > assumes the presence of this feature implies the presence of all the
> > builtins listed in the clang documentation.
> > 
> > [1] : https://clang.llvm.org/docs/LanguageExtensions.html#string-builtins
> > 
> > gcc/cp/ChangeLog:
> > 
> > PR c++/113658
> > * cp-objcp-common.cc (cp_feature_table): Remove entry for
> > cxx_constexpr_string_builtins.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > PR c++/113658
> > * g++.dg/ext/pr113658.C: New test.
> 
> > diff --git a/gcc/cp/cp-objcp-common.cc b/gcc/cp/cp-objcp-common.cc
> > index f06edf04ef0..85dde0459fa 100644
> > --- a/gcc/cp/cp-objcp-common.cc
> > +++ b/gcc/cp/cp-objcp-common.cc
> > @@ -110,7 +110,6 @@ static constexpr cp_feature_info cp_feature_table[] =
> >{ "cxx_alignof", cxx11 },
> >{ "cxx_attributes", cxx11 },
> >{ "cxx_constexpr", cxx11 },
> > -  { "cxx_constexpr_string_builtins", cxx11 },
> >{ "cxx_decltype", cxx11 },
> >{ "cxx_decltype_incomplete_return_types", cxx11 },
> >{ "cxx_default_function_template_args", cxx11 },
> > diff --git a/gcc/testsuite/g++.dg/ext/pr113658.C 
> > b/gcc/testsuite/g++.dg/ext/pr113658.C
> > new file mode 100644
> > index 000..f4a34888f28
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/ext/pr113658.C
> 
> Might be better to name this has-feature2.C
> 
> > @@ -0,0 +1,13 @@
> 
> Please include
> // PR c++/113658

Can do.

> 
> > +// { dg-do compile }
> > +// { dg-options "" }
> 
> Why dg-options ""?  It doesn't seem to have any purpose here.

That is to disable -pedantic-errors which IIRC is added by default in
the testsuite options.

With -pedantic-errors we would have __has_extension behaving like
__has_feature, and I wanted to check in the test that this doesn't get
reported as a feature or extension.

Incidentally it also means we don't have to provide a dummy declaration,
with -pedantic-errors we would get a warning about an empty TU which
would make the test fail.

Thanks,
Alex

> 
> > +// PR113658: we shouldn't declare support for 
> > cxx_constexpr_string_builtins as
> > +// GCC is missing some of the builtins that clang implements.
> > +
> > +#if __has_feature (cxx_constexpr_string_builtins)
> > +#error
> > +#endif
> > +
> > +#if __has_extension (cxx_constexpr_string_builtins)
> > +#error
> > +#endif
> 
> 
> Marek
> 


[PATCH] c++: Don't advertise cxx_constexpr_string_builtins [PR113658]

2024-02-02 Thread Alex Coplan
Bootstrapped/regtested on x86_64-apple-darwin, OK for trunk?

Thanks,
Alex

-- >8 --

When __has_feature was introduced for GCC 14, I included the feature
cxx_constexpr_string_builtins, since of the relevant string builtins
that GCC implements, it seems to support constexpr evaluation of those
builtins.

However, as the PR shows, GCC doesn't implement the full list of
builtins in the clang documentation.  After enumerating the builtins,
the clang docs [1] say:

> Support for constant expression evaluation for the above builtins can
> be detected with __has_feature(cxx_constexpr_string_builtins).

and a strict reading of this would suggest we can't really support
constexpr evaluation of a builtin if we don't implement the builtin in
the first place.

So the conservatively correct thing to do seems to be to stop
advertising the feature altogether to avoid failing to build code which
assumes the presence of this feature implies the presence of all the
builtins listed in the clang documentation.

[1] : https://clang.llvm.org/docs/LanguageExtensions.html#string-builtins

gcc/cp/ChangeLog:

PR c++/113658
* cp-objcp-common.cc (cp_feature_table): Remove entry for
cxx_constexpr_string_builtins.

gcc/testsuite/ChangeLog:

PR c++/113658
* g++.dg/ext/pr113658.C: New test.
diff --git a/gcc/cp/cp-objcp-common.cc b/gcc/cp/cp-objcp-common.cc
index f06edf04ef0..85dde0459fa 100644
--- a/gcc/cp/cp-objcp-common.cc
+++ b/gcc/cp/cp-objcp-common.cc
@@ -110,7 +110,6 @@ static constexpr cp_feature_info cp_feature_table[] =
   { "cxx_alignof", cxx11 },
   { "cxx_attributes", cxx11 },
   { "cxx_constexpr", cxx11 },
-  { "cxx_constexpr_string_builtins", cxx11 },
   { "cxx_decltype", cxx11 },
   { "cxx_decltype_incomplete_return_types", cxx11 },
   { "cxx_default_function_template_args", cxx11 },
diff --git a/gcc/testsuite/g++.dg/ext/pr113658.C 
b/gcc/testsuite/g++.dg/ext/pr113658.C
new file mode 100644
index 000..f4a34888f28
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/pr113658.C
@@ -0,0 +1,13 @@
+// { dg-do compile }
+// { dg-options "" }
+
+// PR113658: we shouldn't declare support for cxx_constexpr_string_builtins as
+// GCC is missing some of the builtins that clang implements.
+
+#if __has_feature (cxx_constexpr_string_builtins)
+#error
+#endif
+
+#if __has_extension (cxx_constexpr_string_builtins)
+#error
+#endif


Re: [PATCH v2] c++: avoid -Wdangling-reference for std::span-like classes [PR110358]

2024-02-01 Thread Alex Coplan
On 31/01/2024 15:53, Marek Polacek wrote:
> On Wed, Jan 31, 2024 at 07:44:41PM +0000, Alex Coplan wrote:
> > Hi Marek,
> > 
> > On 30/01/2024 13:15, Marek Polacek wrote:
> > > On Thu, Jan 25, 2024 at 10:13:10PM -0500, Jason Merrill wrote:
> > > > On 1/25/24 20:36, Marek Polacek wrote:
> > > > > Better version:
> > > > > 
> > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > > > > 
> > > > > -- >8 --
> > > > > Real-world experience shows that -Wdangling-reference triggers for
> > > > > user-defined std::span-like classes a lot.  We can easily avoid that
> > > > > by considering classes like
> > > > > 
> > > > >  template
> > > > >  struct Span {
> > > > >T* data_;
> > > > >std::size len_;
> > > > >  };
> > > > > 
> > > > > to be std::span-like, and not warning for them.  Unlike the previous
> > > > > patch, this one considers a non-union class template that has a 
> > > > > pointer
> > > > > data member and a trivial destructor as std::span-like.
> > > > > 
> > > > >   PR c++/110358
> > > > >   PR c++/109640
> > > > > 
> > > > > gcc/cp/ChangeLog:
> > > > > 
> > > > >   * call.cc (reference_like_class_p): Don't warn for 
> > > > > std::span-like
> > > > >   classes.
> > > > > 
> > > > > gcc/ChangeLog:
> > > > > 
> > > > >   * doc/invoke.texi: Update -Wdangling-reference description.
> > > > > 
> > > > > gcc/testsuite/ChangeLog:
> > > > > 
> > > > >   * g++.dg/warn/Wdangling-reference18.C: New test.
> > > > >   * g++.dg/warn/Wdangling-reference19.C: New test.
> > > > >   * g++.dg/warn/Wdangling-reference20.C: New test.
> > > > > ---
> > > > >   gcc/cp/call.cc| 18 
> > > > >   gcc/doc/invoke.texi   | 14 +++
> > > > >   .../g++.dg/warn/Wdangling-reference18.C   | 24 +++
> > > > >   .../g++.dg/warn/Wdangling-reference19.C   | 25 +++
> > > > >   .../g++.dg/warn/Wdangling-reference20.C   | 42 
> > > > > +++
> > > > >   5 files changed, 123 insertions(+)
> > > > >   create mode 100644 gcc/testsuite/g++.dg/warn/Wdangling-reference18.C
> > > > >   create mode 100644 gcc/testsuite/g++.dg/warn/Wdangling-reference19.C
> > > > >   create mode 100644 gcc/testsuite/g++.dg/warn/Wdangling-reference20.C
> > > > > 
> > > > > diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
> > > > > index 9de0d77c423..afd3e1ff024 100644
> > > > > --- a/gcc/cp/call.cc
> > > > > +++ b/gcc/cp/call.cc
> > > > > @@ -14082,6 +14082,24 @@ reference_like_class_p (tree ctype)
> > > > >   return true;
> > > > >   }
> > > > > +  /* Avoid warning if CTYPE looks like std::span: it's a class 
> > > > > template,
> > > > > + has a T* member, and a trivial destructor.  For example,
> > > > > +
> > > > > +  template
> > > > > +  struct Span {
> > > > > + T* data_;
> > > > > + std::size len_;
> > > > > +  };
> > > > > +
> > > > > + is considered std::span-like.  */
> > > > > +  if (NON_UNION_CLASS_TYPE_P (ctype)
> > > > > +  && CLASSTYPE_TEMPLATE_INSTANTIATION (ctype)
> > > > > +  && TYPE_HAS_TRIVIAL_DESTRUCTOR (ctype))
> > > > > +for (tree field = next_aggregate_field (TYPE_FIELDS (ctype));
> > > > > +  field; field = next_aggregate_field (DECL_CHAIN (field)))
> > > > > +  if (TYPE_PTR_P (TREE_TYPE (field)))
> > > > > + return true;
> > > > > +
> > > > > /* Some classes, such as std::tuple, have the reference member in 
> > > > > its
> > > > >(non-direct) base class.  */
> > > > > if (dfs_walk_once (TYPE_BINFO (ctype), 
> > > > > class_has_reference_member_p_r,
> > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> &g

Re: [PATCH v2] c++: avoid -Wdangling-reference for std::span-like classes [PR110358]

2024-01-31 Thread Alex Coplan
Hi Marek,

On 30/01/2024 13:15, Marek Polacek wrote:
> On Thu, Jan 25, 2024 at 10:13:10PM -0500, Jason Merrill wrote:
> > On 1/25/24 20:36, Marek Polacek wrote:
> > > Better version:
> > > 
> > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > > 
> > > -- >8 --
> > > Real-world experience shows that -Wdangling-reference triggers for
> > > user-defined std::span-like classes a lot.  We can easily avoid that
> > > by considering classes like
> > > 
> > >  template
> > >  struct Span {
> > >T* data_;
> > >std::size len_;
> > >  };
> > > 
> > > to be std::span-like, and not warning for them.  Unlike the previous
> > > patch, this one considers a non-union class template that has a pointer
> > > data member and a trivial destructor as std::span-like.
> > > 
> > >   PR c++/110358
> > >   PR c++/109640
> > > 
> > > gcc/cp/ChangeLog:
> > > 
> > >   * call.cc (reference_like_class_p): Don't warn for std::span-like
> > >   classes.
> > > 
> > > gcc/ChangeLog:
> > > 
> > >   * doc/invoke.texi: Update -Wdangling-reference description.
> > > 
> > > gcc/testsuite/ChangeLog:
> > > 
> > >   * g++.dg/warn/Wdangling-reference18.C: New test.
> > >   * g++.dg/warn/Wdangling-reference19.C: New test.
> > >   * g++.dg/warn/Wdangling-reference20.C: New test.
> > > ---
> > >   gcc/cp/call.cc| 18 
> > >   gcc/doc/invoke.texi   | 14 +++
> > >   .../g++.dg/warn/Wdangling-reference18.C   | 24 +++
> > >   .../g++.dg/warn/Wdangling-reference19.C   | 25 +++
> > >   .../g++.dg/warn/Wdangling-reference20.C   | 42 +++
> > >   5 files changed, 123 insertions(+)
> > >   create mode 100644 gcc/testsuite/g++.dg/warn/Wdangling-reference18.C
> > >   create mode 100644 gcc/testsuite/g++.dg/warn/Wdangling-reference19.C
> > >   create mode 100644 gcc/testsuite/g++.dg/warn/Wdangling-reference20.C
> > > 
> > > diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
> > > index 9de0d77c423..afd3e1ff024 100644
> > > --- a/gcc/cp/call.cc
> > > +++ b/gcc/cp/call.cc
> > > @@ -14082,6 +14082,24 @@ reference_like_class_p (tree ctype)
> > >   return true;
> > >   }
> > > +  /* Avoid warning if CTYPE looks like std::span: it's a class template,
> > > + has a T* member, and a trivial destructor.  For example,
> > > +
> > > +  template
> > > +  struct Span {
> > > + T* data_;
> > > + std::size len_;
> > > +  };
> > > +
> > > + is considered std::span-like.  */
> > > +  if (NON_UNION_CLASS_TYPE_P (ctype)
> > > +  && CLASSTYPE_TEMPLATE_INSTANTIATION (ctype)
> > > +  && TYPE_HAS_TRIVIAL_DESTRUCTOR (ctype))
> > > +for (tree field = next_aggregate_field (TYPE_FIELDS (ctype));
> > > +  field; field = next_aggregate_field (DECL_CHAIN (field)))
> > > +  if (TYPE_PTR_P (TREE_TYPE (field)))
> > > + return true;
> > > +
> > > /* Some classes, such as std::tuple, have the reference member in its
> > >(non-direct) base class.  */
> > > if (dfs_walk_once (TYPE_BINFO (ctype), class_has_reference_member_p_r,
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 6ec56493e59..e0ff18a86f5 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -3916,6 +3916,20 @@ where @code{std::minmax} returns 
> > > @code{std::pair}, and
> > >   both references dangle after the end of the full expression that 
> > > contains
> > >   the call to @code{std::minmax}.
> > > +The warning does not warn for @code{std::span}-like classes.  We consider
> > > +classes of the form:
> > > +
> > > +@smallexample
> > > +template
> > > +struct Span @{
> > > +  T* data_;
> > > +  std::size len_;
> > > +@};
> > > +@end smallexample
> > > +
> > > +as @code{std::span}-like; that is, the class is a non-union class 
> > > template
> > > +that has a pointer data member and a trivial destructor.
> > > +
> > >   This warning is enabled by @option{-Wall}.
> > >   @opindex Wdelete-non-virtual-dtor
> > > diff --git a/gcc/testsuite/g++.dg/warn/Wdangling-reference18.C 
> > > b/gcc/testsuite/g++.dg/warn/Wdangling-reference18.C
> > > new file mode 100644
> > > index 000..e088c177769
> > > --- /dev/null
> > > +++ b/gcc/testsuite/g++.dg/warn/Wdangling-reference18.C
> > > @@ -0,0 +1,24 @@
> > > +// PR c++/110358
> > > +// { dg-do compile { target c++11 } }
> > > +// { dg-options "-Wdangling-reference" }
> > > +// Don't warn for std::span-like classes.
> > > +
> > > +template 
> > > +struct Span {
> > > +T* data_;
> > > +int len_;
> > > +
> > > +[[nodiscard]] constexpr auto operator[](int n) const noexcept -> T& 
> > > { return data_[n]; }
> > > +[[nodiscard]] constexpr auto front() const noexcept -> T& { return 
> > > data_[0]; }
> > > +[[nodiscard]] constexpr auto back() const noexcept -> T& { return 
> > > data_[len_ - 1]; }
> > > +};
> > > +
> > > +auto get() -> Span;
> > > +
> > > +auto f() -> int {
> > > +int const& a = get().front(); // { 

[PATCH][GCC 13] aarch64: Avoid out-of-range shrink-wrapped saves [PR111677]

2024-01-30 Thread Alex Coplan
Bootstrapped/regtested on aarch64-linux-gnu, OK for the 13 branch after
a week of the trunk fix being in?  OK for the other active branches if
the same changes test cleanly there?

GCC 14 patch for reference:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/61.html

Thanks,
Alex

-- >8 --

The PR shows us ICEing due to an unrecognizable TFmode save emitted by
aarch64_process_components.  The problem is that for T{I,F,D}mode we
conservatively require mems to be in range for x-register ldp/stp.  That
is because (at least for TImode) it can be allocated to both GPRs and
FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
a q-register load/store.

As Richard pointed out in the PR, aarch64_get_separate_components
already checks that the offsets are suitable for a single load, so we
just need to choose a mode in aarch64_reg_save_mode that gives the full
q-register range.  In this patch, we choose V16QImode as an alternative
16-byte "bag-of-bits" mode that doesn't have the artificial range
restrictions imposed on T{I,F,D}mode.

For T{F,D}mode in GCC 15 I think we could consider relaxing the
restriction imposed in aarch64_classify_address, as AFAIK T{F,D}mode can
only be allocated to FPRs (unlike TImode).  But such a change seems too
invasive to consider for GCC 14 at this stage (let alone backports).

Unlike for GCC 14 we need additional handling in the load/store pair
code as various cases are not expecting to see V16QImode (particularly
the writeback patterns, but also aarch64_gen_load_pair).

gcc/ChangeLog:

PR target/111677
* config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use
V16QImode for the full 16-byte FPR saves in the vector PCS case.
(aarch64_gen_storewb_pair): Handle V16QImode.
(aarch64_gen_loadwb_pair): Likewise.
(aarch64_gen_load_pair): Likewise.
* config/aarch64/aarch64.md (loadwb_pair_):
Rename to ...
(loadwb_pair_): ... this, extending to
V16QImode.
(storewb_pair_): Rename to ...
(storewb_pair_): ... this, extending to
V16QImode.
* config/aarch64/iterators.md (TX_V16QI): New.

gcc/testsuite/ChangeLog:

PR target/111677
* gcc.target/aarch64/torture/pr111677.c: New test.
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 02515d4683a..f546c48ae2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -4074,7 +4074,7 @@ aarch64_reg_save_mode (unsigned int regno)
   case ARM_PCS_SIMD:
/* The vector PCS saves the low 128 bits (which is the full
   register on non-SVE targets).  */
-   return TFmode;
+   return V16QImode;
 
   case ARM_PCS_SVE:
/* Use vectors of DImode for registers that need frame
@@ -8863,6 +8863,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, 
rtx reg, rtx reg2,
   return gen_storewb_pairtf_di (base, base, reg, reg2,
GEN_INT (-adjustment),
GEN_INT (UNITS_PER_VREG - adjustment));
+case E_V16QImode:
+  return gen_storewb_pairv16qi_di (base, base, reg, reg2,
+  GEN_INT (-adjustment),
+  GEN_INT (UNITS_PER_VREG - adjustment));
 default:
   gcc_unreachable ();
 }
@@ -8908,6 +8912,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, 
rtx reg, rtx reg2,
 case E_TFmode:
   return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT (adjustment),
   GEN_INT (UNITS_PER_VREG));
+case E_V16QImode:
+  return gen_loadwb_pairv16qi_di (base, base, reg, reg2,
+ GEN_INT (adjustment),
+ GEN_INT (UNITS_PER_VREG));
 default:
   gcc_unreachable ();
 }
@@ -8991,6 +8999,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx 
mem1, rtx reg2,
 case E_V4SImode:
   return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
 
+case E_V16QImode:
+  return gen_load_pairv16qiv16qi (reg1, mem1, reg2, mem2);
+
 default:
   gcc_unreachable ();
 }
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 50239d72fc0..922cc987595 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1896,17 +1896,18 @@ (define_insn "loadwb_pair_"
   [(set_attr "type" "neon_load1_2reg")]
 )
 
-(define_insn "loadwb_pair_"
+(define_insn "loadwb_pair_"
   [(parallel
 [(set (match_operand:P 0 "register_operand" "=k")
-  (plus:P (match_operand:P 1 "register_operand" "0")
-  (match_operand:P 4 "aarch64_mem_pair_offset" "n")))
- (set (match_operand:TX 2 "register_operand" "=w")
-  (mem:TX (match_dup 1)))
- (set (match_operand:TX 3 "register_operand" "=w")
-  (mem:TX (plus:P (match_dup 1)
+ (plus:P (match_operand:P 1 "register_operand" 

[PATCH] aarch64: Avoid out-of-range shrink-wrapped saves [PR111677]

2024-01-30 Thread Alex Coplan
Hi,

The PR shows us ICEing due to an unrecognizable TFmode save emitted by
aarch64_process_components.  The problem is that for T{I,F,D}mode we
conservatively require mems to be in range for x-register ldp/stp.  That
is because (at least for TImode) it can be allocated to both GPRs and
FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
a q-register load/store.

As Richard pointed out in the PR, aarch64_get_separate_components
already checks that the offsets are suitable for a single load, so we
just need to choose a mode in aarch64_reg_save_mode that gives the full
q-register range.  In this patch, we choose V16QImode as an alternative
16-byte "bag-of-bits" mode that doesn't have the artificial range
restrictions imposed on T{I,F,D}mode.

For T{F,D}mode in GCC 15 I think we could consider relaxing the
restriction imposed in aarch64_classify_address, as AFAIK T{F,D}mode can
only be allocated to FPRs (unlike TImode).  But such a change seems too
invasive to consider for GCC 14 at this stage (let alone backports).

Fortunately the new flexible load/store pair patterns in GCC 14 allow
this mode change to work without further changes.  The backports are
more involved as we need to adjust the load/store pair handling to cater
for V16QImode in a few places.

Note that for the testcase we are relying on the torture options to add
-funroll-loops at -O3 which is necessary to trigger the ICE on trunk
(but not on the 13 branch).

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/111677
* config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use
V16QImode for the full 16-byte FPR saves in the vector PCS case.

gcc/testsuite/ChangeLog:

PR target/111677
* gcc.target/aarch64/torture/pr111677.c: New test.
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index a37d47b243e..4556b8dd504 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -2361,7 +2361,7 @@ aarch64_reg_save_mode (unsigned int regno)
   case ARM_PCS_SIMD:
/* The vector PCS saves the low 128 bits (which is the full
   register on non-SVE targets).  */
-   return TFmode;
+   return V16QImode;
 
   case ARM_PCS_SVE:
/* Use vectors of DImode for registers that need frame
diff --git a/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c 
b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
new file mode 100644
index 000..6bb640c42c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target fopenmp } */
+/* { dg-options "-ffast-math -fstack-protector-strong -fopenmp" } */
+typedef struct {
+  long size_z;
+  int width;
+} dt_bilateral_t;
+typedef float dt_aligned_pixel_t[4];
+#pragma omp declare simd
+void dt_bilateral_splat(dt_bilateral_t *b) {
+  float *buf;
+  long offsets[8];
+  for (; b;) {
+int firstrow;
+for (int j = firstrow; j; j++)
+  for (int i; i < b->width; i++) {
+dt_aligned_pixel_t contrib;
+for (int k = 0; k < 4; k++)
+  buf[offsets[k]] += contrib[k];
+  }
+float *dest;
+for (int j = (long)b; j; j++) {
+  float *src = (float *)b->size_z;
+  for (int i = 0; i < (long)b; i++)
+dest[i] += src[i];
+}
+  }
+}


[PATCH] aarch64: Ensure iterator validity when updating debug uses [PR113616]

2024-01-29 Thread Alex Coplan
Hi,

The fix for PR113089 introduced range-based for loops over the
debug_insn_uses of an RTL-SSA set_info, but in the case that we reset a
debug insn, the use would get removed from the use list, and thus we
would end up using an invalidated iterator in the next iteration of the
loop.  In practice this means we end up terminating the loop
prematurely, and hence ICE as in PR113089 since there are debug uses
that we failed to fix up.

This patch fixes that by introducing a general mechanism to avoid this
sort of problem.  We introduce a safe_iterator to iterator-utils.h which
wraps an iterator, and also holds the end iterator value.  It then
pre-computes the next iterator value at all iterations, so it doesn't
matter if the original iterator got invalidated during the loop body, we
can still move safely to the next iteration.

We introduce an iterate_safely helper which effectively adapts a
container such as iterator_range into a container of safe_iterators over
the original iterator type.

We then use iterate_safely around all loops over debug_insn_uses () in
the aarch64 ldp/stp pass to fix PR113616.  While doing this, I
remembered that cleanup_tombstones () had the same problem.  I
previously worked around this locally by manually maintaining the next
nondebug insn, so this patch also refactors that loop to use the new
iterate_safely helper.

While doing that I noticed that a couple of cases in cleanup_tombstones
could be converted from using dyn_cast to as_a,
which should be safe because there are no clobbers of mem in RTL-SSA, so
all defs of memory should be set_infos.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113616
* config/aarch64/aarch64-ldp-fusion.cc (fixup_debug_uses_trailing_add):
Use iterate_safely when iterating over debug uses.
(fixup_debug_uses): Likewise.
(ldp_bb_info::cleanup_tombstones): Use iterate_safely to iterate
over nondebug insns instead of manually maintaining the next insn.
* iterator-utils.h (class safe_iterator): New.
(iterate_safely): New.

gcc/testsuite/ChangeLog:

PR target/113616
* gcc.c-torture/compile/pr113616.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 932a6398ae3..22ed95eb743 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -1480,7 +1480,7 @@ fixup_debug_uses_trailing_add (obstack_watermark ,
   def_info *def = defs[0];
 
   if (auto set = safe_dyn_cast (def->prev_def ()))
-for (auto use : set->debug_insn_uses ())
+for (auto use : iterate_safely (set->debug_insn_uses ()))
   if (*use->insn () > *pair_dst)
// DEF is getting re-ordered above USE, fix up USE accordingly.
fixup_debug_use (attempt, use, def, base, wb_offset);
@@ -1544,13 +1544,16 @@ fixup_debug_uses (obstack_watermark ,
   auto def = memory_access (insns[0]->defs ());
   auto last_def = memory_access (insns[1]->defs ());
   for (; def != last_def; def = def->next_def ())
-   for (auto use : as_a (def)->debug_insn_uses ())
- {
-   if (dump_file)
- fprintf (dump_file, "  i%d: resetting debug use of mem\n",
-  use->insn ()->uid ());
-   reset_debug_use (use);
- }
+   {
+ auto set = as_a (def);
+ for (auto use : iterate_safely (set->debug_insn_uses ()))
+   {
+ if (dump_file)
+   fprintf (dump_file, "  i%d: resetting debug use of mem\n",
+use->insn ()->uid ());
+ reset_debug_use (use);
+   }
+   }
 }
 
   // Now let's take care of register uses, starting with debug uses
@@ -1577,7 +1580,7 @@ fixup_debug_uses (obstack_watermark ,
 
   // Now that we've characterized the defs involved, go through the
   // debug uses and determine how to update them (if needed).
-  for (auto use : set->debug_insn_uses ())
+  for (auto use : iterate_safely (set->debug_insn_uses ()))
{
  if (*pair_dst < *use->insn () && defs[1])
// We're re-ordering defs[1] above a previous use of the
@@ -1609,7 +1612,7 @@ fixup_debug_uses (obstack_watermark ,
 
   // We have a def in insns[1] which isn't def'd by the first insn.
   // Look to the previous def and see if it has any debug uses.
-  for (auto use : prev_set->debug_insn_uses ())
+  for (auto use : iterate_safely (prev_set->debug_insn_uses ()))
if (*pair_dst < *use->insn ())
  // We're ordering DEF above a previous use of the same register.
  update_debug_use (use, def, writeback_pat);
@@ -1622,7 +1625,8 @@ fixup_debug_uses (obstack_watermark ,
   // second writeback def which need re-parenting: do that.
   auto def = find_access (insns[1]->defs (), base_regno);
   gcc_assert (def);
-  for (auto use : as_a 

Re: [PATCH] aarch64: Fix undefinedness while testing the J constraint [PR100204]

2024-01-26 Thread Alex Coplan
On 25/01/2024 11:57, Andrew Pinski wrote:
> The J constraint can invoke undefined behavior due to it taking the
> negative of the ival if ival was HWI_MIN. The fix is simple as casting
> to `unsigned HOST_WIDE_INT` before doing the negative of it. This
> does that.

Thanks for doing this.

> 
> Committed as obvious after build/test for aarch64-linux-gnu.
> 
> gcc/ChangeLog:
> 
>   PR target/100204
>   * config/aarch64/constraints.md (J): Cast to `unsigned HOST_WIDE_INT`
>   before taking the negative of it.
> 
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/constraints.md | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/config/aarch64/constraints.md 
> b/gcc/config/aarch64/constraints.md
> index 8566befd727..a2569cea510 100644
> --- a/gcc/config/aarch64/constraints.md
> +++ b/gcc/config/aarch64/constraints.md
> @@ -118,7 +118,7 @@ (define_constraint "Uat"
>  (define_constraint "J"
>   "A constant that can be used with a SUB operation (once negated)."
>   (and (match_code "const_int")
> -  (match_test "aarch64_uimm12_shift (-ival)")))
> +  (match_test "aarch64_uimm12_shift (- (unsigned HOST_WIDE_INT) ival)")))

Sorry for the nitpick, but: I don't think we want a space after the unary -
here (at least according to https://gcc.gnu.org/codingconventions.html).

Alex

>  
>  ;; We can't use the mode of a CONST_INT to determine the context in
>  ;; which it is being used, so we must have a separate constraint for
> -- 
> 2.39.3
> 


Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-01-24 Thread Alex Coplan
Hi Ajit,

On 21/01/2024 19:57, Ajit Agarwal wrote:
> 
> Hello All:
> 
> New pass to replace adjacent memory addresses lxv with lxvp.
> Added common infrastructure for load store fusion for
> different targets.

Thanks for this, it would be nice to see the load/store pair pass
generalized to multiple targets.

I assume you are targeting GCC 15 for this, as we are in stage 4 at
the moment?

> 
> Common routines are refactored in fusion-common.h.
> 
> AARCH64 load/store fusion pass is not changed with the 
> common infrastructure.

I think any patch to generalize the load/store pair fusion pass should
update the aarch64 code at the same time to use the generic
infrastructure, instead of duplicating the code.

As a general comment, I think we should move as much of the code as
possible to target-independent code, with only the bits that are truly
target-specific (e.g. deciding which modes to allow for a load/store
pair operand) in target code.

In terms of structuring the interface between generic code and target
code, I think it would be pragmatic to use a class with (in some cases,
pure) virtual functions that can be overriden by targets to implement
any target-specific behaviour.

IMO the generic class should be implemented in its own .cc instead of
using a header-only approach.  The target code would then define a
derived class which overrides the virtual functions (where necessary)
declared in the generic class, and then instantiate the derived class to
create a target-customized instance of the pass.

A more traditional GCC approach would be to use optabs and target hooks
to customize the behaviour of the pass to handle target-specific
aspects, but:
 - Target hooks are quite heavyweight, and we'd potentially have to add
   quite a few hooks just for one pass that (at least initially) will
   only be used by a couple of targets.
 - Using classes allows both sides to easily maintain their own state
   and share that state where appropriate.

Nit on naming: I understand you want to move away from ldp_fusion, but
how about pair_fusion or mem_pair_fusion instead of just "fusion" as a
base name?  IMO just "fusion" isn't very clear as to what the pass is
trying to achieve.

In general the code could do with a lot more commentary to explain the
rationale for various things / explain the high-level intent of the
code.

Unfortunately I'm not familiar with the DF framework (I've only really
worked with RTL-SSA for the aarch64 pass), so I haven't commented on the
use of that framework, but it would be nice if what you're trying to do
could be done using RTL-SSA instead of using DF directly.

Hopefully Richard S can chime in on those aspects.

My main concerns with the patch at the moment (apart from the code
duplication) is that it looks like:

 - The patch removes alias analysis from try_fuse_pair, which is unsafe.
 - The patch tries to make its own RTL changes inside
   rs6000_gen_load_pair, but it should let fuse_pair make those changes
   using RTL-SSA instead.

I've left some more specific (but still mostly high-level) comments below.

> 
> For AARCH64 architectures just include "fusion-common.h"
> and target dependent code can be added to that.
> 
> 
> Alex/Richard:
> 
> If you would like me to add for AARCH64 I can do that for AARCH64.
> 
> If you would like to do that is fine with me.
> 
> Bootstrapped and regtested with powerpc64-linux-gnu.
> 
> Improvement in performance is seen with Spec 2017 spec FP benchmarks.
> 
> Thanks & Regards
> Ajit
> 
> rs6000: New  pass for replacement of adjacent lxv with lxvp.

Are you looking to handle stores eventually, out of interest?  Looking
at rs6000-vecload-opt.cc:fusion_bb it looks like you're just handling
loads at the moment.

> 
> New pass to replace adjacent memory addresses lxv with lxvp.
> Added common infrastructure for load store fusion for
> different targets.
> 
> Common routines are refactored in fusion-common.h.

I've just done a very quick scan through this file as it mostly just
looks to be idential to existing code in aarch64-ldp-fusion.cc.

> 
> 2024-01-21  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000-passes.def: New vecload pass
>   before pass_early_remat.
>   * config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>   * config.gcc: Add new executable.
>   * config/rs6000/rs6000-protos.h: Add new prototype for vecload
>   pass.
>   * config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>   * config/rs6000/t-rs6000: Add new rule.
>   * fusion-common.h: Add common infrastructure for load store
>   fusion that can be shared across different architectures.
>   * emit-rtl.cc: Modify assert code.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.target/powerpc/vecload.C: New test.
>   * g++.target/powerpc/vecload1.C: New test.
>   * gcc.target/powerpc/mma-builtin-1.c: Modify test.
> ---
>  gcc/config.gcc|4 +-
>  

Re: [PATCH] aarch64: Re-enable ldp/stp fusion pass

2024-01-24 Thread Alex Coplan
On 24/01/2024 09:15, Kyrylo Tkachov wrote:
> Hi Alex,
> 
> > -Original Message-
> > From: Alex Coplan 
> > Sent: Wednesday, January 24, 2024 8:34 AM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Richard Earnshaw ; Richard Sandiford
> > ; Kyrylo Tkachov ;
> > Jakub Jelinek 
> > Subject: [PATCH] aarch64: Re-enable ldp/stp fusion pass
> > 
> > Hi,
> > 
> > Since, to the best of my knowledge, all reported regressions related to
> > the ldp/stp fusion pass have now been fixed, and PGO+LTO bootstrap with
> > --enable-languages=all is working again with the passes enabled, this
> > patch turns the passes back on by default, as agreed with Jakub here:
> > 
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642478.html
> > 
> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> > 
> 
> If we were super-pedantic about the GCC rules we could say that this is a 
> revert of 8ed77a2356c3562f96c64f968e7529065c128c6a and therefore:
> "Similarly, no outside approval is needed to revert a patch that you checked 
> in." 
> But that would go against the spirit of the rule.

Heh, definitely seems against the spirit of the rule.

> Anyway, this is ok. Thanks for working through the regressions so diligently.

Thanks! Pushed as g:da9647e98aa289ba3aba41cf5bbe14d0f5f27e77.

I'll keep an eye on gcc-bugs for any further fallout.

Alex

> Kyrill
> 
> > Thanks,
> > Alex
> > 
> > gcc/ChangeLog:
> > 
> > * config/aarch64/aarch64.opt (-mearly-ldp-fusion): Set default
> > to 1.
> > (-mlate-ldp-fusion): Likewise.


[PATCH] aarch64: Re-enable ldp/stp fusion pass

2024-01-24 Thread Alex Coplan
Hi,

Since, to the best of my knowledge, all reported regressions related to
the ldp/stp fusion pass have now been fixed, and PGO+LTO bootstrap with
--enable-languages=all is working again with the passes enabled, this
patch turns the passes back on by default, as agreed with Jakub here:

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642478.html

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

* config/aarch64/aarch64.opt (-mearly-ldp-fusion): Set default
to 1.
(-mlate-ldp-fusion): Likewise.
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c495cb34fbf..ceed5cdb201 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -290,12 +290,12 @@ Target Var(aarch64_track_speculation)
 Generate code to track when the CPU might be speculating incorrectly.
 
 mearly-ldp-fusion
-Target Var(flag_aarch64_early_ldp_fusion) Optimization Init(0)
+Target Var(flag_aarch64_early_ldp_fusion) Optimization Init(1)
 Enable the copy of the AArch64 load/store pair fusion pass that runs before
 register allocation.
 
 mlate-ldp-fusion
-Target Var(flag_aarch64_late_ldp_fusion) Optimization Init(0)
+Target Var(flag_aarch64_late_ldp_fusion) Optimization Init(1)
 Enable the copy of the AArch64 load/store pair fusion pass that runs after
 register allocation.
 


Re: [PATCH 4/4] aarch64: Fix up uses of mem following stp insert [PR113070]

2024-01-23 Thread Alex Coplan
On 22/01/2024 21:50, Alex Coplan wrote:
> On 22/01/2024 15:59, Richard Sandiford wrote:
> > Alex Coplan  writes:
> > > As the PR shows (specifically #c7) we are missing updating uses of mem
> > > when inserting an stp in the aarch64 load/store pair fusion pass.  This
> > > patch fixes that.
> > >
> > > RTL-SSA has a simple view of memory and by default doesn't allow stores
> > > to be re-ordered w.r.t. other stores.  In the ldp fusion pass, we do our
> > > own alias analysis and so can re-order stores over other accesses when
> > > we deem this is safe.  If neither store can be re-purposed (moved into
> > > the required position to form the stp while respecting the RTL-SSA
> > > constraints), then we turn both the candidate stores into "tombstone"
> > > insns (logically delete them) and insert a new stp insn.
> > >
> > > As it stands, we implement the insert case separately (after dealing
> > > with the candidate stores) in fuse_pair by inserting into the middle of
> > > the vector of changes.  This is OK when we only have to insert one
> > > change, but with this fix we would need to insert the change for the new
> > > stp plus multiple changes to fix up uses of mem (note the number of
> > > fix-ups is naturally bounded by the alias limit param to prevent
> > > quadratic behaviour).  If we kept the code structured as is and inserted
> > > into the middle of the vector, that would lead to repeated moving of
> > > elements in the vector which seems inefficient.  The structure of the
> > > code would also be a little unwieldy.
> > >
> > > To improve on that situation, this patch introduces a helper class,
> > > stp_change_builder, which implements a state machine that helps to build
> > > the required changes directly in program order.  That state machine is
> > > reponsible for deciding what changes need to be made in what order, and
> > > the code in fuse_pair then simply follows those steps.
> > >
> > > Together with the fix in the previous patch for installing new defs
> > > correctly in RTL-SSA, this fixes PR113070.
> > >
> > > We take the opportunity to rename the function decide_stp_strategy to
> > > try_repurpose_store, as that seems more descriptive of what it actually
> > > does, since stp_change_builder is now responsible for the overall change
> > > strategy.
> > >
> > > Bootstrapped/regtested as a series with/without the passes enabled on
> > > aarch64-linux-gnu, OK for trunk?
> > >
> > > Thanks,
> > > Alex
> > >
> > > gcc/ChangeLog:
> > >
> > >   PR target/113070
> > >   * config/aarch64/aarch64-ldp-fusion.cc (struct stp_change_builder): New.
> > >   (decide_stp_strategy): Reanme to ...
> > >   (try_repurpose_store): ... this.
> > >   (ldp_bb_info::fuse_pair): Refactor to use stp_change_builder to
> > >   construct stp changes.  Fix up uses when inserting new stp insns.
> > > ---
> > >  gcc/config/aarch64/aarch64-ldp-fusion.cc | 248 ++-
> > >  1 file changed, 194 insertions(+), 54 deletions(-)
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> > > b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > > index 689a8c884bd..703cfb1228c 100644
> > > --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > > +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > > @@ -844,11 +844,138 @@ def_upwards_move_range (def_info *def)
> > >return range;
> > >  }
> > >  
> > > +// Class that implements a state machine for building the changes needed 
> > > to form
> > > +// a store pair instruction.  This allows us to easily build the changes 
> > > in
> > > +// program order, as required by rtl-ssa.
> > > +struct stp_change_builder
> > > +{
> > > +  enum class state
> > > +  {
> > > +FIRST,
> > > +INSERT,
> > > +FIXUP_USE,
> > > +LAST,
> > > +DONE
> > > +  };
> > > +
> > > +  enum class action
> > > +  {
> > > +TOMBSTONE,
> > > +CHANGE,
> > > +INSERT,
> > > +FIXUP_USE
> > > +  };
> > > +
> > > +  struct change
> > > +  {
> > > +action type;
> > > +insn_info *insn;
> > > +  };
> > > +
> > > +  bool done () const { return m_state == state::DONE; }
> > > +
> > > +  st

Re: [PATCH 3/3] aarch64: Fix up debug uses in ldp/stp pass [PR113089]

2024-01-22 Thread Alex Coplan
On 22/01/2024 17:09, Richard Sandiford wrote:
> Sorry for the earlier review comment about debug insns.  I hadn't
> looked far enough into the queue to see this patch.
> 
> Alex Coplan  writes:
> > As the PR shows, we were missing code to update debug uses in the
> > load/store pair fusion pass.  This patch fixes that.
> >
> > Note that this patch depends on the following patch to create new uses
> > in RTL-SSA, submitted as part of the fixes for PR113070:
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642919.html
> >
> > The patch tries to give a complete treatment of the debug uses that will
> > be affected by the changes we make, and in particular makes an effort to
> > preserve debug info where possible, e.g. when re-ordering an update of
> > a base register by a constant over a debug use of that register.  When
> > re-ordering loads over a debug use of a transfer register, we reset the
> > debug insn.  Likewise when re-ordering stores over debug uses of mem.
> >
> > While doing this I noticed that try_promote_writeback used a strange
> > choice of move_range for the pair insn, in that it chose the previous
> > nondebug insn instead of the insn itself.  Since the insn is being
> > changed, these move ranges are equivalent (at least in terms of nondebug
> > insn placement as far as RTL-SSA is concerned), but I think it is more
> > natural to choose the pair insn itself.  This is needed to avoid
> > incorrectly updating some debug uses.
> >
> > Notes on testing:
> >  - The series was bootstrapped/regtested on top of the fixes for
> >PR113070 and PR113356.  It seemed to make more sense to test with
> >correct use/def info, and as mentioned above, this patch depends on
> >one of the PR113070 patches.
> >  - I also ran the testsuite with -g -funroll-loops -mearly-ldp-fusion
> >-mlate-ldp-fusion to try and flush out more issues, and worked
> >through some examples where writeback updates were triggered to
> >make sure it was doing the right thing.
> >  - The patches also survived an LTO+PGO bootstrap with
> >--enable-languages=all (with the passes enabled).
> >
> > Bootstrapped/regtested as a series on aarch64-linux-gnu (with/without
> > the pass enabled).  OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> > PR target/113089
> > * config/aarch64/aarch64-ldp-fusion.cc (reset_debug_use): New.
> > (fixup_debug_use): New.
> > (fixup_debug_uses_trailing_add): New.
> > (fixup_debug_uses): New. Use it ...
> > (ldp_bb_info::fuse_pair): ... here.
> > (try_promote_writeback): Call fixup_debug_uses_trailing_add to
> > fix up debug uses of the base register that are affected by
> > folding in the trailing add insn.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR target/113089
> > * gcc.c-torture/compile/pr113089.c: New test.
> > ---
> >  gcc/config/aarch64/aarch64-ldp-fusion.cc  | 332 +-
> >  .../gcc.c-torture/compile/pr113089.c  |  26 ++
> >  2 files changed, 351 insertions(+), 7 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr113089.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> > b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > index 4d7fd72c6b1..fd0278e7acf 100644
> > --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > @@ -1342,6 +1342,309 @@ ldp_bb_info::track_tombstone (int uid)
> >  gcc_unreachable (); // Bit should have changed.
> >  }
> >  
> > +// Reset the debug insn containing USE (the debug insn has been
> > +// optimized away).
> > +static void
> > +reset_debug_use (use_info *use)
> > +{
> > +  auto use_insn = use->insn ();
> > +  auto use_rtl = use_insn->rtl ();
> > +  insn_change change (use_insn);
> > +  change.new_uses = {};
> > +  INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
> > +  crtl->ssa->change_insn (change);
> > +}
> > +
> > +// USE is a debug use that needs updating because DEF (a def of the same
> > +// register) is being re-ordered over it.  If BASE is non-null, then DEF
> > +// is an update of the register BASE by a constant, given by WB_OFFSET,
> > +// and we can preserve debug info by accounting for the change in side
> > +// effects.
> > +static void
> > +fixup_debug_use (obstack_watermark ,
> > +use_info *use,
> > +def_info *def,
>

Re: [PATCH 4/4] aarch64: Fix up uses of mem following stp insert [PR113070]

2024-01-22 Thread Alex Coplan
On 22/01/2024 15:59, Richard Sandiford wrote:
> Alex Coplan  writes:
> > As the PR shows (specifically #c7) we are missing updating uses of mem
> > when inserting an stp in the aarch64 load/store pair fusion pass.  This
> > patch fixes that.
> >
> > RTL-SSA has a simple view of memory and by default doesn't allow stores
> > to be re-ordered w.r.t. other stores.  In the ldp fusion pass, we do our
> > own alias analysis and so can re-order stores over other accesses when
> > we deem this is safe.  If neither store can be re-purposed (moved into
> > the required position to form the stp while respecting the RTL-SSA
> > constraints), then we turn both the candidate stores into "tombstone"
> > insns (logically delete them) and insert a new stp insn.
> >
> > As it stands, we implement the insert case separately (after dealing
> > with the candidate stores) in fuse_pair by inserting into the middle of
> > the vector of changes.  This is OK when we only have to insert one
> > change, but with this fix we would need to insert the change for the new
> > stp plus multiple changes to fix up uses of mem (note the number of
> > fix-ups is naturally bounded by the alias limit param to prevent
> > quadratic behaviour).  If we kept the code structured as is and inserted
> > into the middle of the vector, that would lead to repeated moving of
> > elements in the vector which seems inefficient.  The structure of the
> > code would also be a little unwieldy.
> >
> > To improve on that situation, this patch introduces a helper class,
> > stp_change_builder, which implements a state machine that helps to build
> > the required changes directly in program order.  That state machine is
> > reponsible for deciding what changes need to be made in what order, and
> > the code in fuse_pair then simply follows those steps.
> >
> > Together with the fix in the previous patch for installing new defs
> > correctly in RTL-SSA, this fixes PR113070.
> >
> > We take the opportunity to rename the function decide_stp_strategy to
> > try_repurpose_store, as that seems more descriptive of what it actually
> > does, since stp_change_builder is now responsible for the overall change
> > strategy.
> >
> > Bootstrapped/regtested as a series with/without the passes enabled on
> > aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> > PR target/113070
> > * config/aarch64/aarch64-ldp-fusion.cc (struct stp_change_builder): New.
> > (decide_stp_strategy): Reanme to ...
> > (try_repurpose_store): ... this.
> > (ldp_bb_info::fuse_pair): Refactor to use stp_change_builder to
> > construct stp changes.  Fix up uses when inserting new stp insns.
> > ---
> >  gcc/config/aarch64/aarch64-ldp-fusion.cc | 248 ++-
> >  1 file changed, 194 insertions(+), 54 deletions(-)
> >
> > diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> > b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > index 689a8c884bd..703cfb1228c 100644
> > --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > @@ -844,11 +844,138 @@ def_upwards_move_range (def_info *def)
> >return range;
> >  }
> >  
> > +// Class that implements a state machine for building the changes needed 
> > to form
> > +// a store pair instruction.  This allows us to easily build the changes in
> > +// program order, as required by rtl-ssa.
> > +struct stp_change_builder
> > +{
> > +  enum class state
> > +  {
> > +FIRST,
> > +INSERT,
> > +FIXUP_USE,
> > +LAST,
> > +DONE
> > +  };
> > +
> > +  enum class action
> > +  {
> > +TOMBSTONE,
> > +CHANGE,
> > +INSERT,
> > +FIXUP_USE
> > +  };
> > +
> > +  struct change
> > +  {
> > +action type;
> > +insn_info *insn;
> > +  };
> > +
> > +  bool done () const { return m_state == state::DONE; }
> > +
> > +  stp_change_builder (insn_info *insns[2],
> > + insn_info *repurpose,
> > + insn_info *dest)
> > +: m_state (state::FIRST), m_insns { insns[0], insns[1] },
> > +  m_repurpose (repurpose), m_dest (dest), m_use (nullptr) {}
> 
> Just to make sure I understand: is it the case that
> 
>   *insns[0] <= *dest <= *insns[1]
> 
> ?

Yes, that is my understanding.  I thought about asserting it somewhere in
stp_change_build

Re: [PATCH 3/4] rtl-ssa: Ensure new defs get inserted [PR113070]

2024-01-22 Thread Alex Coplan
On 22/01/2024 13:49, Richard Sandiford wrote:
> Alex Coplan  writes:
> > In r14-5820-ga49befbd2c783e751dc2110b544fe540eb7e33eb I added support to
> > RTL-SSA for inserting new insns, which included support for users
> > creating new defs.
> >
> > However, I missed that apply_changes_to_insn needed updating to ensure
> > that the new defs actually got inserted into the main def chain.  This
> > meant that when the aarch64 ldp/stp pass inserted a new stp insn, the
> > stp would just get skipped over during subsequent alias analysis, as its
> > def never got inserted into the memory def chain.  This (unsurprisingly)
> > led to wrong code.
> >
> > This patch fixes the issue by ensuring new user-created defs get
> > inserted.  I would have preferred to have used a flag internal to the
> > defs instead of a separate data structure to keep track of them, but since
> > machine_mode increased to 16 bits we're already at 64 bits in access_info,
> > and we can't really reuse m_is_temp as the logic in finalize_new_accesses
> > requires it to get cleared.
> >
> > Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> > PR target/113070
> > * rtl-ssa.h: Include hash-set.h.
> > * rtl-ssa/changes.cc (function_info::finalize_new_accesses): Add
> > new_sets parameter and use it to keep track of new user-created sets.
> > (function_info::apply_changes_to_insn): Also call add_def on new sets.
> > (function_info::change_insns): Add hash_set to keep track of new
> > user-created defs.  Plumb it through.
> > * rtl-ssa/functions.h: Add hash_set parameter to finalize_new_accesses 
> > and
> > apply_changes_to_insn.
> > ---
> >  gcc/rtl-ssa.h   |  1 +
> >  gcc/rtl-ssa/changes.cc  | 28 +---
> >  gcc/rtl-ssa/functions.h |  6 --
> >  3 files changed, 26 insertions(+), 9 deletions(-)
> >
> > diff --git a/gcc/rtl-ssa.h b/gcc/rtl-ssa.h
> > index f0cf656f5ac..17337639ae8 100644
> > --- a/gcc/rtl-ssa.h
> > +++ b/gcc/rtl-ssa.h
> > @@ -50,6 +50,7 @@
> >  #include "mux-utils.h"
> >  #include "rtlanal.h"
> >  #include "cfgbuild.h"
> > +#include "hash-set.h"
> >  
> >  // Provides the global crtl->ssa.
> >  #include "memmodel.h"
> > diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
> > index ce51d6ccd8d..6119ec3535b 100644
> > --- a/gcc/rtl-ssa/changes.cc
> > +++ b/gcc/rtl-ssa/changes.cc
> > @@ -429,7 +429,8 @@ update_insn_in_place (insn_change )
> >  // POS gives the final position of INSN, which hasn't yet been moved into
> >  // place.
> 
> The new parameter should be documented.  How about:
> 
>   // place.  NEW_SETS contains the new set_infos that are being added as part
>   // of this change (as opposed to being moved or repurposed from existing
>   // instructions).

That comment looks appropriate for apply_changes_to_insn, where NEW_SETS has
already been populated, but doesn't seem accurate for finalize_new_accesses.
How about:

  // place.  Keep track of any newly-created set_infos being added as
  // part of this change by adding them to NEW_SETS.

for finalize_new_accesses?  OK with that change (and using your suggestion for
apply_changes_to_insn)?

Thanks,
Alex

> 
> 
> >  void
> > -function_info::finalize_new_accesses (insn_change , insn_info *pos)
> > +function_info::finalize_new_accesses (insn_change , insn_info *pos,
> > + hash_set _sets)
> >  {
> >insn_info *insn = change.insn ();
> >  
> > @@ -465,6 +466,12 @@ function_info::finalize_new_accesses (insn_change 
> > , insn_info *pos)
> > // later in case we see a second write to the same resource.
> > def_info *perm_def = allocate (change.insn (),
> >  def->resource ());
> > +
> > +   // Keep track of the new set so we remember to add it to the
> > +   // def chain later.
> > +   if (new_sets.add (perm_def))
> > + gcc_unreachable (); // We shouldn't see duplicates here.
> > +
> > def->set_last_def (perm_def);
> > def = perm_def;
> >   }
> > @@ -647,7 +654,8 @@ function_info::finalize_new_accesses (insn_change 
> > , insn_info *pos)
> >  // Copy information from CHANGE to its underlying insn_info, given that
> >  // the insn_info has already been placed appropriately

Re: [PATCH 2/4] rtl-ssa: Support for creating new uses [PR113070]

2024-01-22 Thread Alex Coplan
On 22/01/2024 13:45, Richard Sandiford wrote:
> Alex Coplan  writes:
> > This exposes an interface for users to create new uses in RTL-SSA.
> > This is needed for updating uses after inserting a new store pair insn
> > in the aarch64 load/store pair fusion pass.
> >
> > gcc/ChangeLog:
> >
> > PR target/113070
> > * rtl-ssa/accesses.cc (function_info::create_use): New.
> > * rtl-ssa/changes.cc (function_info::finalize_new_accesses):
> > Handle temporary uses, ensure new uses end up referring to
> > permanent defs.
> > * rtl-ssa/functions.h (function_info::create_use): Declare.
> > ---
> >  gcc/rtl-ssa/accesses.cc | 10 ++
> >  gcc/rtl-ssa/changes.cc  | 24 +++-
> >  gcc/rtl-ssa/functions.h |  5 +
> >  3 files changed, 34 insertions(+), 5 deletions(-)
> >
> > diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
> > index ce4a8b8dc00..3f1304fc5bf 100644
> > --- a/gcc/rtl-ssa/accesses.cc
> > +++ b/gcc/rtl-ssa/accesses.cc
> > @@ -1466,6 +1466,16 @@ function_info::create_set (obstack_watermark 
> > ,
> >return set;
> >  }
> >  
> > +use_info *
> > +function_info::create_use (obstack_watermark ,
> > +  insn_info *insn,
> > +  set_info *set)
> > +{
> > +  auto use = change_alloc (watermark, insn, set->resource (), 
> > set);
> > +  use->m_is_temp = true;
> > +  return use;
> > +}
> > +
> >  // Return true if ACCESS1 can represent ACCESS2 and if ACCESS2 can
> >  // represent ACCESS1.
> >  static bool
> > diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
> > index e538b637848..ce51d6ccd8d 100644
> > --- a/gcc/rtl-ssa/changes.cc
> > +++ b/gcc/rtl-ssa/changes.cc
> > @@ -538,7 +538,9 @@ function_info::finalize_new_accesses (insn_change 
> > , insn_info *pos)
> >unsigned int i = 0;
> >for (use_info *use : change.new_uses)
> >  {
> > -  if (!use->m_has_been_superceded)
> > +  if (use->m_is_temp)
> > +   use->m_has_been_superceded = true;
> > +  else if (!use->m_has_been_superceded)
> > {
> 
> Is this part necessary for correctness, or is it just a compile-time
> optimisation?  We already have temporary uses via make_uses_available,
> and in principle, it's possible to reuse the uses for multiple changes
> within the same group.  E.g. when replacing A with B in multiple
> instructions, it's OK for the associated insn changes to refer to
> A's uses directly, or to uses created for A by make_uses_available.
> 
> So IMO it'd better to drop this hunk if we can.

Yeah, I agree it's just a compile-time optimisation and shouldn't be
needed for correctness.  I initially thought it might save on memory,
but IIUC the memory allocated with allocate_temp will get freed when we
return from finalize_new_accesses anyway.

So I'll drop that hunk and re-test the series, thanks.

Alex

> 
> >   use = allocate_temp (insn, use->resource (), use->def ());
> >   use->m_has_been_superceded = true;
> > @@ -609,15 +611,27 @@ function_info::finalize_new_accesses (insn_change 
> > , insn_info *pos)
> >   m_temp_uses[i] = use = allocate (*use);
> >   use->m_is_temp = false;
> >   set_info *def = use->def ();
> > - // Handle cases in which the value was previously not used
> > - // within the block.
> > - if (def && def->m_is_temp)
> > + if (!def || !def->m_is_temp)
> > +   continue;
> > +
> > + if (auto phi = dyn_cast (def))
> > {
> > - phi_info *phi = as_a (def);
> > + // Handle cases in which the value was previously not used
> > + // within the block.
> >   gcc_assert (phi->is_degenerate ());
> >   phi = create_degenerate_phi (phi->ebb (), phi->input_value (0));
> >   use->set_def (phi);
> > }
> > + else
> > +   {
> > + // The temporary def may also be a set added with this change, in
> > + // which case the permanent set is stored in the last_def link,
> > + // and we need to update the use to refer to the permanent set.
> > + gcc_assert (is_a (def));
> > + auto perm_set = as_a (def->last_def ());
> > + gcc_assert (!perm_set->is_temporary ());
> > + use->set_def (perm_set);
> > +   }
> > }
> >  }
> >  
> > diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
> > in

[PATCH] aarch64: Don't assert recog success in ldp/stp pass [PR113114]

2024-01-19 Thread Alex Coplan
Hi,

The PR shows two different cases where try_promote_writeback produces an
RTL pattern which isn't recognized.  Currently this leads to an ICE, as
we assert recog success, but I think it's better just to back out of the
changes gracefully if recog fails (as we do in the main fuse_pair case).

In theory since we check the ranges here recog shouldn't fail (which is
why I had the assert in the first place), but the PR shows an edge case
in the patterns where if we form a pre-writeback pair where the
writeback offset is exactly -S, where S is the size in bytes of one
transfer register, we fail to match the expected pattern as the patterns
look explicitly for plus operands in the mems.  I think fixing this
would require adding at least four new special-case patterns to
aarch64.md for what doesn't seem to be a particularly useful variant of
the insns.  Even if we were to do that, I think it would be GCC 15
material, and it's better to just punt for GCC 14.

The ILP32 case in the PR is a bit different, as that shows us trying to
combine a pair with DImode base register operands in the mems together
with an SImode trailing update of the base register.  This leads to us
forming an RTL pattern which references the base register in both SImode
and DImode, which also fails to recog.  Again, I think it's best just to
take the missed optimization for now.  If we really want to make this
(try_promote_writeback) work for ILP32, we can try to do it for GCC 15.

Bootstrapped/regtested on aarch64-linux-gnu (with/without passes
enabled), OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113114
* config/aarch64/aarch64-ldp-fusion.cc (try_promote_writeback):
Don't assert recog success, just punt if the writeback pair
isn't recognized.

gcc/testsuite/ChangeLog:

PR target/113114
* gcc.c-torture/compile/pr113114.c: New test.
* gcc.target/aarch64/pr113114.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 689a8c884bd..19142153f41 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -2672,7 +2672,15 @@ try_promote_writeback (insn_info *insn)
   for (unsigned i = 0; i < ARRAY_SIZE (changes); i++)
 gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], 
is_changing));
 
-  gcc_assert (rtl_ssa::recog_ignoring (attempt, pair_change, is_changing));
+  if (!rtl_ssa::recog_ignoring (attempt, pair_change, is_changing))
+{
+  if (dump_file)
+   fprintf (dump_file, "i%d: recog failed on wb pair, bailing out\n",
+insn->uid ());
+  cancel_changes (0);
+  return;
+}
+
   gcc_assert (crtl->ssa->verify_insn_changes (changes));
   confirm_change_group ();
   crtl->ssa->change_insns (changes);
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr113114.c 
b/gcc/testsuite/gcc.c-torture/compile/pr113114.c
new file mode 100644
index 000..978e594eb3d
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr113114.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-funroll-loops" } */
+float val[128];
+float x;
+void bar() {
+  int i = 55;
+  for (; i >= 0; --i)
+x += val[i];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113114.c 
b/gcc/testsuite/gcc.target/aarch64/pr113114.c
new file mode 100644
index 000..5b0383c2435
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113114.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-mabi=ilp32 -O -mearly-ldp-fusion -mlate-ldp-fusion" } */
+void foo_n(double *a) {
+  int i = 1;
+  for (; i < (int)foo_n; i++)
+a[i] = a[i - 1] + a[i + 1] * a[i];
+}


[PATCH 3/3] aarch64: Fix up debug uses in ldp/stp pass [PR113089]

2024-01-19 Thread Alex Coplan
As the PR shows, we were missing code to update debug uses in the
load/store pair fusion pass.  This patch fixes that.

Note that this patch depends on the following patch to create new uses
in RTL-SSA, submitted as part of the fixes for PR113070:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642919.html

The patch tries to give a complete treatment of the debug uses that will
be affected by the changes we make, and in particular makes an effort to
preserve debug info where possible, e.g. when re-ordering an update of
a base register by a constant over a debug use of that register.  When
re-ordering loads over a debug use of a transfer register, we reset the
debug insn.  Likewise when re-ordering stores over debug uses of mem.

While doing this I noticed that try_promote_writeback used a strange
choice of move_range for the pair insn, in that it chose the previous
nondebug insn instead of the insn itself.  Since the insn is being
changed, these move ranges are equivalent (at least in terms of nondebug
insn placement as far as RTL-SSA is concerned), but I think it is more
natural to choose the pair insn itself.  This is needed to avoid
incorrectly updating some debug uses.

Notes on testing:
 - The series was bootstrapped/regtested on top of the fixes for
   PR113070 and PR113356.  It seemed to make more sense to test with
   correct use/def info, and as mentioned above, this patch depends on
   one of the PR113070 patches.
 - I also ran the testsuite with -g -funroll-loops -mearly-ldp-fusion
   -mlate-ldp-fusion to try and flush out more issues, and worked
   through some examples where writeback updates were triggered to
   make sure it was doing the right thing.
 - The patches also survived an LTO+PGO bootstrap with
   --enable-languages=all (with the passes enabled).

Bootstrapped/regtested as a series on aarch64-linux-gnu (with/without
the pass enabled).  OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113089
* config/aarch64/aarch64-ldp-fusion.cc (reset_debug_use): New.
(fixup_debug_use): New.
(fixup_debug_uses_trailing_add): New.
(fixup_debug_uses): New. Use it ...
(ldp_bb_info::fuse_pair): ... here.
(try_promote_writeback): Call fixup_debug_uses_trailing_add to
fix up debug uses of the base register that are affected by
folding in the trailing add insn.

gcc/testsuite/ChangeLog:

PR target/113089
* gcc.c-torture/compile/pr113089.c: New test.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc  | 332 +-
 .../gcc.c-torture/compile/pr113089.c  |  26 ++
 2 files changed, 351 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr113089.c

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 4d7fd72c6b1..fd0278e7acf 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -1342,6 +1342,309 @@ ldp_bb_info::track_tombstone (int uid)
 gcc_unreachable (); // Bit should have changed.
 }
 
+// Reset the debug insn containing USE (the debug insn has been
+// optimized away).
+static void
+reset_debug_use (use_info *use)
+{
+  auto use_insn = use->insn ();
+  auto use_rtl = use_insn->rtl ();
+  insn_change change (use_insn);
+  change.new_uses = {};
+  INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
+  crtl->ssa->change_insn (change);
+}
+
+// USE is a debug use that needs updating because DEF (a def of the same
+// register) is being re-ordered over it.  If BASE is non-null, then DEF
+// is an update of the register BASE by a constant, given by WB_OFFSET,
+// and we can preserve debug info by accounting for the change in side
+// effects.
+static void
+fixup_debug_use (obstack_watermark ,
+		 use_info *use,
+		 def_info *def,
+		 rtx base,
+		 poly_int64 wb_offset)
+{
+  auto use_insn = use->insn ();
+  if (base)
+{
+  auto use_rtl = use_insn->rtl ();
+  insn_change change (use_insn);
+
+  gcc_checking_assert (REG_P (base) && use->regno () == REGNO (base));
+  change.new_uses = check_remove_regno_access (attempt,
+		   change.new_uses,
+		   use->regno ());
+
+  // The effect of the writeback is to add WB_OFFSET to BASE.  If
+  // we're re-ordering DEF below USE, then we update USE by adding
+  // WB_OFFSET to it.  Otherwise, if we're re-ordering DEF above
+  // USE, we update USE by undoing the effect of the writeback
+  // (subtracting WB_OFFSET).
+  use_info *new_use;
+  if (*def->insn () > *use_insn)
+	{
+	  // We now need USE_INSN to consume DEF.  Create a new use of DEF.
+	  //
+	  // N.B. this means until we call change_insns for the main change
+	  // group we will temporarily have a debug use consuming a def that
+	  // comes after it, but RTL-SSA doesn't currently support updating
+	  // debug insns as part of the main change group (together with
+	  // nondebug changes), so 

[PATCH 2/3] aarch64: Re-parent trailing nondebug base reg uses [PR113089]

2024-01-19 Thread Alex Coplan
While working on PR113089, I realised we where missing code to re-parent
trailing nondebug uses of the base register in the case of cancelling
writeback in the load/store pair pass.  This patch fixes that.

Bootstrapped/regtested as a series on aarch64-linux-gnu (with/without
the pass enabled), OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113089
* config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::fuse_pair):
Update trailing nondebug uses of the base register in the case
of cancelling writeback.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 24 
 1 file changed, 24 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 70b75c668ce..4d7fd72c6b1 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -1693,6 +1693,30 @@ ldp_bb_info::fuse_pair (bool load_p,
 
   if (trailing_add)
 changes.safe_push (make_delete (trailing_add));
+  else if ((writeback & 2) && !writeback_effect)
+{
+  // The second insn initially had writeback but now the pair does not,
+  // need to update any nondebug uses of the base register def in the
+  // second insn.  We'll take care of debug uses later.
+  auto def = find_access (insns[1]->defs (), base_regno);
+  gcc_assert (def);
+  auto set = dyn_cast (def);
+  if (set && set->has_nondebug_uses ())
+	{
+	  auto orig_use = find_access (insns[0]->uses (), base_regno);
+	  for (auto use : set->nondebug_insn_uses ())
+	{
+	  auto change = make_change (use->insn ());
+	  change->new_uses = check_remove_regno_access (attempt,
+			change->new_uses,
+			base_regno);
+	  change->new_uses = insert_access (attempt,
+		orig_use,
+		change->new_uses);
+	  changes.safe_push (change);
+	}
+	}
+}
 
   auto is_changing = insn_is_changing (changes);
   for (unsigned i = 0; i < changes.length (); i++)


[PATCH 1/3] rtl-ssa: Provide easier access to debug uses [PR113089]

2024-01-19 Thread Alex Coplan
This patch adds some accessors to set_info and use_info to make it
easier to get at and iterate through uses in debug insns.

It is used by the aarch64 load/store pair fusion pass in a subsequent
patch to fix PR113089, i.e. to update debug uses in the pass.

Bootstrapped/regtested as a series on aarch64-linux-gnu (with/without
the load/store pair pass enabled), OK for trunk?

gcc/ChangeLog:

PR target/113089
* rtl-ssa/accesses.h (use_info::next_debug_insn_use): New.
(debug_insn_use_iterator): New.
(set_info::first_debug_insn_use): New.
(set_info::debug_insn_uses): New.
* rtl-ssa/member-fns.inl (use_info::next_debug_insn_use): New.
(set_info::first_debug_insn_use): New.
(set_info::debug_insn_uses): New.
---
 gcc/rtl-ssa/accesses.h | 13 +
 gcc/rtl-ssa/member-fns.inl | 29 +
 2 files changed, 42 insertions(+)

diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
index 6a3ecd32848..c57b8a8b7b5 100644
--- a/gcc/rtl-ssa/accesses.h
+++ b/gcc/rtl-ssa/accesses.h
@@ -357,6 +357,10 @@ public:
   //next_use () && next_use ()->is_in_any_insn () ? next_use () : nullptr
   use_info *next_any_insn_use () const;
 
+  // Return the next use by a debug instruction, or null if none.
+  // This is only valid if if is_in_debug_insn ().
+  use_info *next_debug_insn_use () const;
+
   // Return the previous use by a phi node in the list, or null if none.
   //
   // This is only valid if is_in_phi ().  It is equivalent to:
@@ -458,6 +462,8 @@ using reverse_use_iterator = list_iterator;
 // of use in the same definition.
 using nondebug_insn_use_iterator
   = list_iterator;
+using debug_insn_use_iterator
+  = list_iterator;
 using any_insn_use_iterator
   = list_iterator;
 using phi_use_iterator = list_iterator;
@@ -680,6 +686,10 @@ public:
   use_info *first_nondebug_insn_use () const;
   use_info *last_nondebug_insn_use () const;
 
+  // Return the first use of the set by debug instructions, or null if
+  // there is no such use.
+  use_info *first_debug_insn_use () const;
+
   // Return the first use of the set by any kind of instruction, or null
   // if there are no such uses.  The uses are in the order described above.
   use_info *first_any_insn_use () const;
@@ -731,6 +741,9 @@ public:
   // List the uses of the set by nondebug instructions, in reverse postorder.
   iterator_range nondebug_insn_uses () const;
 
+  // List the uses of the set by debug instructions, in reverse postorder.
+  iterator_range debug_insn_uses () const;
+
   // Return nondebug_insn_uses () in reverse order.
   iterator_range reverse_nondebug_insn_uses () const;
 
diff --git a/gcc/rtl-ssa/member-fns.inl b/gcc/rtl-ssa/member-fns.inl
index 8e1c17ced95..e4825ad2a18 100644
--- a/gcc/rtl-ssa/member-fns.inl
+++ b/gcc/rtl-ssa/member-fns.inl
@@ -119,6 +119,15 @@ use_info::next_any_insn_use () const
   return nullptr;
 }
 
+inline use_info *
+use_info::next_debug_insn_use () const
+{
+  if (auto use = next_use ())
+if (use->is_in_debug_insn ())
+  return use;
+  return nullptr;
+}
+
 inline use_info *
 use_info::prev_phi_use () const
 {
@@ -212,6 +221,20 @@ set_info::last_nondebug_insn_use () const
   return nullptr;
 }
 
+inline use_info *
+set_info::first_debug_insn_use () const
+{
+  use_info *use;
+  if (has_nondebug_insn_uses ())
+use = last_nondebug_insn_use ()->next_use ();
+  else
+use = first_use ();
+
+  if (use && use->is_in_debug_insn ())
+return use;
+  return nullptr;
+}
+
 inline use_info *
 set_info::first_any_insn_use () const
 {
@@ -310,6 +333,12 @@ set_info::nondebug_insn_uses () const
   return { first_nondebug_insn_use (), nullptr };
 }
 
+inline iterator_range
+set_info::debug_insn_uses () const
+{
+  return { first_debug_insn_use (), nullptr };
+}
+
 inline iterator_range
 set_info::reverse_nondebug_insn_uses () const
 {


Re: [PATCH 1/4] rtl-ssa: Run finalize_new_accesses forwards [PR113070]

2024-01-17 Thread Alex Coplan
On 17/01/2024 07:42, Jeff Law wrote:
> 
> 
> On 1/13/24 08:43, Alex Coplan wrote:
> > The next patch in this series exposes an interface for creating new uses
> > in RTL-SSA.  The intent is that new user-created uses can consume new
> > user-created defs in the same change group.  This is so that we can
> > correctly update uses of memory when inserting a new store pair insn in
> > the aarch64 load/store pair fusion pass (the affected uses need to
> > consume the new store pair insn).
> > 
> > As it stands, finalize_new_accesses is called as part of the backwards
> > insn placement loop within change_insns, but if we want new uses to be
> > able to depend on new defs in the same change group, we need
> > finalize_new_accesses to be called on earlier insns first.  This is so
> > that when we process temporary uses and turn them into permanent uses,
> > we can follow the last_def link on the temporary def to ensure we end up
> > with a permanent use consuming a permanent def.
> > 
> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> > 
> > Thanks,
> > Alex
> > 
> > gcc/ChangeLog:
> > 
> > PR target/113070
> > * rtl-ssa/changes.cc (function_info::change_insns): Split out the call
> > to finalize_new_accesses from the backwards placement loop, run it
> > forwards in a separate loop.
> So just to be explicit -- given this is adjusting the rtl-ssa
> infrastructure, I was going to let Richard S. own the review side -- he
> knows that code better than I.

Yeah, that's fine, thanks.  Richard is away this week but back on Monday, so
hopefully he can take a look at it then.

Alex

> 
> Jeff


Re: [PATCH] aarch64: Fix aarch64_ldp_reg_operand predicate not to allow all subreg [PR113221]

2024-01-17 Thread Alex Coplan
Hi Andrew,

On 16/01/2024 19:29, Andrew Pinski wrote:
> So the problem here is that aarch64_ldp_reg_operand will all subreg even 
> subreg of lo_sum.
> When LRA tries to fix that up, all things break. So the fix is to change the 
> check to only
> allow reg and subreg of regs.

Thanks a lot for tracking this down, I really appreciate having some help with
the bug-fixing.  Sorry for not getting to it sooner myself, I'm working on
PR113089 which ended up taking longer than expected to fix.

> 
> Note the tendancy here is to use register_operand but that checks the mode of 
> the register
> but we need to allow a mismatch modes for this predicate for now.

Yeah, due to the design of the patterns using special predicates we need
to allow a mode mismatch with the contextual mode.

The patch broadly LGTM (although I can't approve), but I've left a
couple of minor comments below.

> 
> Built and tested for aarch64-linux-gnu with no regressions
> (Also tested with the LD/ST pair pass back on).
> 
>   PR target/113221
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/predicates.md (aarch64_ldp_reg_operand): For subreg,
>   only allow REG operands isntead of allowing all.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.c-torture/compile/pr113221-1.c: New test.
> 
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/predicates.md |  8 +++-
>  gcc/testsuite/gcc.c-torture/compile/pr113221-1.c | 12 
>  2 files changed, 19 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr113221-1.c
> 
> diff --git a/gcc/config/aarch64/predicates.md 
> b/gcc/config/aarch64/predicates.md
> index 8a204e48bb5..256268517d8 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -313,7 +313,13 @@ (define_predicate "pmode_plus_operator"
>  
>  (define_special_predicate "aarch64_ldp_reg_operand"
>(and
> -(match_code "reg,subreg")
> +(ior
> +  (match_code "reg")
> +  (and
> +   (match_code "subreg")
> +   (match_test "GET_CODE (SUBREG_REG (op)) == REG")

This could be just REG_P (SUBREG_REG (op)) in the match_test.

> +  )
> +)

I think it would be more in keeping with the style in the rest of the file to
have the closing parens on the same line as the SUBREG_REG match_test.

>  (match_test "aarch64_ldpstp_operand_mode_p (GET_MODE (op))")
>  (ior
>(match_test "mode == VOIDmode")
> diff --git a/gcc/testsuite/gcc.c-torture/compile/pr113221-1.c 
> b/gcc/testsuite/gcc.c-torture/compile/pr113221-1.c
> new file mode 100644
> index 000..152a510786e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/compile/pr113221-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-options "-fno-move-loop-invariants -funroll-all-loops" } */

Does this need to be dg-additional-options?  Naively I would expect the
dg-options clause to override the torture options (and potentially any
options provided in RUNTESTFLAGS, e.g. to re-enable the ldp/stp pass).

Thanks again for the patch, and apologies for the oversight on my part: I'd
missed that register_operand also checks the code inside the subreg.

Alex

> +/* PR target/113221 */
> +/* This used to ICE after the `load/store pair fusion pass` was added
> +   due to the predicate aarch64_ldp_reg_operand allowing too much. */
> +
> +
> +void bar();
> +void foo(int* b) {
> +  for (;;)
> +*b++ = (long)bar;
> +}
> +
> -- 
> 2.39.3
> 


Re: [PATCH 4/4] aarch64: Fix up uses of mem following stp insert [PR113070]

2024-01-15 Thread Alex Coplan
On 13/01/2024 15:46, Alex Coplan wrote:
> As the PR shows (specifically #c7) we are missing updating uses of mem
> when inserting an stp in the aarch64 load/store pair fusion pass.  This
> patch fixes that.
> 
> RTL-SSA has a simple view of memory and by default doesn't allow stores
> to be re-ordered w.r.t. other stores.  In the ldp fusion pass, we do our
> own alias analysis and so can re-order stores over other accesses when
> we deem this is safe.  If neither store can be re-purposed (moved into
> the required position to form the stp while respecting the RTL-SSA
> constraints), then we turn both the candidate stores into "tombstone"
> insns (logically delete them) and insert a new stp insn.
> 
> As it stands, we implement the insert case separately (after dealing
> with the candidate stores) in fuse_pair by inserting into the middle of
> the vector of changes.  This is OK when we only have to insert one
> change, but with this fix we would need to insert the change for the new
> stp plus multiple changes to fix up uses of mem (note the number of
> fix-ups is naturally bounded by the alias limit param to prevent
> quadratic behaviour).  If we kept the code structured as is and inserted
> into the middle of the vector, that would lead to repeated moving of
> elements in the vector which seems inefficient.  The structure of the
> code would also be a little unwieldy.
> 
> To improve on that situation, this patch introduces a helper class,
> stp_change_builder, which implements a state machine that helps to build
> the required changes directly in program order.  That state machine is
> reponsible for deciding what changes need to be made in what order, and
> the code in fuse_pair then simply follows those steps.
> 
> Together with the fix in the previous patch for installing new defs
> correctly in RTL-SSA, this fixes PR113070.
> 
> We take the opportunity to rename the function decide_stp_strategy to
> try_repurpose_store, as that seems more descriptive of what it actually
> does, since stp_change_builder is now responsible for the overall change
> strategy.
> 
> Bootstrapped/regtested as a series with/without the passes enabled on
> aarch64-linux-gnu, OK for trunk?
> 
> Thanks,
> Alex
> 
> gcc/ChangeLog:
> 
>   PR target/113070
>   * config/aarch64/aarch64-ldp-fusion.cc (struct stp_change_builder): New.
>   (decide_stp_strategy): Reanme to ...
>   (try_repurpose_store): ... this.
>   (ldp_bb_info::fuse_pair): Refactor to use stp_change_builder to
>   construct stp changes.  Fix up uses when inserting new stp insns.
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 248 ++-
>  1 file changed, 194 insertions(+), 54 deletions(-)
> 

> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 689a8c884bd..703cfb1228c 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -844,11 +844,138 @@ def_upwards_move_range (def_info *def)
>return range;
>  }
>  
> +// Class that implements a state machine for building the changes needed to 
> form
> +// a store pair instruction.  This allows us to easily build the changes in
> +// program order, as required by rtl-ssa.
> +struct stp_change_builder
> +{
> +  enum class state
> +  {
> +FIRST,
> +INSERT,
> +FIXUP_USE,
> +LAST,
> +DONE
> +  };
> +
> +  enum class action
> +  {
> +TOMBSTONE,
> +CHANGE,
> +INSERT,
> +FIXUP_USE
> +  };
> +
> +  struct change
> +  {
> +action type;
> +insn_info *insn;
> +  };
> +
> +  bool done () const { return m_state == state::DONE; }
> +
> +  stp_change_builder (insn_info *insns[2],
> +   insn_info *repurpose,
> +   insn_info *dest)
> +: m_state (state::FIRST), m_insns { insns[0], insns[1] },
> +  m_repurpose (repurpose), m_dest (dest), m_use (nullptr) {}
> +
> +  change get_change () const
> +  {
> +switch (m_state)
> +  {
> +  case state::FIRST:
> + return {
> +   m_insns[0] == m_repurpose ? action::CHANGE : action::TOMBSTONE,
> +   m_insns[0]
> + };
> +  case state::LAST:
> + return {
> +   m_insns[1] == m_repurpose ? action::CHANGE : action::TOMBSTONE,
> +   m_insns[1]
> + };
> +  case state::INSERT:
> + return { action::INSERT, m_dest };
> +  case state::FIXUP_USE:
> + return { action::FIXUP_USE, m_use->insn () };
> +  case state::DONE:
> + break;
> +  }
> +
> +gcc_unreachable ();
> +  }
> +
> +  // Transition to the next state.
&g

[PATCH] aarch64: Don't record hazards against paired insns [PR113356]

2024-01-15 Thread Alex Coplan
Hi,

For the testcase in the PR, we try to pair insns where the first has
writeback and the second uses the updated base register.  This causes us
to record a hazard against the second insn, thus narrowing the move
range away from the end of the BB.

However, it isn't meaningful to record hazards against the other insn
in the pair, as this doesn't change which pairs can be formed, and also
doesn't change where the pair is formed (from the perspective of
nondebug insns).

To see why this is the case, consider the two cases:

 - Suppoe we are finding hazards for insns[0].  If we record a hazard
   against insns[1], then range.last becomes
   insns[1]->prev_nondebug_insn (), but note that this is equivalent to
   inserting after insns[1] (since insns[1] is being changed).
 - Now consider finding hazards for insns[1].  Suppose we record
   insns[0] as a hazard.  Then we set range.first = insns[0], which is a
   no-op.

As such, it seems better to never record hazards against the other insn
in the pair, as we check whether the insns themselves are suitable for
combination separately (e.g. for ldp checking that they use distinct
transfer registers).  Avoiding unnecessarily narrowing the move range
avoids unnecessarily re-ordering over debug insns.

This should also mean that we can only narrow the move range away from
the end of the BB in the case that we record a hazard for insns[0]
against insns[1]->prev_nondebug_insn () or earlier.  This means that for
the non-call-exceptions case, either the move range includes insns[1],
or we reject the pair (thus the assert tripped in the PR should always
hold).

Bootstrapped/regtested on aarch64-linux-gnu with/without ldp passes
enabled on top of the PR113070 fixes, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113356
* config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::try_fuse_pair):
Don't record hazards against the opposite insn in the pair.

gcc/testsuite/ChangeLog:

PR target/113356
* gcc.target/aarch64/pr113356.C: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 703cfb1228c..6834560c5fb 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -2216,11 +2216,11 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned 
access_size,
  ignore[j] =  (cand_mems[j], 0);
 
   insn_info *h = first_hazard_after (insns[0], ignore[0]);
-  if (h && *h <= *insns[1])
+  if (h && *h < *insns[1])
cand.hazards[0] = h;
 
   h = latest_hazard_before (insns[1], ignore[1]);
-  if (h && *h >= *insns[0])
+  if (h && *h > *insns[0])
cand.hazards[1] = h;
 
   if (!cand.viable ())
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113356.C 
b/gcc/testsuite/gcc.target/aarch64/pr113356.C
new file mode 100644
index 000..0de17a54a53
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113356.C
@@ -0,0 +1,8 @@
+// { dg-do compile }
+// { dg-options "-Os -fnon-call-exceptions -mearly-ldp-fusion 
-fno-lifetime-dse -fno-forward-propagate" }
+struct Class1 {
+  virtual ~Class1() {}
+  unsigned Field1;
+};
+struct Class4 : virtual Class1 {};
+int main() { Class4 var1; }


[PATCH 4/4] aarch64: Fix up uses of mem following stp insert [PR113070]

2024-01-13 Thread Alex Coplan
As the PR shows (specifically #c7) we are missing updating uses of mem
when inserting an stp in the aarch64 load/store pair fusion pass.  This
patch fixes that.

RTL-SSA has a simple view of memory and by default doesn't allow stores
to be re-ordered w.r.t. other stores.  In the ldp fusion pass, we do our
own alias analysis and so can re-order stores over other accesses when
we deem this is safe.  If neither store can be re-purposed (moved into
the required position to form the stp while respecting the RTL-SSA
constraints), then we turn both the candidate stores into "tombstone"
insns (logically delete them) and insert a new stp insn.

As it stands, we implement the insert case separately (after dealing
with the candidate stores) in fuse_pair by inserting into the middle of
the vector of changes.  This is OK when we only have to insert one
change, but with this fix we would need to insert the change for the new
stp plus multiple changes to fix up uses of mem (note the number of
fix-ups is naturally bounded by the alias limit param to prevent
quadratic behaviour).  If we kept the code structured as is and inserted
into the middle of the vector, that would lead to repeated moving of
elements in the vector which seems inefficient.  The structure of the
code would also be a little unwieldy.

To improve on that situation, this patch introduces a helper class,
stp_change_builder, which implements a state machine that helps to build
the required changes directly in program order.  That state machine is
reponsible for deciding what changes need to be made in what order, and
the code in fuse_pair then simply follows those steps.

Together with the fix in the previous patch for installing new defs
correctly in RTL-SSA, this fixes PR113070.

We take the opportunity to rename the function decide_stp_strategy to
try_repurpose_store, as that seems more descriptive of what it actually
does, since stp_change_builder is now responsible for the overall change
strategy.

Bootstrapped/regtested as a series with/without the passes enabled on
aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113070
* config/aarch64/aarch64-ldp-fusion.cc (struct stp_change_builder): New.
(decide_stp_strategy): Reanme to ...
(try_repurpose_store): ... this.
(ldp_bb_info::fuse_pair): Refactor to use stp_change_builder to
construct stp changes.  Fix up uses when inserting new stp insns.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 248 ++-
 1 file changed, 194 insertions(+), 54 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 689a8c884bd..703cfb1228c 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -844,11 +844,138 @@ def_upwards_move_range (def_info *def)
   return range;
 }
 
+// Class that implements a state machine for building the changes needed to form
+// a store pair instruction.  This allows us to easily build the changes in
+// program order, as required by rtl-ssa.
+struct stp_change_builder
+{
+  enum class state
+  {
+FIRST,
+INSERT,
+FIXUP_USE,
+LAST,
+DONE
+  };
+
+  enum class action
+  {
+TOMBSTONE,
+CHANGE,
+INSERT,
+FIXUP_USE
+  };
+
+  struct change
+  {
+action type;
+insn_info *insn;
+  };
+
+  bool done () const { return m_state == state::DONE; }
+
+  stp_change_builder (insn_info *insns[2],
+		  insn_info *repurpose,
+		  insn_info *dest)
+: m_state (state::FIRST), m_insns { insns[0], insns[1] },
+  m_repurpose (repurpose), m_dest (dest), m_use (nullptr) {}
+
+  change get_change () const
+  {
+switch (m_state)
+  {
+  case state::FIRST:
+	return {
+	  m_insns[0] == m_repurpose ? action::CHANGE : action::TOMBSTONE,
+	  m_insns[0]
+	};
+  case state::LAST:
+	return {
+	  m_insns[1] == m_repurpose ? action::CHANGE : action::TOMBSTONE,
+	  m_insns[1]
+	};
+  case state::INSERT:
+	return { action::INSERT, m_dest };
+  case state::FIXUP_USE:
+	return { action::FIXUP_USE, m_use->insn () };
+  case state::DONE:
+	break;
+  }
+
+gcc_unreachable ();
+  }
+
+  // Transition to the next state.
+  void advance ()
+  {
+switch (m_state)
+  {
+  case state::FIRST:
+	if (m_repurpose)
+	  m_state = state::LAST;
+	else
+	  m_state = state::INSERT;
+	break;
+  case state::INSERT:
+  {
+	def_info *def = memory_access (m_insns[0]->defs ());
+	while (*def->next_def ()->insn () <= *m_dest)
+	  def = def->next_def ();
+
+	// Now we know DEF feeds the insertion point for the new stp.
+	// Look for any uses of DEF that will consume the new stp.
+	gcc_assert (*def->insn () <= *m_dest
+		&& *def->next_def ()->insn () > *m_dest);
+
+	if (auto set = dyn_cast (def))
+	  for (auto use : set->nondebug_insn_uses ())
+	if (*use->insn () > *m_dest)
+	  {
+		m_use = use;
+		break;
+	  }
+
+	if (m_use)
+	  m_state = 

[PATCH 3/4] rtl-ssa: Ensure new defs get inserted [PR113070]

2024-01-13 Thread Alex Coplan
In r14-5820-ga49befbd2c783e751dc2110b544fe540eb7e33eb I added support to
RTL-SSA for inserting new insns, which included support for users
creating new defs.

However, I missed that apply_changes_to_insn needed updating to ensure
that the new defs actually got inserted into the main def chain.  This
meant that when the aarch64 ldp/stp pass inserted a new stp insn, the
stp would just get skipped over during subsequent alias analysis, as its
def never got inserted into the memory def chain.  This (unsurprisingly)
led to wrong code.

This patch fixes the issue by ensuring new user-created defs get
inserted.  I would have preferred to have used a flag internal to the
defs instead of a separate data structure to keep track of them, but since
machine_mode increased to 16 bits we're already at 64 bits in access_info,
and we can't really reuse m_is_temp as the logic in finalize_new_accesses
requires it to get cleared.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113070
* rtl-ssa.h: Include hash-set.h.
* rtl-ssa/changes.cc (function_info::finalize_new_accesses): Add
new_sets parameter and use it to keep track of new user-created sets.
(function_info::apply_changes_to_insn): Also call add_def on new sets.
(function_info::change_insns): Add hash_set to keep track of new
user-created defs.  Plumb it through.
* rtl-ssa/functions.h: Add hash_set parameter to finalize_new_accesses 
and
apply_changes_to_insn.
---
 gcc/rtl-ssa.h   |  1 +
 gcc/rtl-ssa/changes.cc  | 28 +---
 gcc/rtl-ssa/functions.h |  6 --
 3 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/gcc/rtl-ssa.h b/gcc/rtl-ssa.h
index f0cf656f5ac..17337639ae8 100644
--- a/gcc/rtl-ssa.h
+++ b/gcc/rtl-ssa.h
@@ -50,6 +50,7 @@
 #include "mux-utils.h"
 #include "rtlanal.h"
 #include "cfgbuild.h"
+#include "hash-set.h"
 
 // Provides the global crtl->ssa.
 #include "memmodel.h"
diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index ce51d6ccd8d..6119ec3535b 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -429,7 +429,8 @@ update_insn_in_place (insn_change )
 // POS gives the final position of INSN, which hasn't yet been moved into
 // place.
 void
-function_info::finalize_new_accesses (insn_change , insn_info *pos)
+function_info::finalize_new_accesses (insn_change , insn_info *pos,
+  hash_set _sets)
 {
   insn_info *insn = change.insn ();
 
@@ -465,6 +466,12 @@ function_info::finalize_new_accesses (insn_change , insn_info *pos)
 		// later in case we see a second write to the same resource.
 		def_info *perm_def = allocate (change.insn (),
 			 def->resource ());
+
+		// Keep track of the new set so we remember to add it to the
+		// def chain later.
+		if (new_sets.add (perm_def))
+		  gcc_unreachable (); // We shouldn't see duplicates here.
+
 		def->set_last_def (perm_def);
 		def = perm_def;
 	  }
@@ -647,7 +654,8 @@ function_info::finalize_new_accesses (insn_change , insn_info *pos)
 // Copy information from CHANGE to its underlying insn_info, given that
 // the insn_info has already been placed appropriately.
 void
-function_info::apply_changes_to_insn (insn_change )
+function_info::apply_changes_to_insn (insn_change ,
+  hash_set _sets)
 {
   insn_info *insn = change.insn ();
   if (change.is_deletion ())
@@ -659,10 +667,11 @@ function_info::apply_changes_to_insn (insn_change )
   // Copy the cost.
   insn->set_cost (change.new_cost);
 
-  // Add all clobbers.  Sets and call clobbers never move relative to
-  // other definitions, so are OK as-is.
+  // Add all clobbers and newly-created sets.  Existing sets and call
+  // clobbers never move relative to other definitions, so are OK as-is.
   for (def_info *def : change.new_defs)
-if (is_a (def) && !def->is_call_clobber ())
+if ((is_a (def) && !def->is_call_clobber ())
+	|| (is_a (def) && new_sets.contains (def)))
   add_def (def);
 
   // Add all uses, now that their position is final.
@@ -793,6 +802,10 @@ function_info::change_insns (array_slice changes)
   placeholders[i] = placeholder;
 }
 
+  // We need to keep track of newly-added sets as these need adding to
+  // the def chain later.
+  hash_set new_sets;
+
   // Finalize the new list of accesses for each change.  Don't install them yet,
   // so that we still have access to the old lists below.
   //
@@ -806,7 +819,8 @@ function_info::change_insns (array_slice changes)
   insn_info *placeholder = placeholders[i];
   if (!change.is_deletion ())
 	finalize_new_accesses (change,
-			   placeholder ? placeholder : change.insn ());
+			   placeholder ? placeholder : change.insn (),
+			   new_sets);
 }
 
   // Remove all definitions that are no longer needed.  After the above,
@@ -861,7 +875,7 @@ function_info::change_insns (array_slice changes)
 
   // Apply the changes to the 

[PATCH 2/4] rtl-ssa: Support for creating new uses [PR113070]

2024-01-13 Thread Alex Coplan
This exposes an interface for users to create new uses in RTL-SSA.
This is needed for updating uses after inserting a new store pair insn
in the aarch64 load/store pair fusion pass.

gcc/ChangeLog:

PR target/113070
* rtl-ssa/accesses.cc (function_info::create_use): New.
* rtl-ssa/changes.cc (function_info::finalize_new_accesses):
Handle temporary uses, ensure new uses end up referring to
permanent defs.
* rtl-ssa/functions.h (function_info::create_use): Declare.
---
 gcc/rtl-ssa/accesses.cc | 10 ++
 gcc/rtl-ssa/changes.cc  | 24 +++-
 gcc/rtl-ssa/functions.h |  5 +
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index ce4a8b8dc00..3f1304fc5bf 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -1466,6 +1466,16 @@ function_info::create_set (obstack_watermark ,
   return set;
 }
 
+use_info *
+function_info::create_use (obstack_watermark ,
+			   insn_info *insn,
+			   set_info *set)
+{
+  auto use = change_alloc (watermark, insn, set->resource (), set);
+  use->m_is_temp = true;
+  return use;
+}
+
 // Return true if ACCESS1 can represent ACCESS2 and if ACCESS2 can
 // represent ACCESS1.
 static bool
diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index e538b637848..ce51d6ccd8d 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -538,7 +538,9 @@ function_info::finalize_new_accesses (insn_change , insn_info *pos)
   unsigned int i = 0;
   for (use_info *use : change.new_uses)
 {
-  if (!use->m_has_been_superceded)
+  if (use->m_is_temp)
+	use->m_has_been_superceded = true;
+  else if (!use->m_has_been_superceded)
 	{
 	  use = allocate_temp (insn, use->resource (), use->def ());
 	  use->m_has_been_superceded = true;
@@ -609,15 +611,27 @@ function_info::finalize_new_accesses (insn_change , insn_info *pos)
 	  m_temp_uses[i] = use = allocate (*use);
 	  use->m_is_temp = false;
 	  set_info *def = use->def ();
-	  // Handle cases in which the value was previously not used
-	  // within the block.
-	  if (def && def->m_is_temp)
+	  if (!def || !def->m_is_temp)
+	continue;
+
+	  if (auto phi = dyn_cast (def))
 	{
-	  phi_info *phi = as_a (def);
+	  // Handle cases in which the value was previously not used
+	  // within the block.
 	  gcc_assert (phi->is_degenerate ());
 	  phi = create_degenerate_phi (phi->ebb (), phi->input_value (0));
 	  use->set_def (phi);
 	}
+	  else
+	{
+	  // The temporary def may also be a set added with this change, in
+	  // which case the permanent set is stored in the last_def link,
+	  // and we need to update the use to refer to the permanent set.
+	  gcc_assert (is_a (def));
+	  auto perm_set = as_a (def->last_def ());
+	  gcc_assert (!perm_set->is_temporary ());
+	  use->set_def (perm_set);
+	}
 	}
 }
 
diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
index 58d0b50ea83..962180e27d6 100644
--- a/gcc/rtl-ssa/functions.h
+++ b/gcc/rtl-ssa/functions.h
@@ -73,6 +73,11 @@ public:
 			insn_info *insn,
 			resource_info resource);
 
+  // Create a temporary use.
+  use_info *create_use (obstack_watermark ,
+			insn_info *insn,
+			set_info *set);
+
   // Create a temporary insn with code INSN_CODE and pattern PAT.
   insn_info *create_insn (obstack_watermark ,
 			  rtx_code insn_code,


[PATCH 1/4] rtl-ssa: Run finalize_new_accesses forwards [PR113070]

2024-01-13 Thread Alex Coplan
The next patch in this series exposes an interface for creating new uses
in RTL-SSA.  The intent is that new user-created uses can consume new
user-created defs in the same change group.  This is so that we can
correctly update uses of memory when inserting a new store pair insn in
the aarch64 load/store pair fusion pass (the affected uses need to
consume the new store pair insn).

As it stands, finalize_new_accesses is called as part of the backwards
insn placement loop within change_insns, but if we want new uses to be
able to depend on new defs in the same change group, we need
finalize_new_accesses to be called on earlier insns first.  This is so
that when we process temporary uses and turn them into permanent uses,
we can follow the last_def link on the temporary def to ensure we end up
with a permanent use consuming a permanent def.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113070
* rtl-ssa/changes.cc (function_info::change_insns): Split out the call
to finalize_new_accesses from the backwards placement loop, run it
forwards in a separate loop.
---
 gcc/rtl-ssa/changes.cc | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index 2fac45ae885..e538b637848 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -775,15 +775,26 @@ function_info::change_insns (array_slice changes)
 	  placeholder = add_placeholder_after (after);
 	  following_insn = placeholder;
 	}
-
-	  // Finalize the new list of accesses for the change.  Don't install
-	  // them yet, so that we still have access to the old lists below.
-	  finalize_new_accesses (change,
- placeholder ? placeholder : insn);
 	}
   placeholders[i] = placeholder;
 }
 
+  // Finalize the new list of accesses for each change.  Don't install them yet,
+  // so that we still have access to the old lists below.
+  //
+  // Note that we do this forwards instead of in the backwards loop above so
+  // that any new defs being inserted are processed before new uses of those
+  // defs, so that the (initially) temporary uses referring to temporary defs
+  // can be easily updated to become permanent uses referring to permanent defs.
+  for (unsigned i = 0; i < changes.size (); i++)
+{
+  insn_change  = *changes[i];
+  insn_info *placeholder = placeholders[i];
+  if (!change.is_deletion ())
+	finalize_new_accesses (change,
+			   placeholder ? placeholder : change.insn ());
+}
+
   // Remove all definitions that are no longer needed.  After the above,
   // the only uses of such definitions should be dead phis and now-redundant
   // live-out uses.


[PATCH 0/4] aarch64, rtl-ssa: Fix wrong code in ldp fusion pass [PR113070]

2024-01-13 Thread Alex Coplan
This patch series restores PGO+LTO bootstrap on aarch64 (with the ldp
passes enabled) and fixes wrong code (leading to a segfault) seen in
cactuBSSN_r from SPEC CPU 2017 with PGO+LTO enabled.

For an example showing what goes wrong, see:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113070#c7

In the case that we insert a new stp insn (as opposed to re-purposing an
existing store) RTL-SSA fails to properly insert the newly-created def
of memory into the def chain and the ldp/stp pass fails to update uses
of memory immediately following an stp insn.  This can lead to alias
analysis going wrong as it ends up incorrectly skipping over the stp
insn when analysing subsequent load pair candidates.

Bootstrapped/regtested as a series with/without the passes enabled on
aarch64-linux-gnu (1/4 also tested independently and no regressions).

OK for trunk?

Thanks,
Alex

Alex Coplan (4):
  rtl-ssa: Run finalize_new_accesses forwards [PR113070]
  rtl-ssa: Support for creating new uses [PR113070]
  rtl-ssa: Ensure new defs get inserted [PR113070]
  aarch64: Fix up uses of mem following stp insert [PR113070]

 gcc/config/aarch64/aarch64-ldp-fusion.cc | 248 ++-
 gcc/rtl-ssa.h|   1 +
 gcc/rtl-ssa/accesses.cc  |  10 +
 gcc/rtl-ssa/changes.cc   |  71 +--
 gcc/rtl-ssa/functions.h  |  11 +-
 5 files changed, 269 insertions(+), 72 deletions(-)



[PATCH v3] aarch64: Fix dwarf2cfi ICEs due to recent CFI note changes [PR113077]

2024-01-10 Thread Alex Coplan
This is a v3 which addresses shortcomings of the v2 patch.  v2 was
posted here:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642448.html

The main issue in v2 is that we were using the final (transformed)
patterns in combine_reg_notes instead of the initial patterns (thanks
Richard S for pointing that out off-list).

For frame-related insns, it seems better to use the initial patterns, as
we may have changed base away from the stack pointer, but any
frame-related single stores without a CFI note should initially have the
stack pointer as a base (and we want the CFI notes to be expressed in
terms of the stack pointer, even if we changed the base for the stp).

So that we don't have to worry about the writeback case (which seems
unlikely to ever happen anyway for frame-related insns) we punt on pairs
where there is any writeback together with a frame-related insn, and
also punt in find_trailing_add if either of the insns are frame-related.

I considered punting on frame-related insns altogether but it is useful
(at least) for the pass to merge SVE vector saves with
-msve-vector-bits=128.

Bootstrapped/regtested on aarch64-linux-gnu with/without the ldp/stp
passes enabled, OK for trunk?

Thanks,
Alex

-- >8 --

In r14-6604-gd7ee988c491cde43d04fe25f2b3dbad9d85ded45 we changed the CFI notes
attached to callee saves (in aarch64_save_callee_saves).  That patch changed
the ldp/stp representation to use unspecs instead of PARALLEL moves.  This meant
that we needed to attach CFI notes to all frame-related pair saves such that
dwarf2cfi could still emit the appropriate CFI (it cannot interpret the unspecs
directly).  The patch also attached REG_CFA_OFFSET notes to individual saves so
that the ldp/stp pass could easily preserve them when forming stps.

In that change I chose to use REG_CFA_OFFSET, but as the PR shows, that
choice was problematic in that REG_CFA_OFFSET requires the attached
store to be expressed in terms of the current CFA register at all times.
This means that even scheduling of frame-related insns can break this
invariant, leading to ICEs in dwarf2cfi.

The old behaviour (before that change) allowed dwarf2cfi to interpret the RTL
directly for sp-relative saves.  This change restores that behaviour by using
REG_FRAME_RELATED_EXPR instead of REG_CFA_OFFSET.  REG_FRAME_RELATED_EXPR
effectively just gives a different pattern for dwarf2cfi to look at instead of
the main insn pattern.  That allows us to attach the old-style PARALLEL move
representation in a REG_FRAME_RELATED_EXPR note and means we are free to always
express the save addresses in terms of the stack pointer.

Since the ldp/stp fusion pass can combine frame-related stores, this patch also
updates it to preserve REG_FRAME_RELATED_EXPR notes, and additionally gives it
the ability to synthesize those notes when combining sp-relative saves into an
stp (the latter always needs a note due to the unspec representation, the former
does not).

gcc/ChangeLog:

PR target/113077
* config/aarch64/aarch64-ldp-fusion.cc (filter_notes): Add
fr_expr param to extract REG_FRAME_RELATED_EXPR notes.
(combine_reg_notes): Handle REG_FRAME_RELATED_EXPR notes, and
synthesize these if needed.  Update caller ...
(ldp_bb_info::fuse_pair): ... here.
(ldp_bb_info::try_fuse_pair): Punt if either insn has writeback
and either insn is frame-related.
(find_trailing_add): Punt on frame-related insns.
* config/aarch64/aarch64.cc (aarch64_save_callee_saves): Use
REG_FRAME_RELATED_EXPR instead of REG_CFA_OFFSET.

gcc/testsuite/ChangeLog:

PR target/113077
* gcc.target/aarch64/pr113077.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 2fe1b1d4d84..689a8c884bd 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -904,9 +904,11 @@ aarch64_operand_mode_for_pair_mode (machine_mode mode)
 // Go through the reg notes rooted at NOTE, dropping those that we should drop,
 // and preserving those that we want to keep by prepending them to (and
 // returning) RESULT.  EH_REGION is used to make sure we have at most one
-// REG_EH_REGION note in the resulting list.
+// REG_EH_REGION note in the resulting list.  FR_EXPR is used to return any
+// REG_FRAME_RELATED_EXPR note we find, as these can need special handling in
+// combine_reg_notes.
 static rtx
-filter_notes (rtx note, rtx result, bool *eh_region)
+filter_notes (rtx note, rtx result, bool *eh_region, rtx *fr_expr)
 {
   for (; note; note = XEXP (note, 1))
 {
@@ -940,6 +942,10 @@ filter_notes (rtx note, rtx result, bool *eh_region)
   copy_rtx (XEXP (note, 0)),
   result);
  break;
+   case REG_FRAME_RELATED_EXPR:
+ gcc_assert (!*fr_expr);
+ *fr_expr = copy_rtx (XEXP (note, 0));
+ break;
default:
  

[PATCH] aarch64: Make ldp/stp pass off by default

2024-01-10 Thread Alex Coplan
As discussed on IRC, this makes the aarch64 ldp/stp pass off by default.  This
should stabilize the trunk and give some time to address the P1 regressions.

Sorry for the breakage.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Alex

gcc/ChangeLog:

* config/aarch64/aarch64.opt (-mearly-ldp-fusion): Set default
to 0.
(-mlate-ldp-fusion): Likewise.
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index ceed5cdb201..c495cb34fbf 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -290,12 +290,12 @@ Target Var(aarch64_track_speculation)
 Generate code to track when the CPU might be speculating incorrectly.
 
 mearly-ldp-fusion
-Target Var(flag_aarch64_early_ldp_fusion) Optimization Init(1)
+Target Var(flag_aarch64_early_ldp_fusion) Optimization Init(0)
 Enable the copy of the AArch64 load/store pair fusion pass that runs before
 register allocation.
 
 mlate-ldp-fusion
-Target Var(flag_aarch64_late_ldp_fusion) Optimization Init(1)
+Target Var(flag_aarch64_late_ldp_fusion) Optimization Init(0)
 Enable the copy of the AArch64 load/store pair fusion pass that runs after
 register allocation.
 


[PATCH v2] aarch64: Fix dwarf2cfi ICEs due to recent CFI note changes [PR113077]

2024-01-10 Thread Alex Coplan
This is a v2 which addresses feedback from v1, posted here:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642313.html

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

In r14-6604-gd7ee988c491cde43d04fe25f2b3dbad9d85ded45 we changed the CFI notes
attached to callee saves (in aarch64_save_callee_saves).  That patch changed
the ldp/stp representation to use unspecs instead of PARALLEL moves.  This meant
that we needed to attach CFI notes to all frame-related pair saves such that
dwarf2cfi could still emit the appropriate CFI (it cannot interpret the unspecs
directly).  The patch also attached REG_CFA_OFFSET notes to individual saves so
that the ldp/stp pass could easily preserve them when forming stps.

In that change I chose to use REG_CFA_OFFSET, but as the PR shows, that
choice was problematic in that REG_CFA_OFFSET requires the attached
store to be expressed in terms of the current CFA register at all times.
This means that even scheduling of frame-related insns can break this
invariant, leading to ICEs in dwarf2cfi.

The old behaviour (before that change) allowed dwarf2cfi to interpret the RTL
directly for sp-relative saves.  This change restores that behaviour by using
REG_FRAME_RELATED_EXPR instead of REG_CFA_OFFSET.  REG_FRAME_RELATED_EXPR
effectively just gives a different pattern for dwarf2cfi to look at instead of
the main insn pattern.  That allows us to attach the old-style PARALLEL move
representation in a REG_FRAME_RELATED_EXPR note and means we are free to always
express the save addresses in terms of the stack pointer.

Since the ldp/stp fusion pass can combine frame-related stores, this patch also
updates it to preserve REG_FRAME_RELATED_EXPR notes, and additionally gives it
the ability to synthesize those notes when combining sp-relative saves into an
stp (the latter always needs a note due to the unspec representation, the former
does not).

gcc/ChangeLog:

PR target/113077
* config/aarch64/aarch64-ldp-fusion.cc (filter_notes): Add fr_expr 
param to
extract REG_FRAME_RELATED_EXPR notes.
(combine_reg_notes): Handle REG_FRAME_RELATED_EXPR notes, and
synthesize these if needed.  Update caller ...
(ldp_bb_info::fuse_pair): ... here.
* config/aarch64/aarch64.cc (aarch64_save_callee_saves): Use
REG_FRAME_RELATED_EXPR instead of REG_CFA_OFFSET.

gcc/testsuite/ChangeLog:

PR target/113077
* gcc.target/aarch64/pr113077.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 2fe1b1d4d84..324d28797da 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -904,9 +904,11 @@ aarch64_operand_mode_for_pair_mode (machine_mode mode)
 // Go through the reg notes rooted at NOTE, dropping those that we should drop,
 // and preserving those that we want to keep by prepending them to (and
 // returning) RESULT.  EH_REGION is used to make sure we have at most one
-// REG_EH_REGION note in the resulting list.
+// REG_EH_REGION note in the resulting list.  FR_EXPR is used to return any
+// REG_FRAME_RELATED_EXPR note we find, as these can need special handling in
+// combine_reg_notes.
 static rtx
-filter_notes (rtx note, rtx result, bool *eh_region)
+filter_notes (rtx note, rtx result, bool *eh_region, rtx *fr_expr)
 {
   for (; note; note = XEXP (note, 1))
 {
@@ -940,6 +942,10 @@ filter_notes (rtx note, rtx result, bool *eh_region)
   copy_rtx (XEXP (note, 0)),
   result);
  break;
+   case REG_FRAME_RELATED_EXPR:
+ gcc_assert (!*fr_expr);
+ *fr_expr = copy_rtx (XEXP (note, 0));
+ break;
default:
  // Unexpected REG_NOTE kind.
  gcc_unreachable ();
@@ -951,13 +957,52 @@ filter_notes (rtx note, rtx result, bool *eh_region)
 
 // Return the notes that should be attached to a combination of I1 and I2, 
where
 // *I1 < *I2.
+//
+// LOAD_P is true for loads, REVERSED is true if the insns in program order are
+// not in offset order, and PATS gives the final RTL patterns for the accesses.
 static rtx
-combine_reg_notes (insn_info *i1, insn_info *i2)
+combine_reg_notes (insn_info *i1, insn_info *i2, bool load_p, bool reversed,
+  rtx pats[2])
 {
+  // Temporary storage for REG_FRAME_RELATED_EXPR notes.
+  rtx fr_expr[2] = {};
+
   bool found_eh_region = false;
   rtx result = NULL_RTX;
-  result = filter_notes (REG_NOTES (i2->rtl ()), result, _eh_region);
-  return filter_notes (REG_NOTES (i1->rtl ()), result, _eh_region);
+  result = filter_notes (REG_NOTES (i2->rtl ()), result,
+_eh_region, fr_expr);
+  result = filter_notes (REG_NOTES (i1->rtl ()), result,
+_eh_region, fr_expr + 1);
+
+  if (!load_p)
+{
+  // Simple frame-related sp-relative saves don't need CFI notes, but when
+  // 

[PATCH] aarch64: Fix dwarf2cfi ICEs due to recent CFI note changes [PR113077]

2024-01-09 Thread Alex Coplan
Hi,

In r14-6604-gd7ee988c491cde43d04fe25f2b3dbad9d85ded45 we changed the CFI notes
attached to callee saves (in aarch64_save_callee_saves).  That patch changed
the ldp/stp representation to use unspecs instead of PARALLEL moves.  This meant
that we needed to attach CFI notes to all frame-related pair saves such that
dwarf2cfi could still emit the appropriate CFI (it cannot interpret the unspecs
directly).  The patch also attached REG_CFA_OFFSET notes to individual saves so
that the ldp/stp pass could easily preserve them when forming stps.

In that change I chose to use REG_CFA_OFFSET, but as the PR shows, that
choice was problematic in that REG_CFA_OFFSET requires the attached
store to be expressed in terms of the current CFA register at all times.
This means that even scheduling of frame-related insns can break this
invariant, leading to ICEs in dwarf2cfi.

The old behaviour (before that change) allowed dwarf2cfi to interpret the RTL
directly for sp-relative saves.  This change restores that behaviour by using
REG_FRAME_RELATED_EXPR instead of REG_CFA_OFFSET.  REG_FRAME_RELATED_EXPR
effectively just gives a different pattern for dwarf2cfi to look at instead of
the main insn pattern.  That allows us to attach the old-style PARALLEL move
representation in a REG_FRAME_RELATED_EXPR note and means we are free to always
express the save addresses in terms of the stack pointer.

Since the ldp/stp fusion pass can combine frame-related stores, this patch also
updates it to preserve REG_FRAME_RELATED_EXPR notes, and additionally gives it
the ability to synthesize those notes when combining sp-relative saves into an
stp (the latter always needs a note due to the unspec representation, the former
does not).

Bootstrapped/regetested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113077
* config/aarch64/aarch64-ldp-fusion.cc (filter_notes): Add fr_expr 
param to
extract REG_FRAME_RELATED_EXPR notes.
(combine_reg_notes): Handle REG_FRAME_RELATED_EXPR notes, and
synthesize these if needed.  Update caller ...
(ldp_bb_info::fuse_pair): ... here.
* config/aarch64/aarch64.cc (aarch64_save_callee_saves): Use
REG_FRAME_RELATED_EXPR instead of REG_CFA_OFFSET.

gcc/testsuite/ChangeLog:

PR target/113077
* gcc.target/aarch64/pr113077.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 2fe1b1d4d84..00bc8b749c8 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -904,9 +904,11 @@ aarch64_operand_mode_for_pair_mode (machine_mode mode)
 // Go through the reg notes rooted at NOTE, dropping those that we should drop,
 // and preserving those that we want to keep by prepending them to (and
 // returning) RESULT.  EH_REGION is used to make sure we have at most one
-// REG_EH_REGION note in the resulting list.
+// REG_EH_REGION note in the resulting list.  FR_EXPR is used to return any
+// REG_FRAME_RELATED_EXPR note we find, as these can need special handling in
+// combine_reg_notes.
 static rtx
-filter_notes (rtx note, rtx result, bool *eh_region)
+filter_notes (rtx note, rtx result, bool *eh_region, rtx *fr_expr)
 {
   for (; note; note = XEXP (note, 1))
 {
@@ -940,6 +942,10 @@ filter_notes (rtx note, rtx result, bool *eh_region)
   copy_rtx (XEXP (note, 0)),
   result);
  break;
+   case REG_FRAME_RELATED_EXPR:
+ gcc_assert (!*fr_expr);
+ *fr_expr = copy_rtx (XEXP (note, 0));
+ break;
default:
  // Unexpected REG_NOTE kind.
  gcc_unreachable ();
@@ -951,13 +957,65 @@ filter_notes (rtx note, rtx result, bool *eh_region)
 
 // Return the notes that should be attached to a combination of I1 and I2, 
where
 // *I1 < *I2.
+//
+// LOAD_P is true for loads, REVERSED is true if the insns in
+// program order are not in offset order, BASE_REGNO is the chosen base
+// register number for the pair, and PATS gives the final RTL patterns for the
+// accesses.
 static rtx
-combine_reg_notes (insn_info *i1, insn_info *i2)
+combine_reg_notes (insn_info *i1, insn_info *i2,
+  bool load_p, bool reversed,
+  int base_regno, rtx pats[2])
 {
+  // Temporary storage for REG_FRAME_RELATED_EXPR notes.
+  rtx fr_expr[2] = {};
+
   bool found_eh_region = false;
   rtx result = NULL_RTX;
-  result = filter_notes (REG_NOTES (i2->rtl ()), result, _eh_region);
-  return filter_notes (REG_NOTES (i1->rtl ()), result, _eh_region);
+  result = filter_notes (REG_NOTES (i2->rtl ()), result,
+_eh_region, fr_expr);
+  result = filter_notes (REG_NOTES (i1->rtl ()), result,
+_eh_region, fr_expr + 1);
+
+  if (!load_p)
+{
+  // Frame-related saves must either be sp-based or must already have
+  // a REG_FRAME_RELATED_EXPR note.
+

[PATCH] aarch64: Further fix for throwing insns in ldp/stp pass [PR113217]

2024-01-05 Thread Alex Coplan
As the PR shows, the fix in
r14-6916-g057dc349021660c40699fb5c98fd9cac8e168653 was not complete.
That fix was enough to stop us trying to move throwing accesses above
nondebug insns, but due to this code in try_fuse_pair:

  // Placement strategy: push loads down and pull stores up, this should
  // help register pressure by reducing live ranges.
  if (load_p)
range.first = range.last;
  else
range.last = range.first;

we would still try to move stores up above any debug insns that occurred
immediately after the previous nondebug insn.  This patch fixes that by
narrowing the move range in the case that the second access is throwing
to exactly the range of that insn.

Note that we still need the fix to latest_hazard_before mentioned above
so as to ensure we select a suitable base and reject pairs if it isn't
viable to form the pair at the end of the BB.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113217
* config/aarch64/aarch64-ldp-fusion.cc
(ldp_bb_info::try_fuse_pair): If the second access can throw,
narrow the move range to exactly that insn.

gcc/testsuite/ChangeLog:

PR target/113217
* g++.dg/pr113217.C: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 25f9b2d01c5..2fe1b1d4d84 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -2195,6 +2195,15 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned 
access_size,
   if (base->hazards[0])
 range.last = base->hazards[0]->prev_nondebug_insn ();
 
+  // If the second insn can throw, narrow the move range to exactly that insn.
+  // This prevents us trying to move the second insn from the end of the BB.
+  if (cfun->can_throw_non_call_exceptions
+  && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
+{
+  gcc_assert (range.includes (insns[1]));
+  range = insn_range_info (insns[1]);
+}
+
   // Placement strategy: push loads down and pull stores up, this should
   // help register pressure by reducing live ranges.
   if (load_p)
diff --git a/gcc/testsuite/g++.dg/pr113217.C b/gcc/testsuite/g++.dg/pr113217.C
new file mode 100644
index 000..ec861543930
--- /dev/null
+++ b/gcc/testsuite/g++.dg/pr113217.C
@@ -0,0 +1,15 @@
+// { dg-do compile }
+// { dg-options "-O -g -fnon-call-exceptions" }
+struct _Vector_base {
+  int _M_end_of_storage;
+};
+struct vector : _Vector_base {
+  vector() : _Vector_base() {}
+  ~vector();
+};
+struct LoadGraph {
+  LoadGraph();
+  vector colors;
+  vector data_block;
+};
+LoadGraph::LoadGraph() {}


[PATCH] aarch64: Prevent moving throwing accesses in ldp/stp pass [PR113093]

2023-12-20 Thread Alex Coplan
As the PR shows, there was nothing to prevent the ldp/stp pass from
trying to move throwing insns, which lead to an RTL verification
failure.

This patch fixes that.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113093
* config/aarch64/aarch64-ldp-fusion.cc (latest_hazard_before):
If the insn is throwing, record the previous insn as a hazard to
prevent moving it from the end of the BB.

gcc/testsuite/ChangeLog:

PR target/113093
* gcc.dg/pr113093.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 0e2c299a0bf..59db70e9cd0 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -618,6 +618,13 @@ latest_hazard_before (insn_info *insn, rtx *ignore,
 {
   insn_info *result = nullptr;
 
+  // If the insn can throw then it is at the end of a BB and we can't
+  // move it, model this by recording a hazard in the previous insn
+  // which will prevent moving the insn up.
+  if (cfun->can_throw_non_call_exceptions
+  && find_reg_note (insn->rtl (), REG_EH_REGION, NULL_RTX))
+return insn->prev_nondebug_insn ();
+
   // Return true if we registered the hazard.
   auto hazard = [&](insn_info *h) -> bool
 {
diff --git a/gcc/testsuite/gcc.dg/pr113093.c b/gcc/testsuite/gcc.dg/pr113093.c
new file mode 100644
index 000..af2a334b45d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr113093.c
@@ -0,0 +1,4 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -fharden-control-flow-redundancy -fnon-call-exceptions" } 
*/
+_Complex long *c;
+void init() { *c = 1.0; }


[PATCH v2] aarch64: Validate register operands early in ldp fusion pass [PR113062]

2023-12-20 Thread Alex Coplan
This is a v2 addressing Richard's feedback, v1 was posted here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640957.html

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

We were missing validation of the candidate register operands in the
ldp/stp pass.  I was relying on recog rejecting such cases when we
formed the final pair insn, but the testcase shows that with
-fharden-conditionals we attempt to combine two insns with asm_operands,
both containing mem rtxes.  This then trips the assert:

gcc_assert (change->new_uses.is_valid ());

in the stp case as we aren't expecting to have (distinct) uses of mem in
the candidate stores.

While doing this I noticed that it seems more natural to have the
initial definition of mem_size closer to its first use in track_access,
so I moved that down.

gcc/ChangeLog:

PR target/113062
* config/aarch64/aarch64-ldp-fusion.cc
(ldp_bb_info::track_access): Punt on accesses with invalid
register operands, move definition of mem_size closer to its
first use.

gcc/testsuite/ChangeLog:

PR target/113062
* gcc.dg/pr113062.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 327ba4e417d..0e2c299a0bf 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -458,11 +458,14 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, 
rtx mem)
   if (!ldp_operand_mode_ok_p (mem_mode))
 return;
 
-  // Note ldp_operand_mode_ok_p already rejected VL modes.
-  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
-
   rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
 
+  // Ignore the access if the register operand isn't suitable for ldp/stp.
+  if (load_p
+  ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
+  : !aarch64_stp_reg_operand (reg_op, mem_mode))
+return;
+
   // We want to segregate FP/SIMD accesses from GPR accesses.
   //
   // Before RA, we use the modes, noting that stores of constant zero
@@ -474,6 +477,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, 
rtx mem)
 : (GET_MODE_CLASS (mem_mode) != MODE_INT
&& (load_p || !aarch64_const_zero_rtx_p (reg_op)));
 
+  // Note ldp_operand_mode_ok_p already rejected VL modes.
+  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
   const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
 
   if (track_via_mem_expr (insn, mem, lfs))
diff --git a/gcc/testsuite/gcc.dg/pr113062.c b/gcc/testsuite/gcc.dg/pr113062.c
new file mode 100644
index 000..5667c17b0f6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr113062.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-Oz -fharden-conditional-branches" } */
+long double foo;
+double bar;
+void abort();
+void check() {
+  if (foo == bar)
+abort();
+}
+


Re: [PATCH] aarch64: Validate register operands early in ldp fusion pass [PR113062]

2023-12-19 Thread Alex Coplan
On 19/12/2023 13:38, Richard Sandiford wrote:
> Alex Coplan  writes:
> > On 19/12/2023 10:15, Richard Sandiford wrote:
> >> Alex Coplan  writes:
> >> > We were missing validation of the candidate register operands in the
> >> > ldp/stp pass.  I was relying on recog rejecting such cases when we
> >> > formed the final pair insn, but the testcase shows that with
> >> > -fharden-conditionals we attempt to combine two insns with asm_operands,
> >> > both containing mem rtxes.  This then trips the assert:
> >> >
> >> > gcc_assert (change->new_uses.is_valid ());
> >> >
> >> > in the stp case as we aren't expecting to have (distinct) uses of mem in
> >> > the candidate stores.
> >> >
> >> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> >> >
> >> > Thanks,
> >> > Alex
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> >  PR target/113062
> >> >  * config/aarch64/aarch64-ldp-fusion.cc
> >> >  (ldp_bb_info::track_access): Punt on accesses with invalid
> >> >  register operands.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >
> >> >  PR target/113062
> >> >  * gcc.dg/pr113062.c: New test.
> >> >
> >> > diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> >> > b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> > index 327ba4e417d..273db8c582f 100644
> >> > --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> > +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> > @@ -476,6 +476,12 @@ ldp_bb_info::track_access (insn_info *insn, bool 
> >> > load_p, rtx mem)
> >> >  
> >> >const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
> >> >  
> >> > +  // Ignore the access if the register operand isn't suitable for 
> >> > ldp/stp.
> >> > +  if (!REG_P (reg_op)
> >> > +  && !SUBREG_P (reg_op)
> >> > +  && (load_p || !aarch64_const_zero_rtx_p (reg_op)))
> >> > +return;
> >> > +
> >> 
> >> It might be more natural to test this before:
> >> 
> >>   // We want to segregate FP/SIMD accesses from GPR accesses.
> >>   //
> >>   // Before RA, we use the modes, noting that stores of constant zero
> >>   // operands use GPRs (even in non-integer modes).  After RA, we use
> >>   // the hard register numbers.
> >>   const bool fpsimd_op_p
> >> = reload_completed
> >> ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >> : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >>&& (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >> 
> >> so that that code is running with a pre-checked operand.
> >
> > Yeah, I agree that seems a bit more natural, I'll move the check up.
> >
> >> 
> >> Also, how about using the predicates instead:
> >> 
> >>   if (load_p
> >>   ? !aarch64_ldp_reg_operand (reg_op, VOIDmode)
> >>   : !aarch64_stp_reg_operand (reg_op, VOIDmode))
> >> return;
> >
> > I thought about doing that, but it seems that we'd effectively just be
> > re-doing the mode check we did above by calling ldp_operand_mode_ok_p
> > (assuming generic RTL rules hold), so it seems a bit wasteful to call
> > the predicates.  Given that this function is called on every (single
> > set) memory access in a function, I wonder if we should prefer the
> > inline check?
> 
> How about passing mem_mode to the predicates and making the
> above do the mode check as well?  That feels like it would scale
> well to extending forms (when implemented, and with the mode then
> specifically being the mode of the SET_SRC, so that it "agrees"
> with reg_op).

Yes, that sounds better to me, it makes the code more defensive as well
(we're actually getting some extra checking from the predicate if we do
that).

I'll respin / re-test the patch and do that.

Thanks,
Alex

> 
> Richard


Re: [PATCH] aarch64: Validate register operands early in ldp fusion pass [PR113062]

2023-12-19 Thread Alex Coplan
On 19/12/2023 10:15, Richard Sandiford wrote:
> Alex Coplan  writes:
> > We were missing validation of the candidate register operands in the
> > ldp/stp pass.  I was relying on recog rejecting such cases when we
> > formed the final pair insn, but the testcase shows that with
> > -fharden-conditionals we attempt to combine two insns with asm_operands,
> > both containing mem rtxes.  This then trips the assert:
> >
> > gcc_assert (change->new_uses.is_valid ());
> >
> > in the stp case as we aren't expecting to have (distinct) uses of mem in
> > the candidate stores.
> >
> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> > PR target/113062
> > * config/aarch64/aarch64-ldp-fusion.cc
> > (ldp_bb_info::track_access): Punt on accesses with invalid
> > register operands.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR target/113062
> > * gcc.dg/pr113062.c: New test.
> >
> > diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> > b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > index 327ba4e417d..273db8c582f 100644
> > --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > @@ -476,6 +476,12 @@ ldp_bb_info::track_access (insn_info *insn, bool 
> > load_p, rtx mem)
> >  
> >const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
> >  
> > +  // Ignore the access if the register operand isn't suitable for ldp/stp.
> > +  if (!REG_P (reg_op)
> > +  && !SUBREG_P (reg_op)
> > +  && (load_p || !aarch64_const_zero_rtx_p (reg_op)))
> > +return;
> > +
> 
> It might be more natural to test this before:
> 
>   // We want to segregate FP/SIMD accesses from GPR accesses.
>   //
>   // Before RA, we use the modes, noting that stores of constant zero
>   // operands use GPRs (even in non-integer modes).  After RA, we use
>   // the hard register numbers.
>   const bool fpsimd_op_p
> = reload_completed
> ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> : (GET_MODE_CLASS (mem_mode) != MODE_INT
>&& (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> 
> so that that code is running with a pre-checked operand.

Yeah, I agree that seems a bit more natural, I'll move the check up.

> 
> Also, how about using the predicates instead:
> 
>   if (load_p
>   ? !aarch64_ldp_reg_operand (reg_op, VOIDmode)
>   : !aarch64_stp_reg_operand (reg_op, VOIDmode))
> return;

I thought about doing that, but it seems that we'd effectively just be
re-doing the mode check we did above by calling ldp_operand_mode_ok_p
(assuming generic RTL rules hold), so it seems a bit wasteful to call
the predicates.  Given that this function is called on every (single
set) memory access in a function, I wonder if we should prefer the
inline check?

Thanks,
Alex

> 
> OK with those changes, or without if you prefer.
> 
> Thanks,
> Richard
> 
> >if (track_via_mem_expr (insn, mem, lfs))
> >  return;
> >  
> > diff --git a/gcc/testsuite/gcc.dg/pr113062.c 
> > b/gcc/testsuite/gcc.dg/pr113062.c
> > new file mode 100644
> > index 000..5667c17b0f6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/pr113062.c
> > @@ -0,0 +1,10 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-Oz -fharden-conditional-branches" } */
> > +long double foo;
> > +double bar;
> > +void abort();
> > +void check() {
> > +  if (foo == bar)
> > +abort();
> > +}
> > +


[PATCH] aarch64: Validate register operands early in ldp fusion pass [PR113062]

2023-12-19 Thread Alex Coplan
We were missing validation of the candidate register operands in the
ldp/stp pass.  I was relying on recog rejecting such cases when we
formed the final pair insn, but the testcase shows that with
-fharden-conditionals we attempt to combine two insns with asm_operands,
both containing mem rtxes.  This then trips the assert:

gcc_assert (change->new_uses.is_valid ());

in the stp case as we aren't expecting to have (distinct) uses of mem in
the candidate stores.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113062
* config/aarch64/aarch64-ldp-fusion.cc
(ldp_bb_info::track_access): Punt on accesses with invalid
register operands.

gcc/testsuite/ChangeLog:

PR target/113062
* gcc.dg/pr113062.c: New test.
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 327ba4e417d..273db8c582f 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -476,6 +476,12 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, 
rtx mem)
 
   const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
 
+  // Ignore the access if the register operand isn't suitable for ldp/stp.
+  if (!REG_P (reg_op)
+  && !SUBREG_P (reg_op)
+  && (load_p || !aarch64_const_zero_rtx_p (reg_op)))
+return;
+
   if (track_via_mem_expr (insn, mem, lfs))
 return;
 
diff --git a/gcc/testsuite/gcc.dg/pr113062.c b/gcc/testsuite/gcc.dg/pr113062.c
new file mode 100644
index 000..5667c17b0f6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr113062.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-Oz -fharden-conditional-branches" } */
+long double foo;
+double bar;
+void abort();
+void check() {
+  if (foo == bar)
+abort();
+}
+


[PATCH] aarch64: Fix parens in aarch64_stp_reg_operand [PR113061]

2023-12-18 Thread Alex Coplan
In r14-6603-gfcdd2757c76bf925115b8e1ba4318d6366dd6f09 I messed up the
parentheses in aarch64_stp_reg_operand, the indentation shows the
intended nesting of the conditions.

This patch fixes that.

This fixes PR113061 which shows IRA substituting (const_int 1) into a
writeback stp pattern as a result (and LRA failing to reload the
constant).

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/113061
* config/aarch64/predicates.md (aarch64_stp_reg_operand): Fix
parentheses to match intent.

gcc/testsuite/ChangeLog:

PR target/113061
* gfortran.dg/PR113061.f90: New test.
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 9e6231691c0..510d4d2eaca 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -323,7 +323,7 @@ (define_special_predicate "aarch64_ldp_reg_operand"
 (define_special_predicate "aarch64_stp_reg_operand"
   (ior (match_operand 0 "aarch64_ldp_reg_operand")
(and (match_code "const_int,const,const_vector,const_double")
-   (match_test "aarch64_const_zero_rtx_p (op)"))
+   (match_test "aarch64_const_zero_rtx_p (op)")
(ior
  (match_test "GET_MODE (op) == VOIDmode")
  (and
@@ -331,7 +331,7 @@ (define_special_predicate "aarch64_stp_reg_operand"
(ior
  (match_test "mode == VOIDmode")
  (match_test "known_eq (GET_MODE_SIZE (mode),
-GET_MODE_SIZE (GET_MODE (op)))"))
+GET_MODE_SIZE (GET_MODE (op)))")))
 
 ;; Used for storing two 64-bit values in an AdvSIMD register using an STP
 ;; as a 128-bit vec_concat.
diff --git a/gcc/testsuite/gfortran.dg/PR113061.f90 
b/gcc/testsuite/gfortran.dg/PR113061.f90
new file mode 100644
index 000..989bc385c76
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/PR113061.f90
@@ -0,0 +1,12 @@
+! { dg-do compile }
+! { dg-options "-fno-move-loop-invariants -Oz" }
+module module_foo
+  use iso_c_binding
+  contains
+  subroutine foo(a) bind(c)
+type(c_ptr)  a(..)
+select rank(a)
+end select
+call bar
+  end
+end


Re: [PATCH v4 10/11] aarch64: Add new load/store pair fusion pass

2023-12-15 Thread Alex Coplan
On 15/12/2023 15:34, Richard Sandiford wrote:
> Alex Coplan  writes:
> > This is a v6 of the aarch64 load/store pair fusion pass, which
> > addresses the feedback from Richard's last review here:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640539.html
> >
> > In particular this version implements the suggested changes which
> > greatly simplify the double list walk.
> >
> > Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > -- >8 --
> >
> > This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
> > and store pairs (LDPs and STPs).
> >
> > As a motivating example for the kind of thing this improves, take the
> > following testcase:
> >
> > extern double c[20];
> >
> > double f(double x)
> > {
> >   double y = x*x;
> >   y += c[16];
> >   y += c[17];
> >   y += c[18];
> >   y += c[19];
> >   return y;
> > }
> >
> > for which we currently generate (at -O2):
> >
> > f:
> > adrpx0, c
> > add x0, x0, :lo12:c
> > ldp d31, d29, [x0, 128]
> > ldr d30, [x0, 144]
> > fmadd   d0, d0, d0, d31
> > ldr d31, [x0, 152]
> > faddd0, d0, d29
> > faddd0, d0, d30
> > faddd0, d0, d31
> > ret
> >
> > but with the pass, we generate:
> >
> > f:
> > .LFB0:
> > adrpx0, c
> > add x0, x0, :lo12:c
> > ldp d31, d29, [x0, 128]
> > fmadd   d0, d0, d0, d31
> > ldp d30, d31, [x0, 144]
> > faddd0, d0, d29
> > faddd0, d0, d30
> > faddd0, d0, d31
> > ret
> >
> > The pass is local (only considers a BB at a time).  In theory, it should
> > be possible to extend it to run over EBBs, at least in the case of pure
> > (MEM_READONLY_P) loads, but this is left for future work.
> >
> > The pass works by identifying two kinds of bases: tree decls obtained
> > via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
> > If a candidate memory access has a MEM_EXPR base, then we track it via
> > this base, and otherwise if it is of a simple reg +  form, we track
> > it via the RTL-SSA def_info for the register.
> >
> > For each BB, for a given kind of base, we build up a hash table mapping
> > the base to an access_group.  The access_group data structure holds a
> > list of accesses at each offset relative to the same base.  It uses a
> > splay tree to support efficient insertion (while walking the bb), and
> > the nodes are chained using a linked list to support efficient
> > iteration (while doing the transformation).
> >
> > For each base, we then iterate over the access_group to identify
> > adjacent accesses, and try to form load/store pairs for those insns that
> > access adjacent memory.
> >
> > The pass is currently run twice, both before and after register
> > allocation.  The first copy of the pass is run late in the pre-RA RTL
> > pipeline, immediately after sched1, since it was found that sched1 was
> > increasing register pressure when the pass was run before.  The second
> > copy of the pass runs immediately before peephole2, so as to get any
> > opportunities that the existing ldp/stp peepholes can handle.
> >
> > There are some cases that we punt on before RA, e.g.
> > accesses relative to eliminable regs (such as the soft frame pointer).
> > We do this since we can't know the elimination offset before RA, and we
> > want to avoid the RA reloading the offset (due to being out of ldp/stp
> > immediate range) as this can generate worse code.
> >
> > The post-RA copy of the pass is there to pick up the crumbs that were
> > left behind / things we punted on in the pre-RA pass.  Among other
> > things, it's needed to handle accesses relative to the stack pointer.
> > It can also handle code that didn't exist at the time the pre-RA pass
> > was run (spill code, prologue/epilogue code).
> >
> > This is an initial implementation, and there are (among other possible
> > improvements) the following notable caveats / missing features that are
> > left for future work, but could give further improvements:
> >
> >  - Moving accesses between BBs within in an EBB, see above.
> >  - Out-of-range opportunities: currently the pass refuses to form pairs
> >if there isn't a suitable base regis

[PATCH v4 10/11] aarch64: Add new load/store pair fusion pass

2023-12-15 Thread Alex Coplan
This is a v6 of the aarch64 load/store pair fusion pass, which
addresses the feedback from Richard's last review here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640539.html

In particular this version implements the suggested changes which
greatly simplify the double list walk.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
and store pairs (LDPs and STPs).

As a motivating example for the kind of thing this improves, take the
following testcase:

extern double c[20];

double f(double x)
{
  double y = x*x;
  y += c[16];
  y += c[17];
  y += c[18];
  y += c[19];
  return y;
}

for which we currently generate (at -O2):

f:
adrpx0, c
add x0, x0, :lo12:c
ldp d31, d29, [x0, 128]
ldr d30, [x0, 144]
fmadd   d0, d0, d0, d31
ldr d31, [x0, 152]
faddd0, d0, d29
faddd0, d0, d30
faddd0, d0, d31
ret

but with the pass, we generate:

f:
.LFB0:
adrpx0, c
add x0, x0, :lo12:c
ldp d31, d29, [x0, 128]
fmadd   d0, d0, d0, d31
ldp d30, d31, [x0, 144]
faddd0, d0, d29
faddd0, d0, d30
faddd0, d0, d31
ret

The pass is local (only considers a BB at a time).  In theory, it should
be possible to extend it to run over EBBs, at least in the case of pure
(MEM_READONLY_P) loads, but this is left for future work.

The pass works by identifying two kinds of bases: tree decls obtained
via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
If a candidate memory access has a MEM_EXPR base, then we track it via
this base, and otherwise if it is of a simple reg +  form, we track
it via the RTL-SSA def_info for the register.

For each BB, for a given kind of base, we build up a hash table mapping
the base to an access_group.  The access_group data structure holds a
list of accesses at each offset relative to the same base.  It uses a
splay tree to support efficient insertion (while walking the bb), and
the nodes are chained using a linked list to support efficient
iteration (while doing the transformation).

For each base, we then iterate over the access_group to identify
adjacent accesses, and try to form load/store pairs for those insns that
access adjacent memory.

The pass is currently run twice, both before and after register
allocation.  The first copy of the pass is run late in the pre-RA RTL
pipeline, immediately after sched1, since it was found that sched1 was
increasing register pressure when the pass was run before.  The second
copy of the pass runs immediately before peephole2, so as to get any
opportunities that the existing ldp/stp peepholes can handle.

There are some cases that we punt on before RA, e.g.
accesses relative to eliminable regs (such as the soft frame pointer).
We do this since we can't know the elimination offset before RA, and we
want to avoid the RA reloading the offset (due to being out of ldp/stp
immediate range) as this can generate worse code.

The post-RA copy of the pass is there to pick up the crumbs that were
left behind / things we punted on in the pre-RA pass.  Among other
things, it's needed to handle accesses relative to the stack pointer.
It can also handle code that didn't exist at the time the pre-RA pass
was run (spill code, prologue/epilogue code).

This is an initial implementation, and there are (among other possible
improvements) the following notable caveats / missing features that are
left for future work, but could give further improvements:

 - Moving accesses between BBs within in an EBB, see above.
 - Out-of-range opportunities: currently the pass refuses to form pairs
   if there isn't a suitable base register with an immediate in range
   for ldp/stp, but it can be profitable to emit anchor addresses in the
   case that there are four or more out-of-range nearby accesses that can
   be formed into pairs.  This is handled by the current ldp/stp
   peepholes, so it would be good to support this in the future.
 - Discovery: currently we prioritize MEM_EXPR bases over RTL bases, which can
   lead to us missing opportunities in the case that two accesses have distinct
   MEM_EXPR bases (i.e. different DECLs) but they are still adjacent in memory
   (e.g. adjacent variables on the stack).  I hope to address this for GCC 15,
   hopefully getting to the point where we can remove the ldp/stp peepholes and
   scheduling hooks.  Furthermore it would be nice to make the pass aware of
   section anchors (adding these as a third kind of base) allowing merging
   accesses to adjacent variables within the same section.

gcc/ChangeLog:

* config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64.
* config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
before and after RA.
* 

[PATCH] doc: Document AArch64-specific asm operand modifiers

2023-12-14 Thread Alex Coplan
Hi,

As it stands, GCC doesn't document any public AArch64-specific operand
modifiers for use in inline asm.  This patch fixes that by documenting
an initial set of public AArch64-specific operand modifiers.

Tested with make html and checking the output looks OK in a browser.

OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

* doc/extend.texi: Document AArch64 Operand Modifiers.
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index e8b5e771f7a..6ade36759ee 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -11723,6 +11723,31 @@ operand as if it were a memory reference.
 @tab @code{%l0}
 @end multitable
 
+@anchor{aarch64Operandmodifiers}
+@subsubsection AArch64 Operand Modifiers
+
+The following table shows the modifiers supported by AArch64 and their effects:
+
+@multitable @columnfractions .10 .90
+@headitem Modifier @tab Description
+@item @code{w} @tab Print a 32-bit general-purpose register name or, given a
+constant zero operand, the 32-bit zero register (@code{wzr}).
+@item @code{x} @tab Print a 64-bit general-purpose register name or, given a
+constant zero operand, the 64-bit zero register (@code{xzr}).
+@item @code{b} @tab Print an FP/SIMD register name with a @code{b} (byte, 
8-bit)
+prefix.
+@item @code{h} @tab Print an FP/SIMD register name with an @code{h} (halfword,
+16-bit) prefix.
+@item @code{s} @tab Print an FP/SIMD register name with an @code{s} (single
+word, 32-bit) prefix.
+@item @code{d} @tab Print an FP/SIMD register name with a @code{d} (doubleword,
+64-bit) prefix.
+@item @code{q} @tab Print an FP/SIMD register name with a @code{q} (quadword,
+128-bit) prefix.
+@item @code{Z} @tab Print an FP/SIMD register name as an SVE register (i.e. 
with
+a @code{z} prefix).  This is a no-op for SVE register operands.
+@end multitable
+
 @anchor{x86Operandmodifiers}
 @subsubsection x86 Operand Modifiers
 


[PATCH 2/2] aarch64: Handle autoinc addresses in ld1rq splitter [PR112906]

2023-12-13 Thread Alex Coplan
This patch uses the new force_reload_address routine added by the
previous patch to fix PR112906.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/112906
* config/aarch64/aarch64-sve.md (@aarch64_vec_duplicate_vq_le):
Use force_reload_address to reload addresses that aren't suitable for
ld1rq in the pre-RA splitter.

gcc/testsuite/ChangeLog:

PR target/112906
* gcc.target/aarch64/sve/acle/general/pr112906.c: New test.
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index fdd14d15096..319bc01cae9 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -2690,10 +2690,7 @@ (define_insn_and_split 
"@aarch64_vec_duplicate_vq_le"
   {
 if (can_create_pseudo_p ()
 && !aarch64_sve_ld1rq_operand (operands[1], mode))
-  {
-   rtx addr = force_reg (Pmode, XEXP (operands[1], 0));
-   operands[1] = replace_equiv_address (operands[1], addr);
-  }
+  operands[1] = force_reload_address (operands[1]);
 if (GET_CODE (operands[2]) == SCRATCH)
   operands[2] = gen_reg_rtx (VNx16BImode);
 emit_move_insn (operands[2], CONSTM1_RTX (VNx16BImode));
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr112906.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr112906.c
new file mode 100644
index 000..69b653f1a71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr112906.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2" } */
+#include 
+unsigned c;
+long d;
+void f() {
+  unsigned char *b;
+  svbool_t x = svptrue_b8();
+  svuint32_t g;
+  svuint8_t h, i;
+  d = 0;
+  for (; (unsigned *)d <  d += 16) {
+h = svld1rq(x, [d]);
+g = svdot_lane(g, i, h, 3);
+  }
+  svst1_vnum(x, , 8, g);
+}


[PATCH 1/2] emit-rtl, lra: Move lra's emit_inc to emit-rtl.cc

2023-12-13 Thread Alex Coplan
Hi,

In PR112906 we ICE because we try to use force_reg to reload an
auto-increment address, but force_reg can't do this.

With the aim of fixing the PR by supporting reloading arbitrary
addresses in pre-RA splitters, this patch generalizes
lra-constraints.cc:emit_inc and makes it available to the rest of the
compiler by moving the generalized version to emit-rtl.cc.

We observe that the separate IN parameter to LRA's emit_inc is
redundant, since the function is static and is only (statically) called
once in lra-constraints.cc, with in == value.  As such, we drop the IN
parameter and simplify the code accordingly.

We wrap the emit_inc code in a virtual class to allow LRA to override
how reload pseudos are created, thereby preserving the existing LRA
behaviour as much as possible.

We then add a second (higher-level) routine to emit-rtl.cc,
force_reload_address, which can reload arbitrary addresses.  This uses
the generalized emit_inc code to handle the RTX_AUTOINC case.  The
second patch in this series uses force_reload_address to fix PR112906.

Since we intend to call address_reload_context::emit_autoinc from within
splitters, and the code lifted from LRA calls recog, we have to avoid
clobbering recog_data.  We do this by introducing a new RAII class for
saving/restoring recog_data on the stack.

Bootstrapped/regtested on aarch64-linux-gnu, bootstrapped on
x86_64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

PR target/112906
* emit-rtl.cc (address_reload_context::emit_autoinc): New.
(force_reload_address): New.
* emit-rtl.h (struct address_reload_context): Declare.
(force_reload_address): Declare.
* lra-constraints.cc (class lra_autoinc_reload_context): New.
(emit_inc): Drop IN parameter, invoke
code moved to emit-rtl.cc:address_reload_context::emit_autoinc.
(curr_insn_transform): Drop redundant IN parameter in call to
emit_inc.
* recog.h (class recog_data_saver): New.
diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
index 4a7e420e7c0..ce7b98bf006 100644
--- a/gcc/emit-rtl.cc
+++ b/gcc/emit-rtl.cc
@@ -58,6 +58,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "rtl-iter.h"
 #include "stor-layout.h"
 #include "opts.h"
+#include "optabs.h"
 #include "predict.h"
 #include "rtx-vector-builder.h"
 #include "gimple.h"
@@ -2576,6 +2577,140 @@ replace_equiv_address_nv (rtx memref, rtx addr, bool 
inplace)
   return change_address_1 (memref, VOIDmode, addr, 0, inplace);
 }
 
+
+/* Emit insns to reload VALUE into a new register.  VALUE is an
+   auto-increment or auto-decrement RTX whose operand is a register or
+   memory location; so reloading involves incrementing that location.
+
+   INC_AMOUNT is the number to increment or decrement by (always
+   positive and ignored for POST_MODIFY/PRE_MODIFY).
+
+   Return a pseudo containing the result.  */
+rtx
+address_reload_context::emit_autoinc (rtx value, poly_int64 inc_amount)
+{
+  /* Since we're going to call recog, and might be called within recog,
+ we need to ensure we save and restore recog_data.  */
+  recog_data_saver recog_save;
+
+  /* REG or MEM to be copied and incremented.  */
+  rtx incloc = XEXP (value, 0);
+
+  const rtx_code code = GET_CODE (value);
+  const bool post_p
+= code == POST_DEC || code == POST_INC || code == POST_MODIFY;
+
+  bool plus_p = true;
+  rtx inc;
+  if (code == PRE_MODIFY || code == POST_MODIFY)
+{
+  gcc_assert (GET_CODE (XEXP (value, 1)) == PLUS
+ || GET_CODE (XEXP (value, 1)) == MINUS);
+  gcc_assert (rtx_equal_p (XEXP (XEXP (value, 1), 0), XEXP (value, 0)));
+  plus_p = GET_CODE (XEXP (value, 1)) == PLUS;
+  inc = XEXP (XEXP (value, 1), 1);
+}
+  else
+{
+  if (code == PRE_DEC || code == POST_DEC)
+   inc_amount = -inc_amount;
+
+  inc = gen_int_mode (inc_amount, GET_MODE (value));
+}
+
+  rtx result;
+  if (!post_p && REG_P (incloc))
+result = incloc;
+  else
+{
+  result = get_reload_reg ();
+  /* First copy the location to the result register.  */
+  emit_insn (gen_move_insn (result, incloc));
+}
+
+  /* See if we can directly increment INCLOC.  */
+  rtx_insn *last = get_last_insn ();
+  rtx_insn *add_insn = emit_insn (plus_p
+ ? gen_add2_insn (incloc, inc)
+ : gen_sub2_insn (incloc, inc));
+  const int icode = recog_memoized (add_insn);
+  if (icode >= 0)
+{
+  if (!post_p && result != incloc)
+   emit_insn (gen_move_insn (result, incloc));
+  return result;
+}
+  delete_insns_since (last);
+
+  /* If couldn't do the increment directly, must increment in RESULT.
+ The way we do this depends on whether this is pre- or
+ post-increment.  For pre-increment, copy INCLOC to the reload
+ register, increment it there, then save back.  */
+  if (!post_p)
+{
+  if (incloc != result)
+   emit_insn (gen_move_insn (result, 

Re: [PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-12-13 Thread Alex Coplan
On 12/12/2023 15:58, Richard Sandiford wrote:
> Alex Coplan  writes:
> > Hi,
> >
> > This is a v2 version which addresses feedback from Richard's review
> > here:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html
> >
> > I'll reply inline to address specific comments.
> >
> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > -- >8 --
> >
> > This patch overhauls the load/store pair patterns with two main goals:
> >
> > 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> > 2. Allowing more flexibility in which operand modes are supported, and which
> >combinations of modes are allowed in the two arms of the load/store pair,
> >while reducing the number of patterns required both in the source and in
> >the generated code.
> >
> > The correctness issue (1) is due to the fact that the current patterns have
> > two independent memory operands tied together only by a predicate on the 
> > insns.
> > Since LRA only looks at the constraints, one of the memory operands can get
> > reloaded without the other one being changed, leading to the insn becoming
> > unrecognizable after reload.
> >
> > We fix this issue by changing the patterns such that they only ever have one
> > memory operand representing the entire pair.  For the store case, we use an
> > unspec to logically concatenate the register operands before storing them.
> > For the load case, we use unspecs to extract the "lanes" from the pair mem,
> > with the second occurrence of the mem matched using a match_dup (such that 
> > there
> > is still really only one memory operand as far as the RA is concerned).
> >
> > In terms of the modes used for the pair memory operands, we canonicalize
> > these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> > only the correct size but also correct alignment requirement for a
> > memory operand representing an entire load/store pair.  Unlike the other
> > two, V2x4QImode didn't previously exist, so had to be added with the
> > patch.
> >
> > As with the previous patch generalizing the writeback patterns, this
> > patch aims to be flexible in the combinations of modes supported by the
> > patterns without requiring a large number of generated patterns by using
> > distinct mode iterators.
> >
> > The new scheme means we only need a single (generated) pattern for each
> > load/store operation of a given operand size.  For the 4-byte and 8-byte
> > operand cases, we use the GPI iterator to synthesize the two patterns.
> > The 16-byte case is implemented as a separate pattern in the source (due
> > to only having a single possible alternative).
> >
> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> > we add REG_CFA_OFFSET notes to the store pair insns emitted by
> > aarch64_save_callee_saves, so that correct CFI information can still be
> > generated.  Furthermore, we now unconditionally generate these CFA
> > notes on frame-related insns emitted by aarch64_save_callee_saves.
> > This is done in case that the load/store pair pass forms these into
> > pairs, in which case the CFA notes would be needed.
> >
> > We also adjust the ldp/stp peepholes to generate the new form.  This is
> > done by switching the generation to use the
> > aarch64_gen_{load,store}_pair interface, making it easier to change the
> > form in the future if needed.  (Likewise, the upcoming aarch64
> > load/store pair pass also makes use of this interface).
> >
> > This patch also adds an "ldpstp" attribute to the non-writeback
> > load/store pair patterns, which is used by the post-RA load/store pair
> > pass to identify existing patterns and see if they can be promoted to
> > writeback variants.
> >
> > One potential concern with using unspecs for the patterns is that it can 
> > block
> > optimization by the generic RTL passes.  This patch series tries to mitigate
> > this in two ways:
> >  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
> >  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion 
> > to
> > emit individual loads/stores instead of ldp/stp.  These should then be
> > formed back into load/store pairs much later in the RTL pipeline by the
> > new load/store pair pass.
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> >

[PATCH v3 10/11] aarch64: Add new load/store pair fusion pass

2023-12-07 Thread Alex Coplan
Hi,

This is a v5 of the aarch64 load/store pair fusion pass,
rebased on top of the SME changes. v4 is here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639404.html

There are no changes to the pass itself since v4, this is just a rebase.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
and store pairs (LDPs and STPs).

As a motivating example for the kind of thing this improves, take the
following testcase:

extern double c[20];

double f(double x)
{
  double y = x*x;
  y += c[16];
  y += c[17];
  y += c[18];
  y += c[19];
  return y;
}

for which we currently generate (at -O2):

f:
adrpx0, c
add x0, x0, :lo12:c
ldp d31, d29, [x0, 128]
ldr d30, [x0, 144]
fmadd   d0, d0, d0, d31
ldr d31, [x0, 152]
faddd0, d0, d29
faddd0, d0, d30
faddd0, d0, d31
ret

but with the pass, we generate:

f:
.LFB0:
adrpx0, c
add x0, x0, :lo12:c
ldp d31, d29, [x0, 128]
fmadd   d0, d0, d0, d31
ldp d30, d31, [x0, 144]
faddd0, d0, d29
faddd0, d0, d30
faddd0, d0, d31
ret

The pass is local (only considers a BB at a time).  In theory, it should
be possible to extend it to run over EBBs, at least in the case of pure
(MEM_READONLY_P) loads, but this is left for future work.

The pass works by identifying two kinds of bases: tree decls obtained
via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
If a candidate memory access has a MEM_EXPR base, then we track it via
this base, and otherwise if it is of a simple reg +  form, we track
it via the RTL-SSA def_info for the register.

For each BB, for a given kind of base, we build up a hash table mapping
the base to an access_group.  The access_group data structure holds a
list of accesses at each offset relative to the same base.  It uses a
splay tree to support efficient insertion (while walking the bb), and
the nodes are chained using a linked list to support efficient
iteration (while doing the transformation).

For each base, we then iterate over the access_group to identify
adjacent accesses, and try to form load/store pairs for those insns that
access adjacent memory.

The pass is currently run twice, both before and after register
allocation.  The first copy of the pass is run late in the pre-RA RTL
pipeline, immediately after sched1, since it was found that sched1 was
increasing register pressure when the pass was run before.  The second
copy of the pass runs immediately before peephole2, so as to get any
opportunities that the existing ldp/stp peepholes can handle.

There are some cases that we punt on before RA, e.g.
accesses relative to eliminable regs (such as the soft frame pointer).
We do this since we can't know the elimination offset before RA, and we
want to avoid the RA reloading the offset (due to being out of ldp/stp
immediate range) as this can generate worse code.

The post-RA copy of the pass is there to pick up the crumbs that were
left behind / things we punted on in the pre-RA pass.  Among other
things, it's needed to handle accesses relative to the stack pointer
(see the previous patch in the series for an example).  It can also
handle code that didn't exist at the time the pre-RA pass was run (spill
code, prologue/epilogue code).

This is an initial implementation, and there are (among other possible
improvements) the following notable caveats / missing features that are
left for future work, but could give further improvements:

 - Moving accesses between BBs within in an EBB, see above.
 - Out-of-range opportunities: currently the pass refuses to form pairs
   if there isn't a suitable base register with an immediate in range
   for ldp/stp, but it can be profitable to emit anchor addresses in the
   case that there are four or more out-of-range nearby accesses that can
   be formed into pairs.  This is handled by the current ldp/stp
   peepholes, so it would be good to support this in the future.
 - Discovery: currently we prioritize MEM_EXPR bases over RTL bases, which can
   lead to us missing opportunities in the case that two accesses have distinct
   MEM_EXPR bases (i.e. different DECLs) but they are still adjacent in memory
   (e.g. adjacent variables on the stack).  I hope to address this for GCC 15,
   hopefully getting to the point where we can remove the ldp/stp peepholes and
   scheduling hooks.  Furthermore it would be nice to make the pass aware of
   section anchors (adding these as a third kind of base) allowing merging
   accesses to adjacent variables within the same section.

gcc/ChangeLog:

* config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64.
* config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
before and after RA.
* 

[PATCH v3 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-12-07 Thread Alex Coplan
Hi,

This is a v3, rebased on top of the SME changes.  v2 is here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639361.html

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

This patch overhauls the load/store pair patterns with two main goals:

1. Fixing a correctness issue (the current patterns are not RA-friendly).
2. Allowing more flexibility in which operand modes are supported, and which
   combinations of modes are allowed in the two arms of the load/store pair,
   while reducing the number of patterns required both in the source and in
   the generated code.

The correctness issue (1) is due to the fact that the current patterns have
two independent memory operands tied together only by a predicate on the insns.
Since LRA only looks at the constraints, one of the memory operands can get
reloaded without the other one being changed, leading to the insn becoming
unrecognizable after reload.

We fix this issue by changing the patterns such that they only ever have one
memory operand representing the entire pair.  For the store case, we use an
unspec to logically concatenate the register operands before storing them.
For the load case, we use unspecs to extract the "lanes" from the pair mem,
with the second occurrence of the mem matched using a match_dup (such that there
is still really only one memory operand as far as the RA is concerned).

In terms of the modes used for the pair memory operands, we canonicalize
these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
only the correct size but also correct alignment requirement for a
memory operand representing an entire load/store pair.  Unlike the other
two, V2x4QImode didn't previously exist, so had to be added with the
patch.

As with the previous patch generalizing the writeback patterns, this
patch aims to be flexible in the combinations of modes supported by the
patterns without requiring a large number of generated patterns by using
distinct mode iterators.

The new scheme means we only need a single (generated) pattern for each
load/store operation of a given operand size.  For the 4-byte and 8-byte
operand cases, we use the GPI iterator to synthesize the two patterns.
The 16-byte case is implemented as a separate pattern in the source (due
to only having a single possible alternative).

Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
we add REG_CFA_OFFSET notes to the store pair insns emitted by
aarch64_save_callee_saves, so that correct CFI information can still be
generated.  Furthermore, we now unconditionally generate these CFA
notes on frame-related insns emitted by aarch64_save_callee_saves.
This is done in case that the load/store pair pass forms these into
pairs, in which case the CFA notes would be needed.

We also adjust the ldp/stp peepholes to generate the new form.  This is
done by switching the generation to use the
aarch64_gen_{load,store}_pair interface, making it easier to change the
form in the future if needed.  (Likewise, the upcoming aarch64
load/store pair pass also makes use of this interface).

This patch also adds an "ldpstp" attribute to the non-writeback
load/store pair patterns, which is used by the post-RA load/store pair
pass to identify existing patterns and see if they can be promoted to
writeback variants.

One potential concern with using unspecs for the patterns is that it can block
optimization by the generic RTL passes.  This patch series tries to mitigate
this in two ways:
 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
emit individual loads/stores instead of ldp/stp.  These should then be
formed back into load/store pairs much later in the RTL pipeline by the
new load/store pair pass.

gcc/ChangeLog:

* config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
representation from peepholes, allowing use of new form.
* config/aarch64/aarch64-modes.def (V2x4QImode): Define.
* config/aarch64/aarch64-protos.h
(aarch64_finish_ldpstp_peephole): Declare.
(aarch64_swap_ldrstr_operands): Delete declaration.
(aarch64_gen_load_pair): Adjust parameters.
(aarch64_gen_store_pair): Likewise.
* config/aarch64/aarch64-simd.md (load_pair):
Delete.
(vec_store_pair): Delete.
(load_pair): Delete.
(vec_store_pair): Delete.
* config/aarch64/aarch64.cc 
(aarch64_sme_mode_switch_regs::emit_mem_128_moves):
Use aarch64_gen_{load,store}_pair instead of emitting parallel
directly.
(aarch64_gen_store_pair): Adjust to use new unspec form of stp.
Drop second mem from parameters.
(aarch64_gen_load_pair): Likewise.
(aarch64_pair_mode_for_mode): New.
(aarch64_pair_mem_from_base): New.
(aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for

[PATCH v3 08/11] aarch64: Generalize writeback ldp/stp patterns

2023-12-07 Thread Alex Coplan
Hi,

This is a v3 patch which is rebased on top of the SME changes.
Otherwise it is the same as v2, posted here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639367.html

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

Thus far the writeback forms of ldp/stp have been exclusively used in
prologue and epilogue code for saving/restoring of registers to/from the
stack.

As such, forms of ldp/stp that weren't needed for prologue/epilogue code
weren't supported by the aarch64 backend.  This patch generalizes the
load/store pair writeback patterns to allow:

 - Base registers other than the stack pointer.
 - Modes that weren't previously supported.
 - Combinations of distinct modes provided they have the same size.
 - Pre/post variants that weren't previously needed in prologue/epilogue
   code.

We make quite some effort to avoid a combinatorial explosion in the
number of patterns generated (and those in the source) by making
extensive use of special predicates.

An updated version of the upcoming ldp/stp pass can generate the
writeback forms, so this patch is motivated by that.

This patch doesn't add zero-extending or sign-extending forms of the
writeback patterns; that is left for future work.

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (aarch64_ldpstp_operand_mode_p): 
Declare.
* config/aarch64/aarch64.cc (aarch64_gen_storewb_pair): Build RTL
directly instead of invoking named pattern.
(aarch64_gen_loadwb_pair): Likewise.
(aarch64_ldpstp_operand_mode_p): New.
* config/aarch64/aarch64.md (loadwb_pair_): Replace 
with
...
(*loadwb_post_pair_): ... this. Generalize as described
in cover letter.
(loadwb_pair_): Delete (superseded by the
above).
(*loadwb_post_pair_16): New.
(*loadwb_pre_pair_): New.
(loadwb_pair_): Delete.
(*loadwb_pre_pair_16): New.
(storewb_pair_): Replace with ...
(*storewb_pre_pair_): ... this.  Generalize as
described in cover letter.
(*storewb_pre_pair_16): New.
(storewb_pair_): Delete.
(*storewb_post_pair_): New.
(storewb_pair_): Delete.
(*storewb_post_pair_16): New.
* config/aarch64/predicates.md (aarch64_mem_pair_operator): New.
(pmode_plus_operator): New.
(aarch64_ldp_reg_operand): New.
(aarch64_stp_reg_operand): New.
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 42f7bfad5cb..ee0f0a18541 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1041,6 +1041,7 @@ bool aarch64_operands_ok_for_ldpstp (rtx *, bool, 
machine_mode);
 bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
 bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
 void aarch64_swap_ldrstr_operands (rtx *, bool);
+bool aarch64_ldpstp_operand_mode_p (machine_mode);
 
 extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
  tree, HOST_WIDE_INT);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d870973dcd6..baa2b6ca3f7 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8097,23 +8097,15 @@ static rtx
 aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
  HOST_WIDE_INT adjustment)
 {
-  switch (mode)
-{
-case E_DImode:
-  return gen_storewb_pairdi_di (base, base, reg, reg2,
-   GEN_INT (-adjustment),
-   GEN_INT (UNITS_PER_WORD - adjustment));
-case E_DFmode:
-  return gen_storewb_pairdf_di (base, base, reg, reg2,
-   GEN_INT (-adjustment),
-   GEN_INT (UNITS_PER_WORD - adjustment));
-case E_TFmode:
-  return gen_storewb_pairtf_di (base, base, reg, reg2,
-   GEN_INT (-adjustment),
-   GEN_INT (UNITS_PER_VREG - adjustment));
-default:
-  gcc_unreachable ();
-}
+  rtx new_base = plus_constant (Pmode, base, -adjustment);
+  rtx mem = gen_frame_mem (mode, new_base);
+  rtx mem2 = adjust_address_nv (mem, mode, GET_MODE_SIZE (mode));
+
+  return gen_rtx_PARALLEL (VOIDmode,
+  gen_rtvec (3,
+ gen_rtx_SET (base, new_base),
+ gen_rtx_SET (mem, reg),
+ gen_rtx_SET (mem2, reg2)));
 }
 
 /* Push registers numbered REGNO1 and REGNO2 to the stack, adjusting the
@@ -8145,20 +8137,15 @@ static rtx
 aarch64_gen_loadwb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
 HOST_WIDE_INT adjustment)
 {
-  switch (mode)
-{
-case E_DImode:
-  return gen_loadwb_pairdi_di (base, base, 

[PATCH v2 10/11] aarch64: Add new load/store pair fusion pass.

2023-12-05 Thread Alex Coplan
Hi,

This is a v4 of the aarch64 load/store pair fusion pass.
This addresses feedback from the review of v3 here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637756.html

I've attached the incremental change in reply to the review above.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
and store pairs (LDPs and STPs).

As a motivating example for the kind of thing this improves, take the
following testcase:

extern double c[20];

double f(double x)
{
  double y = x*x;
  y += c[16];
  y += c[17];
  y += c[18];
  y += c[19];
  return y;
}

for which we currently generate (at -O2):

f:
adrpx0, c
add x0, x0, :lo12:c
ldp d31, d29, [x0, 128]
ldr d30, [x0, 144]
fmadd   d0, d0, d0, d31
ldr d31, [x0, 152]
faddd0, d0, d29
faddd0, d0, d30
faddd0, d0, d31
ret

but with the pass, we generate:

f:
.LFB0:
adrpx0, c
add x0, x0, :lo12:c
ldp d31, d29, [x0, 128]
fmadd   d0, d0, d0, d31
ldp d30, d31, [x0, 144]
faddd0, d0, d29
faddd0, d0, d30
faddd0, d0, d31
ret

The pass is local (only considers a BB at a time).  In theory, it should
be possible to extend it to run over EBBs, at least in the case of pure
(MEM_READONLY_P) loads, but this is left for future work.

The pass works by identifying two kinds of bases: tree decls obtained
via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
If a candidate memory access has a MEM_EXPR base, then we track it via
this base, and otherwise if it is of a simple reg +  form, we track
it via the RTL-SSA def_info for the register.

For each BB, for a given kind of base, we build up a hash table mapping
the base to an access_group.  The access_group data structure holds a
list of accesses at each offset relative to the same base.  It uses a
splay tree to support efficient insertion (while walking the bb), and
the nodes are chained using a linked list to support efficient
iteration (while doing the transformation).

For each base, we then iterate over the access_group to identify
adjacent accesses, and try to form load/store pairs for those insns that
access adjacent memory.

The pass is currently run twice, both before and after register
allocation.  The first copy of the pass is run late in the pre-RA RTL
pipeline, immediately after sched1, since it was found that sched1 was
increasing register pressure when the pass was run before.  The second
copy of the pass runs immediately before peephole2, so as to get any
opportunities that the existing ldp/stp peepholes can handle.

There are some cases that we punt on before RA, e.g.
accesses relative to eliminable regs (such as the soft frame pointer).
We do this since we can't know the elimination offset before RA, and we
want to avoid the RA reloading the offset (due to being out of ldp/stp
immediate range) as this can generate worse code.

The post-RA copy of the pass is there to pick up the crumbs that were
left behind / things we punted on in the pre-RA pass.  Among other
things, it's needed to handle accesses relative to the stack pointer
(see the previous patch in the series for an example).  It can also
handle code that didn't exist at the time the pre-RA pass was run (spill
code, prologue/epilogue code).

This is an initial implementation, and there are (among other possible
improvements) the following notable caveats / missing features that are
left for future work, but could give further improvements:

 - Moving accesses between BBs within in an EBB, see above.
 - Out-of-range opportunities: currently the pass refuses to form pairs
   if there isn't a suitable base register with an immediate in range
   for ldp/stp, but it can be profitable to emit anchor addresses in the
   case that there are four or more out-of-range nearby accesses that can
   be formed into pairs.  This is handled by the current ldp/stp
   peepholes, so it would be good to support this in the future.
 - Discovery: currently we prioritize MEM_EXPR bases over RTL bases, which can
   lead to us missing opportunities in the case that two accesses have distinct
   MEM_EXPR bases (i.e. different DECLs) but they are still adjacent in memory
   (e.g. adjacent variables on the stack).  I hope to address this for GCC 15,
   hopefully getting to the point where we can remove the ldp/stp peepholes and
   scheduling hooks.  Furthermore it would be nice to make the pass aware of
   section anchors (adding these as a third kind of base) allowing merging
   accesses to adjacent variables within the same section.

gcc/ChangeLog:

* config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64.
* config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
before and after RA.
* 

[PATCH v2 08/11] aarch64: Generalize writeback ldp/stp patterns

2023-12-05 Thread Alex Coplan
Hi,

This is a v2 patch which implements the requested changes from the
previous review here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637642.html

the patch was pre-approved with those changes, but this patch
additionally makes use of the new aarch64_const_zero_rtx_p predicate in
aarch64_stp_reg_operand (added in v2 of the earlier patch to
aarch64_print_operand in this series).

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

Thus far the writeback forms of ldp/stp have been exclusively used in
prologue and epilogue code for saving/restoring of registers to/from the
stack.

As such, forms of ldp/stp that weren't needed for prologue/epilogue code
weren't supported by the aarch64 backend.  This patch generalizes the
load/store pair writeback patterns to allow:

 - Base registers other than the stack pointer.
 - Modes that weren't previously supported.
 - Combinations of distinct modes provided they have the same size.
 - Pre/post variants that weren't previously needed in prologue/epilogue
   code.

We make quite some effort to avoid a combinatorial explosion in the
number of patterns generated (and those in the source) by making
extensive use of special predicates.

An updated version of the upcoming ldp/stp pass can generate the
writeback forms, so this patch is motivated by that.

This patch doesn't add zero-extending or sign-extending forms of the
writeback patterns; that is left for future work.

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (aarch64_ldpstp_operand_mode_p): 
Declare.
* config/aarch64/aarch64.cc (aarch64_gen_storewb_pair): Build RTL
directly instead of invoking named pattern.
(aarch64_gen_loadwb_pair): Likewise.
(aarch64_ldpstp_operand_mode_p): New.
* config/aarch64/aarch64.md (loadwb_pair_): Replace 
with
...
(*loadwb_post_pair_): ... this. Generalize as described
in cover letter.
(loadwb_pair_): Delete (superseded by the
above).
(*loadwb_post_pair_16): New.
(*loadwb_pre_pair_): New.
(loadwb_pair_): Delete.
(*loadwb_pre_pair_16): New.
(storewb_pair_): Replace with ...
(*storewb_pre_pair_): ... this.  Generalize as
described in cover letter.
(*storewb_pre_pair_16): New.
(storewb_pair_): Delete.
(*storewb_post_pair_): New.
(storewb_pair_): Delete.
(*storewb_post_pair_16): New.
* config/aarch64/predicates.md (aarch64_mem_pair_operator): New.
(pmode_plus_operator): New.
(aarch64_ldp_reg_operand): New.
(aarch64_stp_reg_operand): New.
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 27fc6ccf098..376b4984be6 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1023,6 +1023,7 @@ bool aarch64_operands_ok_for_ldpstp (rtx *, bool, 
machine_mode);
 bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
 bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
 void aarch64_swap_ldrstr_operands (rtx *, bool);
+bool aarch64_ldpstp_operand_mode_p (machine_mode);
 
 extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
  tree, HOST_WIDE_INT);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 461449847ff..8faaa748a05 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6601,23 +6601,15 @@ static rtx
 aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
  HOST_WIDE_INT adjustment)
 {
-  switch (mode)
-{
-case E_DImode:
-  return gen_storewb_pairdi_di (base, base, reg, reg2,
-   GEN_INT (-adjustment),
-   GEN_INT (UNITS_PER_WORD - adjustment));
-case E_DFmode:
-  return gen_storewb_pairdf_di (base, base, reg, reg2,
-   GEN_INT (-adjustment),
-   GEN_INT (UNITS_PER_WORD - adjustment));
-case E_TFmode:
-  return gen_storewb_pairtf_di (base, base, reg, reg2,
-   GEN_INT (-adjustment),
-   GEN_INT (UNITS_PER_VREG - adjustment));
-default:
-  gcc_unreachable ();
-}
+  rtx new_base = plus_constant (Pmode, base, -adjustment);
+  rtx mem = gen_frame_mem (mode, new_base);
+  rtx mem2 = adjust_address_nv (mem, mode, GET_MODE_SIZE (mode));
+
+  return gen_rtx_PARALLEL (VOIDmode,
+  gen_rtvec (3,
+ gen_rtx_SET (base, new_base),
+ gen_rtx_SET (mem, reg),
+ gen_rtx_SET (mem2, reg2)));
 }
 
 /* Push registers numbered REGNO1 and REGNO2 to the stack, adjusting the
@@ -6649,20 +6641,15 @@ static rtx
 

Re: [PATCH 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-12-05 Thread Alex Coplan
Thanks for the review, I've posted a v2 here which addresses this feedback:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639361.html

On 21/11/2023 16:04, Richard Sandiford wrote:
> Alex Coplan  writes:
> > This patch overhauls the load/store pair patterns with two main goals:
> >
> > 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> > 2. Allowing more flexibility in which operand modes are supported, and which
> >combinations of modes are allowed in the two arms of the load/store pair,
> >while reducing the number of patterns required both in the source and in
> >the generated code.
> >
> > The correctness issue (1) is due to the fact that the current patterns have
> > two independent memory operands tied together only by a predicate on the 
> > insns.
> > Since LRA only looks at the constraints, one of the memory operands can get
> > reloaded without the other one being changed, leading to the insn becoming
> > unrecognizable after reload.
> >
> > We fix this issue by changing the patterns such that they only ever have one
> > memory operand representing the entire pair.  For the store case, we use an
> > unspec to logically concatenate the register operands before storing them.
> > For the load case, we use unspecs to extract the "lanes" from the pair mem,
> > with the second occurrence of the mem matched using a match_dup (such that 
> > there
> > is still really only one memory operand as far as the RA is concerned).
> >
> > In terms of the modes used for the pair memory operands, we canonicalize
> > these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> > only the correct size but also correct alignment requirement for a
> > memory operand representing an entire load/store pair.  Unlike the other
> > two, V2x4QImode didn't previously exist, so had to be added with the
> > patch.
> >
> > As with the previous patch generalizing the writeback patterns, this
> > patch aims to be flexible in the combinations of modes supported by the
> > patterns without requiring a large number of generated patterns by using
> > distinct mode iterators.
> >
> > The new scheme means we only need a single (generated) pattern for each
> > load/store operation of a given operand size.  For the 4-byte and 8-byte
> > operand cases, we use the GPI iterator to synthesize the two patterns.
> > The 16-byte case is implemented as a separate pattern in the source (due
> > to only having a single possible alternative).
> >
> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> > we add REG_CFA_OFFSET notes to the store pair insns emitted by
> > aarch64_save_callee_saves, so that correct CFI information can still be
> > generated.  Furthermore, we now unconditionally generate these CFA
> > notes on frame-related insns emitted by aarch64_save_callee_saves.
> > This is done in case that the load/store pair pass forms these into
> > pairs, in which case the CFA notes would be needed.
> >
> > We also adjust the ldp/stp peepholes to generate the new form.  This is
> > done by switching the generation to use the
> > aarch64_gen_{load,store}_pair interface, making it easier to change the
> > form in the future if needed.  (Likewise, the upcoming aarch64
> > load/store pair pass also makes use of this interface).
> >
> > This patch also adds an "ldpstp" attribute to the non-writeback
> > load/store pair patterns, which is used by the post-RA load/store pair
> > pass to identify existing patterns and see if they can be promoted to
> > writeback variants.
> >
> > One potential concern with using unspecs for the patterns is that it can 
> > block
> > optimization by the generic RTL passes.  This patch series tries to mitigate
> > this in two ways:
> >  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
> >  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion 
> > to
> > emit individual loads/stores instead of ldp/stp.  These should then be
> > formed back into load/store pairs much later in the RTL pipeline by the
> > new load/store pair pass.
> >
> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> > representation from peepholes, allowing use of new form.
> > * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
> > * config/aarch64/aarch64-protos.h
>

[PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-12-05 Thread Alex Coplan
Hi,

This is a v2 version which addresses feedback from Richard's review
here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html

I'll reply inline to address specific comments.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

This patch overhauls the load/store pair patterns with two main goals:

1. Fixing a correctness issue (the current patterns are not RA-friendly).
2. Allowing more flexibility in which operand modes are supported, and which
   combinations of modes are allowed in the two arms of the load/store pair,
   while reducing the number of patterns required both in the source and in
   the generated code.

The correctness issue (1) is due to the fact that the current patterns have
two independent memory operands tied together only by a predicate on the insns.
Since LRA only looks at the constraints, one of the memory operands can get
reloaded without the other one being changed, leading to the insn becoming
unrecognizable after reload.

We fix this issue by changing the patterns such that they only ever have one
memory operand representing the entire pair.  For the store case, we use an
unspec to logically concatenate the register operands before storing them.
For the load case, we use unspecs to extract the "lanes" from the pair mem,
with the second occurrence of the mem matched using a match_dup (such that there
is still really only one memory operand as far as the RA is concerned).

In terms of the modes used for the pair memory operands, we canonicalize
these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
only the correct size but also correct alignment requirement for a
memory operand representing an entire load/store pair.  Unlike the other
two, V2x4QImode didn't previously exist, so had to be added with the
patch.

As with the previous patch generalizing the writeback patterns, this
patch aims to be flexible in the combinations of modes supported by the
patterns without requiring a large number of generated patterns by using
distinct mode iterators.

The new scheme means we only need a single (generated) pattern for each
load/store operation of a given operand size.  For the 4-byte and 8-byte
operand cases, we use the GPI iterator to synthesize the two patterns.
The 16-byte case is implemented as a separate pattern in the source (due
to only having a single possible alternative).

Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
we add REG_CFA_OFFSET notes to the store pair insns emitted by
aarch64_save_callee_saves, so that correct CFI information can still be
generated.  Furthermore, we now unconditionally generate these CFA
notes on frame-related insns emitted by aarch64_save_callee_saves.
This is done in case that the load/store pair pass forms these into
pairs, in which case the CFA notes would be needed.

We also adjust the ldp/stp peepholes to generate the new form.  This is
done by switching the generation to use the
aarch64_gen_{load,store}_pair interface, making it easier to change the
form in the future if needed.  (Likewise, the upcoming aarch64
load/store pair pass also makes use of this interface).

This patch also adds an "ldpstp" attribute to the non-writeback
load/store pair patterns, which is used by the post-RA load/store pair
pass to identify existing patterns and see if they can be promoted to
writeback variants.

One potential concern with using unspecs for the patterns is that it can block
optimization by the generic RTL passes.  This patch series tries to mitigate
this in two ways:
 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
emit individual loads/stores instead of ldp/stp.  These should then be
formed back into load/store pairs much later in the RTL pipeline by the
new load/store pair pass.

gcc/ChangeLog:

* config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
representation from peepholes, allowing use of new form.
* config/aarch64/aarch64-modes.def (V2x4QImode): Define.
* config/aarch64/aarch64-protos.h
(aarch64_finish_ldpstp_peephole): Declare.
(aarch64_swap_ldrstr_operands): Delete declaration.
(aarch64_gen_load_pair): Adjust parameters.
(aarch64_gen_store_pair): Likewise.
* config/aarch64/aarch64-simd.md (load_pair):
Delete.
(vec_store_pair): Delete.
(load_pair): Delete.
(vec_store_pair): Delete.
* config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
(aarch64_gen_store_pair): Adjust to use new unspec form of stp.
Drop second mem from parameters.
(aarch64_gen_load_pair): Likewise.
(aarch64_pair_mem_from_base): New.
(aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
frame-related saves.  Adjust call to aarch64_gen_store_pair
(aarch64_restore_callee_saves): Adjust 

[PATCH v2 06/11] aarch64: Fix up aarch64_print_operand xzr/wzr case

2023-12-05 Thread Alex Coplan
Hi,

This is a v2 of:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637612.html

v1 was approved as-is, but this version pulls out the test into a helper
function which is used by later patches in the series.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

This adjusts aarch64_print_operand to recognize zero rtxes in modes other than
VOIDmode.  This allows us to use xzr/wzr for zero vectors, for example.

We extract the test into a helper function, aarch64_const_zero_rtx_p, since this
predicate is needed by later patches.

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (aarch64_const_zero_rtx_p): New.
* config/aarch64/aarch64.cc (aarch64_const_zero_rtx_p): New.
Use it ...
(aarch64_print_operand): ... here.  Recognize CONST0_RTXes in
modes other than VOIDmode.
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index d2718cc87b3..27fc6ccf098 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -773,6 +773,7 @@ bool aarch64_expand_cpymem (rtx *);
 bool aarch64_expand_setmem (rtx *);
 bool aarch64_float_const_zero_rtx_p (rtx);
 bool aarch64_float_const_rtx_p (rtx);
+bool aarch64_const_zero_rtx_p (rtx);
 bool aarch64_function_arg_regno_p (unsigned);
 bool aarch64_fusion_enabled_p (enum aarch64_fusion_pairs);
 bool aarch64_gen_cpymemqi (rtx *);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index fca64daf2a0..a35c6bbe335 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9095,6 +9095,15 @@ aarch64_float_const_zero_rtx_p (rtx x)
   return real_equal (CONST_DOUBLE_REAL_VALUE (x), );
 }
 
+/* Return true if X is any kind of constant zero rtx.  */
+
+bool
+aarch64_const_zero_rtx_p (rtx x)
+{
+  return x == CONST0_RTX (GET_MODE (x))
+|| (CONST_DOUBLE_P (x) && aarch64_float_const_zero_rtx_p (x));
+}
+
 /* Return TRUE if rtx X is immediate constant that fits in a single
MOVI immediate operation.  */
 bool
@@ -9977,8 +9986,7 @@ aarch64_print_operand (FILE *f, rtx x, int code)
 
 case 'w':
 case 'x':
-  if (x == const0_rtx
- || (CONST_DOUBLE_P (x) && aarch64_float_const_zero_rtx_p (x)))
+  if (aarch64_const_zero_rtx_p (x))
{
  asm_fprintf (f, "%czr", code);
  break;


Re: [PATCH v5] c-family: Implement __has_feature and __has_extension [PR60512]

2023-11-28 Thread Alex Coplan
On 28/11/2023 17:03, Thomas Schwinge wrote:
> Hi!
> 
> On 2023-11-17T14:50:45+0000, Alex Coplan  wrote:
> > --- a/gcc/cp/cp-objcp-common.cc
> > +++ b/gcc/cp/cp-objcp-common.cc
> 
> > +/* Table of features for __has_{feature,extension}.  */
> > +
> > +static constexpr cp_feature_info cp_feature_table[] =
> > +{
> > +  { "cxx_exceptions", _exceptions },
> > +  { "cxx_rtti", _rtti },
> > +  { "cxx_access_control_sfinae", { cxx11, cxx98 } },
> 
> Here we see that 'cxx_exceptions', 'cxx_rtti' are dependent on
> '-fexceptions', '-frtti'.  Certain GCC configurations may decide to
> default to '-fno-exceptions' and/or '-fno-rtti'...
> 
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/ext/has-feature.C
> > @@ -0,0 +1,206 @@
> > +// { dg-do compile }
> > +// { dg-options "" }
> > +
> > +#define FEAT(x) (__has_feature(x) && __has_extension(x))
> > +#define CXX11 (__cplusplus >= 201103L)
> > +#define CXX14 (__cplusplus >= 201402L)
> > +
> > +#if !FEAT(cxx_exceptions) || !FEAT(cxx_rtti)
> > +#error
> > +#endif
> 
> ..., but here, they are assumed available unconditionally.  OK to push
> "Adjust 'g++.dg/ext/has-feature.C' for default-'-fno-exceptions', '-fno-rtti' 
> configurations",
> see attached?

LGTM, but I can't approve the patch.

Sorry for the breakage and thanks for the fix.

Alex

> 
> 
> Grüße
>  Thomas
> 
> 
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
> München, HRB 106955

> From 89482e73066fcd6da5dbc93402e77e28f948a96c Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge 
> Date: Tue, 28 Nov 2023 15:57:09 +0100
> Subject: [PATCH] Adjust 'g++.dg/ext/has-feature.C' for
>  default-'-fno-exceptions', '-fno-rtti' configurations
> 
> ..., where you currently get:
> 
> FAIL: g++.dg/ext/has-feature.C  -std=gnu++98 (test for excess errors)
> [...]
> 
> Minor fix-up for recent commit 06280a906cb3dc80cf5e07cf3335b758848d488d
> "c-family: Implement __has_feature and __has_extension [PR60512]".
> 
>   gcc/testsuite/
>   * g++.dg/ext/has-feature.C: Adjust for default-'-fno-exceptions',
>   '-fno-rtti' configurations.
> ---
>  gcc/testsuite/g++.dg/ext/has-feature.C | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/g++.dg/ext/has-feature.C 
> b/gcc/testsuite/g++.dg/ext/has-feature.C
> index 52191b78fd6..bcfe82469ae 100644
> --- a/gcc/testsuite/g++.dg/ext/has-feature.C
> +++ b/gcc/testsuite/g++.dg/ext/has-feature.C
> @@ -5,7 +5,11 @@
>  #define CXX11 (__cplusplus >= 201103L)
>  #define CXX14 (__cplusplus >= 201402L)
>  
> -#if !FEAT(cxx_exceptions) || !FEAT(cxx_rtti)
> +#if FEAT(cxx_exceptions) != !!__cpp_exceptions
> +#error
> +#endif
> +
> +#if FEAT(cxx_rtti) != !!__cpp_rtti
>  #error
>  #endif
>  
> -- 
> 2.34.1
> 



Re: [PATCH] c++: Fix up __has_extension (cxx_init_captures)

2023-11-28 Thread Alex Coplan
On 28/11/2023 09:22, Jakub Jelinek wrote:
> On Mon, Nov 27, 2023 at 10:58:04AM +0000, Alex Coplan wrote:
> > Many thanks both for the reviews, this is now pushed (with Jason's
> > above changes implemented) as g:06280a906cb3dc80cf5e07cf3335b758848d488d.
> 
> The new test FAILs everywhere with GXX_TESTSUITE_STDS=98,11,14,17,20,2b
> I'm normally using for testing.
> FAIL: g++.dg/ext/has-feature.C  -std=gnu++11 (test for excess errors)
> Excess errors:
> /home/jakub/src/gcc/gcc/testsuite/g++.dg/ext/has-feature.C:185:2: error: 
> #error 
> 
> This is on
> #if __has_extension (cxx_init_captures) != CXX11
> #error
> #endif
> Comparing the values with clang++ on godbolt and with what is actually
> implemented:
> void foo () { auto a = [b = 3]() { return b; }; }
> both clang++ and GCC implement init captures as extension already in C++11
> (and obviously not in C++98 because lambdas aren't implemented there),
> unless -pedantic-errors/-Werror=pedantic, so I think we should change
> the FE to match the test rather than the other way around.
> 
> Tested on x86_64-linux with
> GXX_TESTSUITE_STDS=98,11,14,17,20,23,26 make check-g++ 
> RUNTESTFLAGS="--target_board=unix\{-m32,-m64\} dg.exp='has-feature.C'"
> Ok for trunk?
> 
> Making __has_extension return __has_feature for -pedantic-errors and not
> for -Werror=pedantic is just weird, but as that is what clang++ implements
> and this is for compatibility with it, I can live with it (but perhaps
> we should mention it in the documentation).  Note, the warnings/errors
> can be changed using pragmas inside of the source, so whether one can
> use an extension or not depends on where in the code it is (__extension__
> to the rescue if it can be specified around it).
> I wonder if the has-feature.C test shouldn't be #included in other 2 tests,
> one where -pedantic-errors would be in dg-options and through some macro
> tell the file that __has_extension will behave like __has_feature, and
> another with -Werror=pedantic to document that the option doesn't change
> it.
> 
> 2023-11-28  Jakub Jelinek  
> 
>   * cp-objcp-common.cc (cp_feature_table): Evaluate
>   __has_extension (cxx_init_captures) to 1 even for -std=c++11.
> 
> --- gcc/cp/cp-objcp-common.cc.jj  2023-11-27 17:34:25.0 +0100
> +++ gcc/cp/cp-objcp-common.cc 2023-11-28 08:55:18.868419864 +0100
> @@ -145,7 +145,7 @@ static constexpr cp_feature_info cp_feat
>{ "cxx_contextual_conversions", { cxx14, cxx98 } },
>{ "cxx_decltype_auto", cxx14 },
>{ "cxx_aggregate_nsdmi", cxx14 },
> -  { "cxx_init_captures", cxx14 },
> +  { "cxx_init_captures", { cxx14, cxx11 } },

FWIW it looks like this is what I had in the original RFC here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-May/617878.html

but Jason suggested we be more conservative about what we advertise as
extensions in his review here:

https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618232.html

so it looks like I just missed updating the test when making that change,
and I think it would be better to update the test.

Thanks,
Alex

>{ "cxx_generic_lambdas", cxx14 },
>{ "cxx_relaxed_constexpr", cxx14 },
>{ "cxx_return_type_deduction", cxx14 },
> 
> 
>   Jakub
> 


Re: [PATCH v5] c-family: Implement __has_feature and __has_extension [PR60512]

2023-11-27 Thread Alex Coplan
On 23/11/2023 12:41, Marek Polacek wrote:
> On Mon, Nov 20, 2023 at 05:29:58PM -0500, Jason Merrill wrote:
> > On 11/17/23 09:50, Alex Coplan wrote:
> > > Hi,
> > > 
> > > This is a v5 patch to address Marek's feedback here:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635157.html
> > > 
> > > I also implemented Jason's suggestion to use constexpr for the tables
> > > from this review:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634484.html
> > > 
> > > I'll attach the incremental change in reply to Marek's review to make
> > > things easier to compare.
> > > 
> > > Bootstrapped/regtested on aarch64-linux-gnu.  Bootstrap/regtest on
> > > x86_64-apple-darwin in progress (on top of this libsanitizer fix:
> > > https://github.com/llvm/llvm-project/issues/72639).
> > > 
> > > OK for trunk if testing passes?
> > 
> > > --- a/gcc/c-family/c-common.h
> > > +/* Implemented in c/c-objc-common.cc.  */
> > > +extern void c_register_features ();
> > 
> > I think this declaration should go in c-objc-common.h, though the C
> > maintainers might prefer c-lang.h or c-tree.h.
> > 
> > > +/* Implemented in cp/cp-objcp-common.cc.  */
> > > +extern void cp_register_features ();
> > 
> > And this one in cp-objc-common.h.
> > 
> > With that change the patch is OK on Friday if Marek doesn't have any other
> > notes.
> 
> v5 looks good to me.  Thanks,

Many thanks both for the reviews, this is now pushed (with Jason's
above changes implemented) as g:06280a906cb3dc80cf5e07cf3335b758848d488d.

Alex

> 
> Marek
> 


Re: [PATCH 02/11] rtl-ssa: Add some helpers for removing accesses

2023-11-23 Thread Alex Coplan
On 21/11/2023 16:49, Richard Sandiford wrote:
> Richard Sandiford  writes:
> > Alex Coplan  writes:
> >> This adds some helpers to access-utils.h for removing accesses from an
> >> access_array.  This is needed by the upcoming aarch64 load/store pair
> >> fusion pass.
> >>
> >> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
> >>
> >> gcc/ChangeLog:
> >>
> >>* rtl-ssa/access-utils.h (filter_accesses): New.
> >>(remove_regno_access): New.
> >>(check_remove_regno_access): New.
> >> ---
> >>  gcc/rtl-ssa/access-utils.h | 42 ++
> >>  1 file changed, 42 insertions(+)
> >>
> >> diff --git a/gcc/rtl-ssa/access-utils.h b/gcc/rtl-ssa/access-utils.h
> >> index f078625babf..31259d742d9 100644
> >> --- a/gcc/rtl-ssa/access-utils.h
> >> +++ b/gcc/rtl-ssa/access-utils.h
> >> @@ -78,6 +78,48 @@ drop_memory_access (T accesses)
> >>return T (arr.begin (), accesses.size () - 1);
> >>  }
> >>  
> >> +// Filter ACCESSES to return an access_array of only those accesses that
> >> +// satisfy PREDICATE.  Alocate the new array above WATERMARK.
> >> +template
> >> +inline T
> >> +filter_accesses (obstack_watermark ,
> >> +   T accesses,
> >> +   FilterPredicate predicate)
> >> +{
> >> +  access_array_builder builder (watermark);
> >> +  builder.reserve (accesses.size ());
> >> +  auto it = accesses.begin ();
> >> +  auto end = accesses.end ();
> >> +  for (; it != end; it++)
> >> +if (predicate (*it))
> >> +  builder.quick_push (*it);
> >
> > It looks like the last five lines could be simplified to:
> >
> >   for (access_info *access : accesses)
> > if (!predicate (access))
> >   builder.quick_push (access);

So I implemented these changes, but I found that I had to use auto
instead of access_info * for the type of access in the for loop.

That allows callers to use the most specific/derived type for the
parameter in the predicate (e.g. use `use_info *` for an array of uses).

Is it OK with that change?  I've attached a revised patch.
Bootstrapped/regtested on aarch64-linux-gnu.

Thanks,
Alex

> 
> Oops, I meant:
> 
>  if (predicate (access))
> 
> of course :)
diff --git a/gcc/rtl-ssa/access-utils.h b/gcc/rtl-ssa/access-utils.h
index f078625babf..9a62addfd2a 100644
--- a/gcc/rtl-ssa/access-utils.h
+++ b/gcc/rtl-ssa/access-utils.h
@@ -78,6 +78,46 @@ drop_memory_access (T accesses)
   return T (arr.begin (), accesses.size () - 1);
 }
 
+// Filter ACCESSES to return an access_array of only those accesses that
+// satisfy PREDICATE.  Alocate the new array above WATERMARK.
+template
+inline T
+filter_accesses (obstack_watermark ,
+T accesses,
+FilterPredicate predicate)
+{
+  access_array_builder builder (watermark);
+  builder.reserve (accesses.size ());
+  for (auto access : accesses)
+if (predicate (access))
+  builder.quick_push (access);
+  return T (builder.finish ());
+}
+
+// Given an array of ACCESSES, remove any access with regno REGNO.
+// Allocate the new access array above WM.
+template
+inline T
+remove_regno_access (obstack_watermark ,
+T accesses, unsigned int regno)
+{
+  using Access = decltype (accesses[0]);
+  auto pred = [regno](Access a) { return a->regno () != regno; };
+  return filter_accesses (watermark, accesses, pred);
+}
+
+// As above, but additionally check that we actually did remove an access.
+template
+inline T
+check_remove_regno_access (obstack_watermark ,
+  T accesses, unsigned regno)
+{
+  auto orig_size = accesses.size ();
+  auto result = remove_regno_access (watermark, accesses, regno);
+  gcc_assert (result.size () < orig_size);
+  return result;
+}
+
 // If sorted array ACCESSES includes a reference to REGNO, return the
 // access, otherwise return null.
 template
diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index 76d70fd8bd3..9ec0e6be071 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -1597,16 +1597,14 @@ access_array
 rtl_ssa::remove_note_accesses_base (obstack_watermark ,
access_array accesses)
 {
+  auto predicate = [](access_info *a) {
+return !a->only_occurs_in_notes ();
+  };
+
   for (access_info *access : accesses)
 if (access->only_occurs_in_notes ())
-  {
-   access_array_builder builder (watermark);
-   builder.reserve (accesses.size ());
-   for (access_info *access2 : accesses)
- if (!access2->only_occurs_in_notes ())
-   builder.quick_push (access2);
-   return builder.finish ();
-  }
+  return filter_accesses (watermark, accesses, predicate);
+
   return accesses;
 }
 


[PATCH v2 1/11] rtl-ssa: Support for inserting new insns

2023-11-23 Thread Alex Coplan
Hi,

This is a v2, original patch is here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637606.html

This addresses review feedback and:
 - Fixes a bug in the previous version in
   function_info::finalize_new_accesses; we should now correctly handle
   the case where properties.refs () has two writes to a resource and we're
   adding a new (temporary) set for that resource.
 - Drops some handling for new uses which isn't needed now that RTL-SSA can
   infer uses of mem (since g:505f1202e3a1a1aecd0df10d0f1620df6fea4ab5).

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

-- >8 --

The upcoming aarch64 load pair pass needs to form store pairs, and can
re-order stores over loads when alias analysis determines this is safe.
In the case that both mem defs have uses in the RTL-SSA IR, and both
stores require re-ordering over their uses, we represent that as
(tentative) deletion of the original store insns and creation of a new
insn, to prevent requiring repeated re-parenting of uses during the
pass.  We then update all mem uses that require re-parenting in one go
at the end of the pass.

To support this, RTL-SSA needs to handle inserting new insns (rather
than just changing existing ones), so this patch adds support for that.

New insns (and new accesses) are temporaries, allocated above a temporary
obstack_watermark, such that the user can easily back out of a change without
awkward bookkeeping.

gcc/ChangeLog:

* rtl-ssa/accesses.cc (function_info::create_set): New.
* rtl-ssa/accesses.h (access_info::is_temporary): New.
* rtl-ssa/changes.cc (move_insn): Handle new (temporary) insns.
(function_info::finalize_new_accesses): Handle new/temporary
user-created accesses.
(function_info::apply_changes_to_insn): Ensure m_is_temp flag
on new insns gets cleared.
(function_info::change_insns): Handle new/temporary insns.
(function_info::create_insn): New.
* rtl-ssa/changes.h (class insn_change): Make function_info a
friend class.
* rtl-ssa/functions.h (function_info): Declare new entry points:
create_set, create_insn.  Declare new change_alloc helper.
* rtl-ssa/insns.cc (insn_info::print_full): Identify temporary insns in
dump.
* rtl-ssa/insns.h (insn_info): Add new m_is_temp flag and accompanying
is_temporary accessor.
* rtl-ssa/internals.inl (insn_info::insn_info): Initialize m_is_temp to
false.
* rtl-ssa/member-fns.inl (function_info::change_alloc): New.
* rtl-ssa/movement.h (restrict_movement_for_defs_ignoring): Add
handling for temporary defs.
diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index 510545a8bad..76d70fd8bd3 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -1456,6 +1456,16 @@ function_info::make_uses_available (obstack_watermark 
,
   return use_array (new_uses, num_uses);
 }
 
+set_info *
+function_info::create_set (obstack_watermark ,
+  insn_info *insn,
+  resource_info resource)
+{
+  auto set = change_alloc (watermark, insn, resource);
+  set->m_is_temp = true;
+  return set;
+}
+
 // Return true if ACCESS1 can represent ACCESS2 and if ACCESS2 can
 // represent ACCESS1.
 static bool
diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
index fce31d46717..7e7a90ece97 100644
--- a/gcc/rtl-ssa/accesses.h
+++ b/gcc/rtl-ssa/accesses.h
@@ -204,6 +204,10 @@ public:
   // in the main instruction pattern.
   bool only_occurs_in_notes () const { return m_only_occurs_in_notes; }
 
+  // Return true if this is a temporary access, e.g. one created for
+  // an insn that is about to be inserted.
+  bool is_temporary () const { return m_is_temp; }
+
 protected:
   access_info (resource_info, access_kind);
 
diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index aab532b9f26..2f2d12d5f30 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -394,14 +394,20 @@ move_insn (insn_change , insn_info *after)
   // At the moment we don't support moving instructions between EBBs,
   // but this would be worth adding if it's useful.
   insn_info *insn = change.insn ();
-  gcc_assert (after->ebb () == insn->ebb ());
+
   bb_info *bb = after->bb ();
   basic_block cfg_bb = bb->cfg_bb ();
 
-  if (insn->bb () != bb)
-// Force DF to mark the old block as dirty.
-df_insn_delete (rtl);
-  ::remove_insn (rtl);
+  if (!insn->is_temporary ())
+{
+  gcc_assert (after->ebb () == insn->ebb ());
+
+  if (insn->bb () != bb)
+   // Force DF to mark the old block as dirty.
+   df_insn_delete (rtl);
+  ::remove_insn (rtl);
+}
+
   ::add_insn_after (rtl, after_rtl, cfg_bb);
 }
 
@@ -437,12 +443,33 @@ function_info::finalize_new_accesses (insn_change 
, insn_info *pos)
   {
def_info *def = find_access (change.new_defs, ref.regno);
gcc_assert (def);
+
+  

Re: [PATCH 01/11] rtl-ssa: Support for inserting new insns

2023-11-22 Thread Alex Coplan
On 21/11/2023 11:51, Richard Sandiford wrote:
> Alex Coplan  writes:
> > N.B. this is just a rebased (but otherwise unchanged) version of the
> > same patch already posted here:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633348.html
> >
> > this is the only unreviewed dependency from the previous series, so it
> > seemed easier just to re-post it (not least to appease the pre-commit
> > CI).
> >
> > -- >8 --
> >
> > The upcoming aarch64 load pair pass needs to form store pairs, and can
> > re-order stores over loads when alias analysis determines this is safe.
> > In the case that both mem defs have uses in the RTL-SSA IR, and both
> > stores require re-ordering over their uses, we represent that as
> > (tentative) deletion of the original store insns and creation of a new
> > insn, to prevent requiring repeated re-parenting of uses during the
> > pass.  We then update all mem uses that require re-parenting in one go
> > at the end of the pass.
> >
> > To support this, RTL-SSA needs to handle inserting new insns (rather
> > than just changing existing ones), so this patch adds support for that.
> >
> > New insns (and new accesses) are temporaries, allocated above a temporary
> > obstack_watermark, such that the user can easily back out of a change 
> > without
> > awkward bookkeeping.
> >
> > Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
> >
> > gcc/ChangeLog:
> >
> > * rtl-ssa/accesses.cc (function_info::create_set): New.
> > * rtl-ssa/accesses.h (access_info::is_temporary): New.
> > * rtl-ssa/changes.cc (move_insn): Handle new (temporary) insns.
> > (function_info::finalize_new_accesses): Handle new/temporary
> > user-created accesses.
> > (function_info::apply_changes_to_insn): Ensure m_is_temp flag
> > on new insns gets cleared.
> > (function_info::change_insns): Handle new/temporary insns.
> > (function_info::create_insn): New.
> > * rtl-ssa/changes.h (class insn_change): Make function_info a
> > friend class.
> > * rtl-ssa/functions.h (function_info): Declare new entry points:
> > create_set, create_insn.  Declare new change_alloc helper.
> > * rtl-ssa/insns.cc (insn_info::print_full): Identify temporary 
> > insns in
> > dump.
> > * rtl-ssa/insns.h (insn_info): Add new m_is_temp flag and 
> > accompanying
> > is_temporary accessor.
> > * rtl-ssa/internals.inl (insn_info::insn_info): Initialize 
> > m_is_temp to
> > false.
> > * rtl-ssa/member-fns.inl (function_info::change_alloc): New.
> > * rtl-ssa/movement.h (restrict_movement_for_defs_ignoring): Add
> > handling for temporary defs.
> 
> Looks good, but there were a couple of things I didn't understand:

Thanks for the review.

> 
> > ---
> >  gcc/rtl-ssa/accesses.cc| 10 ++
> >  gcc/rtl-ssa/accesses.h |  4 +++
> >  gcc/rtl-ssa/changes.cc | 74 +++---
> >  gcc/rtl-ssa/changes.h  |  2 ++
> >  gcc/rtl-ssa/functions.h| 14 
> >  gcc/rtl-ssa/insns.cc   |  5 +++
> >  gcc/rtl-ssa/insns.h|  7 +++-
> >  gcc/rtl-ssa/internals.inl  |  1 +
> >  gcc/rtl-ssa/member-fns.inl | 12 +++
> >  gcc/rtl-ssa/movement.h |  8 -
> >  10 files changed, 123 insertions(+), 14 deletions(-)
> >
> > diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
> > index 510545a8bad..76d70fd8bd3 100644
> > --- a/gcc/rtl-ssa/accesses.cc
> > +++ b/gcc/rtl-ssa/accesses.cc
> > @@ -1456,6 +1456,16 @@ function_info::make_uses_available 
> > (obstack_watermark ,
> >return use_array (new_uses, num_uses);
> >  }
> >  
> > +set_info *
> > +function_info::create_set (obstack_watermark ,
> > +  insn_info *insn,
> > +  resource_info resource)
> > +{
> > +  auto set = change_alloc (watermark, insn, resource);
> > +  set->m_is_temp = true;
> > +  return set;
> > +}
> > +
> >  // Return true if ACCESS1 can represent ACCESS2 and if ACCESS2 can
> >  // represent ACCESS1.
> >  static bool
> > diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
> > index fce31d46717..7e7a90ece97 100644
> > --- a/gcc/rtl-ssa/accesses.h
> > +++ b/gcc/rtl-ssa/accesses.h
> > @@ -204,6 +204,10 @@ public:
> >// in the main instruction pattern.
> >bool onl

Re: [PATCH v4] c-family: Implement __has_feature and __has_extension [PR60512]

2023-11-17 Thread Alex Coplan
On 03/11/2023 12:19, Marek Polacek wrote:
> On Wed, Sep 27, 2023 at 03:27:30PM +0100, Alex Coplan wrote:
> > Hi,
> > 
> > This is a v4 patch to address Jason's feedback here:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630911.html
> > 
> > w.r.t. v3 it just removes a comment now that some uncertainty around
> > cxx_binary_literals has been resolved, and updates the documentation as
> > suggested to point to the Clang docs.
> > 
> > --
> > 
> > This patch implements clang's __has_feature and __has_extension in GCC.
> > Currently the patch aims to implement all documented features (and some
> > undocumented ones) following the documentation at
> > https://clang.llvm.org/docs/LanguageExtensions.html with the exception
> > of the legacy features for C++ type traits.  These are omitted, since as
> > the clang documentation notes, __has_builtin is the correct "modern" way
> > to query for these (which GCC already implements).
> > 
> > Bootstrapped/regtested on aarch64-linux-gnu, bootstrapped on
> > x86_64-apple-darwin, darwin regtest in progress.  OK for trunk if
> > testing passes?
> 
> Thanks for the patch.  I only have a few minor comments.

Thanks a lot for the detailed review.  Please see the incremental change
from v4 to v5 attached (which addresses your comments).  The full v5
patch is posted here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637028.html

> 
> > diff --git a/gcc/c-family/c-common.cc b/gcc/c-family/c-common.cc
> > index aae57260097..1210953d33a 100644
> > --- a/gcc/c-family/c-common.cc
> > +++ b/gcc/c-family/c-common.cc
> > @@ -311,6 +311,43 @@ const struct fname_var_t fname_vars[] =
> >{NULL, 0, 0},
> >  };
> >  
> > +/* Flags to restrict availability of generic features that
> > +   are known to __has_{feature,extension}.  */
> > +
> > +enum
> > +{
> > +  HF_FLAG_EXT = 1, /* Available only as an extension.  */
> > +  HF_FLAG_SANITIZE = 2, /* Availability depends on sanitizer flags.  */
> > +};
> 
> Why not have a new HF_FLAG_ = 0 here and use it below...

Sure, I've used HF_FLAG_NONE for this in the updated patch.

> 
> > +/* Info for generic features which can be queried through
> > +   __has_{feature,extension}.  */
> > +
> > +struct hf_feature_info
> > +{
> > +  const char *ident;
> > +  unsigned flags;
> > +  unsigned mask;
> 
> Not enum sanitize_code for mask?

I initially intended the mask field to have a flexible interpretation
depending on the value of flags, i.e. it's only interpreted as
enum sanitize_code if flags has HF_FLAG_SANITIZE set.  Of course, at
the moment the mask field happens to only be used for sanitizer flags.

So personally I'd lean towards keeping the type as is with the view to
allowing re-purposing in the future, but happy to change it if you feel
strongly.

> 
> > +};
> > +
> > +/* Table of generic features which can be queried through
> > +   __has_{feature,extension}.  */
> > +
> > +static const hf_feature_info has_feature_table[] =
> > +{
> > +  { "address_sanitizer",   HF_FLAG_SANITIZE, SANITIZE_ADDRESS },
> > +  { "thread_sanitizer",HF_FLAG_SANITIZE, SANITIZE_THREAD },
> > +  { "leak_sanitizer",  HF_FLAG_SANITIZE, SANITIZE_LEAK },
> > +  { "hwaddress_sanitizer", HF_FLAG_SANITIZE, SANITIZE_HWADDRESS },
> > +  { "undefined_behavior_sanitizer", HF_FLAG_SANITIZE, SANITIZE_UNDEFINED },
> > +  { "attribute_deprecated_with_message",  0, 0 },
> > +  { "attribute_unavailable_with_message", 0, 0 },
> > +  { "enumerator_attributes", 0, 0 },
> > +  { "tls", 0, 0 },
> 
> ...here?  Might be more obvious what it means then.
> 
> > +  { "gnu_asm_goto_with_outputs", HF_FLAG_EXT, 0 },
> > +  { "gnu_asm_goto_with_outputs_full",HF_FLAG_EXT, 0 }
> > +};
> > +
> >  /* Global visibility options.  */
> >  struct visibility_flags visibility_options;
> >  
> > @@ -9808,4 +9845,63 @@ c_strict_flex_array_level_of (tree array_field)
> >return strict_flex_array_level;
> >  }
> >  
> > +/* Map from identifiers to booleans.  Value is true for features, and
> > +   false for extensions.  Used to implement __has_{feature,extension}.  */
> > +
> > +using feature_map_t = hash_map ;
> > +static feature_map_t *feature_map = nullptr;
> 
> You don't need " = nullptr" here.


[PATCH v5] c-family: Implement __has_feature and __has_extension [PR60512]

2023-11-17 Thread Alex Coplan
Hi,

This is a v5 patch to address Marek's feedback here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635157.html

I also implemented Jason's suggestion to use constexpr for the tables
from this review:
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634484.html

I'll attach the incremental change in reply to Marek's review to make
things easier to compare.

Bootstrapped/regtested on aarch64-linux-gnu.  Bootstrap/regtest on
x86_64-apple-darwin in progress (on top of this libsanitizer fix:
https://github.com/llvm/llvm-project/issues/72639).

OK for trunk if testing passes?

Thanks,
Alex

-- >8 --

This patch implements clang's __has_feature and __has_extension in GCC.
Currently the patch aims to implement all documented features (and some
undocumented ones) following the documentation at
https://clang.llvm.org/docs/LanguageExtensions.html with the exception
of the legacy features for C++ type traits.  These are omitted, since as
the clang documentation notes, __has_builtin is the correct "modern" way
to query for these (which GCC already implements).

gcc/c-family/ChangeLog:

PR c++/60512
* c-common.cc (struct hf_feature_info): New.
(c_common_register_feature): New.
(init_has_feature): New.
(has_feature_p): New.
* c-common.h (c_common_has_feature): New.
(c_family_register_lang_features): New.
(c_common_register_feature): New.
(has_feature_p): New.
(c_register_features): New.
(cp_register_features): New.
* c-lex.cc (init_c_lex): Plumb through has_feature callback.
(c_common_has_builtin): Generalize and move common part ...
(c_common_lex_availability_macro): ... here.
(c_common_has_feature): New.
* c-ppoutput.cc (init_pp_output): Plumb through has_feature.

gcc/c/ChangeLog:

PR c++/60512
* c-lang.cc (c_family_register_lang_features): New.
* c-objc-common.cc (struct c_feature_info): New.
(c_register_features): New.

gcc/cp/ChangeLog:

PR c++/60512
* cp-lang.cc (c_family_register_lang_features): New.
* cp-objcp-common.cc (struct cp_feature_selector): New.
(cp_feature_selector::has_feature): New.
(struct cp_feature_info): New.
(cp_register_features): New.

gcc/ChangeLog:

PR c++/60512
* doc/cpp.texi: Document __has_{feature,extension}.

gcc/objc/ChangeLog:

PR c++/60512
* objc-act.cc (struct objc_feature_info): New.
(objc_nonfragile_abi_p): New.
(objc_common_register_features): New.
* objc-act.h (objc_common_register_features): New.
* objc-lang.cc (c_family_register_lang_features): New.

gcc/objcp/ChangeLog:

PR c++/60512
* objcp-lang.cc (c_family_register_lang_features): New.

libcpp/ChangeLog:

PR c++/60512
* include/cpplib.h (struct cpp_callbacks): Add has_feature.
(enum cpp_builtin_type): Add BT_HAS_{FEATURE,EXTENSION}.
* init.cc: Add __has_{feature,extension}.
* macro.cc (_cpp_builtin_macro_text): Handle
BT_HAS_{FEATURE,EXTENSION}.

gcc/testsuite/ChangeLog:

PR c++/60512
* c-c++-common/has-feature-common.c: New test.
* c-c++-common/has-feature-pedantic.c: New test.
* g++.dg/ext/has-feature.C: New test.
* gcc.dg/asan/has-feature-asan.c: New test.
* gcc.dg/has-feature.c: New test.
* gcc.dg/ubsan/has-feature-ubsan.c: New test.
* obj-c++.dg/has-feature.mm: New test.
* objc.dg/has-feature.m: New test.
diff --git a/gcc/c-family/c-common.cc b/gcc/c-family/c-common.cc
index 0ea0c4f4bef..f270fa2f5b5 100644
--- a/gcc/c-family/c-common.cc
+++ b/gcc/c-family/c-common.cc
@@ -311,6 +311,44 @@ const struct fname_var_t fname_vars[] =
   {NULL, 0, 0},
 };
 
+/* Flags to restrict availability of generic features that
+   are known to __has_{feature,extension}.  */
+
+enum
+{
+  HF_FLAG_NONE = 0,
+  HF_FLAG_EXT = 1, /* Available only as an extension.  */
+  HF_FLAG_SANITIZE = 2, /* Availability depends on sanitizer flags.  */
+};
+
+/* Info for generic features which can be queried through
+   __has_{feature,extension}.  */
+
+struct hf_feature_info
+{
+  const char *ident;
+  unsigned flags;
+  unsigned mask;
+};
+
+/* Table of generic features which can be queried through
+   __has_{feature,extension}.  */
+
+static constexpr hf_feature_info has_feature_table[] =
+{
+  { "address_sanitizer",   HF_FLAG_SANITIZE, SANITIZE_ADDRESS },
+  { "thread_sanitizer",HF_FLAG_SANITIZE, SANITIZE_THREAD },
+  { "leak_sanitizer",  HF_FLAG_SANITIZE, SANITIZE_LEAK },
+  { "hwaddress_sanitizer", HF_FLAG_SANITIZE, SANITIZE_HWADDRESS },
+  { "undefined_behavior_sanitizer", HF_FLAG_SANITIZE, SANITIZE_UNDEFINED },
+  { "attribute_deprecated_with_message",  HF_FLAG_NONE, 0 },
+  { "attribute_unavailable_with_message", HF_FLAG_NONE, 0 },
+  { "enumerator_attributes", 

[PATCH 11/11] aarch64: Use individual loads/stores for mem{cpy,set} expansion

2023-11-16 Thread Alex Coplan
This patch adjusts the mem{cpy,set} expansion in the aarch64 backend to use
individual loads/stores instead of ldp/stp at expand time.  The idea is to rely
on the ldp fusion pass to fuse the accesses together later in the RTL pipeline.

The earlier parts of the RTL pipeline should be able to do a better job with the
individual (non-paired) accesses, especially given that an earlier patch in this
series moves the pair representation to use unspecs.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

* config/aarch64/aarch64.cc
(aarch64_copy_one_block_and_progress_pointers): Emit individual
accesses instead of load/store pairs.
(aarch64_set_one_block_and_progress_pointer): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 1f6094bf1bc..315ba7119c0 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25457,9 +25457,12 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
   /* "Cast" the pointers to the correct mode.  */
   *src = adjust_address (*src, mode, 0);
   *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memcpy.  */
-  emit_insn (aarch64_gen_load_pair (reg1, reg2, *src));
-  emit_insn (aarch64_gen_store_pair (*dst, reg1, reg2));
+  /* Emit the memcpy.  The load/store pair pass should form
+	 a load/store pair from these moves.  */
+  emit_move_insn (reg1, *src);
+  emit_move_insn (reg2, aarch64_progress_pointer (*src));
+  emit_move_insn (*dst, reg1);
+  emit_move_insn (aarch64_progress_pointer (*dst), reg2);
   /* Move the pointers forward.  */
   *src = aarch64_move_pointer (*src, 32);
   *dst = aarch64_move_pointer (*dst, 32);
@@ -25638,7 +25641,8 @@ aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
   /* "Cast" the *dst to the correct mode.  */
   *dst = adjust_address (*dst, mode, 0);
   /* Emit the memset.  */
-  emit_insn (aarch64_gen_store_pair (*dst, src, src));
+  emit_move_insn (*dst, src);
+  emit_move_insn (aarch64_progress_pointer (*dst), src);
 
   /* Move the pointers forward.  */
   *dst = aarch64_move_pointer (*dst, 32);


[PATCH 10/11] aarch64: Add new load/store pair fusion pass.

2023-11-16 Thread Alex Coplan
This is a v3 of the aarch64 load/store pair fusion pass.
v2 was posted here:
 - https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633601.html

The main changes since v2 are as follows:

We now handle writeback opportunities as well.  E.g. for this testcase:

void foo (long *p, long *q, long x, long y)
{
  do {
*(p++) = x;
*(p++) = y;
  } while (p < q);
}

wtih the patch, we generate:

foo:
.LFB0:
.align  3
.L2:
stp x2, x3, [x0], 16
cmp x0, x1
bcc .L2
ret

instead of:

foo:
.LFB0:
.align  3
.L2:
str x2, [x0], 16
str x3, [x0, -8]
cmp x0, x1
bcc .L2
ret

i.e. the pass is now capable of finding load/store pair opportunities even in
the case that one or more of the initial candidate accesses uses writeback 
addressing.
We do this by adding a notion of canonicalizing RTL bases.  When we see a
writeback access, we record that the new base def is equivalent to the original
def plus some offset.  When tracking accesses, we then canonicalize to track
each access relative to the earliest equivalent base in the basic block.

This allows us to spot that accesses are adjacent even though they don't share
the same RTL-SSA base def.

Furthermore, we also add some extra logic to opportunistically fold in trailing
destructive updates of the base register used for a load/store pair.  E.g. for

void post_add (long *p, long *q, long x, long y)
{
  do {
p[0] = x;
p[1] = y;
p += 2;
  } while (p < q);
}

the auto-inc-dec pass doesn't currently form any writeback accesses, and we
generate:

post_add:
.LFB0:
.align  3
.L2:
add x0, x0, 16
stp x2, x3, [x0, -16]
cmp x0, x1
bcc .L2
ret

but with the updated pass, we now get:

post_add:
.LFB0:
.align  3
.L2:
stp x2, x3, [x0], 16
cmp x0, x1
bcc .L2
ret

Other notable changes to the pass since the last version include:
 - We switch to using the aarch64_gen_{load,store}_pair interface
   for forming the (non-writeback) pairs, allowing use of the new
   load/store pair representation added by the earlier patch.
 - The various updates to the load/store pair patterns mean that
   we no longer need to do mode canonicalization / mode unification
   in the pass, as the patterns allow arbitrary combinations of suitable modes
   of the same size.  So we remove the logic to do this (including the
   param to control the strategy).
 - Fix up classification of zero operands to make sure that these are always
   treated as GPR operands for pair discovery purposes.  This avoids us
   pairing zero operands with FPRs in the pre-RA pass, which used to lead to
   undesirable codegen involving cross-file moves.
 - We also remove the try_adjust_address logic from the previous iteration of
   the pass.  Since we validate all ldp/stp offsets in the pass, this only
   meant that we lost opportunities in the case that a given mem fails to
   adjust in its original mode.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

* config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64; add
aarch64-ldp-fusion.cc to target_gtfiles.
* config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
before and after RA.
* config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
* config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
(-mlate-ldp-fusion): New.
(--param=aarch64-ldp-alias-check-limit): New.
(--param=aarch64-ldp-writeback): New.
* config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
* config/aarch64/aarch64-ldp-fusion.cc: New file.
---
 gcc/config.gcc   |4 +-
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 2727 ++
 gcc/config/aarch64/aarch64-passes.def|2 +
 gcc/config/aarch64/aarch64-protos.h  |1 +
 gcc/config/aarch64/aarch64.opt   |   23 +
 gcc/config/aarch64/t-aarch64 |7 +
 6 files changed, 2762 insertions(+), 2 deletions(-)
 create mode 100644 gcc/config/aarch64/aarch64-ldp-fusion.cc

diff --git a/gcc/config.gcc b/gcc/config.gcc
index c1460ca354e..8b7f6b20309 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -349,8 +349,8 @@ aarch64*-*-*)
 	c_target_objs="aarch64-c.o"
 	cxx_target_objs="aarch64-c.o"
 	d_target_objs="aarch64-d.o"
-	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
-	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
+	extra_objs="aarch64-builtins.o aarch-common.o 

[PATCH 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

2023-11-16 Thread Alex Coplan
This patch overhauls the load/store pair patterns with two main goals:

1. Fixing a correctness issue (the current patterns are not RA-friendly).
2. Allowing more flexibility in which operand modes are supported, and which
   combinations of modes are allowed in the two arms of the load/store pair,
   while reducing the number of patterns required both in the source and in
   the generated code.

The correctness issue (1) is due to the fact that the current patterns have
two independent memory operands tied together only by a predicate on the insns.
Since LRA only looks at the constraints, one of the memory operands can get
reloaded without the other one being changed, leading to the insn becoming
unrecognizable after reload.

We fix this issue by changing the patterns such that they only ever have one
memory operand representing the entire pair.  For the store case, we use an
unspec to logically concatenate the register operands before storing them.
For the load case, we use unspecs to extract the "lanes" from the pair mem,
with the second occurrence of the mem matched using a match_dup (such that there
is still really only one memory operand as far as the RA is concerned).

In terms of the modes used for the pair memory operands, we canonicalize
these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
only the correct size but also correct alignment requirement for a
memory operand representing an entire load/store pair.  Unlike the other
two, V2x4QImode didn't previously exist, so had to be added with the
patch.

As with the previous patch generalizing the writeback patterns, this
patch aims to be flexible in the combinations of modes supported by the
patterns without requiring a large number of generated patterns by using
distinct mode iterators.

The new scheme means we only need a single (generated) pattern for each
load/store operation of a given operand size.  For the 4-byte and 8-byte
operand cases, we use the GPI iterator to synthesize the two patterns.
The 16-byte case is implemented as a separate pattern in the source (due
to only having a single possible alternative).

Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
we add REG_CFA_OFFSET notes to the store pair insns emitted by
aarch64_save_callee_saves, so that correct CFI information can still be
generated.  Furthermore, we now unconditionally generate these CFA
notes on frame-related insns emitted by aarch64_save_callee_saves.
This is done in case that the load/store pair pass forms these into
pairs, in which case the CFA notes would be needed.

We also adjust the ldp/stp peepholes to generate the new form.  This is
done by switching the generation to use the
aarch64_gen_{load,store}_pair interface, making it easier to change the
form in the future if needed.  (Likewise, the upcoming aarch64
load/store pair pass also makes use of this interface).

This patch also adds an "ldpstp" attribute to the non-writeback
load/store pair patterns, which is used by the post-RA load/store pair
pass to identify existing patterns and see if they can be promoted to
writeback variants.

One potential concern with using unspecs for the patterns is that it can block
optimization by the generic RTL passes.  This patch series tries to mitigate
this in two ways:
 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
emit individual loads/stores instead of ldp/stp.  These should then be
formed back into load/store pairs much later in the RTL pipeline by the
new load/store pair pass.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

* config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
representation from peepholes, allowing use of new form.
* config/aarch64/aarch64-modes.def (V2x4QImode): Define.
* config/aarch64/aarch64-protos.h
(aarch64_finish_ldpstp_peephole): Declare.
(aarch64_swap_ldrstr_operands): Delete declaration.
(aarch64_gen_load_pair): Declare.
(aarch64_gen_store_pair): Declare.
* config/aarch64/aarch64-simd.md (load_pair):
Delete.
(vec_store_pair): Delete.
(load_pair): Delete.
(vec_store_pair): Delete.
* config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
(aarch64_gen_store_pair): Adjust to use new unspec form of stp.
Drop second mem from parameters.
(aarch64_gen_load_pair): Likewise.
(aarch64_pair_mem_from_base): New.
(aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
frame-related saves.  Adjust call to aarch64_gen_store_pair
(aarch64_restore_callee_saves): Adjust calls to
aarch64_gen_load_pair to account for change in interface.
(aarch64_process_components): Likewise.
(aarch64_classify_address): Handle 32-byte pair mems in
LDP_STP_N case.

[PATCH 08/11] aarch64: Generalize writeback ldp/stp patterns

2023-11-16 Thread Alex Coplan
Thus far the writeback forms of ldp/stp have been exclusively used in
prologue and epilogue code for saving/restoring of registers to/from the
stack.

As such, forms of ldp/stp that weren't needed for prologue/epilogue code
weren't supported by the aarch64 backend.  This patch generalizes the
load/store pair writeback patterns to allow:

 - Base registers other than the stack pointer.
 - Modes that weren't previously supported.
 - Combinations of distinct modes provided they have the same size.
 - Pre/post variants that weren't previously needed in prologue/epilogue
   code.

We make quite some effort to avoid a combinatorial explosion in the
number of patterns generated (and those in the source) by making
extensive use of special predicates.

An updated version of the upcoming ldp/stp pass can generate the
writeback forms, so this patch is motivated by that.

This patch doesn't add zero-extending or sign-extending forms of the
writeback patterns; that is left for future work.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (aarch64_ldpstp_operand_mode_p): 
Declare.
* config/aarch64/aarch64.cc (aarch64_gen_storewb_pair): Build RTL
directly instead of invoking named pattern.
(aarch64_gen_loadwb_pair): Likewise.
(aarch64_ldpstp_operand_mode_p): New.
* config/aarch64/aarch64.md (loadwb_pair_): Replace 
with
...
(*loadwb_post_pair_): ... this. Generalize as described
in cover letter.
(loadwb_pair_): Delete (superseded by the
above).
(*loadwb_post_pair_16): New.
(*loadwb_pre_pair_): New.
(loadwb_pair_): Delete.
(*loadwb_pre_pair_16): New.
(storewb_pair_): Replace with ...
(*storewb_pre_pair_): ... this.  Generalize as
described in cover letter.
(*storewb_pre_pair_16): New.
(storewb_pair_): Delete.
(*storewb_post_pair_): New.
(storewb_pair_): Delete.
(*storewb_post_pair_16): New.
* config/aarch64/predicates.md (aarch64_mem_pair_operator): New.
(pmode_plus_operator): New.
(aarch64_ldp_reg_operand): New.
(aarch64_stp_reg_operand): New.
---
 gcc/config/aarch64/aarch64-protos.h |   1 +
 gcc/config/aarch64/aarch64.cc   |  60 +++---
 gcc/config/aarch64/aarch64.md   | 284 
 gcc/config/aarch64/predicates.md|  38 
 4 files changed, 271 insertions(+), 112 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 36d6c688bc8..e463fd5c817 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1023,6 +1023,7 @@ bool aarch64_operands_ok_for_ldpstp (rtx *, bool, machine_mode);
 bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
 bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
 void aarch64_swap_ldrstr_operands (rtx *, bool);
+bool aarch64_ldpstp_operand_mode_p (machine_mode);
 
 extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
 	  tree, HOST_WIDE_INT);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 4820fac67a1..ccf081d2a16 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8977,23 +8977,15 @@ static rtx
 aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
 			  HOST_WIDE_INT adjustment)
 {
-  switch (mode)
-{
-case E_DImode:
-  return gen_storewb_pairdi_di (base, base, reg, reg2,
-GEN_INT (-adjustment),
-GEN_INT (UNITS_PER_WORD - adjustment));
-case E_DFmode:
-  return gen_storewb_pairdf_di (base, base, reg, reg2,
-GEN_INT (-adjustment),
-GEN_INT (UNITS_PER_WORD - adjustment));
-case E_TFmode:
-  return gen_storewb_pairtf_di (base, base, reg, reg2,
-GEN_INT (-adjustment),
-GEN_INT (UNITS_PER_VREG - adjustment));
-default:
-  gcc_unreachable ();
-}
+  rtx new_base = plus_constant (Pmode, base, -adjustment);
+  rtx mem = gen_frame_mem (mode, new_base);
+  rtx mem2 = adjust_address_nv (mem, mode, GET_MODE_SIZE (mode));
+
+  return gen_rtx_PARALLEL (VOIDmode,
+			   gen_rtvec (3,
+  gen_rtx_SET (base, new_base),
+  gen_rtx_SET (mem, reg),
+  gen_rtx_SET (mem2, reg2)));
 }
 
 /* Push registers numbered REGNO1 and REGNO2 to the stack, adjusting the
@@ -9025,20 +9017,15 @@ static rtx
 aarch64_gen_loadwb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
 			 HOST_WIDE_INT adjustment)
 {
-  switch (mode)
-{
-case E_DImode:
-  return gen_loadwb_pairdi_di (base, base, reg, reg2, GEN_INT (adjustment),
-   GEN_INT (UNITS_PER_WORD));
-case E_DFmode:
-  return gen_loadwb_pairdf_di (base, base, reg, reg2, GEN_INT (adjustment),
-   GEN_INT (UNITS_PER_WORD));
-case E_TFmode:
-  return gen_loadwb_pairtf_di (base, base, 

  1   2   3   >