Re: [PATCH v3 1/2] arm: Zero/Sign extends for CMSE security on Armv8-M.baseline [PR115253]

2024-06-12 Thread Richard Sandiford
"Richard Earnshaw (lists)"  writes:
> On 10/06/2024 15:04, Torbjörn SVENSSON wrote:
>> Properly handle zero and sign extension for Armv8-M.baseline as
>> Cortex-M23 can have the security extension active.
>> Currently, there is an internal compiler error on Cortex-M23 for the
>> epilog processing of sign extension.
>> 
>> This patch addresses the following CVE-2024-0151 for Armv8-M.baseline.
>> 
>> gcc/ChangeLog:
>> 
>>  PR target/115253
>>  * config/arm/arm.cc (cmse_nonsecure_call_inline_register_clear):
>>  Sign extend for Thumb1.
>>  (thumb1_expand_prologue): Add zero/sign extend.
>> 
>> Signed-off-by: Torbjörn SVENSSON 
>> Co-authored-by: Yvan ROUX 
>> ---
>>  gcc/config/arm/arm.cc | 71 ++-
>>  1 file changed, 63 insertions(+), 8 deletions(-)
>> 
>> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
>> index ea0c963a4d6..e7b4caf1083 100644
>> --- a/gcc/config/arm/arm.cc
>> +++ b/gcc/config/arm/arm.cc
>> [...]
>> +&& known_ge (GET_MODE_SIZE (TYPE_MODE (ret_type)), 2))
>
> You can use known_eq here.  We'll never have any value other than 2, given 
> the known_le (4) above and anyway it doesn't make sense to call extendhisi 
> with any other size.

BTW, I'm surprised we need known_* in arm-specific code.  Is it actually
needed?  Or is this just a conditioned response? ;)  

Richard



Re: [PATCH] rtlanal: Correct cost regularization in pattern_cost

2024-06-12 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, May 10, 2024 at 4:25 AM HAO CHEN GUI  wrote:
>>
>> Hi,
>>The cost return from set_src_cost might be zero. Zero for
>> pattern_cost means unknown cost. So the regularization converts the zero
>> to COSTS_N_INSNS (1).
>>
>>// pattern_cost
>>cost = set_src_cost (SET_SRC (set), GET_MODE (SET_DEST (set)), speed);
>>return cost > 0 ? cost : COSTS_N_INSNS (1);
>>
>>But if set_src_cost returns a value less than COSTS_N_INSNS (1), it's
>> untouched and just returned by pattern_cost. Thus "zero" from set_src_cost
>> is higher than "one" from set_src_cost.
>>
>>   For instance, i386 returns cost "one" for zero_extend op.
>> //ix86_rtx_costs
>> case ZERO_EXTEND:
>>   /* The zero extensions is often completely free on x86_64, so make
>>  it as cheap as possible.  */
>>   if (TARGET_64BIT && mode == DImode
>>   && GET_MODE (XEXP (x, 0)) == SImode)
>> *total = 1;
>>
>>   This patch fixes the problem by converting all costs which are less than
>> COSTS_N_INSNS (1) to COSTS_N_INSNS (1).
>>
>>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
>> regressions. Is it OK for the trunk?
>
> But if targets return sth < COSTS_N_INSNS (1) but > 0 this is now no
> longer meaningful.  So shouldn't it instead be
>
>   return cost > 0 ? cost : 1;
>
> ?  Alternatively returning fractions of COSTS_N_INSNS (1) from set_src_cost
> is invalid and thus the target is at fault (I do think that making zero the
> unknown value is quite bad since that makes it impossible to have zero
> as cost represented).

I agree zero is an unfortunate choice.  No-op moves should really have
zero cost, without having to be special-cased by callers.  And it came
as a surprise to me that we had this rule.

But like Segher says, it seems to have been around for a long time
(since 2004 by the looks of it, r0-59417).  Which just goes to show,
every day is a learning day. :)

IMO it would be nice to change it.  But then it would be even nicer
to get rid of pattern_cost and move everything to insn_cost.  And that's
going to be a lot of work to do together.

Maybe a compromise would be to open-code pattern_cost into insn_cost
and change the return value for insn_cost only?  That would still mean
auditing all current uses of insn_cost and all current target definitions
of the insn_cost hook, but at least it would be isolated from the work
of removing pattern_cost.

Thanks,
Richard


Re: [PATCH] aarch64: Add vector popcount besides QImode [PR113859]

2024-06-12 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch improves GCC’s vectorization of __builtin_popcount for aarch64 
> target
> by adding popcount patterns for vector modes besides QImode, i.e., HImode,
> SImode and DImode.
>
> With this patch, we now generate the following for HImode:
>   cnt v1.16b, v.16b
>   uaddlp  v2.8h, v1.16b
>
> For SImode, we generate:
>   cnt v1.16b, v.16b
>   uaddlp  v2.8h, v1.16b
>   uaddlp  v3.4s, v2.8h
>
> For V2DI, we generate:
>   cnt v1.16b, v.16b
>   uaddlp  v2.8h, v1.16b
>   uaddlp  v3.4s, v2.8h
>   uaddlp  v4.2d, v3.4s
>
> gcc/ChangeLog:
>
>   PR target/113859
>   * config/aarch64/aarch64-simd.md (popcount2): New define_expand.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/113859
>   * gcc.target/aarch64/popcnt-vec.c: New test.
>
> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-simd.md| 40 
>  gcc/testsuite/gcc.target/aarch64/popcnt-vec.c | 48 +++
>  2 files changed, 88 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index f8bb973a278..093c32ee8ff 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3540,6 +3540,46 @@ (define_insn "popcount2"
>[(set_attr "type" "neon_cnt")]
>  )
>  
> +(define_expand "popcount2"
> +  [(set (match_operand:VQN 0 "register_operand" "=w")
> +(popcount:VQN (match_operand:VQN 1 "register_operand" "w")))]
> +  "TARGET_SIMD"
> +  {
> +rtx v = gen_reg_rtx (V16QImode);
> +rtx v1 = gen_reg_rtx (V16QImode);
> +emit_move_insn (v, gen_lowpart (V16QImode, operands[1]));
> +emit_insn (gen_popcountv16qi2 (v1, v));
> +if (mode == V8HImode)
> +  {
> +/* For V8HI, we generate:
> +cnt v1.16b, v.16b
> +uaddlp  v2.8h, v1.16b */
> +emit_insn (gen_aarch64_uaddlpv16qi (operands[0], v1));
> +DONE;
> +  }
> +rtx v2 = gen_reg_rtx (V8HImode);
> +emit_insn (gen_aarch64_uaddlpv16qi (v2, v1));
> +if (mode == V4SImode)
> +  {
> +/* For V4SI, we generate:
> +cnt v1.16b, v.16b
> +uaddlp  v2.8h, v1.16b
> +uaddlp  v3.4s, v2.8h */
> +emit_insn (gen_aarch64_uaddlpv8hi (operands[0], v2));
> +DONE;
> +  }
> +/* For V2DI, we generate:
> +cnt v1.16b, v.16b
> +uaddlp  v2.8h, v1.16b
> +uaddlp  v3.4s, v2.8h
> +uaddlp  v4.2d, v3.4s */
> +rtx v3 = gen_reg_rtx (V4SImode);
> +emit_insn (gen_aarch64_uaddlpv8hi (v3, v2));
> +emit_insn (gen_aarch64_uaddlpv4si (operands[0], v3));
> +DONE;
> +  }
> +)
> +

Could you add support for V4HI and V2SI at the same time?

I think it's possible to handle all 5 modes iteratively, like so:

(define_expand "popcount2"
  [(set (match_operand:VDQHSD 0 "register_operand")
(popcount:VDQHSD (match_operand:VDQHSD 1 "register_operand")))]
  "TARGET_SIMD"
{
  /* Generate a byte popcount.  */
  machine_mode mode =  == 64 ? V8QImode : V16QImode;
  rtx tmp = gen_reg_rtx (mode);
  auto icode = optab_handler (popcount_optab, mode);
  emit_insn (GEN_FCN (icode) (tmp, gen_lowpart (mode, operands[1])));

  /* Use a sequence of UADDLPs to accumulate the counts.  Each step doubles
 the element size and halves the number of elements.  */
  do
{
  auto icode = code_for_aarch64_addlp (ZERO_EXTEND, GET_MODE (tmp));
  mode = insn_data[icode].operand[0].mode;
  rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
  emit_insn (GEN_FCN (icode) (dest, tmp));
  tmp = dest;
}
  while (mode != mode);
  DONE;
})

(only lightly tested).  This requires changing:

(define_expand "aarch64_addlp"

to:

(define_expand "@aarch64_addlp"

Thanks,
Richard

>  ;; 'across lanes' max and min ops.
>  
>  ;; Template for outputting a scalar, so we can create __builtins which can be
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-vec.c 
> b/gcc/testsuite/gcc.target/aarch64/popcnt-vec.c
> new file mode 100644
> index 000..4c9a1b95990
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt-vec.c
> @@ -0,0 +1,48 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +/* This function should produce cnt v.16b. */
> +void
> +bar (unsigned char *__restrict b, unsigned char *__restrict d)
> +{
> +  for (int i = 0; i < 1024; i++)
> +d[i] = __builtin_popcount (b[i]);
> +}
> +
> +/* This function should produce cnt v.16b and uaddlp (Add Long Pairwise). */
> +void
> +bar1 (unsigned short *__restrict b, unsigned short *__restrict d)
> +{
> +  for (int i = 0; i < 1024; i++)
> +d[i] = __builtin_popcount (b[i]);
> +}
> +
> +/* This function should produce cnt v.16b and 2 uaddlp (Add Long Pairwise). 
> */
> +void
> +bar2 (unsigned int *__restrict b, unsigned int *__restrict d)
> +{
> +  for (int i = 0; i < 1024; i++)
> +d[i] = __builtin_popcount 

Re: [PATCH] tree-optimization/115385 - handle more gaps with peeling of a single iteration

2024-06-12 Thread Richard Sandiford
Richard Biener  writes:
> On Wed, 12 Jun 2024, Richard Biener wrote:
>
>> On Tue, 11 Jun 2024, Richard Sandiford wrote:
>> 
>> > Don't think it makes any difference, but:
>> > 
>> > Richard Biener  writes:
>> > > @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo, 
>> > > stmt_vec_info stmt_info,
>> > >   access excess elements.
>> > >   ???  Enhancements include peeling multiple 
>> > > iterations
>> > >   or using masked loads with a static mask.  */
>> > > -  || (group_size * cvf) % cnunits + group_size - gap < 
>> > > cnunits))
>> > > +  || ((group_size * cvf) % cnunits + group_size - gap < 
>> > > cnunits
>> > > +  /* But peeling a single scalar iteration is 
>> > > enough if
>> > > + we can use the next power-of-two sized partial
>> > > + access.  */
>> > > +  && ((cremain = (group_size * cvf - gap) % 
>> > > cnunits), true
>> > 
>> > ...this might be less surprising as:
>> > 
>> >  && ((cremain = (group_size * cvf - gap) % cnunits, true)
>> > 
>> > in terms of how the & line up.
>> 
>> Yeah - I'll fix before pushing.
>
> The aarch64 CI shows that a few testcases no longer use SVE
> (gcc.target/aarch64/sve/slp_perm_{4,7,8}.c) because peeling
> for gaps is deemed isufficient.  Formerly we had
>
>   if (loop_vinfo
>   && *memory_access_type == VMAT_CONTIGUOUS
>   && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
>   && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR 
> (loop_vinfo),
>   nunits))
> {
>   unsigned HOST_WIDE_INT cnunits, cvf;
>   if (!can_overrun_p
>   || !nunits.is_constant ()
>   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant 
> ()
>   /* Peeling for gaps assumes that a single scalar 
> iteration
>  is enough to make sure the last vector iteration 
> doesn't
>  access excess elements.
>  ???  Enhancements include peeling multiple iterations
>  or using masked loads with a static mask.  */
>   || (group_size * cvf) % cnunits + group_size - gap < 
> cnunits)
> {
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, 
> vect_location,
>  "peeling for gaps insufficient for "
>  "access\n");
>
> and in all cases multiple_p (group_size * LOOP_VINFO_VECT_FACTOR, nunits)
> is true so we didn't check for whether peeling one iteration is
> sufficient.  But after the refactoring the outer checks merely
> indicates there's overrun (which is there already because gap != 0).
>
> That is, we never verified, for the "regular" gap case, whether peeling
> for a single iteration is sufficient.  But now of course we run into
> the inner check which will always trigger if earlier checks didn't
> work out to set overrun_p to false.
>
> For slp_perm_8.c we have a group_size of two, nunits is {16, 16}
> and VF is {8, 8} and gap is one.  Given we know the
> multiple_p we know that (group_size * cvf) % cnunits is zero,
> so what remains is group_size - gap < nunits but 1 is probably
> always less than {16, 16}.

I thought the idea was that the size of the gap was immaterial
for VMAT_CONTIGUOUS, on the assumption that it would never be
bigger than a page.  That is, any gap loaded by the final
unpeeled iteration would belong to the same page as the non-gap
data from either the same vector iteration or the subsequent
peeled scalar iteration.

Will have to think more about this if that doesn't affect the
rest of the message, but FWIW...

> The new logic I added in the later patch that peeling a single
> iteration is OK when we use a smaller, rounded-up to power-of-two
> sized access is
>
>   || ((group_size * cvf) % cnunits + group_size - gap < 
> cnunits
>   /* But peeling a single scalar iteration is enough 
> if
>  we can use the next power-of-two sized partial
>  access.  */
>   && (cremain = (group_size * cvf - gap) % cnunits, 
> true)
>

Re: [PATCH] aarch64: Use bitreverse rtl code instead of unspec [PR115176]

2024-06-12 Thread Richard Sandiford
Andrew Pinski  writes:
> Bitreverse rtl code was added with r14-1586-g6160572f8d243c. So let's
> use it instead of an unspec. This is just a small cleanup but it does
> have one small fix with respect to rtx costs which didn't handle vector modes
> correctly for the UNSPEC and now it does.
> This is part of the first step in adding __builtin_bitreverse's builtins
> but it is independent of it though.

Nice cleanup.

> Bootstrapped and tested on aarch64-linux-gnu with no regressions.
>
> gcc/ChangeLog:
>
>   PR target/115176
>   * config/aarch64/aarch64-simd.md (aarch64_rbit): Use
>   bitreverse instead of unspec.
>   * config/aarch64/aarch64-sve-builtins-base.cc (svrbit): Convert over to 
> using
>   rtx_code_function instead of unspec_based_function.
>   * config/aarch64/aarch64-sve.md: Update comment where RBIT is included.
>   * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle BITREVERSE like 
> BSWAP.
>   Remove UNSPEC_RBIT support.
>   * config/aarch64/aarch64.md (unspec): Remove UNSPEC_RBIT.
>   (aarch64_rbit): Use bitreverse instead of unspec.
>   * config/aarch64/iterators.md (SVE_INT_UNARY): Add bitreverse.
>   (optab): Likewise.
>   (sve_int_op): Likewise.
>   (SVE_INT_UNARY): Remove UNSPEC_RBIT.
>   (optab): Likewise.
>   (sve_int_op): Likewise.
>   (min_elem_bits): Likewise.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64-simd.md  |  3 +--
>  gcc/config/aarch64/aarch64-sve-builtins-base.cc |  2 +-
>  gcc/config/aarch64/aarch64-sve.md   |  2 +-
>  gcc/config/aarch64/aarch64.cc   | 10 ++
>  gcc/config/aarch64/aarch64.md   |  3 +--
>  gcc/config/aarch64/iterators.md | 10 +-
>  6 files changed, 11 insertions(+), 19 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index f644bd1731e..0bb39091a38 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -377,8 +377,7 @@ (define_insn "bswap2"
>  
>  (define_insn "aarch64_rbit"
>[(set (match_operand:VB 0 "register_operand" "=w")
> - (unspec:VB [(match_operand:VB 1 "register_operand" "w")]
> -UNSPEC_RBIT))]
> + (bitreverse:VB (match_operand:VB 1 "register_operand" "w")))]
>"TARGET_SIMD"
>"rbit\\t%0., %1."
>[(set_attr "type" "neon_rbit")]
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index 0d2edf3f19e..dea2f6e6bfc 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -3186,7 +3186,7 @@ FUNCTION (svqincp, svqdecp_svqincp_impl, (SS_PLUS, 
> US_PLUS))
>  FUNCTION (svqincw, svqinc_bhwd_impl, (SImode))
>  FUNCTION (svqincw_pat, svqinc_bhwd_impl, (SImode))
>  FUNCTION (svqsub, rtx_code_function, (SS_MINUS, US_MINUS, -1))
> -FUNCTION (svrbit, unspec_based_function, (UNSPEC_RBIT, UNSPEC_RBIT, -1))
> +FUNCTION (svrbit, rtx_code_function, (BITREVERSE, BITREVERSE, -1))
>  FUNCTION (svrdffr, svrdffr_impl,)
>  FUNCTION (svrecpe, unspec_based_function, (-1, UNSPEC_URECPE, UNSPEC_FRECPE))
>  FUNCTION (svrecps, unspec_based_function, (-1, -1, UNSPEC_FRECPS))
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index d69db34016a..5331e7121d5 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -3083,6 +3083,7 @@ (define_expand "vec_extract"
>  ;; - CLS (= clrsb)
>  ;; - CLZ
>  ;; - CNT (= popcount)
> +;; - RBIT (= bitreverse)
>  ;; - NEG
>  ;; - NOT
>  ;; -
> @@ -3171,7 +3172,6 @@ (define_insn "*cond__any"
>  ;;  [INT] General unary arithmetic corresponding to unspecs
>  ;; -
>  ;; Includes
> -;; - RBIT
>  ;; - REVB
>  ;; - REVH
>  ;; - REVW
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 13191ec8e34..0e9d7b1ec0f 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -14690,6 +14690,7 @@ cost_plus:
>   return true;
>}
>  
> +case BITREVERSE:
>  case BSWAP:
>*cost = COSTS_N_INSNS (1);
>  
> @@ -15339,16 +15340,9 @@ cost_plus:
>  
>return false;
>  }
> -
> -  if (XINT (x, 1) == UNSPEC_RBIT)
> -{
> -  if (speed)
> -*cost += extra_cost->alu.rev;
> -
> -  return false;
> -}
>break;
>  
> +

The extra blank line here seems unnecessary.

OK otherwise, thanks.

Richard

>  case TRUNCATE:
>  
>/* Decompose muldi3_highpart.  */
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index dd88fd891b5..69167ab0c04 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> 

Re: [PATCH v2 0/4] Libatomic: Cleanup ifunc selector and aliasing

2024-06-12 Thread Richard Sandiford
Victor Do Nascimento  writes:
> Changes in V2:
>
> As explained in patch v2 1/4, it has become clear that the current
> approach of querying assembler support for newer architectural
> extensions at compile time is undesirable both from a maintainability
> as well as a consistency standpoint - Different compiled versions of
> Libatomic may have different features depending on the machine on
> which they were built.
>
> These issues make for difficult testing as the explosion in number of
> `#ifdef' guards makes maintenance error-prone and the dependence on
> binutils version means that, as well as deploying changes for testing
> in a variety of target configurations, testing must also involve
> compiling the library on an increasing number of host configurations,
> meaning that the chance of bugs going undetected increases (as was
> proved in the pre-commit CI which, due to the use of an older version
> of Binutils, picked up on a runtime-error that had hitherto gone
> unnoticed).
>
> We therefore do away with the use of all assembly instructions
> dependent on Binutils 2.42, choosing to replace them with `.inst's
> instead.  This eliminates the latent bug picked up by CI and will
> ensure consistent builds of Libatomic across all versions of Binutils.

Nice!  Thanks for doing this.  It seems much cleaner and more flexible
than the current approach.

Thanks also for the clear organisation of the series.

OK for trunk.  (For the record, I didn't hand-check the encodings of the
.insts ...)

Richard

> ---
>
> The recent introduction of the optional LSE128 and RCPC3 architectural
> extensions to AArch64 has further led to the increased flexibility of
> atomic support in the architecture, with many extensions providing
> support for distinct atomic operations, each with different potential
> applications in mind.
>
> This has led to maintenance difficulties in Libatomic, in particular
> regarding the way the ifunc selector is generated via a series of
> macro expansions at compile-time.
>
> Until now, irrespective of the atomic operation in question, all atomic
> functions for a particular operand size were expected to have the same
> number of ifunc alternatives, meaning that a one-size-fits-all
> approach could reasonably be taken for the selector.
>
> This meant that if, hypothetically, for a particular architecture and
> operand size one particular atomic operation was to have 3 different
> implementations associated with different extensions, libatomic would
> likewise be required to present three ifunc alternatives for all other
> atomic functions.
>
> The consequence in the design choice was the unnecessary use of
> function aliasing and the unwieldy code which resulted from this.
>
> This patch series attempts to remediate this issue by making the
> preprocessor macros defining the number of ifunc alternatives and
> their respective selection functions dependent on the file importing
> the ifunc selector-generating framework.
>
> all files are given `LAT_' macros, defined at the beginning
> and undef'd at the end of the file.  It is these macros that are
> subsequently used to fine-tune the behaviors of `libatomic_i.h' and
> `host-config.h'.
>
> In particular, the definition of the `IFUNC_NCOND(N)' and
> `IFUNC_COND_' macros in host-config.h can now be guarded behind
> these new file-specific macros, which ultimately control what the
> `GEN_SELECTOR(X)' macro in `libatomic_i.h' expands to.  As both of
> these headers are imported once per file implementing some atomic
> operation, fine-tuned control is now possible.
>
> Regtested with both `--enable-gnu-indirect-function' and
> `--disable-gnu-indirect-function' configurations on armv9.4-a target
> with LRCPC3 and LSE128 support and without.
>
> Victor Do Nascimento (4):
>   Libatomic: AArch64: Convert all lse128 assembly to .insn directives
>   Libatomic: Define per-file identifier macros
>   Libatomic: Make ifunc selector behavior contingent on importing file
>   Libatomic: Clean up AArch64 `atomic_16.S' implementation file
>
>  libatomic/acinclude.m4   |  18 -
>  libatomic/auto-config.h.in   |   3 -
>  libatomic/cas_n.c|   2 +
>  libatomic/config/linux/aarch64/atomic_16.S   | 511 +--
>  libatomic/config/linux/aarch64/host-config.h |  35 +-
>  libatomic/configure  |  43 --
>  libatomic/configure.ac   |   3 -
>  libatomic/exch_n.c   |   2 +
>  libatomic/fadd_n.c   |   2 +
>  libatomic/fand_n.c   |   2 +
>  libatomic/fence.c|   2 +
>  libatomic/fenv.c |   2 +
>  libatomic/fior_n.c   |   2 +
>  libatomic/flag.c |   2 +
>  libatomic/fnand_n.c  |   2 +
>  libatomic/fop_n.c|   2 +
>  libatomic/fsub_n.c  

Re: [PATCH-1v3] fwprop: Replace rtx_cost with insn_cost in try_fwprop_subst_pattern [PR113325]

2024-06-12 Thread Richard Sandiford
HAO CHEN GUI  writes:
> Hi,
>   This patch replaces rtx_cost with insn_cost in forward propagation.
> In the PR, one constant vector should be propagated and replace a
> pseudo in a store insn if we know it's a duplicated constant vector.
> It reduces the insn cost but not rtx cost. In this case, the cost is
> determined by destination operand (memory or pseudo). Unfortunately,
> rtx cost can't help.
>
>   The test case is added in the second rs6000 specific patch.
>
>   Compared to previous version, the main changes are:
> 1. Invoke change_is_worthwhile to judge if the cost is reduced and
> the replacement is worthwhile.
> 2. Invalidate recog data before getting the insn cost for the new
> rtl as insn cost might call extract_constrain_insn_cached and
> extract_insn_cached to cache the recog data. The cache data is
> invalid for the new rtl and it causes ICE.
> 3. Check if the insn cost of new rtl is zero which means unknown
> cost. The replacement should be rejected at this situation.
>
> Previous version
> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651233.html
>
>   The patch causes a regression cases on i386 as the pattern cost
> regulation has a bug. Please refer the patch and discussion here.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651363.html
>
>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> regressions. Is it OK for the trunk?
>
> ChangeLog
> fwprop: invoke change_is_worthwhile to judge if a replacement is worthwhile
>
> gcc/
>   * fwprop.cc (try_fwprop_subst_pattern): Invoke change_is_worthwhile
>   to judge if a replacement is worthwhile.
>   * rtl-ssa/changes.cc (rtl_ssa::changes_are_worthwhile): Invalidate
>   recog data before getting the insn cost for the new rtl.  Check if
>   the insn cost of new rtl is unknown and fail the replacement.
>
> patch.diff
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index de543923b92..975de0eec7f 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -471,29 +471,19 @@ try_fwprop_subst_pattern (obstack_watermark , 
> insn_change _change,
>redo_changes (0);
>  }
>
> -  /* ??? In theory, it should be better to use insn costs rather than
> - set_src_costs here.  That would involve replacing this code with
> - change_is_worthwhile.  */
>bool ok = recog (attempt, use_change);
> -  if (ok && !prop.changed_mem_p () && !use_insn->is_asm ())
> -if (rtx use_set = single_set (use_rtl))
> -  {
> - bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_rtl));
> - temporarily_undo_changes (0);
> - auto old_cost = set_src_cost (SET_SRC (use_set),
> -   GET_MODE (SET_DEST (use_set)), speed);
> - redo_changes (0);
> - auto new_cost = set_src_cost (SET_SRC (use_set),
> -   GET_MODE (SET_DEST (use_set)), speed);
> - if (new_cost > old_cost
> - || (new_cost == old_cost && !prop.likely_profitable_p ()))
> -   {
> - if (dump_file)
> -   fprintf (dump_file, "change not profitable"
> -" (cost %d -> cost %d)\n", old_cost, new_cost);
> - ok = false;
> -   }
> -  }
> +  if (ok && !prop.changed_mem_p () && !use_insn->is_asm ()
> +  && single_set (use_rtl))
> +{
> +  if (!change_is_worthwhile (use_change, false)
> +   || (!prop.likely_profitable_p ()
> +   && !change_is_worthwhile (use_change, true)))
> + {
> +   if (dump_file)
> + fprintf (dump_file, "change not profitable");
> +   ok = false;
> + }
> +}

It should only be necessary to call change_is_worthwhile once,
with strict == !prop.likely_profitable_p ()

So something like:

  bool ok = recog (attempt, use_change);
  if (ok && !prop.changed_mem_p () && !use_insn->is_asm ())
{
  bool strict_p = !prop.likely_profitable_p ();
  if (!change_is_worthwhile (use_change, strict_p))
{
  if (dump_file)
fprintf (dump_file, "change not profitable");
  ok = false;
}
}

> diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
> index 11639e81bb7..9bad6c2070c 100644
> --- a/gcc/rtl-ssa/changes.cc
> +++ b/gcc/rtl-ssa/changes.cc
> @@ -185,7 +185,18 @@ rtl_ssa::changes_are_worthwhile (array_slice *const> changes,
> * change->old_cost ());
>if (!change->is_deletion ())
>   {
> +   /* Invalidate recog data as insn_cost may call
> +  extract_insn_cached.  */
> +   INSN_CODE (change->rtl ()) = -1;

The:

  bool ok = recog (attempt, use_change);

should leave INSN_CODE set to the result of the successful recog.
Why isn't that true in the example you hit?

I wondered whether we might be trying to cost a NOOP_MOVE_INSN_CODE,
since I couldn't see anything in the current code to stop that.
But if so, that's a bug.  NOOP_MOVE_INSN_CODE should have zero cost,
and shouldn't go through insn_cost.

Thanks,
Richard

> change->new_cost = 

Re: [PATCH] Improve code generation of strided SLP loads

2024-06-11 Thread Richard Sandiford
Richard Biener  writes:
> This avoids falling back to elementwise accesses for strided SLP
> loads when the group size is not a multiple of the vector element
> size.  Instead we can use a smaller vector or integer type for the load.
>
> For stores we can do the same though restrictions on stores we handle
> and the fact that store-merging covers up makes this mostly effective
> for cost modeling which shows for gcc.target/i386/vect-strided-3.c
> which we now vectorize with V4SI vectors rather than just V2SI ones.
>
> For all of this there's still the opportunity to use non-uniform
> accesses, say for a 6-element group with a VF of two do
> V4SI, { V2SI, V2SI }, V4SI.  But that's for a possible followup.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, textually
> this depends on the gap improvement series so I'll push only
> after those.  Target independent testing is difficult, strided
> accesses are difficult for VLA - I suppose they should go
> through gather/scatter but we have to be able to construct the
> offset vector there.

Yeah, agreed.  And I suppose for tests like these, which load
consecutive pairs of 32-bit elements, we'd want to generate a gather
of 64-bit elements.  So there'd be a similar accretion process,
but only if it applies regularly across the whole vector.

Richard

>
> Richard.
>
>   * gcc.target/i386/vect-strided-1.c: New testcase.
>   * gcc.target/i386/vect-strided-2.c: Likewise.
>   * gcc.target/i386/vect-strided-3.c: Likewise.
>   * gcc.target/i386/vect-strided-4.c: Likewise.
> ---
>  .../gcc.target/i386/vect-strided-1.c  |  24 +
>  .../gcc.target/i386/vect-strided-2.c  |  17 +++
>  .../gcc.target/i386/vect-strided-3.c  |  20 
>  .../gcc.target/i386/vect-strided-4.c  |  20 
>  gcc/tree-vect-stmts.cc| 100 --
>  5 files changed, 127 insertions(+), 54 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-4.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-1.c 
> b/gcc/testsuite/gcc.target/i386/vect-strided-1.c
> new file mode 100644
> index 000..db4a06711f1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-strided-1.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +void foo (int * __restrict a, int *b, int s)
> +{
> +  for (int i = 0; i < 1024; ++i)
> +{
> +  a[8*i+0] = b[s*i+0];
> +  a[8*i+1] = b[s*i+1];
> +  a[8*i+2] = b[s*i+2];
> +  a[8*i+3] = b[s*i+3];
> +  a[8*i+4] = b[s*i+4];
> +  a[8*i+5] = b[s*i+5];
> +  a[8*i+6] = b[s*i+4];
> +  a[8*i+7] = b[s*i+5];
> +}
> +}
> +
> +/* Three two-element loads, two four-element stores.  On ia32 we elide
> +   a permute and perform a redundant load.  */
> +/* { dg-final { scan-assembler-times "movq" 2 } } */
> +/* { dg-final { scan-assembler-times "movhps" 2 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "movhps" 1 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "movups" 2 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-2.c 
> b/gcc/testsuite/gcc.target/i386/vect-strided-2.c
> new file mode 100644
> index 000..6fd64e28cf0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-strided-2.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +void foo (int * __restrict a, int *b, int s)
> +{
> +  for (int i = 0; i < 1024; ++i)
> +{
> +  a[4*i+0] = b[s*i+0];
> +  a[4*i+1] = b[s*i+1];
> +  a[4*i+2] = b[s*i+0];
> +  a[4*i+3] = b[s*i+1];
> +}
> +}
> +
> +/* One two-element load, one four-element store.  */
> +/* { dg-final { scan-assembler-times "movq" 1 } } */
> +/* { dg-final { scan-assembler-times "movups" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-3.c 
> b/gcc/testsuite/gcc.target/i386/vect-strided-3.c
> new file mode 100644
> index 000..b462701a0b2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-strided-3.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx -fno-tree-slp-vectorize" } */
> +
> +void foo (int * __restrict a, int *b, int s)
> +{
> +  if (s >= 6)
> +for (int i = 0; i < 1024; ++i)
> +  {
> + a[s*i+0] = b[4*i+0];
> + a[s*i+1] = b[4*i+1];
> + a[s*i+2] = b[4*i+2];
> + a[s*i+3] = b[4*i+3];
> + a[s*i+4] = b[4*i+0];
> + a[s*i+5] = b[4*i+1];
> +  }
> +}
> +
> +/* While the vectorizer generates 6 uint64 stores.  */
> +/* { dg-final { scan-assembler-times "movq" 4 } } */
> +/* { dg-final { scan-assembler-times "movhps" 2 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-4.c 
> b/gcc/testsuite/gcc.target/i386/vect-strided-4.c
> 

Re: [PATCH] tree-optimization/115385 - handle more gaps with peeling of a single iteration

2024-06-11 Thread Richard Sandiford
Don't think it makes any difference, but:

Richard Biener  writes:
> @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>access excess elements.
>???  Enhancements include peeling multiple iterations
>or using masked loads with a static mask.  */
> -   || (group_size * cvf) % cnunits + group_size - gap < cnunits))
> +   || ((group_size * cvf) % cnunits + group_size - gap < cnunits
> +   /* But peeling a single scalar iteration is enough if
> +  we can use the next power-of-two sized partial
> +  access.  */
> +   && ((cremain = (group_size * cvf - gap) % cnunits), true

...this might be less surprising as:

  && ((cremain = (group_size * cvf - gap) % cnunits, true)

in terms of how the & line up.

Thanks,
Richard

> +   && ((cpart_size = (1 << ceil_log2 (cremain)))
> +   != cnunits)
> +   && vector_vector_composition_type
> +(vectype, cnunits / cpart_size,
> + _vtype) == NULL_TREE
>   {
> if (dump_enabled_p ())
>   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -11599,6 +11608,27 @@ vectorizable_load (vec_info *vinfo,
> gcc_assert (new_vtype
> || LOOP_VINFO_PEELING_FOR_GAPS
>  (loop_vinfo));
> + /* But still reduce the access size to the next
> +required power-of-two so peeling a single
> +scalar iteration is sufficient.  */
> + unsigned HOST_WIDE_INT cremain;
> + if (remain.is_constant ())
> +   {
> + unsigned HOST_WIDE_INT cpart_size
> +   = 1 << ceil_log2 (cremain);
> + if (known_gt (nunits, cpart_size)
> + && constant_multiple_p (nunits, cpart_size,
> + ))
> +   {
> + tree ptype;
> + new_vtype
> +   = vector_vector_composition_type (vectype,
> + num,
> + );
> + if (new_vtype)
> +   ltype = ptype;
> +   }
> +   }
> }
> }
>   tree offset


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-11 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Richard:
>
> On 11/06/24 9:41 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>>>> Thanks a lot. Can I know what should we be doing with neg (fma)
>>>>> correctness failures with load fusion.
>>>>
>>>> I think it would involve:
>>>>
>>>> - describing lxvp and stxvp as unspec patterns, as I mentioned
>>>>   in the previous reply
>>>>
>>>> - making plain movoo split loads and stores into individual
>>>>   lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
>>>>   but internally swap the registers after load and before store.)
>>>>   That is, movoo should load the lower-numbered register from the
>>>>   lower address and the higher-numbered register from the higher
>>>>   address, and likewise for stores.
>>>>
>>>
>>> Would you mind elaborating the above.
>> 
>> I think movoo should use rs6000_split_multireg_move for all alternatives,
>> like movxo does.  movoo should split into 2 V1TI loads/stores and movxo
>> should split into 4 V1TI loads/stores.  lxvp and stxvp would be
>> independent patterns of the form:
>> 
>>   (set ...
>>(unspec [...] UNSPEC_FOO))
>> 
>> ---
>> 
>
> In load fusion pass I generate the above pattern for adjacent merge
> pairs.
>
>> rs6000_split_multireg_move has:
>> 
>>   /* The __vector_pair and __vector_quad modes are multi-register
>>  modes, so if we have to load or store the registers, we have to be
>>  careful to properly swap them if we're in little endian mode
>>  below.  This means the last register gets the first memory
>>  location.  We also need to be careful of using the right register
>>  numbers if we are splitting XO to OO.  */
>> 
>> But I don't see how this can work reliably if we allow the kind of
>> subregs that you want to create here.  The register order is the opposite
>> from the one that GCC expects.
>> 
>> This is more a question for the PowerPC maintainers though.
>>
>
> Above unspec pattern generated and modified the movoo pattern to accept
> the above spec it goes through the rs6000_split_multireg_move
> it splits into 2 VITI loads and generate consecutive loads with sequential
> registers. In load_fusion pass I generate the subreg along with load results 
> subreg (reg OO R) 16 and subreg (reg OO R) 0.
>
> But it doesnt generate lxvp instruction. If above unspec instruction
> pattern and write separate pattern in md file to generate lxvp instead of
> normal movoo, then it won't go through rs6000_split_multireg_move

I don't understand the last bit, sorry.  Under the scheme I described,
lxvp should be generated only through an unspec (and no other way).
Same for stxvp.  The fusion pass should generate those unspecs.

If the fusion pass has generated the code correctly, the lxvp unspec
will remain throughout compilation, unless all uses of it are later
deleted as dead.

The movoo rtl pattern should continue to be:

  [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]

But movoo should generate individual loads, stores and moves.  By design,
it should never generate lxvp or stxvp.

This means that, if a fused load is spilled, the sequence will be
something like:

  lxvp ...   // original fused load (unspec)
  ...
  stxv ...   // store one half to the stack (split from movoo)
  stxv ...   // store the other half to the stack (split from movoo)

Then insns that use the pair will load whichever half they need
from the stack.

I realise that isn't great, but it should at least be correct.

Thanks,
Richard


Re: [PATCH] ifcvt: Clarify if_info.original_cost.

2024-06-11 Thread Richard Sandiford
Robin Dapp  writes:
>> I was looking at the code in more detail and just wanted to check.
>> We have:
>> 
>>   int last_needs_comparison = -1;
>> 
>>   bool ok = noce_convert_multiple_sets_1
>> (if_info, _no_cmov, _src, , ,
>>  _insns, _needs_comparison);
>>   if (!ok)
>>   return false;
>> 
>>   /* If there are insns that overwrite part of the initial
>>  comparison, we can still omit creating temporaries for
>>  the last of them.
>>  As the second try will always create a less expensive,
>>  valid sequence, we do not need to compare and can discard
>>  the first one.  */
>>   if (last_needs_comparison != -1)
>> {
>>   end_sequence ();
>>   start_sequence ();
>>   ok = noce_convert_multiple_sets_1
>>  (if_info, _no_cmov, _src, , ,
>>   _insns, _needs_comparison);
>>   /* Actually we should not fail anymore if we reached here,
>>   but better still check.  */
>>   if (!ok)
>>return false;
>> }
>> 
>> But noce_convert_multiple_sets_1 ends with:
>> 
>>   /* Even if we did not actually need the comparison, we want to make sure
>>  to try a second time in order to get rid of the temporaries.  */
>>   if (*last_needs_comparison == -1)
>> *last_needs_comparison = 0;
>> 
>> 
>>   return true;
>> 
>> AFAICT that means that the first attempt is always redundant.
>> 
>> Have I missed something?
>
> (I might not have fully gotten the question)
>
> The idea is that the first attempt goes through all insns and sets
> *last_need_comparison to the insn number that either
> - used the condition/comparison by preferring seq1 or
> - used the condition as a side-effect insn when creating a CC-using
>   insn in seq2.
> (And we only know that after actually creating the sequences). 
>
> The second attempt then improves on the first one by skipping
> any temporary destination registers after the last insn that required
> the condition (even though its target overlaps with the condition
> registers).  This is true for all cmovs that only use the CC
> (instead of the condition).  Essentially, we know that all following
> cmovs can be created via the CC which is not overwritten.
>
> So, even when we never used the condition because of all CC-using
> cmovs we would skip the temporary targets in the second attempt.
> But we can't know that all we ever needed is the CC comparison
> before actually creating the sequences in the first attempt.

Hmm, ok.  The bit that confused me most was:

  if (last_needs_comparison != -1)
{
  end_sequence ();
  start_sequence ();
  ...
}

which implied that the second attempt was made conditionally.
It seems like it's always used and is an inherent part of the
algorithm.

If the problem is tracking liveness, wouldn't it be better to
iterate over the "then" block in reverse order?  We would start
with the liveness set for the join block and update as we move
backwards through the "then" block.  This liveness set would
tell us whether the current instruction needs to preserve a
particular register.  That should make it possible to do the
transformation in one step, and so avoid the risk that the
second attempt does something that is unexpectedly different
from the first attempt.

FWIW, the reason for asking was that it seemed safer to pass
use_cond_earliest back from noce_convert_multiple_sets_1
to noce_convert_multiple_sets, as another parameter,
and then do the adjustment around noce_convert_multiple_sets's
call to targetm.noce_conversion_profitable_p.  That would avoid
the new for a new if_info field, which in turn would make it
less likely that stale information is carried over from one attempt
to the next (e.g. if other ifcvt techniques end up using the same
field in future).

Thanks,
Richard


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-11 Thread Richard Sandiford
Ajit Agarwal  writes:
>>> Thanks a lot. Can I know what should we be doing with neg (fma)
>>> correctness failures with load fusion.
>> 
>> I think it would involve:
>> 
>> - describing lxvp and stxvp as unspec patterns, as I mentioned
>>   in the previous reply
>> 
>> - making plain movoo split loads and stores into individual
>>   lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
>>   but internally swap the registers after load and before store.)
>>   That is, movoo should load the lower-numbered register from the
>>   lower address and the higher-numbered register from the higher
>>   address, and likewise for stores.
>> 
>
> Would you mind elaborating the above.

I think movoo should use rs6000_split_multireg_move for all alternatives,
like movxo does.  movoo should split into 2 V1TI loads/stores and movxo
should split into 4 V1TI loads/stores.  lxvp and stxvp would be
independent patterns of the form:

  (set ...
   (unspec [...] UNSPEC_FOO))

---

rs6000_split_multireg_move has:

  /* The __vector_pair and __vector_quad modes are multi-register
 modes, so if we have to load or store the registers, we have to be
 careful to properly swap them if we're in little endian mode
 below.  This means the last register gets the first memory
 location.  We also need to be careful of using the right register
 numbers if we are splitting XO to OO.  */

But I don't see how this can work reliably if we allow the kind of
subregs that you want to create here.  The register order is the opposite
from the one that GCC expects.

This is more a question for the PowerPC maintainers though.

And this is one of the (admittedly many) times when I wish GCC's
subreg model was more like LLVM's. :)

Thanks,
Richard


Re: [PATCH] ifcvt: Clarify if_info.original_cost.

2024-06-11 Thread Richard Sandiford
Robin Dapp  writes:
> The attached v3 tracks the use of cond_earliest as you suggested
> and adds its cost in default_noce_conversion_profitable_p.
>
> Bootstrapped and regtested on x86 and p10, aarch64 still
> running.  Regtested on riscv64.
>
> Regards
>  Robin
>
> Before noce_find_if_block processes a block it sets up an if_info
> structure that holds the original costs.  At that point the costs of
> the then/else blocks have not been added so we only care about the
> "if" cost.
>
> The code originally used BRANCH_COST for that but was then changed
> to COST_N_INSNS (2) - a compare and a jump.
>
> This patch computes the jump costs via
>   insn_cost (if_info.jump, ...)
> under the assumption that the target takes BRANCH_COST into account
> when costing a jump instruction.
>
> In noce_convert_multiple_sets, we keep track of the need for the initial
> CC comparison.  If we needed it for the generated sequence we add its
> cost in default_noce_conversion_profitable_p.

I was looking at the code in more detail and just wanted to check.
We have:

  int last_needs_comparison = -1;

  bool ok = noce_convert_multiple_sets_1
(if_info, _no_cmov, _src, , ,
 _insns, _needs_comparison);
  if (!ok)
  return false;

  /* If there are insns that overwrite part of the initial
 comparison, we can still omit creating temporaries for
 the last of them.
 As the second try will always create a less expensive,
 valid sequence, we do not need to compare and can discard
 the first one.  */
  if (last_needs_comparison != -1)
{
  end_sequence ();
  start_sequence ();
  ok = noce_convert_multiple_sets_1
(if_info, _no_cmov, _src, , ,
 _insns, _needs_comparison);
  /* Actually we should not fail anymore if we reached here,
 but better still check.  */
  if (!ok)
  return false;
}

But noce_convert_multiple_sets_1 ends with:

  /* Even if we did not actually need the comparison, we want to make sure
 to try a second time in order to get rid of the temporaries.  */
  if (*last_needs_comparison == -1)
*last_needs_comparison = 0;


  return true;

AFAICT that means that the first attempt is always redundant.

Have I missed something?

I don't know if this was something that Manolis's patches addressed.

Thanks,
Richard

>
> gcc/ChangeLog:
>
>   * ifcvt.cc (default_noce_conversion_profitable_p):  Add cost of
>   CC comparison.
>   (noce_convert_multiple_sets_1): Set use_cond_earliest.
>   (noce_process_if_block): Just use original cost.
>   (noce_find_if_block): Use insn_cost (jump_insn).
>   * ifcvt.h (struct noce_if_info): Add use_cond_earliest.
> ---
>  gcc/ifcvt.cc | 37 ++---
>  gcc/ifcvt.h  |  3 +++
>  2 files changed, 25 insertions(+), 15 deletions(-)
>
> diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
> index 58ed42673e5..9b408eeb313 100644
> --- a/gcc/ifcvt.cc
> +++ b/gcc/ifcvt.cc
> @@ -814,7 +814,16 @@ default_noce_conversion_profitable_p (rtx_insn *seq,
>/* Cost up the new sequence.  */
>unsigned int cost = seq_cost (seq, speed_p);
>  
> -  if (cost <= if_info->original_cost)
> +  /* If the created sequence does not use cond_earliest (but the jump
> + does) add its cost to the original_cost here.  */
> +  unsigned int cost_adjust = 0;
> +
> +  if (if_info->jump != if_info->cond_earliest
> +  && !if_info->use_cond_earliest)
> +cost_adjust = insn_cost (if_info->cond_earliest,
> +  if_info->speed_p);
> +
> +  if (cost <= if_info->original_cost + cost_adjust)
>  return true;
>  
>/* When compiling for size, we can make a reasonably accurately guess
> @@ -3780,6 +3789,7 @@ noce_convert_multiple_sets_1 (struct noce_if_info 
> *if_info,
> temp_dest = temp_dest2;
> if (!second_try && read_comparison)
>   *last_needs_comparison = count;
> +   if_info->use_cond_earliest = true;
>   }
>else
>   {
> @@ -3931,16 +3941,13 @@ noce_process_if_block (struct noce_if_info *if_info)
>   to calculate a value for x.
>   ??? For future expansion, further expand the "multiple X" rules.  */
>  
> -  /* First look for multiple SETS.  The original costs already include
> - a base cost of COSTS_N_INSNS (2): one instruction for the compare
> - (which we will be needing either way) and one instruction for the
> - branch.  When comparing costs we want to use the branch instruction
> - cost and the sets vs. the cmovs generated here.  Therefore subtract
> - the costs of the compare before checking.
> - ??? Actually, instead of the branch instruction costs we might want
> - to use COSTS_N_INSNS (BRANCH_COST ()) as in other places.  */
> +  /* First look for multiple SETS.
> + The original costs already include costs for the jump insn as well
> + as for a CC comparison if there is any.
> + If a target re-uses the existing CC comparison we keep track of that
> + and add 

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-11 Thread Richard Sandiford
Ajit Agarwal  writes:
> On 11/06/24 7:07 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> Hello Richard:
>>> On 11/06/24 6:12 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal  writes:
>>>>> Hello Richard:
>>>>>
>>>>> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>>>>>> Ajit Agarwal  writes:
>>>>>>> Hello Richard:
>>>>>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>>>>>> Hello Richard:
>>>>>>>>
>>>>>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>>>>>> Ajit Agarwal  writes:
>>>>>>>>>>>>>> After LRA reload:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 
>>>>>>>>>>>>>> vect__302.545 ] [240])
>>>>>>>>>>>>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] 
>>>>>>>>>>>>>> [1285])
>>>>>>>>>>>>>> (const_int 16 [0x10])) [1 MEM >>>>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) 
>>>>>>>>>>>>>> "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>>>>>  (nil))
>>>>>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 
>>>>>>>>>>>>>> vect__302.545 ] [240])
>>>>>>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM >>>>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>>>>> (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>>> (neg:V2DF (reg:V2DF 51 19 [orig:240 
>>>>>>>>>>>>>> vect__302.545 ] [240]) {*vsx_nfmsv2df4}
>>>>>>>>>>>>>>  (nil))
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 
>>>>>>>>>>>>>> vect__302.545 ] [905])
>>>>>>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>>> (reg:V2DF 38 6 [orig:2561 MEM >>>>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>>>>> (neg:V2DF (reg:V2DF 47 15 [5266]) 
>>>>>>>>>>>>>> {*vsx_nfmsv2df4}
>>>>>>>>>>>>>>  (nil))
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and 
>>>>>>>>>>>>>> they are not sequential.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it 
>>>>>>>>>>>>> deleted?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This is preload version of 2473:
>>>>>>>>>>>>
>>>>>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>>>>> (subreg:V2DF (reg:OO 2561 [ MEM >>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>>>>> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ 
>>>>>>>>>>>> vect__300.543_236 ]) 0) {*vsx_nfmsv2df4}
>>>>>>>>>>>>  (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>>>>> (expr_list:REG_DEAD (reg:OO 2561 [ MEM >>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ])

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-11 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Richard:
> On 11/06/24 6:12 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> Hello Richard:
>>>
>>> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal  writes:
>>>>> Hello Richard:
>>>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>>>> Hello Richard:
>>>>>>
>>>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>>>> Ajit Agarwal  writes:
>>>>>>>>>>>> After LRA reload:
>>>>>>>>>>>>
>>>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 
>>>>>>>>>>>> vect__302.545 ] [240])
>>>>>>>>>>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] 
>>>>>>>>>>>> [1285])
>>>>>>>>>>>> (const_int 16 [0x10])) [1 MEM >>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) 
>>>>>>>>>>>> "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>>>  (nil))
>>>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 
>>>>>>>>>>>> vect__302.545 ] [240])
>>>>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM >>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>>> (reg:V2DF 44 12 [3119])
>>>>>>>>>>>> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 
>>>>>>>>>>>> ] [240]) {*vsx_nfmsv2df4}
>>>>>>>>>>>>  (nil))
>>>>>>>>>>>>
>>>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 
>>>>>>>>>>>> vect__302.545 ] [905])
>>>>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>>> (reg:V2DF 38 6 [orig:2561 MEM >>>>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>>> (neg:V2DF (reg:V2DF 47 15 [5266]) 
>>>>>>>>>>>> {*vsx_nfmsv2df4}
>>>>>>>>>>>>  (nil))
>>>>>>>>>>>>
>>>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they 
>>>>>>>>>>>> are not sequential.
>>>>>>>>>>>
>>>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it 
>>>>>>>>>>> deleted?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is preload version of 2473:
>>>>>>>>>>
>>>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>>> (subreg:V2DF (reg:OO 2561 [ MEM >>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>>> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ 
>>>>>>>>>> vect__300.543_236 ]) 0) {*vsx_nfmsv2df4}
>>>>>>>>>>  (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>>> (expr_list:REG_DEAD (reg:OO 2561 [ MEM >>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>>>> (nil
>>>>>>>>>>
>>>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>>>
>>>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>>>> of insn 2472.  T
>>>>>>>>
>>>>>>>> Yes it is 

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-11 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Richard:
>
> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> Hello Richard:
>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>> Hello Richard:
>>>>
>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal  writes:
>>>>>>>>>> After LRA reload:
>>>>>>>>>>
>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 
>>>>>>>>>> vect__302.545 ] [240])
>>>>>>>>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>>> (const_int 16 [0x10])) [1 MEM >>>>>>>>> real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) 
>>>>>>>>>> "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>  (nil))
>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 
>>>>>>>>>> vect__302.545 ] [240])
>>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM >>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>> (reg:V2DF 44 12 [3119])
>>>>>>>>>> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>>>>>>> [240]) {*vsx_nfmsv2df4}
>>>>>>>>>>  (nil))
>>>>>>>>>>
>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 
>>>>>>>>>> ] [905])
>>>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>> (reg:V2DF 38 6 [orig:2561 MEM >>>>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>> (neg:V2DF (reg:V2DF 47 15 [5266]) 
>>>>>>>>>> {*vsx_nfmsv2df4}
>>>>>>>>>>  (nil))
>>>>>>>>>>
>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they 
>>>>>>>>>> are not sequential.
>>>>>>>>>
>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it 
>>>>>>>>> deleted?
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is preload version of 2473:
>>>>>>>>
>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>> (subreg:V2DF (reg:OO 2561 [ MEM >>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ 
>>>>>>>> vect__300.543_236 ]) 0) {*vsx_nfmsv2df4}
>>>>>>>>  (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>> (expr_list:REG_DEAD (reg:OO 2561 [ MEM >>>>>>> real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>> (nil
>>>>>>>>
>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>
>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>> of insn 2472.  T
>>>>>>
>>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>>
>>>>>> This is not correct as we are not generating lxvp and it is 
>>>>>> normal load lxv.
>>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>>> plus constant offset it breaks the correctness.
>>>>>
>>>>> Not using lxvp is a deliberate choice though.
>>>>>
>>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-11 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Richard:
> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>> Hello Richard:
>> 
>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>> Ajit Agarwal  writes:
>>>>>>>> After LRA reload:
>>>>>>>>
>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 
>>>>>>>> ] [240])
>>>>>>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>> (const_int 16 [0x10])) [1 MEM  
>>>>>>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 
>>>>>>>> 1190 {vsx_movv2df_64bit}
>>>>>>>>  (nil))
>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 
>>>>>>>> ] [240])
>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM >>>>>>> real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>> (reg:V2DF 44 12 [3119])
>>>>>>>> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>>>>> [240]) {*vsx_nfmsv2df4}
>>>>>>>>  (nil))
>>>>>>>>
>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] 
>>>>>>>> [905])
>>>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>> (reg:V2DF 38 6 [orig:2561 MEM  
>>>>>>>> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>> (neg:V2DF (reg:V2DF 47 15 [5266]) {*vsx_nfmsv2df4}
>>>>>>>>  (nil))
>>>>>>>>
>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are 
>>>>>>>> not sequential.
>>>>>>>
>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it 
>>>>>>> deleted?
>>>>>>>
>>>>>>
>>>>>> This is preload version of 2473:
>>>>>>
>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>> (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>> (subreg:V2DF (reg:OO 2561 [ MEM  
>>>>>> [(real(kind=8) *)_4050] ]) 0)
>>>>>> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 
>>>>>> ]) 0) {*vsx_nfmsv2df4}
>>>>>>  (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>> (expr_list:REG_DEAD (reg:OO 2561 [ MEM  
>>>>>> [(real(kind=8) *)_4050] ])
>>>>>> (nil
>>>>>>
>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>
>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>> of insn 2472.  T
>>>>
>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>
>>>> This is not correct as we are not generating lxvp and it is 
>>>> normal load lxv.
>>>> As normal load is generated in predecessor insn of 2412 with
>>>> plus constant offset it breaks the correctness.
>>>
>>> Not using lxvp is a deliberate choice though.
>>>
>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>> to load both halves of R when only one half is needed.  LRA just
>>> loads what it needs into whichever registers happen to be free.
>>>
>>> If the reload of R instead used lxvp, LRA would be forced to free
>>> up another register for the other half of R, even though that value
>>> would never be used.
>>>
>> 
>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>> will be from plus offset 16 instead it should be loaded value 
>> from zero offset. As in load fusion pass we are replacing
>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>> is from plus 16 offsets and thats why its breaking the correctness.
>> 
>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>> and loaded value is from 16 offset instead its loading from zero
>> offset and thats why we are breaking the correctness.
>> 
>
> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
> will be from plus offset 16 instead it should be loaded value 
> from zero offset. As in load fusion pass we are replacing
> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
> is from plus 16 offsets instead it should load from zero offset.
> Thats why its breaking the correctness.
>  
> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
> and loaded value is from 16 offset instead its loading from zero
> offset and thats why we are breaking the correctness.

I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always

(a) the first hard register in (reg:OO R), when the whole of R
is stored in hard registers
(b) at address offset 0 from the start of (reg:OO R), when R is
spilled to memory

Similarly, (subreg:V2DI (reg:OO R) 16) is always

(c) the second hard register in (reg:OO R), when the whole of R
is stored in hard registers
(d) at address offset 16 from the start of (reg:OO R), when R is
spilled to memory

Richard


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-11 Thread Richard Sandiford
Ajit Agarwal  writes:
> After LRA reload:
>
> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
> [240])
> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
> (const_int 16 [0x10])) [1 MEM  
> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
> {vsx_movv2df_64bit}
>  (nil))
> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
> [240])
> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM  
> [(real(kind=8) *)_4050]+16 ])
> (reg:V2DF 44 12 [3119])
> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
> [240]) {*vsx_nfmsv2df4}
>  (nil))
>
> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] 
> [905])
> (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
> (reg:V2DF 38 6 [orig:2561 MEM  
> [(real(kind=8) *)_4050] ] [2561])
> (neg:V2DF (reg:V2DF 47 15 [5266]) {*vsx_nfmsv2df4}
>  (nil))
>
> In the above allocated code it assign registers 51 and 47 and they are 
> not sequential.

 The reload for 2412 looks valid.  What was the original pre-reload
 version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?

>>>
>>> This is preload version of 2473:
>>>
>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>> (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>> (subreg:V2DF (reg:OO 2561 [ MEM  
>>> [(real(kind=8) *)_4050] ]) 0)
>>> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 
>>> 0) {*vsx_nfmsv2df4}
>>>  (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>> (expr_list:REG_DEAD (reg:OO 2561 [ MEM  
>>> [(real(kind=8) *)_4050] ])
>>> (nil
>>>
>>> insn 2472 is replaced with 9299 after reload.
>> 
>> You'd have to check the dumps to be sure, but I think 9299 is instead
>> generated as an input reload of 2412, rather than being a replacement
>> of insn 2472.  T
>
> Yes it is generated for 2412. The predecessor of 2412 is load from
> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>
> This is not correct as we are not generating lxvp and it is 
> normal load lxv.
> As normal load is generated in predecessor insn of 2412 with
> plus constant offset it breaks the correctness.

Not using lxvp is a deliberate choice though.

If a (reg:OO R) has been spilled, there's no requirement for LRA
to load both halves of R when only one half is needed.  LRA just
loads what it needs into whichever registers happen to be free.

If the reload of R instead used lxvp, LRA would be forced to free
up another register for the other half of R, even though that value
would never be used.

>> That is, LRA needs to reload (subreg:V2DF (reg:OO 2572) 16)
>> from memory for insn 2412.  It can use the destination of insn 2412 (r51)
>> as a temporary to do that.  It doesn't need to load the other half of
>> reg:OO 2572 for this instruction.  That in itself looks ok.
>> 
>> So it looks like the problem is specific to insn 2473.  Perhaps LRA
>> thinks that r47 already contains the low half of (reg:OO 2572),
>> left behind by some previous instruction not shown above?
>> If LRA is wrong about that -- if r47 doesn't already contain the
>> low half of (reg:OO 2572) -- then there's a bug somewhere.
>> But we need to track down and fix the bug rather than try to sidestep
>> it in the fusion pass.
>> 
>
> Similarly for 2473 normal load with 0 offset are generated in predecessor
> insn as we are generating subreg:V2DF (reg OO 2572) 0 in 2473. As we are not
> generating lxvp this is not correct and breaks the code.

That too sounds ok, for the reasons above.

> Above code is valid if we are generating lxvp that generates
> sequential registers, but we are not geneating lxvp and normal
> load is generated and this breaks the code.

I think you said earlier that the code is miscompiled (fails at
runtime).  If that's due to an RA issue, then presumably there is
an instruction that, after RA, is reading the wrong value.  In other
words, there's presumably a register input somewhere that has the wrong
contents.  Have you isolated which instruction and register that is?

Thanks,
Richard


Re: [PATCH] Add SLP_TREE_MEMORY_ACCESS_TYPE

2024-06-11 Thread Richard Sandiford
Richard Biener  writes:
> It turns out target costing code looks at STMT_VINFO_MEMORY_ACCESS_TYPE
> to identify operations from (emulated) gathers for example.  This
> doesn't work for SLP loads since we do not set STMT_VINFO_MEMORY_ACCESS_TYPE
> there as the vectorization strathegy might differ between different
> stmt uses.  It seems we got away with setting it for stores though.
> The following adds a memory_access_type field to slp_tree and sets it
> from load and store vectorization code.  All the costing doesn't record
> the SLP node (that was only done selectively for some corner case).  The
> costing is really in need of a big overhaul, the following just massages
> the two relevant ops to fix gcc.dg/target/pr88531-2[bc].c FAILs when
> switching on SLP for non-grouped stores.  In particular currently
> we either have a SLP node or a stmt_info in the cost hook but not both.

Yeah, agree that costing needs an overhaul, and that it only makes sense
to do that once the final shape of the SLP-only form is clearer.

> So the following is a hack(?).  Other targets look possibly affected as
> well.  I do want to postpone rewriting all of the costing to after
> all-SLP.

Yeah, AArch64 uses STMT_VINFO_MEMORY_ACCESS_TYPE.  I think there it'd
be better to wait for a bit (for the reasons above), rather than
propagate the intermediate stage through the current code.  If this
ends up being the state for GCC 15, we can adjust the costs later
in stage 1.

> Any comments?

LGTM FWIW.

Thanks,
Richard

>
>   * tree-vectorizer.h (_slp_tree::memory_access_type): Add.
>   (SLP_TREE_MEMORY_ACCESS_TYPE): New.
>   (record_stmt_cost): Add another overload.
>   * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize
>   memory_access_type.
>   * tree-vect-stmts.cc (vectorizable_store): Set
>   SLP_TREE_MEMORY_ACCESS_TYPE.
>   (vectorizable_load): Likewise.  Also record the SLP node
>   when costing emulated gather offset decompose and vector
>   composition.
>   * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Also
>   recognize SLP emulated gather/scatter.
> ---
>  gcc/config/i386/i386.cc |  22 ++---
>  gcc/tree-vect-slp.cc|   1 +
>  gcc/tree-vect-stmts.cc  |  16 +--
>  gcc/tree-vectorizer.h   | 102 
>  4 files changed, 91 insertions(+), 50 deletions(-)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 4126ab24a79..32ecf31d8d1 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -25150,13 +25150,21 @@ ix86_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>   (AGU and load ports).  Try to account for this by scaling the
>   construction cost by the number of elements involved.  */
>if ((kind == vec_construct || kind == vec_to_scalar)
> -  && stmt_info
> -  && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
> -   || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
> -  && ((STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
> -&& (TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info)))
> -!= INTEGER_CST))
> -   || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER))
> +  && ((stmt_info
> +&& (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
> +|| STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
> +&& ((STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
> + && (TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info)))
> + != INTEGER_CST))
> +|| (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info)
> +== VMAT_GATHER_SCATTER)))
> +   || (node
> +   && ((SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_ELEMENTWISE
> +   && (TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF
> + (SLP_TREE_REPRESENTATIVE (node
> +   != INTEGER_CST))
> +   || (SLP_TREE_MEMORY_ACCESS_TYPE (node)
> +   == VMAT_GATHER_SCATTER)
>  {
>stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index e1e47b786c2..c359e8a0bbc 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -122,6 +122,7 @@ _slp_tree::_slp_tree ()
>SLP_TREE_CODE (this) = ERROR_MARK;
>SLP_TREE_VECTYPE (this) = NULL_TREE;
>SLP_TREE_REPRESENTATIVE (this) = NULL;
> +  SLP_TREE_MEMORY_ACCESS_TYPE (this) = VMAT_INVARIANT;
>SLP_TREE_REF_COUNT (this) = 1;
>this->failed = NULL;
>this->max_nunits = 1;
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index bd7dd149d11..8049c458136 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8316,6 +8316,8 @@ vectorizable_store (vec_info *vinfo,
>if (costing_p) /* transformation not required.  */
> 

Re: [PATCH] ifcvt: Clarify if_info.original_cost.

2024-06-10 Thread Richard Sandiford
Robin Dapp  writes:
>> Is there any way we can avoid using pattern_cost here?  Using it means
>> that we can make use of targetm.insn_cost for the jump but circumvent
>> it for the condition, giving a bit of a mixed metric.
>> 
>> (I realise there are existing calls to pattern_cost in ifcvt.cc,
>> but if possible I think we should try to avoid adding more.)
>
> Yes, I believe there is.  In addition, what I did with
> if_info->cond wasn't what I intended to do.
>
> The whole point of the exercise is that noce_convert_multiple_sets
> can re-use the CC comparison that is already present (because it
> is used in the jump pattern).  Therefore I want to split costs
> into a jump part and a CC-setting part so the final costing
> decision for multiple sets can be:
>
>  insn_cost (jump) + n * insn_cost (set)
> vs
>  n * insn_cost ("cmov")
>
> Still, the original costs should be:
>  insn_cost (set_cc) + insn_cost (jump)
> and with the split we can just remove insn_cost (set_cc) before
> the multiple-set cost comparison and re-add it afterwards.
>
> For non-CC targets this is not necessary.
>
> So what I'd hope is better is to use
> insn_cost (if_info.earliest_cond)
> which is indeed the CC-set/comparison if it exists.

I agree that's probably good enough in practice.  It doesn't cope
with things like:

/* Handle sequences like:

   (set op0 (xor X Y))
   ...(eq|ne op0 (const_int 0))...

   in which case:

   (eq op0 (const_int 0)) reduces to (eq X Y)
   (ne op0 (const_int 0)) reduces to (ne X Y)

   This is the form used by MIPS16, for example.  */

but then neither does the current code.  But...

> The attached v2 was bootstrapped and regtested on x86, aarch64 and
> power10 and regtested on riscv64.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * ifcvt.cc (noce_process_if_block): Subtract condition pattern
>   cost if applicable.
>   (noce_find_if_block): Use insn_cost and pattern_cost for
>   original cost.
> ---
>  gcc/ifcvt.cc | 31 ---
>  1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
> index 58ed42673e5..ebb838fd82c 100644
> --- a/gcc/ifcvt.cc
> +++ b/gcc/ifcvt.cc
> @@ -3931,16 +3931,16 @@ noce_process_if_block (struct noce_if_info *if_info)
>   to calculate a value for x.
>   ??? For future expansion, further expand the "multiple X" rules.  */
>  
> -  /* First look for multiple SETS.  The original costs already include
> - a base cost of COSTS_N_INSNS (2): one instruction for the compare
> - (which we will be needing either way) and one instruction for the
> - branch.  When comparing costs we want to use the branch instruction
> - cost and the sets vs. the cmovs generated here.  Therefore subtract
> - the costs of the compare before checking.
> - ??? Actually, instead of the branch instruction costs we might want
> - to use COSTS_N_INSNS (BRANCH_COST ()) as in other places.  */
> -
> -  unsigned potential_cost = if_info->original_cost - COSTS_N_INSNS (1);
> +  /* First look for multiple SETS.
> + The original costs already include costs for the jump insn as well
> + as for a CC comparison if there is any.
> + We want to allow the backend to re-use the existing CC comparison
> + and therefore don't consider it for the cost comparison (as it is
> + then needed for both the jump as well as the cmov sequence).  */
> +
> +  unsigned potential_cost = if_info->original_cost;
> +  if (if_info->cond_earliest && if_info->jump != if_info->cond_earliest)
> +potential_cost -= insn_cost (if_info->cond_earliest, if_info->speed_p);
>unsigned old_cost = if_info->original_cost;
>if (!else_bb
>&& HAVE_conditional_move

...why do we do the adjustment here?  Doesn't noce_convert_multiple_sets_1
know for certain (or at least with more certainty) whether any of the
new instructions use the old CC result?  It seems like we could record
that and do the adjustment around the call to
targetm.noce_conversion_profitable_p.

> @@ -4703,11 +4703,12 @@ noce_find_if_block (basic_block test_bb, edge 
> then_edge, edge else_edge,
>  = targetm.max_noce_ifcvt_seq_cost (then_edge);
>/* We'll add in the cost of THEN_BB and ELSE_BB later, when we check
>   that they are valid to transform.  We can't easily get back to the insn
> - for COND (and it may not exist if we had to canonicalize to get COND),
> - and jump_insns are always given a cost of 1 by seq_cost, so treat
> - both instructions as having cost COSTS_N_INSNS (1).  */
> -  if_info.original_cost = COSTS_N_INSNS (2);
> -
> + for COND (and it may not exist if we had to canonicalize to get COND).
> + jump insn that is costed via insn_cost.  It is assumed that the
^^
Looks like this part of the comment got a bit garbled.

Thanks,
Richard

> + costs of a jump insn are 

Re: [PATCH v3 6/6] aarch64: Add DLL import/export to AArch64 target

2024-06-10 Thread Richard Sandiford
Thanks for the update.  Parts 1-5 look good to me.  Some minor comments
below about part 6:

Evgeny Karpov  writes:
> This patch reuses the MinGW implementation to enable DLL import/export
> functionality for the aarch64-w64-mingw32 target. It also modifies
> environment configurations for MinGW.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add winnt-dll.o, which contains the DLL
>   import/export implementation.
>   * config/aarch64/aarch64.cc (aarch64_legitimize_pe_coff_symbol):
>   Add a conditional function that reuses the MinGW implementation
>   for COFF and does nothing otherwise.
>   (aarch64_load_symref_appropriately): Add dllimport
>   implementation.
>   (aarch64_expand_call): Likewise.
>   (aarch64_legitimize_address): Likewise.
>   * config/aarch64/cygming.h (SYMBOL_FLAG_DLLIMPORT): Modify MinGW
>   environment to support DLL import/export.
>   (SYMBOL_FLAG_DLLEXPORT): Likewise.
>   (SYMBOL_REF_DLLIMPORT_P): Likewise.
>   (SYMBOL_FLAG_STUBVAR): Likewise.
>   (SYMBOL_REF_STUBVAR_P): Likewise.
>   (TARGET_VALID_DLLIMPORT_ATTRIBUTE_P): Likewise.
>   (TARGET_ASM_FILE_END): Likewise.
>   (SUB_TARGET_RECORD_STUB): Likewise.
>   (GOT_ALIAS_SET): Likewise.
>   (PE_COFF_EXTERN_DECL_SHOULD_BE_LEGITIMIZED): Likewise.
>   (HAVE_64BIT_POINTERS): Likewise.
> ---
>  gcc/config.gcc|  4 +++-
>  gcc/config/aarch64/aarch64.cc | 37 +++
>  gcc/config/aarch64/cygming.h  | 26 ++--
>  3 files changed, 64 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index d053b98efa8..331285b7b6d 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1276,10 +1276,12 @@ aarch64-*-mingw*)
>   tm_file="${tm_file} mingw/mingw32.h"
>   tm_file="${tm_file} mingw/mingw-stdint.h"
>   tm_file="${tm_file} mingw/winnt.h"
> + tm_file="${tm_file} mingw/winnt-dll.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
>   target_gtfiles="$target_gtfiles \$(srcdir)/config/mingw/winnt.cc"
> + target_gtfiles="$target_gtfiles \$(srcdir)/config/mingw/winnt-dll.cc"
>   extra_options="${extra_options} mingw/cygming.opt mingw/mingw.opt"
> - extra_objs="${extra_objs} winnt.o"
> + extra_objs="${extra_objs} winnt.o winnt-dll.o"
>   c_target_objs="${c_target_objs} msformat-c.o"
>   d_target_objs="${d_target_objs} winnt-d.o"
>   tmake_file="${tmake_file} mingw/t-cygming"
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 3418e57218f..5706b9aeb6b 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -860,6 +860,10 @@ static const attribute_spec aarch64_gnu_attributes[] =
>{ "Advanced SIMD type", 1, 1, false, true,  false, true,  NULL, NULL },
>{ "SVE type",3, 3, false, true,  false, true,  NULL, NULL 
> },
>{ "SVE sizeless type",  0, 0, false, true,  false, true,  NULL, NULL },
> +#if TARGET_DLLIMPORT_DECL_ATTRIBUTES
> +  { "dllimport", 0, 0, false, false, false, false, handle_dll_attribute, 
> NULL },
> +  { "dllexport", 0, 0, false, false, false, false, handle_dll_attribute, 
> NULL },
> +#endif
>  #ifdef SUBTARGET_ATTRIBUTE_TABLE
>SUBTARGET_ATTRIBUTE_TABLE
>  #endif
> @@ -2819,6 +2823,15 @@ tls_symbolic_operand_type (rtx addr)
>return tls_kind;
>  }
>  
> +rtx aarch64_legitimize_pe_coff_symbol (rtx addr, bool inreg)
> +{
> +#if TARGET_PECOFF
> +  return legitimize_pe_coff_symbol (addr, inreg);
> +#else
> +  return NULL_RTX;
> +#endif
> +}
> +

I wondered whether we should try to abstract this behind
SUBTARGET_* stuff, e.g.:

  SUBTARGET_LEGITIMIZE_ADDRESS(ADDR) (the inreg==true case)
  SUBTARGET_LEGITIMIZE_CALLEE(ADDR)  (the inreg==false case)

But I don't think it falls out naturally with the way GCC's code is
organised.  I agree having direct references to PECOFF is probably the
least worst option under the circumstances.

Since there is no AArch64-specific handling, I think it'd be
better to have:

#if !TARGET_PECOFF
rtx legitimize_pe_coff_symbol (rtx, bool) { return NULL_RTX; }
#endif

This avoids warning about unused arguments in the !TARGET_PECOFF case.

>  /* We'll allow lo_sum's in addresses in our legitimate addresses
> so that combine would take care of combining addresses where
> necessary, but for generation purposes, we'll generate the address
> @@ -2865,6 +2878,17 @@ static void
>  aarch64_load_symref_appropriately (rtx dest, rtx imm,
>  enum aarch64_symbol_type type)
>  {
> +  /* If legitimize returns a value
> + copy it directly to the destination and return.  */

I don't think the comment really adds anything.

> +
> +  rtx tmp = aarch64_legitimize_pe_coff_symbol (imm, true);
> +

Sorry for pushing personal preference, but I think it's slightly
easier to read without this blank line (following the style used
later in aarch64_legitimize_address).

> +  if (tmp)
> +   

Re: [PATCH v2] vect: Merge loop mask and cond_op mask in fold-left, reduction [PR115382].

2024-06-10 Thread Richard Sandiford
Robin Dapp  writes:
>> Actually, as Richard mentioned in the PR, it would probably be better
>> to use prepare_vec_mask instead.  It should work in this context too
>> and would avoid redundant double masking.
>
> Attached is v2 that uses prepare_vec_mask.
>
> Regtested on riscv64 and armv8.8-a+sve via qemu.
> Bootstrap and regtest running on x86 and aarch64.
>
> Regards
>  Robin
>
>
> Currently we discard the cond-op mask when the loop is fully masked
> which causes wrong code in
> gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
> when compiled with
> -O3 -march=cascadelake --param vect-partial-vector-usage=2.
>
> This patch ANDs both masks.
>
> gcc/ChangeLog:
>
>   PR tree-optimization/115382
>
>   * tree-vect-loop.cc (vectorize_fold_left_reduction): Merge loop
>   mask and cond-op mask.
> ---
>  gcc/tree-vect-loop.cc  | 10 +-
>  gcc/tree-vect-stmts.cc |  2 +-
>  gcc/tree-vectorizer.h  |  2 ++
>  3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 028692614bb..c9b037b8daf 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -7215,7 +7215,15 @@ vectorize_fold_left_reduction (loop_vec_info 
> loop_vinfo,
>tree len = NULL_TREE;
>tree bias = NULL_TREE;
>if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> - mask = vect_get_loop_mask (loop_vinfo, gsi, masks, vec_num, vectype_in, 
> i);
> + {
> +   tree loop_mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
> +vec_num, vectype_in, i);
> +   if (is_cond_op)
> + mask = prepare_vec_mask (loop_vinfo, TREE_TYPE (loop_mask),
> +  loop_mask, vec_opmask[i], gsi);
> +   else
> + mask = loop_mask;
> + }
>else if (is_cond_op)
>   mask = vec_opmask[i];
>if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 5098b7fab6a..124a3462753 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1643,7 +1643,7 @@ check_load_store_for_partial_vectors (loop_vec_info 
> loop_vinfo, tree vectype,
> MASK_TYPE is the type of both masks.  If new statements are needed,
> insert them before GSI.  */
>  
> -static tree
> +tree
>  prepare_vec_mask (loop_vec_info loop_vinfo, tree mask_type, tree loop_mask,
> tree vec_mask, gimple_stmt_iterator *gsi)
>  {
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 97ec9c341e7..1f87c6c8ca2 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2508,6 +2508,8 @@ extern void vect_free_slp_tree (slp_tree);
>  extern bool compatible_calls_p (gcall *, gcall *);
>  extern int vect_slp_child_index_for_operand (const gimple *, int op, bool);
>  
> +extern tree prepare_vec_mask (loop_vec_info, tree, tree, tree, 
> gimple_stmt_iterator *);

Nit: long line.

OK with that fixed, thanks.

Richard

> +
>  /* In tree-vect-patterns.cc.  */
>  extern void
>  vect_mark_pattern_stmts (vec_info *, stmt_vec_info, gimple *, tree);


Re: [PATCH] internal-fn: Force to reg if operand doesn't match.

2024-06-10 Thread Richard Sandiford
Richard Biener  writes:
> On Mon, Jun 10, 2024 at 9:35 AM Robin Dapp  wrote:
>>
>> Hi,
>>
>> despite looking good on cfarm185 and Linaro's pre-commit CI
>> gcc-15-638-g7ca35f2e430 now appears to have caused several
>> regressions on arm-eabi cortex-m55 as found by Linaro's CI:
>>
>> https://linaro.atlassian.net/browse/GNU-1252
>>
>> I'm assuming this target is not tested as regularly and thus
>> the failures went unnoticed until now.
>>
>> So it looks like we do need the insn_operand_matches after all?
>
> But why does expand_vec_cond_optab_fn get away without?
> (note we want to get rid of that variant)
>
> Almost no other expander checks this either, though some
> can_* functions validate.  It's not exactly clear to me whether
> we are just lucky and really always need to validate or whether
> it's a bug in the target?

Sounds like a bug in the target (although I'm not sure from a quick
glance what it would be).

expand_insn is responsible for making sure that operands satisfy
predicates.  We shouldn't need to enforce the predicates beforehand.

Thanks,
Richard

>
> Richard.
>
>> This patch only forces to register if the respective operands
>> do not already match.
>>
>> Bootstrap and regtest on aarch64 and x86 in progress.
>> Regtested on riscv64.
>>
>> Regards
>>  Robin
>>
>> gcc/ChangeLog:
>>
>> * internal-fn.cc (expand_vec_cond_mask_optab_fn): Only force to
>> reg if operand does not already match.
>> ---
>>  gcc/internal-fn.cc | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
>> index 4948b48bde8..fa85fa69f5a 100644
>> --- a/gcc/internal-fn.cc
>> +++ b/gcc/internal-fn.cc
>> @@ -3162,7 +3162,13 @@ expand_vec_cond_mask_optab_fn (internal_fn, gcall 
>> *stmt, convert_optab optab)
>>gcc_assert (icode != CODE_FOR_nothing);
>>
>>mask = expand_normal (op0);
>> +  if (!insn_operand_matches (icode, 3, mask))
>> +mask = force_reg (mask_mode, mask);
>> +
>>rtx_op1 = expand_normal (op1);
>> +  if (!insn_operand_matches (icode, 1, rtx_op1))
>> +rtx_op1 = force_reg (mode, rtx_op1);
>> +
>>rtx_op2 = expand_normal (op2);
>>
>>rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>> --
>> 2.45.1


Re: [PATCH] aarch64: Add fcsel to cmov integer and csel to float cmov [PR98477]

2024-06-10 Thread Richard Sandiford
Andrew Pinski  writes:
> This patch adds an alternative to the integer cmov and one to floating
> point cmov so we avoid in some more moving
>
>   PR target/98477
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.md (*cmov_insn[GPI]): Add 'w'
>   alternative.
>   (*cmov_insn[GPF]): Add 'r' alternative.
>   * config/aarch64/iterators.md (wv): New mode attr.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/csel_1.c: New test.
>   * gcc.target/aarch64/fcsel_2.c: New test.

This seems a bit dangerous while PR114766 remains unresolved.
The problem (AIUI) is that adding r and w alternatives to the
same pattern means that, when computing the cost of each register
class, r and w are equally cheap for each operand in isolation,
without the interdependencies being modelled.  (This is because
class preferences are calculated on a per-register basis.)

E.g. if:

- insn I1 is op0 = fn(op1, op2)
- I1 provides r and w alternatives
- other uses of op0 and op1 prefer r
- other uses of op2 prefer w

then I1 will not influence the costs of op0, op1 or op2, and so
I1 effectively will provide a zero-cost cross-over between r and w.

There again, we already have other patterns with this problem.  And I
think we do need to make it work somehow.  It's just a question of
whether we should pause adding more "r or w" alternatives until the
problem is fixed, or whether we should treat adding more alternatives
and fixing the RA problem as parallel work.

> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.md  | 13 +++
>  gcc/config/aarch64/iterators.md|  4 
>  gcc/testsuite/gcc.target/aarch64/csel_1.c  | 27 ++
>  gcc/testsuite/gcc.target/aarch64/fcsel_2.c | 20 
>  4 files changed, 59 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/csel_1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fcsel_2.c
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 2bdd443e71d..a6cedd0f1b8 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -4404,6 +4404,7 @@ (define_insn "*cmov_insn"
>   [ r, Ui1 , rZ  ; csel] csinc\t%0, %4, zr, %M1
>   [ r, UsM , UsM ; mov_imm ] mov\t%0, -1
>   [ r, Ui1 , Ui1 ; mov_imm ] mov\t%0, 1
> + [ w, w   , w   ; fcsel   ] fcsel\t%0, %3, %4, 
> %m1
>}
>  )
>  
> @@ -4464,15 +4465,17 @@ (define_insn "*cmovdi_insn_uxtw"
>  )
>  
>  (define_insn "*cmov_insn"
> -  [(set (match_operand:GPF 0 "register_operand" "=w")
> +  [(set (match_operand:GPF 0 "register_operand" "=r,w")
>   (if_then_else:GPF
>(match_operator 1 "aarch64_comparison_operator"
> [(match_operand 2 "cc_register" "") (const_int 0)])
> -  (match_operand:GPF 3 "register_operand" "w")
> -  (match_operand:GPF 4 "register_operand" "w")))]
> +  (match_operand:GPF 3 "register_operand" "r,w")
> +  (match_operand:GPF 4 "register_operand" "r,w")))]
>"TARGET_FLOAT"
> -  "fcsel\\t%0, %3, %4, %m1"
> -  [(set_attr "type" "fcsel")]
> +  "@
> +   csel\t%0, %3, %4, %m1
> +   fcsel\\t%0, %3, %4, %m1"
> +  [(set_attr "type" "fcsel,csel")]
>  )

I think we should use the new syntax for all new insns with more than
one alternative.

Thanks,
Richard

>  
>  (define_expand "movcc"
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 99cde46f1ba..42303f2ec02 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1147,6 +1147,10 @@ (define_mode_attr e [(CCFP "") (CCFPE "e")])
>  ;; 32-bit version and "%x0" in the 64-bit version.
>  (define_mode_attr w [(QI "w") (HI "w") (SI "w") (DI "x") (SF "s") (DF "d")])
>  
> +;; For cmov template to be used with fscel instruction
> +(define_mode_attr wv [(QI "s") (HI "s") (SI "s") (DI "d") (SF "s") (DF "d")])
> +
> +
>  ;; The size of access, in bytes.
>  (define_mode_attr ldst_sz [(SI "4") (DI "8")])
>  ;; Likewise for load/store pair.
> diff --git a/gcc/testsuite/gcc.target/aarch64/csel_1.c 
> b/gcc/testsuite/gcc.target/aarch64/csel_1.c
> new file mode 100644
> index 000..5848e5be2ff
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/csel_1.c
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-ssa-phiopt" } */
> +/* PR target/98477 */
> +
> +/* We should be able to produce csel followed by a store
> +   and not move between the GPRs and simd registers. */
> +/* Note -fno-ssa-phiopt is needed, otherwise the tree level
> +   does the VCE after the cmov which allowed to use the csel
> +   instruction. */
> +_Static_assert (sizeof(long long) == sizeof(double));
> +void
> +foo (int a, double *b, long long c, long long d)
> +{
> +  double ct;
> +  double dt;
> +  __builtin_memcpy(, , sizeof(long long));
> +  __builtin_memcpy(, , sizeof(long long));
> +  double t = a ? ct : dt;
> +  *b = t;
> +}
> +
> +/* { 

Re: [PATCH] aarch64: Add vector floating point trunc pattern

2024-06-10 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch is a follow-up of r15-1079-g230d62a2cdd16c to add vector floating
> point trunc pattern for V2DF->V2SF and V4SF->V4HF conversions by renaming the
> existing aarch64_float_truncate_lo_ pattern to the 
> standard
> optab one, i.e., trunc2. This allows the vectorizer
> to vectorize certain floating point narrowing operations for the aarch64 
> target.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (VAR1): Remap float_truncate_lo_
>   builtin codes to standard optab ones.
>   * config/aarch64/aarch64-simd.md 
> (aarch64_float_truncate_lo_):
>   Rename to...
>   (trunc2): ... This.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/trunc-vec.c: New test.

OK, thanks.

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-builtins.cc   |  7 +++
>  gcc/config/aarch64/aarch64-simd.md   |  6 +++---
>  gcc/testsuite/gcc.target/aarch64/trunc-vec.c | 21 
>  3 files changed, 31 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/trunc-vec.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 25189888d17..d589e59defc 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -543,6 +543,13 @@ BUILTIN_VDQ_BHSI (uhadd, uavg, _floor, 0)
>  VAR1 (float_extend_lo_, extend, v2sf, v2df)
>  VAR1 (float_extend_lo_, extend, v4hf, v4sf)
>  
> +/* __builtin_aarch64_float_truncate_lo_ should be expanded through the
> +   standard optabs CODE_FOR_trunc2. */
> +constexpr insn_code CODE_FOR_aarch64_float_truncate_lo_v4hf
> += CODE_FOR_truncv4sfv4hf2;
> +constexpr insn_code CODE_FOR_aarch64_float_truncate_lo_v2sf
> += CODE_FOR_truncv2dfv2sf2;
> +
>  #undef VAR1
>  #define VAR1(T, N, MAP, FLAG, A) \
>{#N #A, UP (A), CF##MAP (N, A), 0, TYPES_##T, FLAG_##FLAG},
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index c5e2c9f00d0..f644bd1731e 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3197,7 +3197,7 @@ (define_expand "aarch64_float_trunc_rodd_hi_v4sf"
>  }
>  )
>  
> -(define_insn "aarch64_float_truncate_lo_"
> +(define_insn "trunc2"
>[(set (match_operand:VDF 0 "register_operand" "=w")
>(float_truncate:VDF
>   (match_operand: 1 "register_operand" "w")))]
> @@ -3256,7 +3256,7 @@ (define_expand "vec_pack_trunc_v2df"
>  int lo = BYTES_BIG_ENDIAN ? 2 : 1;
>  int hi = BYTES_BIG_ENDIAN ? 1 : 2;
>  
> -emit_insn (gen_aarch64_float_truncate_lo_v2sf (tmp, operands[lo]));
> +emit_insn (gen_truncv2dfv2sf2 (tmp, operands[lo]));
>  emit_insn (gen_aarch64_float_truncate_hi_v4sf (operands[0],
>  tmp, operands[hi]));
>  DONE;
> @@ -3272,7 +3272,7 @@ (define_expand "vec_pack_trunc_df"
>{
>  rtx tmp = gen_reg_rtx (V2SFmode);
>  emit_insn (gen_aarch64_vec_concatdf (tmp, operands[1], operands[2]));
> -emit_insn (gen_aarch64_float_truncate_lo_v2sf (operands[0], tmp));
> +emit_insn (gen_truncv2dfv2sf2 (operands[0], tmp));
>  DONE;
>}
>  )
> diff --git a/gcc/testsuite/gcc.target/aarch64/trunc-vec.c 
> b/gcc/testsuite/gcc.target/aarch64/trunc-vec.c
> new file mode 100644
> index 000..05e8af7912d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/trunc-vec.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +/* { dg-final { scan-assembler-times {fcvtn\tv[0-9]+.2s, v[0-9]+.2d} 1 } } */
> +void
> +f (double *__restrict a, float *__restrict b)
> +{
> +  b[0] = a[0];
> +  b[1] = a[1];
> +}
> +
> +/* { dg-final { scan-assembler-times {fcvtn\tv[0-9]+.4h, v[0-9]+.4s} 1 } } */
> +void
> +f1 (float *__restrict a, _Float16 *__restrict b)
> +{
> +
> +  b[0] = a[0];
> +  b[1] = a[1];
> +  b[2] = a[2];
> +  b[3] = a[3];
> +}


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-10 Thread Richard Sandiford
Ajit Agarwal  writes:
> On 10/06/24 3:20 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal  writes:
>>>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>>>> Ajit Agarwal  writes:
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + rtx set = single_set (insn);
>>>>>>>>>>>>>>> + if (set == NULL_RTX)
>>>>>>>>>>>>>>> +   return false;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + rtx op0 = SET_SRC (set);
>>>>>>>>>>>>>>> + rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + // This check is added as register pairs are not 
>>>>>>>>>>>>>>> generated
>>>>>>>>>>>>>>> + // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>>>> + //  (reg2)
>>>>>>>>>>>>>>> + //  (neg:V2DF (reg3)))
>>>>>>>>>>>>>>> + if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>>>> +   return false;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not sure why register allocator fails allocating register 
>>>>>>>>>>>>> pairs with
>>>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged 
>>>>>>>>>>>>> register allocator why the NEG
>>>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 
>>>>>>>>>>> bits are
>>>>>>>>>>> set correctly. 
>>>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>>>
>>>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>>>
>>>>>>>>>> I think this is just restating the symptom though.  I suppose the 
>>>>>>>>>> same
>>>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>>>> case we dont require sequential register pairs to be generated for 2 
>>>>>>>>> loads
>>>>>>>>> for. Hence it worked.
>>>>>>>>>
>>>>>>>>> rtl change is correct and there is no error.
>>>>>>>>>
>>>>>>>>> for load fusion spill happens and we dont generate sequential 
>>>>>>>>> register pairs
>>>>>>>>> because pf spill candidate and lxvp gives incorrect results as 
>>>>>>>>> sequential register
>>>>>>>>> pairs are required for lxvp.
>>>>>>>>
>>>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>>>> look like (before and after things have gone wrong)?
>>>>>>>>
>>>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>>>> correctly or there is some problem in the .md description of lxvp.
>>

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-10 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Richard:
>
> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal  writes:
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +   rtx set = single_set (insn);
>>>>>>>>>>>>> +   if (set == NULL_RTX)
>>>>>>>>>>>>> + return false;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +   rtx op0 = SET_SRC (set);
>>>>>>>>>>>>> +   rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +   // This check is added as register pairs are not generated
>>>>>>>>>>>>> +   // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>> +   //  (reg2)
>>>>>>>>>>>>> +   //  (neg:V2DF (reg3)))
>>>>>>>>>>>>> +   if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>> + return false;
>>>>>>>>>>>>
>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am not sure why register allocator fails allocating register 
>>>>>>>>>>> pairs with
>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register 
>>>>>>>>>>> allocator why the NEG
>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 
>>>>>>>>> bits are
>>>>>>>>> set correctly. 
>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>
>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>
>>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>
>>>>>>>
>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>> case we dont require sequential register pairs to be generated for 2 
>>>>>>> loads
>>>>>>> for. Hence it worked.
>>>>>>>
>>>>>>> rtl change is correct and there is no error.
>>>>>>>
>>>>>>> for load fusion spill happens and we dont generate sequential register 
>>>>>>> pairs
>>>>>>> because pf spill candidate and lxvp gives incorrect results as 
>>>>>>> sequential register
>>>>>>> pairs are required for lxvp.
>>>>>>
>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>> look like (before and after things have gone wrong)?
>>>>>>
>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>
>>>>>
>>>>> After fusion pass:
>>>>>
>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>> [240])
>>>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>> (const_int 16 [0x10])) [1 MEM  
>>>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>>>>> {vsx_movv2df_64bit}
>>>>>  (nil))
>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>>>> [240])
>>>>> 

Re: [PATCH] vect: Merge loop mask and cond_op mask in fold-left, reduction.

2024-06-10 Thread Richard Sandiford
Richard Sandiford  writes:
> Robin Dapp  writes:
>> Hi,
>>
>> currently we discard the cond-op mask when the loop is fully masked
>> which causes wrong code in
>> gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
>> when compiled with
>> -O3 -march=cascadelake --param vect-partial-vector-usage=2.
>>
>> This patch ANDs both masks instead.
>>
>> Bootstrapped and regtested on x86, aarch64 and power10.
>> Regtested on riscv64 and armv8.8-a+sve via qemu.
>>
>> Regards
>>  Robin
>>
>> gcc/ChangeLog:
>>
>>  * tree-vect-loop.cc (vectorize_fold_left_reduction): Merge loop
>>  mask and cond-op mask.
>
> OK, thanks.

Actually, as Richard mentioned in the PR, it would probably be better
to use prepare_vec_mask instead.  It should work in this context too
and would avoid redundant double masking.

Thanks,
Richard

>
>> ---
>>  gcc/tree-vect-loop.cc | 16 +++-
>>  1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>> index 028692614bb..f9bf6a45611 100644
>> --- a/gcc/tree-vect-loop.cc
>> +++ b/gcc/tree-vect-loop.cc
>> @@ -7215,7 +7215,21 @@ vectorize_fold_left_reduction (loop_vec_info 
>> loop_vinfo,
>>tree len = NULL_TREE;
>>tree bias = NULL_TREE;
>>if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>> -mask = vect_get_loop_mask (loop_vinfo, gsi, masks, vec_num, vectype_in, 
>> i);
>> +{
>> +  tree mask_loop = vect_get_loop_mask (loop_vinfo, gsi, masks,
>> +   vec_num, vectype_in, i);
>> +  if (is_cond_op)
>> +{
>> +  /* Merge the loop mask and the cond_op mask.  */
>> +  mask = make_ssa_name (TREE_TYPE (mask_loop));
>> +  gassign *and_stmt = gimple_build_assign (mask, BIT_AND_EXPR,
>> +   mask_loop,
>> +   vec_opmask[i]);
>> +  gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
>> +}
>> +  else
>> +mask = mask_loop;
>> +}
>>else if (is_cond_op)
>>  mask = vec_opmask[i];
>>if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-10 Thread Richard Sandiford
Ajit Agarwal  writes:
> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>>>>>>>>>> +
>>>>>>>>>>> + rtx set = single_set (insn);
>>>>>>>>>>> + if (set == NULL_RTX)
>>>>>>>>>>> +   return false;
>>>>>>>>>>> +
>>>>>>>>>>> + rtx op0 = SET_SRC (set);
>>>>>>>>>>> + rtx_code code = GET_CODE (op0);
>>>>>>>>>>> +
>>>>>>>>>>> + // This check is added as register pairs are not generated
>>>>>>>>>>> + // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>> + //  (reg2)
>>>>>>>>>>> + //  (neg:V2DF (reg3)))
>>>>>>>>>>> + if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>> +   return false;
>>>>>>>>>>
>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not sure why register allocator fails allocating register pairs 
>>>>>>>>> with
>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register 
>>>>>>>>> allocator why the NEG
>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>
>>>>>>>
>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits 
>>>>>>> are
>>>>>>> set correctly. 
>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>
>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>
>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>
>>>>>
>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>>> for. Hence it worked.
>>>>>
>>>>> rtl change is correct and there is no error.
>>>>>
>>>>> for load fusion spill happens and we dont generate sequential register 
>>>>> pairs
>>>>> because pf spill candidate and lxvp gives incorrect results as sequential 
>>>>> register
>>>>> pairs are required for lxvp.
>>>>
>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>> we end up not getting a sequential register pair?  What does the rtl
>>>> look like (before and after things have gone wrong)?
>>>>
>>>> It seems like either the rtl is not describing the result of the fusion
>>>> correctly or there is some problem in the .md description of lxvp.
>>>>
>>>
>>> After fusion pass:
>>>
>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>> [240])
>>> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>> (const_int 16 [0x10])) [1 MEM  
>>> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
>>> {vsx_movv2df_64bit}
>>>  (nil))
>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>> [240])
>>> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM  
>>> [(real(kind=8) *)_4050]+16 ])
>>> (reg:V2DF 44 12 [3119])
>>> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
>>> [240]) {*vsx_nfmsv2df4}
>>>  (nil))
>>>
>>> In LRA reload.
>>>
>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>> (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM 
>>>  [(real(kind=8) *)_4188]+0 S16 A64])) 
>>> "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>  (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtm

Re: [PATCH] vect: Merge loop mask and cond_op mask in fold-left, reduction.

2024-06-10 Thread Richard Sandiford
Robin Dapp  writes:
> Hi,
>
> currently we discard the cond-op mask when the loop is fully masked
> which causes wrong code in
> gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c
> when compiled with
> -O3 -march=cascadelake --param vect-partial-vector-usage=2.
>
> This patch ANDs both masks instead.
>
> Bootstrapped and regtested on x86, aarch64 and power10.
> Regtested on riscv64 and armv8.8-a+sve via qemu.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * tree-vect-loop.cc (vectorize_fold_left_reduction): Merge loop
>   mask and cond-op mask.

OK, thanks.

Richard

> ---
>  gcc/tree-vect-loop.cc | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 028692614bb..f9bf6a45611 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -7215,7 +7215,21 @@ vectorize_fold_left_reduction (loop_vec_info 
> loop_vinfo,
>tree len = NULL_TREE;
>tree bias = NULL_TREE;
>if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> - mask = vect_get_loop_mask (loop_vinfo, gsi, masks, vec_num, vectype_in, 
> i);
> + {
> +   tree mask_loop = vect_get_loop_mask (loop_vinfo, gsi, masks,
> +vec_num, vectype_in, i);
> +   if (is_cond_op)
> + {
> +   /* Merge the loop mask and the cond_op mask.  */
> +   mask = make_ssa_name (TREE_TYPE (mask_loop));
> +   gassign *and_stmt = gimple_build_assign (mask, BIT_AND_EXPR,
> +mask_loop,
> +vec_opmask[i]);
> +   gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
> + }
> +   else
> + mask = mask_loop;
> + }
>else if (is_cond_op)
>   mask = vec_opmask[i];
>if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-10 Thread Richard Sandiford
Ajit Agarwal  writes:
> +
> +   rtx set = single_set (insn);
> +   if (set == NULL_RTX)
> + return false;
> +
> +   rtx op0 = SET_SRC (set);
> +   rtx_code code = GET_CODE (op0);
> +
> +   // This check is added as register pairs are not generated
> +   // by RA for neg:V2DF (fma: V2DF (reg1)
> +   //  (reg2)
> +   //  (neg:V2DF (reg3)))
> +   if (GET_RTX_CLASS (code) == RTX_UNARY)
> + return false;

 What's special about (neg (fma ...))?

>>>
>>> I am not sure why register allocator fails allocating register pairs 
>>> with
>>> NEG Unary operation with fma operand. I have not debugged register 
>>> allocator why the NEG
>>> Unary operation with fma operand. 
>>
>
> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits 
> are
> set correctly. 
> IRA marked them spill candidates as spill priority is zero.
>
> Due to this LRA reload pass couldn't allocate register pairs.

 I think this is just restating the symptom though.  I suppose the same
 kind of questions apply here too: what was the instruction before the
 pass runs, what was the instruction after the pass runs, and why is
 the rtl change incorrect (by the meaning above)?

>>>
>>> Original case where we dont do load fusion, spill happens, in that
>>> case we dont require sequential register pairs to be generated for 2 loads
>>> for. Hence it worked.
>>>
>>> rtl change is correct and there is no error.
>>>
>>> for load fusion spill happens and we dont generate sequential register pairs
>>> because pf spill candidate and lxvp gives incorrect results as sequential 
>>> register
>>> pairs are required for lxvp.
>> 
>> Can you go into more detail?  How is the lxvp represented?  And how do
>> we end up not getting a sequential register pair?  What does the rtl
>> look like (before and after things have gone wrong)?
>> 
>> It seems like either the rtl is not describing the result of the fusion
>> correctly or there is some problem in the .md description of lxvp.
>> 
>
> After fusion pass:
>
> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
> (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
> (const_int 16 [0x10])) [1 MEM  
> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 
> {vsx_movv2df_64bit}
>  (nil))
> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
> (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM  
> [(real(kind=8) *)_4050]+16 ])
> (reg:V2DF 44 12 [3119])
> (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] 
> [240]) {*vsx_nfmsv2df4}
>  (nil))
>
> In LRA reload.
>
> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
> (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM  real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) 
> "shell_lam.fppized.f":238:72 2187 {*movoo}
>  (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) 
> [1 MEM  [(real(kind=8) *)_4188]+0 S16 A64])
> (nil)))
> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
> (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM  real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
> (reg:V2DF 4283 [3119])
> (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 
> 16)  {*vsx_nfmsv2df4}
>  (nil))
>
>
> In LRA reload sequential registers are not generated as r2572 is splled and 
> move to spill location
> in stack and subsequent uses loads from stack. Hence sequential registers 
> pairs are not generated.
>
> lxvp vsx0, 0(r1).
>
> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use 
> sequential register pairs.
>
> Without load fusion since 2 loads exists and 2 loads need not require 
> sequential registers
> hence it worked but with load fusion and using lxvp it requires sequential 
> register pairs.

Do you mean that this is a performance regression?  I.e. the fact that
lxvp requires sequential registers causes extra spilling, due to having
less allocation freedom?

Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
above looks wrong in principle (although I've no idea if the REG_EQUIV
is correct in this context).  What does the allocated code look like,
and why is it wrong?

If (reg:OO 2561) is spilled and then one half of it used, only that half
needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
for insn 2412 on its own, only the second half of the register needs to be
loaded from memory.

Richard



Re: [RFC/RFA] [PATCH 06/12] aarch64: Implement new expander for efficient CRC computation

2024-06-08 Thread Richard Sandiford
Mariam Arutunian  writes:
> This patch introduces two new expanders for the aarch64 backend,
> dedicated to generate optimized code for CRC computations.
> The new expanders are designed to leverage specific hardware capabilities
> to achieve faster CRC calculations,
> particularly using the pmul or crc32 instructions when supported by the
> target architecture.

Thanks for porting this to aarch64!

> Expander 1: Bit-Forward CRC (crc4)
> For targets that support pmul instruction (TARGET_AES),
> the expander will generate code that uses the pmul (crypto_pmulldi)
> instruction for CRC computation.
>
> Expander 2: Bit-Reversed CRC (crc_rev4)
> The expander first checks if the target supports the CRC32 instruction set
> (TARGET_CRC32)
> and the polynomial in use is 0x1EDC6F41 (iSCSI). If the conditions are met,
> it emits calls to the corresponding crc32 instruction (crc32b, crc32h,
> crc32w, or crc32x depending on the data size).
> If the target does not support crc32 but supports pmul, it then uses the
> pmul (crypto_pmulldi) instruction for bit-reversed CRC computation.
>
> Otherwise table-based CRC is generated.
>
>   gcc/config/aarch64/
>
> * aarch64-protos.h (aarch64_expand_crc_using_clmul): New extern
> function declaration.
> (aarch64_expand_reversed_crc_using_clmul):  Likewise.
> * aarch64.cc (aarch64_expand_crc_using_clmul): New function.
> (aarch64_expand_reversed_crc_using_clmul):  Likewise.
> * aarch64.md (UNSPEC_CRC, UNSPEC_CRC_REV):  New unspecs.
> (crc_rev4): New expander for reversed CRC.
> (crc4): New expander for reversed CRC.
> * iterators.md (crc_data_type): New mode attribute.
>
>   gcc/testsuite/gcc.target/aarch64/
>
> * crc-1-pmul.c: Likewise.
> * crc-10-pmul.c: Likewise.
> * crc-12-pmul.c: Likewise.
> * crc-13-pmul.c: Likewise.
> * crc-14-pmul.c: Likewise.
> * crc-17-pmul.c: Likewise.
> * crc-18-pmul.c: Likewise.
> * crc-21-pmul.c: Likewise.
> * crc-22-pmul.c: Likewise.
> * crc-23-pmul.c: Likewise.
> * crc-4-pmul.c: Likewise.
> * crc-5-pmul.c: Likewise.
> * crc-6-pmul.c: Likewise.
> * crc-7-pmul.c: Likewise.
> * crc-8-pmul.c: Likewise.
> * crc-9-pmul.c: Likewise.
> * crc-CCIT-data16-pmul.c: Likewise.
> * crc-CCIT-data8-pmul.c: Likewise.
> * crc-coremark-16bitdata-pmul.c: Likewise.
> * crc-crc32-data16.c: New test.
> * crc-crc32-data32.c: Likewise.
> * crc-crc32-data8.c: Likewise.
>
> Signed-off-by: Mariam Arutunian  diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 1d3f94c813e..167e1140f0d 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1117,5 +1117,8 @@ extern void mingw_pe_encode_section_info (tree, rtx, 
> int);
>  
>  bool aarch64_optimize_mode_switching (aarch64_mode_entity);
>  void aarch64_restore_za (rtx);
> +void aarch64_expand_crc_using_clmul (rtx *);
> +void aarch64_expand_reversed_crc_using_clmul (rtx *);
> +
>  
>  #endif /* GCC_AARCH64_PROTOS_H */
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index ee12d8897a8..05cd0296d38 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -30265,6 +30265,135 @@ aarch64_retrieve_sysreg (const char *regname, bool 
> write_p, bool is128op)
>return sysreg->encoding;
>  }
>  
> +/* Generate assembly to calculate CRC
> +   using carry-less multiplication instruction.
> +   OPERANDS[1] is input CRC,
> +   OPERANDS[2] is data (message),
> +   OPERANDS[3] is the polynomial without the leading 1.  */
> +
> +void
> +aarch64_expand_crc_using_clmul (rtx *operands)

This should probably be pmul rather than clmul.

> +{
> +  /* Check and keep arguments.  */
> +  gcc_assert (!CONST_INT_P (operands[0]));
> +  gcc_assert (CONST_INT_P (operands[3]));
> +  rtx crc = operands[1];
> +  rtx data = operands[2];
> +  rtx polynomial = operands[3];
> +
> +  unsigned HOST_WIDE_INT
> +  crc_size = GET_MODE_BITSIZE (GET_MODE (operands[0])).to_constant ();
> +  gcc_assert (crc_size <= 32);
> +  unsigned HOST_WIDE_INT
> +  data_size = GET_MODE_BITSIZE (GET_MODE (data)).to_constant ();

We could instead make the interface:

void
aarch64_expand_crc_using_pmul (scalar_mode crc_mode, scalar_mode data_mode,
   rtx *operands)

so that the lines above don't need the to_constant.  This should "just
work" on the .md file side, since the modes being passed are naturally
scalar_mode.

I think it'd be worth asserting also that data_size <= crc_size.
(Although we could handle any MAX (data_size, crc_size) <= 32
with some adjustment.)

> +
> +  /* Calculate the quotient.  */
> +  unsigned HOST_WIDE_INT
> +  q = gf2n_poly_long_div_quotient (UINTVAL (polynomial), crc_size + 1);
> +
> +  /* CRC calculation's main part.  */
> +  if (crc_size > data_size)
> +crc = expand_shift (RSHIFT_EXPR, DImode, crc, crc_size - data_size,
> + 

Re: [RFC/RFA] [PATCH 03/12] RISC-V: Add CRC expander to generate faster CRC.

2024-06-08 Thread Richard Sandiford
Thanks a lot for doing this!  It's a really nice series.

Just had a comment on the long division helper:

Mariam Arutunian  writes:
> +/* Return the quotient of polynomial long division of x^2N by POLYNOMIAL
> +   in GF (2^N).  */

It looks like there might be an off-by-one discrepancy between the comment
and the code.  The comment suggests that N is the degree of the polynomial
(crc_size), whereas the callers seem to pass crc_size + 1.  This doesn't
matter in practice since...

> +
> +unsigned HOST_WIDE_INT
> +gf2n_poly_long_div_quotient (unsigned HOST_WIDE_INT polynomial, size_t n)
> +{
> +  vec x2n;
> +  vec pol, q;
> +  /* Create vector of bits, for the polynomial.  */
> +  pol.create (n + 1);
> +  for (size_t i = 0; i < n; i++)
> +{
> +  pol.quick_push (polynomial & 1);
> +  polynomial >>= 1;
> +}
> +  pol.quick_push (1);
> +
> +  /* Create vector for x^2n polynomial.  */
> +  x2n.create (2 * n - 1);
> +  for (size_t i = 0; i < 2 * (n - 1); i++)
> +x2n.safe_push (0);
> +  x2n.safe_push (1);

...this compensates by setting the dividend to x^(2N-2).  And although
the first loop reads crc_size+1 bits from polynomial before adding the
implicit leading 1, only the low crc_size elements of poly affect the
result.

If we do pass crc_size as N, a simpler way of writing the routine might be:

{
  /* The result has degree N, so needs N + 1 bits.  */
  gcc_assert (n < 64);

  /* Perform a division step for the x^2N coefficient.  At this point the
 quotient and remainder have N implicit trailing zeros.  */
  unsigned HOST_WIDE_INT quotient = 1;
  unsigned HOST_WIDE_INT remainder = polynomial;

  /* Process the coefficients for x^(2N-1) down to x^N, with each step
 reducing the number of implicit trailing zeros by one.  */
  for (unsigned int i = 0; i < n; ++i)
{
  bool coeff = remainder & (HOST_WIDE_INT_1U << (n - 1));
  quotient = (quotient << 1) | coeff;
  remainder = (remainder << 1) ^ (coeff ? polynomial : 0);
}
  return quotient;
}

I realise there are many ways of writing this out there though,
so that's just a suggestion.  (And only lightly tested.)

FWIW, we could easily extend the interface to work on wide_ints if we
ever need it for N>63.  

Thanks,
Richard

> +
> +  q.create (n);
> +  for (size_t i = 0; i < n; i++)
> +q.quick_push (0);
> +
> +  /* Calculate the quotient of x^2n/polynomial.  */
> +  for (int i = n - 1; i >= 0; i--)
> +{
> +  int d = x2n[i + n - 1];
> +  if (d == 0)
> + continue;
> +  for (int j = i + n - 1; j >= i; j--)
> + x2n[j] ^= (pol[j - i]);
> +  q[i] = 1;
> +}
> +
> +  /* Get the number from the vector of 0/1s.  */
> +  unsigned HOST_WIDE_INT quotient = 0;
> +  for (size_t i = 0; i < q.length (); i++)
> +{
> +  quotient <<= 1;
> +  quotient = quotient | q[q.length () - i - 1];
> +}
> +  return quotient;
> +}


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-07 Thread Richard Sandiford
Ajit Agarwal  writes:
>>> +
>>> +  df_ref use;
>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>> +{
>>> +  struct df_link *def_link = DF_REF_CHAIN (use);
>>> +
>>> +  if (!def_link || !def_link->ref
>>> + || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>> +   continue;
>>> +
>>> +  while (def_link && def_link->ref)
>>> +   {
>>> + rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>> + if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>> +   return false;
>>
>> Why do you need to skip PARALLELs?
>>
>
> vec_select with parallel give failures final.cc "can't split-up with 
> subreg 128 (reg OO"
> Thats why I have added this.

 But in (vec_select ... (parallel ...)), the parallel won't be the 
 PATTERN (insn).  It'll instead be a suboperand of the vec_select.

 Here too it's important to understand why the final.cc failure occurs
 and what the correct fix is.

>>>
>>> subreg with vec_select operand already exists before fusion pass.
>>> We overwrite them with subreg 128 bits from 256 OO mode operand.
>> 
>> But why is that wrong?  What was the full rtl of the subreg before the
>> pass runs, what did the subreg look like after the pass, and why is the
>> change not correct?
>> 
>> In general, there are two main ways that an rtl change can be incorrect:
>> 
>> (1) The new rtl isn't well-formed (such as (subreg (subreg X A) B)).
>> In this case, the new rtl makes no inherent sense when viewed
>> in isolation: it isn't necessary to see the old rtl to tell that
>> the new rtl is wrong.
>> 
>> (2) The new rtl is well-formed (i.e. makes inherent sense when viewed in
>> isolation) but it does not have the same semantics as the old rtl.
>> In other words, the new rtl is describing a different operation
>> from the old rtl.
>> 
>> I think we need to talk about it in those terms, rather than where
>> the eventual ICE occurs.
>> 
> Before the fusion.
> old rtl looks like this:
>
> (vec_select:HI (subreg:V8HI (reg:V16QI 125 [ vect__29.38 ]) 0)
>
> After the fusion
> new rtl looks like this:
>
> (vec_select:HI (subreg:V16QI (reg:OO 125 [ vect__29.38 ]) 16)
>
> new rtl is not well formed.
>
> Thats why its failing.
>
> reg:v16QI 125 is the destination of the load that needs to be fused.

This indicates that there's a bug in the substitution code.

It's probably better to create a fresh OO register, rather than
change an existing 128-bit register to 256 bits.  If we do that,
and if reg:V16QI 125 is the destination of the second load
(which I assume it is from the 16 offset in the subreg),
then the new RTL should be:

  (vec_select:HI (subreg:V8HI (reg:OO NEW_REG) 16) ...)

It's possible to get this by using insn_propagation to replace
(reg:V16QI 125) with (subreg:V16QI (reg:OO NEW_REG) 16).
insn_propagation should then take care of the rest.

There are no existing rtl-ssa routines for handling new registers
though.  (The idea was to add things as the need arose.)

>>> Due to this in final.cc we couldnt splt at line 2807 and bails
>>> out fatal_insn.
>>>
>>> Currently we dont support already existing subreg vector operand
>>> to generate register pairs.
>>> We should bail out from fusion pass in this case.
>>> +
>>> + rtx set = single_set (insn);
>>> + if (set == NULL_RTX)
>>> +   return false;
>>> +
>>> + rtx op0 = SET_SRC (set);
>>> + rtx_code code = GET_CODE (op0);
>>> +
>>> + // This check is added as register pairs are not generated
>>> + // by RA for neg:V2DF (fma: V2DF (reg1)
>>> + //  (reg2)
>>> + //  (neg:V2DF (reg3)))
>>> + if (GET_RTX_CLASS (code) == RTX_UNARY)
>>> +   return false;
>>
>> What's special about (neg (fma ...))?
>>
>
> I am not sure why register allocator fails allocating register pairs with
> NEG Unary operation with fma operand. I have not debugged register 
> allocator why the NEG
> Unary operation with fma operand. 

>>>
>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>> set correctly. 
>>> IRA marked them spill candidates as spill priority is zero.
>>>
>>> Due to this LRA reload pass couldn't allocate register pairs.
>> 
>> I think this is just restating the symptom though.  I suppose the same
>> kind of questions apply here too: what was the instruction before the
>> pass runs, what was the instruction after the pass runs, and why is
>> the rtl change incorrect (by the meaning above)?
>> 
>
> Original case where we dont do load fusion, spill happens, in that
> case we dont require sequential register pairs to be generated for 2 loads
> for. Hence it worked.
>
> rtl change is correct and there is no error.
>

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-06 Thread Richard Sandiford
Ajit Agarwal  writes:
> On 06/06/24 8:03 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> On 06/06/24 2:28 pm, Richard Sandiford wrote:
>>>> Hi,
>>>>
>>>> Just some comments on the fuseable_load_p part, since that's what
>>>> we were discussing last time.
>>>>
>>>> It looks like this now relies on:
>>>>
>>>> Ajit Agarwal  writes:
>>>>> +  /* We use DF data flow because we change location rtx
>>>>> +  which is easier to find and modify.
>>>>> +  We use mix of rtl-ssa def-use and DF data flow
>>>>> +  where it is easier.  */
>>>>> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>>>>> +  df_analyze ();
>>>>> +  df_set_flags (DF_DEFER_INSN_RESCAN);
>>>>
>>>> But please don't do this!  For one thing, building DU/UD chains
>>>> as well as rtl-ssa is really expensive in terms of compile time.
>>>> But more importantly, modifications need to happen via rtl-ssa
>>>> to ensure that the IL is kept up-to-date.  If we don't do that,
>>>> later fuse attempts will be based on stale data and so could
>>>> generate incorrect code.
>>>>
>>>
>>> Sure I have made changes to use only rtl-ssa and not to use
>>> UD/DU chains. I will send the changes in separate subsequent
>>> patch.
>> 
>> Thanks.  Before you send the patch though:
>> 
>>>>> +// Check whether load can be fusable or not.
>>>>> +// Return true if fuseable otherwise false.
>>>>> +bool
>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>> +{
>>>>> +  for (auto def : info->defs())
>>>>> +{
>>>>> +  auto set = dyn_cast (def);
>>>>> +  for (auto use1 : set->nondebug_insn_uses ())
>>>>> + use1->set_is_live_out_use (true);
>>>>> +}
>>>>
>>>> What was the reason for adding this loop?
>>>>
>>>
>>> The purpose of adding is to avoid assert failure in 
>>> gcc/rtl-ssa/changes.cc:252
>> 
>> That assert is making sure that we don't delete a definition of a
>> register (or memory) while a real insn still uses it.  If the assert
>> is firing then something has gone wrong.
>> 
>> Live-out uses are a particular kind of use that occur at the end of
>> basic blocks.  It's incorrect to mark normal insn uses as live-out.
>> 
>> When an assert fails, it's important to understand why the failure
>> occurs, rather than brute-force the assert condition to true.
>> 
>
> The above assert failure occurs when there is a debug insn and its
> use is not live-out.

Uses in debug insns are never live-out uses.

It sounds like the bug is that we're failing to update all debug uses of
the original register.  We need to do that, or "reset" the debug insn if
substitution fails for some reason.

See fixup_debug_uses for what the target-independent part of the pass
does for debug insns that are affected by movement.  Hopefully the
update needed here will be simpler than that.

>>>>> [...]
>>>>> +
>>>>> +  rtx addr = XEXP (SET_SRC (body), 0);
>>>>> +
>>>>> +  if (GET_CODE (addr) == PLUS
>>>>> +  && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>>>>> +{
>>>>> +  if (INTVAL (XEXP (addr, 1)) == -16)
>>>>> + return false;
>>>>> +  }
>>>>
>>>> What's special about -16?
>>>>
>>>
>>> The tests like libgomp/for-8 fails with fused load with offset -16 and 0.
>>> Thats why I have added this check.
>> 
>> But why does it fail though?  It sounds like the testcase is pointing
>> out a problem in the pass (or perhaps elsewhere).  It's important that
>> we try to understand and fix the underlying problem.
>> 
>
> This check is not required anymore and will remove from subsequent patches.

OK, great.

>>>>> +
>>>>> +  df_ref use;
>>>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>>>> +{
>>>>> +  struct df_link *def_link = DF_REF_CHAIN (use);
>>>>> +
>>>>> +  if (!def_link || !def_link->ref
>>>>> +   || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>>>> + continue;
>>>>&g

Re: [PATCH V2] aarch64: Add missing ACLE macro for NEON-SVE Bridge

2024-06-06 Thread Richard Sandiford
Richard Ball  writes:
> v2: Change macro definition following internal discussion.
>
> __ARM_NEON_SVE_BRIDGE was missed in the original patch and is
> added by this patch.
>
> Ok for trunk and a backport into gcc-14?

Yes, thanks.

Richard

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-c.cc (aarch64_define_unconditional_macros):
>   Add missing __ARM_NEON_SVE_BRIDGE.
>
> On 6/6/24 13:20, Richard Sandiford wrote:
>> Richard Ball  writes:
>>> __ARM_NEON_SVE_BRIDGE was missed in the original patch and is
>>> added by this patch.
>>>
>>> Ok for trunk and a backport into gcc-14?
>>>
>>> gcc/ChangeLog:
>>>
>>> * config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
>>> Add missing __ARM_NEON_SVE_BRIDGE.
>> 
>> After this patch was posted, there was some internal discussion
>> involving LLVM & GNU devs about what this kind of macro means, now that
>> we have FMV.  The feeling was that __ARM_NEON_SVE_BRIDGE should just
>> indicate whether the compiler provides the file, not whether AdvSIMD
>> & SVE are enabled.  I think we should therefore add this to
>> aarch64_define_unconditional_macros instead.
>> 
>> Sorry for the slow review.  I was waiting for the outcome of that
>> discussion before replying.
>> 
>> Thanks,
>> Richard
>> 
>>> diff --git a/gcc/config/aarch64/aarch64-c.cc 
>>> b/gcc/config/aarch64/aarch64-c.cc
>>> index 
>>> fe1a20e4e546a68e5f7eddff3bbb0d3e831fbd9b..1121be118cf8d05e3736ad4ee75568ff7cb92bfd
>>>  100644
>>> --- a/gcc/config/aarch64/aarch64-c.cc
>>> +++ b/gcc/config/aarch64/aarch64-c.cc
>>> @@ -260,6 +260,7 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
>>>aarch64_def_or_undef (TARGET_SME_I16I64, "__ARM_FEATURE_SME_I16I64", 
>>> pfile);
>>>aarch64_def_or_undef (TARGET_SME_F64F64, "__ARM_FEATURE_SME_F64F64", 
>>> pfile);
>>>aarch64_def_or_undef (TARGET_SME2, "__ARM_FEATURE_SME2", pfile);
>>> +  aarch64_def_or_undef (TARGET_SVE, "__ARM_NEON_SVE_BRIDGE", pfile);
>>>  
>>>/* Not for ACLE, but required to keep "float.h" correct if we switch
>>>   target between implementations that do or do not support ARMv8.2-A
>
> diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
> index 
> fe1a20e4e546a68e5f7eddff3bbb0d3e831fbd9b..d042e5fbd8c562df2e4538b51b960c194d2ca2c9
>  100644
> --- a/gcc/config/aarch64/aarch64-c.cc
> +++ b/gcc/config/aarch64/aarch64-c.cc
> @@ -75,6 +75,7 @@ aarch64_define_unconditional_macros (cpp_reader *pfile)
>  
>builtin_define ("__ARM_STATE_ZA");
>builtin_define ("__ARM_STATE_ZT0");
> +  builtin_define ("__ARM_NEON_SVE_BRIDGE");
>  
>/* Define keyword attributes like __arm_streaming as macros that expand
>   to the associated [[...]] attribute.  Use __extension__ in the attribute


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-06 Thread Richard Sandiford
Ajit Agarwal  writes:
> On 06/06/24 2:28 pm, Richard Sandiford wrote:
>> Hi,
>> 
>> Just some comments on the fuseable_load_p part, since that's what
>> we were discussing last time.
>> 
>> It looks like this now relies on:
>> 
>> Ajit Agarwal  writes:
>>> +  /* We use DF data flow because we change location rtx
>>> +which is easier to find and modify.
>>> +We use mix of rtl-ssa def-use and DF data flow
>>> +where it is easier.  */
>>> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>>> +  df_analyze ();
>>> +  df_set_flags (DF_DEFER_INSN_RESCAN);
>> 
>> But please don't do this!  For one thing, building DU/UD chains
>> as well as rtl-ssa is really expensive in terms of compile time.
>> But more importantly, modifications need to happen via rtl-ssa
>> to ensure that the IL is kept up-to-date.  If we don't do that,
>> later fuse attempts will be based on stale data and so could
>> generate incorrect code.
>> 
>
> Sure I have made changes to use only rtl-ssa and not to use
> UD/DU chains. I will send the changes in separate subsequent
> patch.

Thanks.  Before you send the patch though:

>>> +// Check whether load can be fusable or not.
>>> +// Return true if fuseable otherwise false.
>>> +bool
>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>> +{
>>> +  for (auto def : info->defs())
>>> +{
>>> +  auto set = dyn_cast (def);
>>> +  for (auto use1 : set->nondebug_insn_uses ())
>>> +   use1->set_is_live_out_use (true);
>>> +}
>> 
>> What was the reason for adding this loop?
>>
>
> The purpose of adding is to avoid assert failure in gcc/rtl-ssa/changes.cc:252

That assert is making sure that we don't delete a definition of a
register (or memory) while a real insn still uses it.  If the assert
is firing then something has gone wrong.

Live-out uses are a particular kind of use that occur at the end of
basic blocks.  It's incorrect to mark normal insn uses as live-out.

When an assert fails, it's important to understand why the failure
occurs, rather than brute-force the assert condition to true.

>>> [...]
>>> +
>>> +  rtx addr = XEXP (SET_SRC (body), 0);
>>> +
>>> +  if (GET_CODE (addr) == PLUS
>>> +  && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>>> +{
>>> +  if (INTVAL (XEXP (addr, 1)) == -16)
>>> +   return false;
>>> +  }
>> 
>> What's special about -16?
>> 
>
> The tests like libgomp/for-8 fails with fused load with offset -16 and 0.
> Thats why I have added this check.

But why does it fail though?  It sounds like the testcase is pointing
out a problem in the pass (or perhaps elsewhere).  It's important that
we try to understand and fix the underlying problem.

>>> +
>>> +  df_ref use;
>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>> +{
>>> +  struct df_link *def_link = DF_REF_CHAIN (use);
>>> +
>>> +  if (!def_link || !def_link->ref
>>> + || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>> +   continue;
>>> +
>>> +  while (def_link && def_link->ref)
>>> +   {
>>> + rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>> + if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>> +   return false;
>> 
>> Why do you need to skip PARALLELs?
>>
>
> vec_select with parallel give failures final.cc "can't split-up with subreg 
> 128 (reg OO"
> Thats why I have added this.

But in (vec_select ... (parallel ...)), the parallel won't be the 
PATTERN (insn).  It'll instead be a suboperand of the vec_select.

Here too it's important to understand why the final.cc failure occurs
and what the correct fix is.

>>> +
>>> + rtx set = single_set (insn);
>>> + if (set == NULL_RTX)
>>> +   return false;
>>> +
>>> + rtx op0 = SET_SRC (set);
>>> + rtx_code code = GET_CODE (op0);
>>> +
>>> + // This check is added as register pairs are not generated
>>> + // by RA for neg:V2DF (fma: V2DF (reg1)
>>> + //  (reg2)
>>> + //  (neg:V2DF (reg3)))
>>> + if (GET_RTX_CLASS (code) == RTX_UNARY)
>>> +   return false;
>> 
>> What's special about (neg (fma ...))?
>>
>
> I am not sure why register allocator fails allocating register p

Re: [PATCH] aarch64: Add fix_truncv4sfv4hi2 pattern [PR113882]

2024-06-06 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch adds the fix_truncv4sfv4hi2 (V4SF->V4HI) pattern which is 
> implemented
> using fix_truncv4sfv4si2 (V4SF->V4SI) and then truncv4siv4hi2 (V4SI->V4HI).
>
>   PR target/113882
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (fix_truncv4sfv4hi2): New pattern.

Could we handle this by extending the target-independent code instead?
Richard mentioned in comment 1 that the current set of intermediate
conversions is hard-coded, but it didn't sound like he was implying that
the set shouldn't change.

Thanks,
Richard

> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/fix_trunc2.c: New test.
>
> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-simd.md| 13 +
>  gcc/testsuite/gcc.target/aarch64/fix_trunc2.c | 14 ++
>  2 files changed, 27 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fix_trunc2.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 868f4486218..096f7b56a27 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3032,6 +3032,19 @@ (define_expand 
> "2"
>"TARGET_SIMD"
>{})
>  
> +
> +(define_expand "fix_truncv4sfv4hi2"
> +  [(match_operand:V4HI 0 "register_operand")
> +   (match_operand:V4SF 1 "register_operand")]
> +  "TARGET_SIMD"
> +  {
> +rtx tmp = gen_reg_rtx (V4SImode);
> +emit_insn (gen_fix_truncv4sfv4si2 (tmp, operands[1]));
> +emit_insn (gen_truncv4siv4hi2 (operands[0], tmp));
> +DONE;
> +  }
> +)
> +
>  (define_expand "ftrunc2"
>[(set (match_operand:VHSDF 0 "register_operand")
>   (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")]
> diff --git a/gcc/testsuite/gcc.target/aarch64/fix_trunc2.c 
> b/gcc/testsuite/gcc.target/aarch64/fix_trunc2.c
> new file mode 100644
> index 000..57cc00913a3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/fix_trunc2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +void
> +f (short *__restrict a, float *__restrict b)
> +{
> +  a[0] = b[0];
> +  a[1] = b[1];
> +  a[2] = b[2];
> +  a[3] = b[3];
> +}
> +
> +/* { dg-final { scan-assembler-times {fcvtzs\tv[0-9]+.4s, v[0-9]+.4s} 1 } } 
> */
> +/* { dg-final { scan-assembler-times {xtn\tv[0-9]+.4h, v[0-9]+.4s} 1 } } */


Re: [PATCH] aarch64: Add missing ACLE macro for NEON-SVE Bridge

2024-06-06 Thread Richard Sandiford
Richard Ball  writes:
> __ARM_NEON_SVE_BRIDGE was missed in the original patch and is
> added by this patch.
>
> Ok for trunk and a backport into gcc-14?
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
>   Add missing __ARM_NEON_SVE_BRIDGE.

After this patch was posted, there was some internal discussion
involving LLVM & GNU devs about what this kind of macro means, now that
we have FMV.  The feeling was that __ARM_NEON_SVE_BRIDGE should just
indicate whether the compiler provides the file, not whether AdvSIMD
& SVE are enabled.  I think we should therefore add this to
aarch64_define_unconditional_macros instead.

Sorry for the slow review.  I was waiting for the outcome of that
discussion before replying.

Thanks,
Richard

> diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
> index 
> fe1a20e4e546a68e5f7eddff3bbb0d3e831fbd9b..1121be118cf8d05e3736ad4ee75568ff7cb92bfd
>  100644
> --- a/gcc/config/aarch64/aarch64-c.cc
> +++ b/gcc/config/aarch64/aarch64-c.cc
> @@ -260,6 +260,7 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
>aarch64_def_or_undef (TARGET_SME_I16I64, "__ARM_FEATURE_SME_I16I64", 
> pfile);
>aarch64_def_or_undef (TARGET_SME_F64F64, "__ARM_FEATURE_SME_F64F64", 
> pfile);
>aarch64_def_or_undef (TARGET_SME2, "__ARM_FEATURE_SME2", pfile);
> +  aarch64_def_or_undef (TARGET_SVE, "__ARM_NEON_SVE_BRIDGE", pfile);
>  
>/* Not for ACLE, but required to keep "float.h" correct if we switch
>   target between implementations that do or do not support ARMv8.2-A


Re: [PATCH]AArch64: correct constraint on Upl early clobber alternatives

2024-06-06 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> I made an oversight in the previous patch, where I added a ?Upa
> alternative to the Upl cases.  This causes it to create the tie
> between the larger register file rather than the constrained one.
>
> This fixes the affected patterns.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Build SPECCPU 2017 and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-sve.md (@aarch64_pred_cmp,
>   *cmp_cc, *cmp_ptest,
>   @aarch64_pred_cmp_wide,
>   *aarch64_pred_cmp_wide_cc,
>   *aarch64_pred_cmp_wide_ptest): Fix Upl tie alternative.
>   * config/aarch64/aarch64-sve2.md (@aarch64_pred_): Fix
>   Upl tie alternative.

OK, thanks.  For the record...

> ---
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 
> d902bce62fde88b6d85f8d71f305e7fc76a4d34e..d69db34016a55b4324faa129a3ac1f47227ba776
>  100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -8134,13 +8134,13 @@ (define_insn "@aarch64_pred_cmp"
> UNSPEC_PRED_Z))
> (clobber (reg:CC_NZC CC_REGNUM))]
>"TARGET_SVE"
> -  {@ [ cons: =0 , 1   , 3 , 4; attrs: pred_clobber ]
> - [  , Upl , w , ; yes ] 
> cmp\t%0., %1/z, %3., #%4
> - [ ?Upa , 0Upl, w , ; yes ] ^
> - [ Upa  , Upl , w , ; no  ] ^
> - [  , Upl , w , w; yes ] 
> cmp\t%0., %1/z, %3., %4.
> - [ ?Upa , 0Upl, w , w; yes ] ^
> - [ Upa  , Upl , w , w; no  ] ^
> +  {@ [ cons: =0 , 1  , 3 , 4; attrs: pred_clobber ]
> + [  , Upl, w , ; yes ] 
> cmp\t%0., %1/z, %3., #%4
> + [ ?Upl , 0  , w , ; yes ] ^
> + [ Upa  , Upl, w , ; no  ] ^
> + [  , Upl, w , w; yes ] 
> cmp\t%0., %1/z, %3., %4.
> + [ ?Upl , 0  , w , w; yes ] ^
> + [ Upa  , Upl, w , w; no  ] ^
>}
>  )
>  
> @@ -8170,13 +8170,13 @@ (define_insn_and_rewrite "*cmp_cc"
> UNSPEC_PRED_Z))]
>"TARGET_SVE
> && aarch64_sve_same_pred_for_ptest_p ([4], [6])"
> -  {@ [ cons: =0 , 1, 2 , 3; attrs: pred_clobber ]
> - [  ,  Upl , w , ; yes ] 
> cmp\t%0., %1/z, %2., #%3
> - [ ?Upa ,  0Upl, w , ; yes ] ^
> - [ Upa  ,  Upl , w , ; no  ] ^
> - [  ,  Upl , w , w; yes ] 
> cmp\t%0., %1/z, %2., %3.
> - [ ?Upa ,  0Upl, w , w; yes ] ^
> - [ Upa  ,  Upl , w , w; no  ] ^
> +  {@ [ cons: =0 , 1   , 2 , 3; attrs: pred_clobber ]
> + [  ,  Upl, w , ; yes ] 
> cmp\t%0., %1/z, %2., #%3
> + [ ?Upl ,  0  , w , ; yes ] ^
> + [ Upa  ,  Upl, w , ; no  ] ^
> + [  ,  Upl, w , w; yes ] 
> cmp\t%0., %1/z, %2., %3.
> + [ ?Upl ,  0  , w , w; yes ] ^
> + [ Upa  ,  Upl, w , w; no  ] ^
>}
>"&& !rtx_equal_p (operands[4], operands[6])"
>{
> @@ -8205,12 +8205,12 @@ (define_insn_and_rewrite "*cmp_ptest"
>"TARGET_SVE
> && aarch64_sve_same_pred_for_ptest_p ([4], [6])"
>{@ [ cons: =0, 1, 2 , 3; attrs: pred_clobber ]
> - [ ,  Upl , w , ; yes ] 
> cmp\t%0., %1/z, %2., #%3
> - [ ?Upa,  0Upl, w , ; yes ] ^
> - [ Upa ,  Upl , w , ; no  ] ^
> - [ ,  Upl , w , w; yes ] 
> cmp\t%0., %1/z, %2., %3.
> - [ ?Upa,  0Upl, w , w; yes ] ^
> - [ Upa ,  Upl , w , w; no  ] ^
> + [ ,  Upl, w , ; yes ] 
> cmp\t%0., %1/z, %2., #%3
> + [ ?Upl,  0  , w , ; yes ] ^
> + [ Upa ,  Upl, w , ; no  ] ^
> + [ ,  Upl, w , w; yes ] 
> cmp\t%0., %1/z, %2., %3.
> + [ ?Upl,  0  , w , w; yes ] ^
> + [ Upa ,  Upl, w , w; no  ] ^
>}
>"&& !rtx_equal_p (operands[4], operands[6])"
>{
> @@ -8263,10 +8263,10 @@ (define_insn "@aarch64_pred_cmp_wide"
> UNSPEC_PRED_Z))
> (clobber (reg:CC_NZC CC_REGNUM))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1, 2, 3, 4; attrs: pred_clobber ]
> - [ ,  Upl ,  , w, w; yes ] 
> cmp\t%0., %1/z, %3., %4.d
> - [ ?Upa,  0Upl,  , w, w; yes ] ^
> - [ Upa ,  Upl ,  , w, w; no  ] ^
> +  {@ [ cons: =0, 1   , 2, 3, 4; attrs: pred_clobber ]

Re: [PATCH v2] aarch64: Add vector floating point extend pattern [PR113880, PR113869]

2024-06-06 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch adds vector floating point extend pattern for V2SF->V2DF and
> V4HF->V4SF conversions by renaming the existing 
> aarch64_float_extend_lo_
> pattern to the standard optab one, i.e., extend2. This allows the
> vectorizer to vectorize certain floating point widening operations for the
> aarch64 target.
>
>   PR target/113880
>   PR target/113869
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (VAR1): Remap float_extend_lo_
>   builtin codes to standard optab ones.
>   * config/aarch64/aarch64-simd.md (aarch64_float_extend_lo_): 
> Rename
>   to...
>   (extend2): ... This.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/extend-vec.c: New test.

OK, thanks, and sorry for the slow review.

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-builtins.cc|  9 
>  gcc/config/aarch64/aarch64-simd.md|  2 +-
>  gcc/testsuite/gcc.target/aarch64/extend-vec.c | 21 +++
>  3 files changed, 31 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/extend-vec.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index f8eeccb554d..25189888d17 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -534,6 +534,15 @@ BUILTIN_VDQ_BHSI (urhadd, uavg, _ceil, 0)
>  BUILTIN_VDQ_BHSI (shadd, avg, _floor, 0)
>  BUILTIN_VDQ_BHSI (uhadd, uavg, _floor, 0)
>  
> +/* The builtins below should be expanded through the standard optabs
> +   CODE_FOR_extend2. */
> +#undef VAR1
> +#define VAR1(F,T,N,M) \
> +  constexpr insn_code CODE_FOR_aarch64_##F##M = CODE_FOR_##T##N##M##2;
> +
> +VAR1 (float_extend_lo_, extend, v2sf, v2df)
> +VAR1 (float_extend_lo_, extend, v4hf, v4sf)
> +
>  #undef VAR1
>  #define VAR1(T, N, MAP, FLAG, A) \
>{#N #A, UP (A), CF##MAP (N, A), 0, TYPES_##T, FLAG_##FLAG},
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 868f4486218..c5e2c9f00d0 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3132,7 +3132,7 @@
>  DONE;
>}
>  )
> -(define_insn "aarch64_float_extend_lo_"
> +(define_insn "extend2"
>[(set (match_operand: 0 "register_operand" "=w")
>   (float_extend:
> (match_operand:VDF 1 "register_operand" "w")))]
> diff --git a/gcc/testsuite/gcc.target/aarch64/extend-vec.c 
> b/gcc/testsuite/gcc.target/aarch64/extend-vec.c
> new file mode 100644
> index 000..f6241d5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/extend-vec.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +/* { dg-final { scan-assembler-times {fcvtl\tv[0-9]+.2d, v[0-9]+.2s} 1 } } */
> +void
> +f (float *__restrict a, double *__restrict b)
> +{
> +  b[0] = a[0];
> +  b[1] = a[1];
> +}
> +
> +/* { dg-final { scan-assembler-times {fcvtl\tv[0-9]+.4s, v[0-9]+.4h} 1 } } */
> +void
> +f1 (_Float16 *__restrict a, float *__restrict b)
> +{
> +
> +  b[0] = a[0];
> +  b[1] = a[1];
> +  b[2] = a[2];
> +  b[3] = a[3];
> +}


Re: [PATCH v2 1/2] driver: Use -as/ld/objcopy as final fallback instead of native ones for cross

2024-06-06 Thread Richard Sandiford
YunQiang Su  writes:
> YunQiang Su  于2024年5月29日周三 10:02写道:
>>
>> Richard Sandiford  于2024年5月29日周三 05:28写道:
>> >
>> > YunQiang Su  writes:
>> > > If `find_a_program` cannot find `as/ld/objcopy` and we are a cross 
>> > > toolchain,
>> > > the final fallback is `as/ld` of system.  In fact, we can have a try with
>> > > -as/ld/objcopy before fallback to native as/ld/objcopy.
>> > >
>> > > This patch is derivatived from Debian's patch:
>> > >   gcc-search-prefixed-as-ld.diff
>> >
>> > I'm probably making you repeat a previous discussion, sorry, but could
>> > you describe the use case in more detail?  The current approach to
>> > handling cross toolchains has been used for many years.  Presumably
>> > this patch is supporting a different way of organising things,
>> > but I wasn't sure from the description what it was.
>> >
>> > AIUI, we currently assume that cross as, ld and objcopy will be
>> > installed under those names in $prefix/$target_alias/bin (aka 
>> > $tooldir/bin).
>> > E.g.:
>> >
>> >bin/aarch64-elf-as = aarch64-elf/bin/as
>> >
>> > GCC should then find as in aarch64-elf/bin.
>> >
>> > Is that not true in your case?
>> >
>>
>> Yes. This patch is only about the final fallback. I mean aarch64-elf/bin/as
>> still has higher priority than bin/aarch64-elf-as.
>>
>> In the current code, we find gas with:
>> /prefix/aarch64-elf/bin/as > $PATH/as
>>
>> And this patch a new one between them:
>> /prefix/aarch64-elf/bin/as > $PATH/aarch64-elf-as > $PATH/as
>>
>> > To be clear, I'm not saying the patch is wrong.  I'm just trying to
>> > understand why the patch is needed.
>> >
>>
>> Yes. If gcc is configured correctly, it is not so useful.
>> In some case for some lazy user, it may be useful,
>> for example, the binutils installed into different prefix with libc etc.
>>
>> For example, binutils is installed into /usr/aarch64-elf/bin, while
>> libc is installed into /usr/local/aarch64-elf/.
>>
>
> Any idea about it? Is it a use case making sense?

Yeah, I think it makes sense.  GCC and binutils are separate packages.
Users could cherry-pick a GCC installation and a separate binutils
installation rather than bundling them together into a single
toolchain.  And not everyone will have permission to change $tooldir.

So I agree we should support searching the user's path for an
as/ld/etc. based on the tool prefix.  Unfortunately, I don't think
I understand the code & constraints well enough to do a review.

In particular, it seems unfortunate that we need to do a trial
subcommand invocation before committing to the prefixed name.
And, if we continue to search for "as" in the user's path as a fallback,
it's not 100% obvious that "${triple}-as" later in the path should trump
"as" earlier in the path.

In some ways, it seems more consistent to do the replacement without
first doing a trial invocation.  But I don't know whether that would
break existing use cases.  (To be clear, I wouldn't feel comfortable
approving a patch to do that without buy-in from other maintainers.)

Thanks,
Richard


Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

2024-06-06 Thread Richard Sandiford
Hi,

Just some comments on the fuseable_load_p part, since that's what
we were discussing last time.

It looks like this now relies on:

Ajit Agarwal  writes:
> +  /* We use DF data flow because we change location rtx
> +  which is easier to find and modify.
> +  We use mix of rtl-ssa def-use and DF data flow
> +  where it is easier.  */
> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
> +  df_analyze ();
> +  df_set_flags (DF_DEFER_INSN_RESCAN);

But please don't do this!  For one thing, building DU/UD chains
as well as rtl-ssa is really expensive in terms of compile time.
But more importantly, modifications need to happen via rtl-ssa
to ensure that the IL is kept up-to-date.  If we don't do that,
later fuse attempts will be based on stale data and so could
generate incorrect code.

> +// Check whether load can be fusable or not.
> +// Return true if fuseable otherwise false.
> +bool
> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
> +{
> +  for (auto def : info->defs())
> +{
> +  auto set = dyn_cast (def);
> +  for (auto use1 : set->nondebug_insn_uses ())
> + use1->set_is_live_out_use (true);
> +}

What was the reason for adding this loop?

> +
> +  rtx_insn *rtl_insn = info ->rtl ();
> +  rtx body = PATTERN (rtl_insn);
> +  rtx dest_exp = SET_DEST (body);
> +
> +  if (REG_P (dest_exp) &&
> +  (DF_REG_DEF_COUNT (REGNO (dest_exp)) > 1

The rtl-ssa way of checking this is:

  crtl->ssa->is_single_dominating_def (...)

> +   || DF_REG_EQ_USE_COUNT (REGNO (dest_exp)) > 0))
> +return  false;

Why are uses in notes a problem?  In the worst case, we should just be
able to remove the note instead.

> +
> +  rtx addr = XEXP (SET_SRC (body), 0);
> +
> +  if (GET_CODE (addr) == PLUS
> +  && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
> +{
> +  if (INTVAL (XEXP (addr, 1)) == -16)
> + return false;
> +  }

What's special about -16?

> +
> +  df_ref use;
> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
> +{
> +  struct df_link *def_link = DF_REF_CHAIN (use);
> +
> +  if (!def_link || !def_link->ref
> +   || DF_REF_IS_ARTIFICIAL (def_link->ref))
> + continue;
> +
> +  while (def_link && def_link->ref)
> + {
> +   rtx_insn *insn = DF_REF_INSN (def_link->ref);
> +   if (GET_CODE (PATTERN (insn)) == PARALLEL)
> + return false;

Why do you need to skip PARALLELs?

> +
> +   rtx set = single_set (insn);
> +   if (set == NULL_RTX)
> + return false;
> +
> +   rtx op0 = SET_SRC (set);
> +   rtx_code code = GET_CODE (op0);
> +
> +   // This check is added as register pairs are not generated
> +   // by RA for neg:V2DF (fma: V2DF (reg1)
> +   //  (reg2)
> +   //  (neg:V2DF (reg3)))
> +   if (GET_RTX_CLASS (code) == RTX_UNARY)
> + return false;

What's special about (neg (fma ...))?

> +
> +   def_link = def_link->next;
> + }
> + }
> +  return true;
> +}

Thanks,
Richard


Re: [PATCH] expmed: TRUNCATE value1 if needed in store_bit_field_using_insv

2024-06-05 Thread Richard Sandiford
YunQiang Su  writes:
> Richard Sandiford  于2024年6月5日周三 23:20写道:
>>
>> YunQiang Su  writes:
>> > Richard Sandiford  于2024年6月5日周三 22:14写道:
>> >>
>> >> YunQiang Su  writes:
>> >> > PR target/113179.
>> >> >
>> >> > In `store_bit_field_using_insv`, we just use SUBREG if value_mode
>> >> >>= op_mode, while in some ports, a sign_extend will be needed,
>> >> > such as MIPS64:
>> >> >   If either GPR rs or GPR rt does not contain sign-extended 32-bit
>> >> >   values (bits 63..31 equal), then the result of the operation is
>> >> >   UNPREDICTABLE.
>> >> >
>> >> > The problem happens for the code like:
>> >> >   struct xx {
>> >> > int a:4;
>> >> > int b:24;
>> >> > int c:3;
>> >> > int d:1;
>> >> >   };
>> >> >
>> >> >   void xx (struct xx *a, long long b) {
>> >> > a->d = b;
>> >> >   }
>> >> >
>> >> > In the above code, the hard register contains `b`, may be note well
>> >> > sign-extended.
>> >> >
>> >> > gcc/
>> >> >   PR target/113179
>> >> >   * expmed.c(store_bit_field_using_insv): TRUNCATE value1 if
>> >> >   needed.
>> >> >
>> >> > gcc/testsuite
>> >> >   PR target/113179
>> >> >   * gcc.target/mips/pr113179.c: New tests.
>> >> > ---
>> >> >  gcc/expmed.cc| 12 +---
>> >> >  gcc/testsuite/gcc.target/mips/pr113179.c | 18 ++
>> >> >  2 files changed, 27 insertions(+), 3 deletions(-)
>> >> >  create mode 100644 gcc/testsuite/gcc.target/mips/pr113179.c
>> >> >
>> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc
>> >> > index 4ec035e4843..6a582593da8 100644
>> >> > --- a/gcc/expmed.cc
>> >> > +++ b/gcc/expmed.cc
>> >> > @@ -704,9 +704,15 @@ store_bit_field_using_insv (const extraction_insn 
>> >> > *insv, rtx op0,
>> >> >   }
>> >> > else
>> >> >   {
>> >> > -   tmp = gen_lowpart_if_possible (op_mode, value1);
>> >> > -   if (! tmp)
>> >> > - tmp = gen_lowpart (op_mode, force_reg (value_mode, 
>> >> > value1));
>> >> > +   if (targetm.mode_rep_extended (op_mode, value_mode))
>> >> > + tmp = simplify_gen_unary (TRUNCATE, op_mode,
>> >> > +   value1, value_mode);
>> >> > +   else
>> >> > + {
>> >> > +   tmp = gen_lowpart_if_possible (op_mode, value1);
>> >> > +   if (! tmp)
>> >> > + tmp = gen_lowpart (op_mode, force_reg (value_mode, 
>> >> > value1));
>> >> > + }
>> >> >   }
>> >> > value1 = tmp;
>> >> >   }
>> >>
>> >> I notice this patch is already applied.  Was it approved?  I didn't
>> >> see an approval in my feed or in the archives.
>> >>
>> >
>> > Sorry. I was supposed that it only effects MIPS targets since only MIPS 
>> > defines
>> >   targetm.mode_rep_extended
>> >
>> >> Although it might not make any difference on current targets,
>> >> I think the conditional should logically be based on
>> >> TRULY_NOOP_TRUNCATION(_MODES_P) rather than targetm.mode_rep_extended.
>> >>
>> >> TRULY_NOOP_TRUNCATION is a correctness question: can I use subregs
>> >> to do this truncation?  targetm.mode_rep_extended is instead an
>> >> optimisation question: can I assume that a particular extension is free?
>> >> Here we're trying to avoid a subreg for correctness, rather than trying
>> >> to take advantage of a cheap extension.
>> >>
>> >> So I think the code should be:
>> >>
>> >>   if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode))
>> >> {
>> >>   tmp = simplify_subreg (op_mode, value1, value_mode, 0);
>> >>   if (! tmp)
>> >> tmp = simplify_gen_subreg (op

Re: [PATCH] expmed: TRUNCATE value1 if needed in store_bit_field_using_insv

2024-06-05 Thread Richard Sandiford
YunQiang Su  writes:
> Richard Sandiford  于2024年6月5日周三 22:14写道:
>>
>> YunQiang Su  writes:
>> > PR target/113179.
>> >
>> > In `store_bit_field_using_insv`, we just use SUBREG if value_mode
>> >>= op_mode, while in some ports, a sign_extend will be needed,
>> > such as MIPS64:
>> >   If either GPR rs or GPR rt does not contain sign-extended 32-bit
>> >   values (bits 63..31 equal), then the result of the operation is
>> >   UNPREDICTABLE.
>> >
>> > The problem happens for the code like:
>> >   struct xx {
>> > int a:4;
>> > int b:24;
>> > int c:3;
>> > int d:1;
>> >   };
>> >
>> >   void xx (struct xx *a, long long b) {
>> > a->d = b;
>> >   }
>> >
>> > In the above code, the hard register contains `b`, may be note well
>> > sign-extended.
>> >
>> > gcc/
>> >   PR target/113179
>> >   * expmed.c(store_bit_field_using_insv): TRUNCATE value1 if
>> >   needed.
>> >
>> > gcc/testsuite
>> >   PR target/113179
>> >   * gcc.target/mips/pr113179.c: New tests.
>> > ---
>> >  gcc/expmed.cc| 12 +---
>> >  gcc/testsuite/gcc.target/mips/pr113179.c | 18 ++
>> >  2 files changed, 27 insertions(+), 3 deletions(-)
>> >  create mode 100644 gcc/testsuite/gcc.target/mips/pr113179.c
>> >
>> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc
>> > index 4ec035e4843..6a582593da8 100644
>> > --- a/gcc/expmed.cc
>> > +++ b/gcc/expmed.cc
>> > @@ -704,9 +704,15 @@ store_bit_field_using_insv (const extraction_insn 
>> > *insv, rtx op0,
>> >   }
>> > else
>> >   {
>> > -   tmp = gen_lowpart_if_possible (op_mode, value1);
>> > -   if (! tmp)
>> > - tmp = gen_lowpart (op_mode, force_reg (value_mode, value1));
>> > +   if (targetm.mode_rep_extended (op_mode, value_mode))
>> > + tmp = simplify_gen_unary (TRUNCATE, op_mode,
>> > +   value1, value_mode);
>> > +   else
>> > + {
>> > +   tmp = gen_lowpart_if_possible (op_mode, value1);
>> > +   if (! tmp)
>> > + tmp = gen_lowpart (op_mode, force_reg (value_mode, 
>> > value1));
>> > + }
>> >   }
>> > value1 = tmp;
>> >   }
>>
>> I notice this patch is already applied.  Was it approved?  I didn't
>> see an approval in my feed or in the archives.
>>
>
> Sorry. I was supposed that it only effects MIPS targets since only MIPS 
> defines
>   targetm.mode_rep_extended
>
>> Although it might not make any difference on current targets,
>> I think the conditional should logically be based on
>> TRULY_NOOP_TRUNCATION(_MODES_P) rather than targetm.mode_rep_extended.
>>
>> TRULY_NOOP_TRUNCATION is a correctness question: can I use subregs
>> to do this truncation?  targetm.mode_rep_extended is instead an
>> optimisation question: can I assume that a particular extension is free?
>> Here we're trying to avoid a subreg for correctness, rather than trying
>> to take advantage of a cheap extension.
>>
>> So I think the code should be:
>>
>>   if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode))
>> {
>>   tmp = simplify_subreg (op_mode, value1, value_mode, 0);
>>   if (! tmp)
>> tmp = simplify_gen_subreg (op_mode,
>>force_reg (value_mode, value1),
>>value_mode, 0);
>> }
>>   else if (GET_MODE_SIZE (op_mode) < GET_MODE_SIZE (value_mode)
>>&& !TRULY_NOOP_TRUNCATION_MODES_P (op_mode, value_mode))
>
> In fact I don't think so. For other targets besides MIPS, it is safe even
> !TRULY_NOOP_TRUNCATION_MODES_P (op_mode, value_mode)
> as INS instruction may use the low part of a register safely.

Not sure what you mean.  The change above is no-op for targets other
than MIPS and SH5 (now removed).  But I think it's the correct way of
representing the restriction on MIPS and SH5.

>
> It is only not true on MIPS ISA documents as
>  If either GPR rs or GPR rt does not contain sign-extended 32-bit
>  values (bits 63..31 equal), then the result of the operation is
>  UNPREDICTABLE.

Right.  The reason that TRULY_NOOP_TRUNCATION_MODES_P (SImode, DImode)
is false for MIPS isn't that a subreg is impossible on MIPS targets.
It's that the MIPS port needs to ensure that SImode values are stored
in sign-extended form in order to avoid the problem above.  So a
truncation needs to be a sign-extension.

> It is very annoying. I haven't noticed a similar problem on any other
> ISA documents.
> In fact I don't know any real MIPS hardware that is "UNPREDICTABLE" in
> this case.

I think the Broadcom SB-1 took advantage of the freedom.

Thanks,
Richard


Re: [PATCH] expmed: TRUNCATE value1 if needed in store_bit_field_using_insv

2024-06-05 Thread Richard Sandiford
YunQiang Su  writes:
> PR target/113179.
>
> In `store_bit_field_using_insv`, we just use SUBREG if value_mode
>>= op_mode, while in some ports, a sign_extend will be needed,
> such as MIPS64:
>   If either GPR rs or GPR rt does not contain sign-extended 32-bit
>   values (bits 63..31 equal), then the result of the operation is
>   UNPREDICTABLE.
>
> The problem happens for the code like:
>   struct xx {
> int a:4;
> int b:24;
> int c:3;
> int d:1;
>   };
>
>   void xx (struct xx *a, long long b) {
> a->d = b;
>   }
>
> In the above code, the hard register contains `b`, may be note well
> sign-extended.
>
> gcc/
>   PR target/113179
>   * expmed.c(store_bit_field_using_insv): TRUNCATE value1 if
>   needed.
>
> gcc/testsuite
>   PR target/113179
>   * gcc.target/mips/pr113179.c: New tests.
> ---
>  gcc/expmed.cc| 12 +---
>  gcc/testsuite/gcc.target/mips/pr113179.c | 18 ++
>  2 files changed, 27 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/mips/pr113179.c
>
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index 4ec035e4843..6a582593da8 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -704,9 +704,15 @@ store_bit_field_using_insv (const extraction_insn *insv, 
> rtx op0,
>   }
> else
>   {
> -   tmp = gen_lowpart_if_possible (op_mode, value1);
> -   if (! tmp)
> - tmp = gen_lowpart (op_mode, force_reg (value_mode, value1));
> +   if (targetm.mode_rep_extended (op_mode, value_mode))
> + tmp = simplify_gen_unary (TRUNCATE, op_mode,
> +   value1, value_mode);
> +   else
> + {
> +   tmp = gen_lowpart_if_possible (op_mode, value1);
> +   if (! tmp)
> + tmp = gen_lowpart (op_mode, force_reg (value_mode, value1));
> + }
>   }
> value1 = tmp;
>   }

I notice this patch is already applied.  Was it approved?  I didn't
see an approval in my feed or in the archives.

Although it might not make any difference on current targets,
I think the conditional should logically be based on
TRULY_NOOP_TRUNCATION(_MODES_P) rather than targetm.mode_rep_extended.

TRULY_NOOP_TRUNCATION is a correctness question: can I use subregs
to do this truncation?  targetm.mode_rep_extended is instead an
optimisation question: can I assume that a particular extension is free?
Here we're trying to avoid a subreg for correctness, rather than trying
to take advantage of a cheap extension.

So I think the code should be:

  if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode))
{
  tmp = simplify_subreg (op_mode, value1, value_mode, 0);
  if (! tmp)
tmp = simplify_gen_subreg (op_mode,
   force_reg (value_mode, value1),
   value_mode, 0);
}
  else if (GET_MODE_SIZE (op_mode) < GET_MODE_SIZE (value_mode)
   && !TRULY_NOOP_TRUNCATION_MODES_P (op_mode, value_mode))
tmp = simplify_gen_unary (TRUNCATE, op_mode, value1, value_mode);
  else
{
  tmp = gen_lowpart_if_possible (op_mode, value1);
  if (! tmp)
tmp = gen_lowpart (op_mode, force_reg (value_mode, value1));
}

(also inclues unnesting of the else).  Could you try changing the code
to do that and push the change if it works?

IMO the patch (in that form) is OK for backports after it has had a
couple of weeks on trunk.

Thanks,
Richard

> diff --git a/gcc/testsuite/gcc.target/mips/pr113179.c 
> b/gcc/testsuite/gcc.target/mips/pr113179.c
> new file mode 100644
> index 000..f32c5a16765
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/mips/pr113179.c
> @@ -0,0 +1,18 @@
> +/* Check if the operand of INS is sign-extended on MIPS64.  */
> +/* { dg-options "-mips64r2 -mabi=64" } */
> +/* { dg-skip-if "code quality test" { *-*-* } { "-O0" } { "" } } */
> +
> +struct xx {
> +int a:1;
> +int b:24;
> +int c:6;
> +int d:1;
> +};
> +
> +long long xx (struct xx *a, long long b) {
> +a->d = b;
> +return b+1;
> +}
> +
> +/* { dg-final { scan-assembler "\tsll\t\\\$3,\\\$5,0" } } */
> +/* { dg-final { scan-assembler "\tdaddiu\t\\\$2,\\\$5,1" } } */


Re: PATCH] AArch64: Fix cpu features initialization [PR115342]

2024-06-05 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Hi Richard,
>
>>> Essentially anything covered by HWCAP doesn't need an explicit check. So I 
>>> kept
>>> the LS64 and PREDRES checks since they don't have a HWCAP allocated (I'm not
>>> entirely convinced we need these, let alone having 3 individual bits for 
>>> LS64, but
>>> that's something for the ACLE spec to sort out). The goal here is to fix 
>>> all obvious
>>> bugs so one can use FMV as intended.
>>
>> Didn't we take the opposite approach for libatomic though?
>
> We started the work before LSE128/RCPC3 HWCAPs were added, so there was no
> alternative at the time. Checking both means a higher QoI, but once most 
> distros
> use modern kernels, the CPUID checks become unnecessary and will be removed.
>
>> I suppose one difference is that the libatomic code is gating a
>> choice between a well-defined, curated set of routines, whereas the
>> libgcc code is providing a general user-facing feature.  So maybe
>> libgcc should be more conservative for that reason?
>
> Indeed. Using HWCAP means it's trivially correct and working identically 
> between GCC
> and LLVM.
>
> I don't rule out adding extra CPUID checks for some features. However unlike 
> libatomic,
> the selected features are very user visible, so we would need to specify for 
> which features
> this is both useful and correct, and make sure GCC and LLVM behave in the 
> same way.

Thanks, makes sense.  On that basis, the patch is OK for trunk and GCC
14 branch with the previously discussed changes to the changelog &
commit message.

Richard


Re: [PATCH] [RFC] lower SLP load permutation to interleaving

2024-06-05 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, 4 Jun 2024, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > The following emulates classical interleaving for SLP load permutes
>> > that we are unlikely handling natively.  This is to handle cases
>> > where interleaving (or load/store-lanes) is the optimal choice for
>> > vectorizing even when we are doing that within SLP.  An example
>> > would be
>> >
>> > void foo (int * __restrict a, int * b)
>> > {
>> >   for (int i = 0; i < 16; ++i)
>> > {
>> >   a[4*i + 0] = b[4*i + 0] * 3;
>> >   a[4*i + 1] = b[4*i + 1] + 3;
>> >   a[4*i + 2] = (b[4*i + 2] * 3 + 3);
>> >   a[4*i + 3] = b[4*i + 3] * 3;
>> > }
>> > }
>> >
>> > where currently the SLP store is merging four single-lane SLP
>> > sub-graphs but none of the loads in it can be code-generated
>> > with V4SImode vectors and a VF of four as the permutes would need
>> > three vectors.
>> 
>> Nice!
>> 
>> > The patch introduces a lowering phase after SLP discovery but
>> > before SLP pattern recognition or permute optimization that
>> > analyzes all loads from the same dataref group and creates an
>> > interleaving scheme starting from an unpermuted load.
>> >
>> > What can be handled is quite restrictive, matching only a subset
>> > of the non-SLP interleaving cases (the power-of-two group size
>> > ones, in addition only cases without gaps).  The interleaving
>> > vectorization in addition can handle size 3 and 5 - but I am not
>> > sure if it's possible to do that in a VL agnostic way.  It
>> > should be still possible to set up the SLP graph in a way that
>> > a load-lane could be matched from SLP pattern recognition.
>> 
>> Yeah, I don't think it would be possible to decompose a 3- or
>> 5-lane grouped load into a series of VLA 2-input permutes.
>> But (as I think you're saying) it seems like a load-3-lanes would just
>> be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N.
>> Is that right?
>
> Yes, that's how it looks without this patch.  I think we'd need
> a load node loading N, N+1, N+2, ... and then permute nodes
> with N, N+3, ... and N+1, N+4, ... and N+2, N+5 ... so we generate
> one .LOAD_LANES from the load node and the permutes pick up the
> correct vector defs?  I'm not sure yet how classification and
> code generation would work for this.
>
> The store side is already on trunk with the single SLP store node
> getting lanes via permutes.
>
> It might be we want a load/store node with N inputs/outputs as the
> best representation and use lane_permutation to indicate the
> input (for stores) and output (for loads) "permute".
>
>> > As said gaps are currently not handled - for SLP we have a
>> > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
>> > would need to be filled in some way (even if we just push NULL).
>> >
>> > The patch misses multi-level even/odd handling as well as CSEing
>> > intermediate generated permutes.  Both is quite straight-forward
>> > to add, but eventually there's a better or more general strategy
>> > for lowering?  The main goal of the patch is to avoid falling
>> > back to non-SLP for cases the interleaving code handles.
>> 
>> Does the multi-level thing including examples like:
>> 
>> int a[2 * 16];
>> int b[8 * 16];
>> void f()
>> {
>>   for (int i = 0; i < 16; ++i)
>> {
>>   a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 + 
>> 3];
>>   a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 + 
>> 7];
>> }
>> }
>> 
>> ?  For that we generate:
>> 
>>   _45 = VEC_PERM_EXPR ;
>>   _44 = VEC_PERM_EXPR ;
>>   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
>>   _49 = VEC_PERM_EXPR ;
>>   _48 = VEC_PERM_EXPR ;
>>   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
>>   _53 = VEC_PERM_EXPR ;
>>   _52 = VEC_PERM_EXPR ;
>>   _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>;
>>   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
>> 
>> (two even level 1, one even level 2, one odd level 1), whereas
>> preferring 2xeven + 2xodd would avoid the third set of first-level
>> permutes:
>> 
>>   _45 = VEC_PERM_EXPR ;
>>   _44 = VEC_PERM_EXPR ;
>>   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
>>   _49 = VEC_PERM_EXPR ;
>>

Re: [PATCH v4 1/3] [RFC] ifcvt: handle sequences that clobber flags in noce_convert_multiple_sets

2024-06-05 Thread Richard Sandiford
Sorry for the slow review.

Manolis Tsamis  writes:
> This is an extension of what was done in PR106590.
>
> Currently if a sequence generated in noce_convert_multiple_sets clobbers the
> condition rtx (cc_cmp or rev_cc_cmp) then only seq1 is used afterwards
> (sequences that emit the comparison itself). Since this applies only from the
> next iteration it assumes that the sequences generated (in particular seq2)
> doesn't clobber the condition rtx itself before using it in the if_then_else,
> which is only true in specific cases (currently only register/subregister 
> moves
> are allowed).
>
> This patch changes this so it also tests if seq2 clobbers cc_cmp/rev_cc_cmp in
> the current iteration. This makes it possible to include arithmetic operations
> in noce_convert_multiple_sets.
>
> It also makes the code that checks whether the condition is used outside of 
> the
> if_then_else emitted more robust.
>
> gcc/ChangeLog:
>
>   * ifcvt.cc (check_for_cc_cmp_clobbers): Use modified_in_p instead.
>   (noce_convert_multiple_sets_1): Don't use seq2 if it clobbers cc_cmp.
>   Refactor the code that sets read_comparison.
>
> Signed-off-by: Manolis Tsamis 
> ---
>
> (no changes since v1)
>
>  gcc/ifcvt.cc | 106 ---
>  1 file changed, 59 insertions(+), 47 deletions(-)
>
> diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
> index 58ed42673e5..763a25f816e 100644
> --- a/gcc/ifcvt.cc
> +++ b/gcc/ifcvt.cc
> @@ -3592,20 +3592,6 @@ noce_convert_multiple_sets (struct noce_if_info 
> *if_info)
>return true;
>  }
>  
> -/* Helper function for noce_convert_multiple_sets_1.  If store to
> -   DEST can affect P[0] or P[1], clear P[0].  Called via note_stores.  */
> -
> -static void
> -check_for_cc_cmp_clobbers (rtx dest, const_rtx, void *p0)
> -{
> -  rtx *p = (rtx *) p0;
> -  if (p[0] == NULL_RTX)
> -return;
> -  if (reg_overlap_mentioned_p (dest, p[0])
> -  || (p[1] && reg_overlap_mentioned_p (dest, p[1])))
> -p[0] = NULL_RTX;
> -}
> -
>  /* This goes through all relevant insns of IF_INFO->then_bb and tries to
> create conditional moves.  In case a simple move sufficis the insn
> should be listed in NEED_NO_CMOV.  The rewired-src cases should be
> @@ -3731,36 +3717,67 @@ noce_convert_multiple_sets_1 (struct noce_if_info 
> *if_info,
>creating an additional compare for each.  If successful, costing
>is easier and this sequence is usually preferred.  */
>if (cc_cmp)
> - seq2 = try_emit_cmove_seq (if_info, temp, cond,
> -new_val, old_val, need_cmov,
> -, _dest2, cc_cmp, rev_cc_cmp);
> + {
> +   seq2 = try_emit_cmove_seq (if_info, temp, cond,
> +  new_val, old_val, need_cmov,
> +  , _dest2, cc_cmp, rev_cc_cmp);
> +
> +   /* The if_then_else in SEQ2 may be affected when cc_cmp/rev_cc_cmp is
> +  clobbered.  We can't safely use the sequence in this case.  */
> +   if (seq2 && (modified_in_p (cc_cmp, seq2)
> +   || (rev_cc_cmp && modified_in_p (rev_cc_cmp, seq2
> + seq2 = NULL;

It looks like this still has the problem that I mentioned in the
previous round: that modified_in_p only checks the first instruction
in seq2, not the whole sequence.  Or is that the intention?

Thanks,
Richard

> + }
>  
>/* The backend might have created a sequence that uses the
> -  condition.  Check this.  */
> +  condition as a value.  Check this.  */
> +
> +  /* We cannot handle anything more complex than a reg or constant.  */
> +  if (!REG_P (XEXP (cond, 0)) && !CONSTANT_P (XEXP (cond, 0)))
> + read_comparison = true;
> +
> +  if (!REG_P (XEXP (cond, 1)) && !CONSTANT_P (XEXP (cond, 1)))
> + read_comparison = true;
> +
>rtx_insn *walk = seq2;
> -  while (walk)
> +  int if_then_else_count = 0;
> +  while (walk && !read_comparison)
>   {
> -   rtx set = single_set (walk);
> +   rtx exprs_to_check[2];
> +   unsigned int exprs_count = 0;
>  
> -   if (!set || !SET_SRC (set))
> +   rtx set = single_set (walk);
> +   if (set && XEXP (set, 1)
> +   && GET_CODE (XEXP (set, 1)) == IF_THEN_ELSE)
>   {
> -   walk = NEXT_INSN (walk);
> -   continue;
> +   /* We assume that this is the cmove created by the backend that
> +  naturally uses the condition.  */
> +   exprs_to_check[exprs_count++] = XEXP (XEXP (set, 1), 1);
> +   exprs_to_check[exprs_count++] = XEXP (XEXP (set, 1), 2);
> +   if_then_else_count++;
>   }
> +   else if (NONDEBUG_INSN_P (walk))
> + exprs_to_check[exprs_count++] = PATTERN (walk);
>  
> -   rtx src = SET_SRC (set);
> +   /* Bail if we get more than one if_then_else because the assumption
> +  above may be incorrect.  */
> +   if (if_then_else_count > 1)
> + {
> 

Re: [PATCH] libgcc/aarch64: also provide AT_HWCAP2 fallback

2024-06-05 Thread Richard Sandiford
Jan Beulich  writes:
> Much like AT_HWCAP is already provided in case the platform headers
> don't have the value (yet).
>
> libgcc/
>
>   * config/aarch64/cpuinfo.c: Provide AT_HWCAP2.

OK for trunk and GCC 14.

Thanks,
Richard

> ---
> Observed as build failure with 14.1.0, so may want backporting there.
>
> --- a/libgcc/config/aarch64/cpuinfo.c
> +++ b/libgcc/config/aarch64/cpuinfo.c
> @@ -146,6 +146,9 @@ struct {
>  #define HWCAP_PACG (1UL << 31)
>  #endif
>  
> +#ifndef AT_HWCAP2
> +#define AT_HWCAP2 26
> +#endif
>  #ifndef HWCAP2_DCPODP
>  #define HWCAP2_DCPODP (1 << 0)
>  #endif


Re: [PATCH 2/3] [APX CCMP] Adjust startegy for selecting ccmp candidates

2024-06-05 Thread Richard Sandiford
Hongyu Wang  writes:
> CC'd Richard for ccmp part as previously it is added only for aarch64.
> The original logic will not interrupted since if
> aarch64_gen_ccmp_first succeeded, aarch64_gen_ccmp_next will also
> success, the cmp/fcmp and ccmp/fccmp supports all GPI/GPF, and the
> prepare_operand will fixup the input that cmp supports but ccmp not,
> so ret/ret2 will all be valid when comparing cost.
> Thanks in advance.

Sorry for the slow review.

> Hongyu Wang  于2024年5月15日周三 16:22写道:
>>
>> For general ccmp scenario, the tree sequence is like
>>
>> _1 = (a < b)
>> _2 = (c < d)
>> _3 = _1 & _2
>>
>> current ccmp expanding will try to swap compare order for _1 and _2,
>> compare the cost/cost2 between compare _1 and _2 first, then return the
>> sequence with lower cost.
>>
>> For x86 ccmp, we don't support FP compare as ccmp operand, but we
>> support fp comi + int ccmp sequence. With current cost comparison
>> model, the fp comi + int ccmp can never be generated since it doesn't
>> check whether expand_ccmp_next returns available result and the rtl
>> cost for the empty ccmp sequence is always smaller.
>>
>> Check the expand_ccmp_next result ret and ret2, returns the valid one
>> before cost comparison.
>>
>> gcc/ChangeLog:
>>
>> * ccmp.cc (expand_ccmp_expr_1): Check ret and ret2 of
>> expand_ccmp_next, returns the valid one first before
>> comparing cost.
>> ---
>>  gcc/ccmp.cc | 12 +++-
>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/gcc/ccmp.cc b/gcc/ccmp.cc
>> index 7cb525addf4..4b424220068 100644
>> --- a/gcc/ccmp.cc
>> +++ b/gcc/ccmp.cc
>> @@ -247,7 +247,17 @@ expand_ccmp_expr_1 (gimple *g, rtx_insn **prep_seq, 
>> rtx_insn **gen_seq)
>>   cost2 = seq_cost (prep_seq_2, speed_p);
>>   cost2 += seq_cost (gen_seq_2, speed_p);
>> }
>> - if (cost2 < cost1)
>> +
>> + /* For x86 target the ccmp does not support fp operands, but
>> +have fcomi insn that can produce eflags and then do int
>> +ccmp. So if one of the op is fp compare, ret1 or ret2 can
>> +fail, and the cost of the corresponding empty seq will
>> +always be smaller, then the NULL sequence will be returned.
>> +Add check for ret and ret2, returns the available one if
>> +the other is NULL.  */

I think the more fundamental point is that the cost of a failed
expansion isn't meaningful.  So how about:

  /* It's possible that one expansion succeeds and the other fails.
 For example, x86 has int ccmp but not fp ccmp, and so a combined
 fp and int comparison must be ordered such that the fp comparison
 happens first.  The costs are not meaningful for failed
 expansions.  */

>> + if ((!ret && ret2)
>> + || (!(ret && !ret2)
>> + && cost2 < cost1))

I think this simplifies to:

  if (ret2 && (!ret1 || cost2 < cost1))

OK with those changes, thanks.

Richard

>> {
>>   *prep_seq = prep_seq_2;
>>   *gen_seq = gen_seq_2;
>> --
>> 2.31.1
>>


Re: [PATCH-1v2] fwprop: Replace rtx_cost with insn_cost in try_fwprop_subst_pattern [PR113325]

2024-06-05 Thread Richard Sandiford
HAO CHEN GUI  writes:
> Hi,
>   This patch replaces rtx_cost with insn_cost in forward propagation.
> In the PR, one constant vector should be propagated and replace a
> pseudo in a store insn if we know it's a duplicated constant vector.
> It reduces the insn cost but not rtx cost. In this case, the cost is
> determined by destination operand (memory or pseudo). Unfortunately,
> rtx cost can't help.
>
>   The test case is added in the second target specific patch.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643995.html
>
>   Compared to previous version, the main change is not to do
> substitution if either new or old insn cost is zero. The zero means
> the cost is unknown.
>
>  Previous version
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643994.html
>
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is it OK for the trunk?
>
> ChangeLog
> fwprop: Replace set_src_cost with insn_cost in try_fwprop_subst_pattern
>
> gcc/
>   * fwprop.cc (try_fwprop_subst_pattern): Replace set_src_cost with
>   insn_cost.

Thanks for doing this.  It's definitely the right direction, but:

> patch.diff
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index cb6fd6700ca..184a22678b7 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -470,21 +470,19 @@ try_fwprop_subst_pattern (obstack_watermark , 
> insn_change _change,
>redo_changes (0);
>  }
>
> -  /* ??? In theory, it should be better to use insn costs rather than
> - set_src_costs here.  That would involve replacing this code with
> - change_is_worthwhile.  */

...as hinted at in the comment, rtl-ssa already has a routine for
insn_cost-based calculations.  It has two (supposed) advantages:
it caches the old costs, and it takes execution frequency into
account when optimising for speed.

The comment is out of date though.  The name of the routine is
changes_are_worthwhile rather than change_is_worthwhile.  Could you
try using that instead?

Richard

>bool ok = recog (attempt, use_change);
>if (ok && !prop.changed_mem_p () && !use_insn->is_asm ())
> -if (rtx use_set = single_set (use_rtl))
> +if (single_set (use_rtl))
>{
>   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_rtl));
> + auto new_cost = insn_cost (use_rtl, speed);
>   temporarily_undo_changes (0);
> - auto old_cost = set_src_cost (SET_SRC (use_set),
> -   GET_MODE (SET_DEST (use_set)), speed);
> + /* Invalide recog data.  */
> + INSN_CODE (use_rtl) = -1;
> + auto old_cost = insn_cost (use_rtl, speed);
>   redo_changes (0);
> - auto new_cost = set_src_cost (SET_SRC (use_set),
> -   GET_MODE (SET_DEST (use_set)), speed);
> - if (new_cost > old_cost
> + if (new_cost == 0 || old_cost == 0
> + || new_cost > old_cost
>   || (new_cost == old_cost && !prop.likely_profitable_p ()))
> {
>   if (dump_file)


Re: [PATCH v1 0/6] Add DLL import/export implementation to AArch64

2024-06-05 Thread Richard Sandiford
Evgeny Karpov  writes:
> Richard and Uros, could you please review the changes for v2?
> Additionally, we have detected an issue with GCC GC in winnt-dll.cc. The fix 
> will be included in v2.

Would it be possible to have a more "purposeful" name than
CMODEL_IS_NOT_LARGE_OR_MEDIUM_PIC?  What's the property of
large and medium PIC that needs to be handled differently?

It'd be good to have the macro be a positive test rather than a negative
test, so that we don't end up with !IS_NOT_FOO when testing for FOO.

Otherwise it looks good to me.

I never fully reviewed 1/6 or 6/6, sorry.  My main comment there is
that it would be good to avoid including config/mingw/winnt.h and
config/mingw/winnt-dll.h in config/aarch64/aarch64-protos.h (or in
other common AArch64 code).  It's OK for common AArch64 code to
have hooks that can be filled in by OS-specific headers, but there
shouldn't be OS-specific includes or code in the common files themselves.

>>> -ix86_handle_selectany_attribute (tree *node, tree name, tree, int,
>>> +mingw_handle_selectany_attribute (tree *node, tree name, tree, int,
>>>  bool *no_add_attrs)
>
>> please reindent the parameters for the new name length.
>
> Richard, could you please clarify how it should be done?

The "bool" on the second line should be directly under the "tree"
on the first line (so one extra space before "bool").

Thanks,
Richard


> Thanks!
>
> Regards,
> Evgeny
>
>
> ---
>  gcc/config/aarch64/cygming.h   |  6 +
>  gcc/config/i386/cygming.h  |  6 +
>  gcc/config/i386/i386-expand.cc |  6 +++--
>  gcc/config/i386/i386-expand.h  |  2 --
>  gcc/config/i386/i386.cc| 42 ++
>  gcc/config/i386/i386.h |  2 ++
>  gcc/config/mingw/winnt-dll.cc  |  8 ++-
>  gcc/config/mingw/winnt-dll.h   |  2 +-
>  8 files changed, 33 insertions(+), 41 deletions(-)
>
> diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
> index 4beebf9e093..0ff475754e0 100644
> --- a/gcc/config/aarch64/cygming.h
> +++ b/gcc/config/aarch64/cygming.h
> @@ -183,4 +183,10 @@ still needed for compilation.  */
>  #undef MAX_OFILE_ALIGNMENT
>  #define MAX_OFILE_ALIGNMENT (8192 * 8)
>  
> +#define CMODEL_IS_NOT_LARGE_OR_MEDIUM_PIC 0
> +
> +#define HAVE_64BIT_POINTERS 1
> +
> +#define GOT_ALIAS_SET mingw_GOT_alias_set ()
> +
>  #endif
> diff --git a/gcc/config/i386/cygming.h b/gcc/config/i386/cygming.h
> index ee01e6bb6ce..cd240533dbc 100644
> --- a/gcc/config/i386/cygming.h
> +++ b/gcc/config/i386/cygming.h
> @@ -469,3 +469,9 @@ do {  \
>  #ifndef HAVE_GAS_ALIGNED_COMM
>  # define HAVE_GAS_ALIGNED_COMM 0
>  #endif
> +
> +#define CMODEL_IS_NOT_LARGE_OR_MEDIUM_PIC ix86_cmodel != CM_LARGE_PIC && 
> ix86_cmodel != CM_MEDIUM_PIC
> +
> +#define HAVE_64BIT_POINTERS TARGET_64BIT_DEFAULT
> +
> +#define GOT_ALIAS_SET mingw_GOT_alias_set ()
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index fb460e30d0a..267d0ba257b 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -408,11 +408,12 @@ ix86_expand_move (machine_mode mode, rtx operands[])
>: UNSPEC_GOT));
> op1 = gen_rtx_CONST (Pmode, op1);
> op1 = gen_const_mem (Pmode, op1);
> -   set_mem_alias_set (op1, ix86_GOT_alias_set ());
> +   set_mem_alias_set (op1, GOT_ALIAS_SET);
>   }
>else
>   {
> -   tmp = ix86_legitimize_pe_coff_symbol (op1, addend != NULL_RTX);
> +#if TARGET_PECOFF
> +   tmp = legitimize_pe_coff_symbol (op1, addend != NULL_RTX);
> if (tmp)
>   {
> op1 = tmp;
> @@ -424,6 +425,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
> op1 = operands[1];
> break;
>   }
> +#endif
>   }
>  
>if (addend)
> diff --git a/gcc/config/i386/i386-expand.h b/gcc/config/i386/i386-expand.h
> index a8c20993954..5e02df1706d 100644
> --- a/gcc/config/i386/i386-expand.h
> +++ b/gcc/config/i386/i386-expand.h
> @@ -34,9 +34,7 @@ struct expand_vec_perm_d
>  };
>  
>  rtx legitimize_tls_address (rtx x, enum tls_model model, bool for_mov);
> -alias_set_type ix86_GOT_alias_set (void);
>  rtx legitimize_pic_address (rtx orig, rtx reg);
> -rtx ix86_legitimize_pe_coff_symbol (rtx addr, bool inreg);
>  
>  bool insn_defines_reg (unsigned int regno1, unsigned int regno2,
>  rtx_insn *insn);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 66845b30446..ee3a59ed498 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -11807,30 +11807,6 @@ constant_address_p (rtx x)
>  }
>  
>
>  
> -#if TARGET_PECOFF
> -rtx ix86_legitimize_pe_coff_symbol (rtx addr, bool inreg)
> -{
> -  return legitimize_pe_coff_symbol (addr, inreg);
> -}
> -
> -alias_set_type
> -ix86_GOT_alias_set (void)
> -{
> -  return mingw_GOT_alias_set ();
> -}
> -#else
> -rtx ix86_legitimize_pe_coff_symbol 

Re: [PATCH] [RFC] lower SLP load permutation to interleaving

2024-06-04 Thread Richard Sandiford
Richard Biener  writes:
> The following emulates classical interleaving for SLP load permutes
> that we are unlikely handling natively.  This is to handle cases
> where interleaving (or load/store-lanes) is the optimal choice for
> vectorizing even when we are doing that within SLP.  An example
> would be
>
> void foo (int * __restrict a, int * b)
> {
>   for (int i = 0; i < 16; ++i)
> {
>   a[4*i + 0] = b[4*i + 0] * 3;
>   a[4*i + 1] = b[4*i + 1] + 3;
>   a[4*i + 2] = (b[4*i + 2] * 3 + 3);
>   a[4*i + 3] = b[4*i + 3] * 3;
> }
> }
>
> where currently the SLP store is merging four single-lane SLP
> sub-graphs but none of the loads in it can be code-generated
> with V4SImode vectors and a VF of four as the permutes would need
> three vectors.

Nice!

> The patch introduces a lowering phase after SLP discovery but
> before SLP pattern recognition or permute optimization that
> analyzes all loads from the same dataref group and creates an
> interleaving scheme starting from an unpermuted load.
>
> What can be handled is quite restrictive, matching only a subset
> of the non-SLP interleaving cases (the power-of-two group size
> ones, in addition only cases without gaps).  The interleaving
> vectorization in addition can handle size 3 and 5 - but I am not
> sure if it's possible to do that in a VL agnostic way.  It
> should be still possible to set up the SLP graph in a way that
> a load-lane could be matched from SLP pattern recognition.

Yeah, I don't think it would be possible to decompose a 3- or
5-lane grouped load into a series of VLA 2-input permutes.
But (as I think you're saying) it seems like a load-3-lanes would just
be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N.
Is that right?

> As said gaps are currently not handled - for SLP we have a
> representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
> would need to be filled in some way (even if we just push NULL).
>
> The patch misses multi-level even/odd handling as well as CSEing
> intermediate generated permutes.  Both is quite straight-forward
> to add, but eventually there's a better or more general strategy
> for lowering?  The main goal of the patch is to avoid falling
> back to non-SLP for cases the interleaving code handles.

Does the multi-level thing including examples like:

int a[2 * 16];
int b[8 * 16];
void f()
{
  for (int i = 0; i < 16; ++i)
{
  a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 + 3];
  a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 + 7];
}
}

?  For that we generate:

  _45 = VEC_PERM_EXPR ;
  _44 = VEC_PERM_EXPR ;
  _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
  _49 = VEC_PERM_EXPR ;
  _48 = VEC_PERM_EXPR ;
  _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
  _53 = VEC_PERM_EXPR ;
  _52 = VEC_PERM_EXPR ;
  _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>;
  _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;

(two even level 1, one even level 2, one odd level 1), whereas
preferring 2xeven + 2xodd would avoid the third set of first-level
permutes:

  _45 = VEC_PERM_EXPR ;
  _44 = VEC_PERM_EXPR ;
  _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
  _49 = VEC_PERM_EXPR ;
  _48 = VEC_PERM_EXPR ;
  _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
  _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>;
  _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;

> Comments and suggestions welcome, esp. what representation
> you'd think is suitable for SLP pattern matching to
> load/store-lane and how to represent that?  Maybe this lowering
> should happen directly in vect_lower_load_permutations?

If the load-lanes representation is as simple as above, it sounds like
it could be deferred to pattern matching.  Not sure what the result
would look like though.  It would be nice if (at least for costing
purposes) we could have a single node for all lanes of the load-lanes,
rather than create a separate node for each lane and rely on later CSE.
(Or do we already have a good representation for this?  It's been too
long, sorry.)

Bit of trivia below:

> Thanks,
> Richard.
>
>   * tree-vect-slp.cc (vllp_cmp): New function.
>   (vect_lower_load_permutations): Likewise.
>   (vect_analyze_slp): Call it.
> ---
>  gcc/tree-vect-slp.cc | 279 +++
>  1 file changed, 279 insertions(+)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 7e3d0107b4e..766b773452f 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
>return res;
>  }
>  
> +/* qsort comparator ordering SLP load nodes.  */
> +
> +static int
> +vllp_cmp (const void *a_, const void *b_)
> +{
> +  const slp_tree a = *(const slp_tree *)a_;
> +  const slp_tree b = *(const slp_tree *)b_;
> +  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
> +  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
> +  if (STMT_VINFO_GROUPED_ACCESS (a0)
> +  

Re: PATCH] AArch64: Fix cpu features initialization [PR115342]

2024-06-04 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Hi Richard,
>
> I've reworded the commit message a bit:
>
> The CPU features initialization code uses CPUID registers (rather than
> HWCAP).  The equality comparisons it uses are incorrect: for example FEAT_SVE
> is not set if SVE2 is available.  Using HWCAPs for these is both simpler and
> correct.  The initialization must also be done atomically to avoid multiple
> threads causing corruption due to non-atomic RMW accesses to the global.

Thanks, sounds good.

>> What criteria did you use for choosing whether to keep or remove
>> the system register checks?
>
> Essentially anything covered by HWCAP doesn't need an explicit check. So I 
> kept
> the LS64 and PREDRES checks since they don't have a HWCAP allocated (I'm not
> entirely convinced we need these, let alone having 3 individual bits for 
> LS64, but
> that's something for the ACLE spec to sort out). The goal here is to fix all 
> obvious
> bugs so one can use FMV as intended.

Didn't we take the opposite approach for libatomic though?

/* LSE128 atomic support encoded in ID_AA64ISAR0_EL1.Atomic,
   bits[23:20].  The expected value is 0b0011.  Check that.  */

#define AT_FEAT_FIELD(isar0)(((isar0) >> 20) & 15)

static inline bool
has_lse128 (unsigned long hwcap, const __ifunc_arg_t *features)
{
  if (hwcap & _IFUNC_ARG_HWCAP
  && features->_hwcap2 & HWCAP2_LSE128)
return true;
  /* A 0 HWCAP2_LSE128 bit may be just as much a sign of missing HWCAP2 bit
 support in older kernels as it is of CPU feature absence.  Try fallback
 method to guarantee LSE128 is not implemented.

 In the absence of HWCAP_CPUID, we are unable to check for LSE128.
 If feature check available, check LSE2 prerequisite before proceeding.  */
  if (!(hwcap & HWCAP_CPUID) || !(hwcap & HWCAP_USCAT))
 return false;

  unsigned long isar0;
  asm volatile ("mrs %0, ID_AA64ISAR0_EL1" : "=r" (isar0));
  if (AT_FEAT_FIELD (isar0) >= 3)
return true;
  return false;
}

I suppose one difference is that the libatomic code is gating a
choice between a well-defined, curated set of routines, whereas the
libgcc code is providing a general user-facing feature.  So maybe
libgcc should be more conservative for that reason?

Thanks,
Richard


Re: PATCH] AArch64: Fix cpu features initialization [PR115342]

2024-06-04 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Fix CPU features initialization.  Use HWCAP rather than explicit accesses
> to CPUID registers.  Perform the initialization atomically to avoid multi-
> threading issues.

Please describe the problem that the patch is fixing.  I think the
PR description would make a better commit message:

---
The CPU features initialization code uses CPUID registers. It uses
incorrect comparisons so that for example SVE is not set if SVE2 is
available. Using HWCAPs for these is both simpler and works
correctly. The initialization must also be done atomically so to avoid
multiple threads causing corruption due to non-atomic RMW of the global.
---

What criteria did you use for choosing whether to keep or remove
the system register checks?

> Passes regress, OK for commit and backport?
>
> libgcc:
> PR target/115342
> * config/aarch64/cpuinfo.c (__init_cpu_features_constructor):
> Use HWCAP where possible.  Use atomic write for initialization.

It'd be good to mention the fix for the FEAT_PREDRES system register check
as well.

> (__init_cpu_features_resolver): Use atomic load for correct
> initialization.
> (__init_cpu_features): Likewise.

Thanks,
Richard

>
> ---
>
> diff --git a/libgcc/config/aarch64/cpuinfo.c b/libgcc/config/aarch64/cpuinfo.c
> index 
> 4b94fca869507145ec690c825f637abbc82a3493..544c5516133ec3a554d1222de2ea9d5e6d4c27a9
>  100644
> --- a/libgcc/config/aarch64/cpuinfo.c
> +++ b/libgcc/config/aarch64/cpuinfo.c
> @@ -227,14 +227,22 @@ struct {
>  #ifndef HWCAP2_SVE_EBF16
>  #define HWCAP2_SVE_EBF16 (1UL << 33)
>  #endif
> +#ifndef HWCAP2_SME2
> +#define HWCAP2_SME2 (1UL << 37)
> +#endif
> +#ifndef HWCAP2_LRCPC3
> +#define HWCAP2_LRCPC3(1UL << 46)
> +#endif
>  
>  static void
> -__init_cpu_features_constructor(unsigned long hwcap,
> - const __ifunc_arg_t *arg) {
> -#define setCPUFeature(F) __aarch64_cpu_features.features |= 1ULL << F
> +__init_cpu_features_constructor (unsigned long hwcap,
> +  const __ifunc_arg_t *arg)
> +{
> +  unsigned long feat = 0;
> +#define setCPUFeature(F) feat |= 1UL << F
>  #define getCPUFeature(id, ftr) __asm__("mrs %0, " #id : "=r"(ftr))
>  #define extractBits(val, start, number) \
> -  (val & ((1ULL << number) - 1ULL) << start) >> start
> +  (val & ((1UL << number) - 1UL) << start) >> start
>unsigned long hwcap2 = 0;
>if (hwcap & _IFUNC_ARG_HWCAP)
>  hwcap2 = arg->_hwcap2;
> @@ -244,26 +252,20 @@ __init_cpu_features_constructor(unsigned long hwcap,
>  setCPUFeature(FEAT_PMULL);
>if (hwcap & HWCAP_FLAGM)
>  setCPUFeature(FEAT_FLAGM);
> -  if (hwcap2 & HWCAP2_FLAGM2) {
> -setCPUFeature(FEAT_FLAGM);
> +  if (hwcap2 & HWCAP2_FLAGM2)
>  setCPUFeature(FEAT_FLAGM2);
> -  }
> -  if (hwcap & HWCAP_SM3 && hwcap & HWCAP_SM4)
> +  if (hwcap & HWCAP_SM4)
>  setCPUFeature(FEAT_SM4);
>if (hwcap & HWCAP_ASIMDDP)
>  setCPUFeature(FEAT_DOTPROD);
>if (hwcap & HWCAP_ASIMDFHM)
>  setCPUFeature(FEAT_FP16FML);
> -  if (hwcap & HWCAP_FPHP) {
> +  if (hwcap & HWCAP_FPHP)
>  setCPUFeature(FEAT_FP16);
> -setCPUFeature(FEAT_FP);
> -  }
>if (hwcap & HWCAP_DIT)
>  setCPUFeature(FEAT_DIT);
>if (hwcap & HWCAP_ASIMDRDM)
>  setCPUFeature(FEAT_RDM);
> -  if (hwcap & HWCAP_ILRCPC)
> -setCPUFeature(FEAT_RCPC2);
>if (hwcap & HWCAP_AES)
>  setCPUFeature(FEAT_AES);
>if (hwcap & HWCAP_SHA1)
> @@ -277,22 +279,21 @@ __init_cpu_features_constructor(unsigned long hwcap,
>if (hwcap & HWCAP_SB)
>  setCPUFeature(FEAT_SB);
>if (hwcap & HWCAP_SSBS)
> -setCPUFeature(FEAT_SSBS2);
> -  if (hwcap2 & HWCAP2_MTE) {
> -setCPUFeature(FEAT_MEMTAG);
> -setCPUFeature(FEAT_MEMTAG2);
> -  }
> -  if (hwcap2 & HWCAP2_MTE3) {
> -setCPUFeature(FEAT_MEMTAG);
> -setCPUFeature(FEAT_MEMTAG2);
> +{
> +  setCPUFeature(FEAT_SSBS);
> +  setCPUFeature(FEAT_SSBS2);
> +}
> +  if (hwcap2 & HWCAP2_MTE)
> +{
> +  setCPUFeature(FEAT_MEMTAG);
> +  setCPUFeature(FEAT_MEMTAG2);
> +}
> +  if (hwcap2 & HWCAP2_MTE3)
>  setCPUFeature(FEAT_MEMTAG3);
> -  }
>if (hwcap2 & HWCAP2_SVEAES)
>  setCPUFeature(FEAT_SVE_AES);
> -  if (hwcap2 & HWCAP2_SVEPMULL) {
> -setCPUFeature(FEAT_SVE_AES);
> +  if (hwcap2 & HWCAP2_SVEPMULL)
>  setCPUFeature(FEAT_SVE_PMULL128);
> -  }
>if (hwcap2 & HWCAP2_SVEBITPERM)
>  setCPUFeature(FEAT_SVE_BITPERM);
>if (hwcap2 & HWCAP2_SVESHA3)
> @@ -329,108 +330,76 @@ __init_cpu_features_constructor(unsigned long hwcap,
>  setCPUFeature(FEAT_WFXT);
>if (hwcap2 & HWCAP2_SME)
>  setCPUFeature(FEAT_SME);
> +  if (hwcap2 & HWCAP2_SME2)
> +setCPUFeature(FEAT_SME2);
>if (hwcap2 & HWCAP2_SME_I16I64)
>  setCPUFeature(FEAT_SME_I64);
>if (hwcap2 & HWCAP2_SME_F64F64)
>  

Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-06-03 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Richard:
>
> On 03/06/24 7:47 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> On 03/06/24 5:03 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal  writes:
>>>>>> [...]
>>>>>> If it is intentional, what distinguishes things like vperm and xxinsertw
>>>>>> (and all other unspecs) from plain addition?
>>>>>>
>>>>>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>>>>> (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>>>>>  (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>>>>>
>>>>>
>>>>> Plain addition are not supported currently.
>>>>> We have not seen many cases with plain addition and this patch
>>>>> will not accept plain addition.
>>>>>
>>>>>  
>>>>>> This is why the intention behind the patch is important.  As it stands,
>>>>>> it isn't clear what criteria the patch is using to distinguish "valid"
>>>>>> fuse candidates from "invalid" ones.
>>>>>>
>>>>>
>>>>> Intention behind this patch all variants of UNSPEC instructions are
>>>>> supported and uses without UNSPEC are not supported in this patch.
>>>>
>>>> But why make the distinction this way though?  UNSPEC is a very
>>>> GCC-specific concept.  Whether something is an UNSPEC or some other
>>>> RTL code depends largely on historical accident.  E.g. we have specific
>>>> codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
>>>> for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).
>>>>
>>>> It seems unlikely that GCC's choice about whether to represent something
>>>> as an UNSPEC or as another RTL code lines up neatly with the kind of
>>>> codegen decisions that a good assembly programmer would make.
>>>>
>>>> I suppose another way of asking is to turn this around and say: what
>>>> kind of uses are you trying to exclude?  Presumably things are worse
>>>> if you remove this function override.  But what makes them worse?
>>>> What kind of uses cause the regression?
>>>>
>>>
>>> Uses of fused load where load with low address uses are modified with load 
>>> with high address uses.
>>>
>>> Similarly load with high address uses are modified with load low address
>>> uses.
>> 
>> It sounds like something is going wrong the subreg updates.
>> Can you give an example of where this occurs?  For instance...
>> 
>>> This is the semantics of lxvp instructions which can occur through
>>> UNSPEC uses otherwise it breaks the functionality and seen failure
>>> in almost all vect regressions and SPEC benchmarks.
>> 
>> ...could you take one of the simpler vect regressions, show the before
>> and after RTL, and why the transformation is wrong?
>
> Before the change:
>
> (insn 32 30 103 5 (set (reg:V16QI 127 [ _32 ])
> (mem:V16QI (reg:DI 130 [ ivtmp.37 ]) [1 MEM  int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
>  (nil))
> (insn 103 32 135 5 (set (reg:V16QI 173 [ _32 ])
> (mem:V16QI (plus:DI (reg:DI 130 [ ivtmp.37 ])
> (const_int 16 [0x10])) [1 MEM  
> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
>  (nil))
> (insn 135 103 34 5 (set (reg:DI 155)
> (plus:DI (reg:DI 130 [ ivtmp.37 ])
> (const_int 16 [0x10]))) 66 {*adddi3}
>  (nil))
> (insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
> (unspec:V16QI [
> (reg:V16QI 127 [ _32 ]) repeated x2
> (reg:V16QI 152)
> ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>  (expr_list:REG_DEAD (reg:V16QI 127 [ _32 ])
> (nil)))
> (insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
> (unspec:V16QI [
> (reg:V16QI 173 [ _32 ]) repeated x2
> (reg:V16QI 152)
> ] UNSPEC_VPERM)) 
>  {altivec_vperm_v16qi_direct}
>
>
> After the change:
>
> (insn 103 30 135 5 (set (reg:OO 127 [ _32 ])
> (mem:OO (reg:DI 130 [ ivtmp.37 ]) [1 MEM  int> [(short unsigned int *)_55]+0 S16 A128])) {*movoo}
>  (nil))
> (insn 135 103 34 5 (set (reg:DI 155)
> (plus:DI (reg:DI 130 [ ivtmp.37 ])
> (const_int 16 [0x10]))) 66 {*adddi3}
>

Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-06-03 Thread Richard Sandiford
Ajit Agarwal  writes:
> On 03/06/24 5:03 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>>> [...]
>>>> If it is intentional, what distinguishes things like vperm and xxinsertw
>>>> (and all other unspecs) from plain addition?
>>>>
>>>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>>> (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>>>(match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>>>
>>>
>>> Plain addition are not supported currently.
>>> We have not seen many cases with plain addition and this patch
>>> will not accept plain addition.
>>>
>>>  
>>>> This is why the intention behind the patch is important.  As it stands,
>>>> it isn't clear what criteria the patch is using to distinguish "valid"
>>>> fuse candidates from "invalid" ones.
>>>>
>>>
>>> Intention behind this patch all variants of UNSPEC instructions are
>>> supported and uses without UNSPEC are not supported in this patch.
>> 
>> But why make the distinction this way though?  UNSPEC is a very
>> GCC-specific concept.  Whether something is an UNSPEC or some other
>> RTL code depends largely on historical accident.  E.g. we have specific
>> codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
>> for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).
>> 
>> It seems unlikely that GCC's choice about whether to represent something
>> as an UNSPEC or as another RTL code lines up neatly with the kind of
>> codegen decisions that a good assembly programmer would make.
>> 
>> I suppose another way of asking is to turn this around and say: what
>> kind of uses are you trying to exclude?  Presumably things are worse
>> if you remove this function override.  But what makes them worse?
>> What kind of uses cause the regression?
>> 
>
> Uses of fused load where load with low address uses are modified with load 
> with high address uses.
>
> Similarly load with high address uses are modified with load low address
> uses.

It sounds like something is going wrong the subreg updates.
Can you give an example of where this occurs?  For instance...

> This is the semantics of lxvp instructions which can occur through
> UNSPEC uses otherwise it breaks the functionality and seen failure
> in almost all vect regressions and SPEC benchmarks.

...could you take one of the simpler vect regressions, show the before
and after RTL, and why the transformation is wrong?

Thanks,
Richard


Re: [PATCH] ifcvt: Clarify if_info.original_cost.

2024-06-03 Thread Richard Sandiford
Robin Dapp  writes:
> Hi,
>
> before noce_find_if_block processes a block it sets up an if_info
> structure that holds the original costs.  At that point the costs of
> the then/else blocks have not been added so we only care about the
> "if" cost.
>
> The code originally used BRANCH_COST for that but was then changed
> to COST_N_INSNS (2) - a compare and a jump.
> This patch computes the jump costs via
>   insn_cost (if_info.jump, ...)
> which is supposed to incorporate the branch costs and, in case of a CC
> comparison,
>   pattern_cost (if_info.cond, ...)
> which is supposed to account for the CC creation.
>
> For compare_and_jump patterns insn_cost should have already computed
> the right cost.
>
> Does this "split" make sense, generally?
>
> Bootstrapped and regtested on x86, aarch64 and power10.  Regtested
> on riscv.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * ifcvt.cc (noce_process_if_block): Subtract condition pattern
>   cost if applicable.
>   (noce_find_if_block): Use insn_cost and pattern_cost for
>   original cost.
> ---
>  gcc/ifcvt.cc | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
> index 58ed42673e5..305b9faed38 100644
> --- a/gcc/ifcvt.cc
> +++ b/gcc/ifcvt.cc
> @@ -3940,7 +3940,9 @@ noce_process_if_block (struct noce_if_info *if_info)
>   ??? Actually, instead of the branch instruction costs we might want
>   to use COSTS_N_INSNS (BRANCH_COST ()) as in other places.  */
>  
> -  unsigned potential_cost = if_info->original_cost - COSTS_N_INSNS (1);
> +  unsigned potential_cost = if_info->original_cost;
> +  if (cc_in_cond (if_info->cond))
> +potential_cost -= pattern_cost (if_info->cond, if_info->speed_p);
>unsigned old_cost = if_info->original_cost;
>if (!else_bb
>&& HAVE_conditional_move
> @@ -4703,11 +4705,13 @@ noce_find_if_block (basic_block test_bb, edge 
> then_edge, edge else_edge,
>  = targetm.max_noce_ifcvt_seq_cost (then_edge);
>/* We'll add in the cost of THEN_BB and ELSE_BB later, when we check
>   that they are valid to transform.  We can't easily get back to the insn
> - for COND (and it may not exist if we had to canonicalize to get COND),
> - and jump_insns are always given a cost of 1 by seq_cost, so treat
> - both instructions as having cost COSTS_N_INSNS (1).  */
> -  if_info.original_cost = COSTS_N_INSNS (2);
> -
> + for COND (and it may not exist if we had to canonicalize to get COND).
> + Here we assume one CC compare insn (if the target uses CC) and one
> + jump insn that is costed via insn_cost.  It is assumed that the
> + costs of a jump insn are dependent on the branch costs.  */
> +  if (cc_in_cond (if_info.cond))
> +if_info.original_cost = pattern_cost (if_info.cond, if_info.speed_p);
> +  if_info.original_cost += insn_cost (if_info.jump, if_info.speed_p);
>  
>/* Do the real work.  */

Is there any way we can avoid using pattern_cost here?  Using it means
that we can make use of targetm.insn_cost for the jump but circumvent
it for the condition, giving a bit of a mixed metric.

(I realise there are existing calls to pattern_cost in ifcvt.cc,
but if possible I think we should try to avoid adding more.)

Thanks,
Richard


Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-06-03 Thread Richard Sandiford
Ajit Agarwal  writes:
>> [...]
>> If it is intentional, what distinguishes things like vperm and xxinsertw
>> (and all other unspecs) from plain addition?
>> 
>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>> (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>  (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>
>
> Plain addition are not supported currently.
> We have not seen many cases with plain addition and this patch
> will not accept plain addition.
>
>  
>> This is why the intention behind the patch is important.  As it stands,
>> it isn't clear what criteria the patch is using to distinguish "valid"
>> fuse candidates from "invalid" ones.
>>
>
> Intention behind this patch all variants of UNSPEC instructions are
> supported and uses without UNSPEC are not supported in this patch.

But why make the distinction this way though?  UNSPEC is a very
GCC-specific concept.  Whether something is an UNSPEC or some other
RTL code depends largely on historical accident.  E.g. we have specific
codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).

It seems unlikely that GCC's choice about whether to represent something
as an UNSPEC or as another RTL code lines up neatly with the kind of
codegen decisions that a good assembly programmer would make.

I suppose another way of asking is to turn this around and say: what
kind of uses are you trying to exclude?  Presumably things are worse
if you remove this function override.  But what makes them worse?
What kind of uses cause the regression?

>>> [...]
>>> +  // Given insn_info pair I1 and I2, return true if offsets are in 
>>> order.
>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>> + rtl_ssa::insn_info *i2) = 
>>> 0;
>>> +
>>
>> This name seems a bit misleading.  The function is used in:
>>
>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, 
>> unsigned access_size,
>>reversed = true;
>>  }
>>  
>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>> +return false;
>> +
>>rtx cand_mems[2];
>>rtx reg_ops[2];
>>rtx pats[2];
>>
>> and so it acts as a general opt-out.  The insns aren't known to be 
>> unordered.
>>
>> It looks like the rs6000 override requires the original insns to be
>> in offset order.  Could you say why that's necessary?  (Both in email
>> and as a comment in the code.)
>>
>
> Yes rs6000 requires the original load insns to be in offset order.
> Some regression tests like vect-outer-4f fails if we do load pair
> fusion with load offsets are not in offset order as this breaks lxvp 
> semantics.

 How does it break the semantics though?  In principle, the generic code
 only fuses if it has "proved" that the loads can happen in either order.
 So it shouldn't matter which order the hardware does things in.

 Could you give an example of the kind of situation that you want
 to avoid, and why it generates the wrong result?

>>>
>>> (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM  
>>> [(short unsigned int *)vectp.62_36 + 64B] ])
>>> (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>> (const_int 64 [0x40])) [1 MEM >> int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  
>>> {vsx_movv16qi_64bit}
>>>  (nil))
>>> (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM  
>>> [(short unsigned int *)vectp.62_36 + 80B] ])
>>> (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>> (const_int 80 [0x50])) [1 MEM >> int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  
>>> {vsx_movv16qi_64bit}
>>>  (nil))
>>> (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM  
>>> [(short unsigned int *)vectp.62_36 + 16B] ])
>>> (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>> (const_int 16 [0x10])) [1 MEM >> int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  
>>> {vsx_movv16qi_64bit}
>>>  (nil))
>>> (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM  
>>> [(short unsigned int *)vectp.62_36 + 32B] ])
>>> (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>> (const_int 32 [0x20])) [1 MEM >> int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) 
>>> {vsx_movv16qi_64bit}
>>>  (nil))
>>> (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM  
>>> [(short unsigned int *)vectp.62_36 + 48B] ])
>>> (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>> (const_int 48 [0x30])) [1 MEM >> int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) 
>>> {vsx_movv16qi_64bit}
>>>  (nil))
>>>
>>> insn 22 and insn 31 is merged in the failure case and breaks the code.
>> 
>> What specifically goes wrong though?  This is just a sequence of loads
>> from 

Re: [PATCH 36/52] aarch64: New hook implementation aarch64_c_mode_for_floating_type

2024-06-03 Thread Richard Sandiford
Kewen Lin  writes:
> This is to remove macros {FLOAT,{,LONG_}DOUBLE}_TYPE_SIZE
> defines in aarch64 port, and add new port specific hook
> implementation aarch64_c_mode_for_floating_type.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_c_mode_for_floating_type):
>   New function.
>   (TARGET_C_MODE_FOR_FLOATING_TYPE): New macro.
>   * config/aarch64/aarch64.h (FLOAT_TYPE_SIZE): Remove.
>   (DOUBLE_TYPE_SIZE): Likewise.
>   (LONG_DOUBLE_TYPE_SIZE): Likewise.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 15 +++
>  gcc/config/aarch64/aarch64.h  |  6 --
>  2 files changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index ee12d8897a8..1c1c278fd69 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -28427,6 +28427,18 @@ aarch64_bitint_type_info (int n, struct bitint_info 
> *info)
>return true;
>  }
>  
> +/* Implement TARGET_C_MODE_FOR_FLOATING_TYPE.  Return TFmode for
> +   TI_LONG_DOUBLE_TYPE which is for long double type, go with the default
> +   one for the others.  */
> +
> +static machine_mode
> +aarch64_c_mode_for_floating_type (enum tree_index ti)
> +{
> +  if (ti == TI_LONG_DOUBLE_TYPE)
> +return TFmode;
> +  return default_mode_for_floating_type (ti);
> +}
> +
>  /* Implement TARGET_SCHED_CAN_SPECULATE_INSN.  Return true if INSN can be
> scheduled for speculative execution.  Reject the long-running division
> and square-root instructions.  */
> @@ -30554,6 +30566,9 @@ aarch64_run_selftests (void)
>  #undef TARGET_C_BITINT_TYPE_INFO
>  #define TARGET_C_BITINT_TYPE_INFO aarch64_bitint_type_info
>  
> +#undef TARGET_C_MODE_FOR_FLOATING_TYPE
> +#define TARGET_C_MODE_FOR_FLOATING_TYPE aarch64_c_mode_for_floating_type
> +
>  #undef  TARGET_EXPAND_BUILTIN
>  #define TARGET_EXPAND_BUILTIN aarch64_expand_builtin
>  
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index bbf11faaf4b..2064c23f961 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -96,12 +96,6 @@
>  
>  #define LONG_LONG_TYPE_SIZE  64
>  
> -#define FLOAT_TYPE_SIZE  32
> -
> -#define DOUBLE_TYPE_SIZE 64
> -
> -#define LONG_DOUBLE_TYPE_SIZE128
> -
>  /* This value is the amount of bytes a caller is allowed to drop the stack
> before probing has to be done for stack clash protection.  */
>  #define STACK_CLASH_CALLER_GUARD 1024


Re: [PATCH] aarch64: adjust enum writeback after rename

2024-06-03 Thread Richard Sandiford
Marc Poulhiès  writes:
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-ldp-fusion.cc (struct aarch64_pair_fusion):
>   Use new type name.
> ---
> My previous change fixed the generic code, but I forgot to adjust the 
> overload in aarch64.
>
> I don't have an aarch64 setup to check it fixes the build, but will set it up 
> later. Unless it's
> OK to apply it as it's easy enough.

I've tested locally and it builds correctly, so ok now.

Thanks,
Richard

> Marc
>
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 0af927231d3..b255dcbe73c 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -75,9 +75,9 @@ struct aarch64_pair_fusion : public pair_fusion
>  return aarch64_ldp_alias_check_limit;
>}
>  
> -  bool should_handle_writeback (enum writeback which) override final
> +  bool should_handle_writeback (writeback_type which) override final
>{
> -if (which == writeback::ALL)
> +if (which == writeback_type::ALL)
>return aarch64_ldp_writeback > 1;
>  else
>return aarch64_ldp_writeback;


Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-06-03 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Richard:
> On 31/05/24 8:08 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal  writes:
>>>>> Hello All:
>>>>>
>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>> targets.
>>>>>
>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>> by generic code.
>>>>>
>>>>> Code is implemented with pure virtual functions to interface with target
>>>>> code.
>>>>>
>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>>>> virtual
>>>>> function implementation required for rs6000 are added in 
>>>>> aarch64-ldp-fusion.cc.
>>>>>
>>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>>
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>
>>>>>
>>>>> aarch64, rs6000, middle-end: Add implementation for different targets for 
>>>>> pair mem fusion
>>>>>
>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>> targets.
>>>>>
>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>> by generic code.
>>>>>
>>>>> Code is implemented with pure virtual functions to interface with target
>>>>> code.
>>>>>
>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>>>> virtual
>>>>> function implementation required for rs6000 are added in 
>>>>> aarch64-ldp-fusion.cc.
>>>>>
>>>>> 2024-05-31  Ajit Kumar Agarwal  
>>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>>   * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>>>   implementation of additional virtual functions added in pair_fusion
>>>>>   struct.
>>>>>   * config/rs6000/rs6000-passes.def: New mem fusion pass
>>>>>   before pass_early_remat.
>>>>>   * config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>>>   Add target specific implementation using pure virtual
>>>>>   functions.
>>>>>   * config.gcc: Add new object file.
>>>>>   * config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>>>   fusion pass.
>>>>>   * config/rs6000/t-rs6000: Add new rule.
>>>>>   * rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>>>   from private.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>>   * g++.target/powerpc/me-fusion.C: New test.
>>>>>   * g++.target/powerpc/mem-fusion-1.C: New test.
>>>>>   * gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>>> ---
>>>>
>>>> This isn't a complete review, just some initial questions & comments
>>>> about selected parts.
>>>>
>>>>> [...]
>>>>> +/* Check whether load can be fusable or not.
>>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>>> +bool
>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>> +{
>>>>> +  rtx_insn *insn = info->rtl ();
>>>>> +
>>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>>> +if (REG_NOTE_KIND (note) == REG_EQUAL
>>>>> + || REG_NOTE_KIND (note) == REG_EQUIV)
>>>>> +  return false;
>>>>
>>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>>> note.  What's the reason for doing this?  Are you trying to avoid
>>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>>
>>>
>>> We have used the above check because of some SPEC benchmarks failing with
>>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>>
>>> By adding the checks the benchmarks passes and also it improves the
>>> performance.
>>>
>>> This checks were added during initial implementation of pair fusion
>>> pass.
>>>
>>> I wil

Re: [PATCH] pair-fusion: fix for older GCC

2024-06-03 Thread Richard Sandiford
Marc Poulhiès  writes:
> Older GCCs fail with:
>
>   .../gcc/pair-fusion.cc: In member function ‘bool 
> pair_fusion_bb_info::fuse_pair(bool, unsigned int, int, rtl_ssa::insn_info*, 
> rtl_ssa::in
>   sn_info*, base_cand&, const rtl_ssa::insn_range_info&)’:
>   .../gcc/pair-fusion.cc:1790:40: error: ‘writeback’ is not a class, 
> namespace, or enumeration
>  if (m_pass->should_handle_writeback (writeback::ALL)
>
> Renaming the enum type works around the name conflict with the local
> variable and also prevents future similar conflicts.
>
> gcc/ChangeLog:
>
>   * pair-fusion.h (enum class writeback): Rename to...
>   (enum class writeback_type): ...this.
>   (struct pair_fusion): Adjust type name after renaming.
>   * pair-fusion.cc (pair_fusion_bb_info::track_access): Likewise.
>   (pair_fusion_bb_info::fuse_pair): Likewise.
>   (pair_fusion::process_block): Likewise.

OK, thanks, and sorry for missing this during the review.

Richard

> ---
> Patch discussed in 
> https://inbox.sourceware.org/gcc-patches/mptwmn93njq@arm.com/
>
> Tested on x86_64-linux-gnu. OK for master?
>
>  gcc/pair-fusion.cc | 6 +++---
>  gcc/pair-fusion.h  | 4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
> index 9f897ac04e2..26b2284ed37 100644
> --- a/gcc/pair-fusion.cc
> +++ b/gcc/pair-fusion.cc
> @@ -426,7 +426,7 @@ pair_fusion_bb_info::track_access (insn_info *insn, bool 
> load_p, rtx mem)
>  return;
>  
>// Ignore writeback accesses if the hook says to do so.
> -  if (!m_pass->should_handle_writeback (writeback::EXISTING)
> +  if (!m_pass->should_handle_writeback (writeback_type::EXISTING)
>&& GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>  return;
>  
> @@ -1787,7 +1787,7 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
>// update of the base register and try and fold it in to make this into a
>// writeback pair.
>insn_info *trailing_add = nullptr;
> -  if (m_pass->should_handle_writeback (writeback::ALL)
> +  if (m_pass->should_handle_writeback (writeback_type::ALL)
>&& !writeback_effect
>&& (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
>XEXP (pats[0], 0), nullptr)
> @@ -2996,7 +2996,7 @@ void pair_fusion::process_block (bb_info *bb)
>rtx pat = PATTERN (rti);
>bool load_p;
>if (reload_completed
> -   && should_handle_writeback (writeback::ALL)
> +   && should_handle_writeback (writeback_type::ALL)
> && pair_mem_insn_p (rti, load_p))
>   try_promote_writeback (insn, load_p);
>  
> diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
> index 2a38dc8f743..45e4edceecb 100644
> --- a/gcc/pair-fusion.h
> +++ b/gcc/pair-fusion.h
> @@ -75,7 +75,7 @@ struct alias_walker;
>  
>  // When querying should_handle_writeback, this enum is used to
>  // qualify which opportunities we are asking about.
> -enum class writeback {
> +enum class writeback_type {
>// Only those writeback opportunities that arise from existing
>// auto-increment accesses.
>EXISTING,
> @@ -123,7 +123,7 @@ struct pair_fusion {
>// Return true if we should try to handle writeback opportunities.
>// WHICH determines the kinds of writeback opportunities the caller
>// is asking about.
> -  virtual bool should_handle_writeback (enum writeback which) = 0;
> +  virtual bool should_handle_writeback (writeback_type which) = 0;
>  
>// Given BASE_MEM, the mem from the lower candidate access for a pair,
>// and LOAD_P (true if the access is a load), check if we should proceed


Re: [Patch, aarch64, middle-end\ v4: Move pair_fusion pass from aarch64 to middle-end

2024-05-31 Thread Richard Sandiford
Marc Poulhiès  writes:
> Hello,
>
> I can't bootstrap using gcc 5.5 since this change. It fails with:
>
> .../gcc/pair-fusion.cc: In member function ‘bool 
> pair_fusion_bb_info::fuse_pair(bool, unsigned int, int, rtl_ssa::insn_info*, 
> rtl_ssa::in
> sn_info*, base_cand&, const rtl_ssa::insn_range_info&)’:
> .../gcc/pair-fusion.cc:1790:40: error: ‘writeback’ is not a class, namespace, 
> or enumeration
>if (m_pass->should_handle_writeback (writeback::ALL)
> ^
> Is it possible that C++11 enum classes are not correctly supported in
> older GCC?

Looks to be due to an overloading of "writeback", which is also a local
variable in that function.

One fix would be to rename the type to "writeback_type".
FWIW, the "enum"s in "enum writeback" can also be removed,
so it'd be s/enum writeback/writeback_type/.

Richard


Re: [PATCH] AArch64: Add ACLE MOPS support

2024-05-31 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Hi Richard,
>
>> I think this should be in a push_options/pop_options block, as for other
>> intrinsics that require certain features.
>
> But then the intrinsic would always be defined, which is contrary to what the
> ACLE spec demands - it would not give a compilation error at the callsite
> but give assembler errors (potentially in different functions after inlining).

Inlining will fail with an error if the callsite doesn't have the right
features.  E.g.: https://godbolt.org/z/7zz59PhTE

The error message isn't great, but it is at least an error. :)

>> What was the reason for using an inline asm rather than a builtin?
>> Feels a bit old school. :)  Using a builtin should mean that the
>> RTL optimisers see the extent of the write.
>
> Given this intrinsic will be used very rarely, if ever, it does not make sense
> to provide anything more than the basic functionality.

But a lot of effort went into making the old inline asm ACLE
implementations use builtins instead.  It even seems to have been
a complete transition.  (Although we still have:

/* Start of temporary inline asm implementations.  */
...
/* End of temporary inline asm.  */

heh.)

So this feels like a regression in terms of implementation methodology.

I won't object if another maintainer approves the function in this form,
but I'd only be comfortable approving a builtin version.

Thanks,
Richard


Re: [PATCH] AArch64: Add ACLE MOPS support

2024-05-31 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Add __ARM_FEATURE_MOPS predefine.  Add support for ACLE __arm_mops_memset_tag.
>
> Passes regress, OK for commit?
>
> gcc:
> * config/aaarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
> Add __ARM_FEATURE_MOPS predefine.
> * config/aarch64/arm_acle.h: Add __arm_mops_memset_tag().
>
> gcc/testsuite:
> * gcc.target/aarch64/acle/memtag_5.c: Add new test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
> index 
> fe1a20e4e546a68e5f7eddff3bbb0d3e831fbd9b..884a7ba5d10b58fbe182a765041cf80bdaec9615
>  100644
> --- a/gcc/config/aarch64/aarch64-c.cc
> +++ b/gcc/config/aarch64/aarch64-c.cc
> @@ -260,6 +260,7 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
>aarch64_def_or_undef (TARGET_SME_I16I64, "__ARM_FEATURE_SME_I16I64", 
> pfile);
>aarch64_def_or_undef (TARGET_SME_F64F64, "__ARM_FEATURE_SME_F64F64", 
> pfile);
>aarch64_def_or_undef (TARGET_SME2, "__ARM_FEATURE_SME2", pfile);
> +  aarch64_def_or_undef (TARGET_MOPS, "__ARM_FEATURE_MOPS", pfile);
>  
>/* Not for ACLE, but required to keep "float.h" correct if we switch
>   target between implementations that do or do not support ARMv8.2-A
> diff --git a/gcc/config/aarch64/arm_acle.h b/gcc/config/aarch64/arm_acle.h
> index 
> 2aa681090fa205449cf1ac63151565f960716189..22ee4b211a55ca6537a1d9e3bf4dad09585071fb
>  100644
> --- a/gcc/config/aarch64/arm_acle.h
> +++ b/gcc/config/aarch64/arm_acle.h
> @@ -344,6 +344,21 @@ __rndrrs (uint64_t *__res)
>  
>  #pragma GCC pop_options
>  
> +#if defined (__ARM_FEATURE_MOPS) && defined (__ARM_FEATURE_MEMORY_TAGGING)
> +__extension__ extern __inline void *
> +__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
> +__arm_mops_memset_tag (void *__ptr, int __val, size_t __size)
> +{
> +  void *__ptr2 = __ptr;
> +  __asm volatile ("setgp\t[%0]!, %1!, %x2\n\t"
> +   "setgm\t[%0]!, %1!, %x2\n\t"
> +   "setge\t[%0]!, %1!, %x2"
> +   : "+r" (__ptr2), "+r" (__size)
> +   : "rZ" (__val) : "cc", "memory");
> +  return __ptr;
> +}
> +#endif
> +

I think this should be in a push_options/pop_options block, as for other
intrinsics that require certain features.

What was the reason for using an inline asm rather than a builtin?
Feels a bit old school. :)  Using a builtin should mean that the
RTL optimisers see the extent of the write.

Thanks,
Richard

>  #define __arm_rsr(__regname) \
>__builtin_aarch64_rsr (__regname)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/acle/memtag_5.c 
> b/gcc/testsuite/gcc.target/aarch64/acle/memtag_5.c
> new file mode 100644
> index 
> ..79ba1eb39d7c6d577fbe98a3285f8cc618428823
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/acle/memtag_5.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=armv8.8-a+memtag -O2" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
> +
> +#include "arm_acle.h"
> +
> +#ifndef __ARM_FEATURE_MOPS
> +# error __ARM_FEATURE_MOPS not defined!
> +#endif
> +
> +/*
> +** set_tag:
> +**   mov (x[0-9]+), x0
> +**   setgp   \[\1\]\!, x1\!, xzr
> +**   setgm   \[\1\]\!, x1\!, xzr
> +**   setge   \[\1\]\!, x1\!, xzr
> +**   ret
> +*/
> +void *set_tag (void *p, size_t size)
> +{
> +  return __arm_mops_memset_tag (p, 0, size);
> +}


Re: [PATCH] testsuite: Improve check-function-bodies

2024-05-31 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Improve check-function-bodies by allowing single-character function names.
> Also skip '#' comments which may be emitted from inline assembler.
>
> Passes regress, OK for commit?
>
> gcc/testsuite:
> * lib/scanasm.exp (configure_check-function-bodies): Allow single-char
> function names.  Skip '#' comments.
>
> ---
>
> diff --git a/gcc/testsuite/lib/scanasm.exp b/gcc/testsuite/lib/scanasm.exp
> index 
> 6cf9997240deec274a191103d21690d80e34ba95..0e461ef260b7a6fee5a9c60d0571e46468f752c0
>  100644
> --- a/gcc/testsuite/lib/scanasm.exp
> +++ b/gcc/testsuite/lib/scanasm.exp
> @@ -869,15 +869,15 @@ proc configure_check-function-bodies { config } {
>  # Regexp for the start of a function definition (name in \1).
>  if { [istarget nvptx*-*-*] } {
>   set up_config(start) {
> - {^// BEGIN(?: GLOBAL|) FUNCTION DEF: ([a-zA-Z_]\S+)$}
> + {^// BEGIN(?: GLOBAL|) FUNCTION DEF: ([a-zA-Z_]\S*)$}
>   }
>  } elseif { [istarget *-*-darwin*] } {
>   set up_config(start) {
> - {^_([a-zA-Z_]\S+):$}
> + {^_([a-zA-Z_]\S*):$}
>   {^LFB[0-9]+:}
>   }
>  } else {
> - set up_config(start) {{^([a-zA-Z_]\S+):$}}
> + set up_config(start) {{^([a-zA-Z_]\S*):$}}
>  }
>  
>  # Regexp for the end of a function definition.

This part is ok, thanks.

> @@ -899,9 +899,9 @@ proc configure_check-function-bodies { config } {
>  } else {
>   # Skip lines beginning with labels ('.L[...]:') or other directives
>   # ('.align', '.cfi_startproc', '.quad [...]', '.text', etc.), '//' or
> - # '@' comments ('-fverbose-asm' or ARM-style, for example), or empty
> - # lines.
> - set up_config(fluff) {^\s*(?:\.|//|@|$)}
> + # '@' or '#' comments ('-fverbose-asm' or ARM-style, for example), or
> + # empty lines.
> + set up_config(fluff) {^\s*(?:\.|//|@|#|$)}
>  }
>  
>  # Regexp for expected output lines prefix.

I think this should be done separately.  It looks like at least
gcc.target/riscv/target-attr-06.c relies on the current behaviour.

Richard


Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-05-31 Thread Richard Sandiford
Ajit Agarwal  writes:
> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
>>> Hello All:
>>>
>>> Common infrastructure using generic code for pair mem fusion of different
>>> targets.
>>>
>>> rs6000 target specific specific code implements virtual functions defined
>>> by generic code.
>>>
>>> Code is implemented with pure virtual functions to interface with target
>>> code.
>>>
>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>> virtual
>>> function implementation required for rs6000 are added in 
>>> aarch64-ldp-fusion.cc.
>>>
>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>
>>> Thanks & Regards
>>> Ajit
>>>
>>>
>>> aarch64, rs6000, middle-end: Add implementation for different targets for 
>>> pair mem fusion
>>>
>>> Common infrastructure using generic code for pair mem fusion of different
>>> targets.
>>>
>>> rs6000 target specific specific code implements virtual functions defined
>>> by generic code.
>>>
>>> Code is implemented with pure virtual functions to interface with target
>>> code.
>>>
>>> Target specific code are added in rs6000-mem-fusion.cc and additional 
>>> virtual
>>> function implementation required for rs6000 are added in 
>>> aarch64-ldp-fusion.cc.
>>>
>>> 2024-05-31  Ajit Kumar Agarwal  
>>>
>>> gcc/ChangeLog:
>>>
>>> * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>> implementation of additional virtual functions added in pair_fusion
>>> struct.
>>> * config/rs6000/rs6000-passes.def: New mem fusion pass
>>> before pass_early_remat.
>>> * config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>> Add target specific implementation using pure virtual
>>> functions.
>>> * config.gcc: Add new object file.
>>> * config/rs6000/rs6000-protos.h: Add new prototype for mem
>>> fusion pass.
>>> * config/rs6000/t-rs6000: Add new rule.
>>> * rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>> from private.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> * g++.target/powerpc/me-fusion.C: New test.
>>> * g++.target/powerpc/mem-fusion-1.C: New test.
>>> * gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>> ---
>> 
>> This isn't a complete review, just some initial questions & comments
>> about selected parts.
>> 
>>> [...]
>>> +/* Check whether load can be fusable or not.
>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>> +bool
>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>> +{
>>> +  rtx_insn *insn = info->rtl ();
>>> +
>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>> +if (REG_NOTE_KIND (note) == REG_EQUAL
>>> +   || REG_NOTE_KIND (note) == REG_EQUIV)
>>> +  return false;
>> 
>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>> note.  What's the reason for doing this?  Are you trying to avoid
>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>> a natural spill slot)?  I think Alex hit a similar situation.
>> 
>
> We have used the above check because of some SPEC benchmarks failing with
> with MEM pairs having REG_EQUAL/EQUIV notes.
>
> By adding the checks the benchmarks passes and also it improves the
> performance.
>
> This checks were added during initial implementation of pair fusion
> pass.
>
> I will investigate further if this check is still required or not.

Thanks.  If it does affect SPEC results, it would be good to look
at the underlying reason, as a justification for the check.

AIUI, the case Alex was due to the way that the RA recognises:

  (set (reg R) (mem address-of-a-stack-variable))
REG_EQUIV: (mem address-of-a-stack-variable)

where the REG_EQUIV is either explicit or detected by the RA.
If R needs to be spilled, it can then be spilled to its existing
location on the stack.  And if R needs to be spilled in the
instruction above (because of register pressure before the first
use of R), the RA is able to delete the instruction.

But if that is the reason, the condition should be restricted
to cases in which the note is a memory.

I think Alex had tried something similar and found that it wasn't
always effective.

> [...

Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-05-31 Thread Richard Sandiford
Reviewing my review :)

Richard Sandiford  writes:
>> +
>> +  for (auto def : info->defs ())
>> +{
>> +  auto set = dyn_cast (def);
>> +  if (set && set->has_any_uses ())
>> +{
>> +  for (auto use : set->all_uses())
>
> Nit: has_any_uses isn't necessary: the inner loop will simply do nothing
> in that case.  Also, we can/should restrict the scan to non-debug uses.
>
> This can then be:
>
>   for (auto def : info->defs ())
> if (auto set = dyn_cast (def))
>   for (auto use : set->nondebug_insn_uses())

I forgot the space before "()" in the line above.

>
>> +{
>> +  if (use->insn ()->is_artificial ())
>> +return false;
>> +
>> +   insn_info *info = use->insn ();
>> +
>> +   if (info
>> +   && info->rtl ()
>
> This test shouldn't be necessary.
>
>> +   && info->is_real ())
>> +  {
>> +rtx_insn *rtl_insn = info->rtl ();
>> +rtx set = single_set (rtl_insn);
>> +
>> +if (set == NULL_RTX)
>> +  return false;
>> +
>> +rtx op0 = SET_SRC (set);
>> +if (GET_CODE (op0) != UNSPEC)
>> +  return false;
> [...]
> Also, using single_set means that the function still lets through
> parallels of two sets in which the sources are unspecs.  Is that
> intentional?

I got this wrong, sorry.  You return false for non-single_set,
so that particular problem doesn't arise.  But why do we want to
reject uses of registers that are set by parallel sets?

Thanks,
Richard


Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-05-31 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello All:
>
> Common infrastructure using generic code for pair mem fusion of different
> targets.
>
> rs6000 target specific specific code implements virtual functions defined
> by generic code.
>
> Code is implemented with pure virtual functions to interface with target
> code.
>
> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
> function implementation required for rs6000 are added in 
> aarch64-ldp-fusion.cc.
>
> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
>
> aarch64, rs6000, middle-end: Add implementation for different targets for 
> pair mem fusion
>
> Common infrastructure using generic code for pair mem fusion of different
> targets.
>
> rs6000 target specific specific code implements virtual functions defined
> by generic code.
>
> Code is implemented with pure virtual functions to interface with target
> code.
>
> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
> function implementation required for rs6000 are added in 
> aarch64-ldp-fusion.cc.
>
> 2024-05-31  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>   implementation of additional virtual functions added in pair_fusion
>   struct.
>   * config/rs6000/rs6000-passes.def: New mem fusion pass
>   before pass_early_remat.
>   * config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>   Add target specific implementation using pure virtual
>   functions.
>   * config.gcc: Add new object file.
>   * config/rs6000/rs6000-protos.h: Add new prototype for mem
>   fusion pass.
>   * config/rs6000/t-rs6000: Add new rule.
>   * rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>   from private.
>
> gcc/testsuite/ChangeLog:
>
>   * g++.target/powerpc/me-fusion.C: New test.
>   * g++.target/powerpc/mem-fusion-1.C: New test.
>   * gcc.target/powerpc/mma-builtin-1.c: Modify test.
> ---

This isn't a complete review, just some initial questions & comments
about selected parts.

> [...]
> +/* Check whether load can be fusable or not.
> +   Return true if dependent use is UNSPEC otherwise false.  */
> +bool
> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
> +{
> +  rtx_insn *insn = info->rtl ();
> +
> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
> +if (REG_NOTE_KIND (note) == REG_EQUAL
> + || REG_NOTE_KIND (note) == REG_EQUIV)
> +  return false;

It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
note.  What's the reason for doing this?  Are you trying to avoid
fusing pairs before reload that are equivalent to a MEM (i.e. have
a natural spill slot)?  I think Alex hit a similar situation.

> +
> +  for (auto def : info->defs ())
> +{
> +  auto set = dyn_cast (def);
> +  if (set && set->has_any_uses ())
> + {
> +   for (auto use : set->all_uses())

Nit: has_any_uses isn't necessary: the inner loop will simply do nothing
in that case.  Also, we can/should restrict the scan to non-debug uses.

This can then be:

  for (auto def : info->defs ())
if (auto set = dyn_cast (def))
  for (auto use : set->nondebug_insn_uses())

> + {
> +   if (use->insn ()->is_artificial ())
> + return false;
> +
> +insn_info *info = use->insn ();
> +
> +if (info
> +&& info->rtl ()

This test shouldn't be necessary.

> +&& info->is_real ())
> +   {
> + rtx_insn *rtl_insn = info->rtl ();
> + rtx set = single_set (rtl_insn);
> +
> + if (set == NULL_RTX)
> +   return false;
> +
> + rtx op0 = SET_SRC (set);
> + if (GET_CODE (op0) != UNSPEC)
> +   return false;

What's the motivation for rejecting unspecs?  It's unsual to treat
all unspecs as a distinct group.

Also, using single_set means that the function still lets through
parallels of two sets in which the sources are unspecs.  Is that
intentional?

The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
need to be described in comments, so that other people coming to this
code later can understand the motivation.  The same thing applies to
other decisions in the patch.

> +   }
> +   }
> +   }
> +}
> +  return true;
> +}
> [...]
> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
> index 9f897ac04e2..2dbe9f854ef 100644
> --- a/gcc/pair-fusion.cc
> +++ b/gcc/pair-fusion.cc
> @@ -312,7 +312,7 @@ static int
>  encode_lfs (lfs_fields fields)
>  {
>int size_log2 = exact_log2 (fields.size);
> -  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
> +  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 6);
>return ((int)fields.load_p << 3)
>  | ((int)fields.fpsimd_p << 2)
>  | (size_log2 - 2);

The point of the assert 

Re: [PATCH 2/4] resource.cc: Replace calls to find_basic_block with cfgrtl BLOCK_FOR_INSN

2024-05-31 Thread Richard Sandiford
Hans-Peter Nilsson  writes:
> [...]
> (Not-so-)fun fact: add_insn_after takes a bb parameter which
> reorg.cc always passes as NULL.  But - the argument is
> *always ignored* and the bb in the "after" insn is used.
> I traced that ignored parameter as far as
> r0-81421-g6fb5fa3cbc0d78 "Merge dataflow branch into
> mainline" when is was added.  I *guess* it's an artifact
> left over from some idea explored on that branch.  Ripe for
> obvious cleanup by removal everywhere.

Heh.  I wondered whether there'd be some direct callers of
add_insn_after_nobb that relied on the block *not* being updated
for some reason, but thankfully not.  The only two callers seem
to be add_insn_after and emit_note_after.  But then emit_note_after
handles notes slightly differently from add_insn_after, even though
logically, emitting an existing note should work in the same way
as emitting a new note.

So yeah, like you say, ripe for cleanup :)

Richard


Re: [PATCH 01/11] OpenMP/PolyInt: Pass poly-int structures by address to OMP libs.

2024-05-31 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Fri, May 31, 2024 at 08:45:54AM +0100, Richard Sandiford wrote:
>> > When you say same way, do you mean the way SVE ABI defines the rules for 
>> > SVE types?
>> 
>> No, sorry, I meant that if the choice isn't purely local to a source
>> code function, the condition should be something like sizeless_type_p
>> (suitably abstracted) rather than POLY_INT_CST_P.  That way, the "ABI"
>> stays the same regardless of -msve-vector-bits.
>
> There is no ABI, it is how the caller and indirect callee communicate,
> but both parts are compiled with the same compiler, so it can choose
> differently based on different compiler version etc.
> It is effectively simplified:
> struct whatever { ... };
> void callee (void *x) { struct whatever *w = *x; use *w; }
> void caller (void) { struct whatever w; fill in w; ABI_call (callee, ); }
> (plus in some cases the callee can also update values and propagate that
> back to caller).
> In any case, it is a similar "ABI" to e.g. tree-nested.cc communication
> between caller and nested callee, how exactly are the variables laid out
> in a struct depends on compiler version and whatever it decides, same
> compiler then emits both sides.

Ah, ok, thanks.  In that case I guess POLY_INT_CST_P should be
safe/correct after all.

Richard


Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-05-31 Thread Richard Sandiford
Segher Boessenkool  writes:
> Hi!
>
> On Fri, May 31, 2024 at 01:21:44AM +0530, Ajit Agarwal wrote:
>> Code is implemented with pure virtual functions to interface with target
>> code.
>
> It's not a pure function.  A pure function -- by definition -- has no
> side effects.  These things have side effects.
>
> What you mean is this is *an implementation* for C++ functions without
> a generic implementation.  An obfuscation some people (like me) would
> say.  But please call things what they are!  So not "pure function".
> That has a meaning, and this isn't it.

"pure virtual function" is an established term.  The "pure" modifies
"virtual", not "function".

The description is correct because the patch adds pure virtual functions
to the base class and expects the derived class to override and implement
them.

>>  * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>  implementation of additional virtual functions added in pair_fusion
>>  struct.
>
> This does not belong in this patch.  Do not send "rs6000" patches that
> touch anything outside of config/rs6000/ and similar, certainly not in
> config/something-else/!
>
> This would be WAY easier to review (read: AT ALL POSSIBLE) if you
> included some detailed rationale and design document.

Please don't shout.

I don't think this kind of aggressive review is helpful to the project.

Richard


Re: [PATCH 01/11] OpenMP/PolyInt: Pass poly-int structures by address to OMP libs.

2024-05-31 Thread Richard Sandiford
Tejas Belagod  writes:
> On 5/30/24 6:28 PM, Richard Sandiford wrote:
>> Tejas Belagod  writes:
>>> Currently poly-int type structures are passed by value to OpenMP runtime
>>> functions for shared clauses etc.  This patch improves on this by passing
>>> around poly-int structures by address to avoid copy-overhead.
>>>
>>> gcc/ChangeLog
>>> * omp-low.c (use_pointer_for_field): Use pointer if the OMP data
>>> structure's field type is a poly-int.
>>> ---
>>>   gcc/omp-low.cc | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
>>> index 1a65229cc37..b15607f4ef5 100644
>>> --- a/gcc/omp-low.cc
>>> +++ b/gcc/omp-low.cc
>>> @@ -466,7 +466,8 @@ static bool
>>>   use_pointer_for_field (tree decl, omp_context *shared_ctx)
>>>   {
>>> if (AGGREGATE_TYPE_P (TREE_TYPE (decl))
>>> -  || TYPE_ATOMIC (TREE_TYPE (decl)))
>>> +  || TYPE_ATOMIC (TREE_TYPE (decl))
>>> +  || POLY_INT_CST_P (DECL_SIZE (decl)))
>>>   return true;
>>>   
>>> /* We can only use copy-in/copy-out semantics for shared variables
>> 
>
> Thanks for the reviews.
>
>> Realise this is also true of my original patch, but:
>> 
>> I suppose a question here is whether this function is only ever used for
>> local interfaces between code generated by the same source code function,
>> or whether it's ABI in a more general sense.  
>
> I'm not a 100% sure, but AFAICS, 'use_pointer_for_field' seems to be 
> used only for local interface between source and generated functions. I 
> don't see any backend hooks into this or backend hooking into this 
> function for general ABI. Ofcourse, I'm not the expert on OMP lowering, 
> so it would be great to get an expert opinion on this.
>
>> If the latter, I suppose
>> we should make sure to handle ACLE types the same way regardless of
>> whether the SVE vector size is known.
>> 
>
> When you say same way, do you mean the way SVE ABI defines the rules for 
> SVE types?

No, sorry, I meant that if the choice isn't purely local to a source
code function, the condition should be something like sizeless_type_p
(suitably abstracted) rather than POLY_INT_CST_P.  That way, the "ABI"
stays the same regardless of -msve-vector-bits.

Thanks,
Richard


Re: [PATCH 4/4]AArch64: enable new predicate tuning for Neoverse cores.

2024-05-30 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This enables the new tuning flag for Neoverse V1, Neoverse V2 and Neoverse N2.
> It is kept off for generic codegen.
>
> Note the reason for the +sve even though they are in aarch64-sve.exp is if the
> testsuite is ran with a forced SVE off option, e.g. -march=armv8-a+nosve then
> the intrinsics end up being disabled because the -march is preferred over the
> -mcpu even though the -mcpu comes later.
>
> This prevents the tests from failing in such runs.

IMO we should just skip aarch64-sve.exp if the options explicitly disable
SVE.  But that's separate work.  I'll try it once this patch is in.

> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/tuning_models/neoversen2.h (neoversen2_tunings): Add
>   AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.
>   * config/aarch64/tuning_models/neoversev1.h (neoversev1_tunings): Add
>   AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.
>   * config/aarch64/tuning_models/neoversev2.h (neoversev2_tunings): Add
>   AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/pred_clobber_1.c: New test.
>   * gcc.target/aarch64/sve/pred_clobber_2.c: New test.
>   * gcc.target/aarch64/sve/pred_clobber_3.c: New test.
>   * gcc.target/aarch64/sve/pred_clobber_4.c: New test.
>
> ---
> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
> b/gcc/config/aarch64/tuning_models/neoversen2.h
> index 
> 7e799bbe762fe862e31befed50e54040a7fd1f2f..be9a48ac3adc097f967c217fe09dcac194d7d14f
>  100644
> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> @@ -236,7 +236,8 @@ static const struct tune_params neoversen2_tunings =
>(AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> -   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> +   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> +   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags.  */
>_prefetch_tune,
>AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
> b/gcc/config/aarch64/tuning_models/neoversev1.h
> index 
> 9363f2ad98a5279cc99f2f9b1509ba921d582e84..0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf
>  100644
> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> @@ -227,7 +227,8 @@ static const struct tune_params neoversev1_tunings =
>(AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
> +   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> +   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags.  */
>_prefetch_tune,
>AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>AARCH64_LDP_STP_POLICY_ALWAYS/* stp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> b/gcc/config/aarch64/tuning_models/neoversev2.h
> index 
> bc01ed767c9b690504eb98456402df5d9d64eee3..f76e4ef358f7dfb9c7d7b470ea7240eaa2120f8e
>  100644
> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> @@ -236,7 +236,8 @@ static const struct tune_params neoversev2_tunings =
>(AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> -   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> +   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> +   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags.  */
>_prefetch_tune,
>AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c
> new file mode 100644
> index 
> ..934a00a38531c5fd4139d99ff33414904b2c104f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mcpu=neoverse-n2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#pragma GCC target "+sve"
> +
> +#include 
> +
> +extern void use(svbool_t);
> +
> +/*
> +** foo:
> +**   ...
> +**   ptrue   p([1-9][0-9]?).b, all

Might be better to make this p([1-3]), so that we disallow any registers
that would cause a spill.

OK with that change, thanks.

Richard

> +**   cmplo   p0.h, p\1/z, z0.h, z[0-9]+.h
> +**   ...
> +*/
> +void foo (svuint16_t a, uint16_t b)
> +{
> +svbool_t p0 = svcmplt_n_u16 (svptrue_b16 (), a, b);
> +

Re: [PATCH 3/4]AArch64: add new alternative with early clobber to patterns

2024-05-30 Thread Richard Sandiford
Tamar Christina  writes:
> [...]
> @@ -6651,8 +6661,10 @@ (define_insn "and3"
>   (and:PRED_ALL (match_operand:PRED_ALL 1 "register_operand")
> (match_operand:PRED_ALL 2 "register_operand")))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1  , 2   ]
> - [ Upa , Upa, Upa ] and\t%0.b, %1/z, %2.b, %2.b
> +  {@ [ cons: =0, 1  , 2  ; attrs: pred_clobber ]
> + [ , Upa, Upa; yes ] and\t%0.b, %1/z, %2.b, %2.b
> + [ ?Upa, 0  , Upa; yes ] ^
> + [ Upa , Upa, Upa; no  ] ^

I think this ought to be:

> +  {@ [ cons: =0, 1  ,  2   ; attrs: pred_clobber ]
> + [ , Upa,  Upa ; yes ] and\t%0.b, %1/z, %2.b, 
> %2.b
> + [ ?Upa, 0Upa, 0Upa; yes ] ^
> + [ Upa , Upa,  Upa ; no  ] ^

so that operand 2 can be tied to operand 0 in the worst case.  Similarly:

>}
>  )
>  
> @@ -6679,8 +6691,10 @@ (define_insn "@aarch64_pred__z"
>   (match_operand:PRED_ALL 3 "register_operand"))
> (match_operand:PRED_ALL 1 "register_operand")))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1  , 2  , 3   ]
> - [ Upa , Upa, Upa, Upa ] \t%0.b, %1/z, %2.b, %3.b
> +  {@ [ cons: =0, 1  , 2  , 3  ; attrs: pred_clobber ]
> + [ , Upa, Upa, Upa; yes ] \t%0.b, %1/z, 
> %2.b, %3.b
> + [ ?Upa, 0  , Upa, Upa; yes ] ^
> + [ Upa , Upa, Upa, Upa; no  ] ^
>}
>  )

this would be:

  {@ [ cons: =0, 1   , 2   , 3   ; attrs: pred_clobber ]
 [ , Upa , Upa , Upa ; yes ] \t%0.b, %1/z, 
%2.b, %3.b
 [ ?Upa, 0Upa, 0Upa, 0Upa; yes ] ^
 [ Upa , Upa , Upa,  Upa ; no  ] ^
  }

Same idea for the rest.

I tried this on:

--
#include 

void use (svbool_t, svbool_t, svbool_t);

void
f1 (svbool_t p0, svbool_t p1, svbool_t p2, int n, svbool_t *ptr)
{
  while (n--)
p2 = svand_z (p0, p1, p2);
  *ptr = p2;
}

void
f2 (svbool_t p0, svbool_t p1, svbool_t p2, svbool_t *ptr)
{
  *ptr = svand_z (p0, p1, p2);
}

void
f3 (svbool_t p0, svbool_t p1, svbool_t p2, svbool_t *ptr)
{
  use (svand_z (p0, p1, p2), p1, p2);
}

void
f4 (svbool_t p0, svbool_t p1, svbool_t p2, svbool_t *ptr)
{
  use (p0, svand_z (p0, p1, p2), p2);
}

void
f5 (svbool_t p0, svbool_t p1, svbool_t p2, svbool_t *ptr)
{
  use (p0, p1, svand_z (p0, p1, p2));
}
--

and it seemed to produce the right output:

--
f1:
cbz w0, .L2
sub w0, w0, #1
.p2align 5,,15
.L3:
and p2.b, p0/z, p1.b, p2.b
sub w0, w0, #1
cmn w0, #1
bne .L3
.L2:
str p2, [x1]
ret

f2:
and p3.b, p0/z, p1.b, p2.b
str p3, [x0]
ret

f3:
and p0.b, p0/z, p1.b, p2.b
b   use

f4:
and p1.b, p0/z, p1.b, p2.b
b   use

f5:
and p2.b, p0/z, p1.b, p2.b
b   use
--

(with that coming directly from RA, rather than being cleaned
up later)

> [...]
> @@ -10046,8 +10104,10 @@ (define_insn_and_rewrite "*aarch64_brkn_cc"
>  (match_dup 3)]
> UNSPEC_BRKN))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1  , 2  , 3 ]
> - [ Upa , Upa, Upa, 0 ] brkns\t%0.b, %1/z, %2.b, %0.b
> +  {@ [ cons: =0, 1  , 2  , 3; attrs: pred_clobber ]
> + [ , Upa, Upa, 0; yes ] brkns\t%0.b, %1/z, %2.b, 
> %0.b
> + [ ?Upa, 0  , Upa, 0; yes ] ^
> + [ Upa , Upa, Upa, 0; no  ] ^
>}
>"&& (operands[4] != CONST0_RTX (VNx16BImode)
> || operands[5] != CONST0_RTX (VNx16BImode))"

Probably best to leave this out.  All alternatives require operand 3
to match operand 0.  So operands 1 and 2 will only match operand 0
if they're the same as operand 3.  In that case it'd be better to
allow the sharing rather than force the same value to be stored
in two registers.

That is, if op1 != op3 && op2 != op3 then we get what we want
naturally, regardless of tuning.

The same thing would apply to the BRKN instances of :

> @@ -10020,8 +10076,10 @@ (define_insn "@aarch64_brk"
>  (match_operand:VNx16BI 3 "register_operand")]
> SVE_BRK_BINARY))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1  , 2  , 3 ]
> - [ Upa , Upa, Upa,  ] brk\t%0.b, %1/z, %2.b, 
> %.b
> +  {@ [ cons: =0,  1 , 2  , 3; attrs: pred_clobber ]
> + [ , Upa, Upa, ; yes ] 
> brk\t%0.b, %1/z, %2.b, %.b
> + [ ?Upa, 0  , Upa, ; yes ] ^
> + [ Upa , Upa, Upa, ; no  ] ^
>}
>  )

but I think we should keep this factoring/abstraction and just add
the extra alternatives 

Re: [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber

2024-05-30 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Tamar Christina 
>> Sent: Wednesday, May 22, 2024 10:29 AM
>> To: Richard Sandiford 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org
>> Subject: RE: [PATCH 2/4]AArch64: add new tuning param and attribute for
>> enabling conditional early clobber
>>
>> >
>> > Sorry for the bike-shedding, but how about something like "avoid_pred_rmw"?
>> > (I'm open to other suggestions.)  Just looking for something that describes
>> > either the architecture or the end result that we want to achieve.
>> > And preferable something fairly short :)
>> >
>> > avoid_* would be consistent with the existing "avoid_cross_loop_fma".
>> >
>> > > +
>> > >  #undef AARCH64_EXTRA_TUNING_OPTION
>> > > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>> > > index
>> >
>> bbf11faaf4b4340956094a983f8b0dc2649b2d27..76a18dd511f40ebb58ed12d5
>> > 6b46c74084ba7c3c 100644
>> > > --- a/gcc/config/aarch64/aarch64.h
>> > > +++ b/gcc/config/aarch64/aarch64.h
>> > > @@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE =
>> > AARCH64_FL_SM_OFF;
>> > >  enabled through +gcs.  */
>> > >  #define TARGET_GCS (AARCH64_ISA_GCS)
>> > >
>> > > +/*  Prefer different predicate registers for the output of a predicated 
>> > > operation
>> > over
>> > > +re-using an existing input predicate.  */
>> > > +#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
>> > > +  && (aarch64_tune_params.extra_tuning_flags \
>> > > +  &
>> > AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST))
>> > >
>> > >  /* Standard register usage.  */
>> > >
>> > > diff --git a/gcc/config/aarch64/aarch64.md 
>> > > b/gcc/config/aarch64/aarch64.md
>> > > index
>> >
>> dbde066f7478bec51a8703b017ea553aa98be309..1ecd1a2812969504bd5114a
>> > 53473b478c5ddba82 100644
>> > > --- a/gcc/config/aarch64/aarch64.md
>> > > +++ b/gcc/config/aarch64/aarch64.md
>> > > @@ -445,6 +445,10 @@ (define_enum_attr "arch" "arches" (const_string
>> > "any"))
>> > >  ;; target-independent code.
>> > >  (define_attr "is_call" "no,yes" (const_string "no"))
>> > >
>> > > +;; Indicates whether we want to enable the pattern with an optional 
>> > > early
>> > > +;; clobber for SVE predicates.
>> > > +(define_attr "pred_clobber" "no,yes" (const_string "no"))
>> > > +
>> > >  ;; [For compatibility with Arm in pipeline models]
>> > >  ;; Attribute that specifies whether or not the instruction touches fp
>> > >  ;; registers.
>> > > @@ -461,7 +465,8 @@ (define_attr "fp" "no,yes"
>> > >  (define_attr "arch_enabled" "no,yes"
>> > >(if_then_else
>> > >  (ior
>> > > - (eq_attr "arch" "any")
>> > > + (and (eq_attr "arch" "any")
>> > > +  (eq_attr "pred_clobber" "no"))
>> > >
>> > >   (and (eq_attr "arch" "rcpc8_4")
>> > >(match_test "AARCH64_ISA_RCPC8_4"))
>> > > @@ -488,7 +493,10 @@ (define_attr "arch_enabled" "no,yes"
>> > >(match_test "TARGET_SVE"))
>> > >
>> > >   (and (eq_attr "arch" "sme")
>> > > -  (match_test "TARGET_SME")))
>> > > +  (match_test "TARGET_SME"))
>> > > +
>> > > + (and (eq_attr "pred_clobber" "yes")
>> > > +  (match_test "TARGET_SVE_PRED_CLOBBER")))
>> >
>> > IMO it'd be bettero handle pred_clobber separately from arch, as a new
>> > top-level AND:
>> >
>> >   (and
>> > (ior
>> >   (eq_attr "pred_clobber" "no")
>> >   (match_test "!TARGET_..."))
>> > (ior
>> >   ...existing arch tests...))
>> >
>>
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>

Re: [PATCH 00/11] AArch64/OpenMP: Test SVE ACLE types with various OpenMP constructs.

2024-05-30 Thread Richard Sandiford
Tejas Belagod  writes:
> Note: This patch series is based on Richard's initial patch
>   https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606741.html
> and Jakub's suggestion
>   https://gcc.gnu.org/pipermail/gcc-patches/2023-February/611892.html
>
> The following patch series handles various scenarios with OpenMP and SVE 
> types.
> The starting point for the series follows a suggestion from Jakub to cover 
> all 
> the possible scenarios that could arise when OMP constructs/clauses etc are 
> used with SVE ACLE types. Here are a few instances that this patch series 
> tests
> and in some cases fixes the expected output.  This patch series does not 
> follow
> a formal definition or a spec of how OMP interacts with SVE ACLE types, so 
> it's 
> more of a proposed behaviour.  Comments and discussion welcome.

Thanks for doing this.  I've left some comments on individual patches,
but generally the series looks good from my limited abilit to evaluate it.
Hopefully Jakub can say whether this catches all the cases that matter.

Richard

> This list is not exhaustive, but covers most scenarios of how SVE ACLE types
> ought to interact with OMP constructs/clauses.
>
> 1. Poly-int structures that represent variable-sized objects and OMP runtime.
>
> Currently poly-int type structures are passed by value to OpenMP runtime
> functions for shared clauses etc.  This patch improves on this by passing
> around poly-int structures by address to avoid copy-overhead.
>
> 2. SVE ACLE types in OMP Shared clauses.
>
> We test the behaviour where SVE ACLE type objects are shared in the following
> methods into an OMP region:
>   a. Explicit Shared clause on SVE ACLE type objects.
>   b. Implicit shared clause.
>   c. Implicit shared with default clause.
>   d. SVE ALCE types in the presence of predetermined (static) shared objects.
>
> The associated tests ensure that all such shared objects are passed by address
> into the OMP runtime.  There are runtime tests to verify the functional
> correctness of the change.
>
> 3. Offloading and SVE ACLE types.
>
> The target clause in OpenMP is used to offload loop kernels to accelerator
> peripeherals.  target's 'map' clause is used to move data from and to the 
> accelarator.  When the data is SVE type, it may not be suitable because of
> various reasons i.e. the two SVE targets may not agree on vector size or
> some targets don't support variable vector size.  This makes SVE unsuitable
> for use in OMP's 'map' clause.  We diagnose all such cases and issue errors
> where appropriate.  The cases we cover in this patch are:
>
>   a. Implicitly-mapped SVE ACLE types in OMP target regions are diagnosed.
>   b. Explicitly-mapped SVE ACLE types in OMP target regions using map clause
>  are diagnosed.
>   c. Explicilty-mapped SVLE ACLE types of various directions - to, from, 
> tofrom
>  in the map clause are diagnosed.
>   d. target enter and exit data clauses with map on SVE ACLE types are 
>  diagnosed.
>   e. target data map with alloc on SVE ACLE types are diagnosed.
>   f. target update from clause on SVE ACLE types are diagnosed.
>   g. target private firstprivate with SVE ACLE types are diagnosed.
>   h. All combinations of target with work-sharing constructs like parallel,
>  loop, simd, teams, distribute etc are also diagnosed when SVE ACLE types
>  are involved.
>
> 3. Lastprivate and SVE ACLE types.
>
> Various OpenMP lastprivate clause scenarios with SVE object types are 
> diagnosed.  Worksharing constructs like sections, for, distribute bind to an
> implicit outer parallel region in whose scope SVE ACLE types are declared and 
> are therefore default private.  The lastprivate clause list with SVE ACLE type
> object items are diagnosed in this scenario.
>
> 4. Threadprivate on SVE ACLE type objects.
>
> We ensure threadprivate SVE ACLE type objects are supported. We also ensure
> copyin clause is also supported.
>
> 5. User-Defined Reductions on SVE ACLE types.
>
> We define a reduction using OMP declare reduction using SVE ACLE intrinsics 
> and
> ensure its functional correctness with various work-sharing constructs like
> for, simd, parallel, task, taskloop.
>
> 6. Uniform and Aligned Clause with SVE ACLE
>
> We ensure the uniform clause's functional correctness with simd construct and
> associated SVE ACLE intrinsics in the simd region.  There is no direct
> interaction between uniform and SVE ACLE type objects, but we ensure the 
> uniform
> clause applies correctly to a region where SVE ACLE intrinsics are present.
> Similarly for the aligned clause.
>
> 7. Linear clause and SVE ACLE type.
>
> We diagnose if a linear clause list item has SVE ACLE type objects present.
> Its doesn't mean much if the linear clause is applied to SVE ACLE types.
>
> 8. Depend clause and SVE ACLE objects.
>
> We test for functional correctness many combinations of dependency of shared
> SVE ACLE type objects in parallel regions.  We test if in, out dependencies 
> 

Re: [PATCH 01/11] OpenMP/PolyInt: Pass poly-int structures by address to OMP libs.

2024-05-30 Thread Richard Sandiford
Tejas Belagod  writes:
> Currently poly-int type structures are passed by value to OpenMP runtime
> functions for shared clauses etc.  This patch improves on this by passing
> around poly-int structures by address to avoid copy-overhead.
>
> gcc/ChangeLog
>   * omp-low.c (use_pointer_for_field): Use pointer if the OMP data
>   structure's field type is a poly-int.
> ---
>  gcc/omp-low.cc | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
> index 1a65229cc37..b15607f4ef5 100644
> --- a/gcc/omp-low.cc
> +++ b/gcc/omp-low.cc
> @@ -466,7 +466,8 @@ static bool
>  use_pointer_for_field (tree decl, omp_context *shared_ctx)
>  {
>if (AGGREGATE_TYPE_P (TREE_TYPE (decl))
> -  || TYPE_ATOMIC (TREE_TYPE (decl)))
> +  || TYPE_ATOMIC (TREE_TYPE (decl))
> +  || POLY_INT_CST_P (DECL_SIZE (decl)))
>  return true;
>  
>/* We can only use copy-in/copy-out semantics for shared variables

Realise this is also true of my original patch, but:

I suppose a question here is whether this function is only ever used for
local interfaces between code generated by the same source code function,
or whether it's ABI in a more general sense.  If the latter, I suppose
we should make sure to handle ACLE types the same way regardless of
whether the SVE vector size is known.

(At the moment, the vector size is fixed for a TU, not just a function,
but we should probably plan for relaxing that in future.)

Thanks,
Richard


Re: [PATCH 03/11] AArch64: Diagnose OpenMP offloading when SVE types involved.

2024-05-30 Thread Richard Sandiford
Tejas Belagod  writes:
> The target clause in OpenMP is used to offload loop kernels to accelarator
> peripeherals.  target's 'map' clause is used to move data from and to the
> accelarator.  When the data is SVE type, it may not be suitable because of
> various reasons i.e. the two SVE targets may not agree on vector size or
> some targets don't support variable vector size.  This makes SVE unsuitable
> for use in OMP's 'map' clause.  This patch diagnoses all such cases and issues
> an error where SVE types are not suitable.
>
> Co-authored-by: Andrea Corallo 
>
> gcc/ChangeLog:
>
>   * target.h (type_context_kind): Add new context kinds for target 
> clauses.
>   * config/aarch64/aarch64-sve-builtins.cc (verify_type_context): Diagnose
>   SVE types for a given OpenMP context.
>   * gimplify.cc (omp_notice_variable):  Diagnose implicitly-mapped SVE
>   objects in OpenMP regions.
>   (gimplify_scan_omp_clauses): Diagnose SVE types for various target
>   clauses.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/omp/offload-1.c: New test.
>   * gcc.target/aarch64/sve/omp/offload-2.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-parallel-loop.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-parallel.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-simd.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-teams-distribute-simd.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-teams-distribute.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-teams-loop.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-teams.c: Likewise.
>   * gcc.target/aarch64/sve/omp/target-device.c: Likewise.
>   * gcc.target/aarch64/sve/omp/target-link.c: Likewise.
> ---
>  gcc/config/aarch64/aarch64-sve-builtins.cc|  31 +++
>  gcc/gimplify.cc   |  34 ++-
>  gcc/target.h  |  19 +-
>  .../gcc.target/aarch64/sve/omp/offload-1.c| 237 ++
>  .../gcc.target/aarch64/sve/omp/offload-2.c| 198 +++
>  .../aarch64/sve/omp/offload-parallel-loop.c   | 236 +
>  .../aarch64/sve/omp/offload-parallel.c| 195 ++
>  .../gcc.target/aarch64/sve/omp/offload-simd.c | 236 +
>  .../sve/omp/offload-teams-distribute-simd.c   | 237 ++
>  .../sve/omp/offload-teams-distribute.c| 236 +
>  .../aarch64/sve/omp/offload-teams-loop.c  | 237 ++
>  .../aarch64/sve/omp/offload-teams.c   | 195 ++
>  .../aarch64/sve/omp/target-device.c   |  97 +++
>  .../gcc.target/aarch64/sve/omp/target-link.c  |  48 
>  14 files changed, 2234 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/offload-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/offload-2.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-parallel-loop.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-parallel.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/offload-simd.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-teams-distribute-simd.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-teams-distribute.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-teams-loop.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/offload-teams.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/target-device.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/target-link.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index f3983a123e3..ee1064c3bb7 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -5000,6 +5000,29 @@ bool
>  verify_type_context (location_t loc, type_context_kind context,
>const_tree type, bool silent_p)
>  {
> +  if (aarch64_sve::builtin_type_p (type)
> +  || (POINTER_TYPE_P (type)
> +   && aarch64_sve::builtin_type_p (TREE_TYPE (type

Could you say in more detail why we check for zero or one levels
of pointer indirection but not for more?

Also, was there a reason for checking builtin_type_p rather than
sizeless_type_p?  Things like svbool_t remain sizeless even for
-msve-vector-bits=128 etc., so sizeless_type_p would still cover
that case.  But arm_sve_vector_bits makes it possible to define
fixed-length vector types that are treated for ABI & ACLE purposes
like SVE types.  I don't think those should be treated differently
from normal vectors by omp, since the size is fixed by the attribute
(and types with different attributes are distinct).

Thanks,
Richard

> +switch (context)
> +{
> +  case TCTX_OMP_MAP:
> + error_at (loc, "SVE type %qT not allowed in 

Re: [PATCH 02/11] AArch64: Add test cases for SVE types in OpenMP shared clause.

2024-05-30 Thread Richard Sandiford
Tejas Belagod  writes:
> This patch tests various shared clauses with SVE types.  It also adds a test
> scaffold to run OpenMP tests in under the gcc.target testsuite.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/omp/aarch64-sve-omp.exp: New scaffold.

Hopefully Jakub can comment on whether we should test this in the
GCC testsuite or libgomp testsuite.

On the test:

> [...]
> +int
> +main ()
> +{
> +  svint32_t x = svindex_s32 (0 ,1);
> +  svint32_t y = svindex_s32 (8, 1);
> +  svint32_t a, b;
> +  svbool_t p;
> +
> +  /* Implicit shared.  */
> +  a = foo (x, y, p);
> +  b = implicit_shared_default (x, y, p);

It looks like p is used uninitialised here.  Can you check locally
that using svptrue_b8 () (or whatever) as an initialiser allows the
test to pass while svpfalse_b () causes it to fail?

Thanks,
Richard

> +  compare_vec (a, b);
> +
> +  /* Explicit shared.  */
> +  a = foo (x ,y, p);
> +  b = explicit_shared (x, y, p);
> +  compare_vec (a, b);
> +
> +  /* Implicit shared with no default clause.  */
> +  a = foo (x ,y, p);
> +  b = implicit_shared_no_default (x, y, p);
> +  compare_vec (a, b);
> +
> +  /* Mix shared.  */
> +  a = foo (x ,y, p);
> +  b = mix_shared (y, p);
> +  compare_vec (a, b);
> +
> +  /* Predetermined shared.  */
> +  predetermined_shared_static (true);
> +  predetermined_shared_static (false);
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "value-expr: \*.omp_data_i->a" 10 
> "ompexp" } } */


Re: [Patch, aarch64, middle-end\ v4: Move pair_fusion pass from aarch64 to middle-end

2024-05-30 Thread Richard Sandiford
Thanks for the update.  Some comments below, but looks very close
to ready.

Ajit Agarwal  writes:
> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
> new file mode 100644
> index 000..060fd95
> --- /dev/null
> +++ b/gcc/pair-fusion.cc
> @@ -0,0 +1,3012 @@
> +// Pass to fuse adjacent loads/stores into paired memory accesses.
> +// Copyright (C) 2024 Free Software Foundation, Inc.

This should probably be 2023-2024, since it's based on code
contributed in 2023.

> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it
> +// under the terms of the GNU General Public License as published by
> +// the Free Software Foundation; either version 3, or (at your option)
> +// any later version.
> +//
> +// GCC is distributed in the hope that it will be useful, but
> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +// General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// .
> +
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#define INCLUDE_LIST
> +#define INCLUDE_TYPE_TRAITS
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "rtl-iter.h"
> +#include "rtl-ssa.h"
> +#include "cfgcleanup.h"
> +#include "tree-pass.h"
> +#include "ordered-hash-map.h"
> +#include "tree-dfa.h"
> +#include "fold-const.h"
> +#include "tree-hash-traits.h"
> +#include "print-tree.h"
> +#include "pair-fusion.h"
> +
> +using namespace rtl_ssa;
> +
> +// We pack these fields (load_p, fpsimd_p, and size) into an integer
> +// (LFS) which we use as part of the key into the main hash tables.
> +//
> +// The idea is that we group candidates together only if they agree on
> +// the fields below.  Candidates that disagree on any of these
> +// properties shouldn't be merged together.
> +struct lfs_fields
> +{
> +  bool load_p;
> +  bool fpsimd_p;
> +  unsigned size;
> +};
> +
> +using insn_list_t = std::list;
> +
> +// Information about the accesses at a given offset from a particular
> +// base.  Stored in an access_group, see below.
> +struct access_record
> +{
> +  poly_int64 offset;
> +  std::list cand_insns;
> +  std::list::iterator place;
> +
> +  access_record (poly_int64 off) : offset (off) {}
> +};
> +
> +// A group of accesses where adjacent accesses could be ldp/stp
> +// candidates.  The splay tree supports efficient insertion,
> +// while the list supports efficient iteration.
> +struct access_group
> +{
> +  splay_tree tree;
> +  std::list list;
> +
> +  template
> +  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
> +};
> +
> +// Test if this base candidate is viable according to HAZARDS.
> +bool base_cand::viable () const

Formating nit, should be:

bool
base_cand::viable () const

> +{
> +  return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
> +}
> [...]
> +void
> +pair_fusion_bb_info::transform ()
> +{
> +  traverse_base_map (expr_map);
> +  traverse_base_map (def_map);
> +}
> +
> +// the base register which we can fold in to make this pair use
> +// a writeback addressing mode.

The first line of this comment is missing.  It should be:

// Given an existing pair insn INSN, look for a trailing update of

> [...]
> diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
> new file mode 100644
> index 000..f295fdbdb8f
> --- /dev/null
> +++ b/gcc/pair-fusion.h
> @@ -0,0 +1,195 @@
> +// Pass to fuse adjacent loads/stores into paired memory accesses.
> +//
> +// This file contains the definition of the virtual base class which is
> +// overriden by targets that make use of the pass.
> +//
> +// Copyright (C) 2024 Free Software Foundation, Inc.

2023-2024 here too

> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it
> +// under the terms of the GNU General Public License as published by
> +// the Free Software Foundation; either version 3, or (at your option)
> +// any later version.
> +//
> +// GCC is distributed in the hope that it will be useful, but
> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +// General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// .
> +
> +namespace rtl_ssa {
> +  class def_info;
> +  class insn_info;
> +  class insn_range_info;
> +  class bb_info;
> +}
> +
> +// Information about a potential base candidate, used in try_fuse_pair.
> +// There may be zero, one, or two viable RTL bases for a given pair.
> +struct base_cand
> +{
> +  // DEF is the 

Re: [PATCH] aarch64: testsuite: Explicitly add -mlittle-endian to vget_low_2.c

2024-05-30 Thread Richard Sandiford
Pengxuan Zheng  writes:
> vget_low_2.c is a test case for little-endian, but we missed the 
> -mlittle-endian
> flag in r15-697-ga2e4fe5a53cf75.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vget_low_2.c: Add -mlittle-endian.

Ok, thanks.

If you'd like write access, please follow the instructions on
https://gcc.gnu.org/gitwrite.html (I'll sponsor).

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/testsuite/gcc.target/aarch64/vget_low_2.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/vget_low_2.c 
> b/gcc/testsuite/gcc.target/aarch64/vget_low_2.c
> index 44414e1c043..93e9e664ee9 100644
> --- a/gcc/testsuite/gcc.target/aarch64/vget_low_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vget_low_2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O3 -fdump-tree-optimized" } */
> +/* { dg-options "-O3 -fdump-tree-optimized -mlittle-endian" } */
>  
>  #include 


Re: [PATCH] aarch64: Add vector floating point extend patterns [PR113880, PR113869]

2024-05-30 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch improves vectorization of certain floating point widening 
> operations
> for the aarch64 target by adding vector floating point extend patterns for
> V2SF->V2DF and V4HF->V4SF conversions.
>
>   PR target/113880
>   PR target/113869
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (extend2): New expand.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/extend-vec.c: New test.
>
> Signed-off-by: Pengxuan Zheng 

Thanks for doing this.  Could we instead rename
aarch64_float_extend_lo_ to extend2 and
use something similar to:

---
/* The builtins below should be expanded through the standard optabs
   CODE_FOR_[u]avg3_[floor,ceil].  However the mapping scheme in
   aarch64-simd-builtins.def does not easily allow us to have a pre-mode
   ("uavg") and post-mode string ("_ceil") in the CODE_FOR_* construction.
   So the builtins use a name that is natural for AArch64 instructions
   e.g. "aarch64_srhadd" and we re-map these to the optab-related
   CODE_FOR_ here.  */
#undef VAR1
#define VAR1(F,T1,T2,I,M) \
constexpr insn_code CODE_FOR_aarch64_##F##M = CODE_FOR_##T1##M##3##T2;

BUILTIN_VDQ_BHSI (srhadd, avg, _ceil, 0)
BUILTIN_VDQ_BHSI (urhadd, uavg, _ceil, 0)
BUILTIN_VDQ_BHSI (shadd, avg, _floor, 0)
BUILTIN_VDQ_BHSI (uhadd, uavg, _floor, 0)

#undef VAR1
---

(from aarch64-builtins.cc) to handle the intrinsics?  The idea is
to try to avoid adding new patterns just to satisfy the internal
naming convention.

Richard

> ---
>  gcc/config/aarch64/aarch64-simd.md|  7 +++
>  gcc/testsuite/gcc.target/aarch64/extend-vec.c | 21 +++
>  2 files changed, 28 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/extend-vec.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 868f4486218..8febb411d06 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3141,6 +3141,13 @@ (define_insn "aarch64_float_extend_lo_"
>[(set_attr "type" "neon_fp_cvt_widen_s")]
>  )
>  
> +(define_expand "extend2"
> +  [(set (match_operand: 0 "register_operand" "=w")
> +(float_extend:
> +  (match_operand:VDF 1 "register_operand" "w")))]
> +  "TARGET_SIMD"
> +)
> +
>  ;; Float narrowing operations.
>  
>  (define_insn "aarch64_float_trunc_rodd_df"
> diff --git a/gcc/testsuite/gcc.target/aarch64/extend-vec.c 
> b/gcc/testsuite/gcc.target/aarch64/extend-vec.c
> new file mode 100644
> index 000..f6241d5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/extend-vec.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +/* { dg-final { scan-assembler-times {fcvtl\tv[0-9]+.2d, v[0-9]+.2s} 1 } } */
> +void
> +f (float *__restrict a, double *__restrict b)
> +{
> +  b[0] = a[0];
> +  b[1] = a[1];
> +}
> +
> +/* { dg-final { scan-assembler-times {fcvtl\tv[0-9]+.4s, v[0-9]+.4h} 1 } } */
> +void
> +f1 (_Float16 *__restrict a, float *__restrict b)
> +{
> +
> +  b[0] = a[0];
> +  b[1] = a[1];
> +  b[2] = a[2];
> +  b[3] = a[3];
> +}


[PATCH] ira: Fix go_through_subreg offset calculation [PR115281]

2024-05-30 Thread Richard Sandiford
go_through_subreg used:

  else if (!can_div_trunc_p (SUBREG_BYTE (x),
 REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))

to calculate the register offset for a pseudo subreg x.  In the blessed
days before poly-int, this was:

*offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x)));

But I think this is testing the wrong natural size.  If we exclude
paradoxical subregs (which will get an offset of zero regardless),
it's the inner register that is being split, so it should be the
inner register's natural size that we use.

This matters in the testcase because we have an SFmode lowpart
subreg into the last of three variable-sized vectors.  The
SUBREG_BYTE is therefore equal to the size of two variable-sized
vectors.  Dividing by the vector size gives a register offset of 2,
as expected, but dividing by the size of a scalar FPR would give
a variable offset.

I think something similar could happen for fixed-size targets if
REGMODE_NATURAL_SIZE is different for vectors and integers (say).

Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK to install?

Richard


gcc/
PR rtl-optimization/115281
* ira-conflicts.cc (go_through_subreg): Use the natural size of
the inner mode rather than the outer mode.

gcc/testsuite/
PR rtl-optimization/115281
* gfortran.dg/pr115281.f90: New test.
---
 gcc/ira-conflicts.cc   |  3 +-
 gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++
 2 files changed, 41 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gfortran.dg/pr115281.f90

diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc
index 83274c53330..15ac42d8848 100644
--- a/gcc/ira-conflicts.cc
+++ b/gcc/ira-conflicts.cc
@@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset)
   if (REGNO (reg) < FIRST_PSEUDO_REGISTER)
 *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg),
   SUBREG_BYTE (x), GET_MODE (x));
+  /* The offset is always 0 for paradoxical subregs.  */
   else if (!can_div_trunc_p (SUBREG_BYTE (x),
-REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))
+REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset))
 /* Checked by validate_subreg.  We must know at compile time which
inner hard registers are being accessed.  */
 gcc_unreachable ();
diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 
b/gcc/testsuite/gfortran.dg/pr115281.f90
new file mode 100644
index 000..80aa822e745
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr115281.f90
@@ -0,0 +1,39 @@
+! { dg-options "-O3" }
+! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } }
+
+SUBROUTINE fn0(ma, mb, nt)
+  CHARACTER ca
+  REAL r0(ma)
+  INTEGER i0(mb)
+  REAL r1(3,mb)
+  REAL r2(3,mb)
+  REAL r3(3,3)
+  zero=0.0
+  do na = 1, nt
+ nt = i0(na)
+ do l = 1, 3
+r1 (l, na) =   r0 (nt)
+r2(l, na) = zero
+ enddo
+  enddo
+  if (ca  .ne.'z') then
+ do j = 1, 3
+do i = 1, 3
+   r4  = zero
+enddo
+ enddo
+ do na = 1, nt
+do k =  1, 3
+   do l = 1, 3
+  do m = 1, 3
+ r3 = r4 * v
+  enddo
+   enddo
+enddo
+ do i = 1, 3
+   do k = 1, ifn (r3)
+   enddo
+enddo
+ enddo
+ endif
+END
-- 
2.25.1



[PATCH] aarch64: Split aarch64_combinev16qi before RA [PR115258]

2024-05-29 Thread Richard Sandiford
Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
purpose is to put the two input data vectors into consecutive registers.
This aarch64_combinev16qi was then split after reload into individual
moves (from the first input to the first half of the output, and from
the second input to the second half of the output).

In the worst case, the RA might allocate things so that the destination
of the aarch64_combinev16qi is the second input followed by the first
input.  In that case, the split form of aarch64_combinev16qi uses three
eors to swap the registers around.

This PR is about a test where this worst case occurred.  And given the
insn description, that allocation doesn't semm unreasonable.

early-ra should (hopefully) mean that we're now better at allocating
subregs of vector registers.  The upcoming RA subreg patches should
improve things further.  The best fix for the PR therefore seems
to be to split the combination before RA, so that the RA can see
the underlying moves.

Perhaps it even makes sense to do this at expand time, avoiding the need
for aarch64_combinev16qi entirely.  That deserves more experimentation
though.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
PR target/115258
* config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
the split before reload.
* config/aarch64/aarch64.cc (aarch64_split_combinev16qi): Generalize
into a form that handles pseudo registers.

gcc/testsuite/
PR target/115258
* gcc.target/aarch64/pr115258.c: New test.
---
 gcc/config/aarch64/aarch64-simd.md  |  2 +-
 gcc/config/aarch64/aarch64.cc   | 29 ++---
 gcc/testsuite/gcc.target/aarch64/pr115258.c | 19 ++
 3 files changed, 34 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr115258.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index c311888e4bd..868f4486218 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -8474,7 +8474,7 @@ (define_insn_and_split "aarch64_combinev16qi"
UNSPEC_CONCAT))]
   "TARGET_SIMD"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(const_int 0)]
 {
   aarch64_split_combinev16qi (operands);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ee12d8897a8..13191ec8e34 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25333,27 +25333,26 @@ aarch64_output_sve_ptrues (rtx const_unspec)
 void
 aarch64_split_combinev16qi (rtx operands[3])
 {
-  unsigned int dest = REGNO (operands[0]);
-  unsigned int src1 = REGNO (operands[1]);
-  unsigned int src2 = REGNO (operands[2]);
   machine_mode halfmode = GET_MODE (operands[1]);
-  unsigned int halfregs = REG_NREGS (operands[1]);
-  rtx destlo, desthi;
 
   gcc_assert (halfmode == V16QImode);
 
-  if (src1 == dest && src2 == dest + halfregs)
+  rtx destlo = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]), 0);
+  rtx desthi = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]),
+   GET_MODE_SIZE (halfmode));
+
+  bool skiplo = rtx_equal_p (destlo, operands[1]);
+  bool skiphi = rtx_equal_p (desthi, operands[2]);
+
+  if (skiplo && skiphi)
 {
   /* No-op move.  Can't split to nothing; emit something.  */
   emit_note (NOTE_INSN_DELETED);
   return;
 }
 
-  /* Preserve register attributes for variable tracking.  */
-  destlo = gen_rtx_REG_offset (operands[0], halfmode, dest, 0);
-  desthi = gen_rtx_REG_offset (operands[0], halfmode, dest + halfregs,
-  GET_MODE_SIZE (halfmode));
-
   /* Special case of reversed high/low parts.  */
   if (reg_overlap_mentioned_p (operands[2], destlo)
   && reg_overlap_mentioned_p (operands[1], desthi))
@@ -25366,16 +25365,16 @@ aarch64_split_combinev16qi (rtx operands[3])
 {
   /* Try to avoid unnecessary moves if part of the result
 is in the right place already.  */
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
 }
   else
 {
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
 }
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258.c 
b/gcc/testsuite/gcc.target/aarch64/pr115258.c
new file mode 100644
index 000..9a489d4604c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr115258.c
@@ -0,0 +1,19 @@
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+/*
+** fun:
+** (ldr|adrp)  [^\n]+
+** (ldr|adrp)  [^\n]+
+** (ldr|adrp)  

Re: [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance

2024-05-29 Thread Richard Sandiford
Richard Biener  writes:
> Code generation for contiguous load vectorization can already deal
> with generalized avoidance of loading from a gap.  The following
> extends detection of peeling for gaps requirement with that,
> gets rid of the old special casing of a half load and makes sure
> when we do access the gap we have peeling for gaps enabled.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> This is the first patch in a series to improve peeling for gaps,
> it turned out into an improvement for code rather than just doing
> the (delayed from stage3) removal of the "old" half-vector codepath.
>
> I'll wait for the pre-CI testing for pushing so you also have time
> for some comments.

LGTM FWIW (some trivia below).

Out of interest, how far are we off being able to load:

a[i*8+0]
a[i*8+1]
a[i*8+3]
a[i*8+4]

as two half vectors?  It doesn't look like we're quite there yet,
but I might have misread.

It would be nice if we could eventually integrate the overrun_p checks
with the vectorizable_load code that the code is trying to predict.
E.g. we could run through the vectorizable_load code during the
analysis phase and record overruns, similarly to Kewen's costing
patches.  As it stands, it seems difficult to make sure that the two
checks are exactly in sync, especially when the structure is so
different.

> Richard.
>
>   PR tree-optimization/115252
>   * tree-vect-stmts.cc (get_group_load_store_type): Enhance
>   detecting the number of cases where we can avoid accessing a gap
>   during code generation.
>   (vectorizable_load): Remove old half-vector peeling for gap
>   avoidance which is now redundant.  Add gap-aligned case where
>   it's OK to access the gap.  Add assert that we have peeling for
>   gaps enabled when we access a gap.
>
>   * gcc.dg/vect/slp-gap-1.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +
>  gcc/tree-vect-stmts.cc| 58 +--
>  2 files changed, 46 insertions(+), 30 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c 
> b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> new file mode 100644
> index 000..36463ca22c5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3" } */
> +
> +typedef unsigned char uint8_t;
> +typedef short int16_t;
> +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) {
> +  for (int y = 0; y < 4; y++) {
> +for (int x = 0; x < 4; x++)
> +  diff[x + y * 4] = pix1[x] - pix2[x];
> +pix1 += 16;
> +pix2 += 32;
> +  }
> +}
> +
> +/* We can vectorize this without peeling for gaps and thus without epilogue,
> +   but the only thing we can reliably scan is the zero-padding trick for the
> +   partial loads.  */
> +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target 
> vect64 } } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index a01099d3456..b26cc74f417 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, 
> stmt_vec_info stmt_info,
> dr_alignment_support alss;
> int misalign = dr_misalignment (first_dr_info, vectype);
> tree half_vtype;
> +   poly_uint64 remain;
> +   unsigned HOST_WIDE_INT tem, num;
> if (overrun_p
> && !masked_p
> && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
> vectype, misalign)))
>  == dr_aligned
> || alss == dr_unaligned_supported)
> -   && known_eq (nunits, (group_size - gap) * 2)
> -   && known_eq (nunits, group_size)
> -   && (vector_vector_composition_type (vectype, 2, _vtype)
> -   != NULL_TREE))
> +   && can_div_trunc_p (group_size
> +   * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
> +   nunits, , )
> +   && (known_eq (remain, 0u)
> +   || (constant_multiple_p (nunits, remain, )
> +   && (vector_vector_composition_type (vectype, num,
> +   _vtype)
> +   != NULL_TREE
>   overrun_p = false;

Might be worth renaming half_vtype now that it isn't necessarily
a strict half.

>  
> if (overrun_p && !can_overrun_p)
> @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
>   unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
>   unsigned int vect_align
> = vect_known_alignment_in_bytes (first_dr_info, vectype);
> - unsigned int scalar_dr_size
> -   = vect_get_scalar_dr_size (first_dr_info);
> 

Re: [PATCH 1/5] Do single-lane SLP discovery for reductions

2024-05-29 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, 24 May 2024, Richard Biener wrote:
>
>> This is the second merge proposed from the SLP vectorizer branch.
>> I have again managed without adding and using --param vect-single-lane-slp
>> but instead this provides always enabled functionality.
>> 
>> This makes us use SLP reductions (a group of reductions) for the
>> case where the group size is one.  This basically means we try
>> to use SLP for all reductions.
>> 
>> I've kept the series close to changes how they are on the branch
>> but in the end I'll squash it, having separate commits for review
>> eventually helps identifying common issues we will run into.  In
>> particular we lack full SLP support for several reduction kinds
>> and the branch has more enabling patches than in this series.
>> For example 4/5 makes sure we use shifts and direct opcode
>> reductions in the reduction epilog for SLP reductions but doesn't
>> bother to try covering the general case but enables it only
>> for the single-element group case to avoid regressions
>> in gcc.dg/vect/reduc-{mul,or}_[12].c testcases.
>> 
>> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
>> successfully built SPEC CPU 2017.  This posting should trigger
>> arm & riscv pre-checkin CI.
>> 
>> There's one ICE in gcc.target/i386/pr51235.c I discovered late
>> that I will investigate and address after the weekend.
>
> I've fixed this now.
>
> On aarch64 and arm there's
>
> FAIL: gcc.dg/vect/slp-reduc-3.c scan-tree-dump-times vect "VEC_PERM_EXPR" 
> 0
>
> which is a testism, I _think_ due to a bogus vect_load_lanes check
> in that line.  The code is as expected not using a SLP reduction of
> two lanes due to the widen-sum pattern used.  It might be that we
> somehow fail to use load-lanes when vectorizing the load with SLP
> which means that for SLP reductions we fail to consider
> load-lanes as override.  I think we should leave this FAIL, we need to
> work to get load-lanes vectorization from SLP anyway.  To fix this
> the load-permutation followup I have in the works will be necessary.

Sounds good to me FWIW.

> I also see
>
> FAIL: gcc.target/aarch64/sve/dot_1.c scan-assembler-times \\twhilelo\\t 8
> FAIL: gcc.target/aarch64/sve/reduc_4.c scan-assembler-not \\tfadd\\t
> FAIL: gcc.target/aarch64/sve/sad_1.c scan-assembler-times 
> \\tudot\\tz[0-9]+\\.s, z[0-9]+\\.b, z[0-9]+\\.b\\n 2
>
> but scan-assemblers are not my favorite.  For example dot_1.c has
> twice as many whilelo, but I'm not sure what goes wrong.
>
> There are quite some regressions reported for RISC-V, I looked at the
> ICEs and fixed them but I did not investigate any of the assembly
> scanning FAILs.
>
> I'll re-spin the series with the fixes tomorrow.
> If anybody wants to point out something I should investigate please
> speak up.

Thanks for checking the aarch64 results.  I'll look at the three SVE
failures once the patch is in.  Many of the tests are supposed to ensure
that we generate correct code for a given set of choices.  Sometimes
it's necessary to update the flags to retain the same of choices,
e.g. due to costing changes or general vectoriser improvements.

That is, the point of these tests isn't necessarily to make sure that we
get the "best" SVE code for the source -- especially since there isn't
really an abstract, objective "best" that applies to all targets.
The tests are instead reognising that we have mulitple techniques for
doing some things, and are trying to make sure that each of those
techniques works individually.

Realise that kind of test isn't popular with everyone.  The quid
pro quo is that we (AArch64 folks) get to look at the tests when
failures show up :)

Richard

>
> Thanks,
> Richard.
>
>> This change should be more straight-forward than the previous one,
>> still comments are of course welcome.  After pushed I will followup
>> with changes to enable single-lane SLP reductions for various
>> COND_EXPR reductions as well as double-reduction support and
>> in-order reduction support (also all restricted to single-lane
>> for the moment).
>>
>> Thanks,
>> Richard.
>> 
>> --
>> 
>> The following performs single-lane SLP discovery for reductions.
>> This exposes a latent issue with reduction SLP in outer loop
>> vectorization and makes gcc.dg/vect/vect-outer-4[fgkl].c FAIL
>> execution.
>> 
>>  * tree-vect-slp.cc (vect_build_slp_tree_2): Only multi-lane
>>  discoveries are reduction chains and need special backedge
>>  treatment.
>>  (vect_analyze_slp): Fall back to single-lane SLP discovery
>>  for reductions.  Make sure to try single-lane SLP reduction
>>  for all reductions as fallback.
>> ---
>>  gcc/tree-vect-slp.cc | 71 +---
>>  1 file changed, 54 insertions(+), 17 deletions(-)
>> 
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index c7ed520b629..73cc69d85ce 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -1907,7 +1907,8 @@ vect_build_slp_tree_2 

Re: [PATCHv3] Optab: add isfinite_optab for __builtin_isfinite

2024-05-28 Thread Richard Sandiford
HAO CHEN GUI  writes:
> Hi,
>   This patch adds an optab for __builtin_isfinite. The finite check can be
> implemented on rs6000 by a single instruction. It needs an optab to be
> expanded to the certain sequence of instructions.
>
>   The subsequent patches will implement the expand on rs6000.
>
>   Compared to previous version, the main change is to specify acceptable
> modes for the optab.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652170.html
>
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
>
> Thanks
> Gui Haochen
>
> ChangeLog
> optab: Add isfinite_optab for isfinite builtin
>
> gcc/
>   * builtins.cc (interclass_mathfn_icode): Set optab to isfinite_optab
>   for isfinite builtin.
>   * optabs.def (isfinite_optab): New.
>   * doc/md.texi (isfinite): Document.
>
>
> patch.diff
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index f8d94c4b435..b8432f84020 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -2459,8 +2459,9 @@ interclass_mathfn_icode (tree arg, tree fndecl)
>errno_set = true; builtin_optab = ilogb_optab; break;
>  CASE_FLT_FN (BUILT_IN_ISINF):
>builtin_optab = isinf_optab; break;
> -case BUILT_IN_ISNORMAL:
>  case BUILT_IN_ISFINITE:
> +  builtin_optab = isfinite_optab; break;
> +case BUILT_IN_ISNORMAL:
>  CASE_FLT_FN (BUILT_IN_FINITE):
>  case BUILT_IN_FINITED32:
>  case BUILT_IN_FINITED64:
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 5730bda80dc..67407fad37d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -8557,6 +8557,15 @@ operand 2, greater than operand 2 or is unordered with 
> operand 2.
>
>  This pattern is not allowed to @code{FAIL}.
>
> +@cindex @code{isfinite@var{m}2} instruction pattern
> +@item @samp{isfinite@var{m}2}
> +Set operand 0 to nonzero if operand 1 is a finite @code{SFmode},
> +@code{DFmode}, or @code{TFmode} floating point number and to 0
> +otherwise.

This has probably already been discussed, sorry, but how about defining
the optab to return a strict 0/1 result, rather than just zero/nonzero?
I realise that's stricter than the underlying math.h routines, but it
would in principle avoid the need to expand extra instructions in
a setcc-like operation.

Richard

> +
> +If this pattern @code{FAIL}, a call to the library function
> +@code{isfinite} is used.
> +
>  @end table
>
>  @end ifset
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index ad14f9328b9..dcd77315c2a 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -352,6 +352,7 @@ OPTAB_D (fmod_optab, "fmod$a3")
>  OPTAB_D (hypot_optab, "hypot$a3")
>  OPTAB_D (ilogb_optab, "ilogb$a2")
>  OPTAB_D (isinf_optab, "isinf$a2")
> +OPTAB_D (isfinite_optab, "isfinite$a2")
>  OPTAB_D (issignaling_optab, "issignaling$a2")
>  OPTAB_D (ldexp_optab, "ldexp$a3")
>  OPTAB_D (log10_optab, "log10$a2")


Re: [PATCH v3] tree-ssa-pre.c/115214(ICE in find_or_generate_expression, at tree-ssa-pre.c:2780): Return NULL_TREE when deal special cases.

2024-05-28 Thread Richard Sandiford
Richard Biener  writes:
> On Mon, May 27, 2024 at 9:48 AM Jiawei  wrote:
>>
>> Return NULL_TREE when genop3 equal EXACT_DIV_EXPR.
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652641.html
>>
>> version log v3: remove additional POLY_INT_CST check.
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652795.html
>
> OK.
>
> Thanks,
> Richard.
>
>> gcc/ChangeLog:
>>
>> * tree-ssa-pre.cc (create_component_ref_by_pieces_1): New conditions.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/riscv/rvv/vsetvl/pr115214.c: New test.
>>
>> ---
>>  .../gcc.target/riscv/rvv/vsetvl/pr115214.c| 52 +++
>>  gcc/tree-ssa-pre.cc   | 10 ++--
>>  2 files changed, 59 insertions(+), 3 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr115214.c
>>
>> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr115214.c 
>> b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr115214.c
>> new file mode 100644
>> index 000..fce2e9da766
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr115214.c
>> @@ -0,0 +1,52 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-mrvv-vector-bits=scalable -march=rv64gcv -mabi=lp64d -O3 
>> -w" } */
>> +/* { dg-skip-if "" { *-*-* } { "-flto" } } */
>> +
>> +#include 
>> +
>> +static inline __attribute__(()) int vaddq_f32();
>> +static inline __attribute__(()) int vload_tillz_f32(int nlane) {
>> +  vint32m1_t __trans_tmp_9;
>> +  {
>> +int __trans_tmp_0 = nlane;
>> +{
>> +  vint64m1_t __trans_tmp_1;
>> +  vint64m1_t __trans_tmp_2;
>> +  vint64m1_t __trans_tmp_3;
>> +  vint64m1_t __trans_tmp_4;
>> +  if (__trans_tmp_0 == 1) {
>> +{
>> +  __trans_tmp_3 =
>> +  __riscv_vslideup_vx_i64m1(__trans_tmp_1, __trans_tmp_2, 1, 2);
>> +}
>> +__trans_tmp_4 = __trans_tmp_2;
>> +  }
>> +  __trans_tmp_4 = __trans_tmp_3;
>> +  __trans_tmp_9 = __riscv_vreinterpret_v_i64m1_i32m1(__trans_tmp_3);
>> +}
>> +  }
>> +  return vaddq_f32(__trans_tmp_9); /* { dg-error {RVV type 'vint32m1_t' 
>> cannot be passed to an unprototyped function} } */
>> +}
>> +
>> +char CFLOAT_add_args[3];
>> +const int *CFLOAT_add_steps;
>> +const int CFLOAT_steps;
>> +
>> +__attribute__(()) void CFLOAT_add() {
>> +  char *b_src0 = _add_args[0], *b_src1 = _add_args[1],
>> +   *b_dst = _add_args[2];
>> +  const float *src1 = (float *)b_src1;
>> +  float *dst = (float *)b_dst;
>> +  const int ssrc1 = CFLOAT_add_steps[1] / sizeof(float);
>> +  const int sdst = CFLOAT_add_steps[2] / sizeof(float);
>> +  const int hstep = 4 / 2;
>> +  vfloat32m1x2_t a;
>> +  int len = 255;
>> +  for (; len > 0; len -= hstep, src1 += 4, dst += 4) {
>> +int b = vload_tillz_f32(len);
>> +int r = vaddq_f32(a.__val[0], b); /* { dg-error {RVV type 
>> '__rvv_float32m1_t' cannot be passed to an unprototyped function} } */
>> +  }
>> +  for (; len > 0; --len, b_src0 += CFLOAT_steps,
>> +  b_src1 += CFLOAT_add_steps[1], b_dst += 
>> CFLOAT_add_steps[2])
>> +;
>> +}
>> diff --git a/gcc/tree-ssa-pre.cc b/gcc/tree-ssa-pre.cc
>> index 75217f5cde1..5cf1968bc26 100644
>> --- a/gcc/tree-ssa-pre.cc
>> +++ b/gcc/tree-ssa-pre.cc
>> @@ -2685,11 +2685,15 @@ create_component_ref_by_pieces_1 (basic_block block, 
>> vn_reference_t ref,
>>here as the element alignment may be not visible.  See
>>PR43783.  Simply drop the element size for constant
>>sizes.  */
>> -   if (TREE_CODE (genop3) == INTEGER_CST
>> +   if ((TREE_CODE (genop3) == INTEGER_CST
>> && TREE_CODE (TYPE_SIZE_UNIT (elmt_type)) == INTEGER_CST
>> && wi::eq_p (wi::to_offset (TYPE_SIZE_UNIT (elmt_type)),
>> -(wi::to_offset (genop3)
>> - * vn_ref_op_align_unit (currop
>> +(wi::to_offset (genop3) * vn_ref_op_align_unit 
>> (currop

Sorry for the nits, but the original formatting was correct here.
The new one instead goes over 80 columns.

>> + || (TREE_CODE (genop3) == EXACT_DIV_EXPR
>> +   && TREE_CODE (TREE_OPERAND (genop3, 1)) == INTEGER_CST
>> +   && operand_equal_p (TREE_OPERAND (genop3, 0), TYPE_SIZE_UNIT 
>> (elmt_type))

Similarly this line is too long.

Thanks for fixing this.

Richard

>> +   && wi::eq_p (wi::to_offset (TREE_OPERAND (genop3, 1)),
>> +vn_ref_op_align_unit (currop
>>   genop3 = NULL_TREE;
>> else
>>   {
>> --
>> 2.25.1
>>


Re: [PATCH v2 1/2] driver: Use -as/ld/objcopy as final fallback instead of native ones for cross

2024-05-28 Thread Richard Sandiford
YunQiang Su  writes:
> If `find_a_program` cannot find `as/ld/objcopy` and we are a cross toolchain,
> the final fallback is `as/ld` of system.  In fact, we can have a try with
> -as/ld/objcopy before fallback to native as/ld/objcopy.
>
> This patch is derivatived from Debian's patch:
>   gcc-search-prefixed-as-ld.diff

I'm probably making you repeat a previous discussion, sorry, but could
you describe the use case in more detail?  The current approach to
handling cross toolchains has been used for many years.  Presumably
this patch is supporting a different way of organising things,
but I wasn't sure from the description what it was.

AIUI, we currently assume that cross as, ld and objcopy will be
installed under those names in $prefix/$target_alias/bin (aka $tooldir/bin).
E.g.:

   bin/aarch64-elf-as = aarch64-elf/bin/as

GCC should then find as in aarch64-elf/bin.

Is that not true in your case?

To be clear, I'm not saying the patch is wrong.  I'm just trying to
understand why the patch is needed.

Thanks,
Richard

>
> gcc
>   * gcc.cc(execute): Looks for -as/ld/objcopy before fallback
>   to native as/ld/objcopy.
> ---
>  gcc/gcc.cc | 20 
>  1 file changed, 20 insertions(+)
>
> diff --git a/gcc/gcc.cc b/gcc/gcc.cc
> index 830a4700a87..3dc6348d761 100644
> --- a/gcc/gcc.cc
> +++ b/gcc/gcc.cc
> @@ -3293,6 +3293,26 @@ execute (void)
>string = find_a_program(commands[0].prog);
>if (string)
>   commands[0].argv[0] = string;
> +  else if (*cross_compile != '0'
> + && !strcmp (commands[0].argv[0], commands[0].prog)
> + && (!strcmp (commands[0].prog, "as")
> + || !strcmp (commands[0].prog, "ld")
> + || !strcmp (commands[0].prog, "objcopy")))
> + {
> +   string = concat (DEFAULT_REAL_TARGET_MACHINE, "-",
> + commands[0].prog, NULL);
> +   const char *string_args[] = {string, "--version", NULL};
> +   int exit_status = 0;
> +   int err = 0;
> +   const char *errmsg = pex_one (PEX_SEARCH, string,
> +   CONST_CAST (char **, string_args), string,
> +   NULL, NULL, _status, );
> +   if (errmsg == NULL && exit_status == 0 && err == 0)
> + {
> +   commands[0].argv[0] = string;
> +   commands[0].prog = string;
> + }
> + }
>  }
>  
>for (n_commands = 1, i = 0; argbuf.iterate (i, ); i++)


Re: [PATCH] attribs: Fix and refactor diag_attr_exclusions

2024-05-28 Thread Richard Sandiford
Andrew Carlotti  writes:
> The existing implementation of this function was convoluted, and had
> multiple control flow errors that became apparent to me while reading
> the code:
>
> 1. The initial early return only checked the properties of the first
> exclusion in the list, when these properties could be different for
> subsequent exclusions.
>
> 2. excl was not reset within the outer loop, so the inner loop body
> would only execute during the first iteration of the outer loop.  This
> effectively meant that the value of attrs[1] was ignored.
>
> 3. The function called itself recursively twice, with both last_decl and
> TREE_TYPE (last_decl) as parameters. The second recursive call should
> have been redundant, since attrs[1] = TREE_TYPE (last_decl) during the
> first recursive call.

Thanks for doing this.  Agree with the above.

> This patch eliminated the early return, and combines the checks with
> those present within the inner loop.  It also fixes the inner loop
> initialisation, and modifies the outer loop to iterate over nodes
> instead of their attributes. This latter change allows the recursion to
> be eliminated, by extending the new nodes array to include last_decl
> (and its type) as well.
>
> This patch provides an alternative fix for PR114634, although I wasn't
> aware of that issue until rebasing on top of Jakub's fix.
>
> I am not aware of any other compiler bugs resulting from these issues.
> However, if the exclusions for target_clones were listed in the opposite
> order, then it would have broken detection of the always_inline
> exclusion on aarch64 (where TARGET_HAS_FMV_TARGET_ATTRIBUTE is false).
>
> Is this ok for master?
>
> gcc/ChangeLog:
>
>   * attribs.cc (diag_attr_exclusions): Fix and refactor.
>
>
> diff --git a/gcc/attribs.cc b/gcc/attribs.cc
> index 
> 3ab0b0fd87a4404a593b2de365ea5226e31fe24a..431dd4255e68e92dd8d10bbb21ea079e50811faa
>  100644
> --- a/gcc/attribs.cc
> +++ b/gcc/attribs.cc
> @@ -433,84 +433,69 @@ get_attribute_namespace (const_tree attr)
> or a TYPE.  */
>  
>  static bool
> -diag_attr_exclusions (tree last_decl, tree node, tree attrname,
> +diag_attr_exclusions (tree last_decl, tree base_node, tree attrname,
> const attribute_spec *spec)
>  {
> -  const attribute_spec::exclusions *excl = spec->exclude;
>  
> -  tree_code code = TREE_CODE (node);
> +  /* BASE_NODE is either the current decl to which the attribute is being
> + applied, or its type.  For the former, consider the attributes on both 
> the
> + decl and its type.  Check both LAST_DECL and its type as well.  */
>  
> -  if ((code == FUNCTION_DECL && !excl->function
> -   && (!excl->type || !spec->affects_type_identity))
> -  || (code == VAR_DECL && !excl->variable
> -   && (!excl->type || !spec->affects_type_identity))
> -  || (((code == TYPE_DECL || RECORD_OR_UNION_TYPE_P (node)) && 
> !excl->type)))
> -return false;
> +  tree nodes[4] = { NULL_TREE, NULL_TREE, NULL_TREE, NULL_TREE };
>  
> -  /* True if an attribute that's mutually exclusive with ATTRNAME
> - has been found.  */
> -  bool found = false;
> +  nodes[0] = base_node;
> +  if (DECL_P (base_node))
> +  nodes[1] = (TREE_TYPE (base_node));

Nit: too much indentation.

> -  if (last_decl && last_decl != node && TREE_TYPE (last_decl) != node)
> +  if (last_decl)
>  {
> -  /* Check both the last DECL and its type for conflicts with
> -  the attribute being added to the current decl or type.  */
> -  found |= diag_attr_exclusions (last_decl, last_decl, attrname, spec);
> -  tree decl_type = TREE_TYPE (last_decl);
> -  found |= diag_attr_exclusions (last_decl, decl_type, attrname, spec);
> +  nodes[2] = last_decl;
> +  if (DECL_P (last_decl))
> +   nodes[3] = TREE_TYPE (last_decl);
>  }
>  
> -  /* NODE is either the current DECL to which the attribute is being
> - applied or its TYPE.  For the former, consider the attributes on
> - both the DECL and its type.  */
> -  tree attrs[2];
> -
> -  if (DECL_P (node))
> -{
> -  attrs[0] = DECL_ATTRIBUTES (node);
> -  if (TREE_TYPE (node))
> - attrs[1] = TYPE_ATTRIBUTES (TREE_TYPE (node));
> -  else
> - /* TREE_TYPE can be NULL e.g. while processing attributes on
> -enumerators.  */
> - attrs[1] = NULL_TREE;
> -}
> -  else
> -{
> -  attrs[0] = TYPE_ATTRIBUTES (node);
> -  attrs[1] = NULL_TREE;
> -}
> +  /* True if an attribute that's mutually exclusive with ATTRNAME
> + has been found.  */
> +  bool found = false;
>  
>/* Iterate over the mutually exclusive attribute names and verify
>   that the symbol doesn't contain it.  */
> -  for (unsigned i = 0; i != ARRAY_SIZE (attrs); ++i)
> +  for (unsigned i = 0; i != ARRAY_SIZE (nodes); ++i)
>  {
> -  if (!attrs[i])
> +  tree node = nodes[i];
> +
> +  if (!node)
>   continue;
>  
> -  for ( ; excl->name; ++excl)
> +  tree attr;
> +  if DECL_P 

[PATCH] vect: Fix access size alignment assumption [PR115192]

2024-05-24 Thread Richard Sandiford
create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK for trunk
and backports?

Richard


gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr115192.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index db15ddb43de..7c4049faf34 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }
 
-- 
2.25.1



Re: [PATCH] aarch64: Fold vget_high_* intrinsics to BIT_FIELD_REF [PR102171]

2024-05-22 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch is a follow-up of r15-697-ga2e4fe5a53cf75 to also fold vget_high_*
> intrinsics to BIT_FILED_REF and remove the vget_high_* definitions from
> arm_neon.h to use the new intrinsics framework.
>
>   PR target/102171
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (AARCH64_SIMD_VGET_HIGH_BUILTINS):
>   New macro to create definitions for all vget_high intrinsics.
>   (VGET_HIGH_BUILTIN): Likewise.
>   (enum aarch64_builtins): Add vget_high function codes.
>   (AARCH64_SIMD_VGET_LOW_BUILTINS): Delete duplicate macro.
>   (aarch64_general_fold_builtin): Fold vget_high calls.
>   * config/aarch64/aarch64-simd-builtins.def: Delete vget_high builtins.
>   * config/aarch64/aarch64-simd.md (aarch64_get_high): Delete.
>   (aarch64_vget_hi_halfv8bf): Likewise.
>   * config/aarch64/arm_neon.h (__attribute__): Delete.
>   (vget_high_f16): Likewise.
>   (vget_high_f32): Likewise.
>   (vget_high_f64): Likewise.
>   (vget_high_p8): Likewise.
>   (vget_high_p16): Likewise.
>   (vget_high_p64): Likewise.
>   (vget_high_s8): Likewise.
>   (vget_high_s16): Likewise.
>   (vget_high_s32): Likewise.
>   (vget_high_s64): Likewise.
>   (vget_high_u8): Likewise.
>   (vget_high_u16): Likewise.
>   (vget_high_u32): Likewise.
>   (vget_high_u64): Likewise.
>   (vget_high_bf16): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vget_high_2.c: New test.
>   * gcc.target/aarch64/vget_high_2_be.c: New test.

OK, thanks.

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-builtins.cc|  59 +++---
>  gcc/config/aarch64/aarch64-simd-builtins.def  |   6 -
>  gcc/config/aarch64/aarch64-simd.md|  22 
>  gcc/config/aarch64/arm_neon.h | 105 --
>  .../gcc.target/aarch64/vget_high_2.c  |  30 +
>  .../gcc.target/aarch64/vget_high_2_be.c   |  31 ++
>  6 files changed, 104 insertions(+), 149 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_high_2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_high_2_be.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 11b888016ed..f8eeccb554d 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -675,6 +675,23 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VGET_LOW_BUILTIN(u64) \
>VGET_LOW_BUILTIN(bf16)
>  
> +#define AARCH64_SIMD_VGET_HIGH_BUILTINS \
> +  VGET_HIGH_BUILTIN(f16) \
> +  VGET_HIGH_BUILTIN(f32) \
> +  VGET_HIGH_BUILTIN(f64) \
> +  VGET_HIGH_BUILTIN(p8) \
> +  VGET_HIGH_BUILTIN(p16) \
> +  VGET_HIGH_BUILTIN(p64) \
> +  VGET_HIGH_BUILTIN(s8) \
> +  VGET_HIGH_BUILTIN(s16) \
> +  VGET_HIGH_BUILTIN(s32) \
> +  VGET_HIGH_BUILTIN(s64) \
> +  VGET_HIGH_BUILTIN(u8) \
> +  VGET_HIGH_BUILTIN(u16) \
> +  VGET_HIGH_BUILTIN(u32) \
> +  VGET_HIGH_BUILTIN(u64) \
> +  VGET_HIGH_BUILTIN(bf16)
> +
>  typedef struct
>  {
>const char *name;
> @@ -717,6 +734,9 @@ typedef struct
>  #define VGET_LOW_BUILTIN(A) \
>AARCH64_SIMD_BUILTIN_VGET_LOW_##A,
>  
> +#define VGET_HIGH_BUILTIN(A) \
> +  AARCH64_SIMD_BUILTIN_VGET_HIGH_##A,
> +
>  #undef VAR1
>  #define VAR1(T, N, MAP, FLAG, A) \
>AARCH64_SIMD_BUILTIN_##T##_##N##A,
> @@ -753,6 +773,7 @@ enum aarch64_builtins
>/* SIMD intrinsic builtins.  */
>AARCH64_SIMD_VREINTERPRET_BUILTINS
>AARCH64_SIMD_VGET_LOW_BUILTINS
> +  AARCH64_SIMD_VGET_HIGH_BUILTINS
>/* ARMv8.3-A Pointer Authentication Builtins.  */
>AARCH64_PAUTH_BUILTIN_AUTIA1716,
>AARCH64_PAUTH_BUILTIN_PACIA1716,
> @@ -855,26 +876,21 @@ static aarch64_fcmla_laneq_builtin_datum 
> aarch64_fcmla_lane_builtin_data[] = {
> false \
>},
>  
> -#define AARCH64_SIMD_VGET_LOW_BUILTINS \
> -  VGET_LOW_BUILTIN(f16) \
> -  VGET_LOW_BUILTIN(f32) \
> -  VGET_LOW_BUILTIN(f64) \
> -  VGET_LOW_BUILTIN(p8) \
> -  VGET_LOW_BUILTIN(p16) \
> -  VGET_LOW_BUILTIN(p64) \
> -  VGET_LOW_BUILTIN(s8) \
> -  VGET_LOW_BUILTIN(s16) \
> -  VGET_LOW_BUILTIN(s32) \
> -  VGET_LOW_BUILTIN(s64) \
> -  VGET_LOW_BUILTIN(u8) \
> -  VGET_LOW_BUILTIN(u16) \
> -  VGET_LOW_BUILTIN(u32) \
> -  VGET_LOW_BUILTIN(u64) \
> -  VGET_LOW_BUILTIN(bf16)
> +#undef VGET_HIGH_BUILTIN
> +#define VGET_HIGH_BUILTIN(A) \
> +  {"vget_high_" #A, \
> +   AARCH64_SIMD_BUILTIN_VGET_HIGH_##A, \
> +   2, \
> +   { SIMD_INTR_MODE(A, d), SIMD_INTR_MODE(A, q) }, \
> +   { SIMD_INTR_QUAL(A), SIMD_INTR_QUAL(A) }, \
> +   FLAG_AUTO_FP, \
> +   false \
> +  },
>  
>  static const aarch64_simd_intrinsic_datum aarch64_simd_intrinsic_data[] = {
>AARCH64_SIMD_VREINTERPRET_BUILTINS
>AARCH64_SIMD_VGET_LOW_BUILTINS
> +  AARCH64_SIMD_VGET_HIGH_BUILTINS
>  };
>  
>  
> @@ -3270,6 +3286,10 @@ aarch64_fold_builtin_lane_check (tree arg0, tree arg1, 
> tree arg2)
>  #define 

Re: [PATCH v1 5/6] Adjust DLL import/export implementation for AArch64

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> The DLL import/export mingw implementation, originally from ix86, requires
> minor adjustments to be compatible with AArch64.
>
> gcc/ChangeLog:
>
>   * config/mingw/mingw32.h (defined): Use the correct DllMainCRTStartup
>   entry function.
>   * config/mingw/winnt-dll.cc (defined): Exclude ix86-related code.
> ---
>  gcc/config/mingw/mingw32.h| 2 +-
>  gcc/config/mingw/winnt-dll.cc | 4 
>  2 files changed, 5 insertions(+), 1 deletion(-)

Could we provide some abstractions here, rather than testing
CPU-specific macros directly?  E.g.:

>
> diff --git a/gcc/config/mingw/mingw32.h b/gcc/config/mingw/mingw32.h
> index 08f1b5f0696..efe777051b4 100644
> --- a/gcc/config/mingw/mingw32.h
> +++ b/gcc/config/mingw/mingw32.h
> @@ -79,7 +79,7 @@ along with GCC; see the file COPYING3.  If not see
>  #endif
>  
>  #undef SUB_LINK_ENTRY
> -#if TARGET_64BIT_DEFAULT
> +#if TARGET_64BIT_DEFAULT || defined (TARGET_AARCH64_MS_ABI)

it looks like this is equivalent to something like "HAVE_64BIT_POINTERS"
or something, which aarch64 could define to 1 and x86 could define
to TARGET_64BIT_DEFAULT.

The name is just a suggestion, based on not really knowing what the
macro selects.  Please use whatever makes most sense :)

>  #define SUB_LINK_ENTRY SUB_LINK_ENTRY64
>  #else
>  #define SUB_LINK_ENTRY SUB_LINK_ENTRY32
> diff --git a/gcc/config/mingw/winnt-dll.cc b/gcc/config/mingw/winnt-dll.cc
> index 349ade6f5c0..294361fab4c 100644
> --- a/gcc/config/mingw/winnt-dll.cc
> +++ b/gcc/config/mingw/winnt-dll.cc
> @@ -206,9 +206,13 @@ legitimize_pe_coff_symbol (rtx addr, bool inreg)
>   }
>  }
>  
> +#if !defined (TARGET_AARCH64_MS_ABI)
> +
>if (ix86_cmodel != CM_LARGE_PIC && ix86_cmodel != CM_MEDIUM_PIC)
>  return NULL_RTX;
>  
> +#endif
> +

Similarly here, it feels like there is a concept underlying this check.
Could we just use:

  if (!NEW_MACRO)
return NULL_RTX;

with NEW_MACRO describing the underlying property that is common to
medium x86 PIC, large x86 PIC, and aarch64.

Thanks,
Richard

>if (GET_CODE (addr) == SYMBOL_REF
>&& !is_imported_p (addr)
>&& SYMBOL_REF_EXTERNAL_P (addr)


Re: [PATCH v1 4/6] aarch64: Add selectany attribute handling

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> This patch extends the aarch64 attributes list with the selectany
> attribute for the aarch64-w64-mingw32 target and reuses the mingw
> implementation to handle it.
>
>   * config/aarch64/aarch64.cc:
>   Extend the aarch64 attributes list.
>   * config/aarch64/cygming.h (SUBTARGET_ATTRIBUTE_TABLE):
>   Define the selectany attribute.

Now that TARGET_ATTRIBUTE_TABLE is an array, it should in principle
be possible to define the attribute in winnt.cc and so avoid
duplicating the table entry.  That'd be a separate clean-up though.
I agree that for this series we should stick with the current approach.

So the patch is ok, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 5 -
>  gcc/config/aarch64/cygming.h  | 3 +++
>  2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index c763a8a6298..19205927430 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -855,7 +855,10 @@ static const attribute_spec aarch64_gnu_attributes[] =
> NULL },
>{ "Advanced SIMD type", 1, 1, false, true,  false, true,  NULL, NULL },
>{ "SVE type",3, 3, false, true,  false, true,  NULL, NULL 
> },
> -  { "SVE sizeless type",  0, 0, false, true,  false, true,  NULL, NULL }
> +  { "SVE sizeless type",  0, 0, false, true,  false, true,  NULL, NULL },
> +#ifdef SUBTARGET_ATTRIBUTE_TABLE
> +  SUBTARGET_ATTRIBUTE_TABLE
> +#endif
>  };
>  
>  static const scoped_attribute_specs aarch64_gnu_attribute_table =
> diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
> index 0d048879311..76623153080 100644
> --- a/gcc/config/aarch64/cygming.h
> +++ b/gcc/config/aarch64/cygming.h
> @@ -154,6 +154,9 @@ still needed for compilation.  */
>  flag_stack_check = STATIC_BUILTIN_STACK_CHECK;   \
>} while (0)
>  
> +#define SUBTARGET_ATTRIBUTE_TABLE \
> +  { "selectany", 0, 0, true, false, false, false, \
> +mingw_handle_selectany_attribute, NULL }
>  
>  #define SUPPORTS_ONE_ONLY 1


Re: [PATCH v1 3/6] Rename functions for reuse in AArch64

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> This patch renames functions related to dllimport/dllexport
> and selectany functionality. These functions will be reused
> in the aarch64-w64-mingw32 target.
>
> gcc/ChangeLog:
>
>   * config/i386/cygming.h (mingw_pe_record_stub):
>   Rename functions in mingw folder which will be reused for
>   aarch64.
>   (TARGET_ASM_FILE_END): Update to new target-independent name.
>   (SUBTARGET_ATTRIBUTE_TABLE): Likewise.
>   (TARGET_VALID_DLLIMPORT_ATTRIBUTE_P): Likewise.
>   (SUB_TARGET_RECORD_STUB): Likewise.
>   * config/i386/i386-protos.h (ix86_handle_selectany_attribute): Likewise.
>   (mingw_handle_selectany_attribute): Likewise.
>   (i386_pe_valid_dllimport_attribute_p): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.
>   (i386_pe_file_end): Likewise.
>   (mingw_pe_file_end): Likewise.
>   (i386_pe_record_stub): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   * config/mingw/winnt.cc (ix86_handle_selectany_attribute): Likewise.
>   (mingw_handle_selectany_attribute): Likewise.
>   (i386_pe_valid_dllimport_attribute_p): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.
>   (i386_pe_record_stub): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   (i386_pe_file_end): Likewise.
>   (mingw_pe_file_end): Likewise.
>   * config/mingw/winnt.h (mingw_handle_selectany_attribute):
>   Declate functionality that will be reused by multiple targets.
>   (mingw_pe_file_end): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.

Ok, but...

> [...]
> diff --git a/gcc/config/mingw/winnt.cc b/gcc/config/mingw/winnt.cc
> index 9901576ade0..a0b5950be2e 100644
> --- a/gcc/config/mingw/winnt.cc
> +++ b/gcc/config/mingw/winnt.cc
> @@ -71,7 +71,7 @@ ix86_handle_shared_attribute (tree *node, tree name, tree, 
> int,
>  /* Handle a "selectany" attribute;
> arguments as in struct attribute_spec.handler.  */
>  tree
> -ix86_handle_selectany_attribute (tree *node, tree name, tree, int,
> +mingw_handle_selectany_attribute (tree *node, tree name, tree, int,
>bool *no_add_attrs)

please reindent the parameters for the new name length.

Thanks,
Richard


Re: [PATCH 4/4] Testsuite updates

2024-05-22 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, 21 May 2024, Richard Biener wrote:
>
>> The gcc.dg/vect/slp-12a.c case is interesting as we currently split
>> the 8 store group into lanes 0-5 which we SLP with an unroll factor
>> of two (on x86-64 with SSE) and the remaining two lanes are using
>> interleaving vectorization with a final unroll factor of four.  Thus
>> we're using hybrid SLP within a single store group.  After the change
>> we discover the same 0-5 lane SLP part as well as two single-lane
>> parts feeding the full store group.  But that results in a load
>> permutation that isn't supported (I have WIP patchs to rectify that).
>> So we end up cancelling SLP and vectorizing the whole loop with
>> interleaving which is IMO good and results in better code.
>> 
>> This is similar for gcc.target/i386/pr52252-atom.c where interleaving
>> generates much better code than hybrid SLP.  I'm unsure how to update
>> the testcase though.
>> 
>> gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
>> when analyzing SLP operations we discard an instance we currently
>> force the full loop to have no SLP because hybrid detection is
>> broken.  It's probably not worth fixing this at this moment.
>> 
>> For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
>> into two but merge the two 8 lane loads into one before doing the
>> store and thus have only a single SLP instance.  A similar situation
>> happens in gcc.dg/vect/slp-11c.c but the branches feeding the
>> single SLP store only have a single lane.  Likewise for
>> gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
>> 
>> gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
>> with a SLP store group of size two but two single-lane branches.
>> 
>> gcc.target/i386/pr98928.c ICEs in SLP permute optimization
>> because we don't expect a constant and internal branch to be
>> merged with a permute node in
>> vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
>> permutes merging two SLP nodes are two-operator nodes right now).
>> This still requires fixing.
>> 
>> The whole series has been bootstrapped and tested on 
>> x86_64-unknown-linux-gnu with the gcc.target/i386/pr98928.c FAIL
>> unfixed.
>> 
>> Comments welcome (and hello ARM CI), RISC-V and other arch
>> testing appreciated.  Unless there are comments to the contrary
>> I plan to push patch 1 and 2 tomorrow.
>
> RISC-V CI didn't trigger (not sure what magic is required).  Both
> ARM and AARCH64 show that the "Vectorizing stmts using SLP" are a bit
> fragile because we sometimes cancel SLP becuase we want to use
> load/store-lanes.
>
> I have locally scrapped the SLP scanning for gcc.dg/vect/slp-21.c where
> it doesn't really matter (and if we are finished with all-SLP it will
> matter nowhere).  I've conditionalized the outcome based on
> vect_load_lanes for gcc.dg/vect/slp-11c.c and
> gcc.dg/vect/slp-cond-1.c
>
> On AARCH64 additionally gcc.target/aarch64/sve/mask_struct_store_4.c
> ICEs, I have a fix for that.
>
> gcc.target/aarch64/pr99873_2.c FAILs because with a single
> SLP store group merged from two two-lane load groups we cancel
> the SLP and want to use load/store-lanes.  I'll leave this
> FAILing or shall I XFAIL it?

Yeah, agree it's probably worth leaving it FAILing for now, since it
is something we should try to fix for GCC 15.

Thanks,
Richard

>
> Thanks,
> Richard.
>
>> Thanks,
>> Richard.
>> 
>>  * gcc.dg/vect/pr97428.c: Expect a single store SLP group.
>>  * gcc.dg/vect/slp-11c.c: Likewise.
>>  * gcc.dg/vect/vect-complex-5.c: Likewise.
>>  * gcc.dg/vect/slp-12a.c: Do not expect SLP.
>>  * gcc.dg/vect/slp-21.c: Likewise.
>>  * gcc.dg/vect/slp-cond-1.c: Expect one more SLP.
>>  * gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
>>  * gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
>> ---
>>  gcc/testsuite/gcc.dg/vect/pr97428.c  |  2 +-
>>  gcc/testsuite/gcc.dg/vect/slp-11c.c  |  5 +++--
>>  gcc/testsuite/gcc.dg/vect/slp-12a.c  |  6 +-
>>  gcc/testsuite/gcc.dg/vect/slp-21.c   | 19 +--
>>  gcc/testsuite/gcc.dg/vect/slp-cond-1.c   |  2 +-
>>  gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
>>  gcc/testsuite/gcc.dg/vect/vect-gather-2.c|  1 -
>>  gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
>>  8 files changed, 18 insertions(+), 22 deletions(-)
>> 
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c 
>> b/gcc/testsuite/gcc.dg/vect/pr97428.c
>> index 60dd984cfd3..3cc9976c00c 100644
>> --- a/gcc/testsuite/gcc.dg/vect/pr97428.c
>> +++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
>> @@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
>>  /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" 
>> "vect" } } */
>>  /* We're not able to peel & apply re-aligning to make accesses well-aligned 
>> for !vect_hw_misalign,
>> but we could by peeling the stores for alignment and applying 
>> re-aligning loads.  

Re: [PATCH 3/4]AArch64: add new alternative with early clobber to patterns

2024-05-22 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Wednesday, May 22, 2024 10:48 AM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org
>> Subject: Re: [PATCH 3/4]AArch64: add new alternative with early clobber to
>> patterns
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > This patch adds new alternatives to the patterns which are affected.  The 
>> > new
>> > alternatives with the conditional early clobbers are added before the 
>> > normal
>> > ones in order for LRA to prefer them in the event that we have enough free
>> > registers to accommodate them.
>> >
>> > In case register pressure is too high the normal alternatives will be 
>> > preferred
>> > before a reload is considered as we rather have the tie than a spill.
>> >
>> > Tests are in the next patch.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >* config/aarch64/aarch64-sve.md (and3,
>> >@aarch64_pred__z, *3_cc,
>> >*3_ptest, aarch64_pred__z,
>> >*3_cc, *3_ptest,
>> >aarch64_pred__z, *3_cc,
>> >*3_ptest, @aarch64_pred_cmp,
>> >*cmp_cc, *cmp_ptest,
>> >@aarch64_pred_cmp_wide,
>> >*aarch64_pred_cmp_wide_cc,
>> >*aarch64_pred_cmp_wide_ptest,
>> @aarch64_brk,
>> >*aarch64_brk_cc, *aarch64_brk_ptest,
>> >@aarch64_brk, *aarch64_brkn_cc, *aarch64_brkn_ptest,
>> >*aarch64_brk_cc, *aarch64_brk_ptest,
>> >aarch64_rdffr_z, *aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest,
>> >*aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Add new early clobber
>> >alternative.
>> >* config/aarch64/aarch64-sve2.md
>> >(@aarch64_pred_): Likewise.
>> >
>> > ---
>> > diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> > b/gcc/config/aarch64/aarch64-
>> sve.md
>> > index
>> e3085c0c636f1317409bbf3b5fbaf5342a2df1f6..8fdc1bc3cd43acfcd675a18350c
>> 297428c85fe46 100644
>> > --- a/gcc/config/aarch64/aarch64-sve.md
>> > +++ b/gcc/config/aarch64/aarch64-sve.md
>> > @@ -1161,8 +1161,10 @@ (define_insn "aarch64_rdffr_z"
>> >  (reg:VNx16BI FFRT_REGNUM)
>> >  (match_operand:VNx16BI 1 "register_operand")))]
>> >"TARGET_SVE && TARGET_NON_STREAMING"
>> > -  {@ [ cons: =0, 1   ]
>> > - [ Upa , Upa ] rdffr\t%0.b, %1/z
>> > +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
>> > + [ , Upa; yes ] rdffr\t%0.b, %1/z
>> > + [ ?Upa, Upa; yes ] ^
>> > + [ Upa , Upa; *   ] ^
>> >}
>> >  )
>> 
>> Sorry for not explaining it very well, but in the previous review I 
>> suggested:
>> 
>> > The gather-like approach would be something like:
>> >
>> >  [  , Upl , w , ; yes ]
>> cmp\t%0., %1/z, %3., #%4
>> >  [ ?Upl , 0   , w , ; yes ] ^
>> >  [ Upa  , Upl , w , ; no  ] ^
>> >  [  , Upl , w , w; yes ] 
>> > cmp\t%0., %1/z,
>> %3., %4.
>> >  [ ?Upl , 0   , w , w; yes ] ^
>> >  [ Upa  , Upl , w , w; no  ] ^
>> >
>> > with:
>> >
>> >   (define_attr "pred_clobber" "any,no,yes" (const_string "any"))
>> 
>> (with emphasis on the last line).  What I didn't say explicitly is
>> that "no" should require !TARGET_SVE_PRED_CLOBBER.
>> 
>> The premise of that review was that we shouldn't enable things like:
>> 
>>  [ Upa  , Upl , w , w; no  ] ^
>> 
>> for TARGET_SVE_PRED_CLOBBER since it contradicts the earlyclobber
>> alternative.  So we should enable either the pred_clobber=yes
>> alternatives or the pred_clobber=no alternatives, but not both.
>> 
>> The default "any" is then for other non-predicate instructions that
>> don't care about TARGET_SVE_PRED_CLOBBER either way.
>> 
>> In contrast, this patch makes pred_clobber=yes enable the alternatives
>

  1   2   3   4   5   6   7   8   9   10   >