Re: [PATCH]AArch64: Take into account when VF is higher than known scalar iters

2024-09-20 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> Consider low overhead loops like:
>
> void
> foo (char *restrict a, int *restrict b, int *restrict c, int n)
> {
>   for (int i = 0; i < 9; i++)
> {
>   int res = c[i];
>   int t = b[i];
>   if (a[i] != 0)
> res = t;
>   c[i] = res;
> }
> }
>
> For such loops we use latency only costing since the loop bounds is known and
> small.
>
> The current costing however does not consider the case where niters < VF.
>
> So when comparing the scalar vs vector costs it doesn't keep in mind that the
> scalar code can't perform VF iterations.  This makes it overestimate the cost
> for the scalar loop and we incorrectly vectorize.
>
> This patch takes the minimum of the VF and niters in such cases.
> Before the patch we generate:
>
>  note:  Original vector body cost = 46
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:load operations = 2
>  note:store operations = 1
>  note:general operations = 1
>  note:reduction latency = 0
>  note:estimated min cycles per iteration = 1.00
>  note:estimated cycles per vector iteration (for VF 32) = 32.00
>  note:  SVE issue estimate:
>  note:load operations = 5
>  note:store operations = 4
>  note:general operations = 11
>  note:predicate operations = 12
>  note:reduction latency = 0
>  note:estimated min cycles per iteration without predication = 5.50
>  note:estimated min cycles per iteration for predication = 12.00
>  note:estimated min cycles per iteration = 12.00
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> vs after:
>
>  note:  Original vector body cost = 46
>  note:  Known loop bounds, capping VF to 9 for analysis
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:load operations = 2
>  note:store operations = 1
>  note:general operations = 1
>  note:reduction latency = 0
>  note:estimated min cycles per iteration = 1.00
>  note:estimated cycles per vector iteration (for VF 9) = 9.00
>  note:  SVE issue estimate:
>  note:load operations = 5
>  note:store operations = 4
>  note:general operations = 11
>  note:predicate operations = 12
>  note:reduction latency = 0
>  note:estimated min cycles per iteration without predication = 5.50
>  note:estimated min cycles per iteration for predication = 12.00
>  note:estimated min cycles per iteration = 12.00
>  note:  Increasing body cost to 1472 because the scalar code could issue 
> within the limit imposed by predicate operations
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues, and fixes the
> upcoming vectorization regression on exchange2 in SPECCPU 2017.

Sorry, I'd forgotten about the earlier internal conversation.
I went back over that and:

> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (adjust_body_cost):
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/asrdiv_4.c: Update bounds.
>   * gcc.target/aarch64/sve/cond_asrd_2.c: Likewise.
>   * gcc.target/aarch64/sve/cond_uxt_6.c: Likewise.
>   * gcc.target/aarch64/sve/cond_uxt_7.c: Likewise.
>   * gcc.target/aarch64/sve/cond_uxt_8.c: Likewise.
>   * gcc.target/aarch64/sve/miniloop_1.c: Likewise.
>   * gcc.target/aarch64/sve/spill_6.c: Likewise.
>   * gcc.target/aarch64/sve/sve_iters_low_1.c: New test.
>   * gcc.target/aarch64/sve/sve_iters_low_2.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 6ccf08d1cc0a1aecfc72f95b105ace2c00b1a51d..afb58fd88795a26064c8c74f337324e3ecebc389
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17565,6 +17565,20 @@ adjust_body_cost (loop_vec_info loop_vinfo,
>  dump_printf_loc (MSG_NOTE, vect_location,
>"Original vector body cost = %d\n", body_cost);
>  
> +  /* If the iteration count is known and low we use latency only calculation,
> + however if the iteration count is lower than VF then the estimate for 
> the
> + scalar loops will be too high.  Cap it at NITERS.  */

...I don't think this is related to latency costs per se (which was
what was confusing me).  It's instead that we want the port-specific:

  /* If the scalar version of the loop could issue at least as
 quickly as the predicate parts of the SVE loop, make the SVE loop
 prohibitively expensive.  In this case vectorization is adding an
 overhead that the original scalar code didn't have.

 This is mostly intended to detect cases in which WHILELOs dominate
 for very tight loops, which is something that normal latency-based
 costs would not model.  Adding this kind of cliffedge would be
 to

Re: [PATCH]AArch64: Take into account when VF is higher than known scalar iters

2024-09-20 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Friday, September 20, 2024 3:48 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org
>> Subject: Re: [PATCH]AArch64: Take into account when VF is higher than known
>> scalar iters
>> 
>> Tamar Christina  writes:
>> >>
>> >> So my gut instinct is that we should instead tweak the condition for
>> >> using latency costs, but I'll need to think about it more when I get
>> >> back from holiday.
>> >>
>> >
>> > I think that's a separate problem.. From first principals it should already
>> > be very wrong to compare the scalar loop to an iteration count it will
>> > *NEVER* reach.  So I don't understand why that would ever be valid.
>> 
>> But I don't think we're doing that, or at least, not as the final result.
>> Instead, we first calculate the minimum number of vector iterations for
>> which the vector loop is sometimes profitable.  If this is N, then we're
>> saying that the vector code is better than the scalar code for N*VF
>> iterations.  Like you say, this part ignores whether N*VF is actually
>> achievable.  But then:
>> 
>>/* Now that we know the minimum number of vector iterations,
>>   find the minimum niters for which the scalar cost is larger:
>> 
>>   SIC * niters > VIC * vniters + VOC - SOC
>> 
>>   We know that the minimum niters is no more than
>>   vniters * VF + NPEEL, but it might be (and often is) less
>>   than that if a partial vector iteration is cheaper than the
>>   equivalent scalar code.  */
>>int threshold = (vec_inside_cost * min_vec_niters
>> + vec_outside_cost
>> - scalar_outside_cost);
>>if (threshold <= 0)
>>  min_profitable_iters = 1;
>>else
>>  min_profitable_iters = threshold / scalar_single_iter_cost + 1;
>> 
>> calculates which number of iterations in the range [(N-1)*VF + 1, N*VF]
>> is the first to be profitable.  This is specifically taking partial
>> iterations into account and includes the N==1 case.  The lower niters is,
>> the easier it is for the scalar code to win.
>> 
>> This is what is printed as:
>> 
>>   Calculated minimum iters for profitability: 7
>> 
>> So we think that vectorisation should be rejected if the loop count
>> is <= 6, but accepted if it's >= 7.
>
> This 7 is the vector iteration count.
>
>   epilogue iterations: 0
>   Minimum number of vector iterations: 1
>   Calculated minimum iters for profitability: 7
> /app/example.c:4:21: note:Runtime profitability threshold = 7
> /app/example.c:4:21: note:Static estimate profitability threshold = 7
>
> Which says the vector code has to iterate at least 7 iteration for it to be 
> profitable.

It doesn't though:

>   Minimum number of vector iterations: 1

This is in vector iterations but:

>   Calculated minimum iters for profitability: 7

This is in scalar iterations.  (Yes, it would be nice if the dump line
was more explicit. :))

This is why, if we change the loop count to 7 rather than 9:

  for (int i = 0; i < 7; i++)

we still get:

/tmp/foo.c:4:21: note:  Cost model analysis:
  Vector inside of loop cost: 20
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
/tmp/foo.c:4:21: note:Runtime profitability threshold = 7
/tmp/foo.c:4:21: note:Static estimate profitability threshold = 7
/tmp/foo.c:4:21: note:  * Analysis succeeded with vector mode VNx4SI

But if we change it to 6:

  for (int i = 0; i < 6; i++)

we get:

/tmp/foo.c:4:21: note:  Cost model analysis:
  Vector inside of loop cost: 20
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
/tmp/foo.c:4:21: note:Runtime profitability threshold = 7
/tmp/foo.c:4:21: note:Static estimate profitability threshold = 7
/tmp/foo.c:4:21: missed:  not vectorized: vectorization not profitable.
/tmp/foo.c:4:21: note:  not vectorized: iteration count smaller than user 
specified loop bound parameter or minimum profitable iterations (whichever is 
more conservative).
/tmp/foo.c:4:21: missed:  Loop costings not worthwhile.
/tmp/foo.c:4:21: note:  * Analysis failed with vector mode VNx4SI

Thanks,
Richard


Re: [PATCH v3] aarch64: Add fp8 scalar types

2024-09-20 Thread Richard Sandiford
Claudio Bantaloukas  writes:
> The ACLE defines a new scalar type, __mfp8. This is an opaque 8bit types that
> can only be used by fp8 intrinsics. Additionally, the mfloat8_t type is made
> available in arm_neon.h and arm_sve.h as an alias of the same.
>
> This implementation uses an unsigned INTEGER_TYPE, with precision 8 to
> represent __mfp8. Conversions to int and other types are disabled via the
> TARGET_INVALID_CONVERSION hook.
> Additionally, operations that are typically available to integer types are
> disabled via TARGET_INVALID_UNARY_OP and TARGET_INVALID_BINARY_OP hooks.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (aarch64_mfp8_type_node): Add node
>   for __mfp8 type.
>   (aarch64_mfp8_ptr_type_node): Add node for __mfp8 pointer type.
>   (aarch64_init_fp8_types): New function to initialise fp8 types and
>   register with language backends.
>   * config/aarch64/aarch64.cc (aarch64_mangle_type): Add ABI mangling for
>   new type.
>   (aarch64_invalid_conversion): Add function implementing
>   TARGET_INVALID_CONVERSION hook that blocks conversion to and from the
>   __mfp8 type.
>   (aarch64_invalid_unary_op): Add function implementing TARGET_UNARY_OP
>   hook that blocks operations on __mfp8 other than &.
>   (aarch64_invalid_binary_op): Extend TARGET_BINARY_OP hook to disallow
>   operations on __mfp8 type.
>   (TARGET_INVALID_CONVERSION): Add define.
>   (TARGET_INVALID_UNARY_OP): Likewise.
>   * config/aarch64/aarch64.h (aarch64_mfp8_type_node): Add node for __mfp8
>   type.
>   (aarch64_mfp8_ptr_type_node): Add node for __mfp8 pointer type.
>   * config/aarch64/arm_private_fp8.h (mfloat8_t): Add typedef.
>
> gcc/testsuite/ChangeLog:
>
>   * g++.target/aarch64/fp8_mangling.C: New tests exercising mangling.
>   * g++.target/aarch64/fp8_scalar_typecheck_2.C: New tests in C++.
>   * gcc.target/aarch64/fp8_scalar_1.c: New tests in C.
>   * gcc.target/aarch64/fp8_scalar_typecheck_1.c: Likewise.
> ---
> Hi, 
> Is this ok for master? I do not have commit rights yet, if ok, can someone 
> commit it on my behalf?

LGTM, thanks.  Give it a day or so to see if Kyrill has any comments,
but otherwise ok for trunk.  (Please ask someone else to push it though,
since I won't be around next week.)

Richard


Re: [PATCH]AArch64: Take into account when VF is higher than known scalar iters

2024-09-20 Thread Richard Sandiford
Tamar Christina  writes:
>> 
>> So my gut instinct is that we should instead tweak the condition for
>> using latency costs, but I'll need to think about it more when I get
>> back from holiday.
>> 
>
> I think that's a separate problem.. From first principals it should already
> be very wrong to compare the scalar loop to an iteration count it will
> *NEVER* reach.  So I don't understand why that would ever be valid.

But I don't think we're doing that, or at least, not as the final result.
Instead, we first calculate the minimum number of vector iterations for
which the vector loop is sometimes profitable.  If this is N, then we're
saying that the vector code is better than the scalar code for N*VF
iterations.  Like you say, this part ignores whether N*VF is actually
achievable.  But then:

  /* Now that we know the minimum number of vector iterations,
 find the minimum niters for which the scalar cost is larger:

 SIC * niters > VIC * vniters + VOC - SOC

 We know that the minimum niters is no more than
 vniters * VF + NPEEL, but it might be (and often is) less
 than that if a partial vector iteration is cheaper than the
 equivalent scalar code.  */
  int threshold = (vec_inside_cost * min_vec_niters
   + vec_outside_cost
   - scalar_outside_cost);
  if (threshold <= 0)
min_profitable_iters = 1;
  else
min_profitable_iters = threshold / scalar_single_iter_cost + 1;

calculates which number of iterations in the range [(N-1)*VF + 1, N*VF]
is the first to be profitable.  This is specifically taking partial
iterations into account and includes the N==1 case.  The lower niters is,
the easier it is for the scalar code to win.

This is what is printed as:

  Calculated minimum iters for profitability: 7

So we think that vectorisation should be rejected if the loop count
is <= 6, but accepted if it's >= 7.

So I think the costing framework is set up to handle niters

Re: [PATCH 3/4][rtl]: simplify boolean vector EQ and NE comparisons

2024-09-20 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Friday, September 20, 2024 2:10 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd 
>> Subject: Re: [PATCH 3/4][rtl]: simplify boolean vector EQ and NE comparisons
>> 
>> Tamar Christina  writes:
>> >> For variable-sized vectors, I suppose the question is whether the
>> >> first unequal element is found in the minimum vector length, or whether
>> >> it only occurs for larger lengths.  In the former case we can fold at
>> >> compile time, but in the latter case we can't.
>> >>
>> >> So we probably do want the loop for variable-length vectors, up to
>> >> constant_lower_bound (CONST_VECTOR_NUNITS (...)).
>> >>
>> >
>> > Doesn't operand_equal already do this? it looks like the VLA handling
>> > In same_vector_encodings_p rejects vectors that are not the same size,
>> > which is good enough for this no? since I'm after strict equality.
>> 
>> But what I meant is that for VLA vectors, compile-time equality is
>> a tristate value: yes, no, or maybe.
>> 
>> E.g.:
>> 
>>{ 0, 0, 0, 0, 0, 0, 0, 0, ... }
>> 
>> is not equal to
>> 
>>{ 0, 0, 1, 1, 1, 1, 1, 1, ... }
>> 
>> if the runtime VL gives more than 2 elements, but they are equal if
>> the runtime VL gives 2 elements.  In this case, we can't fold EQ to
>> false at compile time if the minimum length is 2 elements, but we can
>> if the minimum length is 4 elements.
>> 
>> Similarly:
>> 
>>{ 0, 0, 1, 1, 1, 1, 1, 1, ... }
>> 
>> is only conditionally not equal to:
>> 
>>{ 0, 0, 1, 1, 2, 2, 3, 3, ... }
>> 
>> It isn't the case that every encoded value has to be present in every
>> runtime vector.  E.g. the series { 0, 1, 2, ... } exists for VNx2DI
>> (for INDEX Z0.D, #0, #1), even though there is never a "2" element for
>> the minimum vector length.
>
> Ah ok... so if I understand correctly, VLA series aren't capped by the VL
> (e.g. representable values) In RTL but represent the base + step only.
> So the series for a VNx2DI and a VNx4SI are the same but what the
> usable bits are is determined by the mode/VL?

Right.  For VLA vectors, the vector constant encoding represents an
infinite series and the runtime VL decides how many elements to take
from the series.  It's a little more complex than just base + step,
since we allow a leading fixed-length sequence of "arbitrary" values
followed by a series of "regular" values.

But yeah, the "regular" part of the series consists of interleaved linear
series.  And some runtime VL might only take the leading "arbitrary"
elements, without digging in to the "regular" part.  Or they might
take the base elements without getting as far as using the steps.

Thanks,
Richard

>
> That's really not how I thought they were represented but get why you
> want a loop now...
>
> Tamar.


Re: [PATCH]AArch64: Take into account when VF is higher than known scalar iters

2024-09-20 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> Consider low overhead loops like:
>
> void
> foo (char *restrict a, int *restrict b, int *restrict c, int n)
> {
>   for (int i = 0; i < 9; i++)
> {
>   int res = c[i];
>   int t = b[i];
>   if (a[i] != 0)
> res = t;
>   c[i] = res;
> }
> }

Eek.

> For such loops we use latency only costing since the loop bounds is known and
> small.
>
> The current costing however does not consider the case where niters < VF.
>
> So when comparing the scalar vs vector costs it doesn't keep in mind that the
> scalar code can't perform VF iterations.  This makes it overestimate the cost
> for the scalar loop and we incorrectly vectorize.

I don't think that in itself is the reason though.  niters < VF isn't
particularly exceptional for SVE.  The vector code is then costed as
performing 1 VF-element iteration plus the prologue (i.e. one full
vector iteration), whereas the scalar code is costed as doing niters
scalar iterations plus the prologue.  The lower the niters, the more
we should favour scalar code.

I suspect instead it's a combination of:

- The latency for the vector costs are too optimistic, especially given
  all the predicate overhead in the loop above, and especially for
  default tuning.  For default tuning I see:

  Vector inside of loop cost: 20
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7

  and grep -c '\(z[0-9]\|p[0-9]*\.\)' shows there are indeed 26
  SVE instructions.  But assuming a 1 cycle latency for every vector
  operation seems too idealistic (especially for loads).  It might be
  ok for "generic" (the "architectural intent" option), but not for
  the new architecture-level default tuning targets.

  I realise that isn't the whole story though, since -mcpu=neoverse-v2
  also vectorises.

- Having a threshold for latency costs based on niters is too simplistic,
  since the size of the loop matters too.  I think the intention was
  to capture cases where loops are so small that they are in practice
  always issued with surrounding code, so that the throughput of the
  loop itself is kind of irrelevant.  But this loop is big enough that
  throughput makes sense.

So my gut instinct is that we should instead tweak the condition for
using latency costs, but I'll need to think about it more when I get
back from holiday.

Another thing about the expansion is that we generate 4 .S instructions
for the int elements, even though it should be possible for later gimple
passes to fold the last one away at compile time.  That would lessen
the horror to some extent, but maybe not enough to make it an actual win.

Specifically:

uqdecw  w0, all, mul #3// this is always 0
whilelo p7.s, wzr, w0  // this is always a pfalse
ld1wz28.s, p7/z, [x1, #3, mul vl]  // this always loads zero
and p6.b, p15/z, p7.b, p7.b// this is always a pfalse
st1wz28.s, p6, [x2, #3, mul vl]// this is a no-op

One thing I remember considering in the past, but apparently never actually
did, was rejecting any vectorisation in which some vector instructions
are provably redundant (as above).  But perhaps in the end it seemed
that we should just be able to optimise them away later instead.

Thanks,
Richard

> This patch takes the minimum of the VF and niters in such cases.
> Before the patch we generate:
>
>  note:  Original vector body cost = 46
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:load operations = 2
>  note:store operations = 1
>  note:general operations = 1
>  note:reduction latency = 0
>  note:estimated min cycles per iteration = 1.00
>  note:estimated cycles per vector iteration (for VF 32) = 32.00
>  note:  SVE issue estimate:
>  note:load operations = 5
>  note:store operations = 4
>  note:general operations = 11
>  note:predicate operations = 12
>  note:reduction latency = 0
>  note:estimated min cycles per iteration without predication = 5.50
>  note:estimated min cycles per iteration for predication = 12.00
>  note:estimated min cycles per iteration = 12.00
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> vs after:
>
>  note:  Original vector body cost = 46
>  note:  Known loop bounds, capping VF to 9 for analysis
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:load operations = 2
>  note:store operations = 1
>  note:general operations = 1
>  note:reduction latency = 0
>  note:estimated min cycles per iteration = 1.00
>  note:estimated cycles per vector iteration (for VF 9) = 9.00
>  note:  SVE issue estimate:

Re: [PATCH 3/4][rtl]: simplify boolean vector EQ and NE comparisons

2024-09-20 Thread Richard Sandiford
Tamar Christina  writes:
>> For variable-sized vectors, I suppose the question is whether the
>> first unequal element is found in the minimum vector length, or whether
>> it only occurs for larger lengths.  In the former case we can fold at
>> compile time, but in the latter case we can't.
>> 
>> So we probably do want the loop for variable-length vectors, up to
>> constant_lower_bound (CONST_VECTOR_NUNITS (...)).
>> 
>
> Doesn't operand_equal already do this? it looks like the VLA handling
> In same_vector_encodings_p rejects vectors that are not the same size,
> which is good enough for this no? since I'm after strict equality.

But what I meant is that for VLA vectors, compile-time equality is
a tristate value: yes, no, or maybe.

E.g.:

   { 0, 0, 0, 0, 0, 0, 0, 0, ... }

is not equal to

   { 0, 0, 1, 1, 1, 1, 1, 1, ... }

if the runtime VL gives more than 2 elements, but they are equal if
the runtime VL gives 2 elements.  In this case, we can't fold EQ to
false at compile time if the minimum length is 2 elements, but we can
if the minimum length is 4 elements.

Similarly:

   { 0, 0, 1, 1, 1, 1, 1, 1, ... }

is only conditionally not equal to:

   { 0, 0, 1, 1, 2, 2, 3, 3, ... }

It isn't the case that every encoded value has to be present in every
runtime vector.  E.g. the series { 0, 1, 2, ... } exists for VNx2DI
(for INDEX Z0.D, #0, #1), even though there is never a "2" element for
the minimum vector length.

Thanks,
Richard

>
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> x86_64-pc-linux-gnu -m32, -m64 and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * simplify-rtx.cc (simplify_context::simplify_unary_operation): Try
>   simplifying operand.
>   (simplify_const_relational_operation): Simplify vector EQ and NE.
>   (test_vector_int_const_compare): New.
>   (test_vector_ops): Use it.
>
>
> -- inline copy of patch --
>
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index 
> a20a61c5dddbc80b23a9489d925a2c31b2163458..8ba5864efb33ffa5d1ced99f6a7d0c73e12560d5
>  100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -6354,6 +6354,19 @@ simplify_const_relational_operation (enum rtx_code 
> code,
>   return 0;
>  }
>  
> +  /* Check if the operands are a vector EQ or NE comparison.  */
> +  if (VECTOR_MODE_P (mode)
> +  && INTEGRAL_MODE_P (mode)
> +  && GET_CODE (op0) == CONST_VECTOR
> +  && GET_CODE (op1) == CONST_VECTOR
> +  && (code == EQ || code == NE))
> +{
> +  if (rtx_equal_p (op0, op1))
> + return code == EQ ? const_true_rtx : const0_rtx;
> +  else
> + return code == NE ? const_true_rtx : const0_rtx;
> +}
> +
>/* We can't simplify MODE_CC values since we don't know what the
>   actual comparison is.  */
>if (GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC)
> @@ -8797,6 +8810,34 @@ test_vector_subregs (machine_mode inner_mode)
>test_vector_subregs_stepped (inner_mode);
>  }
>  
> +/* Verify vector constant comparisons for EQ and NE.  */
> +
> +static void
> +test_vector_int_const_compare (machine_mode mode)
> +{
> +  rtx zeros = CONST0_RTX (mode);
> +  rtx minusone = CONSTM1_RTX (mode);
> +  rtx series_0_1 = gen_const_vec_series (mode, const0_rtx, const1_rtx);
> +  ASSERT_RTX_EQ (const0_rtx,
> +  simplify_const_relational_operation (EQ, mode, zeros,
> +   CONST1_RTX (mode)));
> +  ASSERT_RTX_EQ (const_true_rtx,
> +  simplify_const_relational_operation (EQ, mode, zeros,
> +   CONST0_RTX (mode)));
> +  ASSERT_RTX_EQ (const_true_rtx,
> +  simplify_const_relational_operation (EQ, mode, minusone,
> +   CONSTM1_RTX (mode)));
> +  ASSERT_RTX_EQ (const_true_rtx,
> +  simplify_const_relational_operation (NE, mode, zeros,
> +   CONST1_RTX (mode)));
> +  ASSERT_RTX_EQ (const_true_rtx,
> +  simplify_const_relational_operation (NE, mode, zeros,
> +   series_0_1));
> +  ASSERT_RTX_EQ (const0_rtx,
> +  simplify_const_relational_operation (EQ, mode, zeros,
> +   series_0_1));
> +}
> +
>  /* Verify some simplifications involving vectors.  */
>  
>  static void
> @@ -8814,6 +8855,7 @@ test_vector_ops ()
>   {
> test_vector_ops_series (mode, scalar_reg);
> test_vector_subregs (mode);
> +   test_vector_int_const_compare (mode);
>   }
> test_vec_merge (mode);
>   }


Re: [PATCH v2] AArch64: Fix copysign patterns

2024-09-20 Thread Richard Sandiford
Wilco Dijkstra  writes:
> v2: Add more testcase fixes.
>
> The current copysign pattern has a mismatch in the predicates and constraints 
> -
> operand[2] is a register_operand but also has an alternative X which allows 
> any
> operand.  Since it is a floating point operation, having an integer 
> alternative
> makes no sense.  Change the expander to always use the vector variant of 
> copysign
> which results in better code.  Add a SVE bitmask move immediate alternative to
> the aarch64_simd_mov patterns so we emit a single move when SVE is available.
>
> Passes bootstrap and regress, OK for commit?
>
> gcc:
> * config/aarch64/aarch64.md (copysign3): Defer to AdvSIMD 
> copysign.
> (copysign3_insn): Remove pattern.
> * config/aarch64/aarch64-simd.md (aarch64_simd_mov): Add 
> SVE movimm
> alternative.
> (aarch64_simd_mov): Likewise.  Remove redundant V2DI 
> check.
> (copysign3): Make global.
> (ior3): Move Neon immediate alternative before 
> the SVE one.   
>
> testsuite:
> * gcc.target/aarch64/copysign_3.c: New test.
> * gcc.target/aarch64/copysign_4.c: New test.
> * gcc.target/aarch64/fneg-abs_2.c: Allow .2s and .4s.
> * gcc.target/aarch64/sve/fneg-abs_1.c: Fixup test.
> * gcc.target/aarch64/sve/fneg-abs_2.c: Likewise.

This seems to be doing several things at once.  Could you split it up?

E.g. I think the change to the move patterns should be a separate patch,
with its own tests.  Rather than add new alternatives, I think we should
expand the definition of what a "valid" immediate is for TARGET_SIMD vectors
when SVE is enabled, like Pengxuan did in g:a92f54f580c3.

If you still need the removal of "mode == V2DImode" after that change,
could you send that separately too, with its own explanation and test cases?

Could you explain why you flipped the order of the alternatives in:
>
> @@ -648,7 +649,7 @@ (define_insn 
> "aarch64_dot_lane<
>[(set_attr "type" "neon_dot")]
>  )
>  
> -(define_expand "copysign3"
> +(define_expand "@copysign3"
>[(match_operand:VHSDF 0 "register_operand")
> (match_operand:VHSDF 1 "register_operand")
> (match_operand:VHSDF 2 "nonmemory_operand")]
> @@ -1138,10 +1139,8 @@ (define_insn "ior3"
>"TARGET_SIMD"
>{@ [ cons: =0 , 1 , 2; attrs: arch ]
>   [ w, w , w  ; simd  ] orr\t%0., %1., 
> %2.
> - [ w, 0 , vsl; sve   ] orr\t%Z0., %Z0., #%2
> - [ w, 0 , Do ; simd  ] \
> -   << aarch64_output_simd_mov_immediate (operands[2], , \
> -  AARCH64_CHECK_ORR);
> + [ w, 0 , Do ; simd  ] << aarch64_output_simd_mov_immediate 
> (operands[2], , AARCH64_CHECK_ORR);
> + [ w, 0 , vsl; sve   ] orr\t%Z0., %Z0., %2
>}
>[(set_attr "type" "neon_logic")]

?  I'm not opposed, just wasn't sure why it was useful.

(The original formatting was arguably more correct, since it kept within
the 80-character limit.)

>  )
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> c54b29cd64b9e0dc6c6d12735049386ccedc5408..e9b148e59abf81cee53cb0dd846af9a62bbad294
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -7218,20 +7218,11 @@ (define_expand "lrint2"
>  }
>  )
>  
> -;; For copysign (x, y), we want to generate:
> +;; For copysignf (x, y), we want to generate:
>  ;;
> -;;   LDR d2, #(1 << 63)
> -;;   BSL v2.8b, [y], [x]
> +;;   moviv31.4s, 0x80, lsl 24
> +;;   bit v0.16b, v1.16b, v31.16b
>  ;;
> -;; or another, equivalent, sequence using one of BSL/BIT/BIF.  Because
> -;; we expect these operations to nearly always operate on
> -;; floating-point values, we do not want the operation to be
> -;; simplified into a bit-field insert operation that operates on the
> -;; integer side, since typically that would involve three inter-bank
> -;; register copies.  As we do not expect copysign to be followed by
> -;; other logical operations on the result, it seems preferable to keep
> -;; this as an unspec operation, rather than exposing the underlying
> -;; logic to the compiler.
>  
>  (define_expand "copysign3"
>[(match_operand:GPF 0 "register_operand")
> @@ -7239,57 +7230,22 @@ (define_expand "copysign3"
> (match_operand:GPF 2 "nonmemory_operand")]
>"TARGET_SIMD"
>  {
> -  rtx signbit_const = GEN_INT (HOST_WIDE_INT_M1U
> -<< (GET_MODE_BITSIZE (mode) - 1));
> -  /* copysign (x, -1) should instead be expanded as orr with the sign
> - bit.  */
> -  rtx op2_elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (GET_CODE (op2_elt) == CONST_DOUBLE
> -  && real_isneg (CONST_DOUBLE_REAL_VALUE (op2_elt)))
> -{
> -  rtx v_bitmask
> - = force_reg (V2mode,
> -  gen_const_vec_duplicate (V2mode,
> -   signbit_const));
> -
> -  emit_insn (gen_iorv23 (
> - lowpart_subreg (V2mod

Re: [PATCH v2 6/9] aarch64: Use symbols without offset to prevent relocation issues

2024-09-19 Thread Richard Sandiford
Richard Sandiford  writes:
> Evgeny Karpov  writes:
>> + {
>> +   rtx const_int = imm;
>> +   const_int = XEXP (const_int, 0);
>> +   XEXP (const_int, 1) = GEN_INT(const_offset % (1 << 20));
>
> CONST_INTs are shared objects, so we can't modify their value in-place.

Gah, sorry, I misread.  The patch was only modifying the PLUS, which should
be valid.  My comment below still stands though.

> It might be easier to pass base and const_offset from the caller
> (aarch64_expand_mov_immediate).  We are then guaranteed that the
> offset is constant and don't need to worry about the SVE case.
> The new SYM+OFF expression can be calculated using plus_constant.
>
> I think it'd be worth asserting that the offset fits in 32 bits,
> since if by some bug the offset is larger, we'd generate silent
> wrong code (in the sense that the compiler would truncate the offset
> before the assembler sees it).


Re: [PATCH v2 0/9] SMALL code model fixes, optimization fixes, LTO and minimal C++ enablement

2024-09-19 Thread Richard Sandiford
Evgeny Karpov  writes:
> Hello,
>
> Thank you for reviewing v1!
>
> v2 Changes:
> - Add extra comments and extend patch descriptions.
> - Extract libstdc++ changes to a separate patch.
> - Minor style refactoring based on the reviews.
> - Unify mingw_pe_declare_type for functions and objects.

Thanks for the update.  Aside from the points raised in the discussion
about patches 5, 6, and 9 (and taking into account what you said about
patch 7), the series looks good.

Thanks,
Richard

>
> Regards,
> Evgeny
>
> Evgeny Karpov (9):
>   Support weak references
>   aarch64: Add debugging information
>   aarch64: Add minimal C++ support
>   aarch64: Exclude symbols using GOT from code models
>   aarch64: Multiple adjustments to support the SMALL code model
> correctly
>   aarch64: Use symbols without offset to prevent relocation issues
>   aarch64: Disable the anchors
>   Add LTO support
>   aarch64: Handle alignment when it is bigger than BIGGEST_ALIGNMENT
>
>  gcc/config.gcc|  1 +
>  gcc/config/aarch64/aarch64-coff.h | 32 +++---
>  gcc/config/aarch64/aarch64.cc | 43 ---
>  gcc/config/aarch64/cygming.h  | 69 +--
>  gcc/config/i386/cygming.h | 16 +++
>  gcc/config/i386/i386-protos.h |  2 -
>  gcc/config/mingw/winnt-dll.cc |  4 +-
>  gcc/config/mingw/winnt.cc | 33 ++-
>  gcc/config/mingw/winnt.h  |  7 ++--
>  libiberty/simple-object-coff.c|  4 +-
>  10 files changed, 158 insertions(+), 53 deletions(-)


Re: [PATCH v2 9/9] aarch64: Handle alignment when it is bigger than BIGGEST_ALIGNMENT

2024-09-19 Thread Richard Sandiford
Evgeny Karpov  writes:
> In some cases, the alignment can be bigger than BIGGEST_ALIGNMENT.
>
> The issue was detected while building FFmpeg.
> It creates structures, most likely for AVX optimization.
>
> For instance:
> float __attribute__((aligned (32))) large_aligned_array[3];
>
> BIGGEST_ALIGNMENT could be up to 512 bits on x64.
> This patch has been added to cover this case without needing to
> change the FFmpeg code.

What goes wrong if we don't do this?  I'm not sure from the description
whether it's a correctness fix, a performance fix, or whether it's about
avoiding wasted space.

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-coff.h (ASM_OUTPUT_ALIGNED_LOCAL):
>   Change alignment.
> ---
>  gcc/config/aarch64/aarch64-coff.h | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-coff.h 
> b/gcc/config/aarch64/aarch64-coff.h
> index 17f346fe540..bf8e30b9c08 100644
> --- a/gcc/config/aarch64/aarch64-coff.h
> +++ b/gcc/config/aarch64/aarch64-coff.h
> @@ -58,6 +58,16 @@
>assemble_name ((FILE), (NAME)),\
>fprintf ((FILE), ",%lu\n", (ROUNDED)))
>  
> +#define ASM_OUTPUT_ALIGNED_LOCAL(FILE, NAME, SIZE, ALIGNMENT)  \
> +  { \
> +unsigned HOST_WIDE_INT rounded = MAX ((SIZE), 1); \
> +unsigned HOST_WIDE_INT alignment = MAX ((ALIGNMENT), BIGGEST_ALIGNMENT); 
> \
> +rounded += (alignment / BITS_PER_UNIT) - 1; \
> +rounded = (rounded / (alignment / BITS_PER_UNIT) \
> +  * (alignment / BITS_PER_UNIT)); \

There's a ROUND_UP macro that could be used here.

Thanks,
Richard


> +ASM_OUTPUT_LOCAL (FILE, NAME, SIZE, rounded); \
> +  }
> +
>  #define ASM_OUTPUT_SKIP(STREAM, NBYTES)  \
>fprintf (STREAM, "\t.space\t%d  // skip\n", (int) (NBYTES))


Re: [PATCH v2 5/9] aarch64: Multiple adjustments to support the SMALL code model correctly

2024-09-19 Thread Richard Sandiford
Evgeny Karpov  writes:
> LOCAL_LABEL_PREFIX has been changed to help the assembly
> compiler recognize local labels. Emitting locals has been
> replaced with the .lcomm directive to declare uninitialized
> data without defining an exact section. Functions and objects
> were missing declarations. Binutils was not able to distinguish
> static from external, or an object from a function.
> mingw_pe_declare_object_type has been added to have type
> information for relocation on AArch64, which is not the case
> for ix86.
>
> This fix relies on changes in binutils.
> aarch64: Relocation fixes and LTO
> https://sourceware.org/pipermail/binutils/2024-August/136481.html
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-coff.h (LOCAL_LABEL_PREFIX):
>   Use "." as the local label prefix.
>   (ASM_OUTPUT_ALIGNED_LOCAL): Remove.
>   (ASM_OUTPUT_LOCAL): New.
>   * config/aarch64/cygming.h (ASM_OUTPUT_EXTERNAL_LIBCALL):
>   Update.
>   (ASM_DECLARE_OBJECT_NAME): New.
>   (ASM_DECLARE_FUNCTION_NAME): New.
>   * config/i386/cygming.h (ASM_DECLARE_COLD_FUNCTION_NAME):
>   Update.
>   (ASM_OUTPUT_EXTERNAL_LIBCALL): Update.
>   * config/mingw/winnt.cc (mingw_pe_declare_function_type):
>   Rename into ...
>   (mingw_pe_declare_type): ... this.
>   (i386_pe_start_function): Update.
>   * config/mingw/winnt.h (mingw_pe_declare_function_type):
>   Rename into ...
>   (mingw_pe_declare_type): ... this.
> ---
>  gcc/config/aarch64/aarch64-coff.h | 22 ++
>  gcc/config/aarch64/cygming.h  | 18 +-
>  gcc/config/i386/cygming.h |  8 
>  gcc/config/mingw/winnt.cc | 18 +-
>  gcc/config/mingw/winnt.h  |  3 +--
>  5 files changed, 37 insertions(+), 32 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-coff.h 
> b/gcc/config/aarch64/aarch64-coff.h
> index 81fd9954f75..17f346fe540 100644
> --- a/gcc/config/aarch64/aarch64-coff.h
> +++ b/gcc/config/aarch64/aarch64-coff.h
> @@ -20,9 +20,8 @@
>  #ifndef GCC_AARCH64_COFF_H
>  #define GCC_AARCH64_COFF_H
>  
> -#ifndef LOCAL_LABEL_PREFIX
> -# define LOCAL_LABEL_PREFIX  ""
> -#endif
> +#undef LOCAL_LABEL_PREFIX
> +#define LOCAL_LABEL_PREFIX  "."
>  
>  /* Using long long breaks -ansi and -std=c90, so these will need to be
> made conditional for an LLP64 ABI.  */
> @@ -54,19 +53,10 @@
>  }
>  #endif
>  
> -/* Output a local common block.  /bin/as can't do this, so hack a
> -   `.space' into the bss segment.  Note that this is *bad* practice,
> -   which is guaranteed NOT to work since it doesn't define STATIC
> -   COMMON space but merely STATIC BSS space.  */
> -#ifndef ASM_OUTPUT_ALIGNED_LOCAL
> -# define ASM_OUTPUT_ALIGNED_LOCAL(STREAM, NAME, SIZE, ALIGN) \
> -{
> \
> -  switch_to_section (bss_section);   
> \
> -  ASM_OUTPUT_ALIGN (STREAM, floor_log2 (ALIGN / BITS_PER_UNIT)); \
> -  ASM_OUTPUT_LABEL (STREAM, NAME);   
> \
> -  fprintf (STREAM, "\t.space\t%d\n", (int)(SIZE));   
> \
> -}
> -#endif
> +#define ASM_OUTPUT_LOCAL(FILE, NAME, SIZE, ROUNDED)  \
> +( fputs (".lcomm ", (FILE)), \
> +  assemble_name ((FILE), (NAME)),\
> +  fprintf ((FILE), ",%lu\n", (ROUNDED)))

I'd expect this to be:

  "," HOST_WIDE_INT_PRINT_DEC "\n"

rather than ",%lu\n".  "long" generally shouldn't be used in GCC code,
since it's such an ambiguous type.

LGTM otherwise.

Thanks,
Richard


Re: [PATCH] SVE intrinsics: Fold svmul with all-zero operands to zero vector

2024-09-19 Thread Richard Sandiford
Jennifer Schmitz  writes:
>> On 18 Sep 2024, at 20:33, Richard Sandiford  
>> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Jennifer Schmitz  writes:
>>> From 05e010a4ad5ef8df082b3e03b253aad85e2a270c Mon Sep 17 00:00:00 2001
>>> From: Jennifer Schmitz 
>>> Date: Tue, 17 Sep 2024 00:15:38 -0700
>>> Subject: [PATCH] SVE intrinsics: Fold svmul with all-zero operands to zero
>>> vector
>>> 
>>> As recently implemented for svdiv, this patch folds svmul to a zero
>>> vector if one of the operands is a zero vector. This transformation is
>>> applied if at least one of the following conditions is met:
>>> - the first operand is all zeros or
>>> - the second operand is all zeros, and the predicate is ptrue or the
>>> predication is _x or _z.
>>> 
>>> In contrast to constant folding, which was implemented in a previous
>>> patch, this transformation is applied as soon as one of the operands is
>>> a zero vector, while the other operand can be a variable.
>>> 
>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>> regression.
>>> OK for mainline?
>>> 
>>> Signed-off-by: Jennifer Schmitz 
>> 
>> OK, thanks.
>> 
>> If you're planning any more work in this area, I think the next logical
>> step would be to extend the current folds to all predication types,
>> before going on to support other mul/div cases or other operations.
>> 
>> In principle, the mul and div cases correspond to:
>> 
>>  if (integer_zerop (op1) || integer_zerop (op2))
>>return f.fold_active_lanes_to (build_zero_cst (TREE_TYPE (f.lhs)));
>> 
>> It would then be up to fold_active_lanes_to(X) to work out how to apply
>> predication to X.  The general case would be:
>> 
>>  - For x predication and unpredicated operations, fold to X.
>> 
>>  - For m and z, calculate a vector that supplies the values of inactive
>>lanes (the first vector argument for m and a zero vector from z).
>> 
>>- If X is equal to the inactive lanes vector, fold directly to X.
>> 
>>- Otherwise fold to VEC_COND_EXPR 
> Dear Richard,
> I pushed it to trunk with 08aba2dd8c9390b6131cca0aac069f97eeddc9d2.
> Thank you also for the good suggestion, I will do that. During the last days, 
> I have been working on a patch that folds multiplication by powers of 2 to 
> left-shifts (svlsl), similar to for division. As I see it, that is 
> independent from what you proposed, because it is a change of the function 
> type. Can I submit it for review before starting on the patch you suggested?

Sure!  I agree the power-of-two fold is independent.  I was just worried
about building up technical debt if we added more fold-to-constant cases.

Thanks,
Richard


Re: [PATCH v2 6/9] aarch64: Use symbols without offset to prevent relocation issues

2024-09-19 Thread Richard Sandiford
Evgeny Karpov  writes:
> The current binutils implementation does not support offset up to 4GB in
> IMAGE_REL_ARM64_PAGEBASE_REL21 relocation and is limited to 1MB.
> This is related to differences in ELF and COFF relocation records.
> There are ways to fix this. This work on relocation change will be extracted 
> to
> a separate binutils patch series and discussion.
>
> To unblock the current patch series, the IMAGE_REL_ARM64_PAGEBASE_REL21
> relocation will remain unchanged, and the workaround below will be applied to
> bypass the 1MB offset limitation.
>
> Regards,
> Evgeny
>
>
> The patch will be replaced by this change.

Seems like a reasonable workarond to me FWIW, but some comments on the
implementation below:

>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 03362a975c0..5f17936df1f 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -2896,7 +2896,30 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm,
> if (can_create_pseudo_p ())
>   tmp_reg = gen_reg_rtx (mode);
>
> -   emit_move_insn (tmp_reg, gen_rtx_HIGH (mode, copy_rtx (imm)));
> +   do
> + {
> +   if (TARGET_PECOFF)
> + {
> +   poly_int64 offset;
> +   HOST_WIDE_INT const_offset;
> +   strip_offset (imm, &offset);
> +
> +   if (offset.is_constant (&const_offset)
> +   && abs(const_offset) >= 1 << 20)

abs_hwi (const_offset) (since const_offset has HOST_WIDE_INT type).

> + {
> +   rtx const_int = imm;
> +   const_int = XEXP (const_int, 0);
> +   XEXP (const_int, 1) = GEN_INT(const_offset % (1 << 20));

CONST_INTs are shared objects, so we can't modify their value in-place.

It might be easier to pass base and const_offset from the caller
(aarch64_expand_mov_immediate).  We are then guaranteed that the
offset is constant and don't need to worry about the SVE case.
The new SYM+OFF expression can be calculated using plus_constant.

I think it'd be worth asserting that the offset fits in 32 bits,
since if by some bug the offset is larger, we'd generate silent
wrong code (in the sense that the compiler would truncate the offset
before the assembler sees it).

> +
> +   emit_move_insn (tmp_reg, gen_rtx_HIGH (mode, 
> copy_rtx(imm)));
> +   emit_insn (gen_add_hioffset (tmp_reg, 
> GEN_INT(const_offset)));

I think the normal addition patterns can handle this, if we pass the
result of the ~0xf calculation.  There should be no need for a
dedicated pattern.

> +   break;
> + }
> + }
> +
> + emit_move_insn (tmp_reg, gen_rtx_HIGH (mode, copy_rtx (imm)));
> + } while(0);

I think it'd be clearer to duplicate the gen_add_losym and avoid the
do...while(0)

Thanks,
Richard

> +
> emit_insn (gen_add_losym (dest, tmp_reg, imm));
> return;
>}
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 665a333903c..072110f93e7 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -7405,6 +7405,13 @@
>DONE;
>  })
>
> +(define_insn "add_hioffset"
> +  [(match_operand 0 "register_operand")
> +   (match_operand 1 "const_int_operand")]
> +  ""
> +  "add %0, %0, (%1 & ~0xf) >> 12, lsl #12"
> +)
> +
>  (define_insn "add_losym_"
>[(set (match_operand:P 0 "register_operand" "=r")
> (lo_sum:P (match_operand:P 1 "register_operand" "r")


Re: [PATCH v2 6/9] aarch64: Use symbols without offset to prevent relocation issues

2024-09-19 Thread Richard Sandiford
Martin Storsjö  writes:
> On Thu, 12 Sep 2024, Evgeny Karpov wrote:
>
>> The current binutils implementation does not support offset up to 4GB in
>> IMAGE_REL_ARM64_PAGEBASE_REL21 relocation and is limited to 1MB.
>> This is related to differences in ELF and COFF relocation records.
>
> Yes, I agree.
>
> But I would not consider this a limitation of the binutils implementation, 
> this is a limitation of the object file format. It can't be worked around 
> by inventing your own custom relocations, but should instead worked around 
> on the code generation side, to avoid needing such large offsets.
>
> This approach is one such, quite valid. Another one is to generate extra 
> symbols to allow addressing anything with a smaller offset.

Maybe this is my ELF bias showing, but: generating extra X=Y+OFF
symbols isn't generally valid for ELF when Y is a global symbol, since
interposition rules, comdat, weak symbols, and various other reasons,
could mean that the local definition of Y isn't the one that gets used.
Does COFF cope with that in some other way?  If not, I would have
expected that there would need to be a fallback path that didn't
involve defining extra symbols.

Thanks,
Richard


Re: [PATCH] aarch64: Improve scalar mode popcount expansion by using SVE [PR113860]

2024-09-19 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This is similar to the recent improvements to the Advanced SIMD popcount
> expansion by using SVE. We can utilize SVE to generate more efficient code for
> scalar mode popcount too.
>
>   PR target/113860
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (popcount2): Update pattern to
>   also support V1DI mode.
>   * config/aarch64/aarch64.md (popcount2): Add TARGET_SVE support.
>   * config/aarch64/iterators.md (VDQHSD_V1DI): New mode iterator.
>   (SVE_VDQ_I): Add V1DI.
>   (bitsize): Likewise.
>   (VPRED): Likewise.
>   (VEC_POP_MODE): New mode attribute.
>   (vec_pop_mode): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/popcnt11.c: New test.

Sorry for the slow review of this.  The main reason for putting it off
was the use of V1DI, which always makes me nervous.

In particular:

> @@ -2284,7 +2286,7 @@ (define_mode_attr VPRED [(VNx16QI "VNx16BI") (VNx8QI 
> "VNx8BI")
>(VNx8DI "VNx2BI") (VNx8DF "VNx2BI")
>(V8QI "VNx8BI") (V16QI "VNx16BI")
>(V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI")
> -  (V4SI "VNx4BI") (V2DI "VNx2BI")])
> +  (V4SI "VNx4BI") (V2DI "VNx2BI") (V1DI "VNx2BI")])
>  

it seems odd to have a predicate mode that contains more elements than
the associated single-vector data mode.

The patch also extends the non-SVE SIMD popcount pattern for V1DI,
but it doesn't look like that path works.  E.g. try the following
with -march=armv8-a -fgimple -O2:

__Uint64x1_t __GIMPLE
foo (__Uint64x1_t x)
{
  __Uint64x1_t z;

  z = .POPCOUNT (x);
  return z;
}

Thanks,
Richard


>  ;; ...and again in lower case.
>  (define_mode_attr vpred [(VNx16QI "vnx16bi") (VNx8QI "vnx8bi")
> @@ -2318,6 +2320,14 @@ (define_mode_attr VDOUBLE [(VNx16QI "VNx32QI")
>  (VNx4SI "VNx8SI") (VNx4SF "VNx8SF")
>  (VNx2DI "VNx4DI") (VNx2DF "VNx4DF")])
>  
> +;; The Advanced SIMD modes of popcount corresponding to scalar modes.
> +(define_mode_attr VEC_POP_MODE [(QI "V8QI") (HI "V4HI")
> + (SI "V2SI") (DI "V1DI")])
> +
> +;; ...and again in lower case.
> +(define_mode_attr vec_pop_mode [(QI "v8qi") (HI "v4hi")
> + (SI "v2si") (DI "v1di")])
> +
>  ;; On AArch64 the By element instruction doesn't have a 2S variant.
>  ;; However because the instruction always selects a pair of values
>  ;; The normal 3SAME instruction can be used here instead.
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt11.c 
> b/gcc/testsuite/gcc.target/aarch64/popcnt11.c
> new file mode 100644
> index 000..595b2f9eb93
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt11.c
> @@ -0,0 +1,58 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=armv8.2-a+sve" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +/*
> +** f_qi:
> +**   ldr b([0-9]+), \[x0\]
> +**   cnt v\1.8b, v\1.8b
> +**   smovw0, v\1.b\[0\]
> +**   ret
> +*/
> +unsigned
> +f_qi (unsigned char *a)
> +{
> +  return __builtin_popcountg (a[0]);
> +}
> +
> +/*
> +** f_hi:
> +**   ldr h([0-9]+), \[x0\]
> +**   ptrue   (p[0-7]).b, all
> +**   cnt z\1.h, \2/m, z\1.h
> +**   smovw0, v\1.h\[0\]
> +**   ret
> +*/
> +unsigned
> +f_hi (unsigned short *a)
> +{
> +  return __builtin_popcountg (a[0]);
> +}
> +
> +/*
> +** f_si:
> +**   ldr s([0-9]+), \[x0\]
> +**   ptrue   (p[0-7]).b, all
> +**   cnt z\1.s, \2/m, z\1.s
> +**   umovx0, v\1.d\[0\]
> +**   ret
> +*/
> +unsigned
> +f_si (unsigned int *a)
> +{
> +  return __builtin_popcountg (a[0]);
> +}
> +
> +/*
> +** f_di:
> +**   ldr d([0-9]+), \[x0\]
> +**   ptrue   (p[0-7])\.b, all
> +**   cnt z\1\.d, \2/m, z\1\.d
> +**   fmovx0, d\1
> +**   ret
> +*/
> +unsigned
> +f_di (unsigned long *a)
> +{
> +  return __builtin_popcountg (a[0]);
> +}


Re: [PATCH] SVE intrinsics: Fold svmul with all-zero operands to zero vector

2024-09-18 Thread Richard Sandiford
Jennifer Schmitz  writes:
> From 05e010a4ad5ef8df082b3e03b253aad85e2a270c Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz 
> Date: Tue, 17 Sep 2024 00:15:38 -0700
> Subject: [PATCH] SVE intrinsics: Fold svmul with all-zero operands to zero
>  vector
>
> As recently implemented for svdiv, this patch folds svmul to a zero
> vector if one of the operands is a zero vector. This transformation is
> applied if at least one of the following conditions is met:
> - the first operand is all zeros or
> - the second operand is all zeros, and the predicate is ptrue or the
> predication is _x or _z.
>
> In contrast to constant folding, which was implemented in a previous
> patch, this transformation is applied as soon as one of the operands is
> a zero vector, while the other operand can be a variable.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 

OK, thanks.

If you're planning any more work in this area, I think the next logical
step would be to extend the current folds to all predication types,
before going on to support other mul/div cases or other operations.

In principle, the mul and div cases correspond to:

  if (integer_zerop (op1) || integer_zerop (op2))
return f.fold_active_lanes_to (build_zero_cst (TREE_TYPE (f.lhs)));

It would then be up to fold_active_lanes_to(X) to work out how to apply
predication to X.  The general case would be:

  - For x predication and unpredicated operations, fold to X.

  - For m and z, calculate a vector that supplies the values of inactive
lanes (the first vector argument for m and a zero vector from z).

- If X is equal to the inactive lanes vector, fold directly to X.

- Otherwise fold to VEC_COND_EXPR 

Richard


Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

2024-09-18 Thread Richard Sandiford
Richard Biener  writes:
> On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu  wrote:
>>
>> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu  wrote:
>> >
>> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
>> >  wrote:
>> > >
>> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt  wrote:
>> > > >
>> > > > GCC12 enables vectorization for O2 with very cheap cost model which is 
>> > > > restricted
>> > > > to constant tripcount. The vectorization capacity is very limited w/ 
>> > > > consideration
>> > > > of codesize impact.
>> > > >
>> > > > The patch extends the very cheap cost model a little bit to support 
>> > > > variable tripcount.
>> > > > But still disable peeling for gaps/alignment, runtime aliasing 
>> > > > checking and epilogue
>> > > > vectorization with the consideration of codesize.
>> > > >
>> > > > So there're at most 2 versions of loop for O2 vectorization, one 
>> > > > vectorized main loop
>> > > > , one scalar/remainder loop.
>> > > >
>> > > > .i.e.
>> > > >
>> > > > void
>> > > > foo1 (int* __restrict a, int* b, int* c, int n)
>> > > > {
>> > > >  for (int i = 0; i != n; i++)
>> > > >   a[i] = b[i] + c[i];
>> > > > }
>> > > >
>> > > > with -O2 -march=x86-64-v3, will be vectorized to
>> > > >
>> > > > .L10:
>> > > > vmovdqu (%r8,%rax), %ymm0
>> > > > vpaddd  (%rsi,%rax), %ymm0, %ymm0
>> > > > vmovdqu %ymm0, (%rdi,%rax)
>> > > > addq$32, %rax
>> > > > cmpq%rdx, %rax
>> > > > jne .L10
>> > > > movl%ecx, %eax
>> > > > andl$-8, %eax
>> > > > cmpl%eax, %ecx
>> > > > je  .L21
>> > > > vzeroupper
>> > > > .L12:
>> > > > movl(%r8,%rax,4), %edx
>> > > > addl(%rsi,%rax,4), %edx
>> > > > movl%edx, (%rdi,%rax,4)
>> > > > addq$1, %rax
>> > > > cmpl%eax, %ecx
>> > > > jne .L12
>> > > >
>> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves 
>> > > > performance by 4.11%
>> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 
>> > > > 5.74% with
>> > > > extra 8.88% codesize. The details are as below
>> > >
>> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
>> > > model numbers?
>> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
>> > >
>> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
>> > > >
>> > > > N-Iter  cheap cost model
>> > > > 500.perlbench_r -0.12%  -0.12%
>> > > > 502.gcc_r   0.44%   -0.11%
>> > > > 505.mcf_r   0.17%   4.46%
>> > > > 520.omnetpp_r   0.28%   -0.27%
>> > > > 523.xalancbmk_r 0.00%   5.93%
>> > > > 525.x264_r  -0.09%  23.53%
>> > > > 531.deepsjeng_r 0.19%   0.00%
>> > > > 541.leela_r 0.22%   0.00%
>> > > > 548.exchange2_r -11.54% -22.34%
>> > > > 557.xz_r0.74%   0.49%
>> > > > GEOMEAN INT -1.04%  0.60%
>> > > >
>> > > > 503.bwaves_r3.13%   4.72%
>> > > > 507.cactuBSSN_r 1.17%   0.29%
>> > > > 508.namd_r  0.39%   6.87%
>> > > > 510.parest_r3.14%   8.52%
>> > > > 511.povray_r0.10%   -0.20%
>> > > > 519.lbm_r   -0.68%  10.14%
>> > > > 521.wrf_r   68.20%  76.73%
>> > >
>> > > So this seems to regress as well?
>> > Niter increases performance less than the cheap cost model, that's
>> > expected, it is not a regression.
>> > >
>> > > > 526.blender_r   0.12%   0.12%
>> > > > 527.cam4_r  19.67%  23.21%
>> > > > 538.imagick_r   0.12%   0.24%
>> > > > 544.nab_r   0.63%   0.53%
>> > > > 549.fotonik3d_r 14.44%  9.43%
>> > > > 554.roms_r  12.39%  0.00%
>> > > > GEOMEAN FP  8.26%   9.41%
>> > > > GEOMEAN ALL 4.11%   5.74%
>>
>> I've tested the patch on aarch64, it shows similar improvement with
>> little codesize increasement.
>> I haven't tested it on other backends, but I think it would have
>> similar good improvements
>
> I think overall this is expected since a constant niter dividable by
> the VF isn't a common situation.  So the question is mostly whether
> we want to pay the size penalty or not.
>
> Looking only at docs the proposed change would make the very-cheap
> cost model nearly(?) equivalent to the cheap one so maybe the answer
> is to default to cheap rather than very-cheap?  One difference seems to
> be that cheap allows alias versioning.

I remember seeing cases in the past where we could generate an
excessive number of alias checks.  The cost model didn't account
for them very well, since the checks often became a fixed overhead
for all paths (both scalar and vector), especially if the checks
were fully if-converted, with one branch at the end.  The relevant
comparison is then between the original pre-vectorisation scalar code
and the code with alias checks, rather than between post-vectorisation
scalar code and p

Re: [PATCH] libcpp: Add -Wtrailing-blanks warning

2024-09-18 Thread Richard Sandiford
Jeff Law  writes:
> On 9/18/24 10:11 AM, Jakub Jelinek wrote:
>> Hi!
>> 
>> Trailing blanks is something even git diff diagnoses; while it is a coding
>> style issue, if it is so common that git diff diagnoses it, I think it could
>> be useful to various projects to check that at compile time.

Nice!  Thanks for doing this.

>> Dunno if it should be included in -Wextra, currently it isn't, and due to
>> tons of trailing whitespace in our sources, haven't enabled it for when
>> building gcc itself either.
>> 
>> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>> 
>> Note, git diff also diagnoses indentation with tab following space, wonder
>> if we couldn't have trivial warning options where one would simply ask for
>> checking of indentation with no tabs, just spaces vs. indentation with
>> tabs followed by spaces (but never tab width or more spaces in the
>> indentation).  I think that would be easy to do also on the libcpp side.
>> Checking how much something should be exactly indented requires syntax
>> analysis (at least some limited one) and can consider columns of first token
>> on line, but what the exact indentation blanks were is something only libcpp
>> knows.
>> 
>> 2024-09-18  Jakub Jelinek  
>> 
>> libcpp/
>>  * include/cpplib.h (struct cpp_options): Add cpp_warn_trailing_blanks
>>  member.
>>  (enum cpp_warning_reason): Add CPP_W_TRAILING_BLANKS.
>>  * internal.h (struct _cpp_line_note): Document 'B' line note.
>>  * lex.cc (_cpp_clean_line): Add 'B' line note for trailing blanks
>>  except for trailing whitespace after backslash.  Formatting fix.
>>  (_cpp_process_line_notes): Emit -Wtrailing-blanks diagnostics.
>>  Formatting fixes.
>>  (lex_raw_string): Clear type on 'B' notes.
>> gcc/
>>  * doc/invoke.texi (Wtrailing-blanks): Document.
>> gcc/c-family/
>>  * c.opt (Wtrailing-blanks): New option.
>> gcc/testsuite/
>>  * c-c++-common/cpp/Wtrailing-blanks.c: New test.
> I'd tend to think we want this and would want to (at the appropriate 
> time) turn it on for our builds.  Better catch this nit early rather 
> than at commit/push time IMHO.

+1  I'd much rather learn about this kind of error before the code reaches
a review tool :)

>From a quick check, it doesn't look like Clang has this, so there is no
existing name to follow.

Richard


Re: [PATCH v2 0/4] dwarf2: add hooks for architecture-specific CFIs

2024-09-18 Thread Richard Sandiford
Matthieu Longo  writes:
> Architecture-specific CFI directives are currently declared an processed 
> among others architecture-independent CFI directives in gcc/dwarf2* files. 
> This approach creates confusion, specifically in the case of DWARF 
> instructions in the vendor space and using the same instruction code.
> Such a clash currently happen between DW_CFA_GNU_window_save (used on SPARC) 
> and DW_CFA_AARCH64_negate_ra_state (used on AArch64), and both having the 
> same instruction code 0x2d. Then AArch64 compilers generates a SPARC CFI 
> directive (.cfi_window_save) instead of .cfi_negate_ra_state, contrarily to 
> what is expected in [1].

The series LGTM, thanks.  OK for trunk if there are no objections
before Monday.

Richard

> 1. Rename REG_CFA_TOGGLE_RA_MANGLE to REG_CFA_NEGATE_RA_STATE
>
> This patch renames:
> - dwarf2out_frame_debug_cfa_toggle_ra_mangle to 
> dwarf2out_frame_debug_cfa_negate_ra_state,
> - REG_CFA_TOGGLE_RA_MANGLE to REG_CFA_NEGATE_RA_STATE,
> as the naming was misleading.
> The word "toggle" suggested a binary state, whereas this register stores the 
> mangling state (that can be more than 2 states) for the return address on 
> AArch64.
>
> 2. dwarf2: add hooks for architecture-specific CFIs
>
> This refactoring does not solve completely the problem, but improve the 
> situation by moving some of the processing of those directives (more 
> specifically their output in the assembly) to the backend via 2 target hooks:
> - DW_CFI_OPRND1_DESC: parse the first operand of the directive (if any).
> - OUTPUT_CFI_DIRECTIVE: output the CFI directive as a string.
> Only AArch64's and SPARC's backend are impacted.
>
> 3. aarch64 testsuite: explain expectections for pr94515*
> PR94515's tests in AArch64 G++ testsuite were lacking documentation. They are 
> now thoroughly documented.
>
> 4. dwarf2: store the RA state in CFI row
>
> On AArch64, the RA state informs the unwinder whether the return address is 
> mangled and how, or not. This information is encoded in a boolean in the CFI 
> row. This binary approach prevents from expressing more complex 
> configuration, as it is the case with PAuth_LR introduced in Armv9.5-A.
> This patch addresses this limitation by replacing the boolean by an enum.
>
>
> References:
> [1] DWARF for the Arm 64-bit Architecture (AArch64) --> 
> https://github.com/ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst
>
> ### Diff between revisions 1 & 2
>
> patch 1: no change.
> patch 2:
>   - fix issue with default hook signature for new hooks.
>   - code style formatting.
>   - change targetm.dw_cfi_oprnd1_desc signature to match the one of 
> dw_cfi_oprnd2_desc.
> patch 3:
>   - add comments suggested by Richard Sandiford in test case 
> gcc/testsuite/g++.target
> /aarch64/pr94515-2.C.
> patch 4:
>   - code style formatting.
>
> ## Testing
>
> Built for target aarch64-unknown-linux-gnu and ran GCC's & G++'s testsuites 
> for AArch64.
> Built GCC stage 1 for target sparc64-unknown-linux-gnu.
>
>
> Ok for master? I don't have commit access so I need someone to commit on my 
> behalf.
>
> Regards,
> Matthieu.
>
> Matthieu Longo (4):
>   Rename REG_CFA_TOGGLE_RA_MANGLE to REG_CFA_NEGATE_RA_STATE
>   dwarf2: add hooks for architecture-specific CFIs
>   aarch64 testsuite: explain expectections for pr94515* tests
>   dwarf2: store the RA state in CFI row
>
>  gcc/combine-stack-adj.cc |  2 +-
>  gcc/config/aarch64/aarch64.cc| 37 +++-
>  gcc/config/sparc/sparc.cc| 35 
>  gcc/coretypes.h  |  6 ++
>  gcc/doc/tm.texi  | 16 +-
>  gcc/doc/tm.texi.in   |  5 +-
>  gcc/dwarf2cfi.cc | 59 ++--
>  gcc/dwarf2out.cc | 13 +++--
>  gcc/dwarf2out.h  | 11 ++--
>  gcc/hooks.cc | 14 +
>  gcc/hooks.h  |  3 +
>  gcc/reg-notes.def|  8 +--
>  gcc/target.def   | 20 +++
>  gcc/testsuite/g++.target/aarch64/pr94515-1.C | 14 -
>  gcc/testsuite/g++.target/aarch64/pr94515-2.C | 41 +++---
>  include/dwarf2.h |  5 ++
>  libffi/include/ffi_cfi.h |  2 +
>  libgcc/config/aarch64/aarch64-asm.h  |  4 +-
>  libitm/config/aarch64/sjlj.S | 10 ++--
>  19 files changed, 239 insertions(+), 66 deletions(-)


Re: [PATCH v2 0/4][libgcc] store signing key and signing method in DWARF _Unwind_FrameState

2024-09-18 Thread Richard Sandiford
Matthieu Longo  writes:
> This patch series is only a refactoring of the existing implementation of 
> PAuth and returned-address signing. The existing behavior is preserved.

The series LGTM, thanks.  OK for trunk if there are no objections
before Monday.

Richard

>
> 1. aarch64: store signing key and signing method in DWARF _Unwind_FrameState
>
> _Unwind_FrameState already contains several CIE and FDE information (see the 
> attributes below the comment "The information we care about from the CIE/FDE" 
> in libgcc/unwind-dw2.h).
> The patch aims at moving the information from DWARF CIE (signing key stored 
> in the augmentation string) and FDE (the used signing method) into 
> _Unwind_FrameState along the already-stored CIE and FDE information.
> Note: those information have to be saved in frame_state_reg_info instead of 
> _Unwind_FrameState as they need to be savable by DW_CFA_remember_state and 
> restorable by DW_CFA_restore_state, that both rely on the attribute "prev".
> Those new information in _Unwind_FrameState simplifies the look-up of the 
> signing key when the return address is demangled. It also allows future 
> signing methods to be easily added.
> _Unwind_FrameState is not a part of the public API of libunwind, so the 
> change is backward compatible.
>
> A new architecture-specific handler MD_ARCH_EXTENSION_FRAME_INIT allows to 
> reset values in the frame state and unwind context if needed by the 
> architecture extension before changing the frame state to the caller context.
> A new architecture-specific handler MD_ARCH_EXTENSION_CIE_AUG_HANDLER 
> isolates the architecture-specific augmentation strings in AArch64 backend, 
> and allows others architectures to reuse augmentation strings that would have 
> clashed with AArch64 DWARF extensions.
> aarch64_demangle_return_addr, DW_CFA_AARCH64_negate_ra_state and 
> DW_CFA_val_expression cases in libgcc/unwind-dw2-execute_cfa.h were 
> documented to clarify where the value of the RA state register is stored (FS 
> and CONTEXT respectively).
>
> 2. aarch64: skip copy of RA state register into target context
>
> The RA state register is local to a frame, so it should not be copied to the 
> target frame during the context installation.
> This patch adds a new backend handler that check whether a register needs to 
> be skipped or not before its installation.
>
> 3. libgcc: hide CIE and FDE data for DWARF architecture extensions behind a 
> handler.
>
> This patch provides a new handler MD_ARCH_FRAME_STATE_T to hide an 
> architecture-specific structure containing CIE and FDE data related to DWARF 
> architecture extensions.
> Hiding the architecture-specific attributes behind a handler has the 
> following benefits:
> 1. isolating those data from the generic ones in _Unwind_FrameState
> 2. avoiding casts to custom types.
> 3. preserving typing information when debugging with GDB, and so 
> facilitating their printing.
>
> This approach required to add a new header md-unwind-def.h included at the 
> top of libgcc/unwind-dw2.h, and redirecting to the corresponding architecture 
> header via a symbolic link.
> An obvious drawback is the increase in complexity with macros, and headers. 
> It also caused a split of architecture definitions between md-unwind-def.h 
> (types definitions used in unwind-dw2.h) and md-unwind.h (local types 
> definitions and handlers implementations).
> The naming of md-unwind.h with .h extension is a bit misleading as the file 
> is only included in the middle of unwind-dw2.c. Changing this naming would 
> require modification of others backends, which I prefered to abstain from.
> Overall the benefits are worth the added complexity from my perspective.
>
> 4. libgcc: update configure (regenerated by autoreconf)
>
> Regenerate the build files.
> Note: This patch should be squashed with the previous one before merging.
>
> ### Diff between revisions 1 & 2
>
> 1: code style formatting + remove resetting of the local frame register in 
> the context as recommended by Richard Sandiford.
> 2: add a new handler as recommended by Richard Sandiford to skip the local 
> frame registers before installation in the target context.
> 3: code style formatting.
> 4: no change.
>
> ## Testing
>
> Those changes were testing by covering the 3 following cases:
> - backtracing.
> - exception handling in a C++ program.
> - gcc/testsuite/gcc.target/aarch64/pr104689.c: pac-ret with unusual DWARF [1]
>
> Regression tested on aarch64-unknown-linux-gnu, and no regression found.
>
> [1]: https://gcc.gnu.org/pipermail/gcc-patches/2022-May/594414.html
>
> Ok for master? I don't have commit access so I need someone to commit on my

Re: [PATCH v9 2/2] aarch64: Add codegen support for AdvSIMD faminmax

2024-09-18 Thread Richard Sandiford
 writes:
> The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and
> mandatory from Armv9.5-a. It introduces instructions for computing the
> floating point absolute maximum and minimum of the two vectors
> element-wise.
>
> This patch adds code generation support for famax and famin in terms of
> existing RTL operators.
>
> famax/famin is equivalent to first taking abs of the operands and then
> taking smax/smin on the results of abs.
>
>   famax/famin (a, b) = smax/smin (abs (a), abs (b))
>
> This fusion of operators is only possible when -march=armv9-a+faminmax
> flags are passed. We also need to pass -ffast-math flag; if we don't,
> then a statement like
>
>   c[i] = __builtin_fmaxf16 (a[i], b[i]);
>
> is RTL expanded to UNSPEC_FMAXNM instead of smax (likewise for smin).
>
> This code generation is only available on -O2 or -O3 as that is when
> auto-vectorization is enabled.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md
>   (*aarch64_faminmax_fused): Instruction pattern for faminmax
>   codegen.
>   * config/aarch64/iterators.md: Attribute for faminmax codegen.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/simd/faminmax-codegen-no-flag.c: New test.
>   * gcc.target/aarch64/simd/faminmax-codegen.c: New test.
>   * gcc.target/aarch64/simd/faminmax-no-codegen.c: New test.
> ---
>  gcc/config/aarch64/aarch64-simd.md|   9 +
>  gcc/config/aarch64/iterators.md   |   3 +
>  .../aarch64/simd/faminmax-codegen-no-flag.c   | 217 ++
>  .../aarch64/simd/faminmax-codegen.c   | 197 +
>  .../aarch64/simd/faminmax-no-codegen.c| 267 ++
>  5 files changed, 693 insertions(+)
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen-no-flag.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/simd/faminmax-no-codegen.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 67f0fe26f93..2a44aa3fcc3 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -9920,3 +9920,12 @@
>"TARGET_FAMINMAX"
>"\t%0., %1., %2."
>  )
> +
> +(define_insn "*aarch64_faminmax_fused"
> +  [(set (match_operand:VHSDF 0 "register_operand" "=w")
> + (FMAXMIN:VHSDF
> +   (abs:VHSDF (match_operand:VHSDF 1 "register_operand" "w"))
> +   (abs:VHSDF (match_operand:VHSDF 2 "register_operand" "w"]
> +  "TARGET_FAMINMAX"
> +  "\t%0., %1., %2."
> +)
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 17ac5e073aa..c2fcd18306e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -4472,3 +4472,6 @@
>  
>  (define_int_attr faminmax_uns_op
>[(UNSPEC_FAMAX "famax") (UNSPEC_FAMIN "famin")])
> +
> +(define_code_attr faminmax_op
> +  [(smax "famax") (smin "famin")])
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen-no-flag.c 
> b/gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen-no-flag.c
> new file mode 100644
> index 000..d77f5a5d19f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen-no-flag.c
> @@ -0,0 +1,217 @@
> +/* { dg-do assemble} */

As Kyrill mentioned, there should be a space after "assemble":

/* { dg-do assemble } */

Same for the other tests.  (And for patch 1 -- sorry, forgot
to check there.)

OK with that change, thanks -- no need for another review.

Richard

> +/* { dg-additional-options "-O3 -ffast-math -march=armv9-a" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#include "arm_neon.h"
> +
> +#pragma GCC target "+nosve"
> +
> +/*
> +** test_vamax_f16:
> +**   fabsv1.4h, v1.4h
> +**   fabsv0.4h, v0.4h
> +**   fmaxnm  v0.4h, v0.4h, v1.4h
> +**   ret
> +*/
> +float16x4_t
> +test_vamax_f16 (float16x4_t a, float16x4_t b)
> +{
> +  int i;
> +  float16x4_t c;
> +
> +  for (i = 0; i < 4; ++i) {
> +a[i] = __builtin_fabsf16 (a[i]);
> +b[i] = __builtin_fabsf16 (b[i]);
> +c[i] = __builtin_fmaxf16 (a[i], b[i]);
> +  }
> +  return c;
> +}
> +
> +/*
> +** test_vamaxq_f16:
> +**   fabsv1.8h, v1.8h
> +**   fabsv0.8h, v0.8h
> +**   fmaxnm  v0.8h, v0.8h, v1.8h
> +**   ret
> +*/
> +float16x8_t
> +test_vamaxq_f16 (float16x8_t a, float16x8_t b)
> +{
> +  int i;
> +  float16x8_t c;
> +
> +  for (i = 0; i < 8; ++i) {
> +a[i] = __builtin_fabsf16 (a[i]);
> +b[i] = __builtin_fabsf16 (b[i]);
> +c[i] = __builtin_fmaxf16 (a[i], b[i]);
> +  }
> +  return c;
> +}
> +
> +/*
> +** test_vamax_f32:
> +**   fabsv1.2s, v1.2s
> +**   fabsv0.2s, v0.2s
> +**   fmaxnm  v0.2s, v0.2s, v1.2s
> +**   ret
> +*/
> +float32x2_t
> +test_vamax_f32 (float32x2_t a, float32x2_t b)
> +{
> +  int i;
> +  float32x2_t c;
> +
> +  for (i = 0; i < 2; ++i) {
> +a[i] = __builtin_fabsf32 (a[i]);
> +b[i] = __builtin_fabsf32 (b[i]);
> +   

Re: [PATCH v9 1/2] aarch64: Add AdvSIMD faminmax intrinsics

2024-09-18 Thread Richard Sandiford
 writes:
> The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and
> mandatory from Armv9.5-a. It introduces instructions for computing the
> floating point absolute maximum and minimum of the two vectors element-wise.
>
> This patch introduces AdvSIMD faminmax intrinsics. The intrinsics of
> this extension are implemented as the following builtin functions:
> * vamax_f16
> * vamaxq_f16
> * vamax_f32
> * vamaxq_f32
> * vamaxq_f64
> * vamin_f16
> * vaminq_f16
> * vamin_f32
> * vaminq_f32
> * vaminq_f64
>
> We are defining a new way to add AArch64 AdvSIMD intrinsics by listing
> all the intrinsics in a .def file and then using that .def file to
> initialise various data structures. This would lead to more concise code
> and easier addition of the new AdvSIMD intrinsics in future.
>
> The faminmax intrinsics are defined using the new approach.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc
>   (ENTRY): Macro to parse the contents of
>   aarch64-simd-pragma-builtins.def.
>   (ENTRY_VHSDF): Macro to parse the contents of
>   aarch64-simd-pragma-builtins.def.
>   (enum aarch64_builtins): New enum values for faminmax builtins
>   via aarch64-simd-pragma-builtins.def.
>   (enum class aarch64_builtin_signatures): Enum class to specify
>   the number of operands a builtin will take.
>   (struct aarch64_pragma_builtins_data): Struct to hold data from
>   aarch64-simd-pragma-builtins.def.
>   (aarch64_fntype): New function to define function types of
>   intrinsics given an object of type aarch64_pragma_builtins_data.
>   (aarch64_init_pragma_builtins): New function to define pragma
>   builtins.
>   (aarch64_get_pragma_builtin): New function to get a row of
>   aarch64_pragma_builtins, given code.
>   (handle_arm_neon_h): Modify to call
>   aarch64_init_pragma_builtins.
>   (aarch64_general_check_builtin_call): Modify to check whether
>   required flag is being used for pragma builtins.
>   (aarch64_expand_pragma_builtin): New function to emit
>   instructions of pragma_builtin.
>   (aarch64_general_expand_builtin): Modify to call
>   aarch64_expand_pragma_builtin.
>   * config/aarch64/aarch64-option-extensions.def
>   (AARCH64_OPT_EXTENSION): Introduce new flag for this extension.
>   * config/aarch64/aarch64-simd.md
>   (@aarch64_): Instruction pattern for
>   faminmax intrinsics.
>   * config/aarch64/aarch64.h
>   (TARGET_FAMINMAX): Introduce new flag for this extension.
>   * config/aarch64/iterators.md: New iterators and unspecs.
>   * doc/invoke.texi: Document extension in AArch64 Options.
>   * config/aarch64/aarch64-simd-pragma-builtins.def: New file to
>   list pragma builtins.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/simd/faminmax-builtins-no-flag.c: New test.
>   * gcc.target/aarch64/simd/faminmax-builtins.c: New test.

LGTM.  OK, thanks.

It looks from MAINTAINERS that you don't currently have commit access.
Could you follow the procedure on https://gcc.gnu.org/gitwrite.html ,
listing me as sponsor?  Or if you already have an account for binutils,
I think the process is to email overseers directly.

Thanks for adding the new builtins infrastructure.

Richard

> ---
>  gcc/config/aarch64/aarch64-builtins.cc| 119 ++
>  .../aarch64/aarch64-option-extensions.def |   2 +
>  .../aarch64/aarch64-simd-pragma-builtins.def  |  23 
>  gcc/config/aarch64/aarch64-simd.md|  10 ++
>  gcc/config/aarch64/aarch64.h  |   4 +
>  gcc/config/aarch64/iterators.md   |   9 ++
>  gcc/doc/invoke.texi   |   2 +
>  .../aarch64/simd/faminmax-builtins-no-flag.c  |  10 ++
>  .../aarch64/simd/faminmax-builtins.c  | 115 +
>  9 files changed, 294 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-simd-pragma-builtins.def
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/simd/faminmax-builtins-no-flag.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/faminmax-builtins.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index eb878b933fe..6266bea3b39 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -757,6 +757,18 @@ typedef struct
>  #define VAR1(T, N, MAP, FLAG, A) \
>AARCH64_SIMD_BUILTIN_##T##_##N##A,
>  
> +#undef ENTRY
> +#define ENTRY(N, S, M, U, F) \
> +  AARCH64_##N,
> +
> +#undef ENTRY_VHSDF
> +#define ENTRY_VHSDF(NAME, SIGNATURE, UNSPEC, EXTENSIONS) \
> +  AARCH64_##NAME##_f16, \
> +  AARCH64_##NAME##q_f16, \
> +  AARCH64_##NAME##_f32, \
> +  AARCH64_##NAME##q_f32, \
> +  AARCH64_##NAME##q_f64,
> +
>  enum aarch64_builtins
>  {
>AARCH64_BUILTIN_MIN,
> @@ -829,6 +841,10 @@ enum aarch64_builtins
>AARCH64_RBIT,
>AARCH64_RBITL,
>AARCH64_RBITLL,
> +  /* Pragma builtins.  *

Re: [PATCH] contrib: Set check-params-in-docs.py to skip tables of values of a param

2024-09-18 Thread Richard Sandiford
Filip Kastl  writes:
> Hi,
>
> Here is the part of invoke.texi that currently confuses
> check-params-in-docs.py:
>
> @item aarch64-autovec-preference
> Force an ISA selection strategy for auto-vectorization.
> @table @samp
> @item default
> Use the default heuristics.
> @item asimd-only
> Use only Advanced SIMD for auto-vectorization.
> @item sve-only
> Use only SVE for auto-vectorization.
> @item prefer-asimd
> Use both Advanced SIMD and SVE.  Prefer Advanced SIMD when the costs are
> deemed equal.
> @item prefer-sve 
> Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
> @end table
>
> The script reports
>
> Extra:
> {'default', 'asimd-only', 'sve-only', 'prefer-asimd', 'prefer-sve'}
>
> Is the patch ok to be pushed?
>
> Cheers,
> Filip Kastl
>
>
> -- 8< --
>
>
> Currently check-params-in-docs.py reports extra params being listed in
> invoke.texi.  However, those aren't actual params but items in a table of
> possible values of the aarch64-autove-preference param.
>
> This patch changes check-params-in-docs.py to ignore similar tables.
>
> contrib/ChangeLog:
>
>   * check-params-in-docs.py: Skip tables of values of a param.
>
> Signed-off-by: Filip Kastl 
> ---
>  contrib/check-params-in-docs.py | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/contrib/check-params-in-docs.py b/contrib/check-params-in-docs.py
> index ccdb8d72169..8574842a4e7 100755
> --- a/contrib/check-params-in-docs.py
> +++ b/contrib/check-params-in-docs.py
> @@ -66,7 +66,18 @@ texi = takewhile(lambda x: '@node Instrumentation Options' 
> not in x, texi)
>  texi = list(texi)[1:]
>  
>  texi_params = []
> +skip = False
>  for line in texi:
> +# Skip @table @samp sections of manual where values of a param are 
> usually
> +# listed
> +if skip:
> +if line.startswith('@end table'):
> +skip = False
> +continue
> +elif line.startswith('@table @samp'):
> +skip = True
> +continue
> +

OK, thanks.  I think this makes:

# Skip digits
texi_params = [x for x in texi_params if not x[0].isdigit()]

redundant -- your patch is a more general fix that would handle
both cases.  So I think it would be good to remove that too, but the
patch is ok either way.

Thanks again for cleaning this up.

Richard

>  for token in ('@item ', '@itemx '):
>  if line.startswith(token):
>  texi_params.append(line[len(token):])


Re: [patch, reload] Add target macro RELOAD_ELIMINABLE_REGS (PR116326)

2024-09-18 Thread Richard Sandiford
Georg-Johann Lay  writes:
> diff --git a/gcc/reload1.cc b/gcc/reload1.cc
> index 2e059b09970..b0ae64e10b2 100644
> --- a/gcc/reload1.cc
> +++ b/gcc/reload1.cc
> @@ -283,7 +283,13 @@ static const struct elim_table_1
>const int to;
>  } reg_eliminate_1[] =
>  
> +  // Reload and LRA don't agree on how a multi-register frame pointer
> +  // is represented for elimination.  See avr.h for a use case.

Sorry for the formatting nit, but: we should continue to use /* ... */
comments in files that currently use them.

OK with that change, thanks.

Richard

> +#ifdef RELOAD_ELIMINABLE_REGS
> +  RELOAD_ELIMINABLE_REGS;
> +#else
>ELIMINABLE_REGS;
> +#endif
>  
>  #define NUM_ELIMINABLE_REGS ARRAY_SIZE (reg_eliminate_1)
>  


Re: [RFC PATCH] Allow functions with target_clones attribute to be inlined

2024-09-18 Thread Richard Sandiford
Yangyu Chen  writes:
> I recently found that target_clones functions cannot inline even when
> the caller has exactly the same target. However, if we only use target
> attributes in C++ and let the compiler generate IFUNC for us, the
> functions with the same target will be inlined.
>
> For example, the following code compiled on x86-64 target with -O3 will
> generate IFUNC for foo and bar and inline foo into the bar:
>
> ```cpp
> __attribute__((target("default")))
> int foo(int *arr) {
> int sum = 0;
> for (int i=0;i<16;i++) sum += arr[i];
> return sum;
> }
>
> __attribute__((target("avx2")))
> int foo(int *arr) {
> int sum = 0;
> for (int i=0;i<16;i++) sum += arr[i];
> return sum;
> }
>
> __attribute__((target("default")))
> int bar(int *arr) {
> return foo(arr);
> }
>
> __attribute__((target("avx2")))
> int bar(int *arr) {
> return foo(arr);
> }
> ```
>
> However, if we use target_clones attribute, the target_clones functions
> will not be inlined:
>
> ```cpp
> __attribute__((target_clones("default","avx2")))
> int foo(int *arr) {
> int sum = 0;
> for (int i=0;i<16;i++) sum += arr[i];
> return sum;
> }
>
> __attribute__((target_clones("default","avx2")))
> int bar(int *arr) {
> return foo(arr);
> }
> ```
>
> This behavior may negatively impact performance since the target_clones
> functions are not inlined. And since we didn't jump to the target_clones
> functions based on PLT but used the same target as the caller's target.
> I think it's better to allow the target_clones functions to be inlined.
>
> gcc/ada/ChangeLog:
>
> * gcc-interface/utils.cc (handle_target_clones_attribute):
> Allow functions with target_clones attribute to be inlined.
>
> gcc/c-family/ChangeLog:
>
> * c-attribs.cc (handle_target_clones_attribute):
> Allow functions with target_clones attribute to be inlined.
>
> gcc/d/ChangeLog:
>
> * d-attribs.cc (d_handle_target_clones_attribute):
> Allow functions with target_clones attribute to be inlined.

What I'm about to say applies to both sequences above, but:

Before inlining avx2 foo into avx2 bar, don't we need to be careful about
making sure that foo would still pick the avx2 version if called normally?
E.g. if foo had an avx512 version, direct calls to foo would presumably
pick that on avx512 targets, but still pick the avx2 version of bar.
It would then seem strange for the avx2 version of bar to inline the
avx2 version of foo, both for performance and ODR reasons.

Thanks,
Richard

>
> Signed-off-by: Yangyu Chen 
> ---
>  gcc/ada/gcc-interface/utils.cc | 5 +
>  gcc/c-family/c-attribs.cc  | 3 ---
>  gcc/d/d-attribs.cc | 5 -
>  3 files changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/gcc/ada/gcc-interface/utils.cc b/gcc/ada/gcc-interface/utils.cc
> index 60f36b1e50d..d010b684177 100644
> --- a/gcc/ada/gcc-interface/utils.cc
> +++ b/gcc/ada/gcc-interface/utils.cc
> @@ -7299,10 +7299,7 @@ handle_target_clones_attribute (tree *node, tree name, 
> tree ARG_UNUSED (args),
> int ARG_UNUSED (flags), bool *no_add_attrs)
>  {
>/* Ensure we have a function type.  */
> -  if (TREE_CODE (*node) == FUNCTION_DECL)
> -/* Do not inline functions with multiple clone targets.  */
> -DECL_UNINLINABLE (*node) = 1;
> -  else
> +  if (TREE_CODE (*node) != FUNCTION_DECL)
>  {
>warning (OPT_Wattributes, "%qE attribute ignored", name);
>*no_add_attrs = true;
> diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc
> index 4dd2eecbea5..f8759bb1908 100644
> --- a/gcc/c-family/c-attribs.cc
> +++ b/gcc/c-family/c-attribs.cc
> @@ -6105,9 +6105,6 @@ handle_target_clones_attribute (tree *node, tree name, 
> tree ARG_UNUSED (args),
>  "single % attribute is ignored");
> *no_add_attrs = true;
>   }
> -  else
> -  /* Do not inline functions with multiple clone targets.  */
> - DECL_UNINLINABLE (*node) = 1;
>  }
>else
>  {
> diff --git a/gcc/d/d-attribs.cc b/gcc/d/d-attribs.cc
> index 0f7ca10e017..9f67415adb1 100644
> --- a/gcc/d/d-attribs.cc
> +++ b/gcc/d/d-attribs.cc
> @@ -788,11 +788,6 @@ d_handle_target_clones_attribute (tree *node, tree name, 
> tree, int,
>warning (OPT_Wattributes, "%qE attribute ignored", name);
>*no_add_attrs = true;
>  }
> -  else
> -{
> -  /* Do not inline functions with multiple clone targets.  */
> -  DECL_UNINLINABLE (*node) = 1;
> -}
>  
>return NULL_TREE;
>  }


Re: [PATCH v8 2/2] aarch64: Add codegen support for AdvSIMD faminmax

2024-09-17 Thread Richard Sandiford
 writes:
> The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and
> mandatory from Armv9.5-a. It introduces instructions for computing the
> floating point absolute maximum and minimum of the two vectors
> element-wise.
>
> This patch adds code generation support for famax and famin in terms of
> existing RTL operators.
>
> famax/famin is equivalent to first taking abs of the operands and then
> taking smax/smin on the results of abs.
>
>   famax/famin (a, b) = smax/smin (abs (a), abs (b))
>
> This fusion of operators is only possible when -march=armv9-a+faminmax
> flags are passed. We also need to pass -ffast-math flag; if we don't,
> then a statement like
>
>   c[i] = __builtin_fmaxf16 (a[i], b[i]);
>
> is RTL expanded to UNSPEC_FMAXNM instead of smax (likewise for smin).
>
> This code generation is only available on -O2 or -O3 as that is when
> auto-vectorization is enabled.

The comment in 1/2 about not having a specific neon_fp_aminmax
scheduling type would apply here too.  It looks good otherwise,
but: following on from the SVE review, could you also add tests
to make sure that we don't fold vmax/vmaxnm/vmin/vminm calls with
vabs calls?  I think the code handles it correctly, but it would
be worth having a test to "defend" the behavior.

Thanks,
Richard

>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md
>   (*aarch64_faminmax_fused): Instruction pattern for faminmax
>   codegen.
>   * config/aarch64/iterators.md: Attribute for faminmax codegen.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/simd/faminmax-codegen-no-flag.c: New test.
>   * gcc.target/aarch64/simd/faminmax-codegen.c: New test.
> ---
>  gcc/config/aarch64/aarch64-simd.md|  10 +
>  gcc/config/aarch64/iterators.md   |   3 +
>  .../aarch64/simd/faminmax-codegen-no-flag.c   | 217 ++
>  .../aarch64/simd/faminmax-codegen.c   | 197 
>  4 files changed, 427 insertions(+)
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen-no-flag.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 7542c81ed91..8973cade488 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -9921,3 +9921,13 @@
>"\t%0., %1., %2."
>[(set_attr "type" "neon_fp_aminmax")]
>  )
> +
> +(define_insn "*aarch64_faminmax_fused"
> +  [(set (match_operand:VHSDF 0 "register_operand" "=w")
> + (FMAXMIN:VHSDF
> +   (abs:VHSDF (match_operand:VHSDF 1 "register_operand" "w"))
> +   (abs:VHSDF (match_operand:VHSDF 2 "register_operand" "w"]
> +  "TARGET_FAMINMAX"
> +  "\t%0., %1., %2."
> +  [(set_attr "type" "neon_fp_aminmax")]
> +)
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 17ac5e073aa..c2fcd18306e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -4472,3 +4472,6 @@
>  
>  (define_int_attr faminmax_uns_op
>[(UNSPEC_FAMAX "famax") (UNSPEC_FAMIN "famin")])
> +
> +(define_code_attr faminmax_op
> +  [(smax "famax") (smin "famin")])
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen-no-flag.c 
> b/gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen-no-flag.c
> new file mode 100644
> index 000..d77f5a5d19f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen-no-flag.c
> @@ -0,0 +1,217 @@
> +/* { dg-do assemble} */
> +/* { dg-additional-options "-O3 -ffast-math -march=armv9-a" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#include "arm_neon.h"
> +
> +#pragma GCC target "+nosve"
> +
> +/*
> +** test_vamax_f16:
> +**   fabsv1.4h, v1.4h
> +**   fabsv0.4h, v0.4h
> +**   fmaxnm  v0.4h, v0.4h, v1.4h
> +**   ret
> +*/
> +float16x4_t
> +test_vamax_f16 (float16x4_t a, float16x4_t b)
> +{
> +  int i;
> +  float16x4_t c;
> +
> +  for (i = 0; i < 4; ++i) {
> +a[i] = __builtin_fabsf16 (a[i]);
> +b[i] = __builtin_fabsf16 (b[i]);
> +c[i] = __builtin_fmaxf16 (a[i], b[i]);
> +  }
> +  return c;
> +}
> +
> +/*
> +** test_vamaxq_f16:
> +**   fabsv1.8h, v1.8h
> +**   fabsv0.8h, v0.8h
> +**   fmaxnm  v0.8h, v0.8h, v1.8h
> +**   ret
> +*/
> +float16x8_t
> +test_vamaxq_f16 (float16x8_t a, float16x8_t b)
> +{
> +  int i;
> +  float16x8_t c;
> +
> +  for (i = 0; i < 8; ++i) {
> +a[i] = __builtin_fabsf16 (a[i]);
> +b[i] = __builtin_fabsf16 (b[i]);
> +c[i] = __builtin_fmaxf16 (a[i], b[i]);
> +  }
> +  return c;
> +}
> +
> +/*
> +** test_vamax_f32:
> +**   fabsv1.2s, v1.2s
> +**   fabsv0.2s, v0.2s
> +**   fmaxnm  v0.2s, v0.2s, v1.2s
> +**   ret
> +*/
> +float32x2_t
> +test_vamax_f32 (float32x2_t a, float32x2_t b)
> +{
> +  int i;
> +  float32x2_t c;
> +
> +  for (i = 0; i < 2; ++i) {
> +a[i] = __builtin_fabsf32 (a[i]);
> +b[i] = __builtin_fabsf32 (b[i]);
> +c[i]

Re: [PATCH v8 1/2] aarch64: Add AdvSIMD faminmax intrinsics

2024-09-17 Thread Richard Sandiford
 writes:
> The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and
> mandatory from Armv9.5-a. It introduces instructions for computing the
> floating point absolute maximum and minimum of the two vectors element-wise.
>
> This patch introduces AdvSIMD faminmax intrinsics. The intrinsics of
> this extension are implemented as the following builtin functions:
> * vamax_f16
> * vamaxq_f16
> * vamax_f32
> * vamaxq_f32
> * vamaxq_f64
> * vamin_f16
> * vaminq_f16
> * vamin_f32
> * vaminq_f32
> * vaminq_f64
>
> We are defining a new way to add AArch64 AdvSIMD intrinsics by listing
> all the intrinsics in a .def file and then using that .def file to
> initialise various data structures. This would lead to more concise code
> and easier addition of the new AdvSIMD intrinsics in future.
>
> The faminmax intrinsics are defined using the new approach.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc
>   (ENTRY): Macro to parse the contents of
>   aarch64-simd-pragma-builtins.def.
>   (enum aarch64_builtins): New enum values for faminmax builtins
>   via aarch64-simd-pragma-builtins.def.
>   (enum aarch64_builtin_signatures): Enum to specify the
>   number of operands a builtin will take.
>   (ENTRY_VHSDF): Macro to parse the contents of
>   aarch64-simd-pragma-builtins.def.
>   (struct aarch64_pragma_builtins_data): Struct to hold data from
>   aarch64-simd-pragma-builtins.def.
>   (aarch64_fntype): New function to define function types of
>   intrinsics given an object of type aarch64_pragma_builtins_data.
>   (aarch64_init_pragma_builtins): New function to define pragma
>   builtins.
>   (aarch64_get_pragma_builtin): New function to get a row of
>   aarch64_pragma_builtins, given code.
>   (handle_arm_neon_h): Modify to call
>   aarch64_init_pragma_builtins.
>   (aarch64_general_check_builtin_call): Modify to check whether
>   required flag is being used for pragma builtins.
>   (aarch64_expand_pragma_builtin): New function to emit
>   instructions of pragma_builtin.
>   (aarch64_general_expand_builtin): Modify to call
>   aarch64_expand_pragma_builtin.
>   * config/aarch64/aarch64-option-extensions.def
>   (AARCH64_OPT_EXTENSION): Introduce new flag for this extension.
>   * config/aarch64/aarch64-simd.md
>   (@aarch64_): Instruction pattern for
>   faminmax intrinsics.
>   * config/aarch64/aarch64.h
>   (TARGET_FAMINMAX): Introduce new flag for this extension.
>   * config/aarch64/iterators.md: New iterators and unspecs.
>   * config/arm/types.md: Introduce neon_fp_aminmax attributes.
>   * doc/invoke.texi: Document extension in AArch64 Options.
>   * config/aarch64/aarch64-simd-pragma-builtins.def: New file to
>   list pragma builtins.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/simd/faminmax-builtins-no-flag.c: New test.
>   * gcc.target/aarch64/simd/faminmax-builtins.c: New test.
> ---
>  gcc/config/aarch64/aarch64-builtins.cc| 123 ++
>  .../aarch64/aarch64-option-extensions.def |   2 +
>  .../aarch64/aarch64-simd-pragma-builtins.def  |  23 
>  gcc/config/aarch64/aarch64-simd.md|  11 ++
>  gcc/config/aarch64/aarch64.h  |   4 +
>  gcc/config/aarch64/iterators.md   |   9 ++
>  gcc/config/arm/types.md   |   5 +
>  gcc/doc/invoke.texi   |   2 +
>  .../aarch64/simd/faminmax-builtins-no-flag.c  |  10 ++
>  .../aarch64/simd/faminmax-builtins.c  | 115 
>  10 files changed, 304 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-simd-pragma-builtins.def
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/simd/faminmax-builtins-no-flag.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/faminmax-builtins.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index eb878b933fe..6e64ae86c52 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -757,6 +757,18 @@ typedef struct
>  #define VAR1(T, N, MAP, FLAG, A) \
>AARCH64_SIMD_BUILTIN_##T##_##N##A,
>  
> +#undef ENTRY
> +#define ENTRY(N, S, M, U, F) \
> +  AARCH64_##N,
> +
> +#undef ENTRY_VHSDF
> +#define ENTRY_VHSDF(NAME, SIGNATURE, UNSPEC, EXTENSIONS) \
> +  AARCH64_##NAME##_f16, \
> +  AARCH64_##NAME##q_f16, \
> +  AARCH64_##NAME##_f32, \
> +  AARCH64_##NAME##q_f32, \
> +  AARCH64_##NAME##q_f64,
> +
>  enum aarch64_builtins
>  {
>AARCH64_BUILTIN_MIN,
> @@ -829,6 +841,10 @@ enum aarch64_builtins
>AARCH64_RBIT,
>AARCH64_RBITL,
>AARCH64_RBITLL,
> +  /* Pragma builtins.  */
> +  AARCH64_PRAGMA_BUILTIN_START,
> +#include "aarch64-simd-pragma-builtins.def"
> +  AARCH64_PRAGMA_BUILTIN_END,
>/* System register builtins.  */
>AARCH64_RSR,
>AARCH64_RSRP,
> @@ -947,6 +963,7 @@ const char *aarch64_

Re: [PATCH 2/2] aarch64: Add codegen support for SVE2 faminmax

2024-09-17 Thread Richard Sandiford
Kyrylo Tkachov  writes:
> Hi Saurabh,
>
>> On 13 Sep 2024, at 11:06, saurabh@arm.com wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and
>> mandatory from Armv9.5-a. It introduces instructions for computing the
>> floating point absolute maximum and minimum of the two vectors
>> element-wise.
>> 
>> This patch adds code generation for famax and famin in terms of existing
>> unspecs. With this patch:
>> 1. famax can be expressed as taking fmax/fmaxnm of the two operands and
>>  then taking absolute value of their result.
>> 2. famin can be expressed as taking fmin/fminnm of the two operands and
>>  then taking absolute value of their result.
>> 
>> This fusion of operators is only possible when
>> -march=armv9-a+faminmax+sve flags are passed.
>> 
>> This code generation is only available on -O2 or -O3 as that is when
>> auto-vectorization is enabled.
>> 
>> gcc/ChangeLog:
>> 
>>   * config/aarch64/aarch64-sve.md
>>   (*aarch64_pred_faminmax_fused): Instruction pattern for faminmax
>>   codegen.
>>   * config/aarch64/iterators.md: Attribute for faminmax codegen.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>   * gcc.target/aarch64/sve/faminmax.c: New test.
>> ---
>> gcc/config/aarch64/aarch64-sve.md | 29 +++
>> gcc/config/aarch64/iterators.md   |  6 ++
>> .../gcc.target/aarch64/sve/faminmax.c | 85 +++
>> 3 files changed, 120 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/faminmax.c
>> 
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index a5cd42be9d5..feb6438efde 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -1,3 +1,32 @@
> return "sel\t%0., %3, %2., %1.";
>   }
> )
>
> A slight tangent, maybe more of a question for Richard, but should we be 
> putting these extensions into aarch64-sve2.md or aarch64-sve.md?
> It looks like the architecture has had a major extension with SVE2 or SVE so 
> it made sense to create aarch64-sve2.md but now the incremental improvements 
> can be considered as an extension to either?

Yeah, good question. :)  I guess the sve/sve2 split doesn't make much
sense any more.  But while we have it, new patterns that are specific
to SVE2+ should probably go in aarch64-sve2.md.

> +;; -
> +;; -- [FP] Absolute maximum and minimum
> +;; -
> +;; Includes:
> +;; - FAMAX
> +;; - FAMIN
> +;; -
> +
> +;; Predicated floating-point absolute maximum and minimum.
> +(define_insn "*aarch64_pred_faminmax_fused"
> +  [(set (match_operand:SVE_FULL_F 0 "register_operand" "=w")
> + (unspec:SVE_FULL_F
> +   [(match_operand: 1 "register_operand" "Upl")
> +(match_operand:SI 4 "aarch64_sve_gp_strictness" "w")
> +(unspec:SVE_FULL_F
> +  [(match_operand 5)
> +   (const_int SVE_RELAXED_GP)
> +   (match_operand:SVE_FULL_F 2 "register_operand" "w")]
> +  UNSPEC_COND_FABS)
> +(unspec:SVE_FULL_F
> +  [(match_operand 6)
> +   (const_int SVE_RELAXED_GP)
> +   (match_operand:SVE_FULL_F 3 "register_operand" "w")]
> +  UNSPEC_COND_FABS)]
> +   SVE_COND_FP_MAXMIN))]
> +  "TARGET_SVE_FAMINMAX"
> +  "\t%0., %1/m, %0., %3.”
>
> This output pattern is missing operand 2.

Yeah.  We should use the same formulation as elsewhere to support:

- operand 2 tied to operand 0
- operand 3 tied to operand 0 (through commutativity)
- all three are separate register (using movprfx)

@aarch64_pred_abd is an example of a similar commutative operation.

I don't think this distinguishes between fmax(nm)s that came from
intrinsics and fmaxnms that came from the smax optab.  The former
can't be optimised, since famax has slightly different behaviour.
The latter can, because smax on a float is inherently somewhat fuzzy.

I think we should also have tests that something like:

#include 

svfloat32_t foo(svfloat32_t x, svfloat32_t y) {
svbool_t pg = svptrue_b8();
return svmax_x(pg, svabs_x(pg, x), svabs_x(pg, y));
}

and

#include 

svfloat32_t foo(svfloat32_t x, svfloat32_t y) {
svbool_t pg = svptrue_b8();
return svmaxnm_x(pg, svabs_x(pg, x), svabs_x(pg, y));
}

are not optimised to famax even when famax is available.  This can be
done using scan-assemblers for the three individual instructions and
a scan-assembler-not for famax.

As for how to fix that: I think we'll need to use UNSPEC_COND_SMAX
and UNSPEC_COND_SMIN for "smax" and "smin" (even for floating-point
modes), rather than the current UNSPEC_COND_FMAXNM and UNSPEC_COND_FMINNM.
Code that wants to generate UNSPEC_COND_FMAXNM or UNSPEC_COND_FMINNM
directly can do it via the

Re: [PATCH v2 2/2] aarch64: Improve part-variable vector initialization with SVE INDEX instruction [PR113328]

2024-09-17 Thread Richard Sandiford
Pengxuan Zheng  writes:
> We can still use SVE's INDEX instruction to construct vectors even if not all
> elements are constants. For example, { 0, x, 2, 3 } can be constructed by 
> first
> using "INDEX #0, #1" to generate { 0, 1, 2, 3 }, and then set the elements 
> which
> are non-constants separately.
>
>   PR target/113328
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_expand_vector_init_fallback):
>   Improve part-variable vector generation with SVE's INDEX if TARGET_SVE
>   is available.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/acle/general/dupq_1.c: Update test to use
>   check-function-bodies.
>   * gcc.target/aarch64/sve/acle/general/dupq_2.c: Likewise.
>   * gcc.target/aarch64/sve/acle/general/dupq_3.c: Likewise.
>   * gcc.target/aarch64/sve/acle/general/dupq_4.c: Likewise.
>   * gcc.target/aarch64/sve/vec_init_4.c: New test.
>   * gcc.target/aarch64/sve/vec_init_5.c: New test.
>
> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64.cc | 81 ++-
>  .../aarch64/sve/acle/general/dupq_1.c | 18 -
>  .../aarch64/sve/acle/general/dupq_2.c | 18 -
>  .../aarch64/sve/acle/general/dupq_3.c | 18 -
>  .../aarch64/sve/acle/general/dupq_4.c | 18 -
>  .../gcc.target/aarch64/sve/vec_init_4.c   | 47 +++
>  .../gcc.target/aarch64/sve/vec_init_5.c   | 12 +++
>  7 files changed, 199 insertions(+), 13 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_init_4.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 6b3ca57d0eb..7305a5c6375 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -23942,12 +23942,91 @@ aarch64_expand_vector_init_fallback (rtx target, 
> rtx vals)
>if (n_var != n_elts)
>  {
>rtx copy = copy_rtx (vals);
> +  bool is_index_seq = false;
> +
> +  /* If at least half of the elements of the vector are constants and all
> +  these constant elements form a linear sequence of the form { B, B + S,
> +  B + 2 * S, B + 3 * S, ... }, we can generate the vector with SVE's
> +  INDEX instruction if SVE is available and then set the elements which
> +  are not constant separately.  More precisely, each constant element I
> +  has to be B + I * S where B and S must be valid immediate operand for
> +  an SVE INDEX instruction.
> +
> +  For example, { X, 1, 2, 3} is a vector satisfying these conditions and
> +  we can generate a vector of all constants (i.e., { 0, 1, 2, 3 }) first
> +  and then set the first element of the vector to X.  */
> +
> +  if (TARGET_SVE && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> +   && n_var <= n_elts / 2)
> + {
> +   int const_idx = -1;
> +   HOST_WIDE_INT const_val = 0;
> +   int base = 16;
> +   int step = 16;
> +
> +   for (int i = 0; i < n_elts; ++i)
> + {
> +   rtx x = XVECEXP (vals, 0, i);
> +
> +   if (!CONST_INT_P (x))
> + continue;
> +
> +   if (const_idx == -1)
> + {
> +   const_idx = i;
> +   const_val = INTVAL (x);
> + }
> +   else
> + {
> +   if ((INTVAL (x) - const_val) % (i - const_idx) == 0)
> + {
> +   HOST_WIDE_INT s
> +   = (INTVAL (x) - const_val) / (i - const_idx);
> +   if (s >= -16 && s <= 15)
> + {
> +   int b = const_val - s * const_idx;
> +   if (b >= -16 && b <= 15)
> + {
> +   base = b;
> +   step = s;
> + }
> + }
> + }
> +   break;
> + }
> + }
> +
> +   if (base != 16
> +   && (!CONST_INT_P (v0)
> +   || (CONST_INT_P (v0) && INTVAL (v0) == base)))
> + {
> +   if (!CONST_INT_P (v0))
> + XVECEXP (copy, 0, 0) = GEN_INT (base);
> +
> +   is_index_seq = true;
> +   for (int i = 1; i < n_elts; ++i)
> + {
> +   rtx x = XVECEXP (copy, 0, i);
> +
> +   if (CONST_INT_P (x))
> + {
> +   if (INTVAL (x) != base + i * step)
> + {
> +   is_index_seq = false;
> +   break;
> + }
> + }
> +   else
> + XVECEXP (copy, 0, i) = GEN_INT (base + i * step);
> + }
> + }
> + }

This seems a bit more complex than I was hoping for, although the
complexity is probably justified.

Seeing how awkard it is to do this using current interfaces, I think
I'd instea

Re: [PATCH] SVE intrinsics: Fold svdiv with all-zero operands to zero vector

2024-09-16 Thread Richard Sandiford
Jennifer Schmitz  writes:
> From 287073b46e3eb9e7786f95c1d4bd90208db51f75 Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz 
> Date: Mon, 2 Sep 2024 06:46:57 -0700
> Subject: [PATCH] SVE intrinsics: Fold svdiv with all-zero operands to zero
>  vector
>
> This patch folds svdiv where one of the operands is all-zeros to a zero
> vector, if one of the following conditions holds:
> - the dividend is all zeros or
> - the divisor is all zeros, and the predicate is ptrue or the predication
> is _x or _z.
> This case was not covered by the recent patch that implemented constant
> folding, because that covered only cases where both operands are
> constant vectors. Here, the operation is folded as soon as one of the operands
> is a constant zero vector.
> Folding of divison by 0 to return 0 is in accordance with
> the semantics of sdiv and udiv.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold):
>   Add folding of all-zero operands to zero vector.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve/fold_div_zero.c: New test.
>   * gcc.target/aarch64/sve/const_fold_div_1.c: Adjust expected
>   outcome.

OK, thanks.

Richard


Re: [PATCH 1/2] aarch64: Add SVE2 faminmax intrinsics

2024-09-16 Thread Richard Sandiford
 writes:
> @@ -3082,6 +3084,8 @@
>  (define_int_iterator SVE_COND_ICVTF [UNSPEC_COND_SCVTF UNSPEC_COND_UCVTF])
>  
>  (define_int_iterator SVE_COND_FP_BINARY [UNSPEC_COND_FADD
> +  (UNSPEC_COND_FAMAX 
> "TARGET_SVE_FAMINMAX")
> +  (UNSPEC_COND_FAMIN 
> "TARGET_SVE_FAMINMAX")
>UNSPEC_COND_FDIV
>UNSPEC_COND_FMAX
>UNSPEC_COND_FMAXNM

Sorry for the trivia, but now that the lines exceed 80 characters,
could you reformat the whole define_int_iterator into the same style
as SVE_BFLOAT_TERNARY_LONG, with the "[" on a separate line, indented
by two spaces?  Similarly later.

> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/amax_f16.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/amax_f16.c
> new file mode 100644
> index 000..2646f29e60c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/amax_f16.c
> @@ -0,0 +1,155 @@
> +/* { dg-additional-options "-O3 -march=armv9-a+sve+faminmax" } */
> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */

For all the tests, could you add:

  #pragma GCC target "+faminmax"

instead of using an -march line?

These tests are run with multiple options, so we also shouldn't use -O3
(or dg-options generally).

Looks good otherwise, and other than the stray file that Kyrill mentioned.

Thanks,
Richard


Re: [PATCH 1/2] aarch64: Improve vector constant generation using SVE INDEX instruction [PR113328]

2024-09-16 Thread Richard Sandiford
"Pengxuan Zheng (QUIC)"  writes:
>> On Thu, Sep 12, 2024 at 2:53 AM Pengxuan Zheng
>>  wrote:
>> >
>> > SVE's INDEX instruction can be used to populate vectors by values
>> > starting from "base" and incremented by "step" for each subsequent
>> > value. We can take advantage of it to generate vector constants if
>> > TARGET_SVE is available and the base and step values are within [-16, 15].
>> 
>> Are there multiplication by or addition of scalar immediate instructions to
>> enhance this with two-instruction sequences?
>
> No, Richard, I can't think of any equivalent two-instruction sequences.

There are some.  E.g.:

 { 16, 17, 18, 19, ... }

could be:

index   z0.b, #0, #1
add z0.b, z0.b, #16

or, alternatively:

mov w0, #16
index   z0.b, w0, #1

But these cases are less obviously a win, so I think it's ok to handle
single instructions only for now.

The patch is ok for trunk, thanks, but:

>> > @@ -22991,7 +22991,7 @@ aarch64_simd_valid_immediate (rtx op,
>> simd_immediate_info *info,
>> >if (CONST_VECTOR_P (op)
>> >&& CONST_VECTOR_DUPLICATE_P (op))
>> >  n_elts = CONST_VECTOR_NPATTERNS (op);
>> > -  else if ((vec_flags & VEC_SVE_DATA)
>> > +  else if (which == AARCH64_CHECK_MOV && TARGET_SVE
>> >&& const_vec_series_p (op, &base, &step))

...the convention is to have one && condition per line if the whole
expression doesn't fit on a single line:

  else if (which == AARCH64_CHECK_MOV
   && TARGET_SVE
   && const_vec_series_p (op, &base, &step))

Richard


Re: [PATCH] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions

2024-09-12 Thread Richard Sandiford
Richard Biener  writes:
> On Thu, Sep 12, 2024 at 2:35 PM Richard Sandiford
>  wrote:
>>
>> Soumya AR  writes:
>> > On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift
>> > instructions like SHL have a throughput of 2. We can lean on that to emit 
>> > code
>> > like:
>> >  add  z31.b, z31.b, z31.b
>> > instead of:
>> >  lsl  z31.b, z31.b, #1
>> >
>> > The implementation of this change for SVE vectors is similar to a prior 
>> > patch
>> > <https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that 
>> > adds
>> > the above functionality for Neon vectors.
>> >
>> > Here, the machine descriptor pattern is split up to separately accommodate 
>> > left
>> > and right shifts, so we can specifically emit an add for all left shifts 
>> > by 1.
>>
>> Thanks for doing this.
>
> I do wonder whether our scheduling infrastructure has the ability to "mutate"
> instructions in cases like here if either adds or shifts exceed their
> available resources
> but there is a resource readily available in an alternate instruction form?

Yeah, that sounds like a useful feature in general.  But in this particular
case, the shift resources are a subset of the addition resources, so there
should never be a specific advantage to using shifts.

Thanks,
Richard

> Richard.
>
>> > The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>> > regression.
>> > OK for mainline?
>> >
>> > Signed-off-by: Soumya AR 
>> >
>> > gcc/ChangeLog:
>> >
>> >   * config/aarch64/aarch64-sve.md (*post_ra_v3): Split 
>> > pattern to
>> >   accomodate left and right shifts separately.
>> >   (*post_ra_v_ashl3): Matches left shifts with additional 
>> > constraint to
>> >   check for shifts by 1.
>> >   (*post_ra_v_3): Matches right shifts.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of 
>> > lsl-1 with
>> >   corresponding add
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise.
>> >   * gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise.
>> >   * gcc.target/aarch64/sve/adr_1.c: Likewise.
>> >   * gcc.target/aarch64/sve/adr_6.c: Likewise.
>> >   * gcc.target/aarch64/sve/cond_mla_7.c: Likewise.
>> >   * gcc.target/aarch64/sve/cond_mla_8.c: Likewise.
>> >   * gcc.target/aarch64/sve/shift_2.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise.
>> >   * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise.
>> >   * gcc.target/aarch64/sve/sve_shl_add.c: New test.
>> >
>> > From 94e9cbee44d42c60e94fe89e6ce57526206c13aa Mon Sep 17 00:00:00

Re: [PATCH] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions

2024-09-12 Thread Richard Sandiford
Soumya AR  writes:
> On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift
> instructions like SHL have a throughput of 2. We can lean on that to emit code
> like:
>  add  z31.b, z31.b, z31.b
> instead of:
>  lsl  z31.b, z31.b, #1
>
> The implementation of this change for SVE vectors is similar to a prior patch
>  that adds
> the above functionality for Neon vectors.
>
> Here, the machine descriptor pattern is split up to separately accommodate 
> left
> and right shifts, so we can specifically emit an add for all left shifts by 
> 1. 

Thanks for doing this.

> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Soumya AR 
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-sve.md (*post_ra_v3): Split 
> pattern to
>   accomodate left and right shifts separately.
>   (*post_ra_v_ashl3): Matches left shifts with additional 
> constraint to
>   check for shifts by 1.
>   (*post_ra_v_3): Matches right shifts.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of lsl-1 
> with
>   corresponding add
>   * gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise. 
>   * gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise.
>   * gcc.target/aarch64/sve/adr_1.c: Likewise.
>   * gcc.target/aarch64/sve/adr_6.c: Likewise.
>   * gcc.target/aarch64/sve/cond_mla_7.c: Likewise.
>   * gcc.target/aarch64/sve/cond_mla_8.c: Likewise.
>   * gcc.target/aarch64/sve/shift_2.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise.
>   * gcc.target/aarch64/sve/sve_shl_add.c: New test.
>
> From 94e9cbee44d42c60e94fe89e6ce57526206c13aa Mon Sep 17 00:00:00 2001
> From: Soumya AR 
> Date: Tue, 10 Sep 2024 14:18:44 +0530
> Subject: [PATCH] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE
>  instructions.
>
> On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift
> instructions like SHL have a throughput of 2. We can lean on that to emit code
> like:
>  add  z31.b, z31.b, z31.b
> instead of:
>  lsl  z31.b, z31.b, #1
>
> The implementation of this change for SVE vectors is similar to a prior patch
>  that adds
> the above functionality for Neon vectors.
>
> Here, the machine descriptor pattern is split up to separately accommodate 
> left
> and right shifts, so we can specifically emit an add for all left shifts by 1.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Soumya AR 
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-sve.md (*post_ra_v3): Split 
> pattern to
>   accomodate left and right shifts separately.
>   (*post_ra_v_ashl3): Matches left shifts with additional 
> constraint to
>   check for shifts by 1.
>   (*post_ra_v_3): Matches right shifts.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of lsl-1 
> with
>   corresponding add
>   * gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise.
>   * gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise.
>   * gcc.target/a

Re: [PATCH] vect: Do not try to duplicate_and_interleave one-element mode.

2024-09-09 Thread Richard Sandiford
Richard Biener  writes:
>> Am 06.09.2024 um 16:05 schrieb Robin Dapp :
>> 
>> Hi,
>> 
>> PR112694 shows that we try to create sub-vectors of single-element
>> vectors because can_duplicate_and_interleave_p returns true.
>
> Can we avoid querying the function?  CCing Richard who should know more about 
> this.
>
> Richard 
>
>> The problem resurfaced in PR116611.
>> 
>> This patch makes can_duplicate_and_interleave_p return false
>> if count / nvectors > 0 and removes the corresponding check in the riscv
>> backend.
>> 
>> This partially gets rid of the FAIL in slp-19a.c.  At least when built
>> with cost model we don't have LOAD_LANES anymore.  Without cost model,
>> as in the test suite, we choose a different path and still end up with
>> LOAD_LANES.

Could you walk me through the failure in more detail?  It sounds
like can_duplicate_and_interleave_p eventually gets to the point of
subdividing the original elements, instead of either combining consecutive
elements (the best case), or leaving them as-is (the expected fallback
for SVE).  But it sounds like those attempts fail in this case, but an
attempt to subdivide the elements succeeds.  Is that right?  And if so,
why does that happen?

Thanks,
Richard

>> 
>> Bootstrapped and regtested on x86 and power10, regtested on
>> rv64gcv_zvfh_zvbb.  Still waiting for the aarch64 results.
>> 
>> Regards
>> Robin
>> 
>> gcc/ChangeLog:
>> 
>>PR target/112694
>>PR target/116611.
>> 
>>* config/riscv/riscv-v.cc (expand_vec_perm_const): Remove early
>>return.
>>* tree-vect-slp.cc (can_duplicate_and_interleave_p): Return
>>false when we cannot create sub-elements.
>> ---
>> gcc/config/riscv/riscv-v.cc | 9 -
>> gcc/tree-vect-slp.cc| 4 
>> 2 files changed, 4 insertions(+), 9 deletions(-)
>> 
>> diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
>> index 9b6c3a21e2d..5c5ed63d22e 100644
>> --- a/gcc/config/riscv/riscv-v.cc
>> +++ b/gcc/config/riscv/riscv-v.cc
>> @@ -3709,15 +3709,6 @@ expand_vec_perm_const (machine_mode vmode, 
>> machine_mode op_mode, rtx target,
>>  mask to do the iteration loop control. Just disable it directly.  */
>>   if (GET_MODE_CLASS (vmode) == MODE_VECTOR_BOOL)
>> return false;
>> -  /* FIXME: Explicitly disable VLA interleave SLP vectorization when we
>> - may encounter ICE for poly size (1, 1) vectors in loop vectorizer.
>> - Ideally, middle-end loop vectorizer should be able to disable it
>> - itself, We can remove the codes here when middle-end code is able
>> - to disable VLA SLP vectorization for poly size (1, 1) VF.  */
>> -  if (!BYTES_PER_RISCV_VECTOR.is_constant ()
>> -  && maybe_lt (BYTES_PER_RISCV_VECTOR * TARGET_MAX_LMUL,
>> -   poly_int64 (16, 16)))
>> -return false;
>> 
>>   struct expand_vec_perm_d d;
>> 
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index 3d2973698e2..17b59870c69 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -434,6 +434,10 @@ can_duplicate_and_interleave_p (vec_info *vinfo, 
>> unsigned int count,
>>   unsigned int nvectors = 1;
>>   for (;;)
>> {
>> +  /* We need to be able to to fuse COUNT / NVECTORS elements together,
>> + so no point in continuing if there are none.  */
>> +  if (nvectors > count)
>> +return false;
>>   scalar_int_mode int_mode;
>>   poly_int64 elt_bits = elt_bytes * BITS_PER_UNIT;
>>   if (int_mode_for_size (elt_bits, 1).exists (&int_mode))
>> --
>> 2.46.0
>> 


Re: [PATCH 4/4]AArch64: Define VECTOR_STORE_FLAG_VALUE.

2024-09-09 Thread Richard Sandiford
Kyrylo Tkachov  writes:
>> On 9 Sep 2024, at 11:06, Tamar Christina  wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>>> -Original Message-
>>> From: Richard Sandiford 
>>> Sent: Monday, September 9, 2024 9:29 AM
>>> To: Tamar Christina 
>>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>>> ; Marcus Shawcroft
>>> ; ktkac...@gcc.gnu.org
>>> Subject: Re: [PATCH 4/4]AArch64: Define VECTOR_STORE_FLAG_VALUE.
>>> 
>>> Tamar Christina  writes:
>>>> Hi All,
>>>> 
>>>> This defines VECTOR_STORE_FLAG_VALUE to CONST1_RTX for AArch64
>>>> so we simplify vector comparisons in AArch64.
>>>> 
>>>> With this enabled
>>>> 
>>>> res:
>>>>moviv0.4s, 0
>>>>cmeqv0.4s, v0.4s, v0.4s
>>>>ret
>>>> 
>>>> is simplified to:
>>>> 
>>>> res:
>>>>mvniv0.4s, 0
>>>>ret
>>>> 
>>>> NOTE: I don't really like the testcase as it depends on an
>>>> uninitialised value to hide the constant from GIMPLE.
>>>> 
>>>> Happy to go with something else if there are any suggestions.
>>>> I thought about an RTL testcase, but those seem painful.
>>> 
>>> Like you say, I think an RTL testcase would be better.  Could you use
>>> the attached (for gcc.dg/rtl/aarch64)?
>>> 
>> 
>> Thanks, do you have any tips for writing these? If there a way to dump a
>> skeleton like with the gimple tests?

TBH I just take one I wrote earlier that looks relatively close,
then adapt it.  They're not something I could easily write from scratch.

> As a tangent, I wonder if the RTL dump logic can be extended to have a 
> dump-for-rtl-testcase mode, under the reasoning that creating RTL test cases 
> for ICE fixes is a common action. It could even be used in the EMERGENCY DUMP 
> case when dumping during an ICE.

Yeah, perhaps.  But I think there is an advantage in writing the test
by hand, to make sure that everything in the test is necessary and
relatively future-proof.

Often, the reason for using an RTL testcase is that it isn't easy
to create a minimal RTL reproducer from gimple.  Most EMERGENCY DUMPs
from non-RTL testcases are therefore likely to contain stuff that isn't
directly relevant.

Thanks,
Richard



Re: [PATCH] SVE intrinsics: Fold svdiv with all-zero operands to zero vector

2024-09-09 Thread Richard Sandiford
Jennifer Schmitz  writes:
> This patch folds svdiv where one of the operands is all-zeros to a zero
> vector, if the predicate is ptrue or the predication is _x or _z.
> This case was not covered by the recent patch that implemented constant
> folding, because that covered only cases where both operands are
> constant vectors. Here, the operation is folded as soon as one of the operands
> is a constant zero vector.
> Folding of divison by 0 to return 0 is in accordance with
> the semantics of sdiv and udiv.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold):
>   Add folding of all-zero operands to zero vector.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve/fold_div_zero.c: New test.
>   * gcc.target/aarch64/sve/const_fold_div_1.c: Adjust expected
>   outcome.
>
> From 1d50cc57cd3bbe19a48b7bbb543ea331cbd9a6f6 Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz 
> Date: Mon, 2 Sep 2024 06:46:57 -0700
> Subject: [PATCH] SVE intrinsics: Fold svdiv with all-zero operands to zero
>  vector
>
> This patch folds svdiv where one of the operands is all-zeros to a zero
> vector, if the predicate is ptrue or the predication is _x or _z.
> This case was not covered by the recent patch that implemented constant
> folding, because that covered only cases where both operands are
> constant vectors. Here, the operation is folded as soon as one of the operands
> is a constant zero vector.
> Folding of divison by 0 to return 0 is in accordance with
> the semantics of sdiv and udiv.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold):
>   Add folding of all-zero operands to zero vector.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve/fold_div_zero.c: New test.
>   * gcc.target/aarch64/sve/const_fold_div_1.c: Adjust expected
>   outcome.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc  |  38 +-
>  .../gcc.target/aarch64/sve/const_fold_div_1.c |  12 +-
>  .../gcc.target/aarch64/sve/fold_div_zero.c| 369 ++
>  3 files changed, 402 insertions(+), 17 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/fold_div_zero.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index 6c94d144dc9..3ec9ebbf6ef 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -758,30 +758,50 @@ public:
>  if (auto *res = f.fold_const_binary (TRUNC_DIV_EXPR))
>return res;
>  
> -/* If the divisor is a uniform power of 2, fold to a shift
> -   instruction.  */
> +tree pg = gimple_call_arg (f.call, 0);
> +tree op1 = gimple_call_arg (f.call, 1);
>  tree op2 = gimple_call_arg (f.call, 2);
> -tree divisor_cst = uniform_integer_cst_p (op2);
> +bool pred_fold = f.pred != PRED_m
> +  || is_ptrue (pg, f.type_suffix (0).element_bytes);
>  
> -if (!divisor_cst || !integer_pow2p (divisor_cst))
> +/* If the dividend is all zeros, fold to zero vector.  */
> +tree op1_cst = uniform_integer_cst_p (op1);
> +if (op1_cst && pred_fold && integer_zerop (op1_cst))
> +  return gimple_build_assign (f.lhs, op1);

This fold is ok for all predication types, since _m merges with
the first input.  There's also no need to apply uniform_integer_cst_p
manually, since integer_zerop handles vectors too.  So I think this can be:

/* If the dividend is all zeros, fold to zero vector.  */
if (integer_zerop (op1))
  return gimple_build_assign (f.lhs, op1);

(The new _m cases would need tests though!)

> +
> +/* If the divisor is all zeros, fold to zero vector.  */
> +tree op2_cst = uniform_integer_cst_p (op2);
> +if (!op2_cst)
>return NULL;
>  
> +if (pred_fold && integer_zerop (op2_cst))
> +  {
> + gimple_seq stmts = NULL;
> + tree op2_vec = f.force_vector (stmts, TREE_TYPE (op1), op2);
> + gsi_insert_seq_before (f.gsi, stmts, GSI_SAME_STMT);
> + return gimple_build_assign (f.lhs, op2_vec);
> +  }

This would be simpler as:

if (integer_zerop (op2_cst)
&& (f.pred != PRED_m
|| is_ptrue (pg, f.type_suffix (0).element_bytes)))
  return gimple_build_assign (f.lhs, build_zero_cst (TREE_TYPE (f.lhs)));

(I've dropped the pred_fold variable, since it is only valid for
things that fold to zero.  For everything else we'd need == PRED_x
instead.)

> +
> +/* If the divisor is a uniform power of 2, fold to a shift
> +   instruction.  */
> +if (!integer_pow2p (op2_cst))
> +  return NULL;
>  tree new_divisor;
>  gcall *call;

Very minor nit, but: given the line spacing in the funct

Re: [PATCH 4/4]AArch64: Define VECTOR_STORE_FLAG_VALUE.

2024-09-09 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This defines VECTOR_STORE_FLAG_VALUE to CONST1_RTX for AArch64
> so we simplify vector comparisons in AArch64.
>
> With this enabled
>
> res:
> moviv0.4s, 0
> cmeqv0.4s, v0.4s, v0.4s
> ret
>
> is simplified to:
>
> res:
> mvniv0.4s, 0
> ret
>
> NOTE: I don't really like the testcase as it depends on an
> uninitialised value to hide the constant from GIMPLE.
>
> Happy to go with something else if there are any suggestions.
> I thought about an RTL testcase, but those seem painful.

Like you say, I think an RTL testcase would be better.  Could you use
the attached (for gcc.dg/rtl/aarch64)?

> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.h (VECTOR_STORE_FLAG_VALUE): New.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vector-cmp-rtl-elim.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 
> 2dfb999bea53414498a2355bb30db938f6b94100..b99f69103ab7e1d44e5e41ee89fb9a74450c57ca
>  100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -156,6 +156,8 @@
>  
>  #define PCC_BITFIELD_TYPE_MATTERS1
>  
> +#define VECTOR_STORE_FLAG_VALUE(MODE) CONST1_RTX (GET_MODE_INNER (MODE))
> +

I think it'd be useful to capture the reasons we discussed internally
for preferring this choice.

/* Use the same RTL truth representation for vector elements as we do
   for scalars.  This maintains the property that a comparison like
   eq:V4SI is a composition of 4 individual eq:SIs, just like plus:V4SI
   is a composition of 4 individual plus:SIs.

   This means that Advanced SIMD comparisons are represented in RTL as
   (neg (op ...)).  */

OK with those changes, thanks.

Richard

>  #ifndef USED_FOR_TARGET
>  
>  /* Define an enum of all features (ISA modes, architectures and extensions).
> diff --git a/gcc/testsuite/gcc.target/aarch64/vector-cmp-rtl-elim.c 
> b/gcc/testsuite/gcc.target/aarch64/vector-cmp-rtl-elim.c
> new file mode 100644
> index 
> ..d67baa216d8332a26bdc64350402b77d87379f28
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vector-cmp-rtl-elim.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include 
> +
> +/*
> +** res:
> +**   mvniv0.4s, 0
> +**   ret
> +*/
> +uint32x4_t res ()
> +{
> +  uint32x4_t a;
> +  uint32x4_t b = {0, 0, 0, 0};
> +  return vceqq_u32 (a, b);
> +}
> +

/* { dg-do compile { target aarch64-*-* } } */
/* { dg-additional-options "-O2" } */
/* { dg-final { check-function-bodies "**" "" "" } } */

/*
** foo:
**	mvni	v0.4s, 0
**	ret
*/
__Uint32x4_t __RTL (startwith ("vregs")) foo (void)
{
(function "foo"
  (insn-chain
(block 2
  (edge-from entry (flags "FALLTHRU"))
  (cnote 1 [bb 2] NOTE_INSN_BASIC_BLOCK)
  (cnote 2 NOTE_INSN_FUNCTION_BEG)
  (cinsn 3 (set (reg:V4SI <0>) (const_vector:V4SI [(const_int 0) (const_int 0) (const_int 0) (const_int 0)])))
  (cinsn 4 (set (reg:V4SI <1>) (reg:V4SI <0>)))
  (cinsn 5 (set (reg:V4SI <2>)
		(neg:V4SI (eq:V4SI (reg:V4SI <0>) (reg:V4SI <1>)
  (cinsn 6 (set (reg:V4SI v0) (reg:V4SI <2>)))
  (edge-to exit (flags "FALLTHRU"))
)
  )
  (crtl (return_rtx (reg/i:V4SI v0)))
)
}


Re: [PATCH 3/4][rtl]: simplify boolean vector EQ and NE comparisons

2024-09-06 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This adds vector constant simplification for EQ and NE.  This is useful since
> the vectorizer generates a lot more vector compares now, in particular NE and 
> EQ
> and so these help us optimize cases where the values were not known at GIMPLE
> but instead only at RTL.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> x86_64-pc-linux-gnu -m32, -m64 and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * simplify-rtx.cc (simplify_context::simplify_unary_operation): Try
>   simplifying operand.
>   (simplify_const_relational_operation): Simplify vector EQ and NE.
>   (test_vector_int_const_compare): New.
>   (test_vector_int_const_compare_ops): New.
>   (simplify_rtx_cc_tests): Use them.
>
> ---
>
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index 
> a20a61c5dddbc80b23a9489d925a2c31b2163458..7e83e80246b70c81c388e77967f645d171efe983
>  100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -886,6 +886,10 @@ simplify_context::simplify_unary_operation (rtx_code 
> code, machine_mode mode,
>  
>trueop = avoid_constant_pool_reference (op);
>  
> +  /* If the operand is not a reg or constant try simplifying it first.  */
> +  if (rtx tmp_op = simplify_rtx (op))
> +op = tmp_op;
> +

We shouldn't need to do this.  The assumption is that the operands are
already simplified.

Which caller required this?

>tem = simplify_const_unary_operation (code, mode, trueop, op_mode);
>if (tem)
>  return tem;
> @@ -6354,6 +6358,35 @@ simplify_const_relational_operation (enum rtx_code 
> code,
>   return 0;
>  }
>  
> +  /* Check if the operands are a vector EQ or NE comparison.  */
> +  if (VECTOR_MODE_P (mode)
> +  && INTEGRAL_MODE_P (mode)
> +  && GET_CODE (op0) == CONST_VECTOR
> +  && GET_CODE (op1) == CONST_VECTOR
> +  && (code == EQ || code == NE))
> +{
> +  if (rtx_equal_p (op0, op1))
> + return code == EQ ? const_true_rtx : const0_rtx;
> +
> +  unsigned int npatterns0, npatterns1;
> +  if (CONST_VECTOR_NUNITS (op0).is_constant (&npatterns0)
> +   && CONST_VECTOR_NUNITS (op1).is_constant (&npatterns1))
> + {
> +   if (npatterns0 != npatterns1)
> + return code == EQ ? const0_rtx : const_true_rtx;

This looks like a typing error.  The operands have to have the same
number of elements.  But...

> +
> +   for (unsigned i = 0; i < npatterns0; i++)
> + {
> +   rtx val0 = CONST_VECTOR_ELT (op0, i);
> +   rtx val1 = CONST_VECTOR_ELT (op1, i);
> +   if (!rtx_equal_p (val0, val1))
> + return code == EQ ? const0_rtx : const_true_rtx;
> + }
> +
> +   return code == EQ ? const_true_rtx : const0_rtx;
> + }

...when is this loop needed?  For constant-sized vectors, isn't the
result always rtx_equal_p for EQ and !rtx_equal_p for NE?  If we have
equal vectors for which rtx_equal_p returns false then that should be
fixed.

For variable-sized vectors, I suppose the question is whether the
first unequal element is found in the minimum vector length, or whether
it only occurs for larger lengths.  In the former case we can fold at
compile time, but in the latter case we can't.

So we probably do want the loop for variable-length vectors, up to
constant_lower_bound (CONST_VECTOR_NUNITS (...)).

> +}
> +
>/* We can't simplify MODE_CC values since we don't know what the
>   actual comparison is.  */
>if (GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC)
> @@ -8820,6 +8853,55 @@ test_vector_ops ()
>  }
>  }
>  
> +/* Verify vector constant comparisons for EQ and NE.  */
> +
> +static void
> +test_vector_int_const_compare (machine_mode mode)
> +{
> +  rtx zeros = CONST0_RTX (mode);
> +  rtx minusone = CONSTM1_RTX (mode);
> +  rtx series_0_1 = gen_const_vec_series (mode, const0_rtx, const1_rtx);
> +  ASSERT_RTX_EQ (const0_rtx,
> +  simplify_const_relational_operation (EQ, mode, zeros,
> +   CONST1_RTX (mode)));
> +  ASSERT_RTX_EQ (const_true_rtx,
> +  simplify_const_relational_operation (EQ, mode, zeros,
> +   CONST0_RTX (mode)));
> +  ASSERT_RTX_EQ (const_true_rtx,
> +  simplify_const_relational_operation (EQ, mode, minusone,
> +   CONSTM1_RTX (mode)));
> +  ASSERT_RTX_EQ (const_true_rtx,
> +  simplify_const_relational_operation (NE, mode, zeros,
> +   CONST1_RTX (mode)));
> +  ASSERT_RTX_EQ (const_true_rtx,
> +  simplify_const_relational_operation (NE, mode, zeros,
> +   series_0_1));
> +  ASSERT_RTX_EQ (const0_rtx,
> +  simplify_const_relational_operation (EQ, mode, zeros,
> +   series_0_1));

Re: [PATCH] aarch64: Use is_attribute_namespace_p and get_attribute_name inside aarch64_lookup_shared_state_flags [PR116598]

2024-09-06 Thread Richard Sandiford
Andrew Pinski  writes:
> The code in aarch64_lookup_shared_state_flags all C++11 attributes on the 
> function type
> had a namespace associated with them. But with the addition of 
> reproducible/unsequenced,
> this is not true.
>
> This fixes the issue by using is_attribute_namespace_p instead of manually 
> figuring out
> the namespace is named "arm" and uses get_attribute_name instead of manually 
> grabbing
> the attribute name.
>
> Built and tested for aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
>   PR target/116598
>   * config/aarch64/aarch64.cc (aarch64_lookup_shared_state_flags): Use
>   is_attribute_namespace_p and get_attribute_name instead of manually 
> grabbing
>   the namespace and name of the attribute.

OK, thanks.

Richard

> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.cc | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 27e24ba70ab..6a3f1a23a9f 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -594,14 +594,10 @@ aarch64_lookup_shared_state_flags (tree attrs, const 
> char *state_name)
>  {
>for (tree attr = attrs; attr; attr = TREE_CHAIN (attr))
>  {
> -  if (!cxx11_attribute_p (attr))
> +  if (!is_attribute_namespace_p ("arm", attr))
>   continue;
>  
> -  auto ns = IDENTIFIER_POINTER (TREE_PURPOSE (TREE_PURPOSE (attr)));
> -  if (strcmp (ns, "arm") != 0)
> - continue;
> -
> -  auto attr_name = IDENTIFIER_POINTER (TREE_VALUE (TREE_PURPOSE (attr)));
> +  auto attr_name = IDENTIFIER_POINTER (get_attribute_name (attr));
>auto flags = aarch64_attribute_shared_state_flags (attr_name);
>if (!flags)
>   continue;


Re: [PATCH v1 4/9] aarch64: Exclude symbols using GOT from code models

2024-09-06 Thread Richard Sandiford
Evgeny Karpov  writes:
> Monday, September 2, 2024 5:00 PM
> Richard Sandiford  wrote:
>
>> I think we should instead patch the callers that are using
>> aarch64_symbol_binds_local_p for GOT decisions.  The function itself
>> is checking for a more general property (and one that could be useful
>> in other contexts).
>
> The patch has been refactored to address the review. Thanks!
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index e4df70ddedc..8dc10efa629 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -20988,7 +20988,7 @@ aarch64_classify_symbol (rtx x, HOST_WIDE_INT offset)
>   /* With -fPIC non-local symbols use the GOT.  For orthogonality
>  always use the GOT for extern weak symbols.  */
>   if ((flag_pic || SYMBOL_REF_WEAK (x))
> - && !aarch64_symbol_binds_local_p (x))
> + && !aarch64_symbol_binds_local_p (x) && !TARGET_PECOFF)
> return SYMBOL_TINY_GOT;
>
>   /* When we retrieve symbol + offset address, we have to make sure
> @@ -21010,7 +21010,7 @@ aarch64_classify_symbol (rtx x, HOST_WIDE_INT offset)
> case AARCH64_CMODEL_SMALL_PIC:
> case AARCH64_CMODEL_SMALL:
>   if ((flag_pic || SYMBOL_REF_WEAK (x))
> - && !aarch64_symbol_binds_local_p (x))
> + && !aarch64_symbol_binds_local_p (x) && !TARGET_PECOFF)
> return aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC
> ? SYMBOL_SMALL_GOT_28K : SYMBOL_SMALL_GOT_4G;

Sorry for the nits, but: the GCC convention is to put each && on a separate
line when the && chain spans multiple lines.  And I think it makes sense
to test TARGET_PECOFF first:

 if (!TARGET_PECOFF
 && (flag_pic || SYMBOL_REF_WEAK (x))
 && !aarch64_symbol_binds_local_p (x))

Thanks,
Richard


Re: [PATCH v3 1/2] [APX CFCMOV] Support APX CFCMOV in if_convert pass

2024-09-06 Thread Richard Sandiford
"Kong, Lingling"  writes:
> Hi,
>
> This version has added a new optab named 'cfmovcc'. The new optab is used
> in the middle end to expand to cfcmov. And simplified my patch by trying to
> generate the conditional faulting movcc in noce_try_cmove_arith function.
>
> All the changes passed bootstrap & regtest x86-64-pc-linux-gnu.
> We also tested spec with SDE and passed the runtime test.
>
> Ok for trunk?
>
>
> APX CFCMOV[1] feature implements conditionally faulting which means
> If the comparison is false, all memory faults are suppressed when load
> or store a memory operand. Now we could load or store a memory
> operand may trap or fault for conditional move.
>
> In middle-end, now we don't support a conditional move if we knew
> that a load from A or B could trap or fault. To enable CFCMOV, we
> added a new optab named cfmovcc.
>
> Conditional move suppress fault for condition mem store would not
> move any arithmetic calculations. For condition mem load now just
> support a conditional move one trap mem and one no trap and no mem
> cases.

Sorry if this is going over old ground (I haven't read the earlier
versions yet), but: instead of adding a new optab, could we treat
CFCMOV as a scalar instance of maskload_optab?  Robin is working on
adding an "else" value for when the condition/mask is false.  After
that, it would seem to be a pretty close match to CFCMOV.

One reason for preferring maskload is that it makes the load an
explicit part of the interface.  We could then potentially use
it in gimple too, not just expand.

Thanks,
Richard

>
>
> [1].https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html
>
> gcc/ChangeLog:
>
>* doc/md.texi: Add cfmovcc insn pattern explanation.
>* ifcvt.cc (can_use_cmove_load_mem_notrap): New func
>for conditional faulting movcc for load.
>(can_use_cmove_store_mem_notrap): New func for conditional
>faulting movcc for store.
>(can_use_cfmovcc):  New func for conditional faulting.
>(noce_try_cmove_arith): Try to convert to conditional faulting
>movcc.
>(noce_process_if_block): Ditto.
>* optabs.cc (emit_conditional_move): Handle cfmovcc.
>(emit_conditional_move_1): Ditto.
>* optabs.def (OPTAB_D): New optab.
> ---
> gcc/doc/md.texi |  10 
> gcc/ifcvt.cc| 119 
> gcc/optabs.cc   |  14 +-
> gcc/optabs.def  |   1 +
> 4 files changed, 132 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index a9259112251..5f563787c49 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -8591,6 +8591,16 @@ Return 1 if operand 1 is a normal floating point 
> number and 0
> otherwise.  @var{m} is a scalar floating point mode.  Operand 0
> has mode @code{SImode}, and operand 1 has mode @var{m}.
> +@cindex @code{cfmov@var{mode}cc} instruction pattern
> +@item @samp{cfmov@var{mode}cc}
> +Similar to @samp{mov@var{mode}cc} but for conditional faulting,
> +If the comparison is false, all memory faults are suppressed
> +when load or store a memory operand.
> +
> +Conditionally move operand 2 or operand 3 into operand 0 according
> +to the comparison in operand 1.  If the comparison is true, operand 2
> +is moved into operand 0, otherwise operand 3 is moved.
> +
> @end table
>  @end ifset
> diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
> index 6487574c514..59845390607 100644
> --- a/gcc/ifcvt.cc
> +++ b/gcc/ifcvt.cc
> @@ -778,6 +778,9 @@ static bool noce_try_store_flag_mask (struct noce_if_info 
> *);
> static rtx noce_emit_cmove (struct noce_if_info *, rtx, enum rtx_code, rtx,
> rtx, rtx, rtx, rtx = NULL, 
> rtx = NULL);
> static bool noce_try_cmove (struct noce_if_info *);
> +static bool can_use_cmove_load_mem_notrap (rtx, rtx);
> +static bool can_use_cmove_store_mem_notrap (rtx, rtx, rtx, bool);
> +static bool can_use_cfmovcc (struct noce_if_info *);
> static bool noce_try_cmove_arith (struct noce_if_info *);
> static rtx noce_get_alt_condition (struct noce_if_info *, rtx, rtx_insn **);
> static bool noce_try_minmax (struct noce_if_info *);
> @@ -2132,6 +2135,69 @@ noce_emit_bb (rtx last_insn, basic_block bb, bool 
> simple)
>return true;
> }
> +/* Return TRUE if we could convert "if (test) x = *a; else x = b;"
> +   or "if (test) x = a; else x = *b;" to conditional faulting movcc,
> +   i.e. x86 cfcmov, especially when load a or b may cause memmory faults.  */
> +
> +static bool
> +can_use_cmove_load_mem_notrap (rtx a, rtx b)
> +{
> +  /* Just handle a conditional move from one trap MEM + other non_trap,
> + non mem cases.  */
> +  if (!(MEM_P (a) ^ MEM_P (b)))
> +  return false;
> +  bool a_trap = may_trap_or_fault_p (a);
> +  bool b_trap = may_trap_or_fault_p (b);
> +
> +  if (!(a_trap ^ b_trap))
> +

Re: [PATCH] [AARCH64] adjust gcc.target/aarch64/sve/mask_gather_load_7.c

2024-09-05 Thread Richard Sandiford
Richard Biener  writes:
> The following adjusts the scan-assembler to also allow predicate
> registers p8-15 to be used for the destination of the compares.
> I see that code generation with a pending vectorizer patch (the
> only assembler change is different predicate register allocation).

Oops, yes, I should have realised that 0-7 was overly constrained.

> Tested on aarch64.
>
> OK for trunk?

OK, thanks.

Richard

>
> Thanks,
> Richard.
>
>   * gcc.target/aarch64/sve/mask_gather_load_7.c: Allow
>   p8-15 to be used for the destination of the compares.
> ---
>  gcc/testsuite/gcc.target/aarch64/sve/mask_gather_load_7.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/mask_gather_load_7.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/mask_gather_load_7.c
> index c31fae308a5..7812ae7c928 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/mask_gather_load_7.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/mask_gather_load_7.c
> @@ -41,13 +41,13 @@
>  TEST_ALL (TEST_LOOP)
>  
>  /* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]/z, 
> \[x[0-9]+, x[0-9]+, lsl 1\]\n} 36 } } */
> -/* { dg-final { scan-assembler-times {\tcmpeq\tp[0-7]\.h, p[0-7]/z, 
> z[0-9]+\.h, z[0-9]+\.h\n} 12 } } */
> -/* { dg-final { scan-assembler-times {\tfcmeq\tp[0-7]\.h, p[0-7]/z, 
> z[0-9]+\.h, z[0-9]+\.h\n} 6 } } */
> +/* { dg-final { scan-assembler-times {\tcmpeq\tp[0-9]+\.h, p[0-7]/z, 
> z[0-9]+\.h, z[0-9]+\.h\n} 12 } } */
> +/* { dg-final { scan-assembler-times {\tfcmeq\tp[0-9]+\.h, p[0-7]/z, 
> z[0-9]+\.h, z[0-9]+\.h\n} 6 } } */
>  /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+\.s, sxtw 2\]\n} 18 } } */
>  /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+\.s, uxtw 2\]\n} 18 } } */
>  
>  /* Also used for the TEST32 indices.  */
>  /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
> \[x[0-9]+, x[0-9]+, lsl 2\]\n} 72 } } */
> -/* { dg-final { scan-assembler-times {\tcmpeq\tp[0-7]\.s, p[0-7]/z, 
> z[0-9]+\.s, z[0-9]+\.s\n} 12 } } */
> -/* { dg-final { scan-assembler-times {\tfcmeq\tp[0-7]\.s, p[0-7]/z, 
> z[0-9]+\.s, z[0-9]+\.s\n} 6 } } */
> +/* { dg-final { scan-assembler-times {\tcmpeq\tp[0-9]+\.s, p[0-7]/z, 
> z[0-9]+\.s, z[0-9]+\.s\n} 12 } } */
> +/* { dg-final { scan-assembler-times {\tfcmeq\tp[0-9]+\.s, p[0-7]/z, 
> z[0-9]+\.s, z[0-9]+\.s\n} 6 } } */
>  /* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+\.d, lsl 3\]\n} 36 } } */


Re: [PATCH] libsanitizer: On aarch64 use hint #34 in prologue of libsanitizer functions

2024-09-05 Thread Richard Sandiford
Jakub Jelinek  writes:
> Hi!
>
> When gcc is built with -mbranch-protection=standard, running sanitized
> programs doesn't work properly on bti enabled kernels.
>
> This has been fixed upstream with
> https://github.com/llvm/llvm-project/pull/84061
>
> The following patch cherry picks that from upstream, ok for trunk/14.3?

Yes, thanks!

Richard

> For trunk we should eventually do a full merge from upstream, but I'm hoping
> they will first fix up the _BitInt libubsan support mess.
>
> 2024-09-05  Jakub Jelinek  
>
>   * sanitizer_common/sanitizer_asm.h: Cherry-pick llvm-project revision
>   1c792d24e0a228ad49cc004a1c26bbd7cd87f030.
>   * interception/interception.h: Likewise.
>
> --- libsanitizer/sanitizer_common/sanitizer_asm.h
> +++ libsanitizer/sanitizer_common/sanitizer_asm.h
> @@ -42,6 +42,16 @@
>  # define CFI_RESTORE(reg)
>  #endif
>  
> +#if defined(__aarch64__) && defined(__ARM_FEATURE_BTI_DEFAULT)
> +# define ASM_STARTPROC CFI_STARTPROC; hint #34
> +# define C_ASM_STARTPROC SANITIZER_STRINGIFY(CFI_STARTPROC) "\nhint #34"
> +#else
> +# define ASM_STARTPROC CFI_STARTPROC
> +# define C_ASM_STARTPROC SANITIZER_STRINGIFY(CFI_STARTPROC)
> +#endif
> +#define ASM_ENDPROC CFI_ENDPROC
> +#define C_ASM_ENDPROC SANITIZER_STRINGIFY(CFI_ENDPROC)
> +
>  #if defined(__x86_64__) || defined(__i386__) || defined(__sparc__)
>  # define ASM_TAIL_CALL jmp
>  #elif defined(__arm__) || defined(__aarch64__) || defined(__mips__) || \
> @@ -114,9 +124,9 @@
>   .globl __interceptor_trampoline_##name; 
>   \
>   ASM_TYPE_FUNCTION(__interceptor_trampoline_##name); 
>   \
>   __interceptor_trampoline_##name:
>   \
> - CFI_STARTPROC;  
>   \
> + ASM_STARTPROC;  
>   \
>   ASM_TAIL_CALL ASM_PREEMPTIBLE_SYM(__interceptor_##name);
>   \
> - CFI_ENDPROC;
>   \
> + ASM_ENDPROC;
>   \
>   ASM_SIZE(__interceptor_trampoline_##name)
>  #  define ASM_INTERCEPTOR_TRAMPOLINE_SUPPORT 1
>  # endif  // Architecture supports interceptor trampoline
> --- libsanitizer/interception/interception.h
> +++ libsanitizer/interception/interception.h
> @@ -204,11 +204,11 @@ const interpose_substitution substitution_##func_name[] 
> \
> ".type  " SANITIZER_STRINGIFY(TRAMPOLINE(func)) ", "  
>   \
>   ASM_TYPE_FUNCTION_STR "\n"  
>   \
> SANITIZER_STRINGIFY(TRAMPOLINE(func)) ":\n"   
>   \
> -   SANITIZER_STRINGIFY(CFI_STARTPROC) "\n"   
>   \
> +   C_ASM_STARTPROC "\n"  
>   \
> C_ASM_TAIL_CALL(SANITIZER_STRINGIFY(TRAMPOLINE(func)),
>   \
> "__interceptor_"  
>   \
>   SANITIZER_STRINGIFY(ASM_PREEMPTIBLE_SYM(func))) 
> "\n"  \
> -   SANITIZER_STRINGIFY(CFI_ENDPROC) "\n" 
>   \
> +   C_ASM_ENDPROC "\n"
>   \
> ".size  " SANITIZER_STRINGIFY(TRAMPOLINE(func)) ", "  
>   \
>  ".-" SANITIZER_STRINGIFY(TRAMPOLINE(func)) "\n"  
>   \
>   );
>
>   Jakub


Re: [PATCH v1 0/9] SMALL code model fixes, optimization fixes, LTO and minimal C++ enablement

2024-09-02 Thread Richard Sandiford
Evgeny Karpov  writes:
> The main focus of this work is resolving issues in the aarch64-w64-mingw32
> target related to enabling optimization with -O2, -O3, and LTO, fixing
> relocations, and adding debugging information.
>
> This patch series is supported by the binutils patch series.
> https://sourceware.org/pipermail/binutils/2024-August/136481.html

Thanks for submitting this.  I've responded with minor comments to
some individual patches, but the rest (1, 2, 7, and 8) look good to
me as-is.

On patch 1, do you have a reference for how AArch64 and x86 handle weak
references for MinGW?  The code looks good, but I didn't really follow
why it was doing what it was doing.

Richard

>
> The patches included in this series implement and resolve the
> following areas:
>
> 1. Support weak references.
> 2. Add debugging information.
> 3. Add minimal C++ support.
> 4. Exclude symbols using GOT from code models.
> 5. Resolve relocation and small code model issues. Before the change,
> the code model could not address more than 1MB due to relocation
> issues. After the fix, the SMALL code model can address 4GB properly.
> 6. Fix optimization issues, mostly related to relocation issues.
> 7. Add LTO support.
> 8. Support large alignment.
> 9. Extend the number of tests from 309k to 605k by adding minimal C++
> to the target, and improve the pass rate from 72% to 89% with fixes
> in the patch series targeting armv8 without optional extensions 
> such as SVE.
>
> What is next?
> 1. The patch series 4. Partial or full SEH implementation, variadic
> functions fixes, call stack support for debugging, bug fixing, and
> improving unit tests pass rate.
> 2. The patch series 5. The rest of SEH, C++ exceptions, Boost
> library unit testing, bug fixing, and improving unit tests pass rate.
> 3. In terms of the contribution window for GCC 15 it will be great
> to have architecture target armv8-a in the best quality it can be 
> achieved. The patch series to extend architecture target from armv8-a
> to "armv8.2-a + sve" might be also prepared before GCC 15 release.
> However it is unknown yet if it lands in GCC 15 release or in the
> next development cycle.
>
> Thank you for your review!
>
> Prepared, refactored and validated by
> Radek Barton  and
> Evgeny Karpov 
>
> Contributor: Zac Walker 
>
> Special thanks to Linaro, especially Christophe Lyon, 
> for the internal review!
>
> Regards,
> Evgeny


Re: [PATCH v1 9/9] aarch64: Handle alignment when it is bigger than BIGGEST_ALIGNMENT

2024-09-02 Thread Richard Sandiford
Evgeny Karpov  writes:
> In some cases, the alignment can be bigger than BIGGEST_ALIGNMENT.
> The patch handles these cases.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-coff.h (ASM_OUTPUT_ALIGNED_LOCAL):
>   Change alignment.

Can you go into more detail?  What kind of testcase requires this?

Thanks,
Richard

> ---
>  gcc/config/aarch64/aarch64-coff.h | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-coff.h 
> b/gcc/config/aarch64/aarch64-coff.h
> index 77c09df82e4..131145171a0 100644
> --- a/gcc/config/aarch64/aarch64-coff.h
> +++ b/gcc/config/aarch64/aarch64-coff.h
> @@ -58,6 +58,16 @@
>assemble_name ((FILE), (NAME)),\
>fprintf ((FILE), ",%u\n", (int)(ROUNDED)))
>  
> +#define ASM_OUTPUT_ALIGNED_LOCAL(FILE, NAME, SIZE, ALIGNMENT)  \
> +  { \
> +unsigned HOST_WIDE_INT rounded = MAX ((SIZE), 1); \
> +unsigned HOST_WIDE_INT alignment = MAX ((ALIGNMENT), BIGGEST_ALIGNMENT); 
> \
> +rounded += (alignment / BITS_PER_UNIT) - 1; \
> +rounded = (rounded / (alignment / BITS_PER_UNIT) \
> +  * (alignment / BITS_PER_UNIT)); \
> +ASM_OUTPUT_LOCAL (FILE, NAME, SIZE, rounded); \
> +  }
> +
>  #define ASM_OUTPUT_SKIP(STREAM, NBYTES)  \
>fprintf (STREAM, "\t.space\t%d  // skip\n", (int) (NBYTES))


Re: [PATCH v1 6/9] aarch64: Use symbols without offset to prevent relocation issues

2024-09-02 Thread Richard Sandiford
Evgeny Karpov  writes:
> aarch64.cc has been updated to prevent emitting "symbol + offset"
> for SYMBOL_SMALL_ABSOLUTE for the PECOFF target. "symbol + offset"
> cannot be used in relocations for aarch64-w64-mingw32 due to
> relocation requirements.
> Instead, it will adjust the address by an offset with the
> "add" instruction.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_load_symref_and_add_offset):
>   New.
>   (aarch64_expand_mov_immediate): Use
>   aarch64_load_symref_and_add_offset.
> ---
>  gcc/config/aarch64/aarch64.cc | 24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 7865484860b..1d88814f28d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -4887,6 +4887,17 @@ aarch64_split_add_offset (scalar_int_mode mode, rtx 
> dest, rtx src,
> temp1, temp2, 0, false);
>  }
>  
> +static void
> +aarch64_load_symref_and_add_offset (scalar_int_mode mode, rtx dest, rtx src,
> + poly_int64 offset)

Nit: there should be a function comment, and the second line should be
indented under "scalar_int_mode".  Maybe:

/* Emit:

  TMP = SRC
  DEST = TMP + OFFSET

   given that SRC is a symbolic constant and DEST is a register.  Both SRC
   and DEST have mode MODE.  */

static void
aarch64_load_symref_and_add_offset (scalar_int_mode mode, rtx dest, rtx src,
poly_int64 offset)

OK with that change, thanks.

Richard

> +{
> +  gcc_assert (can_create_pseudo_p ());
> +  src = aarch64_force_temporary (mode, dest, src);
> +  aarch64_add_offset (mode, dest, src, offset,
> +   NULL_RTX, NULL_RTX, 0, false);
> +}
> +
> +
>  /* Add DELTA to the stack pointer, marking the instructions frame-related.
> TEMP1 is available as a temporary if nonnull.  FORCE_ISA_MODE is as
> for aarch64_add_offset.  EMIT_MOVE_IMM is false if TEMP1 already
> @@ -6054,10 +6065,8 @@ aarch64_expand_mov_immediate (rtx dest, rtx imm)
>   case SYMBOL_TINY_TLSIE:
> if (const_offset != 0)
>   {
> -   gcc_assert(can_create_pseudo_p ());
> -   base = aarch64_force_temporary (int_mode, dest, base);
> -   aarch64_add_offset (int_mode, dest, base, const_offset,
> -   NULL_RTX, NULL_RTX, 0, false);
> +   aarch64_load_symref_and_add_offset (int_mode, dest, base,
> +   const_offset);
> return;
>   }
> /* FALLTHRU */
> @@ -6068,6 +6077,13 @@ aarch64_expand_mov_immediate (rtx dest, rtx imm)
>   case SYMBOL_TLSLE24:
>   case SYMBOL_TLSLE32:
>   case SYMBOL_TLSLE48:
> +   if (TARGET_PECOFF && const_offset != 0)
> + {
> +   aarch64_load_symref_and_add_offset (int_mode, dest, base,
> +   const_offset);
> +   return;
> + }
> +
> aarch64_load_symref_appropriately (dest, imm, sty);
> return;


Re: [PATCH v1 5/9] aarch64: Multiple adjustments to support the SMALL code model correctly

2024-09-02 Thread Richard Sandiford
Evgeny Karpov  writes:
> LOCAL_LABEL_PREFIX has been changed to help the assembly
> compiler recognize local labels. Emitting locals has been
> replaced with the .lcomm directive to declare uninitialized
> data without defining an exact section. Functions and objects
> were missing declarations. Binutils was not able to distinguish
> static from external, or an object from a function.
> mingw_pe_declare_object_type has been added to have type
> information for relocation on AArch64, which is not the case
> for ix86.
>
> This fix relies on changes in binutils.
> aarch64: Relocation fixes and LTO
> https://sourceware.org/pipermail/binutils/2024-August/136481.html
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-coff.h (LOCAL_LABEL_PREFIX):
>   Use "." as the local label prefix.
>   (ASM_OUTPUT_ALIGNED_LOCAL): Remove.
>   (ASM_OUTPUT_LOCAL): New.
>   * config/aarch64/cygming.h (ASM_DECLARE_OBJECT_NAME):
>   New.
>   (ASM_DECLARE_FUNCTION_NAME): New.
>   * config/mingw/winnt.cc (mingw_pe_declare_object_type):
>   New.
>   * config/mingw/winnt.h (mingw_pe_declare_object_type):
>   New.
> ---
>  gcc/config/aarch64/aarch64-coff.h | 22 ++
>  gcc/config/aarch64/cygming.h  | 12 
>  gcc/config/mingw/winnt.cc | 10 ++
>  gcc/config/mingw/winnt.h  |  2 ++
>  4 files changed, 30 insertions(+), 16 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-coff.h 
> b/gcc/config/aarch64/aarch64-coff.h
> index 81fd9954f75..77c09df82e4 100644
> --- a/gcc/config/aarch64/aarch64-coff.h
> +++ b/gcc/config/aarch64/aarch64-coff.h
> @@ -20,9 +20,8 @@
>  #ifndef GCC_AARCH64_COFF_H
>  #define GCC_AARCH64_COFF_H
>  
> -#ifndef LOCAL_LABEL_PREFIX
> -# define LOCAL_LABEL_PREFIX  ""
> -#endif
> +#undef LOCAL_LABEL_PREFIX
> +#define LOCAL_LABEL_PREFIX  "."
>  
>  /* Using long long breaks -ansi and -std=c90, so these will need to be
> made conditional for an LLP64 ABI.  */
> @@ -54,19 +53,10 @@
>  }
>  #endif
>  
> -/* Output a local common block.  /bin/as can't do this, so hack a
> -   `.space' into the bss segment.  Note that this is *bad* practice,
> -   which is guaranteed NOT to work since it doesn't define STATIC
> -   COMMON space but merely STATIC BSS space.  */
> -#ifndef ASM_OUTPUT_ALIGNED_LOCAL
> -# define ASM_OUTPUT_ALIGNED_LOCAL(STREAM, NAME, SIZE, ALIGN) \
> -{
> \
> -  switch_to_section (bss_section);   
> \
> -  ASM_OUTPUT_ALIGN (STREAM, floor_log2 (ALIGN / BITS_PER_UNIT)); \
> -  ASM_OUTPUT_LABEL (STREAM, NAME);   
> \
> -  fprintf (STREAM, "\t.space\t%d\n", (int)(SIZE));   
> \
> -}
> -#endif
> +#define ASM_OUTPUT_LOCAL(FILE, NAME, SIZE, ROUNDED)  \
> +( fputs (".lcomm ", (FILE)), \
> +  assemble_name ((FILE), (NAME)),\
> +  fprintf ((FILE), ",%u\n", (int)(ROUNDED)))

I realise this is pre-existing, bue the last line should probably be:

  fprintf ((FILE), "," HOST_WIDE_INT_PRINT_UNSIGNED "\n", (ROUNDED)))

to avoid silent truncation.  (Even if the format only supports 32-bit
code and data, it's better for out-of-bounds values to be flagged by
the assembler rather than silently truncated.)

>  #define ASM_OUTPUT_SKIP(STREAM, NBYTES)  \
>fprintf (STREAM, "\t.space\t%d  // skip\n", (int) (NBYTES))
> diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
> index e4ceab82b9e..d3c6f550b68 100644
> --- a/gcc/config/aarch64/cygming.h
> +++ b/gcc/config/aarch64/cygming.h
> @@ -213,6 +213,18 @@ still needed for compilation.  */
>  
>  #define SUPPORTS_ONE_ONLY 1
>  
> +#undef ASM_DECLARE_OBJECT_NAME
> +#define ASM_DECLARE_OBJECT_NAME(STREAM, NAME, DECL)  \
> +  mingw_pe_declare_object_type (STREAM, NAME, TREE_PUBLIC (DECL)); \
> +  ASM_OUTPUT_LABEL ((STREAM), (NAME))
> +
> +
> +#undef ASM_DECLARE_FUNCTION_NAME
> +#define ASM_DECLARE_FUNCTION_NAME(STR, NAME, DECL)   \
> +  mingw_pe_declare_function_type (STR, NAME, TREE_PUBLIC (DECL)); \
> +  aarch64_declare_function_name (STR, NAME, DECL)
> +
> +

These two should probaly either be wrapped in:

  do { ... ] while (0)

or use comma expressions (as for the .lcomm printing above).

Using "STREAM" rather than "STR" in ASM_DECLARE_FUNCTION_NAME
would be more consistent with ASM_DECLARE_OBJECT_NAME.

>  /* Define this to be nonzero if static stack checking is supported.  */
>  #define STACK_CHECK_STATIC_BUILTIN 1
>  
> diff --git a/gcc/config/mingw/winnt.cc b/gcc/config/mingw/winnt.cc
> index 1e2ec53e841..64157b09644 100644
> --- a/gcc/config/mingw/winnt.cc
> +++ b/gcc/config/mingw/winnt.cc
> @@ -581,6 +581,16 @@ i386_pe_asm_output_aligned_decl_common (FILE *stream, 
> tree decl,
> function, and PUB is nonzero if the function is globally
> visible.  */
>  
> +void
> +mingw_pe_declare_object_type (FILE *file, const char *na

Re: [PATCH v1 4/9] aarch64: Exclude symbols using GOT from code models

2024-09-02 Thread Richard Sandiford
Evgeny Karpov  writes:
> Symbols using GOT are not supported by the aarch64-w64-mingw32
> target and should be excluded from the code models.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_symbol_binds_local_p):
>   Disable GOT for PECOFF target.

I think we should instead patch the callers that are using
aarch64_symbol_binds_local_p for GOT decisions.  The function itself
is checking for a more general property (and one that could be useful
in other contexts).

Thanks,
Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index e4df70ddedc..7865484860b 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -20885,6 +20885,9 @@ aarch64_tlsdesc_abi_id ()
>  static bool
>  aarch64_symbol_binds_local_p (const_rtx x)
>  {
> +  if (TARGET_PECOFF)
> +return true;
> +
>return (SYMBOL_REF_DECL (x)
> ? targetm.binds_local_p (SYMBOL_REF_DECL (x))
> : SYMBOL_REF_LOCAL_P (x));


Re: [PATCH 3/3] SVE intrinsics: Fold constant operands for svmul.

2024-09-02 Thread Richard Sandiford
Jennifer Schmitz  writes:
>> Does:
>> 
>>  svmul_n_u32_x (pg, svindex_u32 (4, 1), 2);
>> 
>> get optimised to a single INDEX of #8, #2?  It would be good to test
>> things like that if so.
>> 
>> Looks good to me otherwise.
>> 
>> Thanks,
>> Richard
> I tested this test case, but - unlike svdup - svindex is currently not folded 
> in gimple and the test case still compiles to using index and multiplication 
> instructions. Implementing the gimple_folder for svindex could be a follow-up 
> patch. In the meantime, I added a svdupq test as for svdiv.
> Best, Jennifer

Bah, I should have checked.  But yeah, folding constant svindex operations
would be a nice thing to add.

The patch is OK for trunk, thanks.

Richard


Re: [PATCH 2/3] SVE intrinsics: Fold constant operands for svdiv.

2024-09-02 Thread Richard Sandiford
Jennifer Schmitz  writes:
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index 5ca9ec32b69..60350e08372 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -1132,6 +1132,30 @@ report_not_enum (location_t location, tree fndecl, 
> unsigned int argno,
>   " a valid %qT value", actual, argno + 1, fndecl, enumtype);
>  }
>  
> +/* Try to fold constant arguments arg1 and arg2 using the given tree_code.
> +   Operations are not treated as overflowing.  */

Sorry for the nit, but: the convention is to put argument names in caps,
so ARG1 and ARG2.

> +static tree
> +aarch64_const_binop (enum tree_code code, tree arg1, tree arg2)
> +{
> +  if (poly_int_tree_p (arg1) && poly_int_tree_p (arg2))
> +{
> +  poly_wide_int poly_res;
> +  tree type = TREE_TYPE (arg1);
> +  signop sign = TYPE_SIGN (type);
> +  wi::overflow_type overflow = wi::OVF_NONE;
> +
> +  /* Return 0 for division by 0.  */

Maybe add ", like SDIV and UDIV do", to make it clearer where this has
come from.

> +  if (code == TRUNC_DIV_EXPR && integer_zerop (arg2))
> + return arg2;
> +
> +  if (!poly_int_binop (poly_res, code, arg1, arg2, sign, &overflow))
> + return NULL_TREE;
> +  return force_fit_type (type, poly_res, false,
> +  TREE_OVERFLOW (arg1) | TREE_OVERFLOW (arg2));
> +}
> +  return NULL_TREE;
> +}
> +
>  /* Return a hash code for a function_instance.  */
>  hashval_t
>  function_instance::hash () const
> @@ -3616,6 +3640,23 @@ gimple_folder::fold ()
>return base->fold (*this);
>  }
>  
> +/* Try to fold constant integer operands.  */

Maybe:

/* Try to fold the call to a constant, given that, for integers, the call
   is roughly equivalent to binary operation CODE.  aarch64_const_binop
   handles any differences between CODE and the intrinsic.  */

> +gimple *
> +gimple_folder::fold_const_binary (enum tree_code code)
> +{
> +  gcc_assert (gimple_call_num_args (call) == 3);
> +  tree pg = gimple_call_arg (call, 0);
> +  tree op1 = gimple_call_arg (call, 1);
> +  tree op2 = gimple_call_arg (call, 2);
> +
> +  if (type_suffix (0).integer_p
> +  && (pred == PRED_x || is_ptrue (pg, type_suffix (0).element_bytes)))
> +if (tree res = vector_const_binop (code, op1, op2, aarch64_const_binop))
> +  return gimple_build_assign (lhs, res);
> +
> +  return NULL;
> +}
> +
>  function_expander::function_expander (const function_instance &instance,
> tree fndecl, tree call_expr_in,
> rtx possible_target_in)
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
> b/gcc/config/aarch64/aarch64-sve-builtins.h
> index 9ab6f202c30..22e9a815039 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.h
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.h
> @@ -639,6 +639,8 @@ public:
>  
>gimple *fold ();
>  
> +  gimple *fold_const_binary (enum tree_code);
> +

Sorry for being so picky, but could you put this above "fold" instead?
fold is the top-level entry point, so I think it should either come
first or last.

OK for trunk with those changes, thanks.

Richard

>/* Where to insert extra statements that feed the final replacement.  */
>gimple_stmt_iterator *gsi;
>  


Re: [PATCH 1/3] SVE intrinsics: Fold constant operands.

2024-09-02 Thread Richard Sandiford
Jennifer Schmitz  writes:
> @@ -1365,6 +1368,91 @@ simplify_const_binop (tree_code code, tree op, tree 
> other_op,
>return NULL_TREE;
>  }
>  
> +/* Combine two (vector) constants ARG1 and ARG2 element-wise under operation
> +   CODE using function ELT_CONST_BINOP to produce a new (vector) constant.
> +   We assume ARG1 and ARG2 have the same data type, or at least
> +   are the same kind of constant and the same machine mode.  Return zero if
> +   combining the constants is not allowed in the current operating mode.  */

How about:

/* If ARG1 and ARG2 are constants, and if performing CODE on them would
   be an elementwise vector operation, try to fold the operation to a
   constant vector, using ELT_CONST_BINOP to fold each element.  Return
   the folded value on success, otherwise return null.  */

The reason for suggesting this is that the caller doesn't need to check
whether ARG1 or ARG2 are constant or whether the operation is a vector
operation.  The routine does those checks itself.

I also don't think we need to specify the assumptions on well-typed
vector operations, since those apply generally.

OK for trunk with that change, thanks.

Richard

> +tree
> +vector_const_binop (tree_code code, tree arg1, tree arg2,
> + tree (*elt_const_binop) (enum tree_code, tree, tree))
> +{
> +  if (TREE_CODE (arg1) == VECTOR_CST && TREE_CODE (arg2) == VECTOR_CST
> +  && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)),
> +TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg2
> +{
> +  tree type = TREE_TYPE (arg1);
> +  bool step_ok_p;
> +  if (VECTOR_CST_STEPPED_P (arg1)
> +   && VECTOR_CST_STEPPED_P (arg2))
> +  /* We can operate directly on the encoding if:
> +
> +  a3 - a2 == a2 - a1 && b3 - b2 == b2 - b1
> +  implies
> +  (a3 op b3) - (a2 op b2) == (a2 op b2) - (a1 op b1)
> +
> +  Addition and subtraction are the supported operators
> +  for which this is true.  */
> + step_ok_p = (code == PLUS_EXPR || code == MINUS_EXPR);
> +  else if (VECTOR_CST_STEPPED_P (arg1))
> +  /* We can operate directly on stepped encodings if:
> +
> +  a3 - a2 == a2 - a1
> +  implies:
> +  (a3 op c) - (a2 op c) == (a2 op c) - (a1 op c)
> +
> +  which is true if (x -> x op c) distributes over addition.  */
> + step_ok_p = distributes_over_addition_p (code, 1);
> +  else
> +  /* Similarly in reverse.  */
> + step_ok_p = distributes_over_addition_p (code, 2);
> +  tree_vector_builder elts;
> +  if (!elts.new_binary_operation (type, arg1, arg2, step_ok_p))
> + return NULL_TREE;
> +  unsigned int count = elts.encoded_nelts ();
> +  for (unsigned int i = 0; i < count; ++i)
> + {
> +   tree elem1 = VECTOR_CST_ELT (arg1, i);
> +   tree elem2 = VECTOR_CST_ELT (arg2, i);
> +
> +   tree elt = elt_const_binop (code, elem1, elem2);
> +
> +  /* It is possible that const_binop cannot handle the given
> +  code and return NULL_TREE */
> +   if (elt == NULL_TREE)
> + return NULL_TREE;
> +   elts.quick_push (elt);
> + }
> +
> +  return elts.build ();
> +}
> +
> +  if (TREE_CODE (arg1) == VECTOR_CST
> +  && TREE_CODE (arg2) == INTEGER_CST)
> +{
> +  tree type = TREE_TYPE (arg1);
> +  bool step_ok_p = distributes_over_addition_p (code, 1);
> +  tree_vector_builder elts;
> +  if (!elts.new_unary_operation (type, arg1, step_ok_p))
> + return NULL_TREE;
> +  unsigned int count = elts.encoded_nelts ();
> +  for (unsigned int i = 0; i < count; ++i)
> + {
> +   tree elem1 = VECTOR_CST_ELT (arg1, i);
> +
> +   tree elt = elt_const_binop (code, elem1, arg2);
> +
> +   /* It is possible that const_binop cannot handle the given
> +  code and return NULL_TREE.  */
> +   if (elt == NULL_TREE)
> + return NULL_TREE;
> +   elts.quick_push (elt);
> + }
> +
> +  return elts.build ();
> +}
> +  return NULL_TREE;
> +}
>  
>  /* Combine two constants ARG1 and ARG2 under operation CODE to produce a new
> constant.  We assume ARG1 and ARG2 have the same data type, or at least
> @@ -1677,83 +1765,7 @@ const_binop (enum tree_code code, tree arg1, tree arg2)
>&& (simplified = simplify_const_binop (code, arg2, arg1, 1)))
>  return simplified;
>  
> -  if (TREE_CODE (arg1) == VECTOR_CST
> -  && TREE_CODE (arg2) == VECTOR_CST
> -  && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)),
> -TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg2
> -{
> -  tree type = TREE_TYPE (arg1);
> -  bool step_ok_p;
> -  if (VECTOR_CST_STEPPED_P (arg1)
> -   && VECTOR_CST_STEPPED_P (arg2))
> - /* We can operate directly on the encoding if:
> -
> -   a3 - a2 == a2 - a1 && b3 - b2 == b2 - b1
> - implies
> -   (a3 op b3) - (a2 op b2) == (a2 op b2) - (a1 op b1)
> -
> -Addition and subtraction are the supported operators
> -for which t

Re: [committed] hppa: Fix handling of unscaled index addresses on HP-UX

2024-08-30 Thread Richard Sandiford
Jeff Law  writes:
> On 8/29/24 10:13 AM, John David Anglin wrote:
>> Tested on hppa-unknown-linux-gnu and hppa64-hp-hpux11.11.
>> Committed to trunk.
>> 
>> Dave
>> ---
>> 
>> hppa: Fix handling of unscaled index addresses on HP-UX
>> 
>> The PA-RISC architecture uses the top two bits of memory pointers
>> to select space registers.  The space register ID is ored with the
>> pointer offset to compute the global virtual address for an access.
>> 
>> The new late combine passes broke gcc on HP-UX.  One of these passes
>> runs after reload.  The existing code assumed no unscaled index
>> instructions would be created after reload as the REG_POINTER flag
>> is not reliable after reload.  The new pass sometimes interchanged
>> the base and index registers, causing these instructions to fault
>> when the wrong space register was selected.
> Implicit space register selection based on the index rather than the 
> effective address.   A thorn in my side for years working on the PA.
>
>> 
>> The flag is not propagated when a constant is added to a pointer.
> IIRC one of the problems in this space was Ada would tend to create 
> pointers outside an object's bounds, then use an offset to bring the 
> effective address back into the object.  And tail merging would tend to 
> do things like see a path with a + b and another with b + a as addresses 
> and assume it could commonize the paths.  All kinds of dragons in here. 
> I tried damn hard to support unscaled indexing, but it was always quite 
> painful.
>
>
>> Thus, I came to the conclusion that the best approach was to
>> disable use of unscaled index addresses on HP-UX.  I don't think
>> this impacts performance significantly.  Code size might get
>> slightly larger but we get some or more back from having the late
>> combine passes.
> I can certainly live with this as I don't have to use HPUX anymore ;-) 
> Presumably Linux has a flat address model with the space registers 
> holding the same value.  That's what we did with our BSD and Mach ports 
> to the PA and was what PRO was recommending to members back in the day...
>
> It may be worth noting that if anyone still has mn10300 silicon it had 
> similar properties.  Though it may have been limited to the first 
> generation, I vaguely recall a Matsushita engineer I worked with 
> indicating they considered it a silicon bug.

Can't remember if I've mentioned this before, but FWIW: as part
of the Morello port, it was vital for correctness that we could
distinguish bases from indices.  We ended up adding a new mode class
for pointers (used only on targets that need them).  We also added
pointer_plus and pointer_diff rtx codes, along the lines of the
tree/gimple operations (again only used on targets that need them).
We added abstractions that were supposed to make it easy for
target-independent code to do the right thing.

For real Morello, the pointer capabilities were 128 bits, and so they
were easy to tell apart from indices.  However, for testing purposes,
we also had a "fake capability" mode that used the Morello-style
representation for normal AArch64 code, with 64-bit pointers.
It seemed to work pretty well.  Something like the fake capability
approach might be useful for targets like PA.

Morello was an experimental architecture, so there are no current
plans to submit support for trunk.  It's available in
vendors/ARM/heads/morello for anyone's who curious though.

(As always, there was quite a bit of learning by doing, so we'd need
to go back and clean the branch up if we did ever submit parts to trunk.)

Richard


Re: [PATCH v5] Target-independent store forwarding avoidance.

2024-08-30 Thread Richard Sandiford
Manolis Tsamis  writes:
>> > I could have some help with that, because after the new changes a
>> > subreg related ICE also happens within store_bit_field when a DI ->
>> > V4SI case is hit. Afaik store_bit_field should just return NULL if it
>> > can't handle something so I don't really know how to address this ICE
>> > in the new version.
>>
>> I don't think store_bit_field is expected to fail.  But yeah, if you have
>> a testcase, I can take a look.
>>
>
> Yes, that was my initial reaction too. I don't yet have a reduced
> testcase, this now only happens with SPEC2017 gcc_r when compiled with
> `-g -O3 -march=armv8.2-a -flto=32 -favoid-store-forwarding` (doesn't
> reproduce without flto). I have also attached the work-in-progress
> patch that implements your changes up to that point and which can be
> used to reproduce this, at least till we have a smaller testcase.
>
> The case that leads to that ICE is this:
>
> Store forwarding detected:
> From: (insn 1700 1714 1701 132 (set (mem/j/c:SI (plus:DI (reg/f:DI 64 sfp)
> (const_int -160 [0xff60])) [0 +0 S4 A128])
> (reg:SI 617)) "real.c":158:9 69 {*movsi_aarch64}
>  (expr_list:REG_DEAD (reg:SI 617)
> (nil)))
> From: (insn 1695 1698 1715 132 (set (mem/c:TI (plus:DI (reg/f:DI 64 sfp)
> (const_int -160 [0xff60])) [0 MEM  [(void
> *)&r]+0 S16 A128])
> (const_int 0 [0])) "real.c":157:3 75 {*movti_aarch64}
>  (nil))
> To: (insn 1709 1703 1696 132 (set (reg:V4SI 622 [ r ])
> (mem/c:V4SI (plus:DI (reg/f:DI 64 sfp)
> (const_int -160 [0xff60])) [51 r+0 S16
> A128])) "builtins.c":9619:13 1270 {*aarch64_simd_movv4si}
>  (nil))
>
> Which is somewhat suspicious; we'll have to first make sure it's not a
> avoid-store-forwarding bug.

Does 1695 come before 1700?  If so, it looks like it should be valid.

> In any case the ICE is
>
> builtins.c:6684:1: internal compiler error: in simplify_subreg, at
> simplify-rtx.cc:7680
>  6684 | }
>   | ^
> 0xc41ecb simplify_context::simplify_subreg(machine_mode, rtx_def*,
> machine_mode, poly_int<2u, unsigned long>)
> /home/mtsamis/src/gcc/gcc4/gcc/simplify-rtx.cc:7680
> 0xc42d0f simplify_context::simplify_gen_subreg(machine_mode, rtx_def*,
> machine_mode, poly_int<2u, unsigned long>)
> /home/mtsamis/src/gcc/gcc4/gcc/simplify-rtx.cc:8007
> 0x8afb4b simplify_gen_subreg(machine_mode, rtx_def*, machine_mode,
> poly_int<2u, unsigned long>)
> /home/mtsamis/src/gcc/gcc4/gcc/rtl.h:3562
> 0x8afb4b force_subreg(machine_mode, rtx_def*, machine_mode,
> poly_int<2u, unsigned long>)
> /home/mtsamis/src/gcc/gcc4/gcc/explow.cc:755
> 0x8b9ef3 store_bit_field_1
> /home/mtsamis/src/gcc/gcc4/gcc/expmed.cc:810
> 0x8ba327 store_bit_field(rtx_def*, poly_int<2u, unsigned long>,
> poly_int<2u, unsigned long>, poly_int<2u, unsigned long>, poly_int<2u,
> unsigned long>, machine_mode, rtx_def*, bool, bool)
> /home/mtsamis/src/gcc/gcc4/gcc/expmed.cc:1214
> 0xc53c7f generate_bit_insert_sequence
> /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:122
> 0xc53c7f process_store_forwarding
> /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:231
> 0xc53c7f avoid_store_forwarding
> /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:504
> 0xc546eb execute
> /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:571

OK, thanks, I'll have a go at reproducing, but it sounds like it might
be tricky. :)

>> > So, although I don't like it and it complicates things, I had to add
>> > recog_memoized pretty much everywhere and also make sure to only emit
>> > if all of these are successful...
>>
>> Do you have testcases?  emit_move_insn's job is to make the move
>> legitimate, so it seems like the checks might be papering over an issue
>> elsewhere.
>>
> I had a quick look and couldn't find an existing testcase. I'll have
> to recheck more thoroughly in order to create a testcase (if possible
> at all).
> I remember that in one case the rtx provided was zero_extend and there
> was no instruction for that so recog would fail.

Ah, ok.  That might have been the bug then.  emit_move_insn should only
be used for moving one object to another, where an "object" can be a
constant, memory or register (including subregs of registers).
The advantage of using it is that it works out how to make the move
valid, given target limitations.

If the instruction can be more general than that then we should just use
gen_rtx_SET, and recog the result.

Thanks,
Richard


Re: [PATCH 3/3] SVE intrinsics: Fold constant operands for svmul.

2024-08-30 Thread Richard Sandiford
Jennifer Schmitz  writes:
> This patch implements constant folding for svmul. If the predicate is
> ptrue or predication is _x, it uses vector_const_binop with
> aarch64_const_binop as callback and tree_code MULT_EXPR to fold constant
> integer operands.
> Tests were added to check the produced assembly for different
> predicates, signed and unsigned integers, and the svmul_n_* case.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-base.cc (svmul_impl::fold):
>   Implement and try constant folding.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve/const_fold_mul_1.c: New test.
>
> From 648d7bfe4f5dbab734e8823f82b289aa381aafb9 Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz 
> Date: Thu, 29 Aug 2024 05:12:53 -0700
> Subject: [PATCH 3/3] SVE intrinsics: Fold constant operands for svmul.
>
> This patch implements constant folding for svmul. If the predicate is
> ptrue or predication is _x, it uses vector_const_binop with
> aarch64_const_binop as callback and tree_code MULT_EXPR to fold constant
> integer operands.
> Tests were added to check the produced assembly for different
> predicates, signed and unsigned integers, and the svmul_n_* case.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-base.cc (svmul_impl::fold):
>   Implement and try constant folding.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve/const_fold_mul_1.c: New test.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc  |  29 +-
>  .../gcc.target/aarch64/sve/const_fold_mul_1.c | 292 ++
>  2 files changed, 320 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/const_fold_mul_1.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index 617c7fc87e5..0136fa2fef6 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -2008,6 +2008,33 @@ public:
>}
>  };
>  
> +class svmul_impl : public rtx_code_function
> +{
> +public:
> +  CONSTEXPR svmul_impl ()
> +: rtx_code_function (MULT, MULT, UNSPEC_COND_FMUL) {}
> +
> +  gimple *
> +  fold (gimple_folder &f) const override
> +  {
> +tree pg = gimple_call_arg (f.call, 0);
> +int step = f.type_suffix (0).element_bytes;
> +
> +/* Try to fold constant integer operands.  */
> +if (f.type_suffix (0).integer_p
> + && (f.pred == PRED_x || is_ptrue (pg, step)))
> +  {
> + tree op1 = gimple_call_arg (f.call, 1);
> + tree op2 = gimple_call_arg (f.call, 2);
> + if (tree res = vector_const_binop (MULT_EXPR, op1, op2,
> +aarch64_const_binop))
> +   return gimple_build_assign (f.lhs, res);
> +  }
> +
> +return NULL;

With the change suggested for 2/3, this would be just:

  return f.fold_const_binary (MULT_EXPR);

> +  }
> +};
> +
>  class svnand_impl : public function_base
>  {
>  public:
> @@ -3192,7 +3219,7 @@ FUNCTION (svmls_lane, svmls_lane_impl,)
>  FUNCTION (svmmla, svmmla_impl,)
>  FUNCTION (svmov, svmov_impl,)
>  FUNCTION (svmsb, svmsb_impl,)
> -FUNCTION (svmul, rtx_code_function, (MULT, MULT, UNSPEC_COND_FMUL))
> +FUNCTION (svmul, svmul_impl,)
>  FUNCTION (svmul_lane, CODE_FOR_MODE0 (aarch64_mul_lane),)
>  FUNCTION (svmulh, unspec_based_function, (UNSPEC_SMUL_HIGHPART,
> UNSPEC_UMUL_HIGHPART, -1))
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/const_fold_mul_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_mul_1.c
> new file mode 100644
> index 000..2ab5cfc1b46
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_mul_1.c
> @@ -0,0 +1,292 @@
> +/* { dg-final { check-function-bodies "**" "" } } */
> +/* { dg-options "-O2" } */
> +
> +#include "arm_sve.h"
> +
> +/*
> +** s64_x_pg:
> +**   mov z[0-9]+\.d, #15
> +**   ret
> +*/
> +svint64_t s64_x_pg (svbool_t pg)
> +{
> +  return svmul_x (pg, svdup_s64 (5), svdup_s64 (3));
> +}
> +
> +/*
> +** s64_x_pg_0:
> +**   mov z[0-9]+\.b, #0
> +**   ret
> +*/
> +svint64_t s64_x_pg_0 (svbool_t pg)
> +{
> +  return svmul_x (pg, svdup_s64 (0), svdup_s64 (3));
> +}
> +
> +/*
> +** s64_z_pg:
> +**   mov z[0-9]+\.d, p[0-7]/z, #15
> +**   ret
> +*/
> +svint64_t s64_z_pg (svbool_t pg)
> +{
> +  return svmul_z (pg, svdup_s64 (5), svdup_s64 (3));
> +}
> +
> +/*
> +** s64_z_pg_0:
> +**   mov z[0-9]+\.d, p[0-7]/z, #0
> +**   ret
> +*/
> +svint64_t s64_z_pg_0 (svbool_t pg)
> +{
> +  return svmul_z (pg, svdup_s64 (0), svdup_s64 (3));
> +}
> +
> +/*
> +** s64_m_pg:
> +**   mov (z[0-9]+\.d), #3
> +**   mov (z[0-9]+\.d), #5
> +**   mul \2, p[0-7]/m, \2, \1
> +**   ret
> +*/
> +svint64_t s64_m_pg (svbool_t

Re: [PATCH 2/3] SVE intrinsics: Fold constant operands for svdiv.

2024-08-30 Thread Richard Sandiford
Jennifer Schmitz  writes:
> This patch implements constant folding for svdiv. If the predicate is
> ptrue or predication is _x, it uses vector_const_binop with
> aarch64_const_binop as callback and tree_code TRUNC_DIV_EXPR to fold constant
> integer operands.
> In aarch64_const_binop, a case was added for TRUNC_DIV_EXPR to return 0
> for division by 0, as defined in the semantics for svdiv.
> Tests were added to check the produced assembly for different
> predicates, signed and unsigned integers, and the svdiv_n_* case.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold):
>   Try constant folding.
>   * config/aarch64/aarch64-sve-builtins.cc (aarch64_const_binop):
>   Add special case for division by 0.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve/const_fold_div_1.c: New test.
>
> From 92583732da28f6eb4a8db484fa3b24d55a7265e6 Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz 
> Date: Thu, 29 Aug 2024 05:04:51 -0700
> Subject: [PATCH 2/3] SVE intrinsics: Fold constant operands for svdiv.
>
> This patch implements constant folding for svdiv. If the predicate is
> ptrue or predication is _x, it uses vector_const_binop with
> aarch64_const_binop as callback and tree_code TRUNC_DIV_EXPR to fold constant
> integer operands.
> In aarch64_const_binop, a case was added for TRUNC_DIV_EXPR to return 0
> for division by 0, as defined in the semantics for svdiv.
> Tests were added to check the produced assembly for different
> predicates, signed and unsigned integers, and the svdiv_n_* case.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold):
>   Try constant folding.
>   * config/aarch64/aarch64-sve-builtins.cc (aarch64_const_binop):
>   Add special case for division by 0.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve/const_fold_div_1.c: New test.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc  |  19 +-
>  gcc/config/aarch64/aarch64-sve-builtins.cc|   4 +
>  .../gcc.target/aarch64/sve/const_fold_div_1.c | 336 ++
>  3 files changed, 356 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/const_fold_div_1.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index d55bee0b72f..617c7fc87e5 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -755,8 +755,21 @@ public:
>gimple *
>fold (gimple_folder &f) const override
>{
> -tree divisor = gimple_call_arg (f.call, 2);
> -tree divisor_cst = uniform_integer_cst_p (divisor);
> +tree pg = gimple_call_arg (f.call, 0);
> +tree op1 = gimple_call_arg (f.call, 1);
> +tree op2 = gimple_call_arg (f.call, 2);
> +
> +/* Try to fold constant integer operands.  */
> +if (f.type_suffix (0).integer_p
> + && (f.pred == PRED_x
> + || is_ptrue (pg, f.type_suffix (0).element_bytes)))
> +  if (tree res = vector_const_binop (TRUNC_DIV_EXPR, op1, op2,
> +  aarch64_const_binop))
> + return gimple_build_assign (f.lhs, res);

To reduce cut-&-paste, it'd be good to put this in a helper:

  gimple *gimple_folder::fold_const_binary (tree_code code);

that does the outermost "if" above for "code" rather than TRUNC_DIV_EXPR.
It could return null on failure.  Then the caller can just be:

  if (auto *res = f.fold_const_binary (TRUNC_DIV_EXPR))
return res;

This could go right at the top of the function, since it doesn't rely
on any of the local variables above.

> +
> +/* If the divisor is a uniform power of 2, fold to a shift
> +   instruction.  */
> +tree divisor_cst = uniform_integer_cst_p (op2);
>  
>  if (!divisor_cst || !integer_pow2p (divisor_cst))
>return NULL;
> @@ -770,7 +783,7 @@ public:
>   shapes::binary_uint_opt_n, MODE_n,
>   f.type_suffix_ids, GROUP_none, f.pred);
>   call = f.redirect_call (instance);
> - tree d = INTEGRAL_TYPE_P (TREE_TYPE (divisor)) ? divisor : divisor_cst;
> + tree d = INTEGRAL_TYPE_P (TREE_TYPE (op2)) ? op2 : divisor_cst;
>   new_divisor = wide_int_to_tree (TREE_TYPE (d), tree_log2 (d));
>}
>  else
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index 315d5ac4177..c1b28ebfe4e 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -3444,6 +3444,10 @@ aarch64_const_binop (enum tree_code code, tree arg1, 
> tree arg2)
>signop sign = TYPE_SIGN (type);
>wi::overfl

Re: [PATCH 1/3] SVE intrinsics: Fold constant operands.

2024-08-30 Thread Richard Sandiford
Jennifer Schmitz  writes:
> This patch implements constant folding of binary operations for SVE intrinsics
> by calling the constant-folding mechanism of the middle-end for a given
> tree_code.
> In fold-const.cc, the code for folding vector constants was moved from
> const_binop to a new function vector_const_binop. This function takes a
> function pointer as argument specifying how to fold the vector elements.
> The code for folding operations where the first operand is a vector
> constant and the second argument is an integer constant was also moved
> into vector_const_binop to fold binary SVE intrinsics where the second
> operand is an integer (_n).
> In the aarch64 backend, the new function aarch64_const_binop was
> created, which - in contrast to int_const_binop - does not treat operations as
> overflowing. This function is passed as callback to vector_const_binop
> during gimple folding in intrinsic implementations.
> Because aarch64_const_binop calls poly_int_binop, the latter was made public.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins.cc (aarch64_const_binop):
>   New function to fold binary SVE intrinsics without overflow.
>   * config/aarch64/aarch64-sve-builtins.h: Declare aarch64_const_binop.
>   * fold-const.h: Declare vector_const_binop.
>   * fold-const.cc (const_binop): Remove cases for vector constants.
>   (vector_const_binop): New function that folds vector constants
>   element-wise.
>   (int_const_binop): Remove call to wide_int_binop.
>   (poly_int_binop): Add call to wide_int_binop.
>
> From 2a773d8289b5ec5ab2f2e0d03cbaa35b48bc44b2 Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz 
> Date: Thu, 29 Aug 2024 04:35:49 -0700
> Subject: [PATCH 1/3] SVE intrinsics: Fold constant operands.
>
> This patch implements constant folding of binary operations for SVE intrinsics
> by calling the constant-folding mechanism of the middle-end for a given
> tree_code.
> In fold-const.cc, the code for folding vector constants was moved from
> const_binop to a new function vector_const_binop. This function takes a
> function pointer as argument specifying how to fold the vector elements.
> The code for folding operations where the first operand is a vector
> constant and the second argument is an integer constant was also moved
> into vector_const_binop to fold binary SVE intrinsics where the second
> operand is an integer (_n).
> In the aarch64 backend, the new function aarch64_const_binop was
> created, which - in contrast to int_const_binop - does not treat operations as
> overflowing. This function is passed as callback to vector_const_binop
> during gimple folding in intrinsic implementations.
> Because aarch64_const_binop calls poly_int_binop, the latter was made public.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins.cc (aarch64_const_binop):
>   New function to fold binary SVE intrinsics without overflow.
>   * config/aarch64/aarch64-sve-builtins.h: Declare aarch64_const_binop.
>   * fold-const.h: Declare vector_const_binop.
>   * fold-const.cc (const_binop): Remove cases for vector constants.
>   (vector_const_binop): New function that folds vector constants
>   element-wise.
>   (int_const_binop): Remove call to wide_int_binop.
>   (poly_int_binop): Add call to wide_int_binop.
> ---
>  gcc/config/aarch64/aarch64-sve-builtins.cc |  20 +++
>  gcc/config/aarch64/aarch64-sve-builtins.h  |   1 +
>  gcc/fold-const.cc  | 192 +++--
>  gcc/fold-const.h   |   5 +
>  4 files changed, 131 insertions(+), 87 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index 5ca9ec32b69..315d5ac4177 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -3432,6 +3432,26 @@ is_ptrue (tree v, unsigned int step)
> && vector_cst_all_same (v, step));
>  }
>  
> +/* Try to fold constant arguments arg1 and arg2 using the given tree_code.
> +   Operations are not treated as overflowing.  */
> +tree
> +aarch64_const_binop (enum tree_code code, tree arg1, tree arg2)
> +{
> +  if (poly_int_tree_p (arg1) && poly_int_tree_p (arg2))
> +{
> +  poly_wide_int poly_res;
> +  tree type = TREE_TYPE (arg1);
> +  signop sign = TYPE_SIGN (type);
> +  wi::overflow_type overflow = wi::OVF_NONE;
> +
> +  if (!poly_int_binop (poly_res, code, arg1, arg2, sign, &overflow))
> +return NULL_TREE;
> +  return force_fit_type (type, poly_res, false,
> + TREE_OVERFLOW (arg1) | TREE_OVERFLOW (arg2));
> +}
> +  return NULL_TREE;
> +}
> +

Re: [PATCH v7 1/2] aarch64: Add AdvSIMD faminmax intrinsics

2024-08-30 Thread Richard Sandiford
 writes:
> The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and
> mandatory from Armv9.5-a. It introduces instructions for computing the
> floating point absolute maximum and minimum of the two vectors element-wise.
>
> This patch introduces AdvSIMD faminmax intrinsics. The intrinsics of
> this extension are implemented as the following builtin functions:
> * vamax_f16
> * vamaxq_f16
> * vamax_f32
> * vamaxq_f32
> * vamaxq_f64
> * vamin_f16
> * vaminq_f16
> * vamin_f32
> * vaminq_f32
> * vaminq_f64
>
> We are defining a new way to add AArch64 AdvSIMD intrinsics by listing
> all the intrinsics in a .def file and then using that .def file to
> initialise various data structures. This would lead to more concise code
> and easier addition of the new AdvSIMD intrinsics in future.
>
> The faminmax intrinsics are defined using the new approach.

Thanks, this is looking almost ready from my POV, but some comments below.

>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-builtins.cc
>   (ENTRY): Macro to parse the contents of
> aarch64-simd-pragma-builtins.def.
>   (enum aarch64_builtins): New enum values for faminmax builtins
> via aarch64-simd-pragma-builtins.def.
>   (struct aarch64_pragma_builtins_data): Struct to hold data from
> aarch64-simd-pragma-builtins.def.
> (aarch64_init_pragma_builtins): New function to define pragma 
> builtins.
>   (aarch64_get_pragma_builtin): New function to get a row of
> aarch64_pragma_builtins, given code.
> (handle_arm_neon_h): Modify to call
> aarch64_init_pragma_builtins.
> (aarch64_general_check_builtin_call): Modify to check whether
> required flag is being used for pragma builtins.
> (aarch64_expand_pragma_builtin): New function to emit
> instructions of pragma builtins.
> (aarch64_general_expand_builtin): Modify to call
> aarch64_expand_pragma_builtin.
> * config/aarch64/aarch64-option-extensions.def
>   (AARCH64_OPT_EXTENSION): Introduce new flag for this
> extension.
> * config/aarch64/aarch64-simd.md
>   (@aarch64_): Instruction pattern for
> faminmax intrinsics.
> * config/aarch64/aarch64.h
>   (TARGET_FAMINMAX): Introduce new flag for this extension.
> * config/aarch64/iterators.md: New iterators and unspecs.
> * config/arm/types.md: Introduce neon_fp_aminmax attributes.
> * doc/invoke.texi: Document extension in AArch64 Options.
>   * config/aarch64/aarch64-simd-pragma-builtins.def: New file to
> list pragma builtins.

(It looks like contrib/gcc-changelog/git_check_commit.py is going
to complain about the formatting at commit time, due to the mixture
of spaces & tabs.)

>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/simd/faminmax-builtins-no-flag.c: New test.
> * gcc.target/aarch64/simd/faminmax-builtins.c: New test.
> ---
>  gcc/config/aarch64/aarch64-builtins.cc|  84 +
>  .../aarch64/aarch64-option-extensions.def |   2 +
>  .../aarch64/aarch64-simd-pragma-builtins.def  |  31 +
>  gcc/config/aarch64/aarch64-simd.md|  11 ++
>  gcc/config/aarch64/aarch64.h  |   4 +
>  gcc/config/aarch64/iterators.md   |   9 ++
>  gcc/config/arm/types.md   |   6 +
>  gcc/doc/invoke.texi   |   2 +
>  .../aarch64/simd/faminmax-builtins-no-flag.c  |  10 ++
>  .../aarch64/simd/faminmax-builtins.c  | 115 ++
>  10 files changed, 274 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-simd-pragma-builtins.def
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/simd/faminmax-builtins-no-flag.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/faminmax-builtins.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index eb878b933fe..a4905dd0aae 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -757,6 +757,10 @@ typedef struct
>  #define VAR1(T, N, MAP, FLAG, A) \
>AARCH64_SIMD_BUILTIN_##T##_##N##A,
>  
> +#undef ENTRY
> +#define ENTRY(N, M, U, F) \
> +  AARCH64_##N,
> +
>  enum aarch64_builtins
>  {
>AARCH64_BUILTIN_MIN,
> @@ -829,6 +833,10 @@ enum aarch64_builtins
>AARCH64_RBIT,
>AARCH64_RBITL,
>AARCH64_RBITLL,
> +  /* Pragma builtins.  */
> +  AARCH64_PRAGMA_BUILTIN_START,
> +#include "aarch64-simd-pragma-builtins.def"
> +  AARCH64_PRAGMA_BUILTIN_END,
>/* System register builtins.  */
>AARCH64_RSR,
>AARCH64_RSRP,
> @@ -947,6 +955,7 @@ const char *aarch64_scalar_builtin_types[] = {
>  
>  extern GTY(()) aarch64_simd_type_info aarch64_simd_types[];
>  
> +#undef ENTRY
>  #define ENTRY(E, M, Q, G)  \
>{E, "__" #E, #G "__" #E, NULL_TREE, NULL_TREE, E_##M##mode, qualifier_##Q},
>  struct aarch64_simd_type_info aarch64_simd_types [] = {
> @@ -1547,6 +1556,50 @@ aarch64_init_simd_builtin_functions (bool 
> called_from_pragma)
>   

Re: [PATCH] [RFC] Add constant_multiple_if_divisor_nonzero_p

2024-08-30 Thread Richard Sandiford
Richard Biener  writes:
> With recent SLP vectorization patches I see RISC-V divison by zero
> for gfortran.dg/matmul_10.f90 and others in get_group_load_store_type
> which does
>
>   && can_div_trunc_p (group_size
>   * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
>   nunits, &tem, &remain)
>   && (known_eq (remain, 0u)
>   || (constant_multiple_p (nunits, remain, &num)
>   && (vector_vector_composition_type (vectype, num,
>   &half_vtype)
>   != NULL_TREE
> overrun_p = false;
>
> where for [2, 2] / [0, 2] the condition doesn't reflect what we
> are trying to test - that, when remain is zero or, when non-zero,
> nunits is a multiple of remain, we can avoid touching a gap via
> loading smaller pieces and vector composition.  Changing known_eq
> to maybe_eq wouldn't be correct I think since for example if it
> were [0, 4] then we cannot load smaller parts.  Note we know
> that remain is in [0, nunits - 1] given we compute it as
> X % nunits so maybe we can indeed use maybe_eq (remain, 0u) here
> in this particular case?

I don't think that would be safe.  In general, any case that can
happen for VLA is a case that could happen for VLS, if the compiler
had more information.  So I don't think it's safe to ignore nonzero
remainders for VLA if we wouldn't for VLS.  (E.g. in principle, that
range would allow nunits - 2 even for VLA, which might be zero for
the lowest vector length, but nonzero for others.)

> The following adds constant_multiple_if_divisor_nonzero_p to
> express the intent of no interest in the 'multiple' value for
> the case B is zero.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Is the reasoning that maybe_eq is "safe" sound in this particular
> case?  I think we should avoid adding 
> constant_multiple_if_divisor_nonzero_p if not absolutely necessary.

Yeah, seeing it written out, maybe it is a bit too confusing.  And when
we talked about this on irc yesterday, I was assuming that we could cope
with either possiblity (zero remainder or extra vectors) dynamically.
I now realise that we need to distinguish between them statically instead.

On that basis, maybe we should just protect the constant_multiple_p test
with known_ne (remainder, 0).

Thanks, and sorry for suggesting a dead end.

Richard

>
> Thanks,
> Richard.
>
>   * poly-int.h (constant_multiple_if_divisor_nonzero_p): New.
>   * tree-vect-stmts.cc (get_group_load_store_type): Use it.
> ---
>  gcc/poly-int.h | 45 ++
>  gcc/tree-vect-stmts.cc |  3 ++-
>  2 files changed, 47 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/poly-int.h b/gcc/poly-int.h
> index 94708165961..29337513eb8 100644
> --- a/gcc/poly-int.h
> +++ b/gcc/poly-int.h
> @@ -1986,6 +1986,51 @@ constant_multiple_p (const poly_int &a,
>return true;
>  }
>  
> +/* Return true if A is a constant multiple of B, storing the
> +   multiple in *MULTIPLE if so.  The case of B being zero is
> +   excluded but it shall not be known equal to zero.  */
> +
> +template
> +inline bool
> +constant_multiple_if_divisor_nonzero_p (const poly_int &a,
> + const poly_int &b, Cm *multiple)
> +{
> +  typedef POLY_CAST (Ca, Cb) NCa;
> +  typedef POLY_CAST (Cb, Ca) NCb;
> +  typedef POLY_INT_TYPE (Ca) ICa;
> +  typedef POLY_INT_TYPE (Cb) ICb;
> +  typedef POLY_BINARY_COEFF (Ca, Cb) C;
> +
> +  C r = 0;
> +  if (b.coeffs[0] != ICb (0))
> +{
> +  if (NCa (a.coeffs[0]) % NCb (b.coeffs[0]) != 0)
> + return false;
> +  r = NCa (a.coeffs[0]) / NCb (b.coeffs[0]);
> +}
> +  for (unsigned int i = 1; i < N; ++i)
> +if (b.coeffs[i] == ICb (0))
> +  {
> + if (a.coeffs[i] != ICa (0))
> +   return false;
> +  }
> +else
> +  {
> + if (NCa (a.coeffs[i]) % NCb (b.coeffs[i]) != 0)
> +   return false;
> + if (r == 0)
> +   r = NCa (a.coeffs[i]) / NCb (b.coeffs[i]);
> + else if (NCa (a.coeffs[i]) / NCb (b.coeffs[i]) != r)
> +   return false;
> +  }
> +
> +  if (r == 0)
> +return false;
> +
> +  *multiple = r;
> +  return true;
> +}
> +
>  /* Return true if A is a constant multiple of B.  */
>  
>  template
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index f8bb637342e..2641833b3aa 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2135,7 +2135,8 @@ get_group_load_store_type (vec_info *vinfo, 
> stmt_vec_info stmt_info,
> * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
> nunits, &tem, &remain)
> && (known_eq (remain, 0u)
> -   || (constant_multiple_p (nunits, remain, &num)
> +   || (constant_multiple_if_divisor_nonzero_p (nunits, remain,
> +

Re: [PATCH v5] Target-independent store forwarding avoidance.

2024-08-29 Thread Richard Sandiford
Manolis Tsamis  writes:
> On Fri, Aug 16, 2024 at 5:33 PM Richard Sandiford
>  wrote:
>> Manolis Tsamis  writes:
>> > +}
>> > +
>> > +  virtual unsigned int execute (function *) override;
>> > +}; // class pass_rtl_avoid_store_forwarding
>> > +
>> > +static unsigned int stats_sf_detected = 0;
>> > +static unsigned int stats_sf_avoided = 0;
>>
>> Could you instead structure the code as a class with these as member
>> variables?  I realise it's a bit over-the-top for just 2 variables,
>> but the pass might be expanded in future.
>>
> I could do that but it would require making most of the functions
> member functions (as these are incremented in
> process_store_forwarding)

Right.

> and by looking at other passes/GCC code this is like never done.  So I
> felt that it would be a bit unconventional and I haven't implemented
> it yet. Thoughts about that?

Most passes were written while the codebase was still C.  I think newer
passes have generally been written in the way I described.  Where possible,
it'd be better for new code to avoid using globals to communicate
information between functions, to reduce the work involved in any future
parallelisation effort.  (There have been tentative steps towards reducing
existing uses of globals in the past.)

Using classes also allows pass-level RAII.  I realise that doesn't matter
for the current code, but like I say, we should consider future extensions
too.

>> > +
>> > +  FOR_EACH_VEC_ELT (stores, i, it)
>> > + {
>> > +   fprintf (dump_file, "From: ");
>> > +   print_rtl_single (dump_file, it->store_insn);
>> > + }
>> > +
>> > +  fprintf (dump_file, "To: ");
>> > +  print_rtl_single (dump_file, load_insn);
>> > +
>> > +  if (load_elim)
>> > + fprintf (dump_file, "(Load elimination candidate)\n");
>> > +}
>> > +
>> > +  rtx dest;
>> > +  if (load_elim)
>> > +dest = gen_reg_rtx (load_inner_mode);
>> > +  else
>> > +dest = SET_DEST (load);
>> > +
>> > +  int move_to_front = -1;
>> > +  int total_cost = 0;
>> > +
>> > +  /* Check if we can emit bit insert instructions for all forwarded 
>> > stores.  */
>> > +  FOR_EACH_VEC_ELT (stores, i, it)
>> > +{
>> > +  it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
>> > +  rtx_insn *insns = NULL;
>> > +
>> > +  /* If we're eliminating the load then find the store with zero 
>> > offset
>> > +  and use it as the base register to avoid a bit insert if possible.  
>> > */
>> > +  if (load_elim && it->offset == 0
>> > +   && validate_subreg (GET_MODE (dest), GET_MODE (it->mov_reg),
>> > +   it->mov_reg, 0))
>> > + {
>> > +   start_sequence ();
>> > +
>> > +   /* We can use a paradoxical subreg to force this to a wider mode, 
>> > as
>> > +  the only use will be inserting the bits (i.e., we don't care 
>> > about
>> > +  the value of the higher bits).  */
>> > +   rtx ext0 = gen_rtx_SUBREG (GET_MODE (dest), it->mov_reg, 0);
>>
>> IMO it'd be safer to call lowpart_subreg and check whether the result
>> is null.  This would also remove the need to check validate_subreg
>> directly.
>>
> I did change this to lowpart_subreg, but note that validate_subreg
> looks to still be necessary (for AArch64). Otherwise I get various
> ICEs, e.g. when a SF -> TI case is hit, which was the original reason
> for validate_subreg.

Do you have a testcase?

Direct uses of gen_rtx_SUBREG are generally suspicious.  We should be
using higher-level generators in most cases.

> I could have some help with that, because after the new changes a
> subreg related ICE also happens within store_bit_field when a DI ->
> V4SI case is hit. Afaik store_bit_field should just return NULL if it
> can't handle something so I don't really know how to address this ICE
> in the new version.

I don't think store_bit_field is expected to fail.  But yeah, if you have
a testcase, I can take a look.

>> > +   rtx_insn *move0 = emit_move_insn (dest, ext0);
>> > +   if (recog_memoized (move0) >= 0)
>> > + {
>> > +   insns = get_insns ();
>> > +   move_to_front = (int) i;
>> > + }
>>
>> Here too I'd expect emit_move_in

Re: [PING] [PATCH V3 09/10] c6x: Adjust dot-product backend patterns

2024-08-29 Thread Richard Sandiford
Victor Do Nascimento  writes:
> Hello,
>
> Gentle reminder for this simple renaming patch :)

OK, thanks.

Richard

> Thanks,
> Victor
>
> On 8/15/24 09:44, Victor Do Nascimento wrote:
>> Following the migration of the dot_prod optab from a direct to a
>> conversion-type optab, ensure all back-end patterns incorporate the
>> second machine mode into pattern names.
>> 
>> gcc/ChangeLog:
>> 
>>  * config/c6x/c6x.md (sdot_prodv2hi): Renamed to...
>>  (sdot_prodsiv2hi): ...this.
>> ---
>>   gcc/config/c6x/c6x.md | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/gcc/config/c6x/c6x.md b/gcc/config/c6x/c6x.md
>> index 5964dd69d0d..ea9ffe8b4e1 100644
>> --- a/gcc/config/c6x/c6x.md
>> +++ b/gcc/config/c6x/c6x.md
>> @@ -3082,7 +3082,7 @@ (define_insn "v2hi3"
>>   ;; Widening vector multiply and dot product.
>>   ;; See c6x-mult.md.in for the define_insn patterns
>>   
>> -(define_expand "sdot_prodv2hi"
>> +(define_expand "sdot_prodsiv2hi"
>> [(match_operand:SI 0 "register_operand" "")
>>  (match_operand:V2HI 1 "register_operand" "")
>>  (match_operand:V2HI 2 "register_operand" "")


Re: [PING] [PATCH V3 06/10] arc: Adjust dot-product backend patterns

2024-08-29 Thread Richard Sandiford
Victor Do Nascimento  writes:
> Hello,
>
> Gentle reminder for this simple renaming patch :)

OK, thanks.

Richard

> Thanks,
> Victor
>
> On 8/15/24 09:44, Victor Do Nascimento wrote:
>> Following the migration of the dot_prod optab from a direct to a
>> conversion-type optab, ensure all back-end patterns incorporate the
>> second machine mode into pattern names.
>> 
>> gcc/ChangeLog:
>> 
>>  * config/arc/simdext.md (sdot_prodv2hi): Renamed to...
>>  (sdot_prodsiv2hi): ...this.
>>  (udot_prodv2hi): Renamed to...
>>  (udot_prodsiv2hi): ...this.
>>  (sdot_prodv4hi): Renamed to...
>>  (sdot_prodv2siv4hi): ...this.
>>  (udot_prodv4hi): Renamed to...
>>  (udot_prodv2siv4hi): ...this.
>> ---
>>   gcc/config/arc/simdext.md | 8 
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/gcc/config/arc/simdext.md b/gcc/config/arc/simdext.md
>> index 4e51a237c3a..0696f0abb70 100644
>> --- a/gcc/config/arc/simdext.md
>> +++ b/gcc/config/arc/simdext.md
>> @@ -1643,7 +1643,7 @@ (define_insn "dmpyh"
>>   
>>   ;; We can use dmac as well here.  To be investigated which version
>>   ;; brings more.
>> -(define_expand "sdot_prodv2hi"
>> +(define_expand "sdot_prodsiv2hi"
>> [(match_operand:SI 0 "register_operand" "")
>>  (match_operand:V2HI 1 "register_operand" "")
>>  (match_operand:V2HI 2 "register_operand" "")
>> @@ -1656,7 +1656,7 @@ (define_expand "sdot_prodv2hi"
>>DONE;
>>   })
>>   
>> -(define_expand "udot_prodv2hi"
>> +(define_expand "udot_prodsiv2hi"
>> [(match_operand:SI 0 "register_operand" "")
>>  (match_operand:V2HI 1 "register_operand" "")
>>  (match_operand:V2HI 2 "register_operand" "")
>> @@ -1669,7 +1669,7 @@ (define_expand "udot_prodv2hi"
>>DONE;
>>   })
>>   
>> -(define_expand "sdot_prodv4hi"
>> +(define_expand "sdot_prodv2siv4hi"
>> [(match_operand:V2SI 0 "register_operand" "")
>>  (match_operand:V4HI 1 "register_operand" "")
>>  (match_operand:V4HI 2 "register_operand" "")
>> @@ -1688,7 +1688,7 @@ (define_expand "sdot_prodv4hi"
>>DONE;
>>   })
>>   
>> -(define_expand "udot_prodv4hi"
>> +(define_expand "udot_prodv2siv4hi"
>> [(match_operand:V2SI 0 "register_operand" "")
>>  (match_operand:V4HI 1 "register_operand" "")
>>  (match_operand:V4HI 2 "register_operand" "")


Re: [PATCH v6 1/2] aarch64: Add AdvSIMD faminmax intrinsics

2024-08-29 Thread Richard Sandiford
Kyrylo Tkachov  writes:
> Hi Saurabh,
>
>> On 29 Aug 2024, at 09:51, saurabh@arm.com wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and
>> mandatory from Armv9.5-a. It introduces instructions for computing the
>> floating point absolute maximum and minimum of the two vectors element-wise.
>> 
>> This patch introduces AdvSIMD faminmax intrinsics. The intrinsics of
>> this extension are implemented as the following builtin functions:
>> * vamax_f16
>> * vamaxq_f16
>> * vamax_f32
>> * vamaxq_f32
>> * vamaxq_f64
>> * vamin_f16
>> * vaminq_f16
>> * vamin_f32
>> * vaminq_f32
>> * vaminq_f64
>> 
>> We are defining a new way to add AArch64 AdvSIMD intrinsics by listing
>> all the intrinsics in a .def file and then using that .def file to
>> initialise various data structures. This would lead to more concise code
>> and easier addition of the new AdvSIMD intrinsics in future.
>> 
>> The faminmax intrinsics are defined using the new approach
>> 
>> gcc/ChangeLog:
>> 
>>* config/aarch64/aarch64-builtins.cc
>>(ENTRY): Macro to parse the contents of
>> aarch64-simd-pragma-builtins.def.
>>(enum aarch64_builtins): New enum values for faminmax builtins
>> via aarch64-simd-pragma-builtins.def.
>>(aarch64_init_pragma_builtins): New function to define pragma 
>> builtins.
>>(handle_arm_neon_h): Modify to call
>> aarch64_init_pragma_builtins.
>>(aarch64_general_check_builtin_call): Modify to check whether
>> required flag is being used for pragma builtins.
>>(aarch64_expand_pragma_builtin): New function to emit
>> instructions of pragma builtins.
>>(aarch64_general_expand_builtin): Modify to call
>> aarch64_expand_pragma_builtin.
>>* config/aarch64/aarch64-option-extensions.def
>>(AARCH64_OPT_EXTENSION): Introduce new flag for this
>> extension.
>>* config/aarch64/aarch64-simd.md
>>(@aarch64_): Instruction pattern for
>> faminmax intrinsics.
>>* config/aarch64/aarch64.h
>>(TARGET_FAMINMAX): Introduce new flag for this extension.
>>* config/aarch64/iterators.md: New iterators and unspecs.
>>* config/arm/types.md: Introduce neon_fp_aminmax attributes.
>>* doc/invoke.texi: Document extension in AArch64 Options.
>>* config/aarch64/aarch64-simd-pragma-builtins.def: New file to
>>  list pragma builtins.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>* gcc.target/aarch64/simd/faminmax-builtins-no-flag.c: New test.
>>* gcc.target/aarch64/simd/faminmax-builtins.c: New test.
>
> Sorry for the back-and-forth, but I just realized, why can’t we reuse the 
> existing aarch64_init_simd_intrinsics code in aarch64-builtins.cc 
> ?
> It seems that it already handles most of the registration code, except that 
> it doesn’t handle checking of arch extensions.
> I think we should aim to refactor this a bit so that we can use that 
> functionality.
> But I appreciate that this would extend the scope of this patch a bit too 
> much.
> So I’m okay with this going in now, but it would be good to clean this area 
> somewhat in a separate patch so that we can rely on just 
> aarch64-simd-builtins.def, potentially augmented with extension information.

Yeah, I agree we should consolidate this more.  I think the .def file
is a good long-term direction though, and it would be good to move the
existing builtins over to that.

As it stands, neither approach is a superset of the other.  The new
approach has the architecture requirements and unspec number, but the
old approach makes it easier to handle simplifications.  Neither approach
makes it particularly easy to encode the type signature.

So one of the tasks for the follow-on would be to find a good way of
expressing the type signature and a good way of handling things like
vget_high simplification (maybe via callbacks?).

On the patch itself:

>> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
>> b/gcc/config/aarch64/aarch64-builtins.cc
>> index eb878b933fe..61df394b881 100644
>> --- a/gcc/config/aarch64/aarch64-builtins.cc
>> +++ b/gcc/config/aarch64/aarch64-builtins.cc
>> @@ -757,6 +757,10 @@ typedef struct
>>  #define VAR1(T, N, MAP, FLAG, A) \
>>AARCH64_SIMD_BUILTIN_##T##_##N##A,
>>  
>> +#undef ENTRY
>> +#define ENTRY(N, M, U, F) \
>> +  AARCH64_##N,
>> +
>>  enum aarch64_builtins
>>  {
>>AARCH64_BUILTIN_MIN,
>> @@ -829,6 +833,10 @@ enum aarch64_builtins
>>AARCH64_RBIT,
>>AARCH64_RBITL,
>>AARCH64_RBITLL,
>> +  /* Pragma builtins.  */
>> +  AARCH64_PRAGMA_BUILTIN_START,
>> +#include "aarch64-simd-pragma-builtins.def"
>> +  AARCH64_PRAGMA_BUILTIN_END,
>>/* System register builtins.  */
>>AARCH64_RSR,
>>AARCH64_RSRP,
>> @@ -947,6 +955,7 @@ const char *aarch64_scalar_builtin_types[] = {
>>  
>>  extern GTY(()) aarch64_simd_type_info aarch64_simd_types[];
>>  
>> +#undef ENTRY
>>  #

[PATCH] Allow subregs around constant displacements [PR116516]

2024-08-28 Thread Richard Sandiford
This patch fixes a regression introduced by g:708ee71808ea61758e73.
x86_64 allows addresses of the form:

  (zero_extend:DI (subreg:SI (symbol_ref:DI "foo") 0))

Before the previous patch, a lax SUBREG check meant that we would
treat the subreg as a base and reload it into a base register.
But that wasn't what the target was expecting.  Instead we should
treat "foo" as a constant displacement, to match:

leal foo, 

After the patch, we recognised that "foo" isn't a base register,
but ICEd on it rather than handling it as a displacement.

With or without the recent patches, if the address had instead been:

  (zero_extend:DI
(subreg:SI (plus:DI (reg:DI R) (symbol_ref:DI "foo") 0)))

then we would have treated "foo" as the displacement and R as the base
or index, as expected.  The problem was that the code that does this was
rejecting all subregs of objects, rather than just subregs of variable
objects.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard


gcc/
PR middle-end/116516
* rtlanal.cc (strip_address_mutations): Allow subregs around
constant displacements.

gcc/testsuite/
PR middle-end/116516
* gcc.c-torture/compile/pr116516.c: New test.
---
 gcc/rtlanal.cc| 28 ---
 .../gcc.c-torture/compile/pr116516.c  | 10 +++
 2 files changed, 34 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr116516.c

diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index 8afbb32f220..cb0c0c0d719 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -6467,10 +6467,30 @@ strip_address_mutations (rtx *loc, enum rtx_code 
*outer_code)
/* (and ... (const_int -X)) is used to align to X bytes.  */
loc = &XEXP (*loc, 0);
   else if (code == SUBREG
-   && !OBJECT_P (SUBREG_REG (*loc))
-   && subreg_lowpart_p (*loc))
-   /* (subreg (operator ...) ...) inside and is used for mode
-  conversion too.  */
+  && (!OBJECT_P (SUBREG_REG (*loc))
+  || CONSTANT_P (SUBREG_REG (*loc)))
+  && subreg_lowpart_p (*loc))
+   /* (subreg (operator ...) ...) inside AND is used for mode
+  conversion too.  It is also used for load-address operations
+  in which an extension can be done for free, such as:
+
+(zero_extend:DI
+  (subreg:SI (plus:DI (reg:DI R) (symbol_ref:DI "foo") 0)))
+
+  The latter usage also covers subregs of plain "displacements",
+  such as:
+
+(zero_extend:DI (subreg:SI (symbol_ref:DI "foo") 0))
+
+  The inner address should then be the symbol_ref, not the subreg,
+  similarly to the plus case above.
+
+  In contrast, the subreg in:
+
+(zero_extend:DI (subreg:SI (reg:DI R) 0))
+
+  should be treated as the base, since it should be replaced by
+  an SImode hard register during register allocation.  */
loc = &SUBREG_REG (*loc);
   else
return loc;
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr116516.c 
b/gcc/testsuite/gcc.c-torture/compile/pr116516.c
new file mode 100644
index 000..c423ebfef5c
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr116516.c
@@ -0,0 +1,10 @@
+extern void my_func (int);
+typedef struct {
+  int var;
+} info_t;
+extern void *_data_offs;
+void test()
+{
+  info_t *info = (info_t *) ((void *)((void *)1) + ((unsigned 
int)&_data_offs));
+  my_func(info->var == 0);
+}
-- 
2.25.1



[PATCH] aarch64: Assume zero gather/scatter set-up cost for -mtune=generic

2024-08-28 Thread Richard Sandiford
generic_vector_cost is not currently used by any SVE target
by default; it has to be specifically selected by -mtune=generic.
Its SVE costing has historically been somewhat idealised, since
it predated any actual SVE cores.  This seems like a useful
tradition to continue, at least for testing purposes.

The ideal case is that gathers and scatters do not induce a specific
one-off overhead.  This patch therefore sets the gather/scatter init
costs to zero.

This patch is necessary to switch -mtune=generic over to the
"new" vector costs.

Tested on aarch64-linux-gnu.  Pushed as previously agreed with
Tamar and Kyrill.

Richard


gcc/
* config/aarch64/tuning_models/generic.h (generic_sve_vector_cost):
Set gather_load_x32_init_cost and gather_load_x64_init_cost to 0.
---
 gcc/config/aarch64/tuning_models/generic.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/tuning_models/generic.h 
b/gcc/config/aarch64/tuning_models/generic.h
index 101969bdbb9..ee2f3ff4266 100644
--- a/gcc/config/aarch64/tuning_models/generic.h
+++ b/gcc/config/aarch64/tuning_models/generic.h
@@ -105,8 +105,8 @@ static const sve_vec_cost generic_sve_vector_cost =
   2, /* fadda_f64_cost  */
   4, /* gather_load_x32_cost  */
   2, /* gather_load_x64_cost  */
-  12, /* gather_load_x32_init_cost  */
-  4, /* gather_load_x64_init_cost  */
+  0, /* gather_load_x32_init_cost  */
+  0, /* gather_load_x64_init_cost  */
   1 /* scatter_store_elt_cost  */
 };
 
-- 
2.25.1



[pushed] aarch64: Fix gather x32/x64 selection

2024-08-28 Thread Richard Sandiford
The SVE gather and scatter costs are classified based on whether
they do 4 loads per 128 bits (x32) or 2 loads per 128 bits (x64).
The number after the "x" refers to the number of bits in each
"container".

However, the test for which to use was based on the element size
rather than the container size.  This meant that we'd use the
overly conservative x32 costs for VNx2SI gathers.  VNx2SI gathers
are really .D gathers in which the upper half of each extension
result is ignored.

This patch is necessary to switch -mtune=generic over to the
"new" vector costs.

Tested on aarch64-linux-gnu.  Pushed as previously agreed with
Tamar and Kyrill.

Richard


gcc/
* config/aarch64/aarch64.cc (aarch64_detect_vector_stmt_subtype)
(aarch64_vector_costs::add_stmt_cost): Use the x64 cost rather
than x32 cost for all VNx2 modes.
---
 gcc/config/aarch64/aarch64.cc | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 40dacfcf2e7..033ea61d3a8 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16819,7 +16819,8 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, 
vect_cost_for_stmt kind,
   && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
 {
   unsigned int nunits = vect_nunits_for_cost (vectype);
-  if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
+  /* Test for VNx2 modes, which have 64-bit containers.  */
+  if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)), aarch64_sve_vg))
return { sve_costs->gather_load_x64_cost, nunits };
   return { sve_costs->gather_load_x32_cost, nunits };
 }
@@ -17309,7 +17310,9 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
  const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve;
  if (sve_costs)
{
- if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
+ /* Test for VNx2 modes, which have 64-bit containers.  */
+ if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
+   aarch64_sve_vg))
m_sve_gather_scatter_init_cost
  += sve_costs->gather_load_x64_init_cost;
  else
-- 
2.25.1



[pushed] aarch64: Add a test for zeroing <64bits>x2_t structures

2024-08-28 Thread Richard Sandiford
g:8d6c6fbc5271dde433998c09407b30e2cf195420 improved the code
generated for functions like:

  void test_s8 (int8x8x2_t *ptr) { *ptr = (int8x8x2_t) {}; }

Previously we would load zero from the constant pool, whereas
now we just use "stp xzr, xzr".  This patch adds a test for
this improvement.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/testsuite/
* gcc.target/aarch64/struct_zero.c: New test.
---
 .../gcc.target/aarch64/struct_zero.c  | 21 +++
 1 file changed, 21 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/struct_zero.c

diff --git a/gcc/testsuite/gcc.target/aarch64/struct_zero.c 
b/gcc/testsuite/gcc.target/aarch64/struct_zero.c
new file mode 100644
index 000..13f7236a4d2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/struct_zero.c
@@ -0,0 +1,21 @@
+/* { dg-options "-O2" } */
+
+#include 
+
+void test_s8 (int8x8x2_t *ptr) { *ptr = (int8x8x2_t) {}; }
+void test_u8 (uint8x8x2_t *ptr) { *ptr = (uint8x8x2_t) {}; }
+void test_p8 (poly8x8x2_t *ptr) { *ptr = (poly8x8x2_t) {}; }
+void test_s16 (int16x4x2_t *ptr) { *ptr = (int16x4x2_t) {}; }
+void test_u16 (uint16x4x2_t *ptr) { *ptr = (uint16x4x2_t) {}; }
+void test_p16 (poly16x4x2_t *ptr) { *ptr = (poly16x4x2_t) {}; }
+void test_bf16 (bfloat16x4x2_t *ptr) { *ptr = (bfloat16x4x2_t) {}; }
+void test_f16 (float16x4x2_t *ptr) { *ptr = (float16x4x2_t) {}; }
+void test_s32 (int32x2x2_t *ptr) { *ptr = (int32x2x2_t) {}; }
+void test_u32 (uint32x2x2_t *ptr) { *ptr = (uint32x2x2_t) {}; }
+void test_f32 (float32x2x2_t *ptr) { *ptr = (float32x2x2_t) {}; }
+void test_s64 (int64x1x2_t *ptr) { *ptr = (int64x1x2_t) {}; }
+void test_u64 (uint64x1x2_t *ptr) { *ptr = (uint64x1x2_t) {}; }
+void test_p64 (poly64x1x2_t *ptr) { *ptr = (poly64x1x2_t) {}; }
+void test_f64 (float64x1x2_t *ptr) { *ptr = (float64x1x2_t) {}; }
+
+/* { dg-final { scan-assembler-times {\tstp\txzr, xzr, \[x0\]\n} 15 } } */
-- 
2.25.1



[PATCH] Rename gimple_asm_input_p to gimple_asm_basic_p

2024-08-28 Thread Richard Sandiford
Following on from the earlier tree rename, this patch renames
gimple_asm_input_p to gimple_asm_basic_p, and similarly for
related names.

Tested on aarch64-linux-gnu.  OK to install?

Richard


gcc/
* doc/gimple.texi (gimple_asm_basic_p): Document.
(gimple_asm_set_basic): Likewise.
* gimple.h (GF_ASM_INPUT): Rename to...
(GF_ASM_BASIC): ...this.
(gimple_asm_set_input): Rename to...
(gimple_asm_set_basic): ...this.
(gimple_asm_input_p): Rename to...
(gimple_asm_basic_p): ...this.
* cfgexpand.cc (expand_asm_stmt): Update after above renaming.
* gimple.cc (gimple_asm_clobbers_memory_p): Likewise.
* gimplify.cc (gimplify_asm_expr): Likewise.
* ipa-icf-gimple.cc (func_checker::compare_gimple_asm): Likewise.
* tree-cfg.cc (stmt_can_terminate_bb_p): Likewise.
---
 gcc/cfgexpand.cc  |  2 +-
 gcc/doc/gimple.texi   |  9 +
 gcc/gimple.cc |  2 +-
 gcc/gimple.h  | 19 ++-
 gcc/gimplify.cc   |  2 +-
 gcc/ipa-icf-gimple.cc |  2 +-
 gcc/tree-cfg.cc   |  2 +-
 7 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 13f8c08d295..f32cf1b20c9 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -3121,7 +3121,7 @@ expand_asm_stmt (gasm *stmt)
 
   location_t locus = gimple_location (stmt);
 
-  if (gimple_asm_input_p (stmt))
+  if (gimple_asm_basic_p (stmt))
 {
   const char *s = gimple_asm_string (stmt);
   tree string = build_string (strlen (s), s);
diff --git a/gcc/doc/gimple.texi b/gcc/doc/gimple.texi
index 5f241b1c64f..d8aaca26049 100644
--- a/gcc/doc/gimple.texi
+++ b/gcc/doc/gimple.texi
@@ -1112,6 +1112,15 @@ Return the string representing the assembly instruction 
in
 @code{GIMPLE_ASM} @code{G}.
 @end deftypefn
 
+@deftypefn {GIMPLE function} bool gimple_asm_basic_p (const gasm *g)
+Return true if @code{G} is a basic asm rather than an extended asm.
+@end deftypefn
+
+@deftypefn {GIMPLE function} void gimple_asm_set_basic (gasm *g, bool basic_p)
+Mark asm statement @code{G} as a basic asm or an extended asm based on
+@code{BASIC_P}.
+@end deftypefn
+
 @deftypefn {GIMPLE function} bool gimple_asm_volatile_p (const gasm *g)
 Return true if @code{G} is an asm statement marked volatile.
 @end deftypefn
diff --git a/gcc/gimple.cc b/gcc/gimple.cc
index a9f968cb038..6e28cf291e1 100644
--- a/gcc/gimple.cc
+++ b/gcc/gimple.cc
@@ -2944,7 +2944,7 @@ gimple_asm_clobbers_memory_p (const gasm *stmt)
 }
 
   /* Non-empty basic ASM implicitly clobbers memory.  */
-  if (gimple_asm_input_p (stmt) && strlen (gimple_asm_string (stmt)) != 0)
+  if (gimple_asm_basic_p (stmt) && strlen (gimple_asm_string (stmt)) != 0)
 return true;
 
   return false;
diff --git a/gcc/gimple.h b/gcc/gimple.h
index bd315ffc2dd..ee986eaf153 100644
--- a/gcc/gimple.h
+++ b/gcc/gimple.h
@@ -135,7 +135,7 @@ enum gimple_rhs_class
 
Keep this list sorted.  */
 enum gf_mask {
-GF_ASM_INPUT   = 1 << 0,
+GF_ASM_BASIC   = 1 << 0,
 GF_ASM_VOLATILE= 1 << 1,
 GF_ASM_INLINE  = 1 << 2,
 GF_CALL_FROM_THUNK = 1 << 0,
@@ -4227,24 +4227,25 @@ gimple_asm_set_inline (gasm *asm_stmt, bool inline_p)
 }
 
 
-/* If INPUT_P is true, mark asm ASM_STMT as an ASM_INPUT.  */
+/* Mark whether asm ASM_STMT is a basic asm or an extended asm, based on
+   BASIC_P.  */
 
 inline void
-gimple_asm_set_input (gasm *asm_stmt, bool input_p)
+gimple_asm_set_basic (gasm *asm_stmt, bool basic_p)
 {
-  if (input_p)
-asm_stmt->subcode |= GF_ASM_INPUT;
+  if (basic_p)
+asm_stmt->subcode |= GF_ASM_BASIC;
   else
-asm_stmt->subcode &= ~GF_ASM_INPUT;
+asm_stmt->subcode &= ~GF_ASM_BASIC;
 }
 
 
-/* Return true if asm ASM_STMT is an ASM_INPUT.  */
+/* Return true if asm ASM_STMT is a basic asm rather than an extended asm.  */
 
 inline bool
-gimple_asm_input_p (const gasm *asm_stmt)
+gimple_asm_basic_p (const gasm *asm_stmt)
 {
-  return (asm_stmt->subcode & GF_ASM_INPUT) != 0;
+  return (asm_stmt->subcode & GF_ASM_BASIC) != 0;
 }
 
 
diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
index 3c84cfce5f7..f87c22edf5c 100644
--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -7340,7 +7340,7 @@ gimplify_asm_expr (tree *expr_p, gimple_seq *pre_p, 
gimple_seq *post_p)
   ASM_VOLATILE_P (expr)
   || noutputs == 0
   || labels);
-  gimple_asm_set_input (stmt, ASM_BASIC_P (expr));
+  gimple_asm_set_basic (stmt, ASM_BASIC_P (expr));
   gimple_asm_set_inline (stmt, ASM_INLINE_P (expr));
 
   gimplify_seq_add_stmt (pre_p, stmt);
diff --git a/gcc/ipa-icf-gimple.cc b/gcc/ipa-icf-gimple.cc
index 4c3174b68b6..31f6cdbda06 100644
--- a/gcc/ipa-icf-gimple.cc
+++ b/gcc/ipa-icf-gimple.cc
@@ -986,7 +986,7 @@ func_checker::compare_gimple_asm (const gasm *g1, const 
gasm *g2)
   if (gimple_asm_volatile_p (g1) != gimple_asm_

[PATCH] Rename ASM_INPUT_P to ASM_BASIC_P

2024-08-28 Thread Richard Sandiford
ASM_INPUT_P is so named because it causes the eventual rtl insn
pattern to be a top-level ASM_INPUT rather than an ASM_OPERANDS.
However, this name has caused confusion, partly due to earlier
documentation.  The name also sounds related to ASM_INPUTS but
is for a different piece of state.

This patch renames it to ASM_BASIC_P, with the inverse meaning
an extended asm.  ("Basic asm" is the term used in extend.texi.)

Tested on aarch64-linux-gnu (all languages, as usual, but that's
especially relevant here).  OK to install?

Richard


gcc/
* doc/generic.texi (ASM_BASIC_P): Document.
* tree.h (ASM_INPUT_P): Rename to...
(ASM_BASIC_P): ...this.
(ASM_VOLATILE_P, ASM_INLINE_P): Reindent.
* gimplify.cc (gimplify_asm_expr): Update after above renaming.
* tree-core.h (tree_base): Likewise.

gcc/c/
* c-typeck.cc (build_asm_expr): Rename ASM_INPUT_P to ASM_BASIC_P.

gcc/cp/
* pt.cc (tsubst_stmt): Rename ASM_INPUT_P to ASM_BASIC_P.
* parser.cc (cp_parser_asm_definition): Likewise.

gcc/d/
* toir.cc (IRVisitor): Rename ASM_INPUT_P to ASM_BASIC_P.

gcc/jit/
* jit-playback.cc (playback::block::add_extended_asm):  Rename
ASM_INPUT_P to ASM_BASIC_P.

gcc/m2/
* gm2-gcc/m2block.cc (flush_pending_note): Rename ASM_INPUT_P
to ASM_BASIC_P.
* gm2-gcc/m2statement.cc (m2statement_BuildAsm): Likewise.
---
 gcc/c/c-typeck.cc |  2 +-
 gcc/cp/parser.cc  |  2 +-
 gcc/cp/pt.cc  |  2 +-
 gcc/d/toir.cc |  5 ++---
 gcc/doc/generic.texi  | 16 +++-
 gcc/gimplify.cc   |  2 +-
 gcc/jit/jit-playback.cc   |  2 +-
 gcc/m2/gm2-gcc/m2block.cc |  2 +-
 gcc/m2/gm2-gcc/m2statement.cc |  2 +-
 gcc/tree-core.h   |  2 +-
 gcc/tree.h|  6 +++---
 11 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index 094e41fa202..58b2724b39e 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -11672,7 +11672,7 @@ build_asm_expr (location_t loc, tree string, tree 
outputs, tree inputs,
 
   /* asm statements without outputs, including simple ones, are treated
  as volatile.  */
-  ASM_INPUT_P (args) = simple;
+  ASM_BASIC_P (args) = simple;
   ASM_VOLATILE_P (args) = (noutputs == 0);
   ASM_INLINE_P (args) = is_inline;
 
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 28ebf2beb60..58dd81d02f0 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -23126,7 +23126,7 @@ cp_parser_asm_definition (cp_parser* parser)
  if (TREE_CODE (temp) == CLEANUP_POINT_EXPR)
temp = TREE_OPERAND (temp, 0);
 
- ASM_INPUT_P (temp) = 1;
+ ASM_BASIC_P (temp) = 1;
}
}
   else
diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index 24a6241d3a5..9ab635027e1 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -18930,7 +18930,7 @@ tsubst_stmt (tree t, tree args, tsubst_flags_t 
complain, tree in_decl)
tree asm_expr = tmp;
if (TREE_CODE (asm_expr) == CLEANUP_POINT_EXPR)
  asm_expr = TREE_OPERAND (asm_expr, 0);
-   ASM_INPUT_P (asm_expr) = ASM_INPUT_P (t);
+   ASM_BASIC_P (asm_expr) = ASM_BASIC_P (t);
   }
   break;
 
diff --git a/gcc/d/toir.cc b/gcc/d/toir.cc
index 9f5531ce5cd..a6848f2ffa2 100644
--- a/gcc/d/toir.cc
+++ b/gcc/d/toir.cc
@@ -1491,10 +1491,9 @@ public:
   outputs, inputs, clobbers, labels);
 SET_EXPR_LOCATION (exp, make_location_t (s->loc));
 
-/* If the extended syntax was not used, mark the ASM_EXPR as being an
-   ASM_INPUT expression instead of an ASM_OPERAND with no operands.  */
+/* Record whether the basic rather than extended syntax was used.  */
 if (s->args == NULL && s->clobbers == NULL)
-  ASM_INPUT_P (exp) = 1;
+  ASM_BASIC_P (exp) = 1;
 
 /* All asm statements are assumed to have a side effect.  As a future
optimization, this could be unset when building in release mode.  */
diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index c596b7d44b2..3de394fd6e0 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -2095,11 +2095,17 @@ asm ("fsinx %1,%0" : "=f" (result) : "f" (angle));
 @end smallexample
 The first string is the @code{ASM_STRING}, containing the instruction
 template.  The next two strings are the output and inputs, respectively;
-this statement has no clobbers.  As this example indicates, ``plain''
-assembly statements are merely a special case of extended assembly
-statements; they have no cv-qualifiers, outputs, inputs, or clobbers.
-All of the strings will be @code{NUL}-terminated, and will contain no
-embedded @code{NUL}-characters.
+this statement has no clobbers.  All of the strings will be
+@code{NUL}-terminated, and will contain no embedded @code{NUL}-characters.
+
+@code{ASM_BASIC_P} indicates whether the assembly statement used the
+``basic'' syntax (as for the fir

Re: [PATCH] Tweak documentation of ASM_INPUT_P

2024-08-28 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, Aug 27, 2024 at 7:17 PM Richard Sandiford
>  wrote:
>>
>> The documentation of ASM_INPUT_P implied that the flag has no
>> effect on ASM_EXPRs that have operands (and which therefore must be
>> extended asms).  In fact we require ASM_INPUT_P to be false for all
>> extended asms.
>
> Quite a bad name for a flag give there's also ASM_INPUTS ...

Yeah, true.  I'll post a couple of follow-on patches to rename it

>> Tested on aarch64-linux-gnu.  OK to install?
>
> OK.  I notice that generic.texi while talking about basic asms, does not
> mention ASM_INPUT_P.

...and to fix this too.

Thanks for the review, now pushed.

Richard

>
> Richard.
>
>> Richard
>>
>>
>> gcc/
>> * tree.h (ASM_INPUT_P): Fix documentation.
>> ---
>>  gcc/tree.h | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/gcc/tree.h b/gcc/tree.h
>> index 5dcbb2fb5dd..c501019717f 100644
>> --- a/gcc/tree.h
>> +++ b/gcc/tree.h
>> @@ -1428,8 +1428,9 @@ class auto_suppress_location_wrappers
>>  #define ASM_INPUTS(NODE)TREE_OPERAND (ASM_EXPR_CHECK (NODE), 2)
>>  #define ASM_CLOBBERS(NODE)  TREE_OPERAND (ASM_EXPR_CHECK (NODE), 3)
>>  #define ASM_LABELS(NODE)   TREE_OPERAND (ASM_EXPR_CHECK (NODE), 4)
>> -/* Nonzero if we want to create an ASM_INPUT instead of an
>> -   ASM_OPERAND with no operands.  */
>> +/* Nonzero if the asm is a basic asm, zero if it is an extended asm.
>> +   Basic asms use a plain ASM_INPUT insn pattern whereas extended asms
>> +   use an ASM_OPERANDS insn pattern.  */
>>  #define ASM_INPUT_P(NODE) (ASM_EXPR_CHECK (NODE)->base.static_flag)
>>  #define ASM_VOLATILE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.public_flag)
>>  /* Nonzero if we want to consider this asm as minimum length and cost
>> --
>> 2.25.1
>>


[PATCH] Make some smallest_int_mode_for_size calls cope with failure

2024-08-28 Thread Richard Sandiford
smallest_int_mode_for_size now returns an optional mode rather
than aborting on failure.  This patch adjusts a couple of callers
so that they fail gracefully when no mode exists.

There should be no behavioural change, since anything that triggers
the new return paths would previously have aborted.  I just think
this is how the code would have been written if the option had been
available earlier.

Tested on aarch64-linux-gnu.  OK to install?

Richard


gcc/
* dse.cc (find_shift_sequence): Allow smallest_int_mode_for_size
to failure.
* optabs.cc (expand_twoval_binop_libfunc): Likewise.
---
 gcc/dse.cc| 16 
 gcc/optabs.cc |  6 --
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/gcc/dse.cc b/gcc/dse.cc
index c3feff06f86..75825a44cb9 100644
--- a/gcc/dse.cc
+++ b/gcc/dse.cc
@@ -1717,12 +1717,12 @@ dump_insn_info (const char * start, insn_info_t 
insn_info)
line up, we need to extract the value from lower part of the rhs of
the store, shift it, and then put it into a form that can be shoved
into the read_insn.  This function generates a right SHIFT of a
-   value that is at least ACCESS_SIZE bytes wide of READ_MODE.  The
+   value that is at least ACCESS_BYTES bytes wide of READ_MODE.  The
shift sequence is returned or NULL if we failed to find a
shift.  */
 
 static rtx
-find_shift_sequence (poly_int64 access_size,
+find_shift_sequence (poly_int64 access_bytes,
 store_info *store_info,
 machine_mode read_mode,
 poly_int64 shift, bool speed, bool require_cst)
@@ -1734,11 +1734,11 @@ find_shift_sequence (poly_int64 access_size,
   /* If a constant was stored into memory, try to simplify it here,
  otherwise the cost of the shift might preclude this optimization
  e.g. at -Os, even when no actual shift will be needed.  */
+  auto access_bits = access_bytes * BITS_PER_UNIT;
   if (store_info->const_rhs
-  && known_le (access_size, GET_MODE_SIZE (MAX_MODE_INT)))
+  && known_le (access_bytes, GET_MODE_SIZE (MAX_MODE_INT))
+  && smallest_int_mode_for_size (access_bits).exists (&new_mode))
 {
-  auto new_mode = smallest_int_mode_for_size
-   (access_size * BITS_PER_UNIT).require ();
   auto byte = subreg_lowpart_offset (new_mode, store_mode);
   rtx ret
= simplify_subreg (new_mode, store_info->const_rhs, store_mode, byte);
@@ -1810,7 +1810,7 @@ find_shift_sequence (poly_int64 access_size,
}
}
 
-  if (maybe_lt (GET_MODE_SIZE (new_mode), access_size))
+  if (maybe_lt (GET_MODE_SIZE (new_mode), access_bytes))
continue;
 
   new_reg = gen_reg_rtx (new_mode);
@@ -1839,8 +1839,8 @@ find_shift_sequence (poly_int64 access_size,
 of the arguments and could be precomputed.  It may
 not be worth doing so.  We could precompute if
 worthwhile or at least cache the results.  The result
-technically depends on both SHIFT and ACCESS_SIZE,
-but in practice the answer will depend only on ACCESS_SIZE.  */
+technically depends on both SHIFT and ACCESS_BYTES,
+but in practice the answer will depend only on ACCESS_BYTES.  */
 
   if (cost > COSTS_N_INSNS (1))
continue;
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index ded9cc3d947..2bcb3f7b47a 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -2551,8 +2551,10 @@ expand_twoval_binop_libfunc (optab binoptab, rtx op0, 
rtx op1,
 
   /* The value returned by the library function will have twice as
  many bits as the nominal MODE.  */
-  libval_mode
-= smallest_int_mode_for_size (2 * GET_MODE_BITSIZE (mode)).require ();
+  auto return_size = 2 * GET_MODE_BITSIZE (mode);
+  if (!smallest_int_mode_for_size (return_size).exists (&libval_mode))
+return false;
+
   start_sequence ();
   libval = emit_library_call_value (libfunc, NULL_RTX, LCT_CONST,
libval_mode,
-- 
2.25.1



Re: [PATCH v2 2/5] testsuite: Add scan-ltrans-rtl* for use in dg-final [PR116140]

2024-08-28 Thread Richard Sandiford
Alex Coplan  writes:
> On 28/08/2024 11:53, Richard Sandiford wrote:
>> Alex Coplan  writes:
>> > Hi,
>> >
>> > This is a v2 of:
>> > https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659966.html
>> > which is rebased on top of Richard S's patch to reduce the cut-and-paste in
>> > scanltranstree.exp (thanks again for doing that).
>> >
>> > Tested on aarch64-linux-gnu, OK for trunk?
>> >
>> > Thanks,
>> > Alex
>> >
>> > -- >8 --
>> >
>> > This extends the scan-ltrans-tree* helpers to create RTL variants.  This
>> > is needed to check the behaviour of an RTL pass under LTO.
>> >
>> > gcc/ChangeLog:
>> >
>> >PR libstdc++/116140
>> >* doc/sourcebuild.texi: Document ltrans-rtl value of kind for
>> >scan--dump*.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >PR libstdc++/116140
>> >* lib/scanltranstree.exp (scan-ltrans-rtl-dump): New.
>> >(scan-ltrans-rtl-dump-not): New.
>> >(scan-ltrans-rtl-dump-dem): New.
>> >(scan-ltrans-rtl-dump-dem-not): New.
>> >(scan-ltrans-rtl-dump-times): New.
>> 
>> The patch only contains the gcc/testsuite changes, but those are ok
>> for trunk, thanks.
>
> Gah, sorry -- those got lost in the rebase.  Is it OK to commit this
> together with the doc changes included as per the previous patch?

Yeah, thanks.

Richard


Re: [PATCH v2 2/5] testsuite: Add scan-ltrans-rtl* for use in dg-final [PR116140]

2024-08-28 Thread Richard Sandiford
Alex Coplan  writes:
> Hi,
>
> This is a v2 of:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659966.html
> which is rebased on top of Richard S's patch to reduce the cut-and-paste in
> scanltranstree.exp (thanks again for doing that).
>
> Tested on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> -- >8 --
>
> This extends the scan-ltrans-tree* helpers to create RTL variants.  This
> is needed to check the behaviour of an RTL pass under LTO.
>
> gcc/ChangeLog:
>
>   PR libstdc++/116140
>   * doc/sourcebuild.texi: Document ltrans-rtl value of kind for
>   scan--dump*.
>
> gcc/testsuite/ChangeLog:
>
>   PR libstdc++/116140
>   * lib/scanltranstree.exp (scan-ltrans-rtl-dump): New.
>   (scan-ltrans-rtl-dump-not): New.
>   (scan-ltrans-rtl-dump-dem): New.
>   (scan-ltrans-rtl-dump-dem-not): New.
>   (scan-ltrans-rtl-dump-times): New.

The patch only contains the gcc/testsuite changes, but those are ok
for trunk, thanks.

Richard

> ---
>  gcc/testsuite/lib/scanltranstree.exp | 80 +---
>  1 file changed, 37 insertions(+), 43 deletions(-)
>
> diff --git a/gcc/testsuite/lib/scanltranstree.exp 
> b/gcc/testsuite/lib/scanltranstree.exp
> index bc6e02dc369..a7d4de3765f 100644
> --- a/gcc/testsuite/lib/scanltranstree.exp
> +++ b/gcc/testsuite/lib/scanltranstree.exp
> @@ -19,50 +19,44 @@
>  
>  load_lib scandump.exp
>  
> -# The first item in the list is an LTO equivalent of the second item
> -# in the list; see the documentation of the second item for details.
> -foreach { name scan type suffix } {
> -scan-ltrans-tree-dump scan-dump ltrans-tree t
> -scan-ltrans-tree-dump-not scan-dump-not ltrans-tree t
> -scan-ltrans-tree-dump-dem scan-dump-dem ltrans-tree t
> -scan-ltrans-tree-dump-dem-not scan-dump-dem-not ltrans-tree t
> -} {
> -eval [string map [list @NAME@ $name \
> -@SCAN@ $scan \
> -@TYPE@ $type \
> -@SUFFIX@ $suffix] {
> -proc @NAME@ { args } {
> - if { [llength $args] < 2 } {
> - error "@NAME@: too few arguments"
> - return
> - }
> - if { [llength $args] > 3 } {
> - error "@NAME@: too many arguments"
> - return
> +# Define scan-ltrans-{tree,rtl}-dump{,-not,-dem,-dem-not}.  These are LTO
> +# variants of the corresponding functions without -ltrans in the name.
> +foreach ir { tree rtl } {
> +foreach modifier { {} -not -dem -dem-not } {
> + eval [string map [list @NAME@ scan-ltrans-$ir-dump$modifier \
> +@SCAN@ scan$modifier \
> +@TYPE@ ltrans-$ir \
> +@SUFFIX@ [string index $ir 0]] {
> + proc @NAME@ { args } {
> + if { [llength $args] < 2 } {
> + error "@NAME@: too few arguments"
> + return
> + }
> + if { [llength $args] > 3 } {
> + error "@NAME@: too many arguments"
> + return
> + }
> + if { [llength $args] >= 3 } {
> + @SCAN@ @TYPE@ [lindex $args 0] \
> + "\[0-9\]\[0-9\]\[0-9\]@SUFFIX@.[lindex $args 1]" \
> + ".ltrans0.ltrans" \
> + [lindex $args 2]
> + } else {
> + @SCAN@ @TYPE@ [lindex $args 0] \
> + "\[0-9\]\[0-9\]\[0-9\]@SUFFIX@.[lindex $args 1]" \
> + ".ltrans0.ltrans"
> + }
>   }
> - if { [llength $args] >= 3 } {
> - @SCAN@ @TYPE@ [lindex $args 0] \
> - "\[0-9\]\[0-9\]\[0-9\]@SUFFIX@.[lindex $args 1]" \
> - ".ltrans0.ltrans" \
> - [lindex $args 2]
> - } else {
> - @SCAN@ @TYPE@ [lindex $args 0] \
> - "\[0-9\]\[0-9\]\[0-9\]@SUFFIX@.[lindex $args 1]" \
> - ".ltrans0.ltrans"
> - }
> -}
> -}]
> + }]
> +}
>  }
>  
> -# The first item in the list is an LTO equivalent of the second item
> -# in the list; see the documentation of the second item for details.
> -foreach { name scan type suffix } {
> -scan-ltrans-tree-dump-times scan-dump-times ltrans-tree t
> -} {
> -eval [string map [list @NAME@ $name \
> -@SCAN@ $scan \
> -@TYPE@ $type \
> -@SUFFIX@ $suffix] {
> +# Define scan-ltrans-{tree,rtl}-dump-times.  These are LTO variants of the
> +# corresponding functions without -ltrans in the name.
> +foreach ir { tree rtl } {
> +eval [string map [list @NAME@ scan-ltrans-$ir-dump-times \
> +@TYPE@ ltrans-$ir \
> +@SUFFIX@ [string index $ir 0]] {
>   proc @NAME@ { args } {
>   if { [llength $args] < 3 } {
>   error "@NAME@: too few arguments"
> @@ -73,11 +67,11 @@ foreach { name scan type suf

Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

2024-08-28 Thread Richard Sandiford
Tamar Christina  writes:
>>  The patch was bootstrapped and regtested on aarch64-linux-gnu:
>>  No problems bootstrapping, but several test files (in aarch64-sve.exp:
>>  gather_load_extend_X.c
>>  where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of 
>>  small
>>  differences
>>  in codegen that make some of the scan-assembler-times tests fail.
>> 
>>  Kyrill suggested to add a -fvect-cost-model=unlimited flag to these 
>>  tests and
>> >> add
>>  some
>> >>>
>> >>> I don't personally like unlimited here as unlimited means just vectorize 
>> >>> at any
>> >>> cost.  This means that costing between modes are also disabled. A lot of 
>> >>> these
>> >>> testcases are intended to test not just that we vectorize but that we 
>> >>> vectorize
>> >>> with efficient code.
>> >>>
>> >>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases.
>> >>
>> >> Yeah, I don't think -fvect-cost-model=unlimited would work for the
>> >> gather_load_extend_X.c tests, since we require the cost model to decide
>> >> when to use extending loads vs loading a full vector and unpacking.
>>
>> I had suggested using -fvect-cost-model=unlimited here because I thought 
>> these
>> tests wanted to test the capability of GCC to detect and generate the 
>> particular SVE
>> feature (gather load extends) for all supported data types regardless of
>> profitability.
>
> I think the problem only specifically for the load_extend tests, because 
> unlimited also
> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. SIMD 
> vs SVE.
> That is while it would generate a gather, it may choose to instead of gather 
> from
> B -> D, do B -> S and then extend the results.  Because this has a higher VF.
>
> Without the cross SVE mode comparisons it wouldn't know that the extensions 
> would
> actually slow it down.

Right.

>> If the tests are intended to also make the right profitability decisions for 
>> the generic
>> tuning model then I agree using -fvect-cost-model=unlimited is not 
>> appropriate
>> here, though I do think that it’d be useful to fix the backend vector cost 
>> model
>> hooks to honor -fvect-cost-model=unlimited and not limit generation of
>> gather/scatter in that case. What it should do for the SVE vs Neon decisions 
>> is an
>> open question.

FTR, most of the tests are instead testing specific code generation
strategies, regardless of whether those strategies are good for
particular real cores.  So forcing the choice is ok in principle.
It's just that -fvect-cost-model=unlimited wouldn't do that in this case.

>> >> [...tries patch...]
>> >>
>> >> It seems like three main things are contributing to the difference:
>> >>
>> >> 1. we now cost 0 for a scalar prologue extension from a loaded value
>> >> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost
>> >> 3. we now apply a large one-off cost for gathers (Tamar's recent change)
>> >>
>> >> (1) is expected.
>> >>
>> >> (2) partly looks like a latent bug.  We're using the x32 cost for
>> >> VNx2QI->VNx2SI, even though those are really .B->.D loads.
>> >>
>> >> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info
>> >> *vinfo, vect_cost_for_stmt kind,
>> >>   && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>> >> VMAT_GATHER_SCATTER)
>> >> {
>> >>   unsigned int nunits = vect_nunits_for_cost (vectype);
>> >> -  if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>> >> +  if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
>> >> aarch64_sve_vg))
>> >>  return { sve_costs->gather_load_x64_cost, nunits };
>> >>   return { sve_costs->gather_load_x32_cost, nunits };
>> >> }
>> >>
>> >> fixes that.
>>
>> Would you be willing to test and push that to trunk to get it out of the way?

Sure.

>> >> (3) is interesting.  generic_vector_cost isn't used by default for any
>> >> SVE CPU, or any -march=...+sve.  So the question is: should we treat it
>> >> as "architectural intent"/somewhat idealised?  Or should we try to make
>> >> it produce good code for existing SVE cores, in which case it would
>> >> overlap quite a lot with generic_armv8_a and generic_armv9_a.
>>
>> >>
>> >> If the former, we could set gather_load_x32_init_cost and
>> >> gather_load_x64_init_cost to 0 for generic_sve_vector_cost
>> >> (and nothing else, so that generic_armv*_a are unaffected).
>>
>> I don’t have strong opinions on this point but if Tamar is concerned about
>> deviating too much from the known-good Advanced SIMD generic tuning we have
>> now then we should aim for minimal codegen changes in that?
>
> No I'm fine with Richard's suggestion.  The only way you'd be able to get 
> this model
> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with 
> assuming
> gathers have no initial overhead.

OK, great.  Should I do this as well?  I suppose it's a separate patch
from removing the "new cost model" flag.

Richard


[PATCH] Tweak documentation of ASM_INPUT_P

2024-08-27 Thread Richard Sandiford
The documentation of ASM_INPUT_P implied that the flag has no
effect on ASM_EXPRs that have operands (and which therefore must be
extended asms).  In fact we require ASM_INPUT_P to be false for all
extended asms.

Tested on aarch64-linux-gnu.  OK to install?

Richard


gcc/
* tree.h (ASM_INPUT_P): Fix documentation.
---
 gcc/tree.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/gcc/tree.h b/gcc/tree.h
index 5dcbb2fb5dd..c501019717f 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1428,8 +1428,9 @@ class auto_suppress_location_wrappers
 #define ASM_INPUTS(NODE)TREE_OPERAND (ASM_EXPR_CHECK (NODE), 2)
 #define ASM_CLOBBERS(NODE)  TREE_OPERAND (ASM_EXPR_CHECK (NODE), 3)
 #define ASM_LABELS(NODE)   TREE_OPERAND (ASM_EXPR_CHECK (NODE), 4)
-/* Nonzero if we want to create an ASM_INPUT instead of an
-   ASM_OPERAND with no operands.  */
+/* Nonzero if the asm is a basic asm, zero if it is an extended asm.
+   Basic asms use a plain ASM_INPUT insn pattern whereas extended asms
+   use an ASM_OPERANDS insn pattern.  */
 #define ASM_INPUT_P(NODE) (ASM_EXPR_CHECK (NODE)->base.static_flag)
 #define ASM_VOLATILE_P(NODE) (ASM_EXPR_CHECK (NODE)->base.public_flag)
 /* Nonzero if we want to consider this asm as minimum length and cost
-- 
2.25.1



Re: [PATCH v4] Extend check-function-bodies to allow label and directives

2024-08-27 Thread Richard Sandiford
Andreas Schwab  writes:
> On Aug 27 2024, H.J. Lu wrote:
>
>> diff --git a/gcc/testsuite/gcc.target/i386/pr116174.c 
>> b/gcc/testsuite/gcc.target/i386/pr116174.c
>> index 8877d0b51af..686aeb9ff31 100644
>> --- a/gcc/testsuite/gcc.target/i386/pr116174.c
>> +++ b/gcc/testsuite/gcc.target/i386/pr116174.c
>> @@ -1,6 +1,20 @@
>>  /* { dg-do compile { target *-*-linux* } } */
>> -/* { dg-options "-O2 -fcf-protection=branch" } */
>> +/* { dg-options "-O2 -g0 -fcf-protection=branch" } */
>> +/* Keep labels and directives ('.p2align', '.cfi_startproc').
>> +/* { dg-final { check-function-bodies "**" "" "" { target "*-*-*" } 
>> {^\t?\.}  } } */
>
> This has a "nested" comment (line 3 missing coment end).

The patch is ok with that fixed.

Thanks,
Richard


Re: [PATCH v2] Extend check-function-bodies to allow label and directives

2024-08-27 Thread Richard Sandiford
"H.J. Lu"  writes:
>> >   append function_regexp ")"
>> >   } elseif { [string equal $line "..."] } {
>> >   append function_regexp ".*"
>> > + } elseif { [regexp "^.L.*" $line] } {
>>
>> {^\.L} would be more precise than "^.L.*".
>
> I tried  {^\.L}.  It didn't work.  I used "^.L" in v3.

Why didn't it work though?  "^.L.*" matches "ALL" as well as ".L".

Richard


Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

2024-08-27 Thread Richard Sandiford
Tamar Christina  writes:
> Hi Jennifer,
>
>> -Original Message-
>> From: Jennifer Schmitz 
>> Sent: Friday, August 23, 2024 1:07 PM
>> To: gcc-patches@gcc.gnu.org
>> Cc: Richard Sandiford ; Kyrylo Tkachov
>> 
>> Subject: [RFC][PATCH] AArch64: Remove
>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>
>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> tunable and
>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>> default.

Thanks for doing this.  This has been on my TODO list ever since the
tunable was added.

The history is that these "new" costs were originally added in stage 4
of GCC 11 for Neoverse V1.  Since the costs were added so late, it wasn't
appropriate to change the behaviour for any other core.  All the new code
was therefore gated on this option.

The new costs do two main things:

(1) use throughput-based calculations where available, including to choose
between Advanced SIMD and SVE

(2) try to make the latency-based costs more precise, by looking more closely
at the provided stmt_info

Old cost models won't be affected by (1) either way, since they don't
provide any throughput information.  But they should in principle benefit
from (2).  So...

>> To that end, the function aarch64_use_new_vector_costs_p and its uses were
>> removed. Additionally, guards were added prevent nullpointer dereferences of
>> fields in cpu_vector_cost.
>>
>
> I'm not against this change, but it does mean that we now switch old Adv. SIMD
> cost models as well to the new throughput based cost models.  That means that
> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I think
> some distros explicitly use this (I believe yocto for instance does).

...it shouldn't mean that we start using throughput-based models for
cortexa53 etc., since there's no associated issue info.

> Have we validated that the old generic cost model still behaves sensibly with 
> this change?
>
>> The patch was bootstrapped and regtested on aarch64-linux-gnu:
>> No problems bootstrapping, but several test files (in aarch64-sve.exp:
>> gather_load_extend_X.c
>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of small
>> differences
>> in codegen that make some of the scan-assembler-times tests fail.
>>
>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these tests 
>> and add
>> some
>
> I don't personally like unlimited here as unlimited means just vectorize at 
> any
> cost.  This means that costing between modes are also disabled. A lot of these
> testcases are intended to test not just that we vectorize but that we 
> vectorize
> with efficient code.
>
> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases.

Yeah, I don't think -fvect-cost-model=unlimited would work for the
gather_load_extend_X.c tests, since we require the cost model to decide
when to use extending loads vs loading a full vector and unpacking.

[...tries patch...]

It seems like three main things are contributing to the difference:

1. we now cost 0 for a scalar prologue extension from a loaded value
2. we now cost gathers & scatters based on gather_load_x32/x64_cost
3. we now apply a large one-off cost for gathers (Tamar's recent change)

(1) is expected.

(2) partly looks like a latent bug.  We're using the x32 cost for
VNx2QI->VNx2SI, even though those are really .B->.D loads.

@@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, 
vect_cost_for_stmt kind,
   && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
 {
   unsigned int nunits = vect_nunits_for_cost (vectype);
-  if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
+  if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)), aarch64_sve_vg))
return { sve_costs->gather_load_x64_cost, nunits };
   return { sve_costs->gather_load_x32_cost, nunits };
 }

fixes that.

(3) is interesting.  generic_vector_cost isn't used by default for any
SVE CPU, or any -march=...+sve.  So the question is: should we treat it
as "architectural intent"/somewhat idealised?  Or should we try to make
it produce good code for existing SVE cores, in which case it would
overlap quite a lot with generic_armv8_a and generic_armv9_a.

If the former, we could set gather_load_x32_init_cost and
gather_load_x64_init_cost to 0 for generic_sve_vector_cost
(and nothing else, so that generic_armv*_a are unaffected).

On the patch:

> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>  {
>const cpu_v

Re: [PATCH v2] Extend check-function-bodies to allow label and directives

2024-08-27 Thread Richard Sandiford
"H.J. Lu"  writes:
> As PR target/116174 shown, we may need to verify labels and the directive
> order.  Extend check-function-bodies to support matched output lines to
> allow label and directives.
>
> gcc/
>
>   * doc/sourcebuild.texi (check-function-bodies): Add an optional
>   argument for matched output lines.
>
> gcc/testsuite/
>
>   * gcc.target/i386/pr116174.c: Use check-function-bodies.
>   * lib/scanasm.exp (parse_function_bodies): Append the line if
>   $up_config(matched) matches the line.
>   (check-function-bodies): Add an argument for matched.  Set
>   up_config(matched) to $matched.  Append the expected line without
>   $config(line_prefix) to function_regexp if it starts with ".L".
>
> Signed-off-by: H.J. Lu 
> ---
>  gcc/doc/sourcebuild.texi |  6 +-
>  gcc/testsuite/gcc.target/i386/pr116174.c | 18 +++---
>  gcc/testsuite/lib/scanasm.exp| 14 --
>  3 files changed, 32 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
> index 1a31f00fb65..f7128f445cf 100644
> --- a/gcc/doc/sourcebuild.texi
> +++ b/gcc/doc/sourcebuild.texi
> @@ -3530,7 +3530,7 @@ assembly output.
>  Passes if @var{symbol} is not defined as a hidden symbol in the test's
>  assembly output.
>  
> -@item check-function-bodies @var{prefix} @var{terminator} [@var{options} [@{ 
> target/xfail @var{selector} @}]]
> +@item check-function-bodies @var{prefix} @var{terminator} [@var{options} [@{ 
> target/xfail @var{selector} @} [@var{matched}]]]
>  Looks through the source file for comments that give the expected assembly
>  output for selected functions.  Each line of expected output starts with the
>  prefix string @var{prefix} and the expected output for a function as a whole
> @@ -3544,6 +3544,10 @@ command line.  This can help if a source file is 
> compiled both with
>  and without optimization, since it is rarely useful to check the full
>  function body for unoptimized code.
>  
> +@var{matched}, if specified, is a regular expression which matches a
> +line of the function body.  If @var{matched} isn't specified, lines
> +beginning with labels, directives and comments are ignored.
> +

How about instead splitting:

  Depending on the configuration (see
  @code{configure_check-function-bodies} in
  @file{gcc/testsuite/lib/scanasm.exp}), the test may discard from the
  compiler's assembly output directives such as @code{.cfi_startproc},
  local label definitions such as @code{.LFB0}, and more.
  It then matches the result against the expected
  output for a function as a single regular expression.  This means that
  later lines can use backslashes to refer back to @samp{(@dots{})}
  captures on earlier lines.  For example:

into two paragraphs at "If then", and describing the new behaviour at
the end of the first paragraph:


Depending on the configuration (see
@code{configure_check-function-bodies} in
@file{gcc/testsuite/lib/scanasm.exp}), the test may discard from the
compiler's assembly output directives such as @code{.cfi_startproc},
local label definitions such as @code{.LFB0}, and more.  This behavior
can be overridden using the optional @var{matched} argument, which
specifies a regexp for lines that should not be discarded in this way.

The test then matches the result against the expected
output for a function as a single regular expression.  This means that
later lines can use backslashes to refer back to @samp{(@dots{})}
captures on earlier lines.  For example:


>  The first line of the expected output for a function @var{fn} has the form:
>  
>  @smallexample
> diff --git a/gcc/testsuite/gcc.target/i386/pr116174.c 
> b/gcc/testsuite/gcc.target/i386/pr116174.c
> index 8877d0b51af..91ec3288786 100644
> --- a/gcc/testsuite/gcc.target/i386/pr116174.c
> +++ b/gcc/testsuite/gcc.target/i386/pr116174.c
> @@ -1,6 +1,20 @@
>  /* { dg-do compile { target *-*-linux* } } */
> -/* { dg-options "-O2 -fcf-protection=branch" } */
> +/* { dg-options "-O2 -g0 -fcf-protection=branch" } */
> +/* Keep labels and directives ('.p2align', '.cfi_startproc').
> +/* { dg-final { check-function-bodies "**" "" "" { target "*-*-*" } {.*} } } 
> */

This works, but matches everything.  Maybe {^\t?\.} would be more precise.
The current version is fine too though, if you think it will work for all
assembly dialects.

>  
> +/*
> +**foo:
> +**.LFB0:
> +**   .cfi_startproc
> +** (
> +**   endbr64
> +**   .p2align 5
> +** |
> +**   endbr32
> +** )
> +**...
> +*/
>  char *
>  foo (char *dest, const char *src)
>  {
> @@ -8,5 +22,3 @@ foo (char *dest, const char *src)
>  /* nothing */;
>return --dest;
>  }
> -
> -/* { dg-final { scan-assembler "\t\.cfi_startproc\n\tendbr(32|64)\n" } } */
> diff --git a/gcc/testsuite/lib/scanasm.exp b/gcc/testsuite/lib/scanasm.exp
> index 42

Re: [PATCH] Extend check-function-bodies to cover directives

2024-08-22 Thread Richard Sandiford
"H.J. Lu"  writes:
> As PR target/116174 shown, we may need to verify the directive order.
> Extend check-function-bodies to cover directives.
>
>   * gcc.target/i386/pr116174.c: Use check-function-bodies.
>   * lib/scanasm.exp (configure_check-function-bodies): Add an
>   argument for fluff.  Set up_config(fluff) to $fluff if not
>   empty.
>   (check-function-bodies): Add an optional argument for fluff and
>   pass it to configure_check-function-bodies.

Looks like a useful feature, but how about instead making the extra
argument specify things that *should* be matched, rather than things
that shouldn't?  That argument would then take precedence over the
current fluff regexp.

That might be easier to maintain, since it wouldn't need to repeat the
knowledge currently in configure_check-function-bodies.  And it should
cope more easily with multiple assembly dialects.

The documentation in doc/sourcebuild.texi would need to be updated as well.

Thanks,
Richard

>
> Signed-off-by: H.J. Lu 
> ---
>  gcc/testsuite/gcc.target/i386/pr116174.c | 16 ++--
>  gcc/testsuite/lib/scanasm.exp| 17 -
>  2 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr116174.c 
> b/gcc/testsuite/gcc.target/i386/pr116174.c
> index 8877d0b51af..75c62964d97 100644
> --- a/gcc/testsuite/gcc.target/i386/pr116174.c
> +++ b/gcc/testsuite/gcc.target/i386/pr116174.c
> @@ -1,6 +1,20 @@
>  /* { dg-do compile { target *-*-linux* } } */
>  /* { dg-options "-O2 -fcf-protection=branch" } */
> +/* Keep directives ('.p2align', '.cfi_startproc').
> +/* { dg-final { check-function-bodies "**" "" "" { target "*-*-*" } 
> {^\s*(?://|$)} } } */
>  
> +/*
> +**foo:
> +**...
> +**   .cfi_startproc
> +** (
> +**   endbr64
> +** |
> +**   endbr32
> +** )
> +**   .p2align 5
> +**...
> +*/
>  char *
>  foo (char *dest, const char *src)
>  {
> @@ -8,5 +22,3 @@ foo (char *dest, const char *src)
>  /* nothing */;
>return --dest;
>  }
> -
> -/* { dg-final { scan-assembler "\t\.cfi_startproc\n\tendbr(32|64)\n" } } */
> diff --git a/gcc/testsuite/lib/scanasm.exp b/gcc/testsuite/lib/scanasm.exp
> index 42c719c512c..5165284608f 100644
> --- a/gcc/testsuite/lib/scanasm.exp
> +++ b/gcc/testsuite/lib/scanasm.exp
> @@ -863,7 +863,7 @@ proc scan-lto-assembler { args } {
>  
>  # Set up CONFIG for check-function-bodies.
>  
> -proc configure_check-function-bodies { config } {
> +proc configure_check-function-bodies { config fluff } {
>  upvar $config up_config
>  
>  # Regexp for the start of a function definition (name in \1).
> @@ -890,7 +890,9 @@ proc configure_check-function-bodies { config } {
>  }
>  
>  # Regexp for lines that aren't interesting.
> -if { [istarget nvptx*-*-*] } {
> +if {$fluff ne ""} then {
> + set up_config(fluff) $fluff
> +} elseif { [istarget nvptx*-*-*] } {
>   # Skip lines beginning with '//' comments ('-fverbose-asm', for
>   # example).
>   set up_config(fluff) {^\s*(?://)}
> @@ -982,7 +984,7 @@ proc check_function_body { functions name body_regexp } {
>  
>  # Check the implementations of functions against expected output.  Used as:
>  #
> -# { dg-do { check-function-bodies PREFIX TERMINATOR[ OPTION[ SELECTOR]] } }
> +# { dg-do { check-function-bodies PREFIX TERMINATOR[ OPTION[ SELECTOR 
> [FLUFF]]] } }
>  #
>  # See sourcebuild.texi for details.
>  
> @@ -990,7 +992,7 @@ proc check-function-bodies { args } {
>  if { [llength $args] < 2 } {
>   error "too few arguments to check-function-bodies"
>  }
> -if { [llength $args] > 4 } {
> +if { [llength $args] > 5 } {
>   error "too many arguments to check-function-bodies"
>  }
>  
> @@ -1029,6 +1031,11 @@ proc check-function-bodies { args } {
>   }
>  }
>  
> +set fluff ""
> +if { [llength $args] >= 5 } {
> + set fluff [lindex $args 4]
> +}
> +
>  set testcase [testname-for-summary]
>  # The name might include a list of options; extract the file name.
>  set filename [lindex $testcase 0]
> @@ -1048,7 +1055,7 @@ proc check-function-bodies { args } {
>  # (name in \1).  This may be different from '$config(start)'.
>  set start_expected {^(\S+):$}
>  
> -configure_check-function-bodies config
> +configure_check-function-bodies config $fluff
>  set have_bodies 0
>  if { [is_remote host] } {
>   remote_upload host "$filename"


Re: final: go down ASHIFT in walk_alter_subreg

2024-08-22 Thread Richard Sandiford
Michael Matz  writes:
> when experimenting with m68k plus LRA one of the
> changes in the backend is to accept ASHIFTs (not only
> MULT) as scale code for address indices.  When then not
> turning on LRA but using reload those addresses are
> presented to it which chokes on them.  While reload is
> going away the change to make them work doesn't really hurt
> (and generally seems useful, as MULT and ASHIFT really are
> no different).  So just add it.
>
>   PR target/116413
>   * final.cc (walk_alter_subreg): Recurse on AHIFT.
> ---
>  gcc/final.cc | 1 +
>  1 file changed, 1 insertion(+)
> ---
>
> Regstrapped on x86-64-linux.  Okay?

OK, thanks.  If things are working correctly, this should only
trigger in LEAs, and MULT should only occur in MEMs.

Richard

>
> diff --git a/gcc/final.cc b/gcc/final.cc
> index eb9e065d9f0..5d911586de5 100644
> --- a/gcc/final.cc
> +++ b/gcc/final.cc
> @@ -3146,6 +3146,7 @@ walk_alter_subreg (rtx *xp, bool *changed)
>  case PLUS:
>  case MULT:
>  case AND:
> +case ASHIFT:
>XEXP (x, 0) = walk_alter_subreg (&XEXP (x, 0), changed);
>XEXP (x, 1) = walk_alter_subreg (&XEXP (x, 1), changed);
>break;


Re: Re-compute TYPE_MODE and DECL_MODE while streaming in for accelerator

2024-08-22 Thread Richard Sandiford
Prathamesh Kulkarni  writes:
>> -Original Message-
>> From: Richard Biener 
>> Sent: Wednesday, August 21, 2024 5:09 PM
>> To: Prathamesh Kulkarni 
>> Cc: Richard Sandiford ; Thomas Schwinge
>> ; gcc-patches@gcc.gnu.org
>> Subject: RE: Re-compute TYPE_MODE and DECL_MODE while streaming in for
>> accelerator
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> On Wed, 21 Aug 2024, Prathamesh Kulkarni wrote:
>> 
>> >
>> >
>> > > -Original Message-
>> > > From: Richard Biener 
>> > > Sent: Tuesday, August 20, 2024 10:36 AM
>> > > To: Richard Sandiford 
>> > > Cc: Prathamesh Kulkarni ; Thomas Schwinge
>> > > ; gcc-patches@gcc.gnu.org
>> > > Subject: Re: Re-compute TYPE_MODE and DECL_MODE while streaming in
>> > > for accelerator
>> > >
>> > > External email: Use caution opening links or attachments
>> > >
>> > >
>> > > > Am 19.08.2024 um 20:56 schrieb Richard Sandiford
>> > > :
>> > > >
>> > > > Prathamesh Kulkarni  writes:
>> > > >> diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc
>> > > >> index
>> > > >> cbf6041fd68..0420183faf8 100644
>> > > >> --- a/gcc/lto-streamer-in.cc
>> > > >> +++ b/gcc/lto-streamer-in.cc
>> > > >> @@ -44,6 +44,7 @@ along with GCC; see the file COPYING3.  If
>> not
>> > > see
>> > > >> #include "debug.h"
>> > > >> #include "alloc-pool.h"
>> > > >> #include "toplev.h"
>> > > >> +#include "stor-layout.h"
>> > > >>
>> > > >> /* Allocator used to hold string slot entries for line map
>> > > streaming.
>> > > >> */ static struct object_allocator
>> > > >> *string_slot_allocator; @@ -1752,6 +1753,17 @@ lto_read_tree_1
>> > > (class lto_input_block *ib, class data_in *data_in, tree expr)
>> > > >> with -g1, see for example PR113488.  */
>> > > >>   else if (DECL_P (expr) && DECL_ABSTRACT_ORIGIN (expr) ==
>> > > expr)
>> > > >>DECL_ABSTRACT_ORIGIN (expr) = NULL_TREE;
>> > > >> +
>> > > >> +#ifdef ACCEL_COMPILER
>> > > >> +  /* For decl with aggregate type, host streams out
>> VOIDmode.
>> > > >> + Compute the correct DECL_MODE by calling relayout_decl.
>> */
>> > > >> +  if ((VAR_P (expr)
>> > > >> +   || TREE_CODE (expr) == PARM_DECL
>> > > >> +   || TREE_CODE (expr) == FIELD_DECL)
>> > > >> +  && AGGREGATE_TYPE_P (TREE_TYPE (expr))
>> > > >> +  && DECL_MODE (expr) == VOIDmode)
>> > > >> +relayout_decl (expr);
>> > > >> +#endif
>> > > >
>> > > > Genuine question, but: is relayout_decl safe in this context?
>> It
>> > > does
>> > > > a lot more than just reset the mode.  It also applies the target
>> > > ABI's
>> > > > preferences wrt alignment, padding, and so on, rather than
>> > > preserving
>> > > > those of the host's.
>> > >
>> > > It would be better to just recompute the mode here.
>> > Hi,
>> > The attached patch sets DECL_MODE (expr) to TYPE_MODE (TREE_TYPE
>> (expr)) in lto_read_tree_1 instead of calling relayout_decl (expr).
>> > I checked layout_decl_type does the same thing for setting decl
>> mode,
>> > except for bit fields. Since bit-fields cannot have aggregate type,
>> I am assuming setting DECL_MODE (expr) to TYPE_MODE (TREE_TYPE (expr))
>> would be OK in this case ?
>> 
>> Yep, that should work.
> Thanks, I have committed the patch in:
> https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=792adb8d222d0d1d16b182871e105f47823b8e72
>
> after verifying it passes bootstrap+test on aarch64-linux-gnu,
> and libgomp testing (without GPU) for aarch64->nvptx and x86_64->nvptx.
>> 
>> > Sorry if this sounds like a silly ques -- Why would it be unsafe to
>> > call relayout_decl for variables that are mapped to accelerator even
>> > if it'd not preserve host's properties ? I assumed we want to assign
>> accel's ABI properties for mapped decls (mode being one of them), or
>> am I misunderstanding ?
>> 
>> Structure layout need not be compatible but we are preserving that of
>> the host instead of re-layouting in target context.  Likewise type <->
>> mode mapping doesn't have to agree.
> Ah OK, thanks for clarifying. So IIUC, in future, we might need to change 
> that if
> (in theory), host's structure layout for a decl is incompatible with a 
> particular accel's ABI
> and will need to relayout in accel's context ?

If structures are ever used to communicate between the host and the
accelerator, they would need to be laid out as the host expects,
otherwise we'd get data corruption.  But maybe structures are never
used that way (it's not my area!).

Richard


Re: [PATCH] optabs-query: Guard smallest_int_mode_for_size [PR115495].

2024-08-22 Thread Richard Sandiford
"Robin Dapp"  writes:
> diff --git a/gcc/machmode.h b/gcc/machmode.h
> index c31ec2f2ebc..b3307ad9342 100644
> --- a/gcc/machmode.h
> +++ b/gcc/machmode.h
> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3.  If not see
>  #ifndef HAVE_MACHINE_MODES
>  #define HAVE_MACHINE_MODES
>  
> +#include "coretypes.h"
>  typedef opt_mode opt_machine_mode;
>  
>  extern CONST_MODE_SIZE poly_uint16 mode_size[NUM_MACHINE_MODES];

Why's the include needed?  .ccs ought to include coretypes.h directly
(and get machmode.h that way, since coretypes.h include machmode.h).

> @@ -905,15 +906,15 @@ decimal_float_mode_for_size (unsigned int size)
>  (mode_for_size (size, MODE_DECIMAL_FLOAT, 0));
>  }
>  
> -extern machine_mode smallest_mode_for_size (poly_uint64, enum mode_class);
> +extern opt_machine_mode smallest_mode_for_size (poly_uint64, enum 
> mode_class);
>  
>  /* Find the narrowest integer mode that contains at least SIZE bits.
> Such a mode must exist.  */

The comment is no longer true.  Maybe:

/* Find the narrowest integer mode that contains at least SIZE bits,
   if such a mode exists.  */

Similarly for smallest_mode_for_size.

Otherwise it looks good.  Thanks for doing this.

When I had a quick look before sending my previous message, it seemed like
quite a few callers do have natural error paths for the case where no mode
exists.  But I agree it makes sense to do that separately.  I can follow
up with a patch for that.

Richard


[PATCH] Handle arithmetic on eliminated address indices [PR116413]

2024-08-22 Thread Richard Sandiford
This patch fixes gcc.c-torture/compile/opout.c for m68k with LRA
enabled.  The test has:

...
z (a, b)
{
  return (int) &a + (int) &b + (int) x + (int) z;
}

so it adds the address of two incoming arguments.  This ends up
being treated as an LEA in which the "index" is the incoming
argument pointer, which the LEA multiplies by 2.  The incoming
argument pointer is then eliminated, leading to:

(plus:SI (plus:SI (ashift:SI (plus:SI (reg/f:SI 24 %argptr)
(const_int -4 [0xfffc]))
(const_int 1 [0x1]))
(reg/f:SI 41 [ _6 ]))
(const_int 20 [0x14]))

In the address_info scheme, the innermost plus has to be treated
as the index "term", since that's the thing that's subject to
index_reg_class.

Tested on aarch64-linux-gnu, powerpc64le-linux-gnu, and
x86_64-linux-gnu.  Also tested by building at least one target
per CPU directory and checking that there were no asm changes or
new ICEs when compiling gcc.c-torture, gcc.dg, and g++.dg at -Os.
OK to install?

Richard


gcc/
PR middle-end/116413
* rtl.h (address_info): Update commentary.
* rtlanal.cc (valid_base_or_index_term_p): New function, split
out from...
(get_base_term, get_index_term): ...here.  Handle elimination PLUSes.
---
 gcc/rtl.h  | 14 --
 gcc/rtlanal.cc | 29 +
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/gcc/rtl.h b/gcc/rtl.h
index 2370d608161..1ef6432fd9c 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2225,11 +2225,21 @@ struct address_info {
reloading.
 
  - *BASE is a variable expression representing a base address.
-   It contains exactly one REG, SUBREG or MEM, pointed to by BASE_TERM.
+   It contains exactly one "term", pointed to by BASE_TERM.
+   This term can be one of the following:
+
+   (1) a REG, or a SUBREG of a REG
+   (2) an eliminated REG (a PLUS of (1) and a constant)
+   (3) a MEM, or a SUBREG of a MEM
+   (4) a SCRATCH
+
+   This term is the one that base_reg_class constrains.
 
  - *INDEX is a variable expression representing an index value.
It may be a scaled expression, such as a MULT.  It has exactly
-   one REG, SUBREG or MEM, pointed to by INDEX_TERM.
+   one "term", pointed to by INDEX_TERM.  The possible terms are
+   the same as for BASE.  This term is the one that index_reg_class
+   constrains.
 
  - *DISP is a constant, possibly mutated.  DISP_TERM points to the
unmutated RTX_CONST_OBJ.  */
diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index 71207ee4f41..8afbb32f220 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -6494,6 +6494,25 @@ binary_scale_code_p (enum rtx_code code)
   || code == ROTATERT);
 }
 
+/* Return true if X appears to be a valid base or index term.  */
+static bool
+valid_base_or_index_term_p (rtx x)
+{
+  if (GET_CODE (x) == SCRATCH)
+return true;
+  /* Handle what appear to be eliminated forms of a register.  If we reach
+ here, the elimination occurs outside of the outermost PLUS tree,
+ and so the elimination offset cannot be treated as a displacement
+ of the main address.  Instead, we need to treat the whole PLUS as
+ the base or index term.  The address can only be made legitimate by
+ reloading the PLUS.  */
+  if (GET_CODE (x) == PLUS && CONST_SCALAR_INT_P (XEXP (x, 1)))
+x = XEXP (x, 0);
+  if (GET_CODE (x) == SUBREG)
+x = SUBREG_REG (x);
+  return REG_P (x) || MEM_P (x);
+}
+
 /* If *INNER can be interpreted as a base, return a pointer to the inner term
(see address_info).  Return null otherwise.  */
 
@@ -6502,10 +6521,7 @@ get_base_term (rtx *inner)
 {
   if (GET_CODE (*inner) == LO_SUM)
 inner = strip_address_mutations (&XEXP (*inner, 0));
-  if (REG_P (*inner)
-  || MEM_P (*inner)
-  || GET_CODE (*inner) == SUBREG
-  || GET_CODE (*inner) == SCRATCH)
+  if (valid_base_or_index_term_p (*inner))
 return inner;
   return 0;
 }
@@ -6519,10 +6535,7 @@ get_index_term (rtx *inner)
   /* At present, only constant scales are allowed.  */
   if (binary_scale_code_p (GET_CODE (*inner)) && CONSTANT_P (XEXP (*inner, 1)))
 inner = strip_address_mutations (&XEXP (*inner, 0));
-  if (REG_P (*inner)
-  || MEM_P (*inner)
-  || GET_CODE (*inner) == SUBREG
-  || GET_CODE (*inner) == SCRATCH)
+  if (valid_base_or_index_term_p (*inner))
 return inner;
   return 0;
 }
-- 
2.25.1


[PATCH] lra: Don't apply eliminations to allocated registers [PR116321]

2024-08-22 Thread Richard Sandiford
The sequence of events in this PR is that:

- the function has many addresses in which only a single hard base
  register is acceptable.  Let's call the hard register H.

- IRA allocates that register to one of the pseudo base registers.
  Let's call the pseudo register P.

- Some of the other addresses that require H occur when P is still live.

- LRA therefore has to spill P.

- When it reallocates P, LRA chooses to use FRAME_POINTER_REGNUM,
  which has been eliminated to the stack pointer.  (This is ok,
  since the frame register is free.)

- Spilling P causes LRA to reprocess the instruction that uses P.

- When reprocessing the address that has P as its base, LRA first
  applies the new allocation, to get FRAME_POINTER_REGNUM,
  and then applies the elimination, to get the stack pointer.

The last step seems wrong: the elimination should only apply to
pre-existing uses of FRAME_POINTER_REGNUM, not to uses that result
from allocating pseudos.  Applying both means that we get the wrong
register number, and therefore the wrong class.

The PR is about an existing testcase that fails with LRA on m86k.

Tested on aarch64-linux-gnu, powerpc64le-linux-gnu, and
x86_64-linux-gnu.  Also tested by building at least one target
per CPU directory and checking that there were no asm changes or
new ICEs when compiling gcc.c-torture, gcc.dg, and g++.dg at -Os.
OK to install?

Richard


gcc/
PR middle-end/116321
* lra-constraints.cc (get_hard_regno): Only apply eliminations
to existing hard registers.
(get_reg_class): Likewise.
---
 gcc/lra-constraints.cc | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 90cbe6c012b..fdcc07764a2 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -200,12 +200,13 @@ get_hard_regno (rtx x)
 reg = SUBREG_REG (x);
   if (! REG_P (reg))
 return -1;
-  if (! HARD_REGISTER_NUM_P (hard_regno = REGNO (reg)))
-hard_regno = lra_get_regno_hard_regno (hard_regno);
+  int regno = REGNO (reg);
+  if (HARD_REGISTER_NUM_P (regno))
+hard_regno = lra_get_elimination_hard_regno (regno);
+  else
+hard_regno = lra_get_regno_hard_regno (regno);
   if (hard_regno < 0)
 return -1;
-  if (HARD_REGISTER_NUM_P (REGNO (reg)))
-hard_regno = lra_get_elimination_hard_regno (hard_regno);
   if (SUBREG_P (x))
 hard_regno += subreg_regno_offset (hard_regno, GET_MODE (reg),
   SUBREG_BYTE (x),  GET_MODE (x));
@@ -221,13 +222,12 @@ get_reg_class (int regno)
 {
   int hard_regno;
 
-  if (! HARD_REGISTER_NUM_P (hard_regno = regno))
+  if (HARD_REGISTER_NUM_P (regno))
+hard_regno = lra_get_elimination_hard_regno (regno);
+  else
 hard_regno = lra_get_regno_hard_regno (regno);
   if (hard_regno >= 0)
-{
-  hard_regno = lra_get_elimination_hard_regno (hard_regno);
-  return REGNO_REG_CLASS (hard_regno);
-}
+return REGNO_REG_CLASS (hard_regno);
   if (regno >= new_regno_start)
 return lra_get_allocno_class (regno);
   return NO_REGS;
-- 
2.25.1



Re: [PATCH] optabs-query: Guard smallest_int_mode_for_size [PR115495].

2024-08-21 Thread Richard Sandiford
Richard Biener  writes:
> On Wed, Aug 21, 2024 at 8:37 AM Robin Dapp  wrote:
>>
>> Hi,
>>
>> in get_best_extraction_insn we use smallest_int_mode_for_size with
>> struct_bits as size argument.  In PR115495 struct_bits = 256 and we
>> don't have a mode for that.  This patch just bails for such cases.
>>
>> This does not happen on the current trunk anymore (so the test passes
>> unpatched) but we've seen it internally.  Does it still make sense
>> to install it (and backport to 14)?
>>
>> Bootstrapped and regtested on x86 and aarch64.  Regtested on rv64gcv.
>>
>> Regards
>>  Robin
>>
>> PR middle-end/115495
>>
>> gcc/ChangeLog:
>>
>> * optabs-query.cc (get_best_extraction_insn): Return if
>> smallest_int_mode_for_size might not find a mode.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/riscv/rvv/autovec/pr115495.c: New test.
>> ---
>>  gcc/optabs-query.cc   | 4 
>>  gcc/testsuite/gcc.target/riscv/rvv/autovec/pr115495.c | 9 +
>>  2 files changed, 13 insertions(+)
>>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr115495.c
>>
>> diff --git a/gcc/optabs-query.cc b/gcc/optabs-query.cc
>> index 93c1d7b8485..dc2690e720f 100644
>> --- a/gcc/optabs-query.cc
>> +++ b/gcc/optabs-query.cc
>> @@ -208,6 +208,10 @@ get_best_extraction_insn (extraction_insn *insn,
>>   machine_mode field_mode)
>>  {
>>opt_scalar_int_mode mode_iter;
>> +
>> +  if (maybe_gt (struct_bits, GET_MODE_PRECISION (MAX_MODE_INT)))
>> +return false;
>> +
>>FOR_EACH_MODE_FROM (mode_iter, smallest_int_mode_for_size (struct_bits))
>
> I think we instead should change this iteration to use FOR_EACH_MODE_IN_CLASS
> (like smallest_mode_for_size does) and skip to small modes?

I can't remember whether we rely on the int_n stuff here.  (If we do though,
it'd only be in a limited way, since the loop only tries int_n for the first
size.)

An alternative would be to make smallest_int_mode_for_size return an
optional mode, which arguably it should be doing anyway.

Thanks,
Richard

>
>>  {
>>scalar_int_mode mode = mode_iter.require ();
>> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr115495.c 
>> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr115495.c
>> new file mode 100644
>> index 000..bbf4d720f63
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr115495.c
>> @@ -0,0 +1,9 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */
>> +
>> +extern short a[];
>> +short b;
>> +int main() {
>> +  for (char c = 0; c < 18; c += 1)
>> +a[c + 0] = b;
>> +}
>> --
>> 2.46.0
>>


Re: [PATCH v2] aarch64: Implement popcountti2 pattern [PR113042]

2024-08-21 Thread Richard Sandiford
Andrew Pinski  writes:
> When CSSC is not enabled, 128bit popcount can be implemented
> just via the vector (v16qi) cnt instruction followed by a reduction,
> like how the 64bit one is currently implemented instead of
> splitting into 2 64bit popcount.
>
> Changes since v1:
> * v2: Make operand 0 be DImode instead of TImode and simplify.
>
> Build and tested for aarch64-linux-gnu.
>
>   PR target/113042
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.md (popcountti2): New define_expand.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/popcnt10.c: New test.
>   * gcc.target/aarch64/popcnt9.c: New test.

OK, thanks.

Richard

>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.md   | 13 +++
>  gcc/testsuite/gcc.target/aarch64/popcnt10.c | 25 +
>  gcc/testsuite/gcc.target/aarch64/popcnt9.c  | 25 +
>  3 files changed, 63 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt10.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt9.c
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 12dcc16529a..c54b29cd64b 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -5378,6 +5378,19 @@ (define_expand "popcount2"
>  }
>  })
>  
> +(define_expand "popcountti2"
> +  [(match_operand:DI 0 "register_operand")
> +   (match_operand:TI 1 "register_operand")]
> +  "TARGET_SIMD && !TARGET_CSSC"
> +{
> +  rtx v = gen_reg_rtx (V16QImode);
> +  rtx v1 = gen_reg_rtx (V16QImode);
> +  emit_move_insn (v, gen_lowpart (V16QImode, operands[1]));
> +  emit_insn (gen_popcountv16qi2 (v1, v));
> +  emit_insn (gen_aarch64_zero_extenddi_reduc_plus_v16qi (operands[0], v1));
> +  DONE;
> +})
> +
>  (define_insn "clrsb2"
>[(set (match_operand:GPI 0 "register_operand" "=r")
>  (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))]
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt10.c 
> b/gcc/testsuite/gcc.target/aarch64/popcnt10.c
> new file mode 100644
> index 000..4d01fc67022
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt10.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +/* PR target/113042 */
> +
> +#pragma GCC target "+cssc"
> +
> +/*
> +** h128:
> +**   ldp x([0-9]+), x([0-9]+), \[x0\]
> +**   cnt x([0-9]+), x([0-9]+)
> +**   cnt x([0-9]+), x([0-9]+)
> +**   add w0, w([0-9]+), w([0-9]+)
> +**   ret
> +*/
> +
> +
> +unsigned h128 (const unsigned __int128 *a) {
> +  return __builtin_popcountg (a[0]);
> +}
> +
> +/* popcount with CSSC should be split into 2 sections. */
> +/* { dg-final { scan-tree-dump-not "POPCOUNT " "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " __builtin_popcount" 2 "optimized" } } 
> */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt9.c 
> b/gcc/testsuite/gcc.target/aarch64/popcnt9.c
> new file mode 100644
> index 000..c778fc7f420
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt9.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +/* PR target/113042 */
> +
> +#pragma GCC target "+nocssc"
> +
> +/*
> +** h128:
> +**   ldr q([0-9]+), \[x0\]
> +**   cnt v([0-9]+).16b, v\1.16b
> +**   addvb([0-9]+), v\2.16b
> +**   fmovw0, s\3
> +**   ret
> +*/
> +
> +
> +unsigned h128 (const unsigned __int128 *a) {
> +   return __builtin_popcountg (a[0]);
> +}
> +
> +/* There should be only one POPCOUNT. */
> +/* { dg-final { scan-tree-dump-times "POPCOUNT " 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-not " __builtin_popcount"  "optimized" } } */
> +


Re: [PATCH] PR target/116365: Add user-friendly arguments to --param aarch64-autovec-preference=N

2024-08-21 Thread Richard Sandiford
Kyrylo Tkachov  writes:
>> On 20 Aug 2024, at 19:11, Richard Sandiford  
>> wrot>> Jennifer Schmitz  writes:
>>> The param aarch64-autovec-preference=N is a useful tool for testing
>>> auto-vectorisation in GCC as it allows the user to force a particular
>>> strategy. So far, N could be an numerical value between 0 and 4.
>>> This patch adds more user-friendly names to distinguish the options.
>>> For backwards compatibility, the numerical values are retained, but are made
>>> aliases of the new user-readable strings.
>>> 
>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>> regression.
>>> Ok for mainline?
>> 
>> User-readable names are good, but just to check: who is the intended
>> user here?  Is it just to save developers/experimenters the effort of
>> having to look up the magic numbers (and yes, I have to do that each
>> time I use it :) )?  Or is it for more general users?
>> 
>> If this is something that is being recommended for general use,
>> then we should probably promote it from a --param to a full -m option.
>> 
>> The difference is that --params are intended to be developer options,
>> so they can be removed or changed without warning.  For this reason,
>> if we keep it as a --param, I think we should remove the old numeric
>> values and just go with the symbolic ones.
>> 
>> Instead, -m options are intended as user options and are more stable.
>> If an -m option becomes redundant later, we'd probably just turn it into
>> a no-op rather than remove it.
>
> I recommended to Jennifer to keep the old names as aliases because I had seem 
> some internal scripts that did some specialist analysis use them and I got 
> the impression there may be more such users.
> But those scripts can easily be updated and indeed this is a param that we 
> don’t promise to keep it backwards—compatible.
> I’ve been in two minds over whether to keep the old options, I’m okay with 
> switching to just the enum values.
> I wouldn’t want to make this an -m* option as I don’t want to see this used 
> in production code to override what the compiler should be getting right on 
> its own.

Ah, ok, thanks.  In that case I agree we should just keep it as a --param.
I've a slight preference for dropping the numeric arguments, but definitely
no objection to keeping them if that's more convenient.

Richard

>
> Thanks,
> Kyrill
>
>
>> 
>>> 
>>> Signed-off-by: Jennifer Schmitz 
>>> 
>>> gcc/
>>>  PR target/116365
>>>  * config/aarch64/aarch64-opts.h
>>>  (enum aarch64_autovec_preference_enum): New enum.
>>>  * config/aarch64/aarch64.cc (aarch64_cmp_autovec_modes):
>>>  Change numerical to enum values.
>>>  (aarch64_autovectorize_vector_modes): Change numerical to enum
>>>  values.
>>>  (aarch64_vector_costs::record_potential_advsimd_unrolling):
>>>  Change numerical to enum values.
>>>  * config/aarch64/aarch64.opt: Change param type to enum.
>>>  * doc/invoke.texi: Update documentation.
>>> 
>>> gcc/testsuite/
>>>  PR target/116365
>>>  * gcc.target/aarch64/autovec_param_0.c: New test.
>>>  * gcc.target/aarch64/autovec_param_1.c: Likewise.
>>>  * gcc.target/aarch64/autovec_param_2.c: Likewise.
>>>  * gcc.target/aarch64/autovec_param_3.c: Likewise.
>>>  * gcc.target/aarch64/autovec_param_4.c: Likewise.
>>>  * gcc.target/aarch64/autovec_param_asimd-only.c: Likewise.
>>>  * gcc.target/aarch64/autovec_param_default.c: Likewise.
>>>  * gcc.target/aarch64/autovec_param_prefer-asimd.c: Likewise.
>>>  * gcc.target/aarch64/autovec_param_prefer-sve.c: Likewise.
>>>  * gcc.target/aarch64/autovec_param_sve-only.c: Likewise.
>>>  * gcc.target/aarch64/neoverse_v1_2.c: Update parameter value.
>>>  * gcc.target/aarch64/neoverse_v1_3.c: Likewise.
>>>  * gcc.target/aarch64/sve/cond_asrd_1.c: Likewise.
>>>  * gcc.target/aarch64/sve/cond_cnot_4.c: Likewise.
>>>  * gcc.target/aarch64/sve/cond_unary_5.c: Likewise.
>>>  * gcc.target/aarch64/sve/cond_uxt_5.c: Likewise.
>>>  * gcc.target/aarch64/sve/cond_xorsign_2.c: Likewise.
>>>  * gcc.target/aarch64/sve/pr98268-1.c: Likewise.
>>>  * gcc.target/aarch64/sve/pr98268-2.c: Likewise.
>>> 
>>> From 2e8693143f1c9c0668dea7bad77b3eadac6a4835 Mon Sep 17 00:00:00 2001
>>> From: Jennifer Schmitz 
>>> Date: Mon, 19 Aug 2024

Re: [RFC/RFA][PATCH v4 06/12] aarch64: Implement new expander for efficient CRC computation

2024-08-21 Thread Richard Sandiford
Mariam Arutunian  writes:
> This patch introduces two new expanders for the aarch64 backend,
> dedicated to generate optimized code for CRC computations.
> The new expanders are designed to leverage specific hardware capabilities
> to achieve faster CRC calculations,
> particularly using the crc32, crc32c and pmull instructions when supported
> by the target architecture.
>
> Expander 1: Bit-Forward CRC (crc4)
> For targets that support pmul instruction (TARGET_AES),
> the expander will generate code that uses the pmull (crypto_pmulldi)
> instruction for CRC computation.
>
> Expander 2: Bit-Reversed CRC (crc_rev4)
> The expander first checks if the target supports the CRC32* instruction set
> (TARGET_CRC32)
> and the polynomial in use is 0x1EDC6F41 (iSCSI) or 0x04C11DB7 (HDLC). If
> the conditions are met,
> it emits calls to the corresponding crc32* instruction (depending on the
> data size and the polynomial).
> If the target does not support crc32* but supports pmull, it then uses the
> pmull (crypto_pmulldi) instruction for bit-reversed CRC computation.
> Otherwise table-based CRC is generated.
>
>   gcc/config/aarch64/
>
> * aarch64-protos.h (aarch64_expand_crc_using_pmull): New extern
> function declaration.
> (aarch64_expand_reversed_crc_using_pmull):  Likewise.
> * aarch64.cc (aarch64_expand_crc_using_pmull): New function.
> (aarch64_expand_reversed_crc_using_pmull):  Likewise.
> * aarch64.md (crc_rev4): New expander for
> reversed CRC.
> (crc4): New expander for bit-forward CRC.
> * iterators.md (crc_data_type): New mode attribute.
>
>   gcc/testsuite/gcc.target/aarch64/
>
> * crc-1-pmul.c: New test.
> * crc-10-pmul.c: Likewise.
> * crc-12-pmul.c: Likewise.
> * crc-13-pmul.c: Likewise.
> * crc-14-pmul.c: Likewise.
> * crc-17-pmul.c: Likewise.
> * crc-18-pmul.c: Likewise.
> * crc-21-pmul.c: Likewise.
> * crc-22-pmul.c: Likewise.
> * crc-23-pmul.c: Likewise.
> * crc-4-pmul.c: Likewise.
> * crc-5-pmul.c: Likewise.
> * crc-6-pmul.c: Likewise.
> * crc-7-pmul.c: Likewise.
> * crc-8-pmul.c: Likewise.
> * crc-9-pmul.c: Likewise.
> * crc-CCIT-data16-pmul.c: Likewise.
> * crc-CCIT-data8-pmul.c: Likewise.
> * crc-coremark-16bitdata-pmul.c: Likewise.
> * crc-crc32-data16.c: Likewise.
> * crc-crc32-data32.c: Likewise.
> * crc-crc32-data8.c: Likewise.
> * crc-crc32c-data16.c: Likewise.
> * crc-crc32c-data32.c: Likewise.
> * crc-crc32c-data8.c: Likewise.

OK for trunk once the prerequisites are approved.  Thanks for all your
work on this.

Which other parts of the series still need review?  I can try to help
out with the target-independent bits.  (That said, I'm not sure I'm the
best person to review the tree recognition pass, but I can have a go.)

Richard

>
> Signed-off-by: Mariam Arutunian 
> Co-authored-by: Richard Sandiford 
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 42639e9efcf..469111e3b17 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1112,5 +1112,8 @@ extern void aarch64_adjust_reg_alloc_order ();
>  
>  bool aarch64_optimize_mode_switching (aarch64_mode_entity);
>  void aarch64_restore_za (rtx);
> +void aarch64_expand_crc_using_pmull (scalar_mode, scalar_mode, rtx *);
> +void aarch64_expand_reversed_crc_using_pmull (scalar_mode, scalar_mode, rtx 
> *);
> +
>  
>  #endif /* GCC_AARCH64_PROTOS_H */
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 7f0cc47d0f0..0cb8f3e8090 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -30314,6 +30314,137 @@ aarch64_retrieve_sysreg (const char *regname, bool 
> write_p, bool is128op)
>return sysreg->encoding;
>  }
>  
> +/* Generate assembly to calculate CRC
> +   using carry-less multiplication instruction.
> +   OPERANDS[1] is input CRC,
> +   OPERANDS[2] is data (message),
> +   OPERANDS[3] is the polynomial without the leading 1.  */
> +
> +void
> +aarch64_expand_crc_using_pmull (scalar_mode crc_mode,
> + scalar_mode data_mode,
> + rtx *operands)
> +{
> +  /* Check and keep arguments.  */
> +  gcc_assert (!CONST_INT_P (operands[0]));
> +  gcc_assert (CONST_INT_P (operands[3]));
> +  rtx crc = operands[1];
> +  rtx data = operands[2];
> +  rtx polynomial = operands[3];
> +
> +  unsigned HOST_WIDE_INT crc_size = GET_MODE_BITSIZE (crc_mode);
> +  unsigned HOST_WIDE_INT data_size = GET_MODE_BITSIZE (data_mode);
> +  gcc_assert (crc_size <= 32);
> +  gcc_asser

Re: [PATCH 1/2] SVE intrinsics: Fold constant operands for svdiv

2024-08-21 Thread Richard Sandiford
Jennifer Schmitz  writes:
> thank you for the feedback. I would like to summarize what I understand from 
> your suggestions before I start revising to make sure we are on the same page:
>
> 1. The new setup for constant folding of SVE intrinsics for binary operations 
> where both operands are constant vectors looks like this:
>
> In gcc/fold-const.cc:
> NEW: vector_const_binop: Handles vector part of const_binop element-wise
> const_binop: For vector arguments, calls vector_const_binop with const_binop 
> as callback
> poly_int_binop: Is now public and -if necessary- we can implement missing 
> codes (e.g. TRUNC_DIV_EXPR)

Yeah.  And specifically: I think we can move:

  if (TREE_CODE (arg1) == INTEGER_CST && TREE_CODE (arg2) == INTEGER_CST)
{
  wide_int warg1 = wi::to_wide (arg1), res;
  wide_int warg2 = wi::to_wide (arg2, TYPE_PRECISION (type));
  if (!wide_int_binop (res, code, warg1, warg2, sign, &overflow))
return NULL_TREE;
  poly_res = res;
}

into poly_int_binop.  It shouldn't affect compile times on non-poly
targets too much, since poly_int_tree_p (arg1) just checks for
INTEGER_CST there.

> In aarch64 backend:
> NEW: aarch64_vector_const_binop: adapted from int_const_binop, but calls 
> poly_int_binop

Yes.  The main differences are that we shouldn't treat any operation
as overflowing, and that we can handle cases that are well-defined
for intrinsics but not for gimple.

> intrinsic_impl::fold: calls vector_const_binop with 
> aarch64_vector_const_binop as callback

Yeah.

> 2. Folding where only one operand is constant (0/x, x/0, 0*x etc.) can be 
> handled individually in intrinsic_impl, but in separate patches. If there is 
> already code to check for uniform vectors (e.g. in the svdiv->svasrd case), 
> we try to share code.

Yeah.  And in particular, we should try to handle (and test) vector-scalar
_n intrinsics as well as vector-vector intrinsics.

> Does that cover what you proposed? Otherwise, please feel free to correct any 
> misunderstandings.

SGTM.

Thanks,
Richard


Re: [PATCH 2/2] aarch64: Implement popcountti2 pattern [PR113042]

2024-08-20 Thread Richard Sandiford
Richard Sandiford  writes:
> Andrew Pinski  writes:
>> When CSSC is not enabled, 128bit popcount can be implemented
>> just via the vector (v16qi) cnt instruction followed by a reduction,
>> like how the 64bit one is currently implemented instead of
>> splitting into 2 64bit popcount.
>>
>> Build and tested for aarch64-linux-gnu.
>>
>>  PR target/113042
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64.md (popcountti2): New define_expand.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/popcnt10.c: New test.
>>  * gcc.target/aarch64/popcnt9.c: New test.
>
> OK if there are no other comments in the next 24 hours.

Sorry, only thought about it later, but:

>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
>> index 12dcc16529a..73506e71f43 100644
>> --- a/gcc/config/aarch64/aarch64.md
>> +++ b/gcc/config/aarch64/aarch64.md
>> @@ -5378,6 +5378,22 @@ (define_expand "popcount2"
>>  }
>>  })
>>  
>> +(define_expand "popcountti2"
>> +  [(set (match_operand:TI 0 "register_operand")
>> +(popcount:TI (match_operand:TI 1 "register_operand")))]

Could you try making the output :DI instead of :TI?  I'd expect
internal-fn.cc to handle that correctly and extend the result to
128 bits where needed.

That would make the dummy popcount rtx malformed, so I suppose
the pattern should just be:

  [(match_operand:DI 0 "register_operand")
   (match_operand:TI 1 "register_operand")]

>> +  "TARGET_SIMD && !TARGET_CSSC"
>> +{
>> +  rtx v = gen_reg_rtx (V16QImode);
>> +  rtx v1 = gen_reg_rtx (V16QImode);
>> +  emit_move_insn (v, gen_lowpart (V16QImode, operands[1]));
>> +  emit_insn (gen_popcountv16qi2 (v1, v));
>> +  rtx out = gen_reg_rtx (DImode);
>> +  emit_insn (gen_aarch64_zero_extenddi_reduc_plus_v16qi (out, v1));

We could then use operands[0] directly as the output here.

Thanks,
Richard

>> +  out = convert_to_mode (TImode, out, true);
>> +  emit_move_insn (operands[0], out);
>> +  DONE;
>> +})


[PATCH] aarch64: Fix caller saves of VNx2QI [PR116238]

2024-08-20 Thread Richard Sandiford
The testcase contains a VNx2QImode pseudo that is live across a call
and that cannot be allocated a call-preserved register.  LRA quite
reasonably tried to save it before the call and restore it afterwards.
Unfortunately, the target told it to do that in SImode, even though
punning between SImode and VNx2QImode is disallowed by both
TARGET_CAN_CHANGE_MODE_CLASS and TARGET_MODES_TIEABLE_P.

The natural class to use for SImode is GENERAL_REGS, so this led
to an unsalvageable situation in which we had:

  (set (subreg:VNx2QI (reg:SI A) 0) (reg:VNx2QI B))

where A needed GENERAL_REGS and B needed FP_REGS.  We therefore ended
up in a reload loop.

The hooks above should ensure that this situation can never occur
for incoming subregs.  It only happened here because the target
explicitly forced it.

The decision to use SImode for modes smaller than 4 bytes dates
back to the beginning of the port, before 16-bit floating-point
modes existed.  I'm not sure whether promoting to SImode really
makes sense for any FPR, but that's a separate performance/QoI
discussion.  For now, this patch just disallows using SImode
when it is wrong for correctness reasons, since that should be
safer to backport.

Bootstrapped & regression-tested on aarch64-linux-gnu.  I'll leave
a day or so for comments before pushing.

Richard


gcc/
PR testsuite/116238
* config/aarch64/aarch64.cc (aarch64_hard_regno_caller_save_mode):
Only return SImode if we can convert to and from it.

gcc/testsuite/
PR testsuite/116238
* gcc.target/aarch64/sve/pr116238.c: New test.
---
 gcc/config/aarch64/aarch64.cc   |  7 ---
 gcc/testsuite/gcc.target/aarch64/sve/pr116238.c | 13 +
 2 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/pr116238.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index bfd7bcdef7c..4e312c43576 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -2521,10 +2521,11 @@ aarch64_hard_regno_caller_save_mode (unsigned regno, 
unsigned,
  unnecessarily significant.  */
   if (PR_REGNUM_P (regno))
 return mode;
-  if (known_ge (GET_MODE_SIZE (mode), 4))
-return mode;
-  else
+  if (known_lt (GET_MODE_SIZE (mode), 4)
+  && REG_CAN_CHANGE_MODE_P (regno, mode, SImode)
+  && REG_CAN_CHANGE_MODE_P (regno, SImode, mode))
 return SImode;
+  return mode;
 }
 
 /* Return true if I's bits are consecutive ones from the MSB.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr116238.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr116238.c
new file mode 100644
index 000..fe66b198107
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr116238.c
@@ -0,0 +1,13 @@
+/* { dg-additional-options "-O2 -msve-vector-bits=128" } */
+
+void foo();
+typedef unsigned char v2qi __attribute__((vector_size(2)));
+void f(v2qi *ptr)
+{
+  v2qi x = *ptr;
+  asm volatile ("" :: "w" (x));
+  asm volatile ("" ::: "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15");
+  foo();
+  asm volatile ("" :: "w" (x));
+  *ptr = x;
+}
-- 
2.25.1



Re: [PATCH] PR target/116365: Add user-friendly arguments to --param aarch64-autovec-preference=N

2024-08-20 Thread Richard Sandiford
Jennifer Schmitz  writes:
> The param aarch64-autovec-preference=N is a useful tool for testing
> auto-vectorisation in GCC as it allows the user to force a particular
> strategy. So far, N could be an numerical value between 0 and 4.
> This patch adds more user-friendly names to distinguish the options.
> For backwards compatibility, the numerical values are retained, but are made
> aliases of the new user-readable strings.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> Ok for mainline?

User-readable names are good, but just to check: who is the intended
user here?  Is it just to save developers/experimenters the effort of
having to look up the magic numbers (and yes, I have to do that each
time I use it :) )?  Or is it for more general users?

If this is something that is being recommended for general use,
then we should probably promote it from a --param to a full -m option.

The difference is that --params are intended to be developer options,
so they can be removed or changed without warning.  For this reason,
if we keep it as a --param, I think we should remove the old numeric
values and just go with the symbolic ones.

Instead, -m options are intended as user options and are more stable.
If an -m option becomes redundant later, we'd probably just turn it into
a no-op rather than remove it.

>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   PR target/116365
>   * config/aarch64/aarch64-opts.h
>   (enum aarch64_autovec_preference_enum): New enum.
>   * config/aarch64/aarch64.cc (aarch64_cmp_autovec_modes):
>   Change numerical to enum values.
>   (aarch64_autovectorize_vector_modes): Change numerical to enum
>   values.
>   (aarch64_vector_costs::record_potential_advsimd_unrolling):
>   Change numerical to enum values.
>   * config/aarch64/aarch64.opt: Change param type to enum.
>   * doc/invoke.texi: Update documentation.
>
> gcc/testsuite/
>   PR target/116365
>   * gcc.target/aarch64/autovec_param_0.c: New test.
>   * gcc.target/aarch64/autovec_param_1.c: Likewise.
>   * gcc.target/aarch64/autovec_param_2.c: Likewise.
>   * gcc.target/aarch64/autovec_param_3.c: Likewise.
>   * gcc.target/aarch64/autovec_param_4.c: Likewise.
>   * gcc.target/aarch64/autovec_param_asimd-only.c: Likewise.
>   * gcc.target/aarch64/autovec_param_default.c: Likewise.
>   * gcc.target/aarch64/autovec_param_prefer-asimd.c: Likewise.
>   * gcc.target/aarch64/autovec_param_prefer-sve.c: Likewise.
>   * gcc.target/aarch64/autovec_param_sve-only.c: Likewise.
>   * gcc.target/aarch64/neoverse_v1_2.c: Update parameter value.
>   * gcc.target/aarch64/neoverse_v1_3.c: Likewise.
>   * gcc.target/aarch64/sve/cond_asrd_1.c: Likewise.
>   * gcc.target/aarch64/sve/cond_cnot_4.c: Likewise.
>   * gcc.target/aarch64/sve/cond_unary_5.c: Likewise.
>   * gcc.target/aarch64/sve/cond_uxt_5.c: Likewise.
>   * gcc.target/aarch64/sve/cond_xorsign_2.c: Likewise.
>   * gcc.target/aarch64/sve/pr98268-1.c: Likewise.
>   * gcc.target/aarch64/sve/pr98268-2.c: Likewise.
>
> From 2e8693143f1c9c0668dea7bad77b3eadac6a4835 Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz 
> Date: Mon, 19 Aug 2024 08:42:55 -0700
> Subject: [PATCH] PR target/116365: Add user-friendly arguments to --param
>  aarch64-autovec-preference=N
>
> The param aarch64-autovec-preference=N is a useful tool for testing
> auto-vectorisation in GCC as it allows the user to force a particular
> strategy. So far, N could be a numerical value between 0 and 4.
> This patch adds more user-friendly names to distinguish the options.
> For backwards compatibility, the numerical values are retained, but are made
> aliases of the new user-readable strings.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> Ok for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   PR target/116365
>   * config/aarch64/aarch64-opts.h
>   (enum aarch64_autovec_preference_enum): New enum.
>   * config/aarch64/aarch64.cc (aarch64_cmp_autovec_modes):
>   Change numerical to enum values.
>   (aarch64_autovectorize_vector_modes): Change numerical to enum
>   values.
>   (aarch64_vector_costs::record_potential_advsimd_unrolling):
>   Change numerical to enum values.
>   * config/aarch64/aarch64.opt: Change param type to enum.
>   * doc/invoke.texi: Update documentation.
>
> gcc/testsuite/
>   PR target/116365
>   * gcc.target/aarch64/autovec_param_0.c: New test.
>   * gcc.target/aarch64/autovec_param_1.c: Likewise.
>   * gcc.target/aarch64/autovec_param_2.c: Likewise.
>   * gcc.target/aarch64/autovec_param_3.c: Likewise.
>   * gcc.target/aarch64/autovec_param_4.c: Likewise.
>   * gcc.target/aarch64/autovec_param_asimd-only.c: Likewise.
>   * gcc.target/aarch64/autovec_param_default.c: Likewise.
>   * gcc.target/aarch64/autovec_param_prefer-asimd.

Re: [PATCH 2/2] aarch64: Implement popcountti2 pattern [PR113042]

2024-08-20 Thread Richard Sandiford
Andrew Pinski  writes:
> When CSSC is not enabled, 128bit popcount can be implemented
> just via the vector (v16qi) cnt instruction followed by a reduction,
> like how the 64bit one is currently implemented instead of
> splitting into 2 64bit popcount.
>
> Build and tested for aarch64-linux-gnu.
>
>   PR target/113042
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.md (popcountti2): New define_expand.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/popcnt10.c: New test.
>   * gcc.target/aarch64/popcnt9.c: New test.

OK if there are no other comments in the next 24 hours.

Thanks,
Richard

>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.md   | 16 +
>  gcc/testsuite/gcc.target/aarch64/popcnt10.c | 25 +
>  gcc/testsuite/gcc.target/aarch64/popcnt9.c  | 25 +
>  3 files changed, 66 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt10.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt9.c
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 12dcc16529a..73506e71f43 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -5378,6 +5378,22 @@ (define_expand "popcount2"
>  }
>  })
>  
> +(define_expand "popcountti2"
> +  [(set (match_operand:TI 0 "register_operand")
> + (popcount:TI (match_operand:TI 1 "register_operand")))]
> +  "TARGET_SIMD && !TARGET_CSSC"
> +{
> +  rtx v = gen_reg_rtx (V16QImode);
> +  rtx v1 = gen_reg_rtx (V16QImode);
> +  emit_move_insn (v, gen_lowpart (V16QImode, operands[1]));
> +  emit_insn (gen_popcountv16qi2 (v1, v));
> +  rtx out = gen_reg_rtx (DImode);
> +  emit_insn (gen_aarch64_zero_extenddi_reduc_plus_v16qi (out, v1));
> +  out = convert_to_mode (TImode, out, true);
> +  emit_move_insn (operands[0], out);
> +  DONE;
> +})
> +
>  (define_insn "clrsb2"
>[(set (match_operand:GPI 0 "register_operand" "=r")
>  (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))]
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt10.c 
> b/gcc/testsuite/gcc.target/aarch64/popcnt10.c
> new file mode 100644
> index 000..4d01fc67022
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt10.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +/* PR target/113042 */
> +
> +#pragma GCC target "+cssc"
> +
> +/*
> +** h128:
> +**   ldp x([0-9]+), x([0-9]+), \[x0\]
> +**   cnt x([0-9]+), x([0-9]+)
> +**   cnt x([0-9]+), x([0-9]+)
> +**   add w0, w([0-9]+), w([0-9]+)
> +**   ret
> +*/
> +
> +
> +unsigned h128 (const unsigned __int128 *a) {
> +  return __builtin_popcountg (a[0]);
> +}
> +
> +/* popcount with CSSC should be split into 2 sections. */
> +/* { dg-final { scan-tree-dump-not "POPCOUNT " "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " __builtin_popcount" 2 "optimized" } } 
> */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt9.c 
> b/gcc/testsuite/gcc.target/aarch64/popcnt9.c
> new file mode 100644
> index 000..c778fc7f420
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt9.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +/* PR target/113042 */
> +
> +#pragma GCC target "+nocssc"
> +
> +/*
> +** h128:
> +**   ldr q([0-9]+), \[x0\]
> +**   cnt v([0-9]+).16b, v\1.16b
> +**   addvb([0-9]+), v\2.16b
> +**   fmovw0, s\3
> +**   ret
> +*/
> +
> +
> +unsigned h128 (const unsigned __int128 *a) {
> +   return __builtin_popcountg (a[0]);
> +}
> +
> +/* There should be only one POPCOUNT. */
> +/* { dg-final { scan-tree-dump-times "POPCOUNT " 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-not " __builtin_popcount"  "optimized" } } */
> +


Re: [PATCH 1/2] builtins: Don't expand bit query builtins for __int128_t if the target supports an optab for it

2024-08-20 Thread Richard Sandiford
Andrew Pinski  writes:
> On aarch64 (without !CSSC instructions), since popcount is implemented using 
> the SIMD instruction cnt,
> instead of using two SIMD cnt (V8QI mode), it is better to use one 128bit cnt 
> (V16QI mode). And only one
> reduction addition instead of 2. Currently fold_builtin_bit_query will expand 
> always without checking
> if there was an optab for the type, so this changes that to check the optab 
> to see if we should expand
> or have the backend handle it.
>
> Bootstrapped and tested on x86_64-linux-gnu and built and tested for 
> aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
>   * builtins.cc (fold_builtin_bit_query): Don't expand double
>   `unsigned long long` typess if there is an optab entry for that
>   type.

OK.  The logic in the function seems a bit twisty (the same condition
is checked later), but all my attempts to improve it only made it worse.

Thanks,
Richard

>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/builtins.cc | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index 0b902896ddd..b4d51eaeba5 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -10185,7 +10185,9 @@ fold_builtin_bit_query (location_t loc, enum 
> built_in_function fcode,
>tree call = NULL_TREE, tem;
>if (TYPE_PRECISION (arg0_type) == MAX_FIXED_MODE_SIZE
>&& (TYPE_PRECISION (arg0_type)
> -   == 2 * TYPE_PRECISION (long_long_unsigned_type_node)))
> +   == 2 * TYPE_PRECISION (long_long_unsigned_type_node))
> +  /* If the target supports the optab, then don't do the expansion. */
> +  && !direct_internal_fn_supported_p (ifn, arg0_type, OPTIMIZE_FOR_BOTH))
>  {
>/* __int128 expansions using up to 2 long long builtins.  */
>arg0 = save_expr (arg0);


Re: [PATCH] Tweak base/index disambiguation in decompose_normal_address [PR116236]

2024-08-20 Thread Richard Sandiford
Richard Biener  writes:
> On Thu, Aug 15, 2024 at 4:57 PM Jeff Law  wrote:
>>
>>
>>
>> On 8/15/24 2:50 AM, Richard Sandiford wrote:
>> > The PR points out that, for an address like:
>> >
>> >(plus (zero_extend X) Y)
>> >
>> > decompose_normal_address doesn't establish a strong preference
>> > between treating X as the base or Y as the base.  As the comment
>> > in the patch says, zero_extend isn't enough on its own to assume
>> > an index, at least not on POINTERS_EXTEND_UNSIGNED targets.
>> > But in a construct like the one above, X and Y have different modes,
>> > and it seems reasonable to assume that the one with the expected
>> > address mode is the base.
>> >
>> > This matters on targets like m68k that support index extension
>> > and that require different classes for bases and indices.
>> >
>> > Tested on aarch64-linux-gnu & x86_64-linux-gnu.  Andreas also confirms
>> > that it fixes the m68k LRA problem.  OK to install?
>> >
>> > Richard
>> >
>> >
>> > gcc/
>> >   PR middle-end/116236
>> >   * rtlanal.cc (decompose_normal_address): Try to distinguish
>> >   bases and indices based on mode, before resorting to "baseness".
>> OK.  Thanks to everyone for chasing this down.  No idea where we sit
>> with the conversion of m68k to LRA but this looks like it'd be helpful
>> irrespective of that effort.
>
> I'll point out that this change merely adjusts heuristics and whether there's
> an underlying issue in the target or LRA remains to be seen?

The PR has a lot more discussion around this. :)  The historical interface
is that the target can request different register classes for base registers
and index registers, but the target doesn't get to choose what it considers
to be a base and what it considers to be an index.  This is instead
determined by target-independent code (like it is for tree-ssa-address.cc,
for example).

decompose_normal_address wasn't making the same choice between base
and index that reload made (or that tree-ssa-address.c would make).
Some inconsistencies like that are ok, if both interpretations are valid.
But IMO the old decompose_normal_address behaviour was clearly wrong for
this case.

So I think the patch is fixing a genuine bug, rather than papering over
a bug elsewhere.

I agree that in some ways it's not a very satisfactory situation,
since there's a fair bit of guesswork and inference going on (and is
in reload too).  But I don't think we can avoid that without changing
the interface.

Changing the interface would be great, but it's a daunting amount of work,
especially given that we have so many inactive ports in-tree that would each
need to be updated individually.

Thanks,
Richard


Re: FIXED_REGISTERS / ira_no_alloc_regs: aarch64 vs. risc-v (was Re: sched1 pathology on RISC-V : PR/114729)

2024-08-20 Thread Richard Sandiford
Vineet Gupta  writes:
> On 8/19/24 14:52, Richard Sandiford wrote:
>>> 2. On RISC-V sched1 is counter intuitively assuming HARD_FP is live due to 
>>> the weird interaction of DF infra (which always marks HARD_FP with
>>> artificial def) and ira_no_alloc_regs.
>> In general, it isn't possible to predict at this stage whether the hard
>> frame pointer will be needed, even for -fomit-frame-pointer.  The final
>> decision is made during LRA, which in the worst case has iterate through
>> several elimination attempts.
>>
>> So whatever we do here will be wrong for some cases.  In some ways,
>> assuming that the hard frame pointer will be needed is the conservative
>> option; if we instead assumed that it wasn't needed, we'd be more
>> willing to move code around to make use of that (supposed) extra register.
>
> Right, but we can tweak sched1 heuristics some more :-)
> What if just for model pressure calculations we ignore these artificial defs 
> and FP semantics from RA pov remain unchanged.

But my point was that:

- the sched1 code doesn't know whether FP will be needed
- assuming that the FP will be needed should lead to a higher
  pressure estimate
- a higher pressure estimate should lead to more caution
- in general, more caution leads to less spilling

The original report was about too much spilling.  Making sched1 assume
that it has more free registers is, in general, likely to increase the
amount of spilling rather than reduce it.

> As a crude hack I tried below (which doesn't do any sanity check for bitmap 
> validity)
>
> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
>
> -  EXECUTE_IF_SET_IN_BITMAP (live, 0, j, bi)
> +  EXECUTE_IF_AND_COMPL_IN_BITMAP (live, &df->regular_block_artificial_uses, 
> 0, j, bi)

If this helps the SPEC benchmark, then for the reasons we discussed
eariler with Jeff, I think it's likely just a coincidence, rather than
the right fix.

Thanks,
Richard


Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

2024-08-19 Thread Richard Sandiford
HAO CHEN GUI  writes:
> Hi,
>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
> constant. So even a platform doesn't support vec_duplicate, it could
> still do clear by pieces if it supports const0 move by that mode.
>
>   Compared to the previous version, the main change is to set up a
> new function to generate const0 for certain modes and use the function
> as by_pieces_constfn for CLEAR_BY_PIECES.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
>
>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> regressions.
>
>   On i386, it got several regressions. One issue is the predicate of
> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
> for clear by pieces with the patch. The second issue is the const0 is
> passed directly to the move expand with the patch. Originally it is
> forced to a pseudo and i386 can leverage the previous data to do
> optimization.

The patch looks good to me, but I suppose we'll need to decide what
to do about x86.

It's not obvious to me why movv16qi requires a nonimmediate_operand
source, especially since ix86_expand_vector_mode does have code to
cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
predicates anyway, so the predicate will have little effect.

A workaround would be to check legitimate_constant_p instead of the
predicate, but I'm not sure that that should be necessary.

Has this already been discussed?  If not, we should loop in the x86
maintainers (but I didn't do that here in case it would be a repeat).

As far as the second issue goes, I suppose there are at least three
ways of handling shared constants:

(1) Force the zero into a register and leave later optimisations to
propagate the zero where profitable.

(2) Emit stores of zero and expect a later pass to share constants
where beneficial.

(3) Generate stores of zero and leave the target expanders to force
constants into registers on the fly if reuse seems plausibly
beneficial.

where (3) is a middle ground between (1) and (2).

Thanks,
Richard

>   The patch also raises several regressions on aarch64. The V2x8QImode
> replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand
> supports const0 and vector mode is preferable. I drafted a patch to
> address the issue. It will be sent for review in a separate email.
> Another problem is V8QImode replaces DImode to do 8-byte clear by pieces.
> It seems cause different sequences of instructions but the actually
> instructions are the same.
>
> Thanks
> Gui Haochen
>
> ChangeLog
> expand: Add const0 move checking for CLEAR_BY_PIECES optabs
>
> vec_duplicate handles duplicates of non-constant inputs.  The 0 is a
> constant.  So even a platform doesn't support vec_duplicate, it could
> still do clear by pieces if it supports const0 move.  This patch adds
> the checking.
>
> gcc/
>   * expr.cc (by_pieces_mode_supported_p): Add const0 move checking
>   for CLEAR_BY_PIECES.
>   (set_zero): New.
>   (clear_by_pieces): Pass set_zero as by_pieces_constfn.
>
> patch.diff
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index ffbac513692..7199e0956f8 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -1014,14 +1014,20 @@ can_use_qi_vectors (by_pieces_operation op)
>  static bool
>  by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation op)
>  {
> -  if (optab_handler (mov_optab, mode) == CODE_FOR_nothing)
> +  enum insn_code icode = optab_handler (mov_optab, mode);
> +  if (icode == CODE_FOR_nothing)
>  return false;
>
> -  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
> +  if (op == SET_BY_PIECES
>&& VECTOR_MODE_P (mode)
>&& optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing)
>  return false;
>
> +  if (op == CLEAR_BY_PIECES
> +  && VECTOR_MODE_P (mode)
> +  && !insn_operand_matches (icode, 1, CONST0_RTX (mode)))
> +   return false;
> +
>if (op == COMPARE_BY_PIECES
>&& !can_compare_p (EQ, mode, ccp_jump))
>  return false;
> @@ -1840,16 +1846,20 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
>  return to;
>  }
>
> +static rtx
> +set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode)
> +{
> +  return CONST0_RTX (mode);
> +}
> +
>  void
>  clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
>  {
>if (len == 0)
>  return;
>
> -  /* Use builtin_memset_read_str to support vector mode broadcast.  */
> -  char c = 0;
> -  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
> -   CLEAR_BY_PIECES);
> +  /* Use set_zero to generate const0 of centain mode.  */
> +  store_by_pieces_d data (to, set_zero, NULL, len, align, CLEAR_BY_PIECES);
>data.run ();
>  }


Re: FIXED_REGISTERS / ira_no_alloc_regs: aarch64 vs. risc-v (was Re: sched1 pathology on RISC-V : PR/114729)

2024-08-19 Thread Richard Sandiford
Vineet Gupta  writes:
> On 8/6/24 17:36, Vineet Gupta wrote:
>> I'm currently pursuing a different trail which comes form observation
>> that initial model setup concludes that pressure is 28 so with 27
>> allocable regs we are bound to spill one.
>> More on that after I find something concrete.
>
> (caveat: I enabled -fomit-frame-pointer for both risc-v and aarch64)
>
> Observation:  So on risc-v, sched1's very first pressure dump starts off with 
> initial pressure 1
>
> ;;   ==
> ;;   -- basic block 2 from 6 to 174 -- before reload
> ;;   ==
> ;;    | idx insn | mpri hght dpth prio |          # model_record_pressures ()
> ;;    |   0    6 |    0    3    0    5 | r154=high(`j')  GR_REGS:[1,+1]
>  ^^^
>
> While on aarch64 is starts off with 0.
>
> ;;    |   0    6 |    0    3    0    6 | r122=high(`j') GENERAL_REGS:[0,+1] 
> FP_REGS:[0,+0] PR_LO_REGS:[0,+0] PR_HI_REGS:[0,+0]
>  ^^^
>
> This seems to be happening because of HARD_FP (reno 8)
>
> model_start_schedule ()
>    initiate_reg_pressure_info (df_get_live_in (bb))
>    EXECUTE_IF_SET_IN_BITMAP (live, 0, j, bi)
>            mark_regno_birth_or_death (.. j )
>           if ( ! TEST_HARD_REG_BIT (ira_no_alloc_regs, regno)))
>               bitmap_set_bit (live, regno)
>
> For RISC-V, the loop above executes for regno 2 (SP), 8 (HARD_FP), 64 (FP), 
> 65 (Arg).
>
> The DF infra (before reload) sets up artificial usage for HARD_FP : see 
> df_get_regular_block_artificial_uses () hence it shows up in df_get_live_in 
> (bb)
>
> On RISC-V, FIXED_REGISTERS omits FP and consequently ira_no_alloc_regs 
> doesn't include HARD_FP. This seems sensible (at least intuitive) since reg
> allocator is allowed to use HARD_FP (which due to -fomit-frame-pointer 
> becomes first callee reg S0).
>
> (gdb) p/x this_target_ira_int->x_no_unit_alloc_regs
> $1 = {elts = {0x1f, 0x}}    <-- bit 8 for HARD_FP not set
>
> On aarch64, HARD_FP regno 29 is marked as FIXED_REGISTERS thus is present in 
> ira_no_alloc_regs
>
> (gdb) p/x this_target_ira_int->x_no_unit_alloc_regs
> $1 = {elts = {0xa000, 0x0}}
>
> So I don't understand 2 things:
>
> 1. Why is aarch64 reserving away HARD_FP (at least from ira) when clearly 
> user is saying -fomit-frame-pointer (It seems this remains even if disable
> exception, asynch unwind etc)

That decision was made before my time, but I think it's because,
per the ABI, code is allowed to assume that r29 points to a valid
frame chain record at all times, and that following the chain up
the stack will not crash (assuming an uncorrupted stack, of course).
On AArch64, -fomit-frame-pointer says that it's ok to skip setting up
x29 (and the frame chain) for a function that doesn't inherently need
a frame pointer, but it doesn't mean that we can break backtracing for
functions further up the stack.

> 2. On RISC-V sched1 is counter intuitively assuming HARD_FP is live due to 
> the weird interaction of DF infra (which always marks HARD_FP with
> artificial def) and ira_no_alloc_regs.

In general, it isn't possible to predict at this stage whether the hard
frame pointer will be needed, even for -fomit-frame-pointer.  The final
decision is made during LRA, which in the worst case has iterate through
several elimination attempts.

So whatever we do here will be wrong for some cases.  In some ways,
assuming that the hard frame pointer will be needed is the conservative
option; if we instead assumed that it wasn't needed, we'd be more
willing to move code around to make use of that (supposed) extra register.

Thanks,
Richard


  1   2   3   4   5   6   7   8   9   10   >