from:"rsandifo at gcc dot gnu.org via Gcc\-bugs"

[Bug c++/115192] [11/12/13/14 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380

2024-05-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192

Richard Sandiford  changed:

   What|Removed |Added

Summary|[11/12/13/14/15 regression] |[11/12/13/14 regression]
   |-O3 miscompilation on   |-O3 miscompilation on
   |x86-64 (loops with vectors  |x86-64 (loops with vectors
   |and scalars) since r11-6380 |and scalars) since r11-6380

--- Comment #11 from Richard Sandiford  ---
Fixed on trunk, will backport if there is no reported fallout.

[Bug c++/115192] [11/12/13/14/15 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380

2024-05-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-05-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #19 from Richard Sandiford  ---
(In reply to Richard Biener from comment #14)
> Usually targets do have a limit on the actual length but I see
> constant_upper_bound_with_limit doesn't query such.  But it would
> be a more appropriate way to say there might be an actual target limit here?
The discussion has moved on, but FWIW: this was a deliberate choice.
The thinking at the time was that VLA code should be truly “agnostic”
and not hard-code an upper limit.  Hard-coding a limit would be hard-coding
an assumption that the architectural maximum would never increase in future.

(The main counterargument was that any uses of the .B form of TBL would
break down for >256-byte vectors.  We hardly use such TBLs for autovec
though, and could easily choose not to use them at all.)

That decision is 8 or 9 years old at this point, so it might seem overly
dogmatic now.  Even so, I think we should have a strong reason to change tack.
It shouldn't just be about trying to avoid poly_ints :)

[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package

2024-04-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664

--- Comment #14 from Richard Sandiford  ---
Yeah, I think so.

[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package

2024-04-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664

--- Comment #12 from Richard Sandiford  ---
(In reply to Peter Bergner from comment #11)
> > > but how are users supposed to know whether
> > > -fno-omit-frame-pointer is in effect or not?  I've looked and there is no
> > > pre-defined macro a user could check.
> > That might be a useful thing to have, but if the programmer has no control
> > over the build flags (i.e. cannot require/force -fomit-frame-pointer) then I
> > think the asm has to take care to save and restore the frame pointer itself.
> > 
> > Dropping "31" from the asm means that the asm must preserve the register. 
> > Things will go badly if the asm doesn't do that.
> 
> So r31 which we use as our frame-pointer reg is a non-volatile/callee saved
> register, so it must be saved, but I guess they (greenlet) cannot use the
> method of mentioning it in the asm clobber list to perform that.
I was thinking of just the asm in isolation, rather than its effect on the
containing function's callee-save set.

If you have:

  asm volatile ("…");

then GCC expects r31 after the asm to equal r31 before the asm, regardless of
the -fomit-frame-pointer setting.  If the asm is:

  asm volatile ("li r31,0");

(sorry, I've forgotten the actual Power asm :)) then things will go wrong if
GCC tries to use r31 after the asm.

If the asm clobbers r31 then it must mention it in the clobber list.  As things
stand, it's not possible to mention r31 in the clobber list if
-fno-omit-frame-pointer.  This means that the only option for code that wants
to support -fno-omit-frame-pointer is to make the asm's contents preserve r31,
using an explicit save and restore if necessary.

And that kind-of makes sense.  If an asm clobbers the frame pointer, that
removes GCC's main option for restoring data after the asm.  A lot of other
clobbers would be handled by loading data relative to the frame pointer.  If
the frame pointer itself has gone then things get tricky.

[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package

2024-04-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664

--- Comment #10 from Richard Sandiford  ---
(In reply to Peter Bergner from comment #7)
> Then that would seem to indicate that mentioning the frame pointer reg in
> the asm clobber list is an error
Yeah, I agree it's an error.  The PR says “ICE”, but is there an internal
error?  The “cannot be used in ‘asm’ here” is a normal user-facing error,
albeit with bad error recovery, leading us to report the same thing multiple
times.

> but how are users supposed to know whether
> -fno-omit-frame-pointer is in effect or not?  I've looked and there is no
> pre-defined macro a user could check.
That might be a useful thing to have, but if the programmer has no control over
the build flags (i.e. cannot require/force -fomit-frame-pointer) then I think
the asm has to take care to save and restore the frame pointer itself.

Dropping "31" from the asm means that the asm must preserve the register. 
Things will go badly if the asm doesn't do that.

[Bug target/114607] aarch64: Incorrect expansion of svsudot

2024-04-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607

--- Comment #2 from Richard Sandiford  ---
Fixed on trunk.  I'll backport in a few weeks if there's no fallout.

[Bug target/114607] aarch64: Incorrect expansion of svsudot

2024-04-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-04-05
 Ever confirmed|0   |1

[Bug target/114607] New: aarch64: Incorrect expansion of svsudot

2024-04-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607

Bug ID: 114607
   Summary: aarch64: Incorrect expansion of svsudot
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

svsudot is supposed to expand to USDOT with the second and third arguments
swapped.  However, there is a thinko in the code that does the reversal, making
it a no-op.  Unfortunately, the tests simply accept the buggy form. :-(

For example, gcc.target/aarch64/sve/acle/asm/sudot_s32.c contains:

/*
** sudot_s32_tied1:
**  usdot   z0\.s, z2\.b, z4\.b
**  ret
*/
TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t,
   z0 = svsudot_s32 (z0, z2, z4),
   z0 = svsudot (z0, z2, z4))

where the usdot z2 and z4 operands should be in the opposite order.

[Bug target/114603] aarch64: Invalid SVE cnot optimisation

2024-04-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603

Richard Sandiford  changed:

   What|Removed |Added

   Last reconfirmed||2024-04-05
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #2 from Richard Sandiford  ---
Fix on trunk so far, but I'll backport if possible.

[Bug target/114603] New: aarch64: Invalid SVE cnot optimisation

2024-04-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603

Bug ID: 114603
   Summary: aarch64: Invalid SVE cnot optimisation
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

An overly lax condition on the cnot combine pattern means that we optimise:

#include 

svint32_t foo(svbool_t pg, svint32_t y)
{
  return svsel(svcmpeq(pg, y, 0), svdup_s32(1), svdup_s32(0));
}

to a single cnot:

foo:
cnotz0.s, p0/m, z0.s
ret

The result must be 0 for inactive elements of pg, whereas the above would leave
the elements unchanged instead.

This seems to have been around since the SVE ACLE was first added.

[Bug target/114577] Inefficient codegen for SVE/NEON bridge

2024-04-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114577

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||rsandifo at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from Richard Sandiford  ---
Fixed.

[Bug target/114521] [11 only] aarch64: wrong code with Neon ld1/st1x4 intrinsics gcc-11 and earlier

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #2 from Richard Sandiford  ---
Oops.  I was going to upload a patch for the bug here, but it looks like I
accidentally committed it while backporting PR97696 to GCC 11.  The patch was
g:daee0409d195d346562e423da783d5d1cf8ea175.

I'm not sure what to do now.  Perhaps we should leave it in?

[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

--- Comment #5 from Richard Sandiford  ---
For the record, the associated new testsuite failures are:

FAIL: gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 3
FAIL: gcc.target/aarch64/asimd-mull-elem.c scan-assembler-times
\\s+fmul\\tv[0-9]+\\.4s, v[0-9]+\\.4s, v[0-9]+\\.s\\[0\\] 4
FAIL: gcc.target/aarch64/asimd-mull-elem.c scan-assembler-times
\\s+mul\\tv[0-9]+\\.4s, v[0-9]+\\.4s, v[0-9]+\\.s\\[0\\] 4
FAIL: gcc.target/aarch64/ccmp_3.c scan-assembler-not \tcbnz\t
FAIL: gcc.target/aarch64/pr100056.c scan-assembler-times \\t[us]bfiz\\tw[0-9]+,
w[0-9]+, 11 2
FAIL: gcc.target/aarch64/pr100056.c scan-assembler-times \\tadd\\tw[0-9]+,
w[0-9]+, w[0-9]+, uxtb\\n 2
FAIL: gcc.target/aarch64/pr108840.c scan-assembler-not and\\tw[0-9]+, w[0-9]+,
31
FAIL: gcc.target/aarch64/pr112105.c scan-assembler-not \\tdup\\t
FAIL: gcc.target/aarch64/pr112105.c scan-assembler-times
(?n)\\tfmul\\t.*v[0-9]+\\.s\\[0\\]\\n 2
FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tx[0-9]+ 2
FAIL: gcc.target/aarch64/vaddX_high_cost.c scan-assembler-not dup\\t
FAIL: gcc.target/aarch64/vmul_element_cost.c scan-assembler-not dup\\t
FAIL: gcc.target/aarch64/vmul_high_cost.c scan-assembler-not dup\\t
FAIL: gcc.target/aarch64/vsubX_high_cost.c scan-assembler-not dup\\t
FAIL: gcc.target/aarch64/sve/pr98119.c scan-assembler \\tand\\tx[0-9]+,
x[0-9]+, #?-31\\n
FAIL: gcc.target/aarch64/sve/pred-not-gen-1.c scan-assembler-not \\tbic\\t
FAIL: gcc.target/aarch64/sve/pred-not-gen-1.c scan-assembler-times
\\tnot\\tp[0-9]+\\.b, p[0-9]+/z, p[0-9]+\\.b\\n 1
FAIL: gcc.target/aarch64/sve/pred-not-gen-4.c scan-assembler-not \\tbic\\t
FAIL: gcc.target/aarch64/sve/pred-not-gen-4.c scan-assembler-times
\\tnot\\tp[0-9]+\\.b, p[0-9]+/z, p[0-9]+\\.b\\n 1
FAIL: gcc.target/aarch64/sve/var_stride_2.c scan-assembler-times
\\tubfiz\\tx[0-9]+, x2, 10, 16\\n 1
FAIL: gcc.target/aarch64/sve/var_stride_2.c scan-assembler-times
\\tubfiz\\tx[0-9]+, x3, 10, 16\\n 1
FAIL: gcc.target/aarch64/sve/var_stride_4.c scan-assembler-times
\\tsbfiz\\tx[0-9]+, x2, 10, 32\\n 1
FAIL: gcc.target/aarch64/sve/var_stride_4.c scan-assembler-times
\\tsbfiz\\tx[0-9]+, x3, 10, 32\\n 1

[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

--- Comment #4 from Richard Sandiford  ---
(In reply to Richard Biener from comment #1)
> Btw, why does forwprop not do this?
Not 100% sure (I wasn't involved in choosing the current heuristics).  But
fwprop can propagate across blocks, so there is probably more risk of
increasing register pressure.

[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

--- Comment #3 from Richard Sandiford  ---
In RTL terms, the dup is vec_duplicate.  The combination is:

Trying 10 -> 13:
   10: r107:V4SF=vec_duplicate(r115:SF)
  REG_DEAD r115:SF
   13: r110:V4SF=r111:V4SF*r107:V4SF
  REG_DEAD r111:V4SF
Failed to match this instruction:
(parallel [
(set (reg:V4SF 110 [ _2 ])
(mult:V4SF (vec_duplicate:V4SF (reg:SF 115))
(reg:V4SF 111 [ *ptr_6(D) ])))
(set (reg:V4SF 107)
(vec_duplicate:V4SF (reg:SF 115)))
])
Failed to match this instruction:
(parallel [
(set (reg:V4SF 110 [ _2 ])
(mult:V4SF (vec_duplicate:V4SF (reg:SF 115))
(reg:V4SF 111 [ *ptr_6(D) ])))
(set (reg:V4SF 107)
(vec_duplicate:V4SF (reg:SF 115)))
])
Successfully matched this instruction:
(set (reg:V4SF 107)
(vec_duplicate:V4SF (reg:SF 115)))
Successfully matched this instruction:
(set (reg:V4SF 110 [ _2 ])
(mult:V4SF (vec_duplicate:V4SF (reg:SF 115))
(reg:V4SF 111 [ *ptr_6(D) ])))
allowing combination of insns 10 and 13
original costs 8 + 20 = 28
replacement costs 8 + 20 = 28
modifying insn i210: r107:V4SF=vec_duplicate(r115:SF)
deferring rescan insn with uid = 10.
modifying insn i313: r110:V4SF=vec_duplicate(r115:SF)*r111:V4SF
  REG_DEAD r115:SF
  REG_DEAD r111:V4SF
deferring rescan insn with uid = 13.

[Bug rtl-optimization/114515] New: [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

Bug ID: 114515
   Summary: [14 Regression] Failure to use aarch64 lane forms
after PR101523
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

The following test regressed on aarch64 after
g:839bc42772ba7af66af3bd16efed4a69511312ae (the fix for PR101523):

typedef float v4sf __attribute__((vector_size(16)));
void f (v4sf *ptr, float f)
{
  ptr[0] = ptr[0] * (v4sf) { f, f, f, f };
  ptr[1] = ptr[1] * (v4sf) { f, f, f, f };
}

Compiled with -O2, we previously generated:

ldp q1, q31, [x0]
fmulv1.4s, v1.4s, v0.s[0]
fmulv31.4s, v31.4s, v0.s[0]
stp q1, q31, [x0]
ret

Now we generate:

ldp q1, q31, [x0]
dup v0.4s, v0.s[0]
fmulv1.4s, v1.4s, v0.4s
fmulv31.4s, v31.4s, v0.4s
stp q1, q31, [x0]
ret

with the extra dup.

The patch is trying to avoid cases where i3 is canonicalised by contextual
information provided by i2.  But here we place a full copy of i2 into i3
(creating an instruction that is no more expensive).  This is a benefit in its
own right because the two instructions can then execute in parallel rather than
serially.  But it also means that, as here, we might be able to remove i2 with
later combinations.

Perhaps we could also check whether i3 still contains the destination of i2?

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2024-03-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Richard Sandiford  ---
Fixed on trunk and all active release branches.

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #5 from Richard Sandiford  ---
(In reply to Andrew Stubbs from comment #4)
> Yes, that's what the simd-math-3* tests do.
Ah, OK.

> The simd-math-5* tests are explicitly supposed to be doing this in the
> context of the autovectorizer.
> 
> If these tests are being compiled as (newly) intended then we should change
> the expected results.
> 
> So, questions:
> 
> 1. Are the new results actually correct? (So far I only know that being
> different is expected.)
I believe so.  We now do the division in 32 bits, as in the original gimple.

> 2. Is there some other testcase form that would exercise the previously
> intended routines?
It should be possible in languages that don't have C's integer
promotion rules, if you're up for some Ada or Rust.

> 3. Is the new behaviour configurable? I don't think the 16-bit shift bug> 
> ever existed on GCN (in which "short" vectors actually have excess bits in
> each lane, much like scalar registers do).
Not AFAIK.  The problem is that the gimple→gimple transformation changes
the gimple-level semantics of the code.  Shifts by out-of-range values
are undefined rather than target-defined.  (And in other cases that's useful,
because it means we don't need to preserve whatever value the target
happens to give for an out-of-range shift.)

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #3 from Richard Sandiford  ---
Ah, ok.  If the main aim is to test the libgcc routines, it might be safer to
use something like:

typedef char v64qi __attribute__((vector_size(64)));
v64qi f(v64qi x, v64qi y) { return x / y; }

instead of relying on vectorisation.

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #1 from Richard Sandiford  ---
The decision to stop narrowing division was deliberate, see the comments in
PR113281 for details.  Is the purpose of the test to check vectorisation
quality, or to check for the right ABI routines?

[Bug tree-optimization/114234] New: [14 Regression] verify_ssa failure with early-break vectorisation

2024-03-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114234

Bug ID: 114234
   Summary: [14 Regression] verify_ssa failure with early-break
vectorisation
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

The following test ICEs with -Ofast on aarch64:

void bar();
float
foo (float x)
{
  float a = 1;
  float b = x;
  long z = 200;
  for (;;)
{
  float c = b - 1.0f;
  a *= c;
  z -= 1;
  if (z == 0)
{
  bar ();
  break;
}
  if (b <= 3.0f)
break;
  b = c;
}
  return a * b;
}

(reduced from wrf).  The ICE is:

foo.c:3:1: error: definition in block 15 does not dominate use in block 10
3 | foo (float x)
  | ^~~
for SSA_NAME: stmp_a_9.10_103 in statement:
a_47 = PHI 
PHI argument
stmp_a_9.10_103
for PHI node
a_47 = PHI 
during GIMPLE pass: vect

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

Richard Sandiford  changed:

   What|Removed |Added

  Attachment #57602|0   |1
is obsolete||

--- Comment #42 from Richard Sandiford  ---
Created attachment 57605
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57605=edit
proof-of-concept patch to suppress peeling for gaps

How about the attached?  It records whether all accesses that require peeling
for gaps could instead have used gathers, and only retries when that's true. 
It means that we retry for only 0.034% of calls to vect_analyze_loop_1 in a
build of SPEC2017 with -mcpu=neoverse-v1 -Ofast -fomit-frame-pointer.

The figures exclude wrf, which failed for me with:

module_mp_gsfcgce.fppized.f90:852:23:

  852 |REAL FUNCTION ggamma(X)
  |   ^
Error: definition in block 18 does not dominate use in block 13
for SSA_NAME: stmp_pf_6.5657_140 in statement:
pf_81 = PHI 
PHI argument
stmp_pf_6.5657_140
for PHI node
pf_81 = PHI 
during GIMPLE pass: vect
module_mp_gsfcgce.fppized.f90:852:23: internal compiler error: verify_ssa
failed

Will look at that tomorrow.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #41 from Richard Sandiford  ---
(In reply to Richard Biener from comment #40)
> So I wonder if we can use "local costing" to decide a gather is always OK
> compared to the alternative with peeling for gaps.  On x86 gather tends
> to be slow compared to open-coding it.
Yeah, on SVE gathers are generally “enabling” instructions rather than
something to use for their own sake.

I suppose one problem is that we currently only try to use gathers for
single-element groups.  If we make a local decision to use gathers while
keeping that restriction, we could end up using gathers “unnecessarily” while
still needing to peel for gaps for (say) a two-element group.

That is, it's only better to use gathers than contiguous loads if by doing that
we avoid all need to peel for gaps (and if the cost of peeling for gaps was
high enough to justify the cost of using gathers over consecutive loads).

One of the things on the list to do (once everything is SLP!) is to support
loads with gaps directly via predication, so that we never load elements that
aren't needed.  E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with a
32-bit load (LD1W .S) to load only even-indexed elements.  So a single-element
group with a group size of 2 could be done cheaply with just consecutive loads,
without peeling for gaps.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #39 from Richard Sandiford  ---
(In reply to Richard Sandiford from comment #38)
> (In reply to Richard Biener from comment #37)
> > Even more iteration looks bad.  I do wonder why when gather can avoid
> > peeling for GAPs using load-lanes cannot?
> Like you say, we don't realise that all the loads from array3[i] form a
> single group.
Oops, sorry, I shouldn't have gone off memory.  So yeah, it's array1[] where
that happens, not array3[].  The reason we don't use load-lanes is that we
don't have load-lane instructions for smaller elements in larger containers, so
we're forced to use load-and-permute instead.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #38 from Richard Sandiford  ---
(In reply to Richard Biener from comment #37)
> Even more iteration looks bad.  I do wonder why when gather can avoid
> peeling for GAPs using load-lanes cannot?
Like you say, we don't realise that all the loads from array3[i] form a single
group.

Note that we're not using load-lanes in either case, since the group size (8)
is too big for that.  But load-lanes and load-and-permute have the same
restriction about when peeling for gaps is required.

In contrast, gather loads only ever load data that they actually need.

> Also for the stores we seem to use elementwise stores rather than store-lanes.
What configuration are you trying?  The original report was about SVE, so I was
trying that.  There we use a scatter store.

> To me the most obvious thing to try optimizing in this testcase is DR
> analysis.  With -march=armv8.3-a I still see
> 
> t.c:26:22: note:   === vect_analyze_data_ref_accesses ===
> t.c:26:22: note:   Detected single element interleaving array1[0][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[1][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[2][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[3][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[0][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[1][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[2][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[3][_1] step 4
> t.c:26:22: missed:   not consecutive access array2[_4][_8] = _69;
> t.c:26:22: note:   using strided accesses
> t.c:26:22: missed:   not consecutive access array2[_4][_1] = _67;
> t.c:26:22: note:   using strided accesses
> 
> so we don't figure
> 
> Creating dr for array1[0][_1]
> base_address: 
> offset from base address: (ssizetype) ((sizetype) (m_111 * 2) * 2)
> constant offset from base address: 0
> step: 4
> base alignment: 16
> base misalignment: 0
> offset alignment: 4
> step alignment: 4
> base_object: array1
> Access function 0: {m_111 * 2, +, 2}_4
> Access function 1: 0
> Creating dr for array1[0][_8]
> analyze_innermost: success.
> base_address: 
> offset from base address: (ssizetype) ((sizetype) (m_111 * 2 + 1) *
> 2)
> constant offset from base address: 0
> step: 4
> base alignment: 16
> base misalignment: 0
> offset alignment: 2
> step alignment: 4
> base_object: array1
> Access function 0: {m_111 * 2 + 1, +, 2}_4
> Access function 1: 0
> 
> belong to the same group (but the access functions tell us it worked out).
> Above we fail to split the + 1 to the constant offset.
OK, but this is moving the question on to how we should optimise the testcase
for Advanced SIMD rather than SVE, and how we should optimise the testcase in
general, rather than simply recover what we could do before.  (SVE is only
enabled for -march=arvm9-a and above, in case armv8.3-a was intended to enable
SVE too.)

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #36 from Richard Sandiford  ---
Created attachment 57602
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57602=edit
proof-of-concept patch to suppress peeling for gaps

This patch does what I suggested in the previous comment: if the loop needs
peeling for gaps, try again without that, and pick the better loop.  It seems
to restore the original style of code for SVE.

A more polished version would be a bit smarter about when to retry.  E.g. it's
pointless if the main loop already operates on full vectors (i.e. if peeling 1
iteration is natural in any case).  Perhaps the condition should be that either
(a) the number of epilogue iterations is known to be equal to the VF of the
main loop or (b) the target is known to support partial vectors for the loop's
vector_mode.

Any thoughts?

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #35 from Richard Sandiford  ---
Maybe I've misunderstood the flow of the ticket, but it looks to me like we do
still correctly recognise the truncating scatter stores.  And, on their own, we
would be able to convert them into masked scatters.

The reason for the epilogue is instead on the load side.  There we have a
non-strided grouped load, and currently we hard-code the assumption that it is
better to use contiguous loads and permutes rather than gather loads where
possible.  So we have:

  /* As a last resort, trying using a gather load or scatter store.

 ??? Although the code can handle all group sizes correctly,
 it probably isn't a win to use separate strided accesses based
 on nearby locations.  Or, even if it's a win over scalar code,
 it might not be a win over vectorizing at a lower VF, if that
 allows us to use contiguous accesses.  */
  if (*memory_access_type == VMAT_ELEMENTWISE
  && single_element_p
  && loop_vinfo
  && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
 masked_p, gs_info))
*memory_access_type = VMAT_GATHER_SCATTER;

only after we've tried and failed to use load lanes or load+permute.  If
instead I change the order so that the code above is tried first, then we do
use extending gather loads and truncating scatter stores as before, with no
epilogue loop.

So I suppose the question is: if we do prefer to use gathers over load+permute
for some cases, how do we decide which to use?  And can it be done a per-load
basis, or should it instead be a per-loop decision?  E.g., if we end up with a
loop that needs peeling for gaps, perhaps we should try again and forbid
peeling for gaps.  Then, if that succeeds, see which loop gives the better
overall cost.

Of course, trying more things means more compile time…

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-01 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #33 from Richard Sandiford  ---
Can you give me a chance to look at it a bit when I back?  This doesn't feel
like the way to go to me.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-02-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #31 from Richard Sandiford  ---
(In reply to Tamar Christina from comment #29)
> This works fine for normal gather and scatters but doesn't work for widening
> gathers and narrowing scatters which only the pattern seems to handle.
I'm supposedly on holiday, so didn't see the IRC discussion, but: as I remember
it, there is no narrowing or widening for IFN gathers or scatters as such, even
for patterns.  One vector's worth of offsets corresponds to one vector's worth
of data.  But the widths of the data elements and the offset elements can be
different.  Any sign or zero extension of a loaded vector, or any operation to
double or halve the number of vectors, is done separately.

I think it does make sense to stick to that, rather than (say) have IFNs that
load two offset vectors into one data vector, or use one offset vector to load
two data vectors.  Supporting those combinations would mean that we have two
different ways in which the offset elements and data elements have different
widths.  And it isn't really a natural fusion on SVE.

[Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics

2024-02-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #8 from Richard Sandiford  ---
The reason early_ra doesn't help with the original testcase is that early_ra
punts on any non-move instruction that has a hard register destination.  And it
does that because it can't cope well with cases where hard-coded destinations
force the wrong choice (unlike the proper allocators, which can change the
destination where necessary).  The restriction is needed to avoid regressing
SVE ACLE tests.

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2024-02-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

--- Comment #3 from Richard Sandiford  ---
Created attachment 57520
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57520=edit
Candidate patch

The attached patch seems to fix it.  I'm taking next week off, but I'll run the
patch through proper testing when I get back.

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2024-02-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220

2024-02-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #14 from Richard Sandiford  ---
Finally fixed.

[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205

--- Comment #12 from Richard Sandiford  ---
Created attachment 57511
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57511=edit
Candidate patch

Sorry for the very slow response on this.  I'm testing the attached.

[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Richard Sandiford  ---
Assume fixed by the patches for PR113295.  Please reopen if not.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 112922, which changed state.

Bug 112922 Summary: [14 Regression] 465.tonto from SPECFP 2006 fails train run 
on Aarch64-linux with -O2 and -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Richard Sandiford  ---
Fixed.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 113295, which changed state.

Bug 113295 Summary: [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 
when built with -Ofast -mcpu=native since 
g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Richard Sandiford  ---
Fixed.

[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295

--- Comment #6 from Richard Sandiford  ---
For me the miscompilation is in jkdmem_, where we end up allocating the same
registers to both arms of an fcsel.  It sounds like it occurs elsewhere too.

I have a candidate fix, but need to think a bit more about it.

[Bug preprocessor/114007] gcc chokes on __has_cpp_attribute(clang::unsafe_buffer_usage)

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114007

--- Comment #14 from Richard Sandiford  ---
I might have misunderstood the suggestion and so be arguing against something
that no-one is suggesting, but I think [[__extension__ …]] should accept the
same things for all standard versions (C23, pre-C23, and GNU).  It was intended
to be something that header files and macros could use without needing to be
sensitive to the user's choice of standard.

[Bug target/113995] ICE: in change_address_1, at emit-rtl.cc:2299 with [[arm::streaming_compatible]] and -march=armv9-a+sve -finstrument-functions -fstack-clash-protection

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113995

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Richard Sandiford  ---
Fixed.

[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Richard Sandiford  ---
Fixed

[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #5 from Richard Sandiford  ---
Mine.  Could be the same as PR112922.

[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922

--- Comment #2 from Richard Sandiford  ---
I don't remember there being a deliberate bug fix in that patch,
but there were some others later.  I suppose the optimistic case
is that this first went latent and then was fixed “properly”
afterwards.  But it could just be latent.

[Bug target/113778] ICE: RTL check: expected elt 3 type 'e' or 'u', have '0' (rtx code_label) in PATTERN, at rtl.h:1506 with -march=armv9-a+sme -fhardened

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113778

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Richard Sandiford  ---
Confirmed that this is the same as PR113220.

*** This bug has been marked as a duplicate of bug 113220 ***

[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220

Richard Sandiford  changed:

   What|Removed |Added

 CC||zsojka at seznam dot cz

--- Comment #3 from Richard Sandiford  ---
*** Bug 113778 has been marked as a duplicate of this bug. ***

[Bug target/113805] sme2 vs -mtrack-speculation, ICE in extract_insn, at recog.cc:2812 with r14-6178-g8d29b7aca15133

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113805

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Richard Sandiford  ---
Fixed.

[Bug target/113995] ICE: in change_address_1, at emit-rtl.cc:2299 with [[arm::streaming_compatible]] and -march=armv9-a+sve -finstrument-functions -fstack-clash-protection

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113995

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #2 from Richard Sandiford  ---
Possibly a dup of some of the other PRs in this area.

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #16 from Richard Sandiford  ---
I think it'd be worth splitting the mode selection out
into a helper called bitwise_mode_for_size, to go alongside
bitwise_mode_for_mode.

[Bug target/113778] ICE: RTL check: expected elt 3 type 'e' or 'u', have '0' (rtx code_label) in PATTERN, at rtl.h:1506 with -march=armv9-a+sme -fhardened

2024-02-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113778

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-02-15
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #3 from Richard Sandiford  ---
Possibly the same as PR113220

[Bug target/113805] sme2 vs -mtrack-speculation, ICE in extract_insn, at recog.cc:2812 with r14-6178-g8d29b7aca15133

2024-02-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113805

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-02-12
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from Richard Sandiford  ---
Heh.  I knew this coming after seeing the previous -mtrack-speculation bug, but
didn't have chance to fix it pre-emptively last week.

[Bug target/113763] [14 Regression] build fails with clang++ host compiler because aarch64.cc uses C++14 constexpr.

2024-02-06 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113763

--- Comment #14 from Richard Sandiford  ---
AFAIK, the constructor shouldn't be necessary.  (And without it, the whole
thing would fit on one line.)  LGTM (and preapproved) otherwise.  Thanks for
doing this.

[Bug target/113763] [14 Regression] build fails with clang++ host compiler because aarch64.cc uses C++14 constexpr.

2024-02-06 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113763

--- Comment #11 from Richard Sandiford  ---
Currently away so can't try it myself, but how about just using an ad-hoc
structure instead?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-01 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #34 from Richard Sandiford  ---
(In reply to Richard Biener from comment #32)
> Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting
> the lowest 4 or 2 bits.
> 
> It all feels like desaster waiting to happen ;)
Yes :)

> For example BIT_NOT_EXPR is RTL expanded like
> 
> case BIT_NOT_EXPR:
>   op0 = expand_expr (treeop0, subtarget,
>  VOIDmode, EXPAND_NORMAL);
>   if (modifier == EXPAND_STACK_PARM)
> target = 0;
>   /* In case we have to reduce the result to bitfield precision
>  for unsigned bitfield expand this as XOR with a proper constant
>  instead.  */
>   if (reduce_bit_field && TYPE_UNSIGNED (type))
> {
>   int_mode = SCALAR_INT_TYPE_MODE (type);
>   wide_int mask = wi::mask (TYPE_PRECISION (type),
> false, GET_MODE_PRECISION (int_mode));
>   
>   temp = expand_binop (int_mode, xor_optab, op0,
>immed_wide_int_const (mask, int_mode),
>target, 1, OPTAB_LIB_WIDEN);
> 
> so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and
> effective bit-precision set reduce_bit_field and fixup the fallout
> (not sure why the above is only for TYPE_UNSIGNED).
>
> At least it feels similar and doing things the opposite for vectors
> (fixing up at uses) would be odd?
Do you know why we take this approach for integers?  Is it for
correctness?  Or is it supposed to be more optimal?

I can imagine that, for arithmetic types, there are going to many
more instances where upper bits matter (division, right shifts,
MIN/MAX, etc.).  So perhaps reducing every result is a good
trade-off there.

But there's an argument that it should be rare for the padding
bits in a vector to matter, since very few things would look at the
padding bits anyway.  So perhaps the cost should be borne by the
operations that need canonical integers.

Not a strong opinion though, more just devil's advocate.

There again, if e.g. the x86 API guarantees memcmp equality between
two masks whose significant bits are equal, then we probably have
no choice.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #30 from Richard Sandiford  ---
(In reply to Richard Biener from comment #29)
> But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
> VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
> masking for VECTOR_CST expansion as well, instead of trying to fixup
> in do_compare_and_jump?
But then how would ~ be implemented for things like 4-bit masks?
If we use notqi2 then I assume the upper bits could be 1 rather than 0.

[Bug debug/113636] [14 Regression] internal compiler error: in dead_debug_global_find, at valtrack.cc:275 since r14-6290-g9f0f7d802482a8

2024-01-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113636

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Richard Sandiford  ---
Fixed.  Thanks for the report and help with reproducing.

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #8 from Richard Sandiford  ---
Fixed.

[Bug target/111677] [12/13 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

--- Comment #21 from Richard Sandiford  ---
(In reply to Alex Coplan from comment #13)
> The problem seems to be this code in aarch64_process_components:
> 
>   while (regno != last_regno)
> {
>   bool frame_related_p = aarch64_emit_cfi_for_reg_p (regno);
>   machine_mode mode = aarch64_reg_save_mode (regno);
> 
>   rtx reg = gen_rtx_REG (mode, regno);
>   poly_int64 offset = frame.reg_offset[regno];
>   if (frame_pointer_needed)
> offset -= frame.bytes_below_hard_fp;
> 
>   rtx addr = plus_constant (Pmode, ptr_reg, offset);
>   rtx mem = gen_frame_mem (mode, addr);
> 
> which emits a TFmode mem with offset 512, which is out of range for TFmode
> (so we later ICE with an unrecognisable insn).  Presumably this just needs
> tweaking to emit a new base anchor in the case of large offsets like this. 
> It looks like the code in aarch64_save_callee_saves already does this.
We shouldn't emit new anchor registers here, since unlike in the prologue,
we don't have any guarantee that certain registers are free.

aarch64_get_separate_components is supposed to vet shrink-wrappable
offsets, but in this case the offset looks valid, since:

str q22, [sp, #512]

is a valid instruction.  Perhaps the constraints are too narrow?

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #6 from Richard Sandiford  ---
Mine.

[Bug debug/113636] [14 Regression] internal compiler error: in dead_debug_global_find, at valtrack.cc:275 since r14-6290-g9f0f7d802482a8

2024-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113636

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
   Last reconfirmed||2024-01-29

--- Comment #10 from Richard Sandiford  ---
Mine.

[Bug tree-optimization/113281] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027

2024-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

--- Comment #24 from Richard Sandiford  ---
Fixed on trunk so far, but it's latent on branches.  I'll see what
the trunk fallout is like before asking about backports.

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #7 from Richard Sandiford  ---
early-ra does try to avoid reusing registers too soon, to increase scheduling
freedom.  But in this case I imagine it handles the two statements as separate
regions.  Should be fixable by carrying across a round-robin counter.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #18 from Richard Sandiford  ---
(In reply to Tamar Christina from comment #17)
> Well the mid-end has generated the right precision. The type it generates is
> vector(4)  vexit_reduc_67;
> so it does say it's a single bit boolean.
> 
> Isn't this just an expand problem?
That's what I meant.  expand is using a QImode comparison to compare things
with 4-bit precision, so I think the masking should happen at that point.

How about doing the masking in do_compare_and_jump?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #16 from Richard Sandiford  ---
(In reply to Richard Biener from comment #15)
> I think the problem is the cbranch pattern which looks at all of the
> QImode mask - but of course it doesn't know it's really V4BImode it's
> working on ...
Yeah.  Currently building an x86_64 toolchain to have a look, but I think
whatever code uses a cbranch with a higher precision than the inputs should
mask off the significant bits beforehand.  cbranch can also be used for
comparing two variable masks too.

I suppose we could add a hook to say that padding bits of an integer mask must
always be zero (and so it's expand's job to ensure that that holds for any mask
operation).  But it feels dangerously close to TRULY_NOOP_TRUNCATION for
integers.

[Bug target/113550] data512_t initializers dereference a clobbered register

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113550

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Sandiford  ---
Yeah, FWIW, I agree improving the define_split is probably best.

Now fixed.

[Bug target/113485] [14 regression] ICE with -fno-guess-branch-probability on aarch64 starting with r14-7187-g74e3e839ab2d36

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113485

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Richard Sandiford  ---
Fixed.

[Bug target/113572] [14 Regression] aarch64: internal compiler error in aarch64_sve::vector_cst_all_same

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113572

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Richard Sandiford  ---
Fixed.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #13 from Richard Sandiford  ---
I don't think there's any principle that upper bits must be zero.
How do we end up with a pattern that depends on that being the case?

[Bug tree-optimization/113281] [14 Regression] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027

2024-01-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #22 from Richard Sandiford  ---
Taking following discussion on irc.

[Bug target/113572] [14 Regression] aarch64: internal compiler error in aarch64_sve::vector_cst_all_same

2024-01-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113572

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #5 from Richard Sandiford  ---
(In reply to Jakub Jelinek from comment #4)
> So, if that part is right, I think we want to use VECTOR_CST_ELT instead of
> VECTOR_CST_ENCODED_ELT, like:
> --- gcc/config/aarch64/aarch64-sve-builtins.cc.jj 2024-01-12
> 13:47:20.815429012 +0100
> +++ gcc/config/aarch64/aarch64-sve-builtins.cc2024-01-24 
> 20:58:33.720677634
> +0100
> @@ -3474,7 +3474,7 @@ vector_cst_all_same (tree v, unsigned in
>unsigned int nelts = lcm * VECTOR_CST_NELTS_PER_PATTERN (v);
>tree first_el = VECTOR_CST_ENCODED_ELT (v, 0);
>for (unsigned int i = 0; i < nelts; i += step)
> -if (!operand_equal_p (VECTOR_CST_ENCODED_ELT (v, i), first_el, 0))
> +if (!operand_equal_p (VECTOR_CST_ELT (v, i), first_el, 0))
>return false;
>  
>return true;
> which fixes the ICE.
Yeah, that's the correct fix.  Sorry for missing it.

[Bug target/113485] [14 regression] ICE with -fno-guess-branch-probability on aarch64 starting with r14-7187-g74e3e839ab2d36

2024-01-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113485

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #7 from Richard Sandiford  ---
I suppose the ZIP1 patterns should just have 64-bit inputs,
rather than going to the trouble of creating paradoxical subregs.

> cfun->machine->advsimd_zero_insn use is plain wrong. As the RTL could be 
> removed fully from the RTL stream and then it will be GC'ed.

But machine_function is a GTYed structure, so the reference itself should
prevent GC.  I don't think we should be in the practice of explicitly
ggc_free()ing RTL, since callers don't generally know what other references
there might be.

[Bug target/109929] profiledbootstrap failure on aarch64-linux-gnu with graphite optimization

2024-01-22 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109929

--- Comment #7 from Richard Sandiford  ---
Hmm, yeah, like you say, neither of those commits should have made a different
to whether bootstrap works.  I guess the problem is just latent now.

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2024-01-22 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

--- Comment #12 from Richard Sandiford  ---
I don't object to the patch, but for the record: the current heuristics go back
a long way.  Although I reworked the pass to use rtl-ssa a few years ago, I
tried as far as possible to preserve the old heuristics (tested by making sure
that there were no unexplained differences over a large set of targets).

I wouldn't characterise the old heuristics as a logic error.  Although I didn't
write them, my understanding is that they were being deliberately conservative,
in particular due to the risk of introducing excess register pressure.

So this change seems potentially quite invasive for stage 4.  Perhaps it'll
work out — if so, great!  But if there is some fallout, I think we should lean
towards reverting the patch and revisiting in GCC 15.

[Bug target/113196] [14 Regression] Failure to use ushll{,2}

2024-01-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113196

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Richard Sandiford  ---
Fixed.

[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`

2024-01-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #15 from Richard Sandiford  ---
I didn't manage to reproduce the PR in the originally reported form, but FWIW,
the patches mean that a gcc_unreachable above:

  return decl;

in simulate_builtin_function_decl no longer fires for arm_sve.h or arm_sme.h. 
Please reopen if there are still some lingering issues.

[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`

2024-01-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989

--- Comment #12 from Richard Sandiford  ---
> another is try
> #pragma GCC aarch64 "arm_sve.h"
> after a couple of intentional declarations of the SVE builtins with
> non-standard return/argument types and make sure that while it emits some
> errors, it doesn't try to use ggc_freed decls in registered tables.
FWIW, this is what the g*.target/aarch64/sve/acle/general*/func_redef_*
tests are supposed to test (although not specifically targeting ggc_free).

[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`

2024-01-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #10 from Richard Sandiford  ---
Mine.

[Bug target/113270] [14 Regression] AArch64 ICEs in register_tuple_type since r14-6524

2024-01-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113270

--- Comment #8 from Richard Sandiford  ---
Thanks for trying it, and sorry for not doing it myself.

The patch LGTM FWIW, so preapproved if it passes testing (which I'm sure it
will :))

[Bug target/113270] [14 Regression] AArch64 ICEs in register_tuple_type since r14-6524

2024-01-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113270

--- Comment #6 from Richard Sandiford  ---
I think we want the patch in comment 3, but in addition, I then also needed to
use the following for a similar SVE case:

extern GTY(()) tree scalar_types[NUM_VECTOR_TYPES + 1];
tree scalar_types[NUM_VECTOR_TYPES + 1];

In this case that would mean adding:

extern GTY(()) aarch64_simd_type_info aarch64_simd_types[];

just above the definition in aarch64-builtins.cc.

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2024-01-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Richard Sandiford  ---
Fixed.  Thanks for the report.

[Bug c++/68703] attribute((vector_size(N))) template member confusion

2024-01-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68703

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #11 from Richard Sandiford  ---
FWIW, the following adaption of the original testcase still fails on trunk, but
is accepted by Clang:

template 
struct D {
using t = int __attribute__((vector_size(N * sizeof(int;
t v;
int f1() { return this->v[N-1]; }
int f2() { return v[N-1]; }
};

int main(int ac, char**)
{
  D<> d = { { ac } };
  return d.f1() + d.f2();
}

Same with a typedef instead of "using".  But that's probably just another
instance of PR88600/PR58855.

[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133

2024-01-03 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220

Richard Sandiford  changed:

   What|Removed |Added

 CC|richard.sandiford at arm dot com   |rsandifo at gcc dot 
gnu.org
   Last reconfirmed||2024-01-03
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Richard Sandiford  ---
Mine

[Bug target/113196] [14 Regression] Failure to use ushll{,2}

2024-01-02 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113196

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-01-02

--- Comment #1 from Richard Sandiford  ---
Testing a patch that does that.  I think it'll depend on late-combine to undo
the split in cases where it isn't profitable.

[Bug target/113196] New: [14 Regression] Failure to use ushll{,2}

2024-01-02 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113196

Bug ID: 113196
   Summary: [14 Regression] Failure to use ushll{,2}
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
CC: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

For this testcase, adapted from the one for PR110625:

  int test(unsigned array[4][4]);

  int foo(unsigned short *a, unsigned long n)
  {
unsigned array[4][4];

for (unsigned i = 0; i < 4; i++, a += 4)
  {
array[i][0] = a[0] << 6;
array[i][1] = a[1] << 6;
array[i][2] = a[2] << 6;
array[i][3] = a[3] << 6;
  }

return test(array);
  }

GCC now uses:

mov x1, x0
stp x29, x30, [sp, -80]!
moviv30.4s, 0
mov x29, sp
ldp q0, q29, [x1]
add x0, sp, 16
zip1v1.8h, v0.8h, v30.8h
zip1v31.8h, v29.8h, v30.8h
zip2v0.8h, v0.8h, v30.8h
zip2v29.8h, v29.8h, v30.8h
shl v1.4s, v1.4s, 6
shl v31.4s, v31.4s, 6
shl v0.4s, v0.4s, 6
shl v29.4s, v29.4s, 6
stp q1, q0, [sp, 16]
stp q31, q29, [sp, 48]
bl  test(unsigned int (*) [4])
ldp x29, x30, [sp], 80
ret

whereas previously it used USHLL{,2}:

mov x1, x0
stp x29, x30, [sp, -80]!
mov x29, sp
ldp q1, q0, [x1]
add x0, sp, 16
ushll   v3.4s, v1.4h, 6
ushll   v2.4s, v0.4h, 6
ushll2  v1.4s, v1.8h, 6
ushll2  v0.4s, v0.8h, 6
stp q3, q1, [sp, 16]
stp q2, q0, [sp, 48]
bl  test(unsigned int (*) [4])
ldp x29, x30, [sp], 80
ret

This changed with g:f26f92b534f9, which expanded zero-extensions to ZIPs.  The
patch included *ADDW patterns for the new representation, but it looks like
there are several more that should be included for full coverage.

AIUI, the point of lowering to ZIPs during expand was to allow the zero to be
hoisted.  An alternative might be to lower during split, but forcibly hoist the
zero by inserting around the FUNCTION_BEG note.  We could then cache the insn
that does that for manual CSE.

Godbolt link: https://godbolt.org/z/vzfnebMhb

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2023-12-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-12-30
 Ever confirmed|0   |1
 CC||rsandifo at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #4 from Richard Sandiford  ---
FWIW, we do get the desired code with -march=armv8-a+sve (even though the test
doesn't use SVE).  This is because of:

  /* Consider enabling VECT_COMPARE_COSTS for SVE, both so that we
 can compare SVE against Advanced SIMD and so that we can compare
 multiple SVE vectorization approaches against each other.  There's
 not really any point doing this for Advanced SIMD only, since the
 first mode that works should always be the best.  */
  if (TARGET_SVE && aarch64_sve_compare_costs)
flags |= VECT_COMPARE_COSTS;

The testcase in this PR is a counterexample to the claim in the final sentence.
 I think the comment might predate significant support for mixed-sized Advanced
SIMD vectorisation.

If we enable SVE (or uncomment the "if" line), the costs are 13 units per
vector iteration for 128-bit vectors and 4 units per vector iteration for
64-bit vectors (so 8 units per 128 bits on a parity basis).  The 64-bit version
is therefore seen as significantly cheaper and is chosen ahead of the 128-bit
version.

I think this PR is enough proof that we should enable VECT_COMPARE_COSTS even
without SVE.  Assigning to myself for that.

[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement

2023-12-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091

--- Comment #5 from Richard Sandiford  ---
> The issue here is that because the "outer" pattern consumes
> patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 
> stmt-to-vectorize
> as being the outer pattern root stmt for all this logic to work correctly.

I don't think it can though, at least not in general.  The final pattern
stmt has to compute the same value as the original scalar stmt.

[Bug target/113094] [14 Regression][aarch64] ICE in extract_constrain_insn, at recog.cc:2713 since r14-6290-g9f0f7d802482a8

2023-12-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113094

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Richard Sandiford  ---
Fixed.

[Bug target/112948] gcc/config/aarch64/aarch64-early-ra.cc:1953: possible cut'n'paste error ?

2023-12-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112948

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Richard Sandiford  ---
Fixed.  Thanks for the report.

[Bug target/113094] [14 Regression][aarch64] ICE in extract_constrain_insn, at recog.cc:2713 since r14-6290-g9f0f7d802482a8

2023-12-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113094

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #4 from Richard Sandiford  ---
Testing a patch.  We're doing spurious work on insns that are slated for
deletion, but we can't simply delete them first because that would disrupt the
main iteration.  Easiest fix seems to be to replace them with NOTE_INSN_DELETED
first, then iterate, then delete.

[Bug rtl-optimization/111702] [14 Regression] ICE: in insert_regs, at cse.cc:1114 with -O2 -fstack-protector-all -frounding-math

2023-12-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111702

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED
 CC||rsandifo at gcc dot gnu.org

--- Comment #5 from Richard Sandiford  ---
Fixed.

[Bug target/113027] New: aarch64 is missing vec_set and vec_extract for structure modes

2023-12-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113027

Bug ID: 113027
   Summary: aarch64 is missing vec_set and vec_extract for
structure modes
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

The lack of vec_set and vec_extract optabs for structure modes means that the
following testcase spills to the stack when compiled at -O2:

#include 

float64x2x2_t f1 (float64x2x2_t x) { x.val[0][1] += 1.0; return x; }
float64x2x3_t f2 (float64x2x3_t x) { x.val[0][0] = x.val[1][1] + x.val[2][0];
return x; }
float64x2x4_t f3 (float64x2x4_t x) { x.val[0][0] = x.val[1][1] + x.val[2][0] -
x.val[3][1]; return x; }

For example:

f1:
sub sp, sp, #32
fmovd31, 1.0e+0
st1 {v0.2d - v1.2d}, [sp]
ldr d30, [sp, 8]
faddd31, d31, d30
str d31, [sp, 8]
ld1 {v0.2d - v1.2d}, [sp]
add sp, sp, 32
ret

With the extra patterns, we instead get:

f1:
dup d31, v0.d[1]
fmovd30, 1.0e+0
faddd30, d31, d30
ins v0.d[1], v30.d[0]
ret
f2:
dup d31, v1.d[1]
faddd31, d31, d2
ins v0.d[0], v31.d[0]
ret
f3:
dup d31, v1.d[1]
dup d30, v3.d[1]
faddd31, d31, d2
fsubd30, d31, d30
ins v0.d[0], v30.d[0]
ret

Fixing this might also make it possible to use structure modes for arrays (c.f.
PR109543).

[Bug tree-optimization/109543] Avoid using BLKmode for unions with a non-BLKmode member when possible

2023-12-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109543

--- Comment #5 from Richard Sandiford  ---
I think the loop in compute_mode_layout needs to be smarter
for unions.  At the moment it's sensitive to field order,
which doesn't make much conceptual sense.

E.g. for the admittedly contrived example:

#include 

union u1 {
  int32x2x2_t x;
  __int128 y __attribute__((packed));
};

union u2 {
  __attribute__((packed)) __int128 y;
  int32x2x2_t x;
};

compiled with -mstrict-align, the loop produces V2x2SImode for
union u1 (good!) but TImode for union u2 (requires too much alignment).
That doesn't matter as things stand, since we don't accept unions
with vector modes.  But if we did, union u1 would be placed in registers
and union u2 wouldn't.

[Bug middle-end/80283] [11/12/13/14 Regression] bad SIMD register allocation

2023-12-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283

--- Comment #39 from Richard Sandiford  ---
(In reply to Andrew Pinski from comment #38)
> For aarch64, the test from comment #11 is so much worse on the trunk than in
> GCC 13.2.0.
I've been working on a fix for that.  I'm hoping to post it today.

[Bug target/112948] gcc/config/aarch64/aarch64-early-ra.cc:1953: possible cut'n'paste error ?

2023-12-11 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112948

Richard Sandiford  changed:

   What|Removed |Added

   Last reconfirmed||2023-12-11
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from Richard Sandiford  ---
Gah.

[Bug target/112933] gcc.target/aarch64/sme2/acle-asm/read_za16_vg1x2.c fails on aarch64_be

2023-12-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112933

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #3 from Richard Sandiford  ---
Fixed.

[Bug target/112931] gcc.target/aarch64/sme2/acle-asm/write_za16_vg1x2.c ICEs on aarch64_be

2023-12-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112931

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Richard Sandiford  ---
Fixed.

[Bug target/112930] gcc.target/aarch64/sme/call_sm_switch_7.c ICEs on aarch64_be

2023-12-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112930

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Richard Sandiford  ---
Fixed.

1 2 3 4 5 6 7 >

1 - 100 of 618 matches

Mail list logo