[Bug rtl-optimization/114996] [15 Regression] [RISC-V] 2->2 combination no longer occurring

2024-06-18 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114996

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #6 from Richard Sandiford  ---
FWIW, late-combine also fixes this.  I'm in the process of getting the
submission ready (still going through multi-target testing).

[Bug target/115518] New: aarch64: Poor codegen for arm_neon_sve_bridge.h

2024-06-17 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115518

Bug ID: 115518
   Summary: aarch64: Poor codegen for arm_neon_sve_bridge.h
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: aarch64-sve
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

With PR115464 fixed, the following testcase:

#include 
#include 
#include 

svuint16_t
convolve4_4_x (uint16x8x2_t permute_tbl, svuint16_t a)
{
return svset_neonq_u16 (a, permute_tbl.val[1]);
}

generates:

mov v0.16b, v1.16b
ptrue   p3.h, vl8
sel z0.h, p3, z0.h, z2.h
ret

The move is redundant: we should be able to use z1.h as input to the sel
instead.

[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

--- Comment #11 from Richard Sandiford  ---
Yeah, like I mentioned in the commit message, I'm in the process of rolling
this fix out to more places.  Was just testing the waters with the minimal fix
for comment 4.

But yeah, maybe more of it will need to be backported than I'd hoped.

[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-13 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

Richard Sandiford  changed:

   What|Removed |Added

  Known to work||15.0
  Known to fail||14.1.0
Summary|ICE when building libaom on |[14 Backport] ICE when
   |arm64 (neon sve bridge  |building libaom on arm64
   |usage with tbl/perm)|(neon sve bridge usage with
   ||tbl/perm)

--- Comment #9 from Richard Sandiford  ---
Fixed on trunk.  Will backport to GCC 14 if there is no fallout.

[Bug target/115464] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)

2024-06-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #5 from Richard Sandiford  ---
In this kind of situation, we should go through a fresh pseudo rather than try
to take the subreg directly.

[Bug rtl-optimization/115281] [14 Regression] aarch64 ICE in go_through_subreg after r14-5129

2024-06-11 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115281

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Sandiford  ---
Fixed.

[Bug debug/100303] [11 Regression] -fcompare-debug failure (length) with -O -fno-dce -ftracer

2024-06-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100303

Richard Sandiford  changed:

   What|Removed |Added

  Known to work||11.4.1
 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #10 from Richard Sandiford  ---
Fixed.

[Bug rtl-optimization/108086] [11 Regression] internal compiler error: in set_accesses, at rtl-ssa/internals.inl:449

2024-06-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108086

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED
  Known to work||11.4.1

--- Comment #23 from Richard Sandiford  ---
Fixed.

[Bug tree-optimization/110838] [14 Regression] wrong code on x365-3.5, -O3, sign extraction

2024-06-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110838
Bug 110838 depends on bug 113281, which changed state.

Bug 113281 Summary: [11 Regression] Latent wrong code due to vectorization of 
shift reduction and missing promotions since r9-1590
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-06-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 113281, which changed state.

Bug 113281 Summary: [11 Regression] Latent wrong code due to vectorization of 
shift reduction and missing promotions since r9-1590
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/113281] [11 Regression] Latent wrong code due to vectorization of shift reduction and missing promotions since r9-1590

2024-06-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED
  Known to work||11.4.1

--- Comment #33 from Richard Sandiford  ---
Fixed.

[Bug c++/115192] [11 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380

2024-06-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192

Richard Sandiford  changed:

   What|Removed |Added

  Known to work||11.4.1
 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Richard Sandiford  ---
Fixed.

[Bug rtl-optimization/115281] [14 Regression] aarch64 ICE in go_through_subreg after r14-5129

2024-05-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115281

Richard Sandiford  changed:

   What|Removed |Added

Summary|[14/15 Regression] aarch64  |[14 Regression] aarch64 ICE
   |ICE in go_through_subreg|in go_through_subreg after
   |after r14-5129  |r14-5129
  Known to work||15.0
  Known to fail|15.0|

--- Comment #3 from Richard Sandiford  ---
Fixed on trunk, will backport if there are no issues.

[Bug rtl-optimization/115281] [14/15 Regression] aarch64 ICE in go_through_subreg after r14-5129

2024-05-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115281

Richard Sandiford  changed:

   What|Removed |Added

   Target Milestone|--- |14.2
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
  Known to work||13.1.0
  Known to fail||14.1.0, 15.0
   Last reconfirmed||2024-05-29
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from Richard Sandiford  ---
Testing a patch.

[Bug rtl-optimization/115281] New: [14/15 Regression] aarch64 ICE in go_through_subreg after r14-5129

2024-05-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115281

Bug ID: 115281
   Summary: [14/15 Regression] aarch64 ICE in go_through_subreg
after r14-5129
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
CC: avieira at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

The following test ICEs with -O3 -mcpu=neoverse-v1 after r14-5129 (thanks to
Andre for the reproducer):

SUBROUTINE fn0(ma, mb, nt)
  CHARACTER ca
  REAL r0(ma)
  INTEGER i0(mb)
  REAL r1(3,mb)
  REAL r2(3,mb)
  REAL r3(3,3)
  zero=0.0
  do na = 1, nt
 nt = i0(na)
 do l = 1, 3
r1 (l, na) =   r0 (nt)
r2(l, na) = zero
 enddo
  enddo
  if (ca  .ne.'z') then
 do j = 1, 3
do i = 1, 3
   r4  = zero
enddo
 enddo
 do na = 1, nt
do k =  1, 3
   do l = 1, 3
  do m = 1, 3
 r3 = r4 * v
  enddo
   enddo
enddo
 do i = 1, 3
   do k = 1, ifn (r3)
   enddo
enddo
 enddo
 endif
END

The ICE is:

internal compiler error: in go_through_subreg, at ira-conflicts.cc:234
0x161647f go_through_subreg
gnu/src/gcc/gcc/ira-conflicts.cc:234
0x1616657 process_regs_for_copy
gnu/src/gcc/gcc/ira-conflicts.cc:270
0x1616fe8 process_reg_shuffles
gnu/src/gcc/gcc/ira-conflicts.cc:440
0x1617b1b add_insn_allocno_copies
gnu/src/gcc/gcc/ira-conflicts.cc:510
0x1617bcc add_copies
gnu/src/gcc/gcc/ira-conflicts.cc:527
0x1600bed ira_traverse_loop_tree(bool, ira_loop_tree_node*, void
(*)(ira_loop_tree_node*), void (*)(ira_loop_tree_node*))
gnu/src/gcc/gcc/ira-build.cc:1802
0x1618b38 ira_build_conflicts()
gnu/src/gcc/gcc/ira-conflicts.cc:819
0x1605eb6 ira_build()
gnu/src/gcc/gcc/ira-build.cc:3508
0x15fafe9 ira
gnu/src/gcc/gcc/ira.cc:5793
0x15fba7f execute
gnu/src/gcc/gcc/ira.cc:6117

I think this is a mode mix-up in go_through_subreg: we should be testing the
natural mode size of the REG that we're trying to split, rather than the mode
of the result.  (But we probably also need to handle paradoxical subregs
separately.)

[Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290

2024-05-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

Richard Sandiford  changed:

   What|Removed |Added

Summary|[14/15 Regression] register |[14 Regression] register
   |swaps for vector perm in|swaps for vector perm in
   |some cases after r14-6290   |some cases after r14-6290

--- Comment #4 from Richard Sandiford  ---
Leaving open in case we do decide to backport.

[Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290

2024-05-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #2 from Richard Sandiford  ---
I agree with Andrew that this seems mostly to be luck.  On that basis, I'm not
sure (either way) whether we should backport the fix.

Now that we're hopefully getting better at tracking and allocating subregs, it
probably makes sense to allow the split before reload.  Doing that (and
adjusting the split accordingly) seems to fix the testcase.

[Bug c++/115192] [11/12/13/14 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380

2024-05-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192

Richard Sandiford  changed:

   What|Removed |Added

Summary|[11/12/13/14/15 regression] |[11/12/13/14 regression]
   |-O3 miscompilation on   |-O3 miscompilation on
   |x86-64 (loops with vectors  |x86-64 (loops with vectors
   |and scalars) since r11-6380 |and scalars) since r11-6380

--- Comment #11 from Richard Sandiford  ---
Fixed on trunk, will backport if there is no reported fallout.

[Bug c++/115192] [11/12/13/14/15 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380

2024-05-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-05-14 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #19 from Richard Sandiford  ---
(In reply to Richard Biener from comment #14)
> Usually targets do have a limit on the actual length but I see
> constant_upper_bound_with_limit doesn't query such.  But it would
> be a more appropriate way to say there might be an actual target limit here?
The discussion has moved on, but FWIW: this was a deliberate choice.
The thinking at the time was that VLA code should be truly “agnostic”
and not hard-code an upper limit.  Hard-coding a limit would be hard-coding
an assumption that the architectural maximum would never increase in future.

(The main counterargument was that any uses of the .B form of TBL would
break down for >256-byte vectors.  We hardly use such TBLs for autovec
though, and could easily choose not to use them at all.)

That decision is 8 or 9 years old at this point, so it might seem overly
dogmatic now.  Even so, I think we should have a strong reason to change tack.
It shouldn't just be about trying to avoid poly_ints :)

[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package

2024-04-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664

--- Comment #14 from Richard Sandiford  ---
Yeah, I think so.

[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package

2024-04-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664

--- Comment #12 from Richard Sandiford  ---
(In reply to Peter Bergner from comment #11)
> > > but how are users supposed to know whether
> > > -fno-omit-frame-pointer is in effect or not?  I've looked and there is no
> > > pre-defined macro a user could check.
> > That might be a useful thing to have, but if the programmer has no control
> > over the build flags (i.e. cannot require/force -fomit-frame-pointer) then I
> > think the asm has to take care to save and restore the frame pointer itself.
> > 
> > Dropping "31" from the asm means that the asm must preserve the register. 
> > Things will go badly if the asm doesn't do that.
> 
> So r31 which we use as our frame-pointer reg is a non-volatile/callee saved
> register, so it must be saved, but I guess they (greenlet) cannot use the
> method of mentioning it in the asm clobber list to perform that.
I was thinking of just the asm in isolation, rather than its effect on the
containing function's callee-save set.

If you have:

  asm volatile ("…");

then GCC expects r31 after the asm to equal r31 before the asm, regardless of
the -fomit-frame-pointer setting.  If the asm is:

  asm volatile ("li r31,0");

(sorry, I've forgotten the actual Power asm :)) then things will go wrong if
GCC tries to use r31 after the asm.

If the asm clobbers r31 then it must mention it in the clobber list.  As things
stand, it's not possible to mention r31 in the clobber list if
-fno-omit-frame-pointer.  This means that the only option for code that wants
to support -fno-omit-frame-pointer is to make the asm's contents preserve r31,
using an explicit save and restore if necessary.

And that kind-of makes sense.  If an asm clobbers the frame pointer, that
removes GCC's main option for restoring data after the asm.  A lot of other
clobbers would be handled by loading data relative to the frame pointer.  If
the frame pointer itself has gone then things get tricky.

[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package

2024-04-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664

--- Comment #10 from Richard Sandiford  ---
(In reply to Peter Bergner from comment #7)
> Then that would seem to indicate that mentioning the frame pointer reg in
> the asm clobber list is an error
Yeah, I agree it's an error.  The PR says “ICE”, but is there an internal
error?  The “cannot be used in ‘asm’ here” is a normal user-facing error,
albeit with bad error recovery, leading us to report the same thing multiple
times.

> but how are users supposed to know whether
> -fno-omit-frame-pointer is in effect or not?  I've looked and there is no
> pre-defined macro a user could check.
That might be a useful thing to have, but if the programmer has no control over
the build flags (i.e. cannot require/force -fomit-frame-pointer) then I think
the asm has to take care to save and restore the frame pointer itself.

Dropping "31" from the asm means that the asm must preserve the register. 
Things will go badly if the asm doesn't do that.

[Bug target/114607] aarch64: Incorrect expansion of svsudot

2024-04-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607

--- Comment #2 from Richard Sandiford  ---
Fixed on trunk.  I'll backport in a few weeks if there's no fallout.

[Bug target/114607] aarch64: Incorrect expansion of svsudot

2024-04-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-04-05
 Ever confirmed|0   |1

[Bug target/114607] New: aarch64: Incorrect expansion of svsudot

2024-04-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607

Bug ID: 114607
   Summary: aarch64: Incorrect expansion of svsudot
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

svsudot is supposed to expand to USDOT with the second and third arguments
swapped.  However, there is a thinko in the code that does the reversal, making
it a no-op.  Unfortunately, the tests simply accept the buggy form. :-(

For example, gcc.target/aarch64/sve/acle/asm/sudot_s32.c contains:

/*
** sudot_s32_tied1:
**  usdot   z0\.s, z2\.b, z4\.b
**  ret
*/
TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t,
   z0 = svsudot_s32 (z0, z2, z4),
   z0 = svsudot (z0, z2, z4))

where the usdot z2 and z4 operands should be in the opposite order.

[Bug target/114603] aarch64: Invalid SVE cnot optimisation

2024-04-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603

Richard Sandiford  changed:

   What|Removed |Added

   Last reconfirmed||2024-04-05
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #2 from Richard Sandiford  ---
Fix on trunk so far, but I'll backport if possible.

[Bug target/114603] New: aarch64: Invalid SVE cnot optimisation

2024-04-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603

Bug ID: 114603
   Summary: aarch64: Invalid SVE cnot optimisation
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*-*-*

An overly lax condition on the cnot combine pattern means that we optimise:

#include 

svint32_t foo(svbool_t pg, svint32_t y)
{
  return svsel(svcmpeq(pg, y, 0), svdup_s32(1), svdup_s32(0));
}

to a single cnot:

foo:
cnotz0.s, p0/m, z0.s
ret

The result must be 0 for inactive elements of pg, whereas the above would leave
the elements unchanged instead.

This seems to have been around since the SVE ACLE was first added.

[Bug target/114577] Inefficient codegen for SVE/NEON bridge

2024-04-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114577

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||rsandifo at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from Richard Sandiford  ---
Fixed.

[Bug target/114521] [11 only] aarch64: wrong code with Neon ld1/st1x4 intrinsics gcc-11 and earlier

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #2 from Richard Sandiford  ---
Oops.  I was going to upload a patch for the bug here, but it looks like I
accidentally committed it while backporting PR97696 to GCC 11.  The patch was
g:daee0409d195d346562e423da783d5d1cf8ea175.

I'm not sure what to do now.  Perhaps we should leave it in?

[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

--- Comment #5 from Richard Sandiford  ---
For the record, the associated new testsuite failures are:

FAIL: gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 3
FAIL: gcc.target/aarch64/asimd-mull-elem.c scan-assembler-times
\\s+fmul\\tv[0-9]+\\.4s, v[0-9]+\\.4s, v[0-9]+\\.s\\[0\\] 4
FAIL: gcc.target/aarch64/asimd-mull-elem.c scan-assembler-times
\\s+mul\\tv[0-9]+\\.4s, v[0-9]+\\.4s, v[0-9]+\\.s\\[0\\] 4
FAIL: gcc.target/aarch64/ccmp_3.c scan-assembler-not \tcbnz\t
FAIL: gcc.target/aarch64/pr100056.c scan-assembler-times \\t[us]bfiz\\tw[0-9]+,
w[0-9]+, 11 2
FAIL: gcc.target/aarch64/pr100056.c scan-assembler-times \\tadd\\tw[0-9]+,
w[0-9]+, w[0-9]+, uxtb\\n 2
FAIL: gcc.target/aarch64/pr108840.c scan-assembler-not and\\tw[0-9]+, w[0-9]+,
31
FAIL: gcc.target/aarch64/pr112105.c scan-assembler-not \\tdup\\t
FAIL: gcc.target/aarch64/pr112105.c scan-assembler-times
(?n)\\tfmul\\t.*v[0-9]+\\.s\\[0\\]\\n 2
FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tx[0-9]+ 2
FAIL: gcc.target/aarch64/vaddX_high_cost.c scan-assembler-not dup\\t
FAIL: gcc.target/aarch64/vmul_element_cost.c scan-assembler-not dup\\t
FAIL: gcc.target/aarch64/vmul_high_cost.c scan-assembler-not dup\\t
FAIL: gcc.target/aarch64/vsubX_high_cost.c scan-assembler-not dup\\t
FAIL: gcc.target/aarch64/sve/pr98119.c scan-assembler \\tand\\tx[0-9]+,
x[0-9]+, #?-31\\n
FAIL: gcc.target/aarch64/sve/pred-not-gen-1.c scan-assembler-not \\tbic\\t
FAIL: gcc.target/aarch64/sve/pred-not-gen-1.c scan-assembler-times
\\tnot\\tp[0-9]+\\.b, p[0-9]+/z, p[0-9]+\\.b\\n 1
FAIL: gcc.target/aarch64/sve/pred-not-gen-4.c scan-assembler-not \\tbic\\t
FAIL: gcc.target/aarch64/sve/pred-not-gen-4.c scan-assembler-times
\\tnot\\tp[0-9]+\\.b, p[0-9]+/z, p[0-9]+\\.b\\n 1
FAIL: gcc.target/aarch64/sve/var_stride_2.c scan-assembler-times
\\tubfiz\\tx[0-9]+, x2, 10, 16\\n 1
FAIL: gcc.target/aarch64/sve/var_stride_2.c scan-assembler-times
\\tubfiz\\tx[0-9]+, x3, 10, 16\\n 1
FAIL: gcc.target/aarch64/sve/var_stride_4.c scan-assembler-times
\\tsbfiz\\tx[0-9]+, x2, 10, 32\\n 1
FAIL: gcc.target/aarch64/sve/var_stride_4.c scan-assembler-times
\\tsbfiz\\tx[0-9]+, x3, 10, 32\\n 1

[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

--- Comment #4 from Richard Sandiford  ---
(In reply to Richard Biener from comment #1)
> Btw, why does forwprop not do this?
Not 100% sure (I wasn't involved in choosing the current heuristics).  But
fwprop can propagate across blocks, so there is probably more risk of
increasing register pressure.

[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

--- Comment #3 from Richard Sandiford  ---
In RTL terms, the dup is vec_duplicate.  The combination is:

Trying 10 -> 13:
   10: r107:V4SF=vec_duplicate(r115:SF)
  REG_DEAD r115:SF
   13: r110:V4SF=r111:V4SF*r107:V4SF
  REG_DEAD r111:V4SF
Failed to match this instruction:
(parallel [
(set (reg:V4SF 110 [ _2 ])
(mult:V4SF (vec_duplicate:V4SF (reg:SF 115))
(reg:V4SF 111 [ *ptr_6(D) ])))
(set (reg:V4SF 107)
(vec_duplicate:V4SF (reg:SF 115)))
])
Failed to match this instruction:
(parallel [
(set (reg:V4SF 110 [ _2 ])
(mult:V4SF (vec_duplicate:V4SF (reg:SF 115))
(reg:V4SF 111 [ *ptr_6(D) ])))
(set (reg:V4SF 107)
(vec_duplicate:V4SF (reg:SF 115)))
])
Successfully matched this instruction:
(set (reg:V4SF 107)
(vec_duplicate:V4SF (reg:SF 115)))
Successfully matched this instruction:
(set (reg:V4SF 110 [ _2 ])
(mult:V4SF (vec_duplicate:V4SF (reg:SF 115))
(reg:V4SF 111 [ *ptr_6(D) ])))
allowing combination of insns 10 and 13
original costs 8 + 20 = 28
replacement costs 8 + 20 = 28
modifying insn i210: r107:V4SF=vec_duplicate(r115:SF)
deferring rescan insn with uid = 10.
modifying insn i313: r110:V4SF=vec_duplicate(r115:SF)*r111:V4SF
  REG_DEAD r115:SF
  REG_DEAD r111:V4SF
deferring rescan insn with uid = 13.

[Bug rtl-optimization/114515] New: [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-03-28 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

Bug ID: 114515
   Summary: [14 Regression] Failure to use aarch64 lane forms
after PR101523
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

The following test regressed on aarch64 after
g:839bc42772ba7af66af3bd16efed4a69511312ae (the fix for PR101523):

typedef float v4sf __attribute__((vector_size(16)));
void f (v4sf *ptr, float f)
{
  ptr[0] = ptr[0] * (v4sf) { f, f, f, f };
  ptr[1] = ptr[1] * (v4sf) { f, f, f, f };
}

Compiled with -O2, we previously generated:

ldp q1, q31, [x0]
fmulv1.4s, v1.4s, v0.s[0]
fmulv31.4s, v31.4s, v0.s[0]
stp q1, q31, [x0]
ret

Now we generate:

ldp q1, q31, [x0]
dup v0.4s, v0.s[0]
fmulv1.4s, v1.4s, v0.4s
fmulv31.4s, v31.4s, v0.4s
stp q1, q31, [x0]
ret

with the extra dup.

The patch is trying to avoid cases where i3 is canonicalised by contextual
information provided by i2.  But here we place a full copy of i2 into i3
(creating an instruction that is no more expensive).  This is a benefit in its
own right because the two instructions can then execute in parallel rather than
serially.  But it also means that, as here, we might be able to remove i2 with
later combinations.

Perhaps we could also check whether i3 still contains the destination of i2?

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2024-03-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Richard Sandiford  ---
Fixed on trunk and all active release branches.

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #5 from Richard Sandiford  ---
(In reply to Andrew Stubbs from comment #4)
> Yes, that's what the simd-math-3* tests do.
Ah, OK.

> The simd-math-5* tests are explicitly supposed to be doing this in the
> context of the autovectorizer.
> 
> If these tests are being compiled as (newly) intended then we should change
> the expected results.
> 
> So, questions:
> 
> 1. Are the new results actually correct? (So far I only know that being
> different is expected.)
I believe so.  We now do the division in 32 bits, as in the original gimple.

> 2. Is there some other testcase form that would exercise the previously
> intended routines?
It should be possible in languages that don't have C's integer
promotion rules, if you're up for some Ada or Rust.

> 3. Is the new behaviour configurable? I don't think the 16-bit shift bug> 
> ever existed on GCN (in which "short" vectors actually have excess bits in
> each lane, much like scalar registers do).
Not AFAIK.  The problem is that the gimple→gimple transformation changes
the gimple-level semantics of the code.  Shifts by out-of-range values
are undefined rather than target-defined.  (And in other cases that's useful,
because it means we don't need to preserve whatever value the target
happens to give for an out-of-range shift.)

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #3 from Richard Sandiford  ---
Ah, ok.  If the main aim is to test the libgcc routines, it might be safer to
use something like:

typedef char v64qi __attribute__((vector_size(64)));
v64qi f(v64qi x, v64qi y) { return x / y; }

instead of relying on vectorisation.

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #1 from Richard Sandiford  ---
The decision to stop narrowing division was deliberate, see the comments in
PR113281 for details.  Is the purpose of the test to check vectorisation
quality, or to check for the right ABI routines?

[Bug tree-optimization/114234] New: [14 Regression] verify_ssa failure with early-break vectorisation

2024-03-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114234

Bug ID: 114234
   Summary: [14 Regression] verify_ssa failure with early-break
vectorisation
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

The following test ICEs with -Ofast on aarch64:

void bar();
float
foo (float x)
{
  float a = 1;
  float b = x;
  long z = 200;
  for (;;)
{
  float c = b - 1.0f;
  a *= c;
  z -= 1;
  if (z == 0)
{
  bar ();
  break;
}
  if (b <= 3.0f)
break;
  b = c;
}
  return a * b;
}

(reduced from wrf).  The ICE is:

foo.c:3:1: error: definition in block 15 does not dominate use in block 10
3 | foo (float x)
  | ^~~
for SSA_NAME: stmp_a_9.10_103 in statement:
a_47 = PHI 
PHI argument
stmp_a_9.10_103
for PHI node
a_47 = PHI 
during GIMPLE pass: vect

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

Richard Sandiford  changed:

   What|Removed |Added

  Attachment #57602|0   |1
is obsolete||

--- Comment #42 from Richard Sandiford  ---
Created attachment 57605
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57605=edit
proof-of-concept patch to suppress peeling for gaps

How about the attached?  It records whether all accesses that require peeling
for gaps could instead have used gathers, and only retries when that's true. 
It means that we retry for only 0.034% of calls to vect_analyze_loop_1 in a
build of SPEC2017 with -mcpu=neoverse-v1 -Ofast -fomit-frame-pointer.

The figures exclude wrf, which failed for me with:

module_mp_gsfcgce.fppized.f90:852:23:

  852 |REAL FUNCTION ggamma(X)
  |   ^
Error: definition in block 18 does not dominate use in block 13
for SSA_NAME: stmp_pf_6.5657_140 in statement:
pf_81 = PHI 
PHI argument
stmp_pf_6.5657_140
for PHI node
pf_81 = PHI 
during GIMPLE pass: vect
module_mp_gsfcgce.fppized.f90:852:23: internal compiler error: verify_ssa
failed

Will look at that tomorrow.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #41 from Richard Sandiford  ---
(In reply to Richard Biener from comment #40)
> So I wonder if we can use "local costing" to decide a gather is always OK
> compared to the alternative with peeling for gaps.  On x86 gather tends
> to be slow compared to open-coding it.
Yeah, on SVE gathers are generally “enabling” instructions rather than
something to use for their own sake.

I suppose one problem is that we currently only try to use gathers for
single-element groups.  If we make a local decision to use gathers while
keeping that restriction, we could end up using gathers “unnecessarily” while
still needing to peel for gaps for (say) a two-element group.

That is, it's only better to use gathers than contiguous loads if by doing that
we avoid all need to peel for gaps (and if the cost of peeling for gaps was
high enough to justify the cost of using gathers over consecutive loads).

One of the things on the list to do (once everything is SLP!) is to support
loads with gaps directly via predication, so that we never load elements that
aren't needed.  E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with a
32-bit load (LD1W .S) to load only even-indexed elements.  So a single-element
group with a group size of 2 could be done cheaply with just consecutive loads,
without peeling for gaps.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #39 from Richard Sandiford  ---
(In reply to Richard Sandiford from comment #38)
> (In reply to Richard Biener from comment #37)
> > Even more iteration looks bad.  I do wonder why when gather can avoid
> > peeling for GAPs using load-lanes cannot?
> Like you say, we don't realise that all the loads from array3[i] form a
> single group.
Oops, sorry, I shouldn't have gone off memory.  So yeah, it's array1[] where
that happens, not array3[].  The reason we don't use load-lanes is that we
don't have load-lane instructions for smaller elements in larger containers, so
we're forced to use load-and-permute instead.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #38 from Richard Sandiford  ---
(In reply to Richard Biener from comment #37)
> Even more iteration looks bad.  I do wonder why when gather can avoid
> peeling for GAPs using load-lanes cannot?
Like you say, we don't realise that all the loads from array3[i] form a single
group.

Note that we're not using load-lanes in either case, since the group size (8)
is too big for that.  But load-lanes and load-and-permute have the same
restriction about when peeling for gaps is required.

In contrast, gather loads only ever load data that they actually need.

> Also for the stores we seem to use elementwise stores rather than store-lanes.
What configuration are you trying?  The original report was about SVE, so I was
trying that.  There we use a scatter store.

> To me the most obvious thing to try optimizing in this testcase is DR
> analysis.  With -march=armv8.3-a I still see
> 
> t.c:26:22: note:   === vect_analyze_data_ref_accesses ===
> t.c:26:22: note:   Detected single element interleaving array1[0][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[1][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[2][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[3][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[0][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[1][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[2][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[3][_1] step 4
> t.c:26:22: missed:   not consecutive access array2[_4][_8] = _69;
> t.c:26:22: note:   using strided accesses
> t.c:26:22: missed:   not consecutive access array2[_4][_1] = _67;
> t.c:26:22: note:   using strided accesses
> 
> so we don't figure
> 
> Creating dr for array1[0][_1]
> base_address: 
> offset from base address: (ssizetype) ((sizetype) (m_111 * 2) * 2)
> constant offset from base address: 0
> step: 4
> base alignment: 16
> base misalignment: 0
> offset alignment: 4
> step alignment: 4
> base_object: array1
> Access function 0: {m_111 * 2, +, 2}_4
> Access function 1: 0
> Creating dr for array1[0][_8]
> analyze_innermost: success.
> base_address: 
> offset from base address: (ssizetype) ((sizetype) (m_111 * 2 + 1) *
> 2)
> constant offset from base address: 0
> step: 4
> base alignment: 16
> base misalignment: 0
> offset alignment: 2
> step alignment: 4
> base_object: array1
> Access function 0: {m_111 * 2 + 1, +, 2}_4
> Access function 1: 0
> 
> belong to the same group (but the access functions tell us it worked out).
> Above we fail to split the + 1 to the constant offset.
OK, but this is moving the question on to how we should optimise the testcase
for Advanced SIMD rather than SVE, and how we should optimise the testcase in
general, rather than simply recover what we could do before.  (SVE is only
enabled for -march=arvm9-a and above, in case armv8.3-a was intended to enable
SVE too.)

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #36 from Richard Sandiford  ---
Created attachment 57602
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57602=edit
proof-of-concept patch to suppress peeling for gaps

This patch does what I suggested in the previous comment: if the loop needs
peeling for gaps, try again without that, and pick the better loop.  It seems
to restore the original style of code for SVE.

A more polished version would be a bit smarter about when to retry.  E.g. it's
pointless if the main loop already operates on full vectors (i.e. if peeling 1
iteration is natural in any case).  Perhaps the condition should be that either
(a) the number of epilogue iterations is known to be equal to the VF of the
main loop or (b) the target is known to support partial vectors for the loop's
vector_mode.

Any thoughts?

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #35 from Richard Sandiford  ---
Maybe I've misunderstood the flow of the ticket, but it looks to me like we do
still correctly recognise the truncating scatter stores.  And, on their own, we
would be able to convert them into masked scatters.

The reason for the epilogue is instead on the load side.  There we have a
non-strided grouped load, and currently we hard-code the assumption that it is
better to use contiguous loads and permutes rather than gather loads where
possible.  So we have:

  /* As a last resort, trying using a gather load or scatter store.

 ??? Although the code can handle all group sizes correctly,
 it probably isn't a win to use separate strided accesses based
 on nearby locations.  Or, even if it's a win over scalar code,
 it might not be a win over vectorizing at a lower VF, if that
 allows us to use contiguous accesses.  */
  if (*memory_access_type == VMAT_ELEMENTWISE
  && single_element_p
  && loop_vinfo
  && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
 masked_p, gs_info))
*memory_access_type = VMAT_GATHER_SCATTER;

only after we've tried and failed to use load lanes or load+permute.  If
instead I change the order so that the code above is tried first, then we do
use extending gather loads and truncating scatter stores as before, with no
epilogue loop.

So I suppose the question is: if we do prefer to use gathers over load+permute
for some cases, how do we decide which to use?  And can it be done a per-load
basis, or should it instead be a per-loop decision?  E.g., if we end up with a
loop that needs peeling for gaps, perhaps we should try again and forbid
peeling for gaps.  Then, if that succeeds, see which loop gives the better
overall cost.

Of course, trying more things means more compile time…

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-01 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #33 from Richard Sandiford  ---
Can you give me a chance to look at it a bit when I back?  This doesn't feel
like the way to go to me.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-02-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #31 from Richard Sandiford  ---
(In reply to Tamar Christina from comment #29)
> This works fine for normal gather and scatters but doesn't work for widening
> gathers and narrowing scatters which only the pattern seems to handle.
I'm supposedly on holiday, so didn't see the IRC discussion, but: as I remember
it, there is no narrowing or widening for IFN gathers or scatters as such, even
for patterns.  One vector's worth of offsets corresponds to one vector's worth
of data.  But the widths of the data elements and the offset elements can be
different.  Any sign or zero extension of a loaded vector, or any operation to
double or halve the number of vectors, is done separately.

I think it does make sense to stick to that, rather than (say) have IFNs that
load two offset vectors into one data vector, or use one offset vector to load
two data vectors.  Supporting those combinations would mean that we have two
different ways in which the offset elements and data elements have different
widths.  And it isn't really a natural fusion on SVE.

[Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics

2024-02-27 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #8 from Richard Sandiford  ---
The reason early_ra doesn't help with the original testcase is that early_ra
punts on any non-move instruction that has a hard register destination.  And it
does that because it can't cope well with cases where hard-coded destinations
force the wrong choice (unlike the proper allocators, which can change the
destination where necessary).  The restriction is needed to avoid regressing
SVE ACLE tests.

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2024-02-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

--- Comment #3 from Richard Sandiford  ---
Created attachment 57520
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57520=edit
Candidate patch

The attached patch seems to fix it.  I'm taking next week off, but I'll run the
patch through proper testing when I get back.

[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales

2024-02-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220

2024-02-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #14 from Richard Sandiford  ---
Finally fixed.

[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205

--- Comment #12 from Richard Sandiford  ---
Created attachment 57511
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57511=edit
Candidate patch

Sorry for the very slow response on this.  I'm testing the attached.

[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Richard Sandiford  ---
Assume fixed by the patches for PR113295.  Please reopen if not.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 112922, which changed state.

Bug 112922 Summary: [14 Regression] 465.tonto from SPECFP 2006 fails train run 
on Aarch64-linux with -O2 and -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Richard Sandiford  ---
Fixed.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 113295, which changed state.

Bug 113295 Summary: [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 
when built with -Ofast -mcpu=native since 
g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-02-23 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Richard Sandiford  ---
Fixed.

[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295

--- Comment #6 from Richard Sandiford  ---
For me the miscompilation is in jkdmem_, where we end up allocating the same
registers to both arms of an fcsel.  It sounds like it occurs elsewhere too.

I have a candidate fix, but need to think a bit more about it.

[Bug preprocessor/114007] gcc chokes on __has_cpp_attribute(clang::unsafe_buffer_usage)

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114007

--- Comment #14 from Richard Sandiford  ---
I might have misunderstood the suggestion and so be arguing against something
that no-one is suggesting, but I think [[__extension__ …]] should accept the
same things for all standard versions (C23, pre-C23, and GNU).  It was intended
to be something that header files and macros could use without needing to be
sensitive to the user's choice of standard.

[Bug target/113995] ICE: in change_address_1, at emit-rtl.cc:2299 with [[arm::streaming_compatible]] and -march=armv9-a+sve -finstrument-functions -fstack-clash-protection

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113995

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Richard Sandiford  ---
Fixed.

[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Richard Sandiford  ---
Fixed

[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a

2024-02-21 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #5 from Richard Sandiford  ---
Mine.  Could be the same as PR112922.

[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922

--- Comment #2 from Richard Sandiford  ---
I don't remember there being a deliberate bug fix in that patch,
but there were some others later.  I suppose the optimistic case
is that this first went latent and then was fixed “properly”
afterwards.  But it could just be latent.

[Bug target/113778] ICE: RTL check: expected elt 3 type 'e' or 'u', have '0' (rtx code_label) in PATTERN, at rtl.h:1506 with -march=armv9-a+sme -fhardened

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113778

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Richard Sandiford  ---
Confirmed that this is the same as PR113220.

*** This bug has been marked as a duplicate of bug 113220 ***

[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220

Richard Sandiford  changed:

   What|Removed |Added

 CC||zsojka at seznam dot cz

--- Comment #3 from Richard Sandiford  ---
*** Bug 113778 has been marked as a duplicate of this bug. ***

[Bug target/113805] sme2 vs -mtrack-speculation, ICE in extract_insn, at recog.cc:2812 with r14-6178-g8d29b7aca15133

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113805

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Richard Sandiford  ---
Fixed.

[Bug target/113995] ICE: in change_address_1, at emit-rtl.cc:2299 with [[arm::streaming_compatible]] and -march=armv9-a+sve -finstrument-functions -fstack-clash-protection

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113995

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #2 from Richard Sandiford  ---
Possibly a dup of some of the other PRs in this area.

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-20 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #16 from Richard Sandiford  ---
I think it'd be worth splitting the mode selection out
into a helper called bitwise_mode_for_size, to go alongside
bitwise_mode_for_mode.

[Bug target/113778] ICE: RTL check: expected elt 3 type 'e' or 'u', have '0' (rtx code_label) in PATTERN, at rtl.h:1506 with -march=armv9-a+sme -fhardened

2024-02-15 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113778

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-02-15
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #3 from Richard Sandiford  ---
Possibly the same as PR113220

[Bug target/113805] sme2 vs -mtrack-speculation, ICE in extract_insn, at recog.cc:2812 with r14-6178-g8d29b7aca15133

2024-02-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113805

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-02-12
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #1 from Richard Sandiford  ---
Heh.  I knew this coming after seeing the previous -mtrack-speculation bug, but
didn't have chance to fix it pre-emptively last week.

[Bug target/113763] [14 Regression] build fails with clang++ host compiler because aarch64.cc uses C++14 constexpr.

2024-02-06 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113763

--- Comment #14 from Richard Sandiford  ---
AFAIK, the constructor shouldn't be necessary.  (And without it, the whole
thing would fit on one line.)  LGTM (and preapproved) otherwise.  Thanks for
doing this.

[Bug target/113763] [14 Regression] build fails with clang++ host compiler because aarch64.cc uses C++14 constexpr.

2024-02-06 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113763

--- Comment #11 from Richard Sandiford  ---
Currently away so can't try it myself, but how about just using an ad-hoc
structure instead?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-01 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #34 from Richard Sandiford  ---
(In reply to Richard Biener from comment #32)
> Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting
> the lowest 4 or 2 bits.
> 
> It all feels like desaster waiting to happen ;)
Yes :)

> For example BIT_NOT_EXPR is RTL expanded like
> 
> case BIT_NOT_EXPR:
>   op0 = expand_expr (treeop0, subtarget,
>  VOIDmode, EXPAND_NORMAL);
>   if (modifier == EXPAND_STACK_PARM)
> target = 0;
>   /* In case we have to reduce the result to bitfield precision
>  for unsigned bitfield expand this as XOR with a proper constant
>  instead.  */
>   if (reduce_bit_field && TYPE_UNSIGNED (type))
> {
>   int_mode = SCALAR_INT_TYPE_MODE (type);
>   wide_int mask = wi::mask (TYPE_PRECISION (type),
> false, GET_MODE_PRECISION (int_mode));
>   
>   temp = expand_binop (int_mode, xor_optab, op0,
>immed_wide_int_const (mask, int_mode),
>target, 1, OPTAB_LIB_WIDEN);
> 
> so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and
> effective bit-precision set reduce_bit_field and fixup the fallout
> (not sure why the above is only for TYPE_UNSIGNED).
>
> At least it feels similar and doing things the opposite for vectors
> (fixing up at uses) would be odd?
Do you know why we take this approach for integers?  Is it for
correctness?  Or is it supposed to be more optimal?

I can imagine that, for arithmetic types, there are going to many
more instances where upper bits matter (division, right shifts,
MIN/MAX, etc.).  So perhaps reducing every result is a good
trade-off there.

But there's an argument that it should be rare for the padding
bits in a vector to matter, since very few things would look at the
padding bits anyway.  So perhaps the cost should be borne by the
operations that need canonical integers.

Not a strong opinion though, more just devil's advocate.

There again, if e.g. the x86 API guarantees memcmp equality between
two masks whose significant bits are equal, then we probably have
no choice.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #30 from Richard Sandiford  ---
(In reply to Richard Biener from comment #29)
> But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
> VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
> masking for VECTOR_CST expansion as well, instead of trying to fixup
> in do_compare_and_jump?
But then how would ~ be implemented for things like 4-bit masks?
If we use notqi2 then I assume the upper bits could be 1 rather than 0.

[Bug debug/113636] [14 Regression] internal compiler error: in dead_debug_global_find, at valtrack.cc:275 since r14-6290-g9f0f7d802482a8

2024-01-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113636

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Richard Sandiford  ---
Fixed.  Thanks for the report and help with reproducing.

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #8 from Richard Sandiford  ---
Fixed.

[Bug target/111677] [12/13 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes

2024-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677

--- Comment #21 from Richard Sandiford  ---
(In reply to Alex Coplan from comment #13)
> The problem seems to be this code in aarch64_process_components:
> 
>   while (regno != last_regno)
> {
>   bool frame_related_p = aarch64_emit_cfi_for_reg_p (regno);
>   machine_mode mode = aarch64_reg_save_mode (regno);
> 
>   rtx reg = gen_rtx_REG (mode, regno);
>   poly_int64 offset = frame.reg_offset[regno];
>   if (frame_pointer_needed)
> offset -= frame.bytes_below_hard_fp;
> 
>   rtx addr = plus_constant (Pmode, ptr_reg, offset);
>   rtx mem = gen_frame_mem (mode, addr);
> 
> which emits a TFmode mem with offset 512, which is out of range for TFmode
> (so we later ICE with an unrecognisable insn).  Presumably this just needs
> tweaking to emit a new base anchor in the case of large offsets like this. 
> It looks like the code in aarch64_save_callee_saves already does this.
We shouldn't emit new anchor registers here, since unlike in the prologue,
we don't have any guarantee that certain registers are free.

aarch64_get_separate_components is supposed to vet shrink-wrappable
offsets, but in this case the offset looks valid, since:

str q22, [sp, #512]

is a valid instruction.  Perhaps the constraints are too narrow?

[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605

2024-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #6 from Richard Sandiford  ---
Mine.

[Bug debug/113636] [14 Regression] internal compiler error: in dead_debug_global_find, at valtrack.cc:275 since r14-6290-g9f0f7d802482a8

2024-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113636

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
   Last reconfirmed||2024-01-29

--- Comment #10 from Richard Sandiford  ---
Mine.

[Bug tree-optimization/113281] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027

2024-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

--- Comment #24 from Richard Sandiford  ---
Fixed on trunk so far, but it's latent on branches.  I'll see what
the trunk fallout is like before asking about backports.

[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8

2024-01-26 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #7 from Richard Sandiford  ---
early-ra does try to avoid reusing registers too soon, to increase scheduling
freedom.  But in this case I imagine it handles the two statements as separate
regions.  Should be fixable by carrying across a round-robin counter.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #18 from Richard Sandiford  ---
(In reply to Tamar Christina from comment #17)
> Well the mid-end has generated the right precision. The type it generates is
> vector(4)  vexit_reduc_67;
> so it does say it's a single bit boolean.
> 
> Isn't this just an expand problem?
That's what I meant.  expand is using a QImode comparison to compare things
with 4-bit precision, so I think the masking should happen at that point.

How about doing the masking in do_compare_and_jump?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #16 from Richard Sandiford  ---
(In reply to Richard Biener from comment #15)
> I think the problem is the cbranch pattern which looks at all of the
> QImode mask - but of course it doesn't know it's really V4BImode it's
> working on ...
Yeah.  Currently building an x86_64 toolchain to have a look, but I think
whatever code uses a cbranch with a higher precision than the inputs should
mask off the significant bits beforehand.  cbranch can also be used for
comparing two variable masks too.

I suppose we could add a hook to say that padding bits of an integer mask must
always be zero (and so it's expand's job to ensure that that holds for any mask
operation).  But it feels dangerously close to TRULY_NOOP_TRUNCATION for
integers.

[Bug target/113550] data512_t initializers dereference a clobbered register

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113550

Richard Sandiford  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Sandiford  ---
Yeah, FWIW, I agree improving the define_split is probably best.

Now fixed.

[Bug target/113485] [14 regression] ICE with -fno-guess-branch-probability on aarch64 starting with r14-7187-g74e3e839ab2d36

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113485

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Richard Sandiford  ---
Fixed.

[Bug target/113572] [14 Regression] aarch64: internal compiler error in aarch64_sve::vector_cst_all_same

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113572

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Richard Sandiford  ---
Fixed.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #13 from Richard Sandiford  ---
I don't think there's any principle that upper bits must be zero.
How do we end up with a pattern that depends on that being the case?

[Bug tree-optimization/113281] [14 Regression] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027

2024-01-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #22 from Richard Sandiford  ---
Taking following discussion on irc.

[Bug target/113572] [14 Regression] aarch64: internal compiler error in aarch64_sve::vector_cst_all_same

2024-01-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113572

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #5 from Richard Sandiford  ---
(In reply to Jakub Jelinek from comment #4)
> So, if that part is right, I think we want to use VECTOR_CST_ELT instead of
> VECTOR_CST_ENCODED_ELT, like:
> --- gcc/config/aarch64/aarch64-sve-builtins.cc.jj 2024-01-12
> 13:47:20.815429012 +0100
> +++ gcc/config/aarch64/aarch64-sve-builtins.cc2024-01-24 
> 20:58:33.720677634
> +0100
> @@ -3474,7 +3474,7 @@ vector_cst_all_same (tree v, unsigned in
>unsigned int nelts = lcm * VECTOR_CST_NELTS_PER_PATTERN (v);
>tree first_el = VECTOR_CST_ENCODED_ELT (v, 0);
>for (unsigned int i = 0; i < nelts; i += step)
> -if (!operand_equal_p (VECTOR_CST_ENCODED_ELT (v, i), first_el, 0))
> +if (!operand_equal_p (VECTOR_CST_ELT (v, i), first_el, 0))
>return false;
>  
>return true;
> which fixes the ICE.
Yeah, that's the correct fix.  Sorry for missing it.

[Bug target/113485] [14 regression] ICE with -fno-guess-branch-probability on aarch64 starting with r14-7187-g74e3e839ab2d36

2024-01-24 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113485

Richard Sandiford  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #7 from Richard Sandiford  ---
I suppose the ZIP1 patterns should just have 64-bit inputs,
rather than going to the trouble of creating paradoxical subregs.

> cfun->machine->advsimd_zero_insn use is plain wrong. As the RTL could be 
> removed fully from the RTL stream and then it will be GC'ed.

But machine_function is a GTYed structure, so the reference itself should
prevent GC.  I don't think we should be in the practice of explicitly
ggc_free()ing RTL, since callers don't generally know what other references
there might be.

[Bug target/109929] profiledbootstrap failure on aarch64-linux-gnu with graphite optimization

2024-01-22 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109929

--- Comment #7 from Richard Sandiford  ---
Hmm, yeah, like you say, neither of those commits should have made a different
to whether bootstrap works.  I guess the problem is just latent now.

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2024-01-22 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

--- Comment #12 from Richard Sandiford  ---
I don't object to the patch, but for the record: the current heuristics go back
a long way.  Although I reworked the pass to use rtl-ssa a few years ago, I
tried as far as possible to preserve the old heuristics (tested by making sure
that there were no unexplained differences over a large set of targets).

I wouldn't characterise the old heuristics as a logic error.  Although I didn't
write them, my understanding is that they were being deliberately conservative,
in particular due to the risk of introducing excess register pressure.

So this change seems potentially quite invasive for stage 4.  Perhaps it'll
work out — if so, great!  But if there is some fallout, I think we should lean
towards reverting the patch and revisiting in GCC 15.

[Bug target/113196] [14 Regression] Failure to use ushll{,2}

2024-01-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113196

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Richard Sandiford  ---
Fixed.

[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`

2024-01-12 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989

Richard Sandiford  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #15 from Richard Sandiford  ---
I didn't manage to reproduce the PR in the originally reported form, but FWIW,
the patches mean that a gcc_unreachable above:

  return decl;

in simulate_builtin_function_decl no longer fires for arm_sve.h or arm_sme.h. 
Please reopen if there are still some lingering issues.

[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`

2024-01-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989

--- Comment #12 from Richard Sandiford  ---
> another is try
> #pragma GCC aarch64 "arm_sve.h"
> after a couple of intentional declarations of the SVE builtins with
> non-standard return/argument types and make sure that while it emits some
> errors, it doesn't try to use ggc_freed decls in registered tables.
FWIW, this is what the g*.target/aarch64/sve/acle/general*/func_redef_*
tests are supposed to test (although not specifically targeting ggc_free).

[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`

2024-01-10 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989

Richard Sandiford  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #10 from Richard Sandiford  ---
Mine.

[Bug target/113270] [14 Regression] AArch64 ICEs in register_tuple_type since r14-6524

2024-01-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113270

--- Comment #8 from Richard Sandiford  ---
Thanks for trying it, and sorry for not doing it myself.

The patch LGTM FWIW, so preapproved if it passes testing (which I'm sure it
will :))

[Bug target/113270] [14 Regression] AArch64 ICEs in register_tuple_type since r14-6524

2024-01-08 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113270

--- Comment #6 from Richard Sandiford  ---
I think we want the patch in comment 3, but in addition, I then also needed to
use the following for a similar SVE case:

extern GTY(()) tree scalar_types[NUM_VECTOR_TYPES + 1];
tree scalar_types[NUM_VECTOR_TYPES + 1];

In this case that would mean adding:

extern GTY(()) aarch64_simd_type_info aarch64_simd_types[];

just above the definition in aarch64-builtins.cc.

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2024-01-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Richard Sandiford  ---
Fixed.  Thanks for the report.

  1   2   3   4   5   6   7   8   9   10   >