[Bug rtl-optimization/114996] [15 Regression] [RISC-V] 2->2 combination no longer occurring
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114996 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #6 from Richard Sandiford --- FWIW, late-combine also fixes this. I'm in the process of getting the submission ready (still going through multi-target testing).
[Bug target/115518] New: aarch64: Poor codegen for arm_neon_sve_bridge.h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115518 Bug ID: 115518 Summary: aarch64: Poor codegen for arm_neon_sve_bridge.h Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: aarch64-sve Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* With PR115464 fixed, the following testcase: #include #include #include svuint16_t convolve4_4_x (uint16x8x2_t permute_tbl, svuint16_t a) { return svset_neonq_u16 (a, permute_tbl.val[1]); } generates: mov v0.16b, v1.16b ptrue p3.h, vl8 sel z0.h, p3, z0.h, z2.h ret The move is redundant: we should be able to use z1.h as input to the sel instead.
[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464 --- Comment #11 from Richard Sandiford --- Yeah, like I mentioned in the commit message, I'm in the process of rolling this fix out to more places. Was just testing the waters with the minimal fix for comment 4. But yeah, maybe more of it will need to be backported than I'd hoped.
[Bug target/115464] [14 Backport] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464 Richard Sandiford changed: What|Removed |Added Known to work||15.0 Known to fail||14.1.0 Summary|ICE when building libaom on |[14 Backport] ICE when |arm64 (neon sve bridge |building libaom on arm64 |usage with tbl/perm)|(neon sve bridge usage with ||tbl/perm) --- Comment #9 from Richard Sandiford --- Fixed on trunk. Will backport to GCC 14 if there is no fallout.
[Bug target/115464] ICE when building libaom on arm64 (neon sve bridge usage with tbl/perm)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115464 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #5 from Richard Sandiford --- In this kind of situation, we should go through a fresh pseudo rather than try to take the subreg directly.
[Bug rtl-optimization/115281] [14 Regression] aarch64 ICE in go_through_subreg after r14-5129
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115281 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Sandiford --- Fixed.
[Bug debug/100303] [11 Regression] -fcompare-debug failure (length) with -O -fno-dce -ftracer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100303 Richard Sandiford changed: What|Removed |Added Known to work||11.4.1 Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #10 from Richard Sandiford --- Fixed.
[Bug rtl-optimization/108086] [11 Regression] internal compiler error: in set_accesses, at rtl-ssa/internals.inl:449
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108086 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED Known to work||11.4.1 --- Comment #23 from Richard Sandiford --- Fixed.
[Bug tree-optimization/110838] [14 Regression] wrong code on x365-3.5, -O3, sign extraction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110838 Bug 110838 depends on bug 113281, which changed state. Bug 113281 Summary: [11 Regression] Latent wrong code due to vectorization of shift reduction and missing promotions since r9-1590 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 113281, which changed state. Bug 113281 Summary: [11 Regression] Latent wrong code due to vectorization of shift reduction and missing promotions since r9-1590 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/113281] [11 Regression] Latent wrong code due to vectorization of shift reduction and missing promotions since r9-1590
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED Known to work||11.4.1 --- Comment #33 from Richard Sandiford --- Fixed.
[Bug c++/115192] [11 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192 Richard Sandiford changed: What|Removed |Added Known to work||11.4.1 Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #16 from Richard Sandiford --- Fixed.
[Bug rtl-optimization/115281] [14 Regression] aarch64 ICE in go_through_subreg after r14-5129
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115281 Richard Sandiford changed: What|Removed |Added Summary|[14/15 Regression] aarch64 |[14 Regression] aarch64 ICE |ICE in go_through_subreg|in go_through_subreg after |after r14-5129 |r14-5129 Known to work||15.0 Known to fail|15.0| --- Comment #3 from Richard Sandiford --- Fixed on trunk, will backport if there are no issues.
[Bug rtl-optimization/115281] [14/15 Regression] aarch64 ICE in go_through_subreg after r14-5129
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115281 Richard Sandiford changed: What|Removed |Added Target Milestone|--- |14.2 Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Known to work||13.1.0 Known to fail||14.1.0, 15.0 Last reconfirmed||2024-05-29 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #1 from Richard Sandiford --- Testing a patch.
[Bug rtl-optimization/115281] New: [14/15 Regression] aarch64 ICE in go_through_subreg after r14-5129
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115281 Bug ID: 115281 Summary: [14/15 Regression] aarch64 ICE in go_through_subreg after r14-5129 Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org CC: avieira at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* The following test ICEs with -O3 -mcpu=neoverse-v1 after r14-5129 (thanks to Andre for the reproducer): SUBROUTINE fn0(ma, mb, nt) CHARACTER ca REAL r0(ma) INTEGER i0(mb) REAL r1(3,mb) REAL r2(3,mb) REAL r3(3,3) zero=0.0 do na = 1, nt nt = i0(na) do l = 1, 3 r1 (l, na) = r0 (nt) r2(l, na) = zero enddo enddo if (ca .ne.'z') then do j = 1, 3 do i = 1, 3 r4 = zero enddo enddo do na = 1, nt do k = 1, 3 do l = 1, 3 do m = 1, 3 r3 = r4 * v enddo enddo enddo do i = 1, 3 do k = 1, ifn (r3) enddo enddo enddo endif END The ICE is: internal compiler error: in go_through_subreg, at ira-conflicts.cc:234 0x161647f go_through_subreg gnu/src/gcc/gcc/ira-conflicts.cc:234 0x1616657 process_regs_for_copy gnu/src/gcc/gcc/ira-conflicts.cc:270 0x1616fe8 process_reg_shuffles gnu/src/gcc/gcc/ira-conflicts.cc:440 0x1617b1b add_insn_allocno_copies gnu/src/gcc/gcc/ira-conflicts.cc:510 0x1617bcc add_copies gnu/src/gcc/gcc/ira-conflicts.cc:527 0x1600bed ira_traverse_loop_tree(bool, ira_loop_tree_node*, void (*)(ira_loop_tree_node*), void (*)(ira_loop_tree_node*)) gnu/src/gcc/gcc/ira-build.cc:1802 0x1618b38 ira_build_conflicts() gnu/src/gcc/gcc/ira-conflicts.cc:819 0x1605eb6 ira_build() gnu/src/gcc/gcc/ira-build.cc:3508 0x15fafe9 ira gnu/src/gcc/gcc/ira.cc:5793 0x15fba7f execute gnu/src/gcc/gcc/ira.cc:6117 I think this is a mode mix-up in go_through_subreg: we should be testing the natural mode size of the REG that we're trying to split, rather than the mode of the result. (But we probably also need to handle paradoxical subregs separately.)
[Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258 Richard Sandiford changed: What|Removed |Added Summary|[14/15 Regression] register |[14 Regression] register |swaps for vector perm in|swaps for vector perm in |some cases after r14-6290 |some cases after r14-6290 --- Comment #4 from Richard Sandiford --- Leaving open in case we do decide to backport.
[Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #2 from Richard Sandiford --- I agree with Andrew that this seems mostly to be luck. On that basis, I'm not sure (either way) whether we should backport the fix. Now that we're hopefully getting better at tracking and allocating subregs, it probably makes sense to allow the split before reload. Doing that (and adjusting the split accordingly) seems to fix the testcase.
[Bug c++/115192] [11/12/13/14 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192 Richard Sandiford changed: What|Removed |Added Summary|[11/12/13/14/15 regression] |[11/12/13/14 regression] |-O3 miscompilation on |-O3 miscompilation on |x86-64 (loops with vectors |x86-64 (loops with vectors |and scalars) since r11-6380 |and scalars) since r11-6380 --- Comment #11 from Richard Sandiford --- Fixed on trunk, will backport if there is no reported fallout.
[Bug c++/115192] [11/12/13/14/15 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 --- Comment #19 from Richard Sandiford --- (In reply to Richard Biener from comment #14) > Usually targets do have a limit on the actual length but I see > constant_upper_bound_with_limit doesn't query such. But it would > be a more appropriate way to say there might be an actual target limit here? The discussion has moved on, but FWIW: this was a deliberate choice. The thinking at the time was that VLA code should be truly “agnostic” and not hard-code an upper limit. Hard-coding a limit would be hard-coding an assumption that the architectural maximum would never increase in future. (The main counterargument was that any uses of the .B form of TBL would break down for >256-byte vectors. We hardly use such TBLs for autovec though, and could easily choose not to use them at all.) That decision is 8 or 9 years old at this point, so it might seem overly dogmatic now. Even so, I think we should have a strong reason to change tack. It shouldn't just be about trying to avoid poly_ints :)
[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664 --- Comment #14 from Richard Sandiford --- Yeah, I think so.
[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664 --- Comment #12 from Richard Sandiford --- (In reply to Peter Bergner from comment #11) > > > but how are users supposed to know whether > > > -fno-omit-frame-pointer is in effect or not? I've looked and there is no > > > pre-defined macro a user could check. > > That might be a useful thing to have, but if the programmer has no control > > over the build flags (i.e. cannot require/force -fomit-frame-pointer) then I > > think the asm has to take care to save and restore the frame pointer itself. > > > > Dropping "31" from the asm means that the asm must preserve the register. > > Things will go badly if the asm doesn't do that. > > So r31 which we use as our frame-pointer reg is a non-volatile/callee saved > register, so it must be saved, but I guess they (greenlet) cannot use the > method of mentioning it in the asm clobber list to perform that. I was thinking of just the asm in isolation, rather than its effect on the containing function's callee-save set. If you have: asm volatile ("…"); then GCC expects r31 after the asm to equal r31 before the asm, regardless of the -fomit-frame-pointer setting. If the asm is: asm volatile ("li r31,0"); (sorry, I've forgotten the actual Power asm :)) then things will go wrong if GCC tries to use r31 after the asm. If the asm clobbers r31 then it must mention it in the clobber list. As things stand, it's not possible to mention r31 in the clobber list if -fno-omit-frame-pointer. This means that the only option for code that wants to support -fno-omit-frame-pointer is to make the asm's contents preserve r31, using an explicit save and restore if necessary. And that kind-of makes sense. If an asm clobbers the frame pointer, that removes GCC's main option for restoring data after the asm. A lot of other clobbers would be handled by loading data relative to the frame pointer. If the frame pointer itself has gone then things get tricky.
[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664 --- Comment #10 from Richard Sandiford --- (In reply to Peter Bergner from comment #7) > Then that would seem to indicate that mentioning the frame pointer reg in > the asm clobber list is an error Yeah, I agree it's an error. The PR says “ICE”, but is there an internal error? The “cannot be used in ‘asm’ here” is a normal user-facing error, albeit with bad error recovery, leading us to report the same thing multiple times. > but how are users supposed to know whether > -fno-omit-frame-pointer is in effect or not? I've looked and there is no > pre-defined macro a user could check. That might be a useful thing to have, but if the programmer has no control over the build flags (i.e. cannot require/force -fomit-frame-pointer) then I think the asm has to take care to save and restore the frame pointer itself. Dropping "31" from the asm means that the asm must preserve the register. Things will go badly if the asm doesn't do that.
[Bug target/114607] aarch64: Incorrect expansion of svsudot
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607 --- Comment #2 from Richard Sandiford --- Fixed on trunk. I'll backport in a few weeks if there's no fallout.
[Bug target/114607] aarch64: Incorrect expansion of svsudot
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-04-05 Ever confirmed|0 |1
[Bug target/114607] New: aarch64: Incorrect expansion of svsudot
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607 Bug ID: 114607 Summary: aarch64: Incorrect expansion of svsudot Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* svsudot is supposed to expand to USDOT with the second and third arguments swapped. However, there is a thinko in the code that does the reversal, making it a no-op. Unfortunately, the tests simply accept the buggy form. :-( For example, gcc.target/aarch64/sve/acle/asm/sudot_s32.c contains: /* ** sudot_s32_tied1: ** usdot z0\.s, z2\.b, z4\.b ** ret */ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t, z0 = svsudot_s32 (z0, z2, z4), z0 = svsudot (z0, z2, z4)) where the usdot z2 and z4 operands should be in the opposite order.
[Bug target/114603] aarch64: Invalid SVE cnot optimisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603 Richard Sandiford changed: What|Removed |Added Last reconfirmed||2024-04-05 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Richard Sandiford --- Fix on trunk so far, but I'll backport if possible.
[Bug target/114603] New: aarch64: Invalid SVE cnot optimisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603 Bug ID: 114603 Summary: aarch64: Invalid SVE cnot optimisation Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* An overly lax condition on the cnot combine pattern means that we optimise: #include svint32_t foo(svbool_t pg, svint32_t y) { return svsel(svcmpeq(pg, y, 0), svdup_s32(1), svdup_s32(0)); } to a single cnot: foo: cnotz0.s, p0/m, z0.s ret The result must be 0 for inactive elements of pg, whereas the above would leave the elements unchanged instead. This seems to have been around since the SVE ACLE was first added.
[Bug target/114577] Inefficient codegen for SVE/NEON bridge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114577 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED CC||rsandifo at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED --- Comment #2 from Richard Sandiford --- Fixed.
[Bug target/114521] [11 only] aarch64: wrong code with Neon ld1/st1x4 intrinsics gcc-11 and earlier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #2 from Richard Sandiford --- Oops. I was going to upload a patch for the bug here, but it looks like I accidentally committed it while backporting PR97696 to GCC 11. The patch was g:daee0409d195d346562e423da783d5d1cf8ea175. I'm not sure what to do now. Perhaps we should leave it in?
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 --- Comment #5 from Richard Sandiford --- For the record, the associated new testsuite failures are: FAIL: gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 3 FAIL: gcc.target/aarch64/asimd-mull-elem.c scan-assembler-times \\s+fmul\\tv[0-9]+\\.4s, v[0-9]+\\.4s, v[0-9]+\\.s\\[0\\] 4 FAIL: gcc.target/aarch64/asimd-mull-elem.c scan-assembler-times \\s+mul\\tv[0-9]+\\.4s, v[0-9]+\\.4s, v[0-9]+\\.s\\[0\\] 4 FAIL: gcc.target/aarch64/ccmp_3.c scan-assembler-not \tcbnz\t FAIL: gcc.target/aarch64/pr100056.c scan-assembler-times \\t[us]bfiz\\tw[0-9]+, w[0-9]+, 11 2 FAIL: gcc.target/aarch64/pr100056.c scan-assembler-times \\tadd\\tw[0-9]+, w[0-9]+, w[0-9]+, uxtb\\n 2 FAIL: gcc.target/aarch64/pr108840.c scan-assembler-not and\\tw[0-9]+, w[0-9]+, 31 FAIL: gcc.target/aarch64/pr112105.c scan-assembler-not \\tdup\\t FAIL: gcc.target/aarch64/pr112105.c scan-assembler-times (?n)\\tfmul\\t.*v[0-9]+\\.s\\[0\\]\\n 2 FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tx[0-9]+ 2 FAIL: gcc.target/aarch64/vaddX_high_cost.c scan-assembler-not dup\\t FAIL: gcc.target/aarch64/vmul_element_cost.c scan-assembler-not dup\\t FAIL: gcc.target/aarch64/vmul_high_cost.c scan-assembler-not dup\\t FAIL: gcc.target/aarch64/vsubX_high_cost.c scan-assembler-not dup\\t FAIL: gcc.target/aarch64/sve/pr98119.c scan-assembler \\tand\\tx[0-9]+, x[0-9]+, #?-31\\n FAIL: gcc.target/aarch64/sve/pred-not-gen-1.c scan-assembler-not \\tbic\\t FAIL: gcc.target/aarch64/sve/pred-not-gen-1.c scan-assembler-times \\tnot\\tp[0-9]+\\.b, p[0-9]+/z, p[0-9]+\\.b\\n 1 FAIL: gcc.target/aarch64/sve/pred-not-gen-4.c scan-assembler-not \\tbic\\t FAIL: gcc.target/aarch64/sve/pred-not-gen-4.c scan-assembler-times \\tnot\\tp[0-9]+\\.b, p[0-9]+/z, p[0-9]+\\.b\\n 1 FAIL: gcc.target/aarch64/sve/var_stride_2.c scan-assembler-times \\tubfiz\\tx[0-9]+, x2, 10, 16\\n 1 FAIL: gcc.target/aarch64/sve/var_stride_2.c scan-assembler-times \\tubfiz\\tx[0-9]+, x3, 10, 16\\n 1 FAIL: gcc.target/aarch64/sve/var_stride_4.c scan-assembler-times \\tsbfiz\\tx[0-9]+, x2, 10, 32\\n 1 FAIL: gcc.target/aarch64/sve/var_stride_4.c scan-assembler-times \\tsbfiz\\tx[0-9]+, x3, 10, 32\\n 1
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 --- Comment #4 from Richard Sandiford --- (In reply to Richard Biener from comment #1) > Btw, why does forwprop not do this? Not 100% sure (I wasn't involved in choosing the current heuristics). But fwprop can propagate across blocks, so there is probably more risk of increasing register pressure.
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 --- Comment #3 from Richard Sandiford --- In RTL terms, the dup is vec_duplicate. The combination is: Trying 10 -> 13: 10: r107:V4SF=vec_duplicate(r115:SF) REG_DEAD r115:SF 13: r110:V4SF=r111:V4SF*r107:V4SF REG_DEAD r111:V4SF Failed to match this instruction: (parallel [ (set (reg:V4SF 110 [ _2 ]) (mult:V4SF (vec_duplicate:V4SF (reg:SF 115)) (reg:V4SF 111 [ *ptr_6(D) ]))) (set (reg:V4SF 107) (vec_duplicate:V4SF (reg:SF 115))) ]) Failed to match this instruction: (parallel [ (set (reg:V4SF 110 [ _2 ]) (mult:V4SF (vec_duplicate:V4SF (reg:SF 115)) (reg:V4SF 111 [ *ptr_6(D) ]))) (set (reg:V4SF 107) (vec_duplicate:V4SF (reg:SF 115))) ]) Successfully matched this instruction: (set (reg:V4SF 107) (vec_duplicate:V4SF (reg:SF 115))) Successfully matched this instruction: (set (reg:V4SF 110 [ _2 ]) (mult:V4SF (vec_duplicate:V4SF (reg:SF 115)) (reg:V4SF 111 [ *ptr_6(D) ]))) allowing combination of insns 10 and 13 original costs 8 + 20 = 28 replacement costs 8 + 20 = 28 modifying insn i210: r107:V4SF=vec_duplicate(r115:SF) deferring rescan insn with uid = 10. modifying insn i313: r110:V4SF=vec_duplicate(r115:SF)*r111:V4SF REG_DEAD r115:SF REG_DEAD r111:V4SF deferring rescan insn with uid = 13.
[Bug rtl-optimization/114515] New: [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 Bug ID: 114515 Summary: [14 Regression] Failure to use aarch64 lane forms after PR101523 Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- The following test regressed on aarch64 after g:839bc42772ba7af66af3bd16efed4a69511312ae (the fix for PR101523): typedef float v4sf __attribute__((vector_size(16))); void f (v4sf *ptr, float f) { ptr[0] = ptr[0] * (v4sf) { f, f, f, f }; ptr[1] = ptr[1] * (v4sf) { f, f, f, f }; } Compiled with -O2, we previously generated: ldp q1, q31, [x0] fmulv1.4s, v1.4s, v0.s[0] fmulv31.4s, v31.4s, v0.s[0] stp q1, q31, [x0] ret Now we generate: ldp q1, q31, [x0] dup v0.4s, v0.s[0] fmulv1.4s, v1.4s, v0.4s fmulv31.4s, v31.4s, v0.4s stp q1, q31, [x0] ret with the extra dup. The patch is trying to avoid cases where i3 is canonicalised by contextual information provided by i2. But here we place a full copy of i2 into i3 (creating an instruction that is no more expensive). This is a benefit in its own right because the two instructions can then execute in parallel rather than serially. But it also means that, as here, we might be able to remove i2 with later combinations. Perhaps we could also check whether i3 still contains the destination of i2?
[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Richard Sandiford --- Fixed on trunk and all active release branches.
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #5 from Richard Sandiford --- (In reply to Andrew Stubbs from comment #4) > Yes, that's what the simd-math-3* tests do. Ah, OK. > The simd-math-5* tests are explicitly supposed to be doing this in the > context of the autovectorizer. > > If these tests are being compiled as (newly) intended then we should change > the expected results. > > So, questions: > > 1. Are the new results actually correct? (So far I only know that being > different is expected.) I believe so. We now do the division in 32 bits, as in the original gimple. > 2. Is there some other testcase form that would exercise the previously > intended routines? It should be possible in languages that don't have C's integer promotion rules, if you're up for some Ada or Rust. > 3. Is the new behaviour configurable? I don't think the 16-bit shift bug> > ever existed on GCN (in which "short" vectors actually have excess bits in > each lane, much like scalar registers do). Not AFAIK. The problem is that the gimple→gimple transformation changes the gimple-level semantics of the code. Shifts by out-of-range values are undefined rather than target-defined. (And in other cases that's useful, because it means we don't need to preserve whatever value the target happens to give for an out-of-range shift.)
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #3 from Richard Sandiford --- Ah, ok. If the main aim is to test the libgcc routines, it might be safer to use something like: typedef char v64qi __attribute__((vector_size(64))); v64qi f(v64qi x, v64qi y) { return x / y; } instead of relying on vectorisation.
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #1 from Richard Sandiford --- The decision to stop narrowing division was deliberate, see the comments in PR113281 for details. Is the purpose of the test to check vectorisation quality, or to check for the right ABI routines?
[Bug tree-optimization/114234] New: [14 Regression] verify_ssa failure with early-break vectorisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114234 Bug ID: 114234 Summary: [14 Regression] verify_ssa failure with early-break vectorisation Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- The following test ICEs with -Ofast on aarch64: void bar(); float foo (float x) { float a = 1; float b = x; long z = 200; for (;;) { float c = b - 1.0f; a *= c; z -= 1; if (z == 0) { bar (); break; } if (b <= 3.0f) break; b = c; } return a * b; } (reduced from wrf). The ICE is: foo.c:3:1: error: definition in block 15 does not dominate use in block 10 3 | foo (float x) | ^~~ for SSA_NAME: stmp_a_9.10_103 in statement: a_47 = PHI PHI argument stmp_a_9.10_103 for PHI node a_47 = PHI during GIMPLE pass: vect
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 Richard Sandiford changed: What|Removed |Added Attachment #57602|0 |1 is obsolete|| --- Comment #42 from Richard Sandiford --- Created attachment 57605 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57605=edit proof-of-concept patch to suppress peeling for gaps How about the attached? It records whether all accesses that require peeling for gaps could instead have used gathers, and only retries when that's true. It means that we retry for only 0.034% of calls to vect_analyze_loop_1 in a build of SPEC2017 with -mcpu=neoverse-v1 -Ofast -fomit-frame-pointer. The figures exclude wrf, which failed for me with: module_mp_gsfcgce.fppized.f90:852:23: 852 |REAL FUNCTION ggamma(X) | ^ Error: definition in block 18 does not dominate use in block 13 for SSA_NAME: stmp_pf_6.5657_140 in statement: pf_81 = PHI PHI argument stmp_pf_6.5657_140 for PHI node pf_81 = PHI during GIMPLE pass: vect module_mp_gsfcgce.fppized.f90:852:23: internal compiler error: verify_ssa failed Will look at that tomorrow.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #41 from Richard Sandiford --- (In reply to Richard Biener from comment #40) > So I wonder if we can use "local costing" to decide a gather is always OK > compared to the alternative with peeling for gaps. On x86 gather tends > to be slow compared to open-coding it. Yeah, on SVE gathers are generally “enabling” instructions rather than something to use for their own sake. I suppose one problem is that we currently only try to use gathers for single-element groups. If we make a local decision to use gathers while keeping that restriction, we could end up using gathers “unnecessarily” while still needing to peel for gaps for (say) a two-element group. That is, it's only better to use gathers than contiguous loads if by doing that we avoid all need to peel for gaps (and if the cost of peeling for gaps was high enough to justify the cost of using gathers over consecutive loads). One of the things on the list to do (once everything is SLP!) is to support loads with gaps directly via predication, so that we never load elements that aren't needed. E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with a 32-bit load (LD1W .S) to load only even-indexed elements. So a single-element group with a group size of 2 could be done cheaply with just consecutive loads, without peeling for gaps.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #39 from Richard Sandiford --- (In reply to Richard Sandiford from comment #38) > (In reply to Richard Biener from comment #37) > > Even more iteration looks bad. I do wonder why when gather can avoid > > peeling for GAPs using load-lanes cannot? > Like you say, we don't realise that all the loads from array3[i] form a > single group. Oops, sorry, I shouldn't have gone off memory. So yeah, it's array1[] where that happens, not array3[]. The reason we don't use load-lanes is that we don't have load-lane instructions for smaller elements in larger containers, so we're forced to use load-and-permute instead.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #38 from Richard Sandiford --- (In reply to Richard Biener from comment #37) > Even more iteration looks bad. I do wonder why when gather can avoid > peeling for GAPs using load-lanes cannot? Like you say, we don't realise that all the loads from array3[i] form a single group. Note that we're not using load-lanes in either case, since the group size (8) is too big for that. But load-lanes and load-and-permute have the same restriction about when peeling for gaps is required. In contrast, gather loads only ever load data that they actually need. > Also for the stores we seem to use elementwise stores rather than store-lanes. What configuration are you trying? The original report was about SVE, so I was trying that. There we use a scatter store. > To me the most obvious thing to try optimizing in this testcase is DR > analysis. With -march=armv8.3-a I still see > > t.c:26:22: note: === vect_analyze_data_ref_accesses === > t.c:26:22: note: Detected single element interleaving array1[0][_8] step 4 > t.c:26:22: note: Detected single element interleaving array1[1][_8] step 4 > t.c:26:22: note: Detected single element interleaving array1[2][_8] step 4 > t.c:26:22: note: Detected single element interleaving array1[3][_8] step 4 > t.c:26:22: note: Detected single element interleaving array1[0][_1] step 4 > t.c:26:22: note: Detected single element interleaving array1[1][_1] step 4 > t.c:26:22: note: Detected single element interleaving array1[2][_1] step 4 > t.c:26:22: note: Detected single element interleaving array1[3][_1] step 4 > t.c:26:22: missed: not consecutive access array2[_4][_8] = _69; > t.c:26:22: note: using strided accesses > t.c:26:22: missed: not consecutive access array2[_4][_1] = _67; > t.c:26:22: note: using strided accesses > > so we don't figure > > Creating dr for array1[0][_1] > base_address: > offset from base address: (ssizetype) ((sizetype) (m_111 * 2) * 2) > constant offset from base address: 0 > step: 4 > base alignment: 16 > base misalignment: 0 > offset alignment: 4 > step alignment: 4 > base_object: array1 > Access function 0: {m_111 * 2, +, 2}_4 > Access function 1: 0 > Creating dr for array1[0][_8] > analyze_innermost: success. > base_address: > offset from base address: (ssizetype) ((sizetype) (m_111 * 2 + 1) * > 2) > constant offset from base address: 0 > step: 4 > base alignment: 16 > base misalignment: 0 > offset alignment: 2 > step alignment: 4 > base_object: array1 > Access function 0: {m_111 * 2 + 1, +, 2}_4 > Access function 1: 0 > > belong to the same group (but the access functions tell us it worked out). > Above we fail to split the + 1 to the constant offset. OK, but this is moving the question on to how we should optimise the testcase for Advanced SIMD rather than SVE, and how we should optimise the testcase in general, rather than simply recover what we could do before. (SVE is only enabled for -march=arvm9-a and above, in case armv8.3-a was intended to enable SVE too.)
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #36 from Richard Sandiford --- Created attachment 57602 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57602=edit proof-of-concept patch to suppress peeling for gaps This patch does what I suggested in the previous comment: if the loop needs peeling for gaps, try again without that, and pick the better loop. It seems to restore the original style of code for SVE. A more polished version would be a bit smarter about when to retry. E.g. it's pointless if the main loop already operates on full vectors (i.e. if peeling 1 iteration is natural in any case). Perhaps the condition should be that either (a) the number of epilogue iterations is known to be equal to the VF of the main loop or (b) the target is known to support partial vectors for the loop's vector_mode. Any thoughts?
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #35 from Richard Sandiford --- Maybe I've misunderstood the flow of the ticket, but it looks to me like we do still correctly recognise the truncating scatter stores. And, on their own, we would be able to convert them into masked scatters. The reason for the epilogue is instead on the load side. There we have a non-strided grouped load, and currently we hard-code the assumption that it is better to use contiguous loads and permutes rather than gather loads where possible. So we have: /* As a last resort, trying using a gather load or scatter store. ??? Although the code can handle all group sizes correctly, it probably isn't a win to use separate strided accesses based on nearby locations. Or, even if it's a win over scalar code, it might not be a win over vectorizing at a lower VF, if that allows us to use contiguous accesses. */ if (*memory_access_type == VMAT_ELEMENTWISE && single_element_p && loop_vinfo && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo, masked_p, gs_info)) *memory_access_type = VMAT_GATHER_SCATTER; only after we've tried and failed to use load lanes or load+permute. If instead I change the order so that the code above is tried first, then we do use extending gather loads and truncating scatter stores as before, with no epilogue loop. So I suppose the question is: if we do prefer to use gathers over load+permute for some cases, how do we decide which to use? And can it be done a per-load basis, or should it instead be a per-loop decision? E.g., if we end up with a loop that needs peeling for gaps, perhaps we should try again and forbid peeling for gaps. Then, if that succeeds, see which loop gives the better overall cost. Of course, trying more things means more compile time…
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #33 from Richard Sandiford --- Can you give me a chance to look at it a bit when I back? This doesn't feel like the way to go to me.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #31 from Richard Sandiford --- (In reply to Tamar Christina from comment #29) > This works fine for normal gather and scatters but doesn't work for widening > gathers and narrowing scatters which only the pattern seems to handle. I'm supposedly on holiday, so didn't see the IRC discussion, but: as I remember it, there is no narrowing or widening for IFN gathers or scatters as such, even for patterns. One vector's worth of offsets corresponds to one vector's worth of data. But the widths of the data elements and the offset elements can be different. Any sign or zero extension of a loaded vector, or any operation to double or halve the number of vectors, is done separately. I think it does make sense to stick to that, rather than (say) have IFNs that load two offset vectors into one data vector, or use one offset vector to load two data vectors. Supporting those combinations would mean that we have two different ways in which the offset elements and data elements have different widths. And it isn't really a natural fusion on SVE.
[Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #8 from Richard Sandiford --- The reason early_ra doesn't help with the original testcase is that early_ra punts on any non-move instruction that has a hard register destination. And it does that because it can't cope well with cases where hard-coded destinations force the wrong choice (unlike the proper allocators, which can change the destination where necessary). The restriction is needed to avoid regressing SVE ACLE tests.
[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696 --- Comment #3 from Richard Sandiford --- Created attachment 57520 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57520=edit Candidate patch The attached patch seems to fix it. I'm taking next week off, but I'll run the patch through proper testing when I get back.
[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #14 from Richard Sandiford --- Finally fixed.
[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205 --- Comment #12 from Richard Sandiford --- Created attachment 57511 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57511=edit Candidate patch Sorry for the very slow response on this. I'm testing the attached.
[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED
[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #3 from Richard Sandiford --- Assume fixed by the patches for PR113295. Please reopen if not.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 112922, which changed state. Bug 112922 Summary: [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Richard Sandiford --- Fixed.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 113295, which changed state. Bug 113295 Summary: [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #9 from Richard Sandiford --- Fixed.
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 --- Comment #6 from Richard Sandiford --- For me the miscompilation is in jkdmem_, where we end up allocating the same registers to both arms of an fcsel. It sounds like it occurs elsewhere too. I have a candidate fix, but need to think a bit more about it.
[Bug preprocessor/114007] gcc chokes on __has_cpp_attribute(clang::unsafe_buffer_usage)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114007 --- Comment #14 from Richard Sandiford --- I might have misunderstood the suggestion and so be arguing against something that no-one is suggesting, but I think [[__extension__ …]] should accept the same things for all standard versions (C23, pre-C23, and GNU). It was intended to be something that header files and macros could use without needing to be sensitive to the user's choice of standard.
[Bug target/113995] ICE: in change_address_1, at emit-rtl.cc:2299 with [[arm::streaming_compatible]] and -march=armv9-a+sve -finstrument-functions -fstack-clash-protection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113995 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Richard Sandiford --- Fixed.
[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Richard Sandiford --- Fixed
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #5 from Richard Sandiford --- Mine. Could be the same as PR112922.
[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922 --- Comment #2 from Richard Sandiford --- I don't remember there being a deliberate bug fix in that patch, but there were some others later. I suppose the optimistic case is that this first went latent and then was fixed “properly” afterwards. But it could just be latent.
[Bug target/113778] ICE: RTL check: expected elt 3 type 'e' or 'u', have '0' (rtx code_label) in PATTERN, at rtl.h:1506 with -march=armv9-a+sme -fhardened
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113778 Richard Sandiford changed: What|Removed |Added Resolution|--- |DUPLICATE Status|ASSIGNED|RESOLVED --- Comment #4 from Richard Sandiford --- Confirmed that this is the same as PR113220. *** This bug has been marked as a duplicate of bug 113220 ***
[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220 Richard Sandiford changed: What|Removed |Added CC||zsojka at seznam dot cz --- Comment #3 from Richard Sandiford --- *** Bug 113778 has been marked as a duplicate of this bug. ***
[Bug target/113805] sme2 vs -mtrack-speculation, ICE in extract_insn, at recog.cc:2812 with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113805 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #3 from Richard Sandiford --- Fixed.
[Bug target/113995] ICE: in change_address_1, at emit-rtl.cc:2299 with [[arm::streaming_compatible]] and -march=armv9-a+sve -finstrument-functions -fstack-clash-protection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113995 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #2 from Richard Sandiford --- Possibly a dup of some of the other PRs in this area.
[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #16 from Richard Sandiford --- I think it'd be worth splitting the mode selection out into a helper called bitwise_mode_for_size, to go alongside bitwise_mode_for_mode.
[Bug target/113778] ICE: RTL check: expected elt 3 type 'e' or 'u', have '0' (rtx code_label) in PATTERN, at rtl.h:1506 with -march=armv9-a+sme -fhardened
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113778 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Last reconfirmed||2024-02-15 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #3 from Richard Sandiford --- Possibly the same as PR113220
[Bug target/113805] sme2 vs -mtrack-speculation, ICE in extract_insn, at recog.cc:2812 with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113805 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-02-12 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #1 from Richard Sandiford --- Heh. I knew this coming after seeing the previous -mtrack-speculation bug, but didn't have chance to fix it pre-emptively last week.
[Bug target/113763] [14 Regression] build fails with clang++ host compiler because aarch64.cc uses C++14 constexpr.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113763 --- Comment #14 from Richard Sandiford --- AFAIK, the constructor shouldn't be necessary. (And without it, the whole thing would fit on one line.) LGTM (and preapproved) otherwise. Thanks for doing this.
[Bug target/113763] [14 Regression] build fails with clang++ host compiler because aarch64.cc uses C++14 constexpr.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113763 --- Comment #11 from Richard Sandiford --- Currently away so can't try it myself, but how about just using an ad-hoc structure instead?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #34 from Richard Sandiford --- (In reply to Richard Biener from comment #32) > Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting > the lowest 4 or 2 bits. > > It all feels like desaster waiting to happen ;) Yes :) > For example BIT_NOT_EXPR is RTL expanded like > > case BIT_NOT_EXPR: > op0 = expand_expr (treeop0, subtarget, > VOIDmode, EXPAND_NORMAL); > if (modifier == EXPAND_STACK_PARM) > target = 0; > /* In case we have to reduce the result to bitfield precision > for unsigned bitfield expand this as XOR with a proper constant > instead. */ > if (reduce_bit_field && TYPE_UNSIGNED (type)) > { > int_mode = SCALAR_INT_TYPE_MODE (type); > wide_int mask = wi::mask (TYPE_PRECISION (type), > false, GET_MODE_PRECISION (int_mode)); > > temp = expand_binop (int_mode, xor_optab, op0, >immed_wide_int_const (mask, int_mode), >target, 1, OPTAB_LIB_WIDEN); > > so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and > effective bit-precision set reduce_bit_field and fixup the fallout > (not sure why the above is only for TYPE_UNSIGNED). > > At least it feels similar and doing things the opposite for vectors > (fixing up at uses) would be odd? Do you know why we take this approach for integers? Is it for correctness? Or is it supposed to be more optimal? I can imagine that, for arithmetic types, there are going to many more instances where upper bits matter (division, right shifts, MIN/MAX, etc.). So perhaps reducing every result is a good trade-off there. But there's an argument that it should be rare for the padding bits in a vector to matter, since very few things would look at the padding bits anyway. So perhaps the cost should be borne by the operations that need canonical integers. Not a strong opinion though, more just devil's advocate. There again, if e.g. the x86 API guarantees memcmp equality between two masks whose significant bits are equal, then we probably have no choice.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #30 from Richard Sandiford --- (In reply to Richard Biener from comment #29) > But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for > VECTOR_CSTs. But yeah, that _might_ argue we should perform the same > masking for VECTOR_CST expansion as well, instead of trying to fixup > in do_compare_and_jump? But then how would ~ be implemented for things like 4-bit masks? If we use notqi2 then I assume the upper bits could be 1 rather than 0.
[Bug debug/113636] [14 Regression] internal compiler error: in dead_debug_global_find, at valtrack.cc:275 since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113636 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #12 from Richard Sandiford --- Fixed. Thanks for the report and help with reproducing.
[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #8 from Richard Sandiford --- Fixed.
[Bug target/111677] [12/13 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677 --- Comment #21 from Richard Sandiford --- (In reply to Alex Coplan from comment #13) > The problem seems to be this code in aarch64_process_components: > > while (regno != last_regno) > { > bool frame_related_p = aarch64_emit_cfi_for_reg_p (regno); > machine_mode mode = aarch64_reg_save_mode (regno); > > rtx reg = gen_rtx_REG (mode, regno); > poly_int64 offset = frame.reg_offset[regno]; > if (frame_pointer_needed) > offset -= frame.bytes_below_hard_fp; > > rtx addr = plus_constant (Pmode, ptr_reg, offset); > rtx mem = gen_frame_mem (mode, addr); > > which emits a TFmode mem with offset 512, which is out of range for TFmode > (so we later ICE with an unrecognisable insn). Presumably this just needs > tweaking to emit a new base anchor in the case of large offsets like this. > It looks like the code in aarch64_save_callee_saves already does this. We shouldn't emit new anchor registers here, since unlike in the prologue, we don't have any guarantee that certain registers are free. aarch64_get_separate_components is supposed to vet shrink-wrappable offsets, but in this case the offset looks valid, since: str q22, [sp, #512] is a valid instruction. Perhaps the constraints are too narrow?
[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #6 from Richard Sandiford --- Mine.
[Bug debug/113636] [14 Regression] internal compiler error: in dead_debug_global_find, at valtrack.cc:275 since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113636 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Last reconfirmed||2024-01-29 --- Comment #10 from Richard Sandiford --- Mine.
[Bug tree-optimization/113281] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 --- Comment #24 from Richard Sandiford --- Fixed on trunk so far, but it's latent on branches. I'll see what the trunk fallout is like before asking about backports.
[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #7 from Richard Sandiford --- early-ra does try to avoid reusing registers too soon, to increase scheduling freedom. But in this case I imagine it handles the two statements as separate regions. Should be fixable by carrying across a round-robin counter.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #18 from Richard Sandiford --- (In reply to Tamar Christina from comment #17) > Well the mid-end has generated the right precision. The type it generates is > vector(4) vexit_reduc_67; > so it does say it's a single bit boolean. > > Isn't this just an expand problem? That's what I meant. expand is using a QImode comparison to compare things with 4-bit precision, so I think the masking should happen at that point. How about doing the masking in do_compare_and_jump?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #16 from Richard Sandiford --- (In reply to Richard Biener from comment #15) > I think the problem is the cbranch pattern which looks at all of the > QImode mask - but of course it doesn't know it's really V4BImode it's > working on ... Yeah. Currently building an x86_64 toolchain to have a look, but I think whatever code uses a cbranch with a higher precision than the inputs should mask off the significant bits beforehand. cbranch can also be used for comparing two variable masks too. I suppose we could add a hook to say that padding bits of an integer mask must always be zero (and so it's expand's job to ensure that that holds for any mask operation). But it feels dangerously close to TRULY_NOOP_TRUNCATION for integers.
[Bug target/113550] data512_t initializers dereference a clobbered register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113550 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Sandiford --- Yeah, FWIW, I agree improving the define_split is probably best. Now fixed.
[Bug target/113485] [14 regression] ICE with -fno-guess-branch-probability on aarch64 starting with r14-7187-g74e3e839ab2d36
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113485 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Richard Sandiford --- Fixed.
[Bug target/113572] [14 Regression] aarch64: internal compiler error in aarch64_sve::vector_cst_all_same
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113572 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #7 from Richard Sandiford --- Fixed.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #13 from Richard Sandiford --- I don't think there's any principle that upper bits must be zero. How do we end up with a pattern that depends on that being the case?
[Bug tree-optimization/113281] [14 Regression] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #22 from Richard Sandiford --- Taking following discussion on irc.
[Bug target/113572] [14 Regression] aarch64: internal compiler error in aarch64_sve::vector_cst_all_same
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113572 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #5 from Richard Sandiford --- (In reply to Jakub Jelinek from comment #4) > So, if that part is right, I think we want to use VECTOR_CST_ELT instead of > VECTOR_CST_ENCODED_ELT, like: > --- gcc/config/aarch64/aarch64-sve-builtins.cc.jj 2024-01-12 > 13:47:20.815429012 +0100 > +++ gcc/config/aarch64/aarch64-sve-builtins.cc2024-01-24 > 20:58:33.720677634 > +0100 > @@ -3474,7 +3474,7 @@ vector_cst_all_same (tree v, unsigned in >unsigned int nelts = lcm * VECTOR_CST_NELTS_PER_PATTERN (v); >tree first_el = VECTOR_CST_ENCODED_ELT (v, 0); >for (unsigned int i = 0; i < nelts; i += step) > -if (!operand_equal_p (VECTOR_CST_ENCODED_ELT (v, i), first_el, 0)) > +if (!operand_equal_p (VECTOR_CST_ELT (v, i), first_el, 0)) >return false; > >return true; > which fixes the ICE. Yeah, that's the correct fix. Sorry for missing it.
[Bug target/113485] [14 regression] ICE with -fno-guess-branch-probability on aarch64 starting with r14-7187-g74e3e839ab2d36
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113485 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #7 from Richard Sandiford --- I suppose the ZIP1 patterns should just have 64-bit inputs, rather than going to the trouble of creating paradoxical subregs. > cfun->machine->advsimd_zero_insn use is plain wrong. As the RTL could be > removed fully from the RTL stream and then it will be GC'ed. But machine_function is a GTYed structure, so the reference itself should prevent GC. I don't think we should be in the practice of explicitly ggc_free()ing RTL, since callers don't generally know what other references there might be.
[Bug target/109929] profiledbootstrap failure on aarch64-linux-gnu with graphite optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109929 --- Comment #7 from Richard Sandiford --- Hmm, yeah, like you say, neither of those commits should have made a different to whether bootstrap works. I guess the problem is just latent now.
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 --- Comment #12 from Richard Sandiford --- I don't object to the patch, but for the record: the current heuristics go back a long way. Although I reworked the pass to use rtl-ssa a few years ago, I tried as far as possible to preserve the old heuristics (tested by making sure that there were no unexplained differences over a large set of targets). I wouldn't characterise the old heuristics as a logic error. Although I didn't write them, my understanding is that they were being deliberately conservative, in particular due to the risk of introducing excess register pressure. So this change seems potentially quite invasive for stage 4. Perhaps it'll work out — if so, great! But if there is some fallout, I think we should lean towards reverting the patch and revisiting in GCC 15.
[Bug target/113196] [14 Regression] Failure to use ushll{,2}
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113196 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #3 from Richard Sandiford --- Fixed.
[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #15 from Richard Sandiford --- I didn't manage to reproduce the PR in the originally reported form, but FWIW, the patches mean that a gcc_unreachable above: return decl; in simulate_builtin_function_decl no longer fires for arm_sve.h or arm_sme.h. Please reopen if there are still some lingering issues.
[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989 --- Comment #12 from Richard Sandiford --- > another is try > #pragma GCC aarch64 "arm_sve.h" > after a couple of intentional declarations of the SVE builtins with > non-standard return/argument types and make sure that while it emits some > errors, it doesn't try to use ggc_freed decls in registered tables. FWIW, this is what the g*.target/aarch64/sve/acle/general*/func_redef_* tests are supposed to test (although not specifically targeting ggc_free).
[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #10 from Richard Sandiford --- Mine.
[Bug target/113270] [14 Regression] AArch64 ICEs in register_tuple_type since r14-6524
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113270 --- Comment #8 from Richard Sandiford --- Thanks for trying it, and sorry for not doing it myself. The patch LGTM FWIW, so preapproved if it passes testing (which I'm sure it will :))
[Bug target/113270] [14 Regression] AArch64 ICEs in register_tuple_type since r14-6524
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113270 --- Comment #6 from Richard Sandiford --- I think we want the patch in comment 3, but in addition, I then also needed to use the following for a similar SVE case: extern GTY(()) tree scalar_types[NUM_VECTOR_TYPES + 1]; tree scalar_types[NUM_VECTOR_TYPES + 1]; In this case that would mean adding: extern GTY(()) aarch64_simd_type_info aarch64_simd_types[]; just above the definition in aarch64-builtins.cc.
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Richard Sandiford --- Fixed. Thanks for the report.