[Bug c++/115192] [11/12/13/14 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192 Richard Sandiford changed: What|Removed |Added Summary|[11/12/13/14/15 regression] |[11/12/13/14 regression] |-O3 miscompilation on |-O3 miscompilation on |x86-64 (loops with vectors |x86-64 (loops with vectors |and scalars) since r11-6380 |and scalars) since r11-6380 --- Comment #11 from Richard Sandiford --- Fixed on trunk, will backport if there is no reported fallout.
[Bug c++/115192] [11/12/13/14/15 regression] -O3 miscompilation on x86-64 (loops with vectors and scalars) since r11-6380
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115192 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 --- Comment #19 from Richard Sandiford --- (In reply to Richard Biener from comment #14) > Usually targets do have a limit on the actual length but I see > constant_upper_bound_with_limit doesn't query such. But it would > be a more appropriate way to say there might be an actual target limit here? The discussion has moved on, but FWIW: this was a deliberate choice. The thinking at the time was that VLA code should be truly “agnostic” and not hard-code an upper limit. Hard-coding a limit would be hard-coding an assumption that the architectural maximum would never increase in future. (The main counterargument was that any uses of the .B form of TBL would break down for >256-byte vectors. We hardly use such TBLs for autovec though, and could easily choose not to use them at all.) That decision is 8 or 9 years old at this point, so it might seem overly dogmatic now. Even so, I think we should have a strong reason to change tack. It shouldn't just be about trying to avoid poly_ints :)
[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664 --- Comment #14 from Richard Sandiford --- Yeah, I think so.
[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664 --- Comment #12 from Richard Sandiford --- (In reply to Peter Bergner from comment #11) > > > but how are users supposed to know whether > > > -fno-omit-frame-pointer is in effect or not? I've looked and there is no > > > pre-defined macro a user could check. > > That might be a useful thing to have, but if the programmer has no control > > over the build flags (i.e. cannot require/force -fomit-frame-pointer) then I > > think the asm has to take care to save and restore the frame pointer itself. > > > > Dropping "31" from the asm means that the asm must preserve the register. > > Things will go badly if the asm doesn't do that. > > So r31 which we use as our frame-pointer reg is a non-volatile/callee saved > register, so it must be saved, but I guess they (greenlet) cannot use the > method of mentioning it in the asm clobber list to perform that. I was thinking of just the asm in isolation, rather than its effect on the containing function's callee-save set. If you have: asm volatile ("…"); then GCC expects r31 after the asm to equal r31 before the asm, regardless of the -fomit-frame-pointer setting. If the asm is: asm volatile ("li r31,0"); (sorry, I've forgotten the actual Power asm :)) then things will go wrong if GCC tries to use r31 after the asm. If the asm clobbers r31 then it must mention it in the clobber list. As things stand, it's not possible to mention r31 in the clobber list if -fno-omit-frame-pointer. This means that the only option for code that wants to support -fno-omit-frame-pointer is to make the asm's contents preserve r31, using an explicit save and restore if necessary. And that kind-of makes sense. If an asm clobbers the frame pointer, that removes GCC's main option for restoring data after the asm. A lot of other clobbers would be handled by loading data relative to the frame pointer. If the frame pointer itself has gone then things get tricky.
[Bug rtl-optimization/114664] -fno-omit-frame-pointer causes an ICE during the build of the greenlet package
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114664 --- Comment #10 from Richard Sandiford --- (In reply to Peter Bergner from comment #7) > Then that would seem to indicate that mentioning the frame pointer reg in > the asm clobber list is an error Yeah, I agree it's an error. The PR says “ICE”, but is there an internal error? The “cannot be used in ‘asm’ here” is a normal user-facing error, albeit with bad error recovery, leading us to report the same thing multiple times. > but how are users supposed to know whether > -fno-omit-frame-pointer is in effect or not? I've looked and there is no > pre-defined macro a user could check. That might be a useful thing to have, but if the programmer has no control over the build flags (i.e. cannot require/force -fomit-frame-pointer) then I think the asm has to take care to save and restore the frame pointer itself. Dropping "31" from the asm means that the asm must preserve the register. Things will go badly if the asm doesn't do that.
[Bug target/114607] aarch64: Incorrect expansion of svsudot
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607 --- Comment #2 from Richard Sandiford --- Fixed on trunk. I'll backport in a few weeks if there's no fallout.
[Bug target/114607] aarch64: Incorrect expansion of svsudot
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-04-05 Ever confirmed|0 |1
[Bug target/114607] New: aarch64: Incorrect expansion of svsudot
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114607 Bug ID: 114607 Summary: aarch64: Incorrect expansion of svsudot Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* svsudot is supposed to expand to USDOT with the second and third arguments swapped. However, there is a thinko in the code that does the reversal, making it a no-op. Unfortunately, the tests simply accept the buggy form. :-( For example, gcc.target/aarch64/sve/acle/asm/sudot_s32.c contains: /* ** sudot_s32_tied1: ** usdot z0\.s, z2\.b, z4\.b ** ret */ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t, z0 = svsudot_s32 (z0, z2, z4), z0 = svsudot (z0, z2, z4)) where the usdot z2 and z4 operands should be in the opposite order.
[Bug target/114603] aarch64: Invalid SVE cnot optimisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603 Richard Sandiford changed: What|Removed |Added Last reconfirmed||2024-04-05 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Richard Sandiford --- Fix on trunk so far, but I'll backport if possible.
[Bug target/114603] New: aarch64: Invalid SVE cnot optimisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114603 Bug ID: 114603 Summary: aarch64: Invalid SVE cnot optimisation Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* An overly lax condition on the cnot combine pattern means that we optimise: #include svint32_t foo(svbool_t pg, svint32_t y) { return svsel(svcmpeq(pg, y, 0), svdup_s32(1), svdup_s32(0)); } to a single cnot: foo: cnotz0.s, p0/m, z0.s ret The result must be 0 for inactive elements of pg, whereas the above would leave the elements unchanged instead. This seems to have been around since the SVE ACLE was first added.
[Bug target/114577] Inefficient codegen for SVE/NEON bridge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114577 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED CC||rsandifo at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED --- Comment #2 from Richard Sandiford --- Fixed.
[Bug target/114521] [11 only] aarch64: wrong code with Neon ld1/st1x4 intrinsics gcc-11 and earlier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #2 from Richard Sandiford --- Oops. I was going to upload a patch for the bug here, but it looks like I accidentally committed it while backporting PR97696 to GCC 11. The patch was g:daee0409d195d346562e423da783d5d1cf8ea175. I'm not sure what to do now. Perhaps we should leave it in?
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 --- Comment #5 from Richard Sandiford --- For the record, the associated new testsuite failures are: FAIL: gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 3 FAIL: gcc.target/aarch64/asimd-mull-elem.c scan-assembler-times \\s+fmul\\tv[0-9]+\\.4s, v[0-9]+\\.4s, v[0-9]+\\.s\\[0\\] 4 FAIL: gcc.target/aarch64/asimd-mull-elem.c scan-assembler-times \\s+mul\\tv[0-9]+\\.4s, v[0-9]+\\.4s, v[0-9]+\\.s\\[0\\] 4 FAIL: gcc.target/aarch64/ccmp_3.c scan-assembler-not \tcbnz\t FAIL: gcc.target/aarch64/pr100056.c scan-assembler-times \\t[us]bfiz\\tw[0-9]+, w[0-9]+, 11 2 FAIL: gcc.target/aarch64/pr100056.c scan-assembler-times \\tadd\\tw[0-9]+, w[0-9]+, w[0-9]+, uxtb\\n 2 FAIL: gcc.target/aarch64/pr108840.c scan-assembler-not and\\tw[0-9]+, w[0-9]+, 31 FAIL: gcc.target/aarch64/pr112105.c scan-assembler-not \\tdup\\t FAIL: gcc.target/aarch64/pr112105.c scan-assembler-times (?n)\\tfmul\\t.*v[0-9]+\\.s\\[0\\]\\n 2 FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tx[0-9]+ 2 FAIL: gcc.target/aarch64/vaddX_high_cost.c scan-assembler-not dup\\t FAIL: gcc.target/aarch64/vmul_element_cost.c scan-assembler-not dup\\t FAIL: gcc.target/aarch64/vmul_high_cost.c scan-assembler-not dup\\t FAIL: gcc.target/aarch64/vsubX_high_cost.c scan-assembler-not dup\\t FAIL: gcc.target/aarch64/sve/pr98119.c scan-assembler \\tand\\tx[0-9]+, x[0-9]+, #?-31\\n FAIL: gcc.target/aarch64/sve/pred-not-gen-1.c scan-assembler-not \\tbic\\t FAIL: gcc.target/aarch64/sve/pred-not-gen-1.c scan-assembler-times \\tnot\\tp[0-9]+\\.b, p[0-9]+/z, p[0-9]+\\.b\\n 1 FAIL: gcc.target/aarch64/sve/pred-not-gen-4.c scan-assembler-not \\tbic\\t FAIL: gcc.target/aarch64/sve/pred-not-gen-4.c scan-assembler-times \\tnot\\tp[0-9]+\\.b, p[0-9]+/z, p[0-9]+\\.b\\n 1 FAIL: gcc.target/aarch64/sve/var_stride_2.c scan-assembler-times \\tubfiz\\tx[0-9]+, x2, 10, 16\\n 1 FAIL: gcc.target/aarch64/sve/var_stride_2.c scan-assembler-times \\tubfiz\\tx[0-9]+, x3, 10, 16\\n 1 FAIL: gcc.target/aarch64/sve/var_stride_4.c scan-assembler-times \\tsbfiz\\tx[0-9]+, x2, 10, 32\\n 1 FAIL: gcc.target/aarch64/sve/var_stride_4.c scan-assembler-times \\tsbfiz\\tx[0-9]+, x3, 10, 32\\n 1
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 --- Comment #4 from Richard Sandiford --- (In reply to Richard Biener from comment #1) > Btw, why does forwprop not do this? Not 100% sure (I wasn't involved in choosing the current heuristics). But fwprop can propagate across blocks, so there is probably more risk of increasing register pressure.
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 --- Comment #3 from Richard Sandiford --- In RTL terms, the dup is vec_duplicate. The combination is: Trying 10 -> 13: 10: r107:V4SF=vec_duplicate(r115:SF) REG_DEAD r115:SF 13: r110:V4SF=r111:V4SF*r107:V4SF REG_DEAD r111:V4SF Failed to match this instruction: (parallel [ (set (reg:V4SF 110 [ _2 ]) (mult:V4SF (vec_duplicate:V4SF (reg:SF 115)) (reg:V4SF 111 [ *ptr_6(D) ]))) (set (reg:V4SF 107) (vec_duplicate:V4SF (reg:SF 115))) ]) Failed to match this instruction: (parallel [ (set (reg:V4SF 110 [ _2 ]) (mult:V4SF (vec_duplicate:V4SF (reg:SF 115)) (reg:V4SF 111 [ *ptr_6(D) ]))) (set (reg:V4SF 107) (vec_duplicate:V4SF (reg:SF 115))) ]) Successfully matched this instruction: (set (reg:V4SF 107) (vec_duplicate:V4SF (reg:SF 115))) Successfully matched this instruction: (set (reg:V4SF 110 [ _2 ]) (mult:V4SF (vec_duplicate:V4SF (reg:SF 115)) (reg:V4SF 111 [ *ptr_6(D) ]))) allowing combination of insns 10 and 13 original costs 8 + 20 = 28 replacement costs 8 + 20 = 28 modifying insn i210: r107:V4SF=vec_duplicate(r115:SF) deferring rescan insn with uid = 10. modifying insn i313: r110:V4SF=vec_duplicate(r115:SF)*r111:V4SF REG_DEAD r115:SF REG_DEAD r111:V4SF deferring rescan insn with uid = 13.
[Bug rtl-optimization/114515] New: [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 Bug ID: 114515 Summary: [14 Regression] Failure to use aarch64 lane forms after PR101523 Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- The following test regressed on aarch64 after g:839bc42772ba7af66af3bd16efed4a69511312ae (the fix for PR101523): typedef float v4sf __attribute__((vector_size(16))); void f (v4sf *ptr, float f) { ptr[0] = ptr[0] * (v4sf) { f, f, f, f }; ptr[1] = ptr[1] * (v4sf) { f, f, f, f }; } Compiled with -O2, we previously generated: ldp q1, q31, [x0] fmulv1.4s, v1.4s, v0.s[0] fmulv31.4s, v31.4s, v0.s[0] stp q1, q31, [x0] ret Now we generate: ldp q1, q31, [x0] dup v0.4s, v0.s[0] fmulv1.4s, v1.4s, v0.4s fmulv31.4s, v31.4s, v0.4s stp q1, q31, [x0] ret with the extra dup. The patch is trying to avoid cases where i3 is canonicalised by contextual information provided by i2. But here we place a full copy of i2 into i3 (creating an instruction that is no more expensive). This is a benefit in its own right because the two instructions can then execute in parallel rather than serially. But it also means that, as here, we might be able to remove i2 with later combinations. Perhaps we could also check whether i3 still contains the destination of i2?
[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Richard Sandiford --- Fixed on trunk and all active release branches.
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #5 from Richard Sandiford --- (In reply to Andrew Stubbs from comment #4) > Yes, that's what the simd-math-3* tests do. Ah, OK. > The simd-math-5* tests are explicitly supposed to be doing this in the > context of the autovectorizer. > > If these tests are being compiled as (newly) intended then we should change > the expected results. > > So, questions: > > 1. Are the new results actually correct? (So far I only know that being > different is expected.) I believe so. We now do the division in 32 bits, as in the original gimple. > 2. Is there some other testcase form that would exercise the previously > intended routines? It should be possible in languages that don't have C's integer promotion rules, if you're up for some Ada or Rust. > 3. Is the new behaviour configurable? I don't think the 16-bit shift bug> > ever existed on GCN (in which "short" vectors actually have excess bits in > each lane, much like scalar registers do). Not AFAIK. The problem is that the gimple→gimple transformation changes the gimple-level semantics of the code. Shifts by out-of-range values are undefined rather than target-defined. (And in other cases that's useful, because it means we don't need to preserve whatever value the target happens to give for an out-of-range shift.)
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #3 from Richard Sandiford --- Ah, ok. If the main aim is to test the libgcc routines, it might be safer to use something like: typedef char v64qi __attribute__((vector_size(64))); v64qi f(v64qi x, v64qi y) { return x / y; } instead of relying on vectorisation.
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #1 from Richard Sandiford --- The decision to stop narrowing division was deliberate, see the comments in PR113281 for details. Is the purpose of the test to check vectorisation quality, or to check for the right ABI routines?
[Bug tree-optimization/114234] New: [14 Regression] verify_ssa failure with early-break vectorisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114234 Bug ID: 114234 Summary: [14 Regression] verify_ssa failure with early-break vectorisation Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- The following test ICEs with -Ofast on aarch64: void bar(); float foo (float x) { float a = 1; float b = x; long z = 200; for (;;) { float c = b - 1.0f; a *= c; z -= 1; if (z == 0) { bar (); break; } if (b <= 3.0f) break; b = c; } return a * b; } (reduced from wrf). The ICE is: foo.c:3:1: error: definition in block 15 does not dominate use in block 10 3 | foo (float x) | ^~~ for SSA_NAME: stmp_a_9.10_103 in statement: a_47 = PHI PHI argument stmp_a_9.10_103 for PHI node a_47 = PHI during GIMPLE pass: vect
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 Richard Sandiford changed: What|Removed |Added Attachment #57602|0 |1 is obsolete|| --- Comment #42 from Richard Sandiford --- Created attachment 57605 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57605=edit proof-of-concept patch to suppress peeling for gaps How about the attached? It records whether all accesses that require peeling for gaps could instead have used gathers, and only retries when that's true. It means that we retry for only 0.034% of calls to vect_analyze_loop_1 in a build of SPEC2017 with -mcpu=neoverse-v1 -Ofast -fomit-frame-pointer. The figures exclude wrf, which failed for me with: module_mp_gsfcgce.fppized.f90:852:23: 852 |REAL FUNCTION ggamma(X) | ^ Error: definition in block 18 does not dominate use in block 13 for SSA_NAME: stmp_pf_6.5657_140 in statement: pf_81 = PHI PHI argument stmp_pf_6.5657_140 for PHI node pf_81 = PHI during GIMPLE pass: vect module_mp_gsfcgce.fppized.f90:852:23: internal compiler error: verify_ssa failed Will look at that tomorrow.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #41 from Richard Sandiford --- (In reply to Richard Biener from comment #40) > So I wonder if we can use "local costing" to decide a gather is always OK > compared to the alternative with peeling for gaps. On x86 gather tends > to be slow compared to open-coding it. Yeah, on SVE gathers are generally “enabling” instructions rather than something to use for their own sake. I suppose one problem is that we currently only try to use gathers for single-element groups. If we make a local decision to use gathers while keeping that restriction, we could end up using gathers “unnecessarily” while still needing to peel for gaps for (say) a two-element group. That is, it's only better to use gathers than contiguous loads if by doing that we avoid all need to peel for gaps (and if the cost of peeling for gaps was high enough to justify the cost of using gathers over consecutive loads). One of the things on the list to do (once everything is SLP!) is to support loads with gaps directly via predication, so that we never load elements that aren't needed. E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with a 32-bit load (LD1W .S) to load only even-indexed elements. So a single-element group with a group size of 2 could be done cheaply with just consecutive loads, without peeling for gaps.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #39 from Richard Sandiford --- (In reply to Richard Sandiford from comment #38) > (In reply to Richard Biener from comment #37) > > Even more iteration looks bad. I do wonder why when gather can avoid > > peeling for GAPs using load-lanes cannot? > Like you say, we don't realise that all the loads from array3[i] form a > single group. Oops, sorry, I shouldn't have gone off memory. So yeah, it's array1[] where that happens, not array3[]. The reason we don't use load-lanes is that we don't have load-lane instructions for smaller elements in larger containers, so we're forced to use load-and-permute instead.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #38 from Richard Sandiford --- (In reply to Richard Biener from comment #37) > Even more iteration looks bad. I do wonder why when gather can avoid > peeling for GAPs using load-lanes cannot? Like you say, we don't realise that all the loads from array3[i] form a single group. Note that we're not using load-lanes in either case, since the group size (8) is too big for that. But load-lanes and load-and-permute have the same restriction about when peeling for gaps is required. In contrast, gather loads only ever load data that they actually need. > Also for the stores we seem to use elementwise stores rather than store-lanes. What configuration are you trying? The original report was about SVE, so I was trying that. There we use a scatter store. > To me the most obvious thing to try optimizing in this testcase is DR > analysis. With -march=armv8.3-a I still see > > t.c:26:22: note: === vect_analyze_data_ref_accesses === > t.c:26:22: note: Detected single element interleaving array1[0][_8] step 4 > t.c:26:22: note: Detected single element interleaving array1[1][_8] step 4 > t.c:26:22: note: Detected single element interleaving array1[2][_8] step 4 > t.c:26:22: note: Detected single element interleaving array1[3][_8] step 4 > t.c:26:22: note: Detected single element interleaving array1[0][_1] step 4 > t.c:26:22: note: Detected single element interleaving array1[1][_1] step 4 > t.c:26:22: note: Detected single element interleaving array1[2][_1] step 4 > t.c:26:22: note: Detected single element interleaving array1[3][_1] step 4 > t.c:26:22: missed: not consecutive access array2[_4][_8] = _69; > t.c:26:22: note: using strided accesses > t.c:26:22: missed: not consecutive access array2[_4][_1] = _67; > t.c:26:22: note: using strided accesses > > so we don't figure > > Creating dr for array1[0][_1] > base_address: > offset from base address: (ssizetype) ((sizetype) (m_111 * 2) * 2) > constant offset from base address: 0 > step: 4 > base alignment: 16 > base misalignment: 0 > offset alignment: 4 > step alignment: 4 > base_object: array1 > Access function 0: {m_111 * 2, +, 2}_4 > Access function 1: 0 > Creating dr for array1[0][_8] > analyze_innermost: success. > base_address: > offset from base address: (ssizetype) ((sizetype) (m_111 * 2 + 1) * > 2) > constant offset from base address: 0 > step: 4 > base alignment: 16 > base misalignment: 0 > offset alignment: 2 > step alignment: 4 > base_object: array1 > Access function 0: {m_111 * 2 + 1, +, 2}_4 > Access function 1: 0 > > belong to the same group (but the access functions tell us it worked out). > Above we fail to split the + 1 to the constant offset. OK, but this is moving the question on to how we should optimise the testcase for Advanced SIMD rather than SVE, and how we should optimise the testcase in general, rather than simply recover what we could do before. (SVE is only enabled for -march=arvm9-a and above, in case armv8.3-a was intended to enable SVE too.)
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #36 from Richard Sandiford --- Created attachment 57602 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57602=edit proof-of-concept patch to suppress peeling for gaps This patch does what I suggested in the previous comment: if the loop needs peeling for gaps, try again without that, and pick the better loop. It seems to restore the original style of code for SVE. A more polished version would be a bit smarter about when to retry. E.g. it's pointless if the main loop already operates on full vectors (i.e. if peeling 1 iteration is natural in any case). Perhaps the condition should be that either (a) the number of epilogue iterations is known to be equal to the VF of the main loop or (b) the target is known to support partial vectors for the loop's vector_mode. Any thoughts?
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #35 from Richard Sandiford --- Maybe I've misunderstood the flow of the ticket, but it looks to me like we do still correctly recognise the truncating scatter stores. And, on their own, we would be able to convert them into masked scatters. The reason for the epilogue is instead on the load side. There we have a non-strided grouped load, and currently we hard-code the assumption that it is better to use contiguous loads and permutes rather than gather loads where possible. So we have: /* As a last resort, trying using a gather load or scatter store. ??? Although the code can handle all group sizes correctly, it probably isn't a win to use separate strided accesses based on nearby locations. Or, even if it's a win over scalar code, it might not be a win over vectorizing at a lower VF, if that allows us to use contiguous accesses. */ if (*memory_access_type == VMAT_ELEMENTWISE && single_element_p && loop_vinfo && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo, masked_p, gs_info)) *memory_access_type = VMAT_GATHER_SCATTER; only after we've tried and failed to use load lanes or load+permute. If instead I change the order so that the code above is tried first, then we do use extending gather loads and truncating scatter stores as before, with no epilogue loop. So I suppose the question is: if we do prefer to use gathers over load+permute for some cases, how do we decide which to use? And can it be done a per-load basis, or should it instead be a per-loop decision? E.g., if we end up with a loop that needs peeling for gaps, perhaps we should try again and forbid peeling for gaps. Then, if that succeeds, see which loop gives the better overall cost. Of course, trying more things means more compile time…
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #33 from Richard Sandiford --- Can you give me a chance to look at it a bit when I back? This doesn't feel like the way to go to me.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #31 from Richard Sandiford --- (In reply to Tamar Christina from comment #29) > This works fine for normal gather and scatters but doesn't work for widening > gathers and narrowing scatters which only the pattern seems to handle. I'm supposedly on holiday, so didn't see the IRC discussion, but: as I remember it, there is no narrowing or widening for IFN gathers or scatters as such, even for patterns. One vector's worth of offsets corresponds to one vector's worth of data. But the widths of the data elements and the offset elements can be different. Any sign or zero extension of a loaded vector, or any operation to double or halve the number of vectors, is done separately. I think it does make sense to stick to that, rather than (say) have IFNs that load two offset vectors into one data vector, or use one offset vector to load two data vectors. Supporting those combinations would mean that we have two different ways in which the offset elements and data elements have different widths. And it isn't really a natural fusion on SVE.
[Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #8 from Richard Sandiford --- The reason early_ra doesn't help with the original testcase is that early_ra punts on any non-move instruction that has a hard register destination. And it does that because it can't cope well with cases where hard-coded destinations force the wrong choice (unlike the proper allocators, which can change the destination where necessary). The restriction is needed to avoid regressing SVE ACLE tests.
[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696 --- Comment #3 from Richard Sandiford --- Created attachment 57520 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57520=edit Candidate patch The attached patch seems to fix it. I'm taking next week off, but I'll run the patch through proper testing when I get back.
[Bug sanitizer/97696] ICE since ASAN_MARK does not handle poly_int sized varibales
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97696 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #14 from Richard Sandiford --- Finally fixed.
[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205 --- Comment #12 from Richard Sandiford --- Created attachment 57511 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57511=edit Candidate patch Sorry for the very slow response on this. I'm testing the attached.
[Bug middle-end/113205] [14 Regression] internal compiler error: in backward_pass, at tree-vect-slp.cc:5346 since r14-3220
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113205 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED
[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #3 from Richard Sandiford --- Assume fixed by the patches for PR113295. Please reopen if not.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 112922, which changed state. Bug 112922 Summary: [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Richard Sandiford --- Fixed.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 113295, which changed state. Bug 113295 Summary: [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #9 from Richard Sandiford --- Fixed.
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 --- Comment #6 from Richard Sandiford --- For me the miscompilation is in jkdmem_, where we end up allocating the same registers to both arms of an fcsel. It sounds like it occurs elsewhere too. I have a candidate fix, but need to think a bit more about it.
[Bug preprocessor/114007] gcc chokes on __has_cpp_attribute(clang::unsafe_buffer_usage)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114007 --- Comment #14 from Richard Sandiford --- I might have misunderstood the suggestion and so be arguing against something that no-one is suggesting, but I think [[__extension__ …]] should accept the same things for all standard versions (C23, pre-C23, and GNU). It was intended to be something that header files and macros could use without needing to be sensitive to the user's choice of standard.
[Bug target/113995] ICE: in change_address_1, at emit-rtl.cc:2299 with [[arm::streaming_compatible]] and -march=armv9-a+sve -finstrument-functions -fstack-clash-protection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113995 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Richard Sandiford --- Fixed.
[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Richard Sandiford --- Fixed
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #5 from Richard Sandiford --- Mine. Could be the same as PR112922.
[Bug target/112922] [14 Regression] 465.tonto from SPECFP 2006 fails train run on Aarch64-linux with -O2 and -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112922 --- Comment #2 from Richard Sandiford --- I don't remember there being a deliberate bug fix in that patch, but there were some others later. I suppose the optimistic case is that this first went latent and then was fixed “properly” afterwards. But it could just be latent.
[Bug target/113778] ICE: RTL check: expected elt 3 type 'e' or 'u', have '0' (rtx code_label) in PATTERN, at rtl.h:1506 with -march=armv9-a+sme -fhardened
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113778 Richard Sandiford changed: What|Removed |Added Resolution|--- |DUPLICATE Status|ASSIGNED|RESOLVED --- Comment #4 from Richard Sandiford --- Confirmed that this is the same as PR113220. *** This bug has been marked as a duplicate of bug 113220 ***
[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220 Richard Sandiford changed: What|Removed |Added CC||zsojka at seznam dot cz --- Comment #3 from Richard Sandiford --- *** Bug 113778 has been marked as a duplicate of this bug. ***
[Bug target/113805] sme2 vs -mtrack-speculation, ICE in extract_insn, at recog.cc:2812 with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113805 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #3 from Richard Sandiford --- Fixed.
[Bug target/113995] ICE: in change_address_1, at emit-rtl.cc:2299 with [[arm::streaming_compatible]] and -march=armv9-a+sve -finstrument-functions -fstack-clash-protection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113995 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #2 from Richard Sandiford --- Possibly a dup of some of the other PRs in this area.
[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #16 from Richard Sandiford --- I think it'd be worth splitting the mode selection out into a helper called bitwise_mode_for_size, to go alongside bitwise_mode_for_mode.
[Bug target/113778] ICE: RTL check: expected elt 3 type 'e' or 'u', have '0' (rtx code_label) in PATTERN, at rtl.h:1506 with -march=armv9-a+sme -fhardened
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113778 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Last reconfirmed||2024-02-15 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #3 from Richard Sandiford --- Possibly the same as PR113220
[Bug target/113805] sme2 vs -mtrack-speculation, ICE in extract_insn, at recog.cc:2812 with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113805 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-02-12 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #1 from Richard Sandiford --- Heh. I knew this coming after seeing the previous -mtrack-speculation bug, but didn't have chance to fix it pre-emptively last week.
[Bug target/113763] [14 Regression] build fails with clang++ host compiler because aarch64.cc uses C++14 constexpr.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113763 --- Comment #14 from Richard Sandiford --- AFAIK, the constructor shouldn't be necessary. (And without it, the whole thing would fit on one line.) LGTM (and preapproved) otherwise. Thanks for doing this.
[Bug target/113763] [14 Regression] build fails with clang++ host compiler because aarch64.cc uses C++14 constexpr.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113763 --- Comment #11 from Richard Sandiford --- Currently away so can't try it myself, but how about just using an ad-hoc structure instead?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #34 from Richard Sandiford --- (In reply to Richard Biener from comment #32) > Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting > the lowest 4 or 2 bits. > > It all feels like desaster waiting to happen ;) Yes :) > For example BIT_NOT_EXPR is RTL expanded like > > case BIT_NOT_EXPR: > op0 = expand_expr (treeop0, subtarget, > VOIDmode, EXPAND_NORMAL); > if (modifier == EXPAND_STACK_PARM) > target = 0; > /* In case we have to reduce the result to bitfield precision > for unsigned bitfield expand this as XOR with a proper constant > instead. */ > if (reduce_bit_field && TYPE_UNSIGNED (type)) > { > int_mode = SCALAR_INT_TYPE_MODE (type); > wide_int mask = wi::mask (TYPE_PRECISION (type), > false, GET_MODE_PRECISION (int_mode)); > > temp = expand_binop (int_mode, xor_optab, op0, >immed_wide_int_const (mask, int_mode), >target, 1, OPTAB_LIB_WIDEN); > > so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and > effective bit-precision set reduce_bit_field and fixup the fallout > (not sure why the above is only for TYPE_UNSIGNED). > > At least it feels similar and doing things the opposite for vectors > (fixing up at uses) would be odd? Do you know why we take this approach for integers? Is it for correctness? Or is it supposed to be more optimal? I can imagine that, for arithmetic types, there are going to many more instances where upper bits matter (division, right shifts, MIN/MAX, etc.). So perhaps reducing every result is a good trade-off there. But there's an argument that it should be rare for the padding bits in a vector to matter, since very few things would look at the padding bits anyway. So perhaps the cost should be borne by the operations that need canonical integers. Not a strong opinion though, more just devil's advocate. There again, if e.g. the x86 API guarantees memcmp equality between two masks whose significant bits are equal, then we probably have no choice.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #30 from Richard Sandiford --- (In reply to Richard Biener from comment #29) > But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for > VECTOR_CSTs. But yeah, that _might_ argue we should perform the same > masking for VECTOR_CST expansion as well, instead of trying to fixup > in do_compare_and_jump? But then how would ~ be implemented for things like 4-bit masks? If we use notqi2 then I assume the upper bits could be 1 rather than 0.
[Bug debug/113636] [14 Regression] internal compiler error: in dead_debug_global_find, at valtrack.cc:275 since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113636 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #12 from Richard Sandiford --- Fixed. Thanks for the report and help with reproducing.
[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #8 from Richard Sandiford --- Fixed.
[Bug target/111677] [12/13 Regression] darktable build on aarch64 fails with unrecognizable insn due to -fstack-protector changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111677 --- Comment #21 from Richard Sandiford --- (In reply to Alex Coplan from comment #13) > The problem seems to be this code in aarch64_process_components: > > while (regno != last_regno) > { > bool frame_related_p = aarch64_emit_cfi_for_reg_p (regno); > machine_mode mode = aarch64_reg_save_mode (regno); > > rtx reg = gen_rtx_REG (mode, regno); > poly_int64 offset = frame.reg_offset[regno]; > if (frame_pointer_needed) > offset -= frame.bytes_below_hard_fp; > > rtx addr = plus_constant (Pmode, ptr_reg, offset); > rtx mem = gen_frame_mem (mode, addr); > > which emits a TFmode mem with offset 512, which is out of range for TFmode > (so we later ICE with an unrecognisable insn). Presumably this just needs > tweaking to emit a new base anchor in the case of large offsets like this. > It looks like the code in aarch64_save_callee_saves already does this. We shouldn't emit new anchor registers here, since unlike in the prologue, we don't have any guarantee that certain registers are free. aarch64_get_separate_components is supposed to vet shrink-wrappable offsets, but in this case the offset looks valid, since: str q22, [sp, #512] is a valid instruction. Perhaps the constraints are too narrow?
[Bug target/113623] [14 Regression] ICE in aarch64_pair_mem_from_base since r14-6605
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113623 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #6 from Richard Sandiford --- Mine.
[Bug debug/113636] [14 Regression] internal compiler error: in dead_debug_global_find, at valtrack.cc:275 since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113636 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Last reconfirmed||2024-01-29 --- Comment #10 from Richard Sandiford --- Mine.
[Bug tree-optimization/113281] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 --- Comment #24 from Richard Sandiford --- Fixed on trunk so far, but it's latent on branches. I'll see what the trunk fallout is like before asking about backports.
[Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #7 from Richard Sandiford --- early-ra does try to avoid reusing registers too soon, to increase scheduling freedom. But in this case I imagine it handles the two statements as separate regions. Should be fixable by carrying across a round-robin counter.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #18 from Richard Sandiford --- (In reply to Tamar Christina from comment #17) > Well the mid-end has generated the right precision. The type it generates is > vector(4) vexit_reduc_67; > so it does say it's a single bit boolean. > > Isn't this just an expand problem? That's what I meant. expand is using a QImode comparison to compare things with 4-bit precision, so I think the masking should happen at that point. How about doing the masking in do_compare_and_jump?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #16 from Richard Sandiford --- (In reply to Richard Biener from comment #15) > I think the problem is the cbranch pattern which looks at all of the > QImode mask - but of course it doesn't know it's really V4BImode it's > working on ... Yeah. Currently building an x86_64 toolchain to have a look, but I think whatever code uses a cbranch with a higher precision than the inputs should mask off the significant bits beforehand. cbranch can also be used for comparing two variable masks too. I suppose we could add a hook to say that padding bits of an integer mask must always be zero (and so it's expand's job to ensure that that holds for any mask operation). But it feels dangerously close to TRULY_NOOP_TRUNCATION for integers.
[Bug target/113550] data512_t initializers dereference a clobbered register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113550 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Richard Sandiford --- Yeah, FWIW, I agree improving the define_split is probably best. Now fixed.
[Bug target/113485] [14 regression] ICE with -fno-guess-branch-probability on aarch64 starting with r14-7187-g74e3e839ab2d36
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113485 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Richard Sandiford --- Fixed.
[Bug target/113572] [14 Regression] aarch64: internal compiler error in aarch64_sve::vector_cst_all_same
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113572 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #7 from Richard Sandiford --- Fixed.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #13 from Richard Sandiford --- I don't think there's any principle that upper bits must be zero. How do we end up with a pattern that depends on that being the case?
[Bug tree-optimization/113281] [14 Regression] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #22 from Richard Sandiford --- Taking following discussion on irc.
[Bug target/113572] [14 Regression] aarch64: internal compiler error in aarch64_sve::vector_cst_all_same
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113572 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #5 from Richard Sandiford --- (In reply to Jakub Jelinek from comment #4) > So, if that part is right, I think we want to use VECTOR_CST_ELT instead of > VECTOR_CST_ENCODED_ELT, like: > --- gcc/config/aarch64/aarch64-sve-builtins.cc.jj 2024-01-12 > 13:47:20.815429012 +0100 > +++ gcc/config/aarch64/aarch64-sve-builtins.cc2024-01-24 > 20:58:33.720677634 > +0100 > @@ -3474,7 +3474,7 @@ vector_cst_all_same (tree v, unsigned in >unsigned int nelts = lcm * VECTOR_CST_NELTS_PER_PATTERN (v); >tree first_el = VECTOR_CST_ENCODED_ELT (v, 0); >for (unsigned int i = 0; i < nelts; i += step) > -if (!operand_equal_p (VECTOR_CST_ENCODED_ELT (v, i), first_el, 0)) > +if (!operand_equal_p (VECTOR_CST_ELT (v, i), first_el, 0)) >return false; > >return true; > which fixes the ICE. Yeah, that's the correct fix. Sorry for missing it.
[Bug target/113485] [14 regression] ICE with -fno-guess-branch-probability on aarch64 starting with r14-7187-g74e3e839ab2d36
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113485 Richard Sandiford changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #7 from Richard Sandiford --- I suppose the ZIP1 patterns should just have 64-bit inputs, rather than going to the trouble of creating paradoxical subregs. > cfun->machine->advsimd_zero_insn use is plain wrong. As the RTL could be > removed fully from the RTL stream and then it will be GC'ed. But machine_function is a GTYed structure, so the reference itself should prevent GC. I don't think we should be in the practice of explicitly ggc_free()ing RTL, since callers don't generally know what other references there might be.
[Bug target/109929] profiledbootstrap failure on aarch64-linux-gnu with graphite optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109929 --- Comment #7 from Richard Sandiford --- Hmm, yeah, like you say, neither of those commits should have made a different to whether bootstrap works. I guess the problem is just latent now.
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 --- Comment #12 from Richard Sandiford --- I don't object to the patch, but for the record: the current heuristics go back a long way. Although I reworked the pass to use rtl-ssa a few years ago, I tried as far as possible to preserve the old heuristics (tested by making sure that there were no unexplained differences over a large set of targets). I wouldn't characterise the old heuristics as a logic error. Although I didn't write them, my understanding is that they were being deliberately conservative, in particular due to the risk of introducing excess register pressure. So this change seems potentially quite invasive for stage 4. Perhaps it'll work out — if so, great! But if there is some fallout, I think we should lean towards reverting the patch and revisiting in GCC 15.
[Bug target/113196] [14 Regression] Failure to use ushll{,2}
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113196 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #3 from Richard Sandiford --- Fixed.
[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #15 from Richard Sandiford --- I didn't manage to reproduce the PR in the originally reported form, but FWIW, the patches mean that a gcc_unreachable above: return decl; in simulate_builtin_function_decl no longer fires for arm_sve.h or arm_sme.h. Please reopen if there are still some lingering issues.
[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989 --- Comment #12 from Richard Sandiford --- > another is try > #pragma GCC aarch64 "arm_sve.h" > after a couple of intentional declarations of the SVE builtins with > non-standard return/argument types and make sure that while it emits some > errors, it doesn't try to use ggc_freed decls in registered tables. FWIW, this is what the g*.target/aarch64/sve/acle/general*/func_redef_* tests are supposed to test (although not specifically targeting ggc_free).
[Bug target/112989] [14 Regression] GC ICE with C++, `#include ` and `-fsanitize=address`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112989 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #10 from Richard Sandiford --- Mine.
[Bug target/113270] [14 Regression] AArch64 ICEs in register_tuple_type since r14-6524
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113270 --- Comment #8 from Richard Sandiford --- Thanks for trying it, and sorry for not doing it myself. The patch LGTM FWIW, so preapproved if it passes testing (which I'm sure it will :))
[Bug target/113270] [14 Regression] AArch64 ICEs in register_tuple_type since r14-6524
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113270 --- Comment #6 from Richard Sandiford --- I think we want the patch in comment 3, but in addition, I then also needed to use the following for a similar SVE case: extern GTY(()) tree scalar_types[NUM_VECTOR_TYPES + 1]; tree scalar_types[NUM_VECTOR_TYPES + 1]; In this case that would mean adding: extern GTY(()) aarch64_simd_type_info aarch64_simd_types[]; just above the definition in aarch64-builtins.cc.
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Richard Sandiford --- Fixed. Thanks for the report.
[Bug c++/68703] __attribute__((vector_size(N))) template member confusion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68703 Richard Sandiford changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #11 from Richard Sandiford --- FWIW, the following adaption of the original testcase still fails on trunk, but is accepted by Clang: template struct D { using t = int __attribute__((vector_size(N * sizeof(int; t v; int f1() { return this->v[N-1]; } int f2() { return v[N-1]; } }; int main(int ac, char**) { D<> d = { { ac } }; return d.f1() + d.f2(); } Same with a typedef instead of "using". But that's probably just another instance of PR88600/PR58855.
[Bug target/113220] [aarch64] ICE Segmentation fault with r14-6178-g8d29b7aca15133
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113220 Richard Sandiford changed: What|Removed |Added CC|richard.sandiford at arm dot com |rsandifo at gcc dot gnu.org Last reconfirmed||2024-01-03 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED --- Comment #1 from Richard Sandiford --- Mine
[Bug target/113196] [14 Regression] Failure to use ushll{,2}
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113196 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Last reconfirmed||2024-01-02 --- Comment #1 from Richard Sandiford --- Testing a patch that does that. I think it'll depend on late-combine to undo the split in cases where it isn't profitable.
[Bug target/113196] New: [14 Regression] Failure to use ushll{,2}
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113196 Bug ID: 113196 Summary: [14 Regression] Failure to use ushll{,2} Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org CC: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* For this testcase, adapted from the one for PR110625: int test(unsigned array[4][4]); int foo(unsigned short *a, unsigned long n) { unsigned array[4][4]; for (unsigned i = 0; i < 4; i++, a += 4) { array[i][0] = a[0] << 6; array[i][1] = a[1] << 6; array[i][2] = a[2] << 6; array[i][3] = a[3] << 6; } return test(array); } GCC now uses: mov x1, x0 stp x29, x30, [sp, -80]! moviv30.4s, 0 mov x29, sp ldp q0, q29, [x1] add x0, sp, 16 zip1v1.8h, v0.8h, v30.8h zip1v31.8h, v29.8h, v30.8h zip2v0.8h, v0.8h, v30.8h zip2v29.8h, v29.8h, v30.8h shl v1.4s, v1.4s, 6 shl v31.4s, v31.4s, 6 shl v0.4s, v0.4s, 6 shl v29.4s, v29.4s, 6 stp q1, q0, [sp, 16] stp q31, q29, [sp, 48] bl test(unsigned int (*) [4]) ldp x29, x30, [sp], 80 ret whereas previously it used USHLL{,2}: mov x1, x0 stp x29, x30, [sp, -80]! mov x29, sp ldp q1, q0, [x1] add x0, sp, 16 ushll v3.4s, v1.4h, 6 ushll v2.4s, v0.4h, 6 ushll2 v1.4s, v1.8h, 6 ushll2 v0.4s, v0.8h, 6 stp q3, q1, [sp, 16] stp q2, q0, [sp, 48] bl test(unsigned int (*) [4]) ldp x29, x30, [sp], 80 ret This changed with g:f26f92b534f9, which expanded zero-extensions to ZIPs. The patch included *ADDW patterns for the new representation, but it looks like there are several more that should be included for full coverage. AIUI, the point of lowering to ZIPs during expand was to allow the zero to be hoisted. An alternative might be to lower during split, but forcibly hoist the zero by inserting around the FUNCTION_BEG note. We could then cache the insn that does that for manual CSE. Godbolt link: https://godbolt.org/z/vzfnebMhb
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-12-30 Ever confirmed|0 |1 CC||rsandifo at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #4 from Richard Sandiford --- FWIW, we do get the desired code with -march=armv8-a+sve (even though the test doesn't use SVE). This is because of: /* Consider enabling VECT_COMPARE_COSTS for SVE, both so that we can compare SVE against Advanced SIMD and so that we can compare multiple SVE vectorization approaches against each other. There's not really any point doing this for Advanced SIMD only, since the first mode that works should always be the best. */ if (TARGET_SVE && aarch64_sve_compare_costs) flags |= VECT_COMPARE_COSTS; The testcase in this PR is a counterexample to the claim in the final sentence. I think the comment might predate significant support for mixed-sized Advanced SIMD vectorisation. If we enable SVE (or uncomment the "if" line), the costs are 13 units per vector iteration for 128-bit vectors and 4 units per vector iteration for 64-bit vectors (so 8 units per 128 bits on a parity basis). The 64-bit version is therefore seen as significantly cheaper and is chosen ahead of the 128-bit version. I think this PR is enough proof that we should enable VECT_COMPARE_COSTS even without SVE. Assigning to myself for that.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #5 from Richard Sandiford --- > The issue here is that because the "outer" pattern consumes > patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 > stmt-to-vectorize > as being the outer pattern root stmt for all this logic to work correctly. I don't think it can though, at least not in general. The final pattern stmt has to compute the same value as the original scalar stmt.
[Bug target/113094] [14 Regression][aarch64] ICE in extract_constrain_insn, at recog.cc:2713 since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113094 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Richard Sandiford --- Fixed.
[Bug target/112948] gcc/config/aarch64/aarch64-early-ra.cc:1953: possible cut'n'paste error ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112948 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #3 from Richard Sandiford --- Fixed. Thanks for the report.
[Bug target/113094] [14 Regression][aarch64] ICE in extract_constrain_insn, at recog.cc:2713 since r14-6290-g9f0f7d802482a8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113094 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #4 from Richard Sandiford --- Testing a patch. We're doing spurious work on insns that are slated for deletion, but we can't simply delete them first because that would disrupt the main iteration. Easiest fix seems to be to replace them with NOTE_INSN_DELETED first, then iterate, then delete.
[Bug rtl-optimization/111702] [14 Regression] ICE: in insert_regs, at cse.cc:1114 with -O2 -fstack-protector-all -frounding-math
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111702 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED CC||rsandifo at gcc dot gnu.org --- Comment #5 from Richard Sandiford --- Fixed.
[Bug target/113027] New: aarch64 is missing vec_set and vec_extract for structure modes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113027 Bug ID: 113027 Summary: aarch64 is missing vec_set and vec_extract for structure modes Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- The lack of vec_set and vec_extract optabs for structure modes means that the following testcase spills to the stack when compiled at -O2: #include float64x2x2_t f1 (float64x2x2_t x) { x.val[0][1] += 1.0; return x; } float64x2x3_t f2 (float64x2x3_t x) { x.val[0][0] = x.val[1][1] + x.val[2][0]; return x; } float64x2x4_t f3 (float64x2x4_t x) { x.val[0][0] = x.val[1][1] + x.val[2][0] - x.val[3][1]; return x; } For example: f1: sub sp, sp, #32 fmovd31, 1.0e+0 st1 {v0.2d - v1.2d}, [sp] ldr d30, [sp, 8] faddd31, d31, d30 str d31, [sp, 8] ld1 {v0.2d - v1.2d}, [sp] add sp, sp, 32 ret With the extra patterns, we instead get: f1: dup d31, v0.d[1] fmovd30, 1.0e+0 faddd30, d31, d30 ins v0.d[1], v30.d[0] ret f2: dup d31, v1.d[1] faddd31, d31, d2 ins v0.d[0], v31.d[0] ret f3: dup d31, v1.d[1] dup d30, v3.d[1] faddd31, d31, d2 fsubd30, d31, d30 ins v0.d[0], v30.d[0] ret Fixing this might also make it possible to use structure modes for arrays (c.f. PR109543).
[Bug tree-optimization/109543] Avoid using BLKmode for unions with a non-BLKmode member when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109543 --- Comment #5 from Richard Sandiford --- I think the loop in compute_mode_layout needs to be smarter for unions. At the moment it's sensitive to field order, which doesn't make much conceptual sense. E.g. for the admittedly contrived example: #include union u1 { int32x2x2_t x; __int128 y __attribute__((packed)); }; union u2 { __attribute__((packed)) __int128 y; int32x2x2_t x; }; compiled with -mstrict-align, the loop produces V2x2SImode for union u1 (good!) but TImode for union u2 (requires too much alignment). That doesn't matter as things stand, since we don't accept unions with vector modes. But if we did, union u1 would be placed in registers and union u2 wouldn't.
[Bug middle-end/80283] [11/12/13/14 Regression] bad SIMD register allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283 --- Comment #39 from Richard Sandiford --- (In reply to Andrew Pinski from comment #38) > For aarch64, the test from comment #11 is so much worse on the trunk than in > GCC 13.2.0. I've been working on a fix for that. I'm hoping to post it today.
[Bug target/112948] gcc/config/aarch64/aarch64-early-ra.cc:1953: possible cut'n'paste error ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112948 Richard Sandiford changed: What|Removed |Added Last reconfirmed||2023-12-11 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #1 from Richard Sandiford --- Gah.
[Bug target/112933] gcc.target/aarch64/sme2/acle-asm/read_za16_vg1x2.c fails on aarch64_be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112933 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #3 from Richard Sandiford --- Fixed.
[Bug target/112931] gcc.target/aarch64/sme2/acle-asm/write_za16_vg1x2.c ICEs on aarch64_be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112931 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #4 from Richard Sandiford --- Fixed.
[Bug target/112930] gcc.target/aarch64/sme/call_sm_switch_7.c ICEs on aarch64_be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112930 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Richard Sandiford --- Fixed.